The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Part 2

Cluso99 · 2015-09-27 08:27

Bill Henning wrote: »

Questions:

Is that up to 32 bits, one bit per clock?

Can you also read up to 32 bits, one per clock from a pin?

Is that instruction clock, or system clock?

Can the clock be exposed on an adjacent pin?

If the clock can be exposed, that gives us SPI master (half duplex) with the above for free.

For full duplex SPI, two pins would need to be sync'd with a third as a clock. An arbitrary other pin could be the chip select.

cgracey wrote: »

The streamer can write data directly to the i/o pins, not just to the DACs, up to 32 bits per clock, from hub or LUT.

And can we input the clock from a pin?

Permits absolute timing between props for data exchange at max clock frequency, as well as slave SPI.

Seairth · 2015-09-27 13:25

It looks like DJNS keeps going until the counter is less than zero. If so, I suggest renaming it to DJNC, which keeps it more consistent with the DJNZ (and their association to the C/Z flags).

Conga · 2015-09-27 14:25

Seairth wrote: »

It looks like DJNS keeps going until the counter is less than zero.
If so, I suggest renaming it to DJNC, which keeps it more consistent with the DJNZ (and their association to the C/Z flags).

The name for this family of jumps could be clearer: if they began with DEC it would be really obvious what they do.
In Prop world, the letter D's most frequent association is with a register identified in the Destination field.

Names like DECJZ, DECJNZ, ... would parallel DECMOD (consistently using DEC for "Decrement").

Seairth · 2015-09-27 14:28

So, if you want to use a branch instruction with a 20-bit immediate address, it looks like you use "#". If you want to use an instruction with a 9-bit immediate address, you also use "#". If you want to use AUGx, you use "##".

So my question is... why bother with "##" at all? Just use "#" for all of them. The assembler can determine whether to add AUGx by the value of the immediate. This also avoids an inadvertent "JMP ##addr".

(by the way, what happens if AUGx precedes a 20-bit branch instruction?)

tonyp12 · 2015-09-27 17:11

Now that you're moving on to smart-pins, how hard would it be to current-isolate one GPIO bank (maybe the last in the daisy-chain?)
P2_RTC.jpg?psid=1

jmg · 2015-09-27 21:53

tonyp12 wrote: »

Now that you're moving on to smart-pins, how hard would it be to current-isolate one GPIO bank (maybe the last in the daisy-chain?)

Interesting, but power islands are tricky, and next thing users expect is some means to wake-up the rest of the chip, which complicates even more.

( Pushing to 1uA needs special buffer designs & I've seen one chip where the designers forgot that. They had a nice 1uA 32KHz oscillator feeding a more generic schmitt, and the 1uA became > 100uA thanks to the transition currents of that stage. )
Getting RTC support on P2 probably depends on OnSemi having a proven cell they can drop-in.

It would be useful to know the 0MHz predicted Icc on the P2 die ?

Starting and stopping clocks is easier than power-management and switching.

tonyp12 · 2015-09-27 22:53

I just want something simple, you set the counter with utc-seconds since 2000 when ever it have access to that source.
With power failures that may last up to 72hrs I just want the seconds to tick away on the regular cog counter,
so when power comes back up I just want seconds since 2000 to be still accurate.

GPIO state is not that important and 8KB hub ram-retention if is it's to much work skip it, cog0 ram retention maybe?
If the new IRQ system can be set to wait for the smartpin/cog counter to reach a certain number software rtc-alarm(s) is easy.

This is the 0.9uA 1hz mems-osc, as implementing a internal 32khz crystal osc is then not needed.
http://www.mouser.com/ds/2/3/ASTMK-604412.pdf

jmg · 2015-09-28 00:42

tonyp12 wrote: »

GPIO state is not that important and 8KB hub ram-retention if is it's to much work skip it, cog0 ram retention maybe?

Partial HUB save would be too difficult, but perhaps a COG can be ring-fenced enough to have a low power island ?
This still comes down to the Static ICC expected for the P2, and how that stacks up against a separate RTC chip with maybe TCXO and RAM.

evanh · 2015-09-28 01:01

The Prop1 will have much lower minimum power consumption for battery operation.

In either case, if you aren't designing a battery run solution then it would be advised to use a separate RTC chip.

Bill Henning · 2015-09-28 02:31

I like the 1/2/4/8/16/32 bit options

24 bits might be handy too. Or a counter, so bits can be 1..32, possibly causing an interrupt when the count is done.

I like the clocking.

cgracey wrote: »

Bill Henning wrote: »

Questions:

Is that up to 32 bits, one bit per clock?

Can you also read up to 32 bits, one per clock from a pin?

Is that instruction clock, or system clock?

Can the clock be exposed on an adjacent pin?

If the clock can be exposed, that gives us SPI master (half duplex) with the above for free.

For full duplex SPI, two pins would need to be sync'd with a third as a clock. An arbitrary other pin could be the chip select.

cgracey wrote: »

The streamer can write data directly to the i/o pins, not just to the DACs, up to 32 bits per clock, from hub or LUT.

It captures bytes, words, or longs. I like the idea of one, two, or four bits, as well, getting written as bytes! The rate is already programmable by SETXFRQ: $8000000 = every clock, $40000000 = every 2nd clock, $2AAAAAAB = every 3rd clock. In that case of every third clock, the LSB must be set to ensure that it rolls over (reaches $80000000+) on the initial third clock. Bit 31 is not kept by the phase accumulator.

jmg · 2015-09-28 03:13

Bill Henning wrote: »

I like the 1/2/4/8/16/32 bit options

24 bits might be handy too. Or a counter, so bits can be 1..32, possibly causing an interrupt when the count is done.

I think chip was meaning in Parallel, for those sizes.
(4bW covers QuadSPI for example & 24bW could be useful for LCDs )
I guess x1 effectively means DMA into a Serial engine

Serial is slightly different, managed in the Smart-Pins, but yes a bit-size field allowing 1,,32 would be useful. I think an Infineon MCU has that feature, along with good FIFOs on the serial.

tonyp12 · 2015-09-28 17:28

I know this breaks the idea that all cogs are just cookie cutters of themselves.

But why not make each cog really good at one thing?, don't use the extra feature if you want them all to be plain.
Of course there needs to be some way cognew can specify what cog it wants.

One can have hardware AES, one can be good at usb2.0
and one at Ethernet, Fourier transform math and so on.... for 16 really useful features.
Maybe just one or two very specific op-code to boost encoding/decoding by 8x or some type of hardware assist so 480P hdmi is possible etc

Say each feature adds ~20% in gate logic, giving all the cogs the same 16 features would add 300%, but if keeping to just to one cog it's manageable.

Though coming up with gate-logic will take time and we don't want to wait, unless there is open source blocks already out there to just to drop in.

potatohead · 2015-09-28 17:48

Well, that's not a bad idea, but it's not really a Propeller either. More like a hodge-podge of things that happen to share the HUB memory.

Could be pretty great too. But it's not the project at hand.

Maybe someone else can attempt this, or some ideas get considered after P2 is done.

Heater. · 2015-09-28 19:00

tonyp12,

But why not make each cog really good at one thing?,

Because at that point you have totally destroyed any idea of somebody being able to mix and match that really smart code you have written with the really smart code I have written into the project of their dreams.

What you are suggesting is basically equivalent to having customised hardware for whatever task. Like a normal SoC,

tonyp12 · 2015-09-28 19:35

16 plain cogs: Jack of all trades, master of none.

It's highly unlikely you would have the need to mix and match two of the same "type"
They are 95% still the same, and the complier will handle it, using ~8x slower emulated version of a specific op-code if it's not available in this cog.

If an engineer see a diagram of 16cores, showing hardware-assisted-"acronym" in each box, that will get their attention.

potatohead · 2015-09-28 19:44

It's way too late for anything like that now. And as mentioned, not what a Propeller is about.

Besides, what is this master of none business?

Good software will do the job well, and it can improve over time. This chip will do a lot more.

Getting good at things in hardware meand dealing with IP, testing, etc... all of which will come at considerable time and expense.

That is time software can be developed on real chips.

There are a ton of SoC devices out there now.

Heater. · 2015-09-28 19:45

So, you are suggesting the compiler generates 8 times slower code for my object when I give it to you because you are using the special hardware my code expects.

And so your program does not work because it does not have the speed for my component.

And then you have to analyse the whole source code of all the objects you are using to find out what the problem is.

This is chaos.

No thanks. I like predictability and determinism. Even at the cost of extreme performance. If I need extreme performance I can find that elsewhere if I want and except that it will be more work on my part to attain.

tonyp12 · 2015-09-28 19:56

>Besides, what is this master of none business?

Maybe USA specific?, many handyman claim they can do little bit of everything but then work turns out subpar.
So you are better of hiring someone that specialize in just one area.

It's not a SOC, it just a boost in specific areas, that is likely only needed for one routine.
I's probably to late to add anything to P2 now.

potatohead · 2015-09-28 20:12

Again. Good software can do the job well. Specialized people can make specialized code.

If these hardware assists are in fact minor, good software is needed anyway.

jmg · 2015-09-28 20:17

tonyp12 wrote: »

They are 95% still the same, and the complier will handle it, using ~8x slower emulated version of a specific op-code if it's not available in this cog.

I think making an opcode divergence would cause more problems than it would solve.

tonyp12 wrote: »

One can have hardware AES, one can be good at usb2.0
and one at Ethernet, Fourier transform math and so on.... for 16 really useful features.
Maybe just one or two very specific op-code to boost encoding/decoding by 8x or some type of hardware assist so 480P hdmi is possible etc

Chip has this already in P2, in the way MathOPS and Cordic are managed,
ie That resource is not duplicated in each COG and is available for any COG to use.

If any special HW intensive opcodes are needed, they can best go into that common 'Math pool', following the model that exists already.

Serial items like USB2 could certainly benefit from some small HW level helpers, but that is best mapped into SmartPin resource, not pulled into any one COG.
There are already pin-grouping implied in some operations, so that can continue.

tonyp12 · 2015-09-28 20:31

>If any special HW intensive opcodes are needed, they can best go into that common 'Math pool'
15-20 cycles?, sometimes you need something that is done in 1-2 cycles for immediate use.

Just because one cog is good at something, nobody should have it as to make all equally slow?
I'm pretty sure the compiler could have a "don't use hardware assist" for that.
But probably something for the P3.

jmg · 2015-09-28 21:01

tonyp12 wrote: »

>If any special HW intensive opcodes are needed, they can best go into that common 'Math pool'
15-20 cycles?, sometimes you need something that is done in 1-2 cycles for immediate use.

You need to give some real use case examples.

eg I can think of USB, so let's follow that :

That could benefit form bit-stuff/unstuff, but rather than a special opcode, that is best done within the smart pins.
There was talk of a CRC opcode, I've not kept up with where that is, but that could go into the math-block, as USB is byte-based, 15~20 cycles is fine.

Notice how, by using existing P2 flows and blocks, this avoids needing any COG specific divergence ?

Seairth · 2015-09-28 21:23

Hey, could the "specialized cog" conversation be moved to a separate thread? It would be nice to keep the topic of this thread primarily about the FPGA image that was just released.

Electrodude · 2015-09-29 21:25

Are things like "rdlong x, ptra++" possible? If so, how is the ++ encoded in the instruction?

Seairth · 2015-09-29 23:21

Electrodude wrote: »

Are things like "rdlong x, ptra++" possible? If so, how is the ++ encoded in the instruction?

Yes.

Electrodude · 2015-09-30 03:30

I saw that, but how is it specified that the S field means that and not it's usual (P1) meaning? How are three possibilities for the address (S/#/PTRx) encoded in just one bit (I), or is the documentation wrong?

EDIT: actually read the entirety of that post

So, should

CCCC 1011010 CZI DDDDDDDDD SSSSSSSSS        RDLONG  D,S/#/PTRx  {WC,WZ}

in the (maybe) latest docs read

CCCC 1011010 CZI DDDDDDDDD SSSSSSSSS        RDLONG  D,S/PTRx  {WC,WZ}

?

ozpropdev · 2015-09-30 03:45

Immediate source values are invalid for RDLONG so the I flag changes the SSSSSSSSS values to mean the following.
Note: Chip indicated that has changed slightly but this should show how it all works.

    000000000     PTRA              'use PTRA
    100000000     PTRB              'use PTRB
    011000001     PTRA++            'use PTRA,                PTRA += SCALE
    111000001     PTRB++            'use PTRB,                PTRB += SCALE
    011111111     PTRA--            'use PTRA,                PTRA -= SCALE
    111111111     PTRB--            'use PTRB,                PTRB -= SCALE
    010000001     ++PTRA            'use PTRA + SCALE,        PTRA += SCALE
    110000001     ++PTRB            'use PTRB + SCALE,        PTRB += SCALE
    010111111     --PTRA            'use PTRA - SCALE,        PTRA -= SCALE
    110111111     --PTRB            'use PTRB - SCALE,        PTRB -= SCALE

    000NNNNNN     PTRA[INDEX]       'use PTRA + INDEX*SCALE
    100NNNNNN     PTRB[INDEX]       'use PTRB + INDEX*SCALE
    011NNNNNN     PTRA++[INDEX]     'use PTRA,                PTRA += INDEX*SCALE
    111NNNNNN     PTRB++[INDEX]     'use PTRB,                PTRB += INDEX*SCALE
    011nnnnnn     PTRA--[INDEX]     'use PTRA,                PTRA -= INDEX*SCALE
    111nnnnnn     PTRB--[INDEX]     'use PTRB,                PTRB -= INDEX*SCALE
    010NNNNNN     ++PTRA[INDEX]     'use PTRA + INDEX*SCALE,  PTRA += INDEX*SCALE
    110NNNNNN     ++PTRB[INDEX]     'use PTRB + INDEX*SCALE,  PTRB += INDEX*SCALE
    010nnnnnn     --PTRA[INDEX]     'use PTRA - INDEX*SCALE,  PTRA -= INDEX*SCALE
    110nnnnnn     --PTRB[INDEX]     'use PTRB - INDEX*SCALE,  PTRB -= INDEX*SCALE

Electrodude · 2015-09-30 03:51

OK, thanks. That makes sense.

evanh · 2015-10-03 11:19

AntoineDoinel wrote: »

Rename STALLI and ALLOWI to FORBID and PERMIT !

Belated answer: Those names were chosen to convey that pending requests aren't forgotten, ie: Any IRQ that fires while STALLIed will still mark as pending ... and immediately generate a call to it's related ISR upon ALLOWI.

cgracey · 2015-10-03 12:05

In the new Prop2, for RDxxxx/WRxxxx instructions, #0..255 is allowed for S. Values 256..511 are used by PTRx expressions. This means we have 5, not 6, relative index bits.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip - Part 2

Comments