The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Seairth · 2015-08-04 13:34

Heater. wrote: »

No more frikken new features please. Get me the chip already !

In honor of Dave Hein's informative countdown (countup?) calendar, here's another one:

Publison · 2015-08-04 13:54

LOL!

David Betz · 2015-08-04 14:33

The only additional feature I'd like to see, and I doubt it is possible in the P2 timeframe, is some sort of compressed instruction set that would make better use of the on-chip memory. It is clear that XMM will not be supported by Parallax for P1 and it shouldn't be necessary for P2 but it would be nice to squeeze more code space out of the 512K of hub memory. You might think 512K should be good enough for any MCU program but remember that some people will want to use a lot of that for data buffers especially if they are making use of video.

potatohead · 2015-08-04 15:31

How many instructions and what would they look like? At this point, I'm just curious about what would be needed to be meaningful.

Yeah, it's 300Kb for a single buffer 640x480x8color display.

It also occurs to me having more cogs means being able to set some of them up to do "DMA" type operations to help with XMM too.

Of course, we've not yet arrived at that stage of dealing with external RAM. Maybe there are options in the queue for it yet. On the last one, we had an SDRAM driver...

And for the record, I'm happy with current video capabilities. They are almost entirely software and that means it's a bit more work to get the stuff we might want, but it also means a lot of tricks are on the table too. Where needed, it's gonna be possible to really max out how that RAM gets used.

But I am curious about larger programs myself. This one is capable enough to warrant the discussion on how those get done at some point. Maybe this isn't the point.

David Betz · 2015-08-04 15:36

potatohead wrote: »

How many instructions and what would they look like? At this point, I'm just curious about what would be needed to be meaningful.

Yeah, it's 300Kb for a single buffer 640x480x8color display.

It also occurs to me having more cogs means being able to set some of them up to do "DMA" type operations to help with XMM too.

Of course, we've not yet arrived at that stage of dealing with external RAM. Maybe there are options in the queue for it yet. On the last one, we had an SDRAM driver...

And for the record, I'm happy with current video capabilities. They are almost entirely software and that means it's a bit more work to get the stuff we might want, but it also means a lot of tricks are on the table too. Where needed, it's gonna be possible to really max out how that RAM gets used.

But I am curious about larger programs myself. This one is capable enough to warrant the discussion on how those get done at some point. Maybe this isn't the point.

I'm not sure anyone will be happy with XMM performance on P2. On P1 we pretty much had to use LMM or CMM to get any reasonable amount of code space. XMM wasn't *that* much worse than CMM. However, on P2 we have hub exec and then we go immediately to XMM. That's going to be a huge difference in performance and probably not acceptable to most people. Overlays might be more appropriate but that puts us back in the dark ages of computing. :-)

potatohead · 2015-08-04 15:46

Yeah, that's going to be a big step for sure.

What about a really great byte or word code interpreter? I've no idea on the performance differences, but maybe that's a fallback option? With the COG being able to directly execute more code, there is room to unroll and optimize most of that interpreter.

jmg · 2015-08-04 20:25

David Betz wrote: »

The only additional feature I'd like to see, and I doubt it is possible in the P2 timeframe, is some sort of compressed instruction set that would make better use of the on-chip memory. It is clear that XMM will not be supported by Parallax for P1 and it shouldn't be necessary for P2 but it would be nice to squeeze more code space out of the 512K of hub memory. You might think 512K should be good enough for any MCU program but remember that some people will want to use a lot of that for data buffers especially if they are making use of video.

Compressed instructions usually work by limiting the resource they work on.
The bulk of the P2/P1 opcode comes from the Dual 9 bit register address, but to compress and opcode needs removal of 16 bits, a plus the means to then fetch either 16 or 32 bit boundaries - suddenly, you have limited the opcode reach even more.

My take on this boundary issue comes from a slightly different angle - I say when you do need to go off-chip, make sure the HW does so elegantly.
That drops the speed step, and expands the usage.

At a minimum QuadSPI is needed, ideally with Dual edge option, and the obvious modern memory to also support, is the new HyperBUS / HyperRAM.

This becomes a smart-pin issue, and the data-flow pathways certainly have high bandwidth up to the pins.

Good Byte-code support should (hopefuly) already be in P2, from the experience of Spin, and the need to support Spin2.

Bill Henning · 2015-08-05 18:14

Thanks potatohead!

Working on documentation is NOT fun...

potatohead wrote: »

Make that launch awesome Bill.

Bill Henning · 2015-08-05 18:19

Love the block load / save.

As a matter of fact, I am having visions of setting up an SDRAM /RAS cycle, then going nuts with

{RD|WR}BLOCK PORTA,S/#

Add an exposed clock, and we have fast SDRAM interface... use the exposed clock for CAS burst mode

Hmm... it would need a way to force the block move to wait until it did not have to scramble the hub order, so may add 15 cycle delay, but then could back to back block transfer a whole page.

Maybe re-purpose WZ to mean wait for eggbeater to reach 0?

cgracey wrote: »

You're right about the delays and I've been thinking that a block read which plays along with the hub ram rotation is badly needed.
I'm adding two instructions:
RDBLOCK D/#,S/#
WRBLOCK D/#,S/#
D is %rrrrrbbbbb, where %rrrrr is the x16 register base address and %bbbbb + 1 is the number of 16-long blocks to read/write from/to hub RAM starting at {S[19:6],5'b00000}. For example:
RDBLOCK %00000_11110,#$00000
...would read locations $00000..$007BF from the hub into registers $000..$1EF in the cog.
These instructions are fast and deterministic, taking 5+blocks*16 clocks to get the job done. They start at whatever position the hub is in and read/write 16 longs from/to the same hub page (16 longs on a 16-long boundary) before moving to the next page and the next 16-register block in the cog.
RDBLOCK/WRBLOCK don't use the FIFO streamer like hub exec does, but use the same conduit as the random RDxxxx/WRxxxx instructions do, without waiting for a particular hub position. This means that hub exec code can quickly load cog exec code and execute it, or just read/write data blocks.
I think this totally rounds out the Prop 2 instruction set.
I believe I have the interrupt stuff all done, with the breakpoint and single-step functions, but I'm still on the road, so I haven't been able to test anything. I'll probably have this RDBLOCK/WRBLOCK implemented before we get back home.
Sorry I haven't been more responsive to this thread in the last few days. We've been doing lots of things with our kids on this trip and whole days have gone by without any computer time.

cgracey · 2015-08-06 00:18

How would you guys feel about getting rid of WAITPAE/WAITPAN/WAITPBE/WAITPBN and replacing them with event sensors?

This way, you could poll for pin-pattern matches, as well as interrupt on them.

The new instructions would be:

SETPM D/# - set 32-bit pin mask
SETPT D/# - set 32-bit (pin AND mask) target
SETPS D/# - %00 = port A equals, %01 = port A doesn't equal, %10 = port B equals, %11 = port B doesn't equal

And what about shrinking the sensitivity down to 16 or 8 contiguous pins? 8-pin sensitivity would let us configure the event sensor with a single 20-bit value: %n_ppp_tttttttt_mmmmmmmm, which means only one D/# instruction.

And, lastly, is this pin checking even necessary to keep? Does anyone use it?

I think this function is really important for logic analyzers, where you need to trap multiple pin states simultaneously. And this could even be expanded to 64-bit sensitivity, instead of shrinking down to 8 or 16.

Thanks.

cgracey · 2015-08-06 00:19

I'm working on the block read/write instruction now, also, which poses some new challenges.

msrobots · 2015-08-06 01:24

My understanding is that the waitxx instructions also put a cog in low power mode. Not sure if this still applies to the P2.

Not being able to waitxx on a single pin is a big mistake. Just having 8 pin groups to wait on would render all other 7 pins useless.

We have hopefully 16 COGs now and a simple serial receiver could just wait for his pin as on the P1 now, with his cog sleeping.

Maybe the smart pins could help there, but not much info known to me about them, yet.

For Logic Analyzer and friends a 64 bit read would be nice but in reality? Not really needed. You need to capture into hub anyways, maybe with 2 cogs to get every clock cycle. Or four COGs to have 64 bit. But then you have no buttons, display, EEPROM or serial. Just 64 pins there as far as I understand.

Please @Chip, keep that Interrupt stuff simple. We do not need a whole event-system. That is what other MC are missing. They have HUGE Datasheets. Please stay away from that and do it the KISS way of Prop1.

Since the P2 can do ADC/DAC and not just digital any capture program will use this feature. So any LA will capture over the smart pin interface. And who needs 64 triggers?

Please keep it simple. Block read write is good. Some helper INS for USB maybe.

Then make the smart pins really smart. I really think this is a very good decision you made to move stuff to them independent pin systems. Keeps the COG clearer and allows for way more parallel execution.

just my two cent.

Mike

jmg · 2015-08-06 01:34

cgracey wrote: »

How would you guys feel about getting rid of WAITPAE/WAITPAN/WAITPBE/WAITPBN and replacing them with event sensors?

This way, you could poll for pin-pattern matches, as well as interrupt on them.

The new instructions would be:

SETPM D/# - set 32-bit pin mask
SETPT D/# - set 32-bit (pin AND mask) target
SETPS D/# - %00 = port A equals, %01 = port A doesn't equal, %10 = port B equals, %11 = port B doesn't equal

Some use examples would help ?
Present WAITxx are single cycle granular, and are easily portable from P1.
I'm not sure if you are saying the ability to wait is gone, or if this extends it to allow WAIT and Interrupts.

IIRC, there were edge version of WAITxx added - are those still in P2 ?

cgracey wrote:

And what about shrinking the sensitivity down to 16 or 8 contiguous pins? 8-pin sensitivity would let us configure the event sensor with a single 20-bit value: %n_ppp_tttttttt_mmmmmmmm, which means only one D/# instruction.

Do you mean 8 bits as a compact opcode, but have 32b available for some uses ?

cgracey wrote:

And, lastly, is this pin checking even necessary to keep? Does anyone use it?

?? Pin-Match events are very common in MCUs,and are used for many things, like keyboards, Quadrature Counting, AutoBAUD etc....

cgracey wrote:

I think this function is really important for logic analyzers, where you need to trap multiple pin states simultaneously. And this could even be expanded to 64-bit sensitivity, instead of shrinking down to 8 or 16.

Yes, the ability to capture on any change is also nice for Logic Analyzers, and yes, 64b wide would be useful there.

Of course, ideally, that 64b needs to be a same-edge captured value, but that needs more logic.
A Fall-back method is to have one code line where Port is read, and then Match register is loaded with that copy, and that means the interrupt re-enters should any pin change occur between read and update.

My preferred/ideal Logic Analyzer captures a Pin-State and a Time-Stamp, and does so to the Silicon BW.

The most compact coding for this would be
ReadPinState
UpdatePinMatch
CaptureTimeStamp
SavePinState
SaveTimeStamp

and the peak MHz is set by how compact that group can be.

Some MCUs have small FIFOs on time-stamp capture, which allows narrow pulses down to SysCLK to be read correctly. Otherwise, there is some larger minimum correctly measured pulse width.

Seairth · 2015-08-06 02:14

cgracey wrote: »

I'm working on the block read/write instruction now, also, which poses some new challenges.

Not knowing the particulars, I am going to throw out a suggestion that may (or may not) help: Have the hub rotate every two clock cycles instead of every clock cycle. For one clock cycle, perform instruction streaming (if in hubexec mode). For the other clock cycle, perform any pending memory HUBOP.

With this approach, I'd think the following would result:

Obviously, if a pending memory HUBOP isn't aligned, it will stall. But it won't stop the instruction streamer.
Both the memory HUBOPs and the instruction streamer can use the same data bus.
The instruction streamer will still be fast enough, since instructions are fetched into the pipeline at the same rate.
It would be possible to use existing RDxxx/WRxxx instructions back-to-back optimally (after the hub alignment delay).

Note: You could still have a RDBLOCK/WRBLOCK instruction, but I would suggest that it still require alignment to start. This would keep things consistent with the other memory HUBOPS.

Yanomani · 2015-08-06 05:46

I'm not sure if this can be implemented at the current design stage, but i'd thought if it'll not be better to implement block moving instructions, using an interleaved clock scheme; one for instruction execution and one for Hub-to/from-Cog data movement, or even two consecutive clocks for instruction execution, then one clock for data to/from movement.

Sure, the clock interleave should only began if there are pending data transfers and it could begin or continue, in case of lose sync, by changes in the address sequence, by the execution of out-of-sync hubops interspersed in the code, whichever the case is.

With that scheme, instruction execution can proceed normally, with the exception of stalls due to inherent waits introduced by address misalignments, and even there, since the instruction execution has stalled, by default, the data transfer can continue in a rate twice faster than normal, until realignment solves the pending operation that caused the stall.

With this approach, despite losing stricto sensu determinism, being instruction execution timing now dependent of having or not block moving operations activated, one could process information and move resulting data in an non stop fashion, the resulting throughput depending solely of the number of instructions to be executed between data movements.

In fact, I'm all in favor of the first (or single) write operation being non stalling at all. If one writes anything in a one out of sixteen clocks access rate, the write operation would be totaly transparent, from the code execution rate stand point. Only the second write or a read operation, if any, will now stall the execution unit, until the transfer circuit could complete any pending operation(s).

Better yet if this can be a choice, i. e., programmable by an inherent control bit, crafted into the instruction itself..

Continuous versus burst, which one is better?

I'm not sure, but if one can choose the way that best fits its needs, shure its better than if it can't make a choice and compare results.

Henrique

<"http://forums.parallax.com/profile/41126/cgracey"> wrote:

I'm working on the block read/write instruction now, also, which poses some new challenges.

cgracey · 2015-08-06 06:28

There's no need for splitting the hub cycles between hub exec and block r/w's, because the hub exec FIFO will fill up in a few clocks once the block r/w instruction engages, maybe even before the first block move slot comes up. And once the block move engages, it will move one long per clock.

Yanomani · 2015-08-06 06:40

Hi Chip

Sorry by my poor english, but I was thinking about the block moving instructions you are now crafting, don't stalling Cog resident code execution at all.

Is that possible in the actual design?

Henrique

cgracey · 2015-08-06 07:17

msrobots wrote: »

My understanding is that the waitxx instructions also put a cog in low power mode. Not sure if this still applies to the P2.

Not being able to waitxx on a single pin is a big mistake. Just having 8 pin groups to wait on would render all other 7 pins useless.

We have hopefully 16 COGs now and a simple serial receiver could just wait for his pin as on the P1 now, with his cog sleeping.

Maybe the smart pins could help there, but not much info known to me about them, yet.

For Logic Analyzer and friends a 64 bit read would be nice but in reality? Not really needed. You need to capture into hub anyways, maybe with 2 cogs to get every clock cycle. Or four COGs to have 64 bit. But then you have no buttons, display, EEPROM or serial. Just 64 pins there as far as I understand.

Please @Chip, keep that Interrupt stuff simple. We do not need a whole event-system. That is what other MC are missing. They have HUGE Datasheets. Please stay away from that and do it the KISS way of Prop1.

Since the P2 can do ADC/DAC and not just digital any capture program will use this feature. So any LA will capture over the smart pin interface. And who needs 64 triggers?

Please keep it simple. Block read write is good. Some helper INS for USB maybe.

Then make the smart pins really smart. I really think this is a very good decision you made to move stuff to them independent pin systems. Keeps the COG clearer and allows for way more parallel execution.

just my two cent.

Mike

If you do a wait-event instruction like WAITEDG, you are in a low-power mode. We already can wait for a single-pin edge event. I'm talking about making those multi-pin WAITPAE, etc. instructions into events that you configure. Then, you can poll or wait for the event. It's actually very simple and it lets you see if the event occurred, without you having to wait for it. Of course, you can wait for it, too, if you want.

I guess here's my question: Does anybody ever need to wait for a pattern to appear on a set of pins, or is waiting for a transition on a single pin enough? Every time I've used WAITPxx, it's been to trap an event on a single pin. We already have this with GETEDG/WAITEDG. Do we need this function for multiple pins? If not, we could get rid of the WAITPxx instructions.

cgracey · 2015-08-06 07:24

Seairth wrote: »

cgracey wrote: »

I'm working on the block read/write instruction now, also, which poses some new challenges.

Not knowing the particulars, I am going to throw out a suggestion that may (or may not) help: Have the hub rotate every two clock cycles instead of every clock cycle. For one clock cycle, perform instruction streaming (if in hubexec mode). For the other clock cycle, perform any pending memory HUBOP.

With this approach, I'd think the following would result:
Obviously, if a pending memory HUBOP isn't aligned, it will stall. But it won't stop the instruction streamer.

Both the memory HUBOPs and the instruction streamer can use the same data bus.

The instruction streamer will still be fast enough, since instructions are fetched into the pipeline at the same rate.

It would be possible to use existing RDxxx/WRxxx instructions back-to-back optimally (after the hub alignment delay).

Note: You could still have a RDBLOCK/WRBLOCK instruction, but I would suggest that it still require alignment to start. This would keep things consistent with the other memory HUBOPS.

This is a neat idea. I had to read it a few times before I got what you were saying. We don't really need to do this, though, because the hub exec FIFO will fill up soon enough if you are in a block r/w instruction, and once filled, the block r/w can proceed at the rate of one long per clock. Also, cutting the hub rate in half would cut the top streamer I/O rate in half, too.

cgracey · 2015-08-06 07:39

Yanomani wrote: »

Hi Chip

Sorry by my poor english, but I was thinking about the block moving instructions you are now crafting, don't stalling Cog resident code execution at all.

Is that possible in the actual design?

Henrique

They must stall code execution because they need access to the cog RAM, just like execution does.

Sapieha · 2015-08-06 07:39

Hi Chip.

I think that that possibility have most usability in Numeric controlled Robots!

cgracey wrote: »

I guess here's my question: Does anybody ever need to wait for a pattern to appear on a set of pins, or is waiting for a transition on a single pin enough? Every time I've used WAITPxx, it's been to trap an event on a single pin. We already have this with GETEDG/WAITEDG. Do we need this function for multiple pins? If not, we could get rid of the WAITPxx instructions.

cgracey · 2015-08-06 07:41

I can see, as one of you pointed out, that keyboards could use multiple-pin-sensitive events or interrupts. I'm thinking that the WAITPxx instructions could be redone as an event/interrupt, with 8-pin sensitivity. 32-pin sensitivity is overkill for most practical applications.

MJB · 2015-08-06 07:42

cgracey wrote: »

I guess here's my question: Does anybody ever need to wait for a pattern to appear on a set of pins, or is waiting for a transition on a single pin enough? Every time I've used WAITPxx, it's been to trap an event on a single pin. We already have this with GETEDG/WAITEDG. Do we need this function for multiple pins? If not, we could get rid of the WAITPxx instructions.

- obviously the classic logic analyser case you already mentioned.
-Incremental encoders would be the other one, but this might be handled in the smart pins now.

cgracey · 2015-08-06 07:42

Sapieha wrote: »

Hi Chip.

I think that that possibility have most usability in Numeric controlled Robots!

cgracey wrote: »

I guess here's my question: Does anybody ever need to wait for a pattern to appear on a set of pins, or is waiting for a transition on a single pin enough? Every time I've used WAITPxx, it's been to trap an event on a single pin. We already have this with GETEDG/WAITEDG. Do we need this function for multiple pins? If not, we could get rid of the WAITPxx instructions.

Do you think 8-pin sensitivity would be adequate?

jmg · 2015-08-06 07:50

cgracey wrote: »

I guess here's my question: Does anybody ever need to wait for a pattern to appear on a set of pins, or is waiting for a transition on a single pin enough? Every time I've used WAITPxx, it's been to trap an event on a single pin. We already have this with GETEDG/WAITEDG. Do we need this function for multiple pins? If not, we could get rid of the WAITPxx instructions.

See above, for examples of multiple-pin-waits :
Keyboard scan is a classic one.
Multiple channel Quadrature counting is another...
(I know Smart Pins are expecting Quad Counting, but SW counting should not be excluded )

Sports Event timing is another app...

jmg · 2015-08-06 07:56

cgracey wrote: »

I can see, as one of you pointed out, that keyboards could use multiple-pin-sensitive events or interrupts. I'm thinking that the WAITPxx instructions could be redone as an event/interrupt, with 8-pin sensitivity. 32-pin sensitivity is overkill for most practical applications.
Do you think 8-pin sensitivity would be adequate?

Given P1 has 32 pin sensitivity, I would be very wary of making P2 a subset of any P1 feature.
What is the Logic Cost of the 32 wide ability ?

I've just checked the cheap ~30c MCU's here, have 16 wide Port Match, and others with slightly more pins, have 24 wide Port Match.

8 wide is looking a tad anemic ?

Sapieha · 2015-08-06 08:02

Hi Chip.

8-pin sensitivity are good enough.

cgracey wrote: »

Do you think 8-pin sensitivity would be adequate?

Yanomani · 2015-08-06 08:18

cgracey wrote: »

Yanomani wrote: »

Hi Chip

Sorry by my poor english, but I was thinking about the block moving instructions you are now crafting, don't stalling Cog resident code execution at all.

Is that possible in the actual design?

Henrique

They must stall code execution because they need access to the cog RAM, just like execution does.

I'm not asking for it, only trying to get some understanding about it.

I'll try to explain it better, without boggling it, by my poor written english skills.

Its like time multiplexing a third port into Cog ram.
And only during block moves; one cycle runs the Cog (or two, in case of one out of three), the next runs data transfers.

The fifo shouldn't hold longs values themselves (data to be moved to/from), only the Hub and Cog addresses, and sure, the operation to be performed: read from Hub to Cog ram, or the reverse one.

Even the addresses don't have to carry for the least significant four bits of each one either, because they are implied in the four bits, that address each fifo position., from either standpoint, Hub or Cog.

So, if there is an operation request for a certain long within the fifo, it'll block further ops at this address (lower four bits represented), until serviced by the passing of the egg beater, that gets the operation done. Untill the egg beater passes, the Cog execute unit will receive all the clocks; only during the Hub to Cog transfer (or vice versa) there is a one clock stall.

Then, each fifo position will in fact hold 16+5+1 bits (for operating on longs), or 16+4+5+4+1 (for operating on individual bytes), plus 1 bit to signal a pending operation.

Since the egg beater is fully predictable, all setups could be done in the preceding cycle, just before the data movement takes place, avoiding timing being blown away.

I don't know if this is possible, but in case it is, will spare some logic, if my thoughts are correct.

Henrique

cgracey · 2015-08-06 08:34

jmg wrote: »

cgracey wrote: »

I can see, as one of you pointed out, that keyboards could use multiple-pin-sensitive events or interrupts. I'm thinking that the WAITPxx instructions could be redone as an event/interrupt, with 8-pin sensitivity. 32-pin sensitivity is overkill for most practical applications.
Do you think 8-pin sensitivity would be adequate?

Given P1 has 32 pin sensitivity, I would be very wary of making P2 a subset of any P1 feature.
What is the Logic Cost of the 32 wide ability ?

I've just checked the cheap ~30c MCU's here, have 16 wide Port Match, and others with slightly more pins, have 24 wide Port Match.

8 wide is looking a tad anemic ?

Do they offer mismatch, as well?

jmg · 2015-08-06 09:20

cgracey wrote: »

jmg wrote: »

What is the Logic Cost of the 32 wide ability ?

I've just checked the cheap ~30c MCU's here, have 16 wide Port Match, and others with slightly more pins, have 24 wide Port Match.

8 wide is looking a tad anemic ?

Do they offer mismatch, as well?

They have a N wide Mask, and a N Wide XOR between Pins, MatchReg, so it is called Port Match, but actually fires on mismatch. On INT entry, you update the Compare reg.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments