The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Cluso99 · 2016-04-15 13:26

evanh wrote: »

I still think HubRAM as soft buffers is best interCog comms, it's so much more flexible. Prop2 Hub has the potential bandwidth, why not use it?

It is still a bottleneck for highly integrated co-operating cogs as there is too much delay to get data between cogs.

It was a problem in P1, but there were other constraints too in the P1 that prevented tightly co-operating cogs.

If there is space and it is simple (as Chip has already indicated), then some form of direct cog-cog (adjacent cogs) communication should seriously be considered.

Cluso99 · 2016-04-15 13:29

Seairth wrote: »

cgracey wrote: »

jmg wrote: »

cgracey wrote: »

There's not enough room to double the hub RAM to 1MB, but we could do something like switch the 512x32 SP LUT RAMs to dual-ports (like the cogs have) at a cost of 3.2 mm2.

Do you mean so they would dual-port to adjacent COGs ? or just be able to extend std COG memory ?

I just meant that the streamer would be able to access the LUT from its own bus.

What you are talking about would certainly solve the cog-to-cog communication problem!

That would mean that there would be only 8 LUT blocks (one per pair of cogs). That may still be reasonable. Could you then go to 4-port RAM and provide both glitch-free streaming and cog-to-cog access?

Edit: also add one more event for writing to LUT address $1FE. That way, the paired cogs can use $1FE and $1FF for signaling.

I was thinking there would still be 16 LUT blocks. Cog0 to Cog1, Cog1 to Cog2, ... , Cog15 to Cog0.

evanh · 2016-04-15 13:40

The only example given so far is one of instruction determinism issues, with an implied preferred increase in throughput on top of that.

Eliminating Hub write stalls is such a nice improvement to all operations. And once that's in place the usefulness of pairing up Cogs is probably zero.

Cluso99 · 2016-04-15 15:39

evanh wrote: »

The only example given so far is one of instruction determinism issues, with an implied preferred increase in throughput on top of that.

Eliminating Hub write stalls is such a nice improvement to all operations. And once that's in place the usefulness of pairing up Cogs is probably zero.

On P1 I wrote a datalogger program but I was not able to get the data out of a cog fast enough, so I was limited to the cog ram as a buffer. It is not the only time I haven't been able to get data between cogs fast enough.

Similar problems were found in getting USB FS running on P1. The turnaround time is extremely tight in USB.

It is one of the reasons P2HOT (and IIRC the prior P2 versions) had an additional internal 32bit I/O bus and support instructions. Most of those who supported this are no longer active on the forum. Did you see Linus's demo???

We now have 16 cogs. It would be a shame if we could not use these co-operatively when we have so much power in the P2. There are so many more possibilities with P2.

Some of the things possible would be to share some cordic processing between cogs so that the cordic could be used at full speed. By this I mean interleave cogs to get multiple cordics operating in successive clocks.

Seairth · 2016-04-15 16:16

Cluso99 wrote: »

Oooooh! Maybe 16mm2 available

Two thoughts come to mind...

16 of 512x32 DP RAM 16 x 0.292 mm2 = 4.7 mm2
Additional 4KB of LUT dual-ported between adjacent cogs (addresses $400-$5FF) and $3FF interrupt becomes $5FF.

16 of 4096x32 SP RAM 16 x 1.57 x0.5 mm2 = 25.1/2 mm2 = ~12.6 mm2
Additional 256KB of hub ram (total 768KB)

Cluso99 wrote: »

Seairth wrote: »

cgracey wrote: »

jmg wrote: »

cgracey wrote: »

There's not enough room to double the hub RAM to 1MB, but we could do something like switch the 512x32 SP LUT RAMs to dual-ports (like the cogs have) at a cost of 3.2 mm2.

Do you mean so they would dual-port to adjacent COGs ? or just be able to extend std COG memory ?

I just meant that the streamer would be able to access the LUT from its own bus.

What you are talking about would certainly solve the cog-to-cog communication problem!

That would mean that there would be only 8 LUT blocks (one per pair of cogs). That may still be reasonable. Could you then go to 4-port RAM and provide both glitch-free streaming and cog-to-cog access?

Edit: also add one more event for writing to LUT address $1FE. That way, the paired cogs can use $1FE and $1FF for signaling.

I was thinking there would still be 16 LUT blocks. Cog0 to Cog1, Cog1 to Cog2, ... , Cog15 to Cog0.

I think the more common use case is to have pairs of cogs talking to each other quickly. The problem with the daisy-chain is that it's fast in only one direction and requires a lot of cooperation from every cog to make it two-way.

As for the additional HUB-vs-LUT space, I say leave both of them as they are. Adding LUT space will require either additional instructions or modification of the LUT addressing. While more HUB space would be nice (it always is), I think more functional LUTs would give you better bang for your buck. Sharing a LUT between pairs of cogs would also open up some interesting possibilities! For instance, one cog could be a virtual memory manager for the other cog. Or one cog could be streaming and both cogs could be writing (e.g. Cog A writes while Cog B renders, streamer event occurs, Cog A renders while Cog B writes, etc).

Circuitsoft · 2016-04-15 16:48

Does HUB RAM /necessarily/ have to be a power-of-two size? I could see 768k or even just 640k (Thanks IBM!) being useful amounts. If addresses beyond those just always read 0 and accepted any write, and we knew to expect that, then there would be no issues there.

Kerry S · 2016-04-15 16:53

Give each cog 16 - 32bit registers that are read only to all other cogs. Would work same as how each cog can READ the IN ports at the same time, just with one Cog being able to do the write for their block of 16 registers.

Programmer can then mix, match, all kinds of comm signals from individual bits to longs depending on need.

78rpm · 2016-04-15 17:42

If it is a choice bewteen more HUB RAM or more/futher LUT space, or inter-COG communication buffer, have it configurable as all, so the designer can use to their best advantage. Think of it as smart memory to match smart pins.

1. Configured as HUB RAM
2. Configured as additional LUT
3. Configured as SP RAM bewteen adjacent COGs for bidirectional use, a simple SYSCLK/2 counter sync'd with the instructions determines if Cog-N or ( Cog-n-1 and Cog-n+1 ) currently has access and stalls the other. Therefore low latency between cogs. Even numbered Cogs on one access, odd on other.
4. Configured as SP RAM bewteen adjacent COGs for unidirectional use, a simple SYSCLK/2 counter sync'd with the instructions determines if Cog-N or ( Cog-n-1 and Cog-n+1 ) currently has access and stalls the other. Therefore low latency between cogs. Even numbered Cogs on one access, odd on other. Flow is from COG-n-1 to COG-n to COG-n+1;

evanh · 2016-04-15 23:24

Cluso99 wrote: »

evanh wrote: »

The only example given so far is one of instruction determinism issues, with an implied preferred increase in throughput on top of that.

Eliminating Hub write stalls is such a nice improvement to all operations. And once that's in place the usefulness of pairing up Cogs is probably zero.

On P1 I wrote a datalogger program but I was not able to get the data out of a cog fast enough, so I was limited to the cog ram as a buffer. It is not the only time I haven't been able to get data between cogs fast enough.

Throughput. Sorted with the faster writes.

Similar problems were found in getting USB FS running on P1. The turnaround time is extremely tight in USB.

That sounds like instruction determinism again. Sorted with the faster writes.

It is one of the reasons P2HOT (and IIRC the prior P2 versions) had an additional internal 32bit I/O bus and support instructions. Most of those who supported this are no longer active on the forum. Did you see Linus's demo???

Smartpins is the replacement there, not Cog interlinking.

We now have 16 cogs. It would be a shame if we could not use these co-operatively when we have so much power in the P2. There are so many more possibilities with P2.

Two-clock Hub writes on all instructions from all Cogs! That's got be attractive!

Some of the things possible would be to share some cordic processing between cogs so that the cordic could be used at full speed. By this I mean interleave cogs to get multiple cordics operating in successive clocks.

The CORDIC is in the Hub. Interleaving Cog results I'm sure can be managed already.

evanh · 2016-04-15 23:37

evanh wrote: »

It is one of the reasons P2HOT (and IIRC the prior P2 versions) had an additional internal 32bit I/O bus and support instructions. Most of those who supported this are no longer active on the forum. Did you see Linus's demo???

Smartpins is the replacement there, not Cog interlinking.

Oh, the hidden pins, now I remember. I was thinking about the massive data bus for fast I/O ops in the very first Prop2 incarnation.

Chip has added some interrupts(poll-able) for this I think. There's also the tried and true system counter method.

Cluso99 · 2016-04-15 23:47

Evan,
You are totally missing the point.

There will always be a delay from writing to the hub until the other cog reads that hub. Not only is that delay almost impossible to determine, it can be substantial in the number of clocks.

Faster writes only permits the writing cog to get on with processing. It just means the data is cached until the appropriate hub slot comes along. Then the receiving cog has to read it, waiting until the particular slot comes around. And yes, it is indeterminate (or extremely difficult to calculate).

However, when using a normal 2 clock instruction to write to a register that can be viewed by another cog directly, the other cog could be sitting in a loop waiting for that data. Maximum receive cog could be 2 instructions = 4 clocks. So a total of 6 clocks to send and receive the byte/word/long. This is not physically possible using hub.

USB FS is no longer the high end, yet it requires a response in only a few bit times. Processing must also occur within this turnaround time.

Other much faster protocols and interfaces will require much tighter responses.

If the bi-directional small register file between adjacent cogs is implemented (a small piece of silicon really), then for example, COGn could enslave 2 adjacent cogs for assistance, COGn-1 and COGn+1.

With all the extra power of the P2, it would really be a shame to miss this opportunity, especially since Chip has said that it is easy, and a small piece of silicon, and now we have the space too.

Evan, why are you totally against it? Real examples please.

evanh · 2016-04-15 23:58

Cluso99 wrote: »

Evan, why are you totally against it? Real examples please.

Because fast, non-stalling Hub writes is obviously good.

And I feel what you are wanting is already covered by both the fast writes and other combinations already provided for. Eg: You can't get more exact than the system counter for Cog interleaving.

EDIT: Eliminating non-deterministic instructions is the win-win.

evanh · 2016-04-16 00:20

Cluso99 wrote: »

Faster writes only permits the writing cog to get on with processing. It just means the data is cached until the appropriate hub slot comes along. Then the receiving cog has to read it, waiting until the particular slot comes around. And yes, it is indeterminate (or extremely difficult to calculate).

Not if the existing FIFO is doing the reads. Everything becomes deterministic and fast for both reading and writing(with the added write buffering).

However, when using a normal 2 clock instruction to write to a register that can be viewed by another cog directly, the other cog could be sitting in a loop waiting for that data. Maximum receive cog could be 2 instructions = 4 clocks. So a total of 6 clocks to send and receive the byte/word/long. This is not physically possible using hub.

USB FS is no longer the high end, yet it requires a response in only a few bit times. Processing must also occur within this turnaround time.

It's a response best dealt with by the one Cog. What you appear to trying to do is use a second Cog just because of the instruction stall in the first Cog.

evanh · 2016-04-16 00:25

Cluso99 wrote: »

If the bi-directional small register file between adjacent cogs is implemented (a small piece of silicon really), then for example, COGn could enslave 2 adjacent cogs for assistance, COGn-1 and COGn+1.

With all the extra power of the P2, it would really be a shame to miss this opportunity, especially since Chip has said that it is easy, and a small piece of silicon, and now we have the space too.

If I had to chose between a single small latch and using dual-ported LUT to share neighbouring Cogs then I'd chose the LUT, the larger window is useful.

On the other hand, I'd rather have double the HubRAM so maybe then I'd go with your single latch instead.

Cluso99 · 2016-04-16 00:36

Since we have OTP for the fuses, perhaps some user OTP in conjunction with the ROM would be useful. Maybe 4KB or 8K Might permit booting without external SPI flash, eg SD, or just a serial monitor.
I guess the OTP would be slightly smaller than RAM.

msrobots · 2016-04-16 00:39

I think @Evan you are missing the basic feature on @Cluso99's idea.

Each cog would have two com-channels to his neighbors independently from hub access at all. Build in mailbox for 'left' and 'right' other cog.

I personally think that this would allow a lot of fast synchronization between cooperate cogs, say multiple cogs driving sprites on video, or Clusos USB needing a second cog as co-processor.

The use of 32bit mailboxes is a quite common concept in the P1 software. There the mailbox is in the HUB. Having a mailbox for each of your neighbors would be quite easy to adapt to.

Basically I think we just would need read and write a long, and maybe test if zero. For the left and the right next cog.

Enjoy!

Mike

kwinn · 2016-04-16 02:25

msrobots wrote: »

I think @Evan you are missing the basic feature on @Cluso99's idea.

Each cog would have two com-channels to his neighbors independently from hub access at all. Build in mailbox for 'left' and 'right' other cog.

I personally think that this would allow a lot of fast synchronization between cooperate cogs, say multiple cogs driving sprites on video, or Clusos USB needing a second cog as co-processor.

The use of 32bit mailboxes is a quite common concept in the P1 software. There the mailbox is in the HUB. Having a mailbox for each of your neighbors would be quite easy to adapt to.

Basically I think we just would need read and write a long, and maybe test if zero. For the left and the right next cog.

Enjoy!

Mike

If this can be added without taking away something else it would be a mistake to pass up the opportunity. Even the suggested uses so far make it worthwhile, and I am sure more creative uses will come along once it is available.

evanh · 2016-04-16 03:36

I guess what I'm saying is the Hub write buffering is even more useful because it covers those uses and far more to boot.

There is already clock aligning. There is already mailbox notification. There is even the semaphore lock bits that could be used - they're poll-able now.

One thing that is still an issue is deterministic hub instructions. Even the Smartpin instructions are deterministic I think.

A big bonus of adding the write buffering is you get concurrent, fast, non-stalling hub reads and writes in the one Cog.

evanh · 2016-04-16 08:19

evanh wrote: »

One thing that is still an issue is deterministic hub instructions. Even the Smartpin instructions are deterministic I think.

A big bonus of adding the write buffering is you get concurrent, fast, non-stalling hub reads and writes in the one Cog.

I'll just add that while HubRAM write buffering should be entirely transparent and deterministic for all writes, there is caveats for HubRAM reads via the FIFO:
- The blocks of 16 longs are sequentially syphoned through the FIFO, they aren't addressable once in the FIFO. Which means trying to read just the last long of the block requires programmatically reading all prior 15 longs too.

- Also, I think, issuing a FBLOCK, presumably pointing to another block of HubRAM, only takes effect for the Cog once the existing 16 longs have been syphoned.

evanh · 2016-04-16 08:21

Actually, I'm not sure if anyone is even thinking about using the FIFO or not? I've obviously been counting on using it but I'm now getting the feeling many were avoiding it for it's complexity ... opinions?

Dave Hein · 2016-04-16 11:46

I suggest freezing the design and getting the P2 out in silicon. Leave the empty space for future enhancements of the P2 after the initial chip has been in use for a while.

ErNa · 2016-04-16 11:55

Yes, I support Dave! We can do so much with P2, there is no need to be able to do more not. Please finish as fast as possible and there will be P3!

cgracey · 2016-04-16 12:01

ErNa wrote: »

Yes, I support Dave! We can do so much with P2, there is no need to be able to do more not. Please finish as fast as possible and there will be P3!

Okay.

There are a few very minor enhancements I'm making to the smart pins. This will take another day.

I want to see about having the streamer do 4/2/1-bit operations, not just 8/16/32. This will take a day.

I will make the LUT dual-port, so that the streamer can read data while the cog writes simultaneously, without any glitching. This takes about 5 minutes.

cgracey · 2016-04-16 12:10

Here is the full pad frame that Treehouse has pretty much finished. The breaks in the power rings are due to this not being the top-most-level view. They are actually connected:

Here is the I/O pad, with 1k-ohm and 120-ohm DACs, high-z DAC for comparator use, ADC, and all the steering logic:

Here is a detail from the upper-left corner (RESn and TESn pads, plus I/O's, power):

Here is the lower-right corner (XI, XO, I/O's and power):

The power pads have been beefed up quite a bit, since these screenshots.

ozpropdev · 2016-04-16 12:15

Looks fabulous!

cgracey · 2016-04-16 12:31

ozpropdev wrote: »

Looks fabulous!

Yeah, they did a really good job, I think. They made it look simple.

Almost all of this work was done by a young layout engineer in Mexico who works for Treehouse. At first, he wasn't aware that big transistors needed big metal, as everything was passing LVS (layout vs. schematic) just fine, but we talked about that and he started doing the metal sizing perfectly. Maybe it was dumb luck, but he kept the clock runs close and tight and only encroaching into the edge of the ADC, where they needed to be, anyway. I think the analog should work really well.

T Chap · 2016-04-16 13:04

Chip do you have an ETA to lock this up? ETA for earliest possible real chips? What is going to be the cheapest and smallest footprint option for embedded solutions in the mean time? A9's are over 200 at digikey.

evanh · 2016-04-16 13:30

I'd guess at next year still. The singular package hasn't change from the QFP100.

T Chap · 2016-04-16 13:32

I am wondering about an interim embedded A9 or similar solution.

cgracey · 2016-04-16 13:39

T Chap wrote: »

Chip do you have an ETA to lock this up? ETA for earliest possible real chips? What is going to be the cheapest and smallest footprint option for embedded solutions in the mean time? A9's are over 200 at digikey.

We are going to do a shuttle run in July for the pad frame elements, in order to test the reset, clock, and I/O pads. We can start the synthesis any time, but I would rather wait until we are certain that the pads are okay. We could have real chips by the end of the year.

I could support the BeMicroCV which has a Cyclone V -A2 device on it, which holds more than the DE0-Nano. I think that board is only $49. It would do one or two cogs and several smart pins, maybe 64KB hub RAM. If you wanted that working, I could start doing compiles for it, along with the others.

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments