Prop II question : Serial Chip-To-Chip Communication

jmg · 2012-03-06 21:30

The new Prop II spec
Attachment not found.
says
["Chip-To-Chip Communication
Each cog now also features high-speed serial transfer and receive hardware for chip-to-chip communication. The hardware requires three I/O pins (SO, SI, CLK). Opcodes are SNDSER RCVSER SETSER "]

but this seems to exclude both Standard SPI and QuadSPI support ?

That's a large blind spot - standard SPI receives as it sends, and QuadSPI is the emerging standard for large FLASH memory.
So near, and yet so far...

SPI does not appear anywhere in the DOCs ... Do parallax imagine Prop IIs will ONLY talk to other Prop IIs ?

Roy Eltham · 2012-03-06 21:38

You can do SPI just like you do now on the Prop1, same with I2C.

This stuff is a specific protocol for chip to chip between prop 2s (or other chip that can do the protocol and keep up).

Kye · 2012-03-06 21:41

The purpose of that command is not to support SPI communication.

It uses a built-in state machine in each cog to quickly move data between cogs in different chips. The instruction is only for users who are making propeller chip meshes.

I believe Chip designed that instruction to move data at some insane speed between prop chip cores. It does not send even SPI like data I believe. The instruction is for users who want to really make a multi-processor prop chip system. It should make that possible.

---

SPI will be bit banged as before. It would be nice to have hardware SPI support. Given the speed of each core and the limits of the SPI bus there will be more than enough time to bit bang out data for the SPI bus. Since there are no interrupts its not like the core will be doing anything else.

---

In general, when there are no interrupts in the whole system... Having a fast processor and then just making it do everything manually gives you more control. Since, again, the processor has nothing else todo.

I guess you can get into the details of using the JMPRET instruction. But, that's a different manner.

jmg · 2012-03-06 21:48

That's a shame, reduced to bit-banging, when a fast serial silicon engine is sitting right there, just not of the right type..

QuadSPI really is an emerging standard, it seems in the rush for the fancy stuff, the basics got overlooked here ...

jmg · 2012-03-06 21:52

Roy Eltham wrote: »

This stuff is a specific protocol for chip to chip between prop 2s (or other chip that can do the protocol and keep up).

Details of this 'special protocol' is where ?
I can see a pin-mapping opcode, but no mention of speed control ?
If this goes across isolation barriers, some speed choice will be needed.

Kye · 2012-03-06 21:54

Sorry, jmg. Me an Roy have been insiders so we have a little more knowledge about all this stuff. And only a little...

jmg · 2012-03-06 22:04

Kye wrote: »

In general, when there are no interrupts in the whole system... Having a fast processor and then just making it do everything manually gives you more control. Since, again, the processor has nothing else todo.

That's incorrect - I have plenty of more important things things for the core to do, than twiddle pins.

Using your logic, no fast serial silicon at all would be included.

There IS fast serial silicon, which is a clear admission that speed and low overhead really does matter.

Why limit that, to just talking to another Prop II ??

Cluso99 · 2012-03-06 22:11

jmg: I beg to differ. This is why we love the prop. There are no preconceived ideas or hardware restrictions. That is what some of the cogs are for. If you don't have SPI, then you don't waste the hardware. Almost nothing is tied to pins, so you can place your SPI anywhere. And because of the way they work, you can even share SPI pins with other things. I share my Spi pins with the SRAM data pins. I share my I2C pins with SRAM R & W. When I don't want I2C I don't have any cogs doing this.

I am into dynamic loading of the cogs. While you can do this now, the code must be in hub to load from. While it has been done, I am trying to make it more general to dynamically load from SD card. Of course, the concept will also work for I2C eeprom and flash too.

So, what we have is a huge set of peripherals that are soft configured. For the prop 1 you could have 16 2wire (TX & RX) serial ports, all running at different speeds (=32 I/Opins). ANd the same chip on a different board can have 16 I2C busses, or 8+ SPI busses, etc, etc, etc. Do you see where this is leading. We dont have interrupts to disrupt our code. And our software peripherals are much smarter than just a simple old UART, yet are actually simpler to use because we dont have lots of registers to control them. And within the software peripheral we can also do some processing on the data as it comes in and/or out. So, we can also do some protocols here too. There are lots of objects to do lots of different peripherals. Only downside is what are good and what are not, which have good comments and which do not. But this is getting better all the time.

While on UARTs, how many character buffers do you want? If you take over the whole prop, you could have an input buffer of almost 32KB! Just takes thinking in a different way.

How many instructions does it take to flash a led on other processors (and I don't mean using C or similar where they load many KB worth of data just to start with)?

PUB Main
  DIRA[17]~~                 'make P17 an output
  repeat
    !OUTA[17]                 'toggle P17
    waitcnt(clkfreq + cnt)  'wait some time (if you add 2 lines to use a crystal, this will be 1 second accurately)

Hope I got this right!

And here is one to output a message to the serial port. One of the cogs is used to perform the software uart (in the object FullDuplexSerial)

CON
  _clkmode  = xtal1 + pll16x
  _xinfreq  = 5_000_000
  rxPin  = 31                   'serial
  txPin  = 30
  baud   = 115200
OBJ
  fdx  : "FullDuplexSerial"
PUB Main
   fdx.start(rxPin,txPin,0,baud)                         'start serial driver to PC
  fdx.tx(0)                                             'cls
  repeat
    fdx.str(string("This program does nothing!",13))      
    waitcnt(clkfreq*1 + cnt)                            'delay 1s

jmg · 2012-03-06 22:15

Cluso99 wrote: »

jmg: I beg to differ. This is why we love the prop. There are no preconceived ideas or hardware restrictions. That is what some of the cogs are for. If you don't have SPI, then you don't waste the hardware.

Only now in Prop II, I DO have the hardware, and yes, I hate to waste it.
I want to use the hardware that is already paid for and sitting there, designed for fast serial IO, for fast Serial IO.

Roy Eltham · 2012-03-06 22:57

jmg, I think you are attributing more to these instructions and hardware than exists. It's just taking a long and banging it out the SO/CLK pins, or banging the SI/CLK pins to read a long.

MagIO2 · 2012-03-06 23:36

Does Chip to chip communication mean prop II chip to prop II chip or does it mean prop II chip to Chip Gracey??? Do the prop II call home??? ;o)

I would not waste to much time in discussing the specs at this point in time! On one hand the chip is not finalized yet, but on the other hand each change to the basic architecture or bigger building blocks would delay the release date some more years!
jmg, I understand your point: when there is some buildin serial interface why not support a standard instead of doing your own stuff? Well, I don't know why but I believe there must be a reason and I trust Chip in this matter.

And in the end Chip is the one who designs the prop II. Listening to each and every comment coming from outside simply adds delay. Parallax has to pay the development and thus they have the right to design the silicon in their way! You on the other hand have the freedom to buy the prop II or use another uC.
As all other already stated, it's the goal to add as little specialized hardware as possible. I like the C2C idea even if it is not standard. Having a prop II dedicated to do all the user interface stuff and another one which is dedicated to the system-critical stuff and both can communicate with little delay and software resources makes sense.

There is a full duplex driver for 4 serial interfaces for the prop I. Having 10 times more MIPS per COG theoretically allows to have 40 serial interfaces in one COG. I bet there will be drivers which allow to have SPI, I2C and serial interface driven by one COG. So, 7 COGs left for doing your more important stuff.

jmg · 2012-03-07 02:13

Roy Eltham wrote: »

jmg, I think you are attributing more to these instructions and hardware than exists. It's just taking a long and banging it out the SO/CLK pins, or banging the SI/CLK pins to read a long.

You have almost exactly described a SPI port there. That is all it does, too.

For the want of a hand full of flip-flips & a few SFR bits, this Serial shifter COULD still be a lot more flexible.
It could actually support a standard interface, as well as the 'in-house' one.

QuadSPI parts can clock to 100MHz/104MHz, and they are cheap, and widely available.
Programmable Logic can do QuadSPI using DDR IO.

Javalin · 2012-03-07 02:53

>For the want of a hand full of flip-flips & a few SFR bits, this Serial shifter COULD still be a lot more flexible.
> It could actually support a standard interface, as well as the 'in-house' one.

now that makes sense.... You don't want to tie pins to a task - i.e. an ADC pin, or SPI pin as other manufacturers do, but having some hardware functions per-cog that can drive fast SPI would be very useful and very powerful.

James

Heater. · 2012-03-07 03:17

Strangely enough a certain multi-core MCU from the company that shall remain nameless here also has serial chip to chip communications links using some weird protocol of their own creation whilst not having flexible serdes hardware for other serial devices which are expected to be handled in software. However they do have some extra tricks on the I/O such as clocking I/O to help with that. Their I/O pins are a lot less flexible in the way they can be allocated to tasks and cores though.

Cluso99 · 2012-03-07 04:45

I believe we can shift out quite easily. The video counters can even do it in P1. What is missing is the ability to shift in easily and I think that is a major missing feature for minimal silicon. Those requests have fallen upon deaf ears.

Heater. · 2012-03-07 05:56

Clusso,

I am inclined to agree. There was much discussion about this a long while ago and myself and others were proposing serdes hardware assistance in such a way that different serial input and output could be configured. Perhaps not optimal for serious high speed chip to chip comms but general purpose.

hinv · 2012-03-07 08:22

I am inclined to agree with jmg about having a standard interface if it can be just as fast. The QuadSPI thing though, I think is proprietary and needs to be licensed, and parallax would like to avoid that...me to.
I would like to be proven wrong though.

jmg · 2012-03-07 12:02

hinv wrote: »

I am inclined to agree with jmg about having a standard interface if it can be just as fast. The QuadSPI thing though, I think is proprietary and needs to be licensed, and parallax would like to avoid that...me to.
I would like to be proven wrong though.

QuadSPI is simply an electrical variant, so there is no license.
Quite a few uC already offer this, and I believe the new Infineon XMC45xx has this down to quite low cost parts.

You may be thinking of some Consumer Storage cards, where the form factor and some of the format/control, is licensed by some companies.

pedward · 2012-03-07 12:22

Guys, the prop 2 is 200Mhz and is single cycle per clock. I researched SPI devices recently and the fastest I saw them up to was 80Mhz in single wire mode. I know that the SD spec say min 20Mhz for SPI and QSPI. What that means is for the vast majority of the devices, you will have 10 instructions inside a loop to read and process a nibble. I have a Sandisk flip-chip USB/SD card that does ~26MB per second via the builtin USB and less via SPI. You can probably do the read/pack in 4 or 5 instructions, which means 100Mhz, or 50MB per second.

I don't really see the problem here.

jmg · 2012-03-07 14:17

pedward wrote: »

Guys, the prop 2 is 200Mhz and is single cycle per clock. I researched SPI devices recently and the fastest I saw them up to was 80Mhz in single wire mode.

Try
http://www.winbond-usa.com/hq/enu/ProductAndSales/ProductLines/FlashMemory/SerialFlash/
They have had 104MHz devices for a while. Faster ones WILL come, in the next few years.

This is not just for SD cards, it is also for fast table access, external code fetch, and communicating to other devices that are NOT Prop IIs using an Industry Standard interface.

If I was doing a new chip design now, I would even include DDR (optional) QuadSPI , as a half speed CLK is easier to transport and isolate than a full speed one.

The point is, the Prop core can do many other things besides flip pins, and a 100MHz QuadSPI would deliver/accept 1 Word every 16 machine cycles (A DDR one would do it in 8 cycles) - and those cycles are far better spent doing data manipulation, than pin flips.

A relatively minor silicon change, would mean a BIG increase in usable processing + data bandwidth.

FLASH is also one thing a Prop lacks, so being smart about how you talk to external flash, is pretty obviously important.

Cluso99 · 2012-03-07 19:31

P2 to P2 comms is as previously stated, just a simple input pin, output pin and an internal or external clock and presumably two 32bit serial/parallel latch. Could even be 4*longs = 128bits. I would expect the clock generator will be scalable. At 200MHz would give 200Mbps or perhaps even 400Mbps if clocked on both edges.

No doubt we will be able to use this for other things once the details are known.

evanh · 2012-03-08 02:46

Hardware wise, it is a generic SPI port. The devil will be in the details though. Having controls on how the clock pin works is a trait of SPI. High performance SPI always have multi-word buffering and interrupts backing the shift register but that is assuming a single core design. A Cog can easily and reliably perform the buffering.

One thing that starts to stand out in the Prop II threads is the increased value of faster Cogs. Throwing a whole Cog at a single minded task of one I/O function/engine is looking more and more wasteful as more and more transistors go into the Propeller design. Or, the more emphasis on MIPS per Cog the more precious each Cog becomes and the more pressure there is to have them doing more than waiting on a pin change to occur.

Needless to say this all leads to more demands for specialised hardware to perform common I/O functions.

Is this the only path? Can there be a smaller core design that can multiply out to 16, 32, 64 cores without losing a lot of individual bandwidth and the niceties of a register rich 32 bit architecture?

evanh · 2012-03-08 03:05

Hmm, the Prop II has is a shadow register set in each Cog, right? I don't suppose there is hardware threading in there somewhere? It would be pretty cool to have a high and low priority thread in each Cog. This would effectively allow for eight background I/O controllers, one in each Cog, running at high priority. The eight low priority threads would run non-deterministic application level code.

In effect, 16 Cogs with the Hub access of 8 Cogs.

jmg · 2012-03-08 03:07

evanh wrote: »

Is this the only path? Can there be a smaller core design that can multiply out to 16, 32, 64 cores without losing a lot of individual bandwidth and the niceties of a register rich 32 bit architecture?

Sure, but that breaks the software compatibility.

An easier 'family variant', could be one with 4 cogs dropped out, and MORE RAM dropped in, even PSRAM.

NOT much design effort there, we just need a number on how much RAM a COG maps to.

Or one that does not allocate full Maths to ALL COGs.
Another number here, for Die cost of Maths in All Cogs ?

evanh · 2012-03-08 03:37

What?! Of course it breaks software compatibility. That was biffed out in the second P2 thread where Chip asked that very question.

Heater. · 2012-03-08 04:35

evanh,

Can there be a smaller core design that can multiply out to 16, 32, 64 cores without losing a lot of individual bandwidth and the niceties of a register rich 32 bit architecture?

The way I see it is that if one had 64 smaller cores whatever bandwidth they have to the outside world is going to be wasted because now you have 8 times more contention for the HUB RAM. There is not much space in COG so data has to go to HUB and there is you bottle neck.

This was discussed a long while ago when Chip was musing over having 16 COGs or more RAM. With any shared memory architecture and especially if you have the round robin HUB access of the prop adding more cores has diminishing returns.

pedward · 2012-03-08 13:17

I talked to Chip last night about the P2P comms. He said that the design was changed so that it uses the internal clock to gate the data. This means that any chips that communicate need to be run off the same oscillator so they are synchronized. This is a compromise for very high speed chip2chip comms because the external clock input caused all sorts of havoc in the synthesis process.

A very simple topology to take would be like JTAG and establish a token ring configuration.

Let's say you have 4 devices, they are all slaved off of an external oscillator with equal trace lengths to the clock bus. The output of one chip is tied to the input of it's neighbor, repeat until all chips are connected. If you want to communicate from the first chip to the last chip the data has to pass through 2 chips to get there.

The alternative is to tie the tx and rx together into a single wire, but you need some sort of arbitration scheme then, like CSMA/CD.

It boils down to if you want to do 200Mbps comms, you have some compromises you have to accept to get that level of speed.

The alternative is to use a coded message passing scheme. If it were me, I'd use a token ring topology and the JTAG chaining.

Cluso99 · 2012-03-08 14:35

There are hardware instructions that have been provided in P2 for multiple threading. Currently, not everything is fully explained. We also now have an additional 128 longs (cluts) that can be used as fifos or stacks, from both ends!! IIRC we can also get to this indirectly too. This will provide some extra space in the cogs.

Provided the P2 takes off, developing new P2 variations are not so expensive time and money wise now - Chip has learnt a lot and there are better tools

(rather than polute this thread, see my musings thread where heater put forward some ideas. I am going to put some up there now too)

evanh · 2012-03-08 15:10

pedward wrote: »

The alternative is to use a coded message passing scheme. If it were me, I'd use a token ring topology and the JTAG chaining.

That is SPI chaining as per default recommended wiring. I guess that figures since JTAG uses a SPI port.

evanh · 2012-03-08 15:33

Cluso99 wrote: »

There are hardware instructions that have been provided in P2 for multiple threading.

I suspect that's just to help with software threading, ie: task switching. What I'm thinking of is a WAIT instruction on the high priority thread being able to immediately return and take off at full speed with say only one clock of delay to flip in the shadow registers. And only gives up the Cog again when it goes back to waiting. Not unlike an interrupt from the low priority thread's POV.

Of course, this slaughters determinism for the low priority thread. Not sure if that's so desirable.

EDIT: I guess this *is* more like an interrupt than a thread. O_o

Provided the P2 takes off, developing new P2 variations are not so expensive time and money wise now - Chip has learnt a lot and there are better tools

That sure brings on the anticipation.

evanh · 2012-03-08 15:51

But it is a cool interrupt method as the inline code effectively defines the IRQ source on the fly. Indirect WAITs anyone?

Err, a flaw, maybe a fatal one ... any Hub accesses from the interrupt code are indeterminate ... or not, Hub accesses after a WAIT on the Prop1 are indeterminate anyway, right?

Prop II question : Serial Chip-To-Chip Communication

Comments