Prop II question : Serial Chip-To-Chip Communication

pedward · 2012-03-08 15:58

With the new instruction that gives you the number of clocks since the last invocation, you can use this to average out the clocks per task. This architecture is called cooperative multi-tasking and has always had caveats, but is traditionally the easiest to implement.

evanh · 2012-03-08 16:02

Right, so not applicable to the interrupt mechanism I'm presenting.

jmg · 2012-03-08 20:13

pedward wrote: »

I talked to Chip last night about the P2P comms. He said that the design was changed so that it uses the internal clock to gate the data.

So it has moved closer to SPI already.

pedward wrote: »

This means that any chips that communicate need to be run off the same oscillator so they are synchronized. This is a compromise for very high speed chip2chip comms because the external clock input caused all sorts of havoc in the synthesis process.

Did he mention what the clock speed choices were ?
Did you ask specifically about QuadSPI support ?

Running off one PLL oscillator does not eliminate the system fish-hooks, it simply makes synthesis happier.

Actually getting real speed across multiple chips, is still going to need a choice to CLOCK rates, to work in the real world.
( Hey : we are now 99% of the way to SPI !!)

Most standard SPI ports run up to /2 on MASTER mode, and less on SLAVE (/4 or /10), and some specify higher speeds for Simplex.

I've even seen one that implements TWO nicely granular baud rate generators, and a bit-pattern flips between the two.

If someone has done that, they must have had a reason... eg I think it allows easy support of different SPI chips on one BUS.

pedward · 2012-03-08 21:25

It uses the internal clock, ala 200MHz. There is no external clock line present because he said the synthesis compiler was looking at it as another clock net and it screwed up a whole bunch of things.

It doesn't look like SPI any more than SPI is the name for a clocked serial protocol.

I just don't know why you are so hung up on SPI. He isn't going to put SPI or QSPI on the chip, just like it won't have USB or Ethernet or UARTs, all of this is left up to the developer to implement as a software module.

jmg · 2012-03-08 22:05

pedward wrote: »

I just don't know why you are so hung up on SPI

Is that a serious question ?

The REALLY OBVIOUS answer, is because it is a STANDARD, thousands of chips use it.
Another simple reason : It will cost Parallax LESS to include, than answer (repeatedly) why they failed to do so...
(but came so close...). Sometimes, avoiding being a laughing stock matters.

pedward wrote: »

He isn't going to put SPI or QSPI on the chip

What you have described so far, is VERY close to SPI - do you KNOW it is not going to finally have SPI ability ?

You are not designing the chip.The steps needed from here, are trivial.
All it really needs, is someone to actually pause, long enough to think of it.

Roy Eltham · 2012-03-08 22:20

jmg,

I doubt it's trivial, or it would have been done. I honestly don't know your knowledge level on actual chip design and synthesis, but I just don't see it as the easy task you make it out to be. Case in point, to support SPI properly in both directions, it would have to support an external clock, and we've already been told that an external clock causes synthesis problems. Even if that were not the case, my understanding of how things work in the COGs is such that SPI at best would be an instruction to read a byte and an instruction to write a byte, and you would still have to do everything else yourself. So you are arguing over one instruction verses a handful.

Also, by the time the Prop 2 actually hits the market we'll have code examples and objects for doing SPI, QSPI, I2C, and others. So it's not like you won't be able to easily interface to all those standard devices...

Roy

jmg · 2012-03-08 23:35

Roy Eltham wrote: »

jmg,

I doubt it's trivial, or it would have been done. I honestly don't know your knowledge level on actual chip design and synthesis, but I just don't see it as the easy task you make it out to be. Case in point, to support SPI properly in both directions, it would have to support an external clock, and we've already been told that an external clock causes synthesis problems. Even if that were not the case, my understanding of how things work in the COGs is such that SPI at best would be an instruction to read a byte and an instruction to write a byte, and you would still have to do everything else yourself. So you are arguing over one instruction verses a handful.

Also, by the time the Prop 2 actually hits the market we'll have code examples and objects for doing SPI, QSPI, I2C, and others. So it's not like you won't be able to easily interface to all those standard devices...

Roy

Take a look at (almost) any uC, and you will see how they implement a slave SPI.
They use a Clock sync scheme, based on the internal clock. ( which is what the Prop II now does, see #27 )

Next, take a look at some Verilog, or VHDL, and see the difference between write only, and read-while-write, serial interface, is really just a MUX.

Sure, you can bash away in SW, but always at a lower speed, and those cycles COULD have been otherwise used for something more productive.

With 32 bit registers, you would logically have a choice of 1/2/3/4 bytes per load. - at full speed & width, that is 32 opcodes, and at half speed, 64 opcodes, you have available for anything you choose BEFORE you even start to slow the link down.

evanh · 2012-03-09 03:29

I'll start by saying there is no such thing as standard SPI. The basic structure is just a shift-register. There is a little more but basically everything else is built on top of that. There is plenty of possibilities for incompatibility.

jmg wrote: »

With 32 bit registers, you would logically have a choice of 1/2/3/4 bytes per load. - at full speed & width, that is 32 opcodes, and at half speed, 64 opcodes, you have available for anything you choose BEFORE you even start to slow the link down.

Here we are hitting on my side topic again - The increasing demand for offloading trivial I/O functionality to dedicated hardware as the MIPS and complexity of each Cog increases.

The "interrupt" like background thread, proposed above, I feel is a great solution. Without trying to think of and tack on every I/O device conceivable it allows the project designer to load up up-to eight timing critical "drivers" while still having all spare MIPS available for a general computing level.

Phil Pilgrim (PhiPi) · 2012-03-09 08:41

jmg wrote:

The REALLY OBVIOUS answer, is because it [SPI] is a STANDARD, thousands of chips use it.

Thousands of chips implement UARTs and I2C interfaces in hardware, too. But that's neither the way things are done with the Propeller nor an argument for including the facility in hardware. In the Propeller chip, you can think of each cog as a microcoded peripheral. And that's what gives the chip its ultimate flexibility. Today it it's SPI; tomorrow, who knows? But a Propeller program will probably be able to handle it.

-Phil

richaj45 · 2012-03-09 11:44

Hello:

My first point is external asynchronous inputs, like and external shift clock, are done in HDL and synthesized every day using modern ASIC tools. (I know from current and past experience)

I would think that a useful bit of hardware would be a shift-register with a wait-on empty/full instruction.
Add in different clocking modes and length options and it would be a great benefit to PASM code.

After all, that is what was done for the video generator, albeit only for output.

cheers,
rich

jmg · 2012-03-09 11:47

evanh wrote: »

Here we are hitting on my side topic again - The increasing demand for offloading trivial I/O functionality to dedicated hardware as the MIPS and complexity of each Cog increases.

The "interrupt" like background thread, proposed above, I feel is a great solution. Without trying to think of and tack on every I/O evice conceivable it allows the project designer to load up up-to eight timing critical "drivers" while still having all spare MIPS available for a general computing level.

Sure, but that better belongs in the musing thread, as it is certainly not a minor change to an existing peripheral.

Hochips is offering a 3 core/thread 8051, but there, they hard slice the core, which misses some targets.
Better would be a user access to the time-slice allocation, which could then deliver any combination of deterministic, and resource/priority based sharing.

pedward · 2012-03-09 11:48

FWIW, the Prop 2 will have the guts to do 10Mb ethernet and Full-speed USB in 1 COG. Each pair of I/O pins has a comparator with a 30Mhz bandwidth (per Chip), that can do differential comms. There will be a lot more possible with soft-peripherals on the Prop2, eliminating a lot of external stuff.

jmg · 2012-03-09 12:09

richaj45 wrote: »

Hello:

My first point is external asynchronous inputs, like and external shift clock, are done in HDL and synthesized every day using modern ASIC tools. (I know from current and past experience)

I would think that a useful bit of hardware would be a shift-register with a wait-on empty/full instruction.
Add in different clocking modes and length options and it would be a great benefit to PASM code.

After all, that is what was done for the video generator, albeit only for output.

Correct.
I would expand on the "wait-on empty/full instruction." a little, given the Prop already does this very well.

Ideally, you want to both pace-on-input, _and_ to give user possible spare cycles, and you _can_ do both with a variant of "wait-on empty/full instruction", that 'waits if needed' - ie a READ of the shifter is done with an inbuilt/implicit wait (as distinct from polling a flag, then read, as more common uC do).

A write would be no-delay on the first one, and a possible delay on the successive writes, depending on if the shifter is done.

Ideally, a queued send, would deliver the next bits without any extra delays. That detail can matter.

This new Porp II port even has 3 opcodes allocated, so the silicon certainly can know 'what the user intends'

By doing that, a user can use any spare cycles, and still keep full IO speed.

It's really a simple balance, of using simple IO silicon, (now already 99% there) to bring the data rates down to user software rates, without the serious resource waste of pin-bashing.
Doing that wastes bandwidth, and is a poor use of silicon.

If Chip is serious about inter-chip speed, I fully expect he _already_ does a DDR option, as the pin-buffers will be a bottle neck.

jmg · 2012-03-09 12:36

pedward wrote: »

FWIW, the Prop 2 will have the guts to do 10Mb ethernet and Full-speed USB in 1 COG. Each pair of I/O pins has a comparator with a 30Mhz bandwidth (per Chip), that can do differential comms. There will be a lot more possible with soft-peripherals on the Prop2, eliminating a lot of external stuff.

Great, besides USB and Ethernet, there are also Profibus/RS485 networks that are now over 50MHz.

I see Prop II does mention this :
["CLUT ...may be used as a 256 Long FIFO buffer "]

If that can work in both directions, and 'clip directly' onto Data/ comparator port pins, on one end. (just as it does in Video) then we have the balance of Silicon and SW everyone is hoping for.

( On the topic of FIFOs, I see new Infineon devices have a FIFO block, you can skew allocate between TX and RX - clever, so one block of ram can be allocated to suit the application, and die area is not wasted)

Cluso99 · 2012-03-09 16:46

The clut/fifo in each cog can inded be used as two fifos, one from each end.

I did not see anything about ethernet or usb and presumed that was no longer in the P2. For these, serial in shifters will no doubt be required. So maybe we will actually get these functions and that it just has not been detailed to us ??? I at least now have some hope of getting this.

BTW Just to make it clear, I am not looking for I2C or SPI silicon - just the ability to use the vga and counters in such a way as to clock in, just as they can be used to clock out. With this addition, the P2 would be capable of all sorts of comms, maybe even CAN.

Hindsight is wonderful... It is a shame P1 didnt have provision to gate an input into the vga serialiser. But I have said before,we have not exhausted the functions of the counters and vga in P1 yet. The new sd spi drivers use counters. I am sure we could use the vga to output serial (uart) and this would save code space.

evanh · 2012-03-09 17:14

jmg wrote: »

Sure, but that better belongs in the musing thread, as it is certainly not a minor change to an existing peripheral.

Minor change or not, if it's not in the finished product then tough. And I feel this conversation is in the correct thread, the deserialiser issue is one I'm very interested in and can be achieved in software. But at the cost of a whole Cog and reduced max throughput, which is does seem a bit wasteful given how simple the basic hardware is.

Actually, I'm wondering if Chip may have already thought of what I'm proposing. What exactly is the main purpose of the shadow registers?

Better would be a user access to the time-slice allocation, which could then deliver any combination of deterministic, and resource/priority based sharing.

The performance of 100% when it's needed is better. And also, when it's not needed the 100% goes to the low priority thread. Better all round.

Even multitasking with it's context switching is superior.

pedward · 2012-03-09 18:06

I guess I should say that "guts" means horsepower. The Prop 2 doesn't have USB or Ethernet, but you can implement them (with differential signalling) using a COG and the comparator inputs.

evanh · 2012-03-09 18:19

One example of where I'm looking would be putting the raw bit-bashing code in the high-priority thread with say a basic double-buffer inline hard coded for maximum sampling speed and what goes in the low-priority is the associated extended buffering and protocol stack handling everything coming from the OS or app.

Even better, this neatly avoids the timing critical part from ever having to deal with Hub accesses at all.

evanh · 2012-03-09 18:23

evanh wrote: »

Even better, this neatly avoids the timing critical part from ever having to deal with Hub accesses at all.

Well, that part would depend on some sort of shared memory within the Cog so maybe not. Although, the CLUT could fill this purpose.

Ariba · 2012-03-09 18:27

Cluso99 wrote: »

...
I did not see anything about ethernet or usb and presumed that was no longer in the P2. For these, serial in shifters will no doubt be required. So maybe we will actually get these functions and that it just has not been detailed to us ??? I at least now have some hope of getting this.
...

The Prop II only provides some hardware at the pins to support USB and Ethernet and other such interfaces. There are these switchable 1.5k pullups and the differential inputs with programmable threshold levels. All the rest is done by software. You don't need a serial shifter in hardware for a 10..12 MBit stream when you have a 200 MIPS processor. The Prop 1 can do lowspeed USB (1.5 MBit) with it's 20 MIPS.

I agree that a deserializer on the Prop 1 would be great, but the Prop 2 is so much faster that this no more so important.

Andy

evanh · 2012-03-10 17:51

evanh wrote: »

jmg wrote: »

Better would be a user access to the time-slice allocation, which could then deliver any combination of deterministic, and resource/priority based sharing.

The performance of 100% when it's needed is better. And also, when it's not needed the 100% goes to the low priority thread.

Just been thinking about this a bit more and I've realised there is a downside to having any sort of priority. And that is the pipeline takes time to fill upon any unpredicted deviation. There is nothing that can be done to reduce "interrupt" lag but thrashing can be eliminated by making sure that what is already in the pipe gets completed. So, it's not all bad.

Hard slicing, which is what I was responding to, doesn't have this problem as it's instruction fetches are preallocated. If not used then the Cog is idle for that clock.

On a related note, how many banks of shadow registers are there? On page 14 of the Detailed Preliminary Feature List there is one instruction, SETMAP, that seems to suggest there is many. I'm guessing that there is a fair amount of future-proofing in the range as it appears to be a 6 bit value - potentially up to 64 banks per Cog! That's a total of one megabyte of quad-ported Cog RAM! Maybe there is plans for portions of the register set to be switchable ...

Kye · 2012-03-11 08:54

Its not shadow ram. The setmap function uses the regular cog ram to remap portions of it.

evanh · 2012-03-11 18:35

Ah, cool, thanks, makes sense.

I thought I saw a comment somewhere regarding a single shadow register bank. Maybe I misunderstand what registers were being reference, or maybe it was just speculation ...

evanh · 2012-04-29 04:34

evanh wrote: »

Just been thinking about this a bit more and I've realised there is a downside to having any sort of priority. And that is the pipeline takes time to fill upon any unpredicted deviation. There is nothing that can be done to reduce "interrupt" lag but thrashing can be eliminated by making sure that what is already in the pipe gets completed. So, it's not all bad.

It's a necro I know ...

Heh, there's always ways isn't there, ... the fix for laggy pipeline filling is to have two pipelines that switch with the registers. Dunno why I hadn't thought of this earlier.

jmg, you're right of course, slicing works too and I was wrong to say multitasking is superior. Slicing does require another register set for each thread I believe ... could get hungry on the silicon real fast. Either way, I still like my idea of a wait instruction instantly livening up a dormant thread while the processor is already busy.

pjv · 2012-04-29 08:22

Hi All;

I was REALLY disappointed to hear Chip say in his UPEW address that serializer SER/DES functions had been eliminated. In my opinion this is a huge error in judgement (barring of course technical impossibilities). It is the second most important thing missing from Prop1 (iindirect addressing being number one for me), and not having those serial capabiilities for Prop 2.... well !

It seems to me such a small issue to add some input and auto-shift capability to the counters' phase registers, that I'm just aghast at this revelation.

So much effort is dedicated to the video end of things which I have not yet ever used. Surely a tiny bit of silicon for high speed serial is not too much to ask ?

Perhaps in Prop 3 you say ? Unfortunately at my age, I will not be alive to enjoy that epihany!

(Sad) Cheers,

Peter (pjv)

jmg · 2012-04-29 14:17

pjv wrote: »

It seems to me such a small issue to add some input and auto-shift capability to the counters' phase registers, that I'm just aghast at this revelation.

Yes, counters is the obvious place to 'patch in' a simple serial option.

The counters already have Pin IO ability, and they already have registers decoded in the memory map, so the cost is buried right in the counter cell, and simply needs a SHIFT option, and a register BIT to enable it. (and maybe a couple more for Clk Dividers for master )
There are spare bits in the Counter control register.

Cost is very low - A Mux in the counter carry logics, and bit(s) that are already spare in the control register

( I'd also like to capture from the Systick timer on an external event, but that is a different area of silicon.)

Prop II question : Serial Chip-To-Chip Communication

Comments