What is the maximum possible throughput of SPI on the Prop?

Mickster · 2016-05-03 16:15

IIRC @FriedV was achieving 1MHz but believed he could push it beyond 2MHz, using PropBASIC. I am looking at a clock speed of 96MHz.

Chris Savage · 2016-05-03 16:29

I have characterized the speed using a SPIN object I created bit-banging the protocol, but in PASM it would be so much faster. I should have the relevant article back on my site soon.

Ariba · 2016-05-03 16:47

Maximum Bitrate with 96MHz Sysclock is 24 MBit/sec with the help of the counters.
This works well for transmit, for receiving 12 Mb/s is more safe (but you asked for the maximum).

Andy

Chris Savage · 2016-05-03 16:56

Of course, you need to check the max speed of the device you're using. Many popular SPI chips still have a fairly slow max clock speed, relatively speaking.

JasonDorie · 2016-05-03 17:02

I've used the video write hardware to push SPI data at 20MHz on a standard 80MHz clock prop.

Mickster · 2016-05-04 11:26

Thanks guys. I am looking at a 10MHz device but about half that speed will be more than adequate.

Peter Jakacki · 2016-05-04 12:03

I thought the OP was talking about "throughput", not clock speed. I've seen Spin/PASM implementations that may boast 10MHz but the throughput is way way below that. Then there is sustained throughput vs burst.....

Wossname · 2017-06-16 19:22

I wonder if it is possible to have Counter A and Counter B synchronised 180 degrees out of phase (I'm thinking this is needed for 'setup time' on some long SPI bus wires) so that PHSA provides the clock pulses and PHSB (fed with consecutive MOV and SHR) can achieve 1 bit per instruction of TX bandwidth. I guess the clock counter would have to be running 2x the data frequency.

Achieving this fine phasing control without upsetting the SPI bus might be a fun experiment.

SaucySoliton · 2017-06-16 21:05

The fsrw object does this. It sets the initial PHSA to 0. I think you could get the phase shift you describe by choosing some other initial value. I just posted some similar code as part of a memory interface.

Wossname · 2017-06-16 21:26

SaucySoliton wrote: »

The fsrw object does this. It sets the initial PHSA to 0. I think you could get the phase shift you describe by choosing some other initial value. I just posted some similar code as part of a memory interface.

Yeah but I'm thinking tha sometimes SPI needs short transactions (less than 32bits, or even 16 bits). The shorter the transactions become, the more problematic the counter-manipulation overheads will be.

jmg · 2017-06-17 02:18

Wossname wrote: »

SaucySoliton wrote: »

The fsrw object does this. It sets the initial PHSA to 0. I think you could get the phase shift you describe by choosing some other initial value. I just posted some similar code as part of a memory interface.

Yeah but I'm thinking tha sometimes SPI needs short transactions (less than 32bits, or even 16 bits). The shorter the transactions become, the more problematic the counter-manipulation overheads will be.

Yes, there are many trade-offs, and to complicate things more, "SPI" is not a formal standard.

Strictly, SPI is duplex with data-in at the same time as data-out, but in most real memory uses, you can speed things up with a simplex design.

Then, there is QuadSPI, which just about all FLASH parts now support, and certainly is the main-volume driver.

Mark_T · 2017-06-17 23:32

The fast way for SPI that only writes and doesn't read back is to use WAITVID to push out
the bits,

Mark_T · 2017-06-18 00:07

I had a search and found some test code I had. The basic approach is to use 2 bit per pixel mode
of video driving the SCLK and MOSI. 2 video clocks are needed per bit output. This means having to
duplicate and spread the bits apart to send them, but the application I was working with was painting
pixels into a SPI TFT chip where to fill a rectangle you have to send the same pixel data repeatedly
so the cost of pre-processing the data is spread out across many bytes sent.

To be more precise I was encoding the byte abcdefgh as
a1a0b1b0c1c0d1d0e1e0f1f0g1g0h1h0, ie alternating data/clock and every other clock is inverted.

Can't recall the max video clock rate, but the max SPI clock will be 1/2 of that by this method.

kwinn · 2017-06-18 01:41

Mark_T wrote: »

I had a search and found some test code I had. The basic approach is to use 2 bit per pixel mode
of video driving the SCLK and MOSI. 2 video clocks are needed per bit output. This means having to
duplicate and spread the bits apart to send them, but the application I was working with was painting
pixels into a SPI TFT chip where to fill a rectangle you have to send the same pixel data repeatedly
so the cost of pre-processing the data is spread out across many bytes sent.

To be more precise I was encoding the byte abcdefgh as
a1a0b1b0c1c0d1d0e1e0f1f0g1g0h1h0, ie alternating data/clock and every other clock is inverted.

Can't recall the max video clock rate, but the max SPI clock will be 1/2 of that by this method.

IIRC max video clock is 128MHz so max spi clock would be 64MHz, although I doubt that even a cog pasm program could sustain that data rate for more than a few bytes of data.

SaucySoliton · 2017-06-19 17:45

Repacking the data is not necessary. You can get the clock from the counter. I had some test code to do this which I cleaned up and posted below.

Mark_T · 2017-06-20 17:52

You can't start and stop the counter in sync with the data pulses easily, a missed, extra or runt clock pulse in SPI
will break things. For really fast you have to use the video outputs for all the signals I think

jmg · 2017-06-20 23:02

I'm looking at the FT4222H, which seems a nice low cost, i2c/SPI bridge.

This is the speed table, SPI (i2c can go to std 3.4MB/s, and can be set to 6.66Mbit/s)

;  FT4222 SPI Speed choices:
; SCK   Rate      Supported Dividers on Master CLK
;                 1/2   1/4   1/8   1/16    1/32    1/64    1/128    1/256      SlaveCLK    Bandwidth Utiliz, appx, QSPI
; 80MHz 53.8Mbps* 40M*  20M*  10M   5M      2.5M    1.25M   625K     312.5K     <= 20MHz   53.8M/(4*40M) = 33.62%
; 60MHz 39.7Mbps* 30M*  15M   7.5M  3.75M   1.875M  937.5K  468.75K  234.375K   <= 15MHz   39.7M/(4*30M) = 33.08%
; 48MHz 31.5Mbps* 24M*  12M   6M    3M      1.5M    750K    375K     187.5K     <= 12MHz   31.5M/(4*24M) = 32.81%
; 24MHz 15.8Mbps* 12M*  6M    3M    1.5M    750K    375K    187.5K   93.75K     <= 6MHz    15.8M/(4*12M) = 32.91%

QuadSPI is available in master only. Highest slave CLK is 20MHz
WAITPIN can be used on CS# and then CLK to sync a P1 to the master clk.

For SW based Quad SPI, I think you need a rotate and output, plus the clock.
I figure highest speed without SW CLK (timer CLK) is 10MHz and with SW clk is 5MHz

In one burst design, with a small 8b MCU, I allocate first clocks for WRITE, then some dummy/turnaround clocks, and then some reply clocks.
You can also shift the window, to allow one FT4222 master to send to multiple slaves (can be separate MCU, or separate COG) in the same packet.

FT4222 -> MCU is ok with CLK sync, & the skip clocks can be managed in 8b MCU with a Reload-Timer with external CLK and TF0 polling.

That's not so easy on P1, as you cannot wait on a timer, but you can externally clock it, so it seems a =\_ CLK, and a 2 line poll-timer loop can coarse sync, (Skip-1) and then a WAITPIN pair can sync on the final edge, for replies.

Such (Sync + R or W) on paper can unroll to around 5MHz/nibble, which is a decent ~20Mbit burst speed.
( a AUP1G80 from the FT4222 12MHz can give a 6MHz P1 clk, and that is 19.2 SysCLKs at 5MHz , 25.6 SysCLKs at 3.75MHz, 32 SysCLKS at 3MHz )

Has anyone done an i2c Master or Slave on P1 to 3.4MHz ? Seems doable ?

SaucySoliton · 2017-06-20 23:15

My last post was a very incomplete description of what the code does. I agree that trying to synchronize a counter with the video generator would likely be a disaster.

The counter runs continuously in PLL single-ended (or differential) mode. The PLL output is the clock. The PLL output also feeds the video generator. It seems to work fine with the counter in PLL output mode. The video generator outputs the data and chip select. I also use it to force the clock output high when not sending. It looks better on the scope, but may not be required. The video generator is in 1 bit mode. The chip select and clock masking are generated with a separate waitvid using different "colors."

The writes to OUTA and VCFG are to keep the outputs in a known state without having to keep the video generator constantly fed.

jmg · 2017-06-20 23:34

SaucySoliton wrote: »

My last post was a very incomplete description of what the code does. I agree that trying to synchronize a counter with the video generator would likely be a disaster.

The counter runs continuously in PLL single-ended (or differential) mode. The PLL output is the clock. The PLL output also feeds the video generator. It seems to work fine with the counter in PLL output mode. The video generator outputs the data and chip select. I also use it to force the clock output high when not sending. It looks better on the scope, but may not be required. The video generator is in 1 bit mode. The chip select and clock masking are generated with a separate waitvid using different "colors."

The writes to OUTA and VCFG are to keep the outputs in a known state without having to keep the video generator constantly fed.

What clock speeds did you achieve with that, and what was it connected to ?
Video modes are master only, (one bit?) and also can only write (not duplex, no read)

SaucySoliton · 2017-06-21 00:35

jmg wrote: »

What clock speeds did you achieve with that, and what was it connected to ?
Video modes are master only, (one bit?) and also can only write (not duplex, no read)

Yes, master only, write only. Never tried it with a device, just the oscilloscope. It seems to work at 160MHz. 2 data bits should be possible. Could also do 8 bits.

The plots show a 32 bit transfer at 20 MHz. The last one is at 100MHz. It's a 70MHz oscilloscope so the 160MHz test didn't look that good.

jmg · 2017-06-21 01:14

SaucySoliton wrote: »

Yes, master only, write only. Never tried it with a device, just the oscilloscope. It seems to work at 160MHz. 2 data bits should be possible. Could also do 8 bits.

The plots show a 32 bit transfer at 20 MHz. ...

Nice looking plots.
The FT4222H I am looking at, can support 20MHz SCK in slave mode, so this example could work well with that for fast, one way P1 -> PC.
Where PC-> P1 is needed the P1 can lower the clock easily, since P1 generates the clock. GPIO lines can set direction, at a more modest speed.

What is the sustainable data rate ? (ie how much housekeeping overhead is involved)
Scope shows I think 417k*32 = 13.344Mb/s sustained, but that looks like a repeating payload ?

A couple of possible benchmarks
a) Read (say) 8-16-24 pins, and stream to PC, continually.

b) use WAITPIN and Counter, to send a 24b(pin)+32b(dT) packet, of PinPatterns + TimeStamp.
An option of 8b(pin) and 24b(dT) would have a data-send time of 1.6us at 20MHz, so that sets a ceiling on peak change-rate supported, but maybe a simple FIFO can allow glitch capture lower than 1.6us ?

What is the maximum possible throughput of SPI on the Prop?

Comments