Faster SPI Bus Transfers
cgracey
Posts: 14,155
I'm working on the 2nd-stage flash booter for application launching. The first thing to sort out is how to quickly program the flash, so the user doesn't have to wait long. Then, the loader which executes on reset must pull the data from the flash into memory very quickly.
The regular way of looping to get the next data bit, outputting it, raising the clock, and lowering the clock is quite slow. I've been working out how to speed it up. This code runs off the RCFAST oscillator at boot, which is always over 20MHz and usually ~24MHz. The oscillator is designed to not drop below 20MHz, worst-case, to support auto-baud serial connections of up to 2Mbaud.
As a first pass, I used the smart pin mode which outputs timed transitions to generate the SPI clock. I then output the data manually in software. I was lamenting that I didn't make an instruction to just shift a register and output the bit to some pin. That would have made things really easy and fast. We don't have that, but I realized that the RCZL instruction, which rotates a register two bits left and puts the bits into C and Z could save some time. The transition mode can then generate the clock in the background while my code outputs the data bit stream. It works really nicely.
Here's the code:
And see the picture of what it does...
The regular way of looping to get the next data bit, outputting it, raising the clock, and lowering the clock is quite slow. I've been working out how to speed it up. This code runs off the RCFAST oscillator at boot, which is always over 20MHz and usually ~24MHz. The oscillator is designed to not drop below 20MHz, worst-case, to support auto-baud serial connections of up to 2Mbaud.
As a first pass, I used the smart pin mode which outputs timed transitions to generate the SPI clock. I then output the data manually in software. I was lamenting that I didn't make an instruction to just shift a register and output the bit to some pin. That would have made things really easy and fast. We don't have that, but I realized that the RCZL instruction, which rotates a register two bits left and puts the bits into C and Z could save some time. The transition mode can then generate the clock in the background while my code outputs the data bit stream. It works really nicely.
Here's the code:
CON dpin = 17 'data pin cpin = 16 'clock pin DAT org hubset #%10_00 'use 20MHz crystal for clean scoping waitx ##20_000_000/100 hubset #%10_10 wrpin #%01_00101_0,#cpin 'set cpin for transition-mode output wxpin #2,#cpin 'timebase is 2 clocks per transition .loop mov cmd,#$55 'ready cmd data shl cmd,#24 dirl #cpin '2 reset transition pin, reset timebase dirh #cpin '2 (outputs low during reset) rczl cmd wcz '2 ready bits 7/6 drvc #dpin '2! output bit7 wypin #16,#cpin '2 start 16 transitions drvz #dpin '2! output bit6 rczl cmd wcz '2 ready bits 5/4 drvc #dpin '2! output bit5 nop '2 drvz #dpin '2! output bit4 rczl cmd wcz '2 ready bits 3/2 drvc #dpin '2! output bit3 nop '2 drvz #dpin '2! output bit2 rczl cmd wcz '2 ready bits 1/0 drvc #dpin '2! output bit1 nop '2 drvz #dpin '2! output bit0 jmp #.loop cmd res 1
And see the picture of what it does...
Comments
That said, I never got round to testing it on a real SPI device. I think it was still RevA silicon.
I remember Peter had asked if it was worth using the smartpins at all and I'd initially said not really.
I just got sysclock/2 working and it's mind-blowingly simple. Sometimes things just work out. It was accidental that it could work so perfectly. Just a minute...
This means that running from RCFAST, not counting flash erase and program delays, you could load 512KB into the flash in just 400ms! No need to even use the crystal/PLL, which could actually make the software much more complicated.
In this program, I'm outputting a whole 32 bits, which is what the loader will be doing to program the flash. Note that the data changes on the falling clock, so that it's stable during the rising clock. This will run SPI at over 10MHz using RCFAST:
Here's a picture of it running...
For reading back, it's easier to use a smartpin though.
There can't be a faster or a simpler way to do this. It's miraculous that the timing aligned so well. Note that it's ONE clock different, as needed, due to an extra clock delay in the streamer design.
So, this wraps up how to do fast SPI output. Now, I've got to see about SPI input using the same ideas, but with the streamer inputting a pin. Not sure how that timing will be.
Yes, I had to cover for that in the first example in the initial post, but when you set the timebase to ONE clock, you don't have that problem because its metronome ticks on every clock. The only way you could screw it up would be issuing another command before it finishes the current command, causing it to toggle some odd number of times, leaving it in the opposite state you intended. We have provision for that in the WAITXFI.
Streamer for reading SPI data does work but it's a lot of trial and error to align timing. Here's an example HyperRAM snippet I was using for testing various questions:
I'm thinking that to develop the SPI input, I'll have another cog output the same clock stream, but with output data. I'll then tune the inputting cog, which is outputting a sync'd clock stream, from the other cog's output data.
When the SPI device outputs, it updates its data output after the falling edge of the clock. I'll have to make my simulator work like this.
I'm now thinking that flash programming should happen DURING the download, so that you don't suffer the download time, then have the programming time on top of it. The bigger the download to flash, the more it will benefit from download/programming overlap.
Thanks, Evanh. That data is really interesting. Sheesh.... What do we do? Is it practical to try to adjust dynamically to these shifts?
EDIT: The board layout has a large impact on the slew rate. That was proven with the revA Eval boards where the SD slot and EEPROM were placed on the opposite side of the board from the I/O header and prop2 pins. The max SPI clock was really bad there.
He was initially only talking about bit-bashing methods.
Smartpins can't improve the read slew rate issue, they're not not true clock inputs.
> Smartpins can't improve the read slew rate issue, they're not not true clock inputs.
Not to derail the topic, but it seems reasonable that most people coming to the P2 will gravitate towards those pin modes first. I get that they are not the fastest possible solution in all cases, but the fact that Chip seemed to skip over them as a possible solution makes me concerned about their actual usefulness. I'm particularly surprised that they're not being considered for the receive mode, where it seems they should be the ideal choice here.
Here's the output for the same config, write timings, but with the unmodified hyperRAM board fitted: Note, only has one good compensation column. Problem with this is when attempting to go to full DDR capabilities of the hyperRAM the column with zero errors vanishes entirely. Attached is the full output which demonstrates that the slew rate issue doesn't affect writes.
Just that the Prop2 can go so fast, and is so easy to push it there, that there is other potential issues that could never affect the Prop1. Many other micros didn't have the speed in the past either. It's all a little new in some ways.
Then, the other half of dealing with this is latching the shifted data into a buffer without any potential glitches between the two clocks. It shouldn't be a huge issue given the ratio between shifting and latching. Similar to solving the sysclock PLL mode change.
The smart pin serial synchronous output mode, on the other hand, inputs the clock and outputs the data, so it suffers turn-around delays, making sysclock/2 impossible.
First, you need to set up a smart pin to generate the clock signal and set the streamer rate:
To output a value:
To output from hub memory:
To input to hub memory:
That's all there is to it!
Here is a test program that I developed this with. There are two cog programs. One outputs data and the other receives and verifies data. They time-align their clock outputs so that you can know that the receiver (clock on P18) is aligned with the transmitter (clock on P16). The transmitter outputs data on P17 and the receiver inputs from P17. It's doing 32 bits at a time. In the hub-transfer modes, you could do up to 8191 bytes at a time, unless you could use $FFFF for infinite and then do an XSTOP at the right time.
Reading back through the docs, I now see the two-clock delay comment. I guess for slaves that can read on the rising edge, I suppose you could get down to sysclock/4 (so that output is effective written on the falling edge). But, other than that, sysclock/8 (or maybe sysclock/6 for slow-enough clock settings) is the best you can achieve?
I don't know. This gets so complex that writing code and looking at it on the scope is the best way to know the timing.
The smart pin synchronous serial input suffers from the turn-around delays. We added another flop on each input on Rev B silicon, and I don't think I updated the docs for that mode.
If you can control the clock, you can do much better than the smart pin synchronous input mode. If you are waiting for an external clock, you can't improve its function. There are just a lot of register stages.
Maybe even 4-bit SD bus? Although the way the SD card is connected for booting complicates this.