[EDITED] fast SPI out, 1 bit per instruction
lonesock
Posts: 917
Hi, All.
EDITED: GEAR timing was off by a tiny bit...used a scope, updated the counter PHSx initialization values!!
You may remember this thread here:
http://forums.parallax.com/showthread.php?p=811943
I was trying to use both counters to get SPI data at one bit per instruction. kuroneko pointed me to a similar thread of his where he used almost exactly the same technique, and of course did it months before I did [noparse][[/noparse]8^)
http://forums.parallax.com/showthread.php?p=784536
Well, I wanted to do the same thing for SPI output, so here is the test framework. It looks a bit goofy in GEAR (the clock pin trace is off by 1 prop-clock relative to the data pin trace, not sure why), but the scope looks nice and clean.
edit: changed the code to work with what the scope says!!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Post Edited (lonesock) : 6/17/2009 11:51:57 PM GMT
EDITED: GEAR timing was off by a tiny bit...used a scope, updated the counter PHSx initialization values!!
You may remember this thread here:
http://forums.parallax.com/showthread.php?p=811943
I was trying to use both counters to get SPI data at one bit per instruction. kuroneko pointed me to a similar thread of his where he used almost exactly the same technique, and of course did it months before I did [noparse][[/noparse]8^)
http://forums.parallax.com/showthread.php?p=784536
Well, I wanted to do the same thing for SPI output, so here is the test framework. It looks a bit goofy in GEAR (the clock pin trace is off by 1 prop-clock relative to the data pin trace, not sure why), but the scope looks nice and clean.
{{ Jonathan "lonesock" Dummer Testing a fast SPI clock out routine Use both counters in NCO single-ended mode, where the output pin is equal to PHSx's high bit. Use Counter B to drive the clock pin, and Counter A to drive the data line. B actually changes the pin automatically, while you update the Data pin using a series of SHL's on PHSA (we set FRQA to 0, so no up- dates are happening automatically). This is for an SPI interface where the data is latched in on the rising edge of the Clock line, so you want your Data pin to be stable before the clock pin goes high. You might have to sdjust the "movi phsb,#%xxx000000" line to initialize the PHSB into the right state for your SPI definition. }} CON '_clkmode = RCFast _clkmode = RCSlow pinDataOut = 25 pinClock = 24 pinChipSelect = 26 PUB start_test ' start out our assembly test framework, then we're done! cognew( @fast_SPI_out_test_entry, 0 ) repeat ' do nothing forever (looking at you, Wally!) DAT ORG 0 fast_SPI_out_test_entry ' set up Counter A to be the data counter mov frqa,#0 ' unecessary mov phsa,#0 ' unecessary mov ctra,#pinDataOut ' set the data pin movi ctra,#%0_00100_000 ' set the mode to NCO, single output pin ' set up Counter B to be the clock counter mov frqb,#0 ' unecessary mov phsb,#0 ' unecessary mov ctrb,#pinClock ' set the clock pin movi ctrb,#%0_00100_000 ' set the mode to NCO, single output pin ' set up my 3 pins as outputs mov t,#1 ' temp = 1 shl t,#pinDataOut ' temp = 1 << pinDataOut mov dira,t ' DIRA now has the DataOut pin as an output mov t,#1 ' temp = 1 shl t,#pinClock ' temp = 1 << pinClock or dira,t ' DIRA now has both DataOut and Clock as outputs mov maskCS,#1 ' ditto fo the ChipSelect pin, but keep the mask for later shl maskCS,#pinChipSelect or dira,maskCS ' DIRA now has all 3 pins set to outputs mov outa,maskCS ' set the Chip Select pin high (usually active low) fast_SPI_out_test ' what is my data byte? mov data,#%10101010 ' randomly selected by myself ' here is the super fast unrolled version '{ mov phsa,data ' start with the raw data byte shl phsa,#24 ' get the MSb into position 31 'rev phsa,#0 ' do this instead of the above line for LSb first andn outa,maskCS ' CS goes low, signifying a start movi phsb,#%000000000 ' set up my clock register movi frqb,#%010000000 ' start my clock line ticking! shl phsa,#1 ' move next bit into the PHSA[noparse][[/noparse]31] slot shl phsa,#1 ' move next bit into the PHSA[noparse][[/noparse]31] slot shl phsa,#1 ' move next bit into the PHSA[noparse][[/noparse]31] slot shl phsa,#1 ' move next bit into the PHSA[noparse][[/noparse]31] slot shl phsa,#1 ' move next bit into the PHSA[noparse][[/noparse]31] slot shl phsa,#1 ' move next bit into the PHSA[noparse][[/noparse]31] slot shl phsa,#1 ' move next bit into the PHSA[noparse][[/noparse]31] slot mov frqb,#0 ' stop my clock or outa,maskCS ' CS goes high again '} ' here is the 2x slower looped version '{ '' NOTE: The 1st one will be primed, so the number '' of remaining bits = total-1. For this 8-bit '' test, I have 7 bits remaining to be shifted out. mov t,#7 ' number of bits left mov phsa,data ' start with the raw data byte shl phsa,#24 ' get the MSb into position 31 'rev phsa,#0 ' do this instead of the above line for LSb first andn outa,maskCS ' CS goes low, signifying a start movi phsb,#%011000000 ' set up my clock register movi frqb,#%001000000 ' start my clock line ticking! :bit_shift_loop shl phsa,#1 ' move next bit into the PHSA[noparse][[/noparse]31] slot djnz t,#:bit_shift_loop ' keep going till we run out of bits mov frqb,#0 ' stop my clock or outa,maskCS ' CS goes high again '} ' wait and repeat mov t,#1 ' 511 clocks is a good number for fitting into 9 bits [noparse][[/noparse]8^) shl t,#9 add t,cnt ' add in the current time waitcnt t,#511 ' wait for a little while jmp #fast_SPI_out_test ' start our test over again data res t res maskCS res FIT 496
edit: changed the code to work with what the scope says!!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Post Edited (lonesock) : 6/17/2009 11:51:57 PM GMT
Comments
-Phil
edit: latest code is in the top post
Note that this can be used to send more than 8 bits, up to 32, obviously, but you just lose that many words of Cog RAM as the loop must be unrolled. Alternatively, if memory was more important than speed, you could use a loop at 1/2 the data rate, but almost no code size...you'd just have to play with the starting value for phsb, and frqb would be 1/2 the current value.
Theoretically for a 80MHz clock, you can now drive SD cards at the specified max data rate (20MHz clock) for both input ant output. If you overclock your prop, you might go too fast! I doubt that is a problem, as the SD spec was artificially limited, and I don't think any companies would go out of their way to corrupt data if it goes a tiny bit faster than spec. On the other hand, going with a cheap SD card could spell trouble via simply poor construction.
The next challenge would be to read and write SPI data concurrently. I'm sure you could do it with 2 instructions per bit, but the supreme awesomeness would be one bit (each way) per instruction. I'm dubious, but hey, food for thought [noparse][[/noparse]8^)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Post Edited (lonesock) : 6/17/2009 11:52:39 PM GMT
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
We can dedicate a whole cog for it in our emulations.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBladeProp, SixBladeProp, website (Multiple propeller pcbs)
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: Micros eg Altair, and Terminals eg VT100 (Index)
· Search the Propeller forums (via Google)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
EDITED: put the new code in the sample up top
At the risk of derailing my own thread, if I made a fast FAT32-only & MMC/SD/SDHC object, would it:
A) be of use to anyone besides myself?
duplicate the effort of anyone else? (I know many people have mentioned FAT32 support already, but I have no status updates)
C) be a problem if there was no FAT16 or FAT12 support?
D) annoy people with limitations like "only one file open for writing at a time"?
I have some ideas for optimization, and even a cool name: FlashFAT32! (and a "Lion Tamer" hat!)
Anyway...bedtime for me...I appreciate any feedback, which I will collect in the morning [noparse][[/noparse]8^)
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Post Edited (lonesock) : 6/17/2009 11:53:20 PM GMT
read-ahead and write-behind, and a bunch of other stuff, but if you're considering fat32
and sdhc, I wasn't planning on that. I'd hate to waste my effort if you're going to be
leapfrogging that anyway.
I was also going to have a nifty DMA mode so you could get real speed even when using
only Spin.
Yours code will be very useful for Ramtrons 2MBit FRAMS.
If You have that driver.
I use that chips
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.
For every stupid question there is at least one intelligent answer.
Don't guess - ask instead.
If you don't ask you won't know.
If your gonna construct something, make it·as simple as·possible yet as versatile as posible.
Sapieha
On the TriBlade and RamBlade, DMA is not possible, nor is read ahead or write behind. This is because it shares pins with the ram.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBladeProp, SixBladeProp, website (Multiple propeller pcbs)
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: Micros eg Altair, and Terminals eg VT100 (Index)
· Search the Propeller forums (via Google)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
read-ahead and write-behind, and DMA too, so PLEASE do continue
@lonesock: likewise; I expect loads of people would "rip yer arm off" to get FAT32 / SD / SDHC support on the Prop' - can we have it yesterday please?! (BTW: Your SPI work looks great).
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Cheers,
Simon
www.norfolkhelicopterclub.com
“Before you criticize someone, you should walk a mile in their shoes. That way when you criticize them, you are a mile away from them and you have their shoes.” - Jack Handey.
However, then the feature set wouldn't be good...
Also, you can't just pump the spi clock to max with sd cards. They have a CSD register which tells you their maximum limit, and all SD cards can have a different value. Most however should be able to take 5 Mhz.
Plus you need high speed input for the SD card also, not high speed output... Maybe input would be possible with the counters also.
Good luck, if you want to try. Sorry for raining on the happy parade.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nyamekye,
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nyamekye,
I have looked some more at fat32, and I agree, it would be straightforward to modify fsrw to do fat32
(the cluster r/w is part of it, but another part is the extendable root directory).
At first blush it might be easy to divide work into the block layer and the file system layer (as it
currently is divided). In other words, we can use the 1-bit-per-instruction ideas in a really simple
spisasm routine that plugs into the existing fsrw to make it much faster (and this layer can
implement read-ahead and write-behind), but I was going to move the FAT manipulation into the
cog too, so it may not be so quick and dried.
Lonesock, email me at my username at gmail.com (same username as what I use here) and
let's discuss if we want to work together on this.
Thanks for your responses! Rokicki and I will collaborate on the next rev of fsrw. There will be some more news when we have it (and look for an upcoming forum poll from rokicki).
(note: the next release will be ready on or before the Duke Nukem Forever ship-date [noparse][[/noparse]8^)
thanks,
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Well, I finally plugged a scope in and looked at my waveforms: GEAR != HW+Scope
I changed my unrolled SPI code so that the clock pin transitions to high are dead center of each data bit on the scope. I then looked at the traces in GEAR, and the trace for the clock pin was advanced by one, relative to the data pin. The GEAR output is attached (remember, it looks perfect on the scope..."Don't Panic")
The updated code is in the 1st post, both the embedded code block and the attached file. The code shows the fast unrolled "1 instruction per bit" way, and it also has the looped "2 instructions per bit" way.
In the looped mode, the clock transition to high does not land exactly in the middle of the data bit, but is in fact one clock late. I.e. each data bit is 8 propeller-clocks wide, and the clock pin's transition to high occurs on relative clock 5. I could not figure out a way to get that transition exactly in the middle of the data bit without an extra clock-pin-transition-to-high slipping in.
Anyway, "share and enjoy"
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Now ctrb starts incrementing (assuming phsb is initially 0) and reaches 4 during R of movi phsb, .... At the same time however we force phsb to 4 (so no harm done here, i.e. we could place a nop there if we could guarantee phsb being 0 when we set frqb). Anyway, the clock transitions are now locked to 4n+3 as well which is what we wanted.
Maybe I could try something like:
* compute the exact time I should shut off the clock counter
* do something like:
This would let me perform both the jump and the fall through in only 4 clocks. The waitcnt would make sure that the clock pin transitioned the requisite number of times, even if I ran out of data bits early. I think getting the ending cnt value would be tricky, but possible. I also run the risk (even if this works) of losing in setup instructions what I gained by going to the looped version in the 1st place.
Of course the alternative is to just let my clock pin transition on clock 5 of every 8-clock-wide data bit, instead of on clock 4 [noparse][[/noparse]8^). And since this is going at 1/2 speed, I wouldn't think that the precise timing is that important, but I guess that is device specific.
Any feedback on the "waitcnt" idea, or the usefulness thereof?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
I don't think the waitcnt idea is going to fly. Actually, I'm sure. Imagine the last bit sent is 1, this means we have one more round-trip for the jmp instruction in order to clear phsa (bit 9 if you like), then we have a nop (jump not taken) and that's already too late (clock transition during nop.R). Even without the extra loop cycle you'd only have 8+3 cycles left to stop the clock (4 of which are consumed by the jump not taken, 6 by the waitcnt ...).
So I'd suggest you ignore that slightly off center transition and just use it. The data hold time should be long enough.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBladeProp, SixBladeProp, website (Multiple propeller pcbs)
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: Micros eg Altair, and Terminals eg VT100 (Index)
· Search the Propeller forums (via Google)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
I've got 2 routines, one fast pasm spi write only (using lonesock's counter idea), and one pasm read-write. I can call either one repeatedly, however once I call the read/write version the write only (counter version) no longer clocks out data. Can anyone explain what's going on that prevents me from using the r/w version (non-counter) and then using the w_fast(counter) version? I know for sure that I can get into the fast (counter) version again because the XOR statements there, if uncommented, show up on the scope. But I just don't get any data. It seems like my setup of the counters is wrong, but I don't know why.
Thanks,
Peter
Anyway, I lost interest because lack of feedback.
Even better than 20 Mbit out is possible in bursts. The bottle neck is feeding the video generator fast enough with data from hub ram.
@Peter: Here's what I have in the latest version of the FSRW block code, byte versions only: The commented out middle version was the 20 Mbps read, but it didn't seem to work on all hardware & cog/pin combinations...so it's in disuse, and hasn't been tested since the rest of the counter framework as evolved around it. Sorry for not answering your question directly...a bit busy!
thanks,
Jonathan
Do you mind posting a link to that thread? If I can't get what I'm working on fixed, I'm open to a different tactic.
Thanks
Thanks for posting. I'd looked over that section of fsrw to see if it was different.
The code I posted above (based directly on yours) works but, it basically doesn't get along with the other stuff I'd written and I can't understand why. I really like these write speeds and I'm only working with on specific device rather than a bunch of different SD card manufactures. My device is spec'd to 25Mhz for the spi line so I'm still considerably below it. However, do need to read with this device as well as a read while writing (true spi) so I need to get a version working that will do both in and out. My spi read/write routines aren't nearly as fast as what you've got, but they are fast enough since I don't use read AND write simultaneously that frequently. For the heavy throughput stuff I want to use your code and for the read/write stuff I can go slowly. But to do this they need to cooperate.
Ahle2, if Jonathan's memory is correct your methods might not be suitable for my application (everything here is treated as a byte) but I'd still like to see them.
Thanks,
Peter
I appreciate you looking at this. I must say I don't see the problem (which is probably how I got into this situation). In the highlighted code above I'm using the red line to toggle the clock line. That works fine. I'm only changing the state of one pin, outa[clkMask]. As I'm sure you're aware, this works fine. However after running this, when I jump into lonesock's code I don't get any output - clocks (or data) coming out. I think you're suggesting that when I leave my slow read/write routine, I'm leaving outa[clkMask] := 1 and this somehow prevents other output. But since this is the same cog (and NOT the direction register) why wouldn't the faster routine be able to then toggle the counter output pin (which just so happens to be the same as the bit mask used in the code above).
I'm sure you see the problem. Unfortunately, I don't (yet).
Thanks,
Peter
IOW either the fast code has to unblock outa and raise it again once finished or you change all routines to leave the clock line low once done.
A if set to output, pin := outa|ctra|ctrb|video