Sweet-spot INA to HUB Byte Transfers (6MB/s or 48Mbit/s 1 COG @ 96MHz)
I guess something similar has been done before. After all there is nothing new under the sun. Still, I thought it was worth sharing. You will find a code fragment below, but first some background.
I'm working on a driver to program/read 2 W25Q flash chips in parallel for SpinSocket-Flash (probably renamed to !nSocket to avoid a trademark dispute). SpinSocket-Flash has 2 of these chips for 2MB of program storage on an 8 bit synchronous data bus. The devices have 4 IO pins, a chip select, and a clock. Basically the W25Q parts are SPI devices with 4 bits on each 8 pin chip rather than just 1 bit per chip. Rayman is using a similar part in Flash Point. I have samples of these SQI parts too, but have not developed a driver for them yet.
As mentioned above, with 2 chips, you can create an 8 bit synchronous data bus. Propeller is very good at precision timing (precise to a fault really: straight-jacket timing). Given the fact that a Propeller output pin clocks the SPI (CTRA in this case), an edge counter (CTRB) can be used to adjust PHSB on each transition of the clock. So PHSB is an indirect pointer to hub memory. This is a very convenient set up to get the highest possible "sweet-spot" byte-wide throughput available on Propeller in a single COG from INA to HUB.
Here is an example of reading INA and saving data to HUB using PHSB counter pointer.
Here's a dump from the test program. Just stepping through the output, the algorithm is:
Thanks for reading.
--Steve
I'm working on a driver to program/read 2 W25Q flash chips in parallel for SpinSocket-Flash (probably renamed to !nSocket to avoid a trademark dispute). SpinSocket-Flash has 2 of these chips for 2MB of program storage on an 8 bit synchronous data bus. The devices have 4 IO pins, a chip select, and a clock. Basically the W25Q parts are SPI devices with 4 bits on each 8 pin chip rather than just 1 bit per chip. Rayman is using a similar part in Flash Point. I have samples of these SQI parts too, but have not developed a driver for them yet.
As mentioned above, with 2 chips, you can create an 8 bit synchronous data bus. Propeller is very good at precision timing (precise to a fault really: straight-jacket timing). Given the fact that a Propeller output pin clocks the SPI (CTRA in this case), an edge counter (CTRB) can be used to adjust PHSB on each transition of the clock. So PHSB is an indirect pointer to hub memory. This is a very convenient set up to get the highest possible "sweet-spot" byte-wide throughput available on Propeller in a single COG from INA to HUB.
Here is an example of reading INA and saving data to HUB using PHSB counter pointer.
'--------------------------------------------------------------------
' read buffer from Flash to HUB at 5MB/s @ 80MHz, 6MB/s @ 96MHz, or 6.25MB/s @ 100MHz
'
_doRead
mov phsb,tmp ' tmp is the HUB pointer start address
mov frqb,#1 ' phsb accumulate once per EDGE
movs ctrb,#clkpin ' detect the SPI clock edge
movi ctrb,#%01110<<3 ' Negative EDGE detect
mov frqa,_frqa1 ' phsa accumulate frequency at clkfreq/16
mov phsa,#0 ' phsa intial value
rdlong _len, _mblen ' get length here to sync before ctra enable
mov ctra,_ctra ' start clock here
:loop
mov _byte0,ina ' read byte
wrbyte _byte0,phsb ' write to hub pointer
djnz _len,#:loop ' do until _len bytes are stored
mov ctra,#0 ' kill clock
_doRead_ret ret
Here's a dump from the test program. Just stepping through the output, the algorithm is:
- Read JEDEC Manufacturer's ID and Device Type
- Erase a sector for programming
- Read some values to ensure erase worked
- Fill buffer with an incrementing pattern (except $55 at offset 0)
- Program the buffer
- Clear the buffer so we know the data read is valid
- Read some values to make sure reads worked
- Read Flash buffer and show programmed data
c:\Propeller\inSocket>bstc -ls -d COM6 -p0 -Ograux -L c:\bstc\spin w25qx2.spin
Brads Spin Tool Compiler v0.15.4-pre3 - Copyright 2008,2009 All rights reserved
Compiled for i386 Win32 at 19:57:58 on 2010/01/15
Loading Object w25qx2
Loading Object FullDuplexSingleton
Program size is 2772 longs
Compiled 688 Lines of Code in 0.017 Seconds
We found a Propeller Version 1
Propeller Load took 1.103 Seconds
c:\Propeller\inSocket>uterm COM6 115200
MFG ID EF
DEV TYPE 40
erase sector
read byte 000000FF
read word 0000FFFF
read long FFFFFFFF
fill buffer
prog buffer
clear buffer
read bytes 55 01 02 03
read word 00000155
read long 03020155
read buffer
show buffer
080800 55 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
080810 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
080820 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
080830 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
080840 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
080850 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
080860 60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F
080870 70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F
080880 80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
080890 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
0808A0 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
0808B0 B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
0808C0 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 CA CB CC CD CE CF
0808D0 D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 DA DB DC DD DE DF
0808E0 E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF
0808F0 F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 FA FB FC FD FE FF
080900 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
080910 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
080920 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
080930 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
080940 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
080950 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
080960 60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F
080970 70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F
080980 80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
080990 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
0809A0 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
0809B0 B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
0809C0 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 CA CB CC CD CE CF
0809D0 D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 DA DB DC DD DE DF
0809E0 E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF
0809F0 F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 FA FB FC FD FE FF
test complete.
Thanks for reading.
--Steve
Comments
I wish there were a more in depth explanation of the counters, or should I say counters for dummies, I feel its a hard subject to grasp for the beginner.
Jeff T.
BTW, I'll post the whole driver later.
Cheers.
--Steve
BTW: Do you see a way to do something similar with a 4-bit bus?
Kuroneko's counter investigations.
Are you refering to one particular post, and if so do you have a link to it
or just the many great posts Kuroneko has done involving the Prop.
Tom
@tdlivings I don't remember a specific thread where kuroneko went into this specifically, but he does show some awesome work.
(first byte written is junk)
loop: mov nib0,ina wrbyte nib1,phsb sub phsb,#1 mov nib1,ina shl nib0,#4 or nib1,#$FF add nib1,nib0 djnz ...
I think I've got one too many instructions in my loop....
Yes, my first code fragment depends on flash or other synchronous device like sdram being on P0..7.
It looks like the rdlong before entering the loop is required to sync the loop with hub access, right?
How many clocks does the wrbyte use? Manual says 7..22, so I suppose you're close to 7...
Yes, some HUB op is required to sync things up before turning on CTRA.
You get 2, 6, 10, 14 etc... normal instructions to stay in sync between HUB op windows.
I realized while I was out that if the modified fragment I posted works, you don't need to use PHSB as the HUB pointer. Just increment your HUB pointer in that left-over space.
PS: Maybe you could double speed also with two cogs? Do two ina reads and one wrword in each cog?
Parallel SRAM using lots of pins would only be faster for element at a time access (byte, word, or long).
Initial addressing for 8xSPI SRAM would be slower, but page data could be read/written at full speed (6MB/s @ 96MHz), and only 2 more pins would be used for SPI CS and Clock. It would be a good solution when used with a cache.
So for SpinSocket-Flash for example, you could have 2MB of Flash + 256KB of SRAM in a sandwich and still have 16 free pins (enough for VGA, SD-Card, Keyboard and Mouse).
Slick! ... I had to sleep on it to figure out what you were doing. :-)
(or, in my case use 4 cogs with a nibble not on P0..P3)
Where's your driver? I'll see what I can do with it. I have several of the SST26VF parts.
http://www.rayslogic.com/Propeller/Products/FlashPoint/FlashPoint.htm
It uses 2 interleaved cogs but without counters...
I'll write a driver that looks more like the attachment below, then make a 2 COG version.
Here is the W25Qx2.spin driver. I'll make an SST26VFx2.spin version soon.
Right now, the screen is 256 pixels wide, but my driver can only keep up with with a 156 pixel wide image...
I'd love to extend that out to the full 256x192 that this full color TV driver does...
Hi Rayman.
This is the best I can do in the time I have. Nibbles must be on D0..3. It does does not try to manage 2 cog resources. There is no way that I can see to shift both nibbles from some arbitrary start pin.
It may be possible to use an arbitrary start pin if you program the data backwards and read it the same way .... Also PHSB as a pointer will not work with 2 CTRA clock cycles since it will increment by 2 (changing PLL rate does not help). However with 2 COGS, having COG 1 PHSB do even writes and COG 2 PHSB do odd writes may work. I'm due for a nap, so there may be errors reasoning regarding 2 COG PHSB or the code below. My previous snippet was wrong btw.
Code below is not tested. I haven't had time to build an SST26FV module. You can temporarily replace movs t1,ina with an xor ... to look at timing in case it doesn't work on a scope and see if timing needs to be adjusted. It should be clear where to put this fragment. If this works, you get 2.5MB/s at 80MHz which is twice as fast in a single COG than what you had.
Once you get this working, then we can try to interleave even/odd address writes with a PHSB setup. With 2 COGs I think only one COG should generate a CTRA clock at sysclk/16 and then the read has to start-up at the right time.
InputData 'Data Input (only do this if arg3>0) cmp arg3,#0 wz,wc if_e jmp #SqiBufferIO_End mov frqa,_frqa ' sysclk/32 mov phsa,#0 movs ctra,#SckPin sub arg4,#1 ' loop preincrements before wrbyte rdbyte t3, arg4 ' w0 save byte and sync up - restore byte later mov savep, arg4 movi ctra,#4 << 3 ' single end NCO clock mode rdbyte t3, savep ' w0 save byte and sync up - restore byte later 'Write data 'First Nibble already on output InputDataLoop movs t1,ina ' get first upper nibble rol t1,#4 ' retrieve lower from D28..31 wrbyte t1, arg4 ' w1 first byte is garbage movs t1,ina ' get second nibble ror t1,#4 ' w2 save lower in D28..31 add arg4,#1 djnz arg3,#InputDataLoop ' next byte wrbyte t3, savep SqiBufferIO_End 'Bring CE high to end transfer or outa,Cen_Mask jmp #loopEnd savep long 0 _frqa long $0800_0000
I think it's clear that I'm going to need 2 sets of drivers: One for where we're on P0..P3 and another for when on any other pins.
I'd rather just have one driver, but the ability to double the speed by using P0..P3 is hard to resist...
I just wish our Prop Platform USB and compatible boards didn't have SD on P0..P3. That's a real problem...
I asked Nick to allow cuts/jumps for different pins, but am not sure if he did anything about it.
This reminds me. I think we need a Propeller Platform combo card that combines 8 bit Flash/SPI RAM, and SDCARD.
My only concern is that unlike the 16 or 32 clock cycle loops, 48 does not divide evenly into 2^32.
Not quite sure how big a problem that is though...
There was one major problem... The counter loop would always cycle the clock one more time than I wanted. That made attempts and successive read bursts not work right. (I could end the transfer by raising CE and start over, but didn't want to for this project).
Anyway, here's the trick I used to fix the problem:
'Setup counters sub t5,#2 'need to subtract because adding before wrbyte mov frqa,aFreq ' mov phsa,aPhsa movs ctra,#SckPin rdbyte t1, #0 ' sync up with hub movi ctra,#4 << 3 ' single end NCO clock mode ReadBmpLineLoop mov phsa,aPhsa '1 mov t1,ina '2 'First Nibble already on output ror t1,#Sio0Pin '3 shl t1,#4 '4 'need a clock every 3 instructions=12 cycles mov t2,ina '5 ror t2,#Sio0Pin '6 and t2,#15 '7 add t2,t1 '8 add t5,#2 '9 wrbyte t2,t5 '10 'wrbyte counts for 2 '11 djnz t4,#ReadBmpLineLoop '12 mov ctra,#0 'kill clock
I had one free instruction in the loop that I changed from "nop" to "mov phsa,aPhsa".
Now, successive reads work perfectly.
So, using counters, I've reduced my two cog reading loop from 16 to 12 instruction cycles.
This was enough to increase the refresh rate in 6-bit fullscreen driver for my 4.3" screens from 18 to 25 Hz, which I needed to reduce flicker to a reasonable level...
I really appreciate being able to change the timing of the clock transition.