Sweet-spot INA to HUB Byte Transfers (6MB/s or 48Mbit/s 1 COG @ 96MHz)
I guess something similar has been done before. After all there is nothing new under the sun. Still, I thought it was worth sharing. You will find a code fragment below, but first some background.
I'm working on a driver to program/read 2 W25Q flash chips in parallel for SpinSocket-Flash (probably renamed to !nSocket to avoid a trademark dispute). SpinSocket-Flash has 2 of these chips for 2MB of program storage on an 8 bit synchronous data bus. The devices have 4 IO pins, a chip select, and a clock. Basically the W25Q parts are SPI devices with 4 bits on each 8 pin chip rather than just 1 bit per chip. Rayman is using a similar part in Flash Point. I have samples of these SQI parts too, but have not developed a driver for them yet.
As mentioned above, with 2 chips, you can create an 8 bit synchronous data bus. Propeller is very good at precision timing (precise to a fault really: straight-jacket timing). Given the fact that a Propeller output pin clocks the SPI (CTRA in this case), an edge counter (CTRB) can be used to adjust PHSB on each transition of the clock. So PHSB is an indirect pointer to hub memory. This is a very convenient set up to get the highest possible "sweet-spot" byte-wide throughput available on Propeller in a single COG from INA to HUB.
Here is an example of reading INA and saving data to HUB using PHSB counter pointer.
Here's a dump from the test program. Just stepping through the output, the algorithm is:
Thanks for reading.
--Steve
I'm working on a driver to program/read 2 W25Q flash chips in parallel for SpinSocket-Flash (probably renamed to !nSocket to avoid a trademark dispute). SpinSocket-Flash has 2 of these chips for 2MB of program storage on an 8 bit synchronous data bus. The devices have 4 IO pins, a chip select, and a clock. Basically the W25Q parts are SPI devices with 4 bits on each 8 pin chip rather than just 1 bit per chip. Rayman is using a similar part in Flash Point. I have samples of these SQI parts too, but have not developed a driver for them yet.
As mentioned above, with 2 chips, you can create an 8 bit synchronous data bus. Propeller is very good at precision timing (precise to a fault really: straight-jacket timing). Given the fact that a Propeller output pin clocks the SPI (CTRA in this case), an edge counter (CTRB) can be used to adjust PHSB on each transition of the clock. So PHSB is an indirect pointer to hub memory. This is a very convenient set up to get the highest possible "sweet-spot" byte-wide throughput available on Propeller in a single COG from INA to HUB.
Here is an example of reading INA and saving data to HUB using PHSB counter pointer.
'-------------------------------------------------------------------- ' read buffer from Flash to HUB at 5MB/s @ 80MHz, 6MB/s @ 96MHz, or 6.25MB/s @ 100MHz ' _doRead mov phsb,tmp ' tmp is the HUB pointer start address mov frqb,#1 ' phsb accumulate once per EDGE movs ctrb,#clkpin ' detect the SPI clock edge movi ctrb,#%01110<<3 ' Negative EDGE detect mov frqa,_frqa1 ' phsa accumulate frequency at clkfreq/16 mov phsa,#0 ' phsa intial value rdlong _len, _mblen ' get length here to sync before ctra enable mov ctra,_ctra ' start clock here :loop mov _byte0,ina ' read byte wrbyte _byte0,phsb ' write to hub pointer djnz _len,#:loop ' do until _len bytes are stored mov ctra,#0 ' kill clock _doRead_ret ret
Here's a dump from the test program. Just stepping through the output, the algorithm is:
- Read JEDEC Manufacturer's ID and Device Type
- Erase a sector for programming
- Read some values to ensure erase worked
- Fill buffer with an incrementing pattern (except $55 at offset 0)
- Program the buffer
- Clear the buffer so we know the data read is valid
- Read some values to make sure reads worked
- Read Flash buffer and show programmed data
c:\Propeller\inSocket>bstc -ls -d COM6 -p0 -Ograux -L c:\bstc\spin w25qx2.spin Brads Spin Tool Compiler v0.15.4-pre3 - Copyright 2008,2009 All rights reserved Compiled for i386 Win32 at 19:57:58 on 2010/01/15 Loading Object w25qx2 Loading Object FullDuplexSingleton Program size is 2772 longs Compiled 688 Lines of Code in 0.017 Seconds We found a Propeller Version 1 Propeller Load took 1.103 Seconds c:\Propeller\inSocket>uterm COM6 115200 MFG ID EF DEV TYPE 40 erase sector read byte 000000FF read word 0000FFFF read long FFFFFFFF fill buffer prog buffer clear buffer read bytes 55 01 02 03 read word 00000155 read long 03020155 read buffer show buffer 080800 55 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 080810 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 080820 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F 080830 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F 080840 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 080850 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F 080860 60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F 080870 70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F 080880 80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F 080890 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F 0808A0 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF 0808B0 B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF 0808C0 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 CA CB CC CD CE CF 0808D0 D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 DA DB DC DD DE DF 0808E0 E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF 0808F0 F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 FA FB FC FD FE FF 080900 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 080910 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F 080920 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F 080930 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F 080940 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 080950 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F 080960 60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F 080970 70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F 080980 80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F 080990 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F 0809A0 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF 0809B0 B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF 0809C0 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 CA CB CC CD CE CF 0809D0 D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 DA DB DC DD DE DF 0809E0 E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF 0809F0 F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 FA FB FC FD FE FF test complete.
Thanks for reading.
--Steve
Comments
I wish there were a more in depth explanation of the counters, or should I say counters for dummies, I feel its a hard subject to grasp for the beginner.
Jeff T.
BTW, I'll post the whole driver later.
Cheers.
--Steve
BTW: Do you see a way to do something similar with a 4-bit bus?
Kuroneko's counter investigations.
Are you refering to one particular post, and if so do you have a link to it
or just the many great posts Kuroneko has done involving the Prop.
Tom
@tdlivings I don't remember a specific thread where kuroneko went into this specifically, but he does show some awesome work.
(first byte written is junk)
I think I've got one too many instructions in my loop....
Yes, my first code fragment depends on flash or other synchronous device like sdram being on P0..7.
It looks like the rdlong before entering the loop is required to sync the loop with hub access, right?
How many clocks does the wrbyte use? Manual says 7..22, so I suppose you're close to 7...
Yes, some HUB op is required to sync things up before turning on CTRA.
You get 2, 6, 10, 14 etc... normal instructions to stay in sync between HUB op windows.
I realized while I was out that if the modified fragment I posted works, you don't need to use PHSB as the HUB pointer. Just increment your HUB pointer in that left-over space.
PS: Maybe you could double speed also with two cogs? Do two ina reads and one wrword in each cog?
Parallel SRAM using lots of pins would only be faster for element at a time access (byte, word, or long).
Initial addressing for 8xSPI SRAM would be slower, but page data could be read/written at full speed (6MB/s @ 96MHz), and only 2 more pins would be used for SPI CS and Clock. It would be a good solution when used with a cache.
So for SpinSocket-Flash for example, you could have 2MB of Flash + 256KB of SRAM in a sandwich and still have 16 free pins (enough for VGA, SD-Card, Keyboard and Mouse).
Slick! ... I had to sleep on it to figure out what you were doing. :-)
(or, in my case use 4 cogs with a nibble not on P0..P3)
Where's your driver? I'll see what I can do with it. I have several of the SST26VF parts.
http://www.rayslogic.com/Propeller/Products/FlashPoint/FlashPoint.htm
It uses 2 interleaved cogs but without counters...
I'll write a driver that looks more like the attachment below, then make a 2 COG version.
Here is the W25Qx2.spin driver. I'll make an SST26VFx2.spin version soon.
Right now, the screen is 256 pixels wide, but my driver can only keep up with with a 156 pixel wide image...
I'd love to extend that out to the full 256x192 that this full color TV driver does...
Hi Rayman.
This is the best I can do in the time I have. Nibbles must be on D0..3. It does does not try to manage 2 cog resources. There is no way that I can see to shift both nibbles from some arbitrary start pin.
It may be possible to use an arbitrary start pin if you program the data backwards and read it the same way .... Also PHSB as a pointer will not work with 2 CTRA clock cycles since it will increment by 2 (changing PLL rate does not help). However with 2 COGS, having COG 1 PHSB do even writes and COG 2 PHSB do odd writes may work. I'm due for a nap, so there may be errors reasoning regarding 2 COG PHSB or the code below. My previous snippet was wrong btw.
Code below is not tested. I haven't had time to build an SST26FV module. You can temporarily replace movs t1,ina with an xor ... to look at timing in case it doesn't work on a scope and see if timing needs to be adjusted. It should be clear where to put this fragment. If this works, you get 2.5MB/s at 80MHz which is twice as fast in a single COG than what you had.
Once you get this working, then we can try to interleave even/odd address writes with a PHSB setup. With 2 COGs I think only one COG should generate a CTRA clock at sysclk/16 and then the read has to start-up at the right time.
I think it's clear that I'm going to need 2 sets of drivers: One for where we're on P0..P3 and another for when on any other pins.
I'd rather just have one driver, but the ability to double the speed by using P0..P3 is hard to resist...
I just wish our Prop Platform USB and compatible boards didn't have SD on P0..P3. That's a real problem...
I asked Nick to allow cuts/jumps for different pins, but am not sure if he did anything about it.
This reminds me. I think we need a Propeller Platform combo card that combines 8 bit Flash/SPI RAM, and SDCARD.
My only concern is that unlike the 16 or 32 clock cycle loops, 48 does not divide evenly into 2^32.
Not quite sure how big a problem that is though...
There was one major problem... The counter loop would always cycle the clock one more time than I wanted. That made attempts and successive read bursts not work right. (I could end the transfer by raising CE and start over, but didn't want to for this project).
Anyway, here's the trick I used to fix the problem:
I had one free instruction in the loop that I changed from "nop" to "mov phsa,aPhsa".
Now, successive reads work perfectly.
So, using counters, I've reduced my two cog reading loop from 16 to 12 instruction cycles.
This was enough to increase the refresh rate in 6-bit fullscreen driver for my 4.3" screens from 18 to 25 Hz, which I needed to reduce flicker to a reasonable level...
I really appreciate being able to change the timing of the clock transition.