Sweet-spot INA to HUB Byte Transfers (6MB/s or 48Mbit/s 1 COG @ 96MHz)

jazzed · 2011-03-28 17:10

I guess something similar has been done before. After all there is nothing new under the sun. Still, I thought it was worth sharing. You will find a code fragment below, but first some background.

I'm working on a driver to program/read 2 W25Q flash chips in parallel for SpinSocket-Flash (probably renamed to !nSocket to avoid a trademark dispute). SpinSocket-Flash has 2 of these chips for 2MB of program storage on an 8 bit synchronous data bus. The devices have 4 IO pins, a chip select, and a clock. Basically the W25Q parts are SPI devices with 4 bits on each 8 pin chip rather than just 1 bit per chip. Rayman is using a similar part in Flash Point. I have samples of these SQI parts too, but have not developed a driver for them yet.

As mentioned above, with 2 chips, you can create an 8 bit synchronous data bus. Propeller is very good at precision timing (precise to a fault really: straight-jacket timing). Given the fact that a Propeller output pin clocks the SPI (CTRA in this case), an edge counter (CTRB) can be used to adjust PHSB on each transition of the clock. So PHSB is an indirect pointer to hub memory. This is a very convenient set up to get the highest possible "sweet-spot" byte-wide throughput available on Propeller in a single COG from INA to HUB.

Here is an example of reading INA and saving data to HUB using PHSB counter pointer.

'--------------------------------------------------------------------
' read buffer from Flash to HUB at 5MB/s @ 80MHz, 6MB/s @ 96MHz, or 6.25MB/s @ 100MHz
'
_doRead
                mov     phsb,tmp        ' tmp is the HUB pointer start address
                mov     frqb,#1         ' phsb accumulate once per EDGE
                movs    ctrb,#clkpin    ' detect the SPI clock edge
                movi    ctrb,#%01110<<3 ' Negative EDGE detect
                mov     frqa,_frqa1     ' phsa accumulate frequency at clkfreq/16
                mov     phsa,#0         ' phsa intial value
                rdlong  _len, _mblen    ' get length here to sync before ctra enable
                mov     ctra,_ctra      ' start clock here
:loop
                mov     _byte0,ina      ' read byte
                wrbyte  _byte0,phsb     ' write to hub pointer
                djnz    _len,#:loop     ' do until _len bytes are stored

                mov     ctra,#0         ' kill clock
_doRead_ret     ret

Here's a dump from the test program. Just stepping through the output, the algorithm is:

Read JEDEC Manufacturer's ID and Device Type
Erase a sector for programming
Read some values to ensure erase worked
Fill buffer with an incrementing pattern (except $55 at offset 0)
Program the buffer
Clear the buffer so we know the data read is valid
Read some values to make sure reads worked
Read Flash buffer and show programmed data

c:\Propeller\inSocket>bstc -ls -d COM6 -p0 -Ograux -L c:\bstc\spin w25qx2.spin
Brads Spin Tool Compiler v0.15.4-pre3 - Copyright 2008,2009 All rights reserved
Compiled for i386 Win32 at 19:57:58 on 2010/01/15
Loading Object w25qx2
Loading Object FullDuplexSingleton
Program size is 2772 longs
Compiled 688 Lines of Code in 0.017 Seconds
We found a Propeller Version 1
Propeller Load took 1.103 Seconds

c:\Propeller\inSocket>uterm COM6 115200

MFG ID   EF
DEV TYPE 40
erase sector
read byte 000000FF
read word 0000FFFF
read long FFFFFFFF
fill buffer
prog buffer
clear buffer
read bytes 55 01 02 03
read word 00000155
read long 03020155
read buffer
show buffer

080800 55 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
080810 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
080820 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
080830 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
080840 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
080850 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
080860 60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F
080870 70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F
080880 80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
080890 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
0808A0 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
0808B0 B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
0808C0 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 CA CB CC CD CE CF
0808D0 D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 DA DB DC DD DE DF
0808E0 E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF
0808F0 F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 FA FB FC FD FE FF
080900 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
080910 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
080920 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
080930 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
080940 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
080950 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
080960 60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F
080970 70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E 7F
080980 80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
080990 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
0809A0 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
0809B0 B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
0809C0 C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 CA CB CC CD CE CF
0809D0 D0 D1 D2 D3 D4 D5 D6 D7 D8 D9 DA DB DC DD DE DF
0809E0 E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 EA EB EC ED EE EF
0809F0 F0 F1 F2 F3 F4 F5 F6 F7 F8 F9 FA FB FC FD FE FF
test complete.

Thanks for reading.
--Steve

Unsoundcode · 2011-03-28 18:28

Steve , glad you shared , more from the perspective of learning assembly than any particular use of the snippet.

I wish there were a more in depth explanation of the counters, or should I say counters for dummies, I feel its a hard subject to grasp for the beginner.

Jeff T.

jazzed · 2011-03-28 20:07

Unsoundcode wrote: »

I wish there were a more in depth explanation of the counters, or should I say counters for dummies, I feel its a hard subject to grasp for the beginner.

I agree wholeheartedly. It took me a while to "get it" even with the counter app note, Propeller datasheet, Propeller manual, and an oscilloscope on my desk. I know there are things I still have to learn about the counters. Thank goodness for Kuroneko's counter investigations.

BTW, I'll post the whole driver later.

Cheers.
--Steve

Andrey Demenev · 2011-03-29 01:07

jazzed wrote: »

After all there is nothing new under the sun.

For sure

. I've been doing similar thing when sampling 2 10-bit ADCs at 24 MSPS with 4 interleaved cogs

Rayman · 2011-03-29 06:28

Thanks for posting this. I think others have mentioned a similar thing before, but you've made it very clear...

BTW: Do you see a way to do something similar with a 4-bit bus?

tdlivings · 2011-03-29 08:39

@Steve
Kuroneko's counter investigations.
Are you refering to one particular post, and if so do you have a link to it
or just the many great posts Kuroneko has done involving the Prop.

Tom

jazzed · 2011-03-29 09:31

Rayman wrote: »

Do you see a way to do something similar with a 4-bit bus?

Rayman, I've played with this a little today, there are big challenges. Even with half the data rate it does not seem possible to sample and assemble a BYTE in time for writing because of asymmetric code (maybe a WORD or LONG could be done?). It seems easy at a fourth the data rate, but i don't know if using PHSB as a hub pointer buys anything at that point.

@tdlivings I don't remember a specific thread where kuroneko went into this specifically, but he does show some awesome work.

Rayman · 2011-03-29 09:52

jazzed, I think I just noticed that your fragment requires the bus to be on P0..P7, right?

Rayman · 2011-03-29 10:02

jazzed, what about something like this for a 4-bit bus:
(first byte written is junk)

loop:
  mov  nib0,ina
  wrbyte nib1,phsb
  sub phsb,#1
  mov nib1,ina
  shl  nib0,#4
  or nib1,#$FF
  add nib1,nib0
  djnz ...

I think I've got one too many instructions in my loop....

jazzed · 2011-03-29 10:31

Rayman wrote: »

jazzed, what about something like this for a 4-bit bus:
(first byte written is junk)

I think I've got one too many instructions in my loop....

Ok, moving the wrbyte up in the loop helps the symmetry problem. Post #25 has a better solution.

Yes, my first code fragment depends on flash or other synchronous device like sdram being on P0..7.

Rayman · 2011-03-29 10:59

jazzed, Thanks for the tips.

It looks like the rdlong before entering the loop is required to sync the loop with hub access, right?

How many clocks does the wrbyte use? Manual says 7..22, so I suppose you're close to 7...

jazzed · 2011-03-29 12:33

Rayman wrote: »

jazzed, Thanks for the tips.

It looks like the rdlong before entering the loop is required to sync the loop with hub access, right?

How many clocks does the wrbyte use? Manual says 7..22, so I suppose you're close to 7...

Yes, some HUB op is required to sync things up before turning on CTRA.

You get 2, 6, 10, 14 etc... normal instructions to stay in sync between HUB op windows.

I realized while I was out that if the modified fragment I posted works, you don't need to use PHSB as the HUB pointer. Just increment your HUB pointer in that left-over space.

Rayman · 2011-03-29 12:46

Excellent... So, I think it's 8 clocks for the hub access... So, now I want to make it even more diffucult and use 2 cogs to get 8-bit 1-cog speed with 4-bit bus...

PS: Maybe you could double speed also with two cogs? Do two ina reads and one wrword in each cog?

jazzed · 2011-03-29 15:27

Rayman wrote: »

PS: Maybe you could double speed also with two cogs? Do two ina reads and one wrword in each cog?

Sure, such designs are a little tedious, but I've done several. There is certainly more to explore. The rest of my day today is wasted though. I'll look at it more tomorrow.

hinv · 2011-03-29 18:53

Are there any sram devices with the same bus.

jazzed · 2011-03-29 19:50

hinv wrote: »

Are there any sram devices with the same bus.

I wish!

Of course 8x TSSOP SPI chips would fit on a DIP32 size board. Guess I'll have to make one.

David Betz · 2011-03-29 20:25

jazzed wrote: »

I wish! Of course 8x TSSOP SPI chips would fit on a DIP32 size board. Guess I'll have to make one.

What would be the advantage of doing that over just using a parallel SRAM chip? I guess you need a bunch of latches and/or counters to drive the SRAM chip but wouldn't it end up being faster than a bunch of SPI chips?

jazzed · 2011-03-29 21:44

David Betz wrote: »

What would be the advantage of doing that over just using a parallel SRAM chip? I guess you need a bunch of latches and/or counters to drive the SRAM chip but wouldn't it end up being faster than a bunch of SPI chips?

Fast access and lots of free pins!

Parallel SRAM using lots of pins would only be faster for element at a time access (byte, word, or long).

Initial addressing for 8xSPI SRAM would be slower, but page data could be read/written at full speed (6MB/s @ 96MHz), and only 2 more pins would be used for SPI CS and Clock. It would be a good solution when used with a cache.

So for SpinSocket-Flash for example, you could have 2MB of Flash + 256KB of SRAM in a sandwich and still have 16 free pins (enough for VGA, SD-Card, Keyboard and Mouse).

Beau Schwabe · 2011-03-29 22:51

jazzed,

Slick! ... I had to sleep on it to figure out what you were doing. :-)

Rayman · 2011-03-30 10:23

I wonder if there's a way to use 2 cogs for cases where your not on P0..P7 in order to get the speed you get here with one cog...

(or, in my case use 4 cogs with a nibble not on P0..P3)

jazzed · 2011-03-30 11:17

Beau Schwabe (Parallax) wrote: »

jazzed,

Slick! ... I had to sleep on it to figure out what you were doing. :-)

Sorry if I did not present the idea very clearly.

Rayman wrote: »

I wonder if there's a way to use 2 cogs for cases where your not on P0..P7 in order to get the speed you get here with one cog...

(or, in my case use 4 cogs with a nibble not on P0..P3)

Where's your driver? I'll see what I can do with it. I have several of the SST26VF parts.

Rayman · 2011-03-30 11:35

jazzed, I have an assembly driver as part of the FPS TV Bitmap Viewer example here:

http://www.rayslogic.com/Propeller/Products/FlashPoint/FlashPoint.htm

It uses 2 interleaved cogs but without counters...

jazzed · 2011-03-30 12:10

Rayman wrote: »

jazzed, I have an assembly driver as part of the FPS TV Bitmap Viewer example here:

http://www.rayslogic.com/Propeller/Products/FlashPoint/FlashPoint.htm

It uses 2 interleaved cogs but without counters...

PASM in FlashPointSuperQuadA2.spin ?
I'll write a driver that looks more like the attachment below, then make a 2 COG version.

Here is the W25Qx2.spin driver. I'll make an SST26VFx2.spin version soon.

Rayman · 2011-03-30 15:54

jazzed, yep, that's it. I'm hoping it can be made faster with counters...

Right now, the screen is 256 pixels wide, but my driver can only keep up with with a 156 pixel wide image...
I'd love to extend that out to the full 256x192 that this full color TV driver does...

jazzed · 2011-03-30 19:34

Rayman wrote: »

jazzed, yep, that's it. I'm hoping it can be made faster with counters...

Right now, the screen is 256 pixels wide, but my driver can only keep up with with a 156 pixel wide image...
I'd love to extend that out to the full 256x192 that this full color TV driver does...

Hi Rayman.

This is the best I can do in the time I have. Nibbles must be on D0..3. It does does not try to manage 2 cog resources. There is no way that I can see to shift both nibbles from some arbitrary start pin.

It may be possible to use an arbitrary start pin if you program the data backwards and read it the same way .... Also PHSB as a pointer will not work with 2 CTRA clock cycles since it will increment by 2 (changing PLL rate does not help). However with 2 COGS, having COG 1 PHSB do even writes and COG 2 PHSB do odd writes may work. I'm due for a nap, so there may be errors reasoning regarding 2 COG PHSB or the code below. My previous snippet was wrong btw.

Code below is not tested. I haven't had time to build an SST26FV module. You can temporarily replace movs t1,ina with an xor ... to look at timing in case it doesn't work on a scope and see if timing needs to be adjusted. It should be clear where to put this fragment. If this works, you get 2.5MB/s at 80MHz which is twice as fast in a single COG than what you had.

Once you get this working, then we can try to interleave even/odd address writes with a PHSB setup. With 2 COGs I think only one COG should generate a CTRA clock at sysclk/16 and then the read has to start-up at the right time.

InputData
              'Data Input (only do this if arg3>0)
              cmp       arg3,#0 wz,wc
        if_e  jmp       #SqiBufferIO_End

              mov       frqa,_frqa              ' sysclk/32
              mov       phsa,#0
              movs      ctra,#SckPin

              sub       arg4,#1                 ' loop preincrements before wrbyte
              rdbyte    t3, arg4                ' w0 save byte and sync up - restore byte later
              mov       savep, arg4
              movi      ctra,#4 << 3            ' single end NCO clock mode
              rdbyte    t3, savep               ' w0 save byte and sync up - restore byte later
              'Write data
              'First Nibble already on output
InputDataLoop
              movs      t1,ina                  ' get first upper nibble  
              rol       t1,#4                   ' retrieve lower from D28..31
              wrbyte    t1, arg4                ' w1 first byte is garbage
              movs      t1,ina                  ' get second nibble
              ror       t1,#4                   ' w2 save lower in D28..31
              add       arg4,#1
              djnz      arg3,#InputDataLoop     ' next byte

              wrbyte    t3, savep
              
SqiBufferIO_End
              'Bring CE high to end transfer 
              or        outa,Cen_Mask                                     
              jmp       #loopEnd
              
savep         long      0
_frqa         long      $0800_0000

Rayman · 2011-03-31 06:22

jazzed, thanks. It's going to take me a while to digest that...

I think it's clear that I'm going to need 2 sets of drivers: One for where we're on P0..P3 and another for when on any other pins.

I'd rather just have one driver, but the ability to double the speed by using P0..P3 is hard to resist...

I just wish our Prop Platform USB and compatible boards didn't have SD on P0..P3. That's a real problem...

jazzed · 2011-03-31 08:53

Rayman wrote: »

I just wish our Prop Platform USB and compatible boards didn't have SD on P0..P3. That's a real problem...

That's why I put an SDCARD on the SDRAM board in a more reasonable place. SDCARD does not benefit from being on D0..3 and is generally in the way but the pull-ups will not hurt data when the SDCARD is not installed.

I asked Nick to allow cuts/jumps for different pins, but am not sure if he did anything about it.

This reminds me. I think we need a Propeller Platform combo card that combines 8 bit Flash/SPI RAM, and SDCARD.

Rayman · 2011-04-20 00:15

I'm about to try a counter controlled read with 10 regular and 1 wrbyte instruction. That's 12 instruction cycles or 48 clock cycles.

My only concern is that unlike the 16 or 32 clock cycle loops, 48 does not divide evenly into 2^32.

Not quite sure how big a problem that is though...

kuroneko · 2011-04-20 00:25

Rayman wrote: »

My only concern is that unlike the 16 or 32 clock cycle loops, 48 does not divide evenly into 2^32.

It's still 16n which is all you need re: hub window alignment. So I wouldn't worry just yet

Rayman · 2011-04-22 02:59

Well, it took a lot of tinkering, but I finally have two cogs working together using counters to clock the FlashPoint SQI flash chip.

There was one major problem... The counter loop would always cycle the clock one more time than I wanted. That made attempts and successive read bursts not work right. (I could end the transfer by raising CE and start over, but didn't want to for this project).

Anyway, here's the trick I used to fix the problem:

'Setup counters
              sub       t5,#2 'need to subtract because adding before wrbyte
              mov       frqa,aFreq           ' 
              mov       phsa,aPhsa
              movs      ctra,#SckPin                   
              rdbyte    t1, #0               ' sync up with hub 
              movi      ctra,#4 << 3            ' single end NCO clock mode 
ReadBmpLineLoop
              mov       phsa,aPhsa          '1
              mov       t1,ina              '2  'First Nibble already on output
              ror       t1,#Sio0Pin         '3
              shl       t1,#4               '4  'need a clock every 3 instructions=12 cycles
              mov       t2,ina              '5
              ror       t2,#Sio0Pin         '6
              and       t2,#15              '7
              add       t2,t1               '8
              add       t5,#2               '9
              wrbyte    t2,t5               '10
              'wrbyte counts for 2          '11
              djnz      t4,#ReadBmpLineLoop '12
              mov       ctra,#0 'kill clock

I had one free instruction in the loop that I changed from "nop" to "mov phsa,aPhsa".
Now, successive reads work perfectly.

So, using counters, I've reduced my two cog reading loop from 16 to 12 instruction cycles.
This was enough to increase the refresh rate in 6-bit fullscreen driver for my 4.3" screens from 18 to 25 Hz, which I needed to reduce flicker to a reasonable level...

jazzed · 2011-04-22 07:12

Rayman wrote: »

Well, it took a lot of tinkering, but I finally have two cogs working together using counters to clock the FlashPoint SQI flash chip.

Good job.

What is the value of aPhsa ?
I really appreciate being able to change the timing of the clock transition.

Sweet-spot INA to HUB Byte Transfers (6MB/s or 48Mbit/s 1 COG @ 96MHz)

Comments