Shop OBEX P1 Docs P2 Docs Learn Events
Bursting data to/from Cog RAM- 6.6MB/sec using 9 pins? — Parallax Forums

Bursting data to/from Cog RAM- 6.6MB/sec using 9 pins?

HannoHanno Posts: 1,130
edited 2009-12-14 17:51 in Propeller 1
What's the fastest way to access lots of RAM from a Propeller cog using the fewest IO pins?
I spent some time thinking (started by another very lively thread [noparse]:)[/noparse] ) and had a very rough thought- maybe others can fill in the details.
I would use 8 IO lines to get bytes in/out. I would use burst mode, where I ideally tell some other device (might be discrete logic, or another cog) the start address and the read/write mode- this takes at minimum 1 IO line. Once set up, the cog could burst read/write into cog memory- very fast! Something like:
  read     mov addr,ina 
           add read,incdest
           djnz cntr,#read



This takes 12 cycles per byte read/written, so 6.6MB/sec- once everything has been set up. Can anyone improve? Please run with this and turn it into an affordable product!
Hanno
«134

Comments

  • GiemmeGiemme Posts: 85
    edited 2009-05-06 14:58
    Hi Hanno

    an 8bit databus was proofed with the open source HIVE computer by drohne23
    http://drohne.piranho.de/pic-hive-board2jpg.htm

    I am prototyping it right now

    Regards

    Gianni
    640 x 480 - 48K
  • HannoHanno Posts: 1,130
    edited 2009-05-06 15:09
    Sounds great! Once it's working maybe we can get spinstudio to do a board for us?
    Hanno
  • Linus AkessonLinus Akesson Posts: 22
    edited 2009-05-06 15:32
    Hanno said...
    What's the fastest way to access lots of RAM from a Propeller cog using the fewest IO pins?
    I spent some time thinking (started by another very lively thread [noparse]:)[/noparse] ) and had a very rough thought- maybe others can fill in the details.
    I would use 8 IO lines to get bytes in/out. I would use burst mode, where I ideally tell some other device (might be discrete logic, or another cog) the start address and the read/write mode- this takes at minimum 1 IO line. Once set up, the cog could burst read/write into cog memory- very fast! Something like:
      read     mov addr,ina 
               add read,incdest
               djnz cntr,#read
    
    


    This takes 12 cycles per byte read/written, so 6.6MB/sec- once everything has been set up. Can anyone improve? Please run with this and turn it into an affordable product!
    Hanno

    You could always unroll the loop, depending on the amount of data. The unrolled loop could even overlap the buffer:

    addr
    read
                        mov    addr, INA
                        mov    addr + 1, INA
                        mov    addr + 2, INA
                        ...
    
    



    The unrolled loop would obviously be generated during the preparation phase. Something like:

    prepare
                        mov    temp, instr
                        movd   loop, #read
                        mov    count, #N
    loop
                        mov    0, temp
                        add    loop, d0
                        add    temp, d0
                        djnz   count, #loop
    prepare_ret
                        ret
    instr
                        mov    addr, INA
    d0
                        long   1 << 9
    
    
  • mynet43mynet43 Posts: 644
    edited 2009-05-06 15:34
    Hi Hanno,

    I'm trying to do something similar, to read from SRAM.

    The problem with your example seems to be that INA in assembly is not the same as in spin.

    You can't specify the pins in assembly, so you're reading all 32 pins every time.

    So, unless you want to store a 32 bit long for every byte you input, you'll have to shift/mask the input to get the byte to store.

    I may be showing my ignorance (nothing new), but this is what I'm doing.

    Jim
  • Linus AkessonLinus Akesson Posts: 22
    edited 2009-05-06 15:39
    You could even improve the unrolled loop, actually. By executing that code in four different cogs, each of them running one cycle later than its predecessor, you'd be able to sample at 80 MHz (times 32 bits).
  • jazzedjazzed Posts: 11,803
    edited 2009-05-06 16:29
    @Hanno, Nice COG loop.

    I've worked through some of these details in hardware ideas and have verilog that can be used in a CPLD.

    Obviously as you mention a synchronous access device is necessary. As I see it, one needs 2 pins in addition to the 8 for byte data. One pin would be for the clock which would be produced by the CTRA on demand. The other pin would be a start bit. The data access would be via a packet as described below.

    Synchronous memory transfer packet format:
    
    # S BYTE
    1 1 WLLLLLLL
    2 0 AAAAAAAA
    3 0 AAAAAAAA
    4 0 AAAAAAAA
    5 0 XXXXXXXX
    6 0 DDDDDDDD
    7 0 DDDDDDDD
    8 0 DDDDDDDD
    M 0 DDDDDDDD
    
    Legend:
    # - Packet Byte
    S - Start bit state
    W - Write Bit:  Write if high, Read if low
    L - Length Bit: Transaction up to N bytes
    A - Address Bit: Target address
    X - Turn Around: Need to turn the BUS to input
    D - Data Bits
    M - Packet Length: N data + 5 setup
    
    


    Timing all depends on how data is stored by Propeller. The packet could be smaller for smaller length and address.
    The turnaround byte can be skipped in a write packet. Obviously the more data that is transferred, the higher the burst rate.

    I've also thought a little about a more effective asynchronous transfer, but that is less attractive.

    @Linus,
    It's great to see you participate here. I would like to see·your·fully·developed·write/read routines and pinout for COG and/or HUB memory transactions. Not being pushy of course [noparse]:)[/noparse]

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    --Steve


    Propalyzer: Propeller PC Logic Analyzer
    http://forums.parallax.com/showthread.php?p=788230

    Post Edited (jazzed) : 5/6/2009 4:34:29 PM GMT
  • Linus AkessonLinus Akesson Posts: 22
    edited 2009-05-06 16:44
    jazzed said...
    @Linus,
    It's great to see you participate here. I would like to see your fully developed write/read routines and pinout for COG and/or HUB memory transactions. Not being pushy of course [noparse]:)[/noparse]

    Ah, but that depends on the use case. Perhaps you don't need to store the data in HUB memory. Maybe you can only access the external RAM one block (cache line) at a time. To change a single byte you'd have to read the entire block, then modify one of the registers, and finally write everything back using a similar unrolled loop. Basically, the four cogs would perform the role of a traditional cache.

    If I had any fully developed write/read routines I'd post them. =)
  • hippyhippy Posts: 1,981
    edited 2009-05-06 16:56
    An un-rolled loop ( mov plus space for storage ) would allow a block of 128 bytes to be read. You could go to 256 bytes per block if you overwrote the instructions ( mov $,INA ) but that would mean reconstructing the mov's every block.

    If you have an unrolled loop packing the bytes into words or longs that reduces the block size and decreases overall throughput, a rolled-up loop decreases throughput further.

    Ultimately real throughput depends on what you are going to do with the data when you have it. If you need to put it to Hub that's another overhead, though maybe a 'mov acc,INA / wrbyte acc,hubPtr' loop could hit the sweetspot making that quite efficient. That may well be the case using multiple Cogs, because you've got to get the data in multiple Cogs to one place to use it.

    If this data is to be interpreted LMM or similar, a 128 byte block becomes 32 PASM/LMM instructions which isn't a lot and how much overhead does it add in practice when code fetches alternate between blocks ? When I did block overlays in LMM for a VM I found it was generally faster to fecth and execute from hub on demand than load a block and execute.

    The only case I can see where speed of memory access would help is being able to keep a Hub cache filled from one Cog while an LMM interpreter runs in another but even there having to synchronise the two will likely wipe out any gains.

    We don't have DMA to Hub or Cog, Cogs can only access Hub round robbin, there's no efficient inter-Cog links ( unless sacrificing I/O ), every extra PASM instruction needed eats into throughput so it's a case of banging a square peg into a round hole. The search is for the best square peg.

    I think the bottom line is that ultra high speed memory transfer is seen as a Holy Grail when there are other bottlenecks which ultimately cap throughput.
  • mikedivmikediv Posts: 825
    edited 2009-05-06 17:28
    Giemme I don't think I have seen that computer before can I ask where to find details also is there any software for it? I recognize the prop chip but what are the other large chips.
    Thanks
  • Mike HuseltonMike Huselton Posts: 746
    edited 2009-05-06 19:14
    Please forgive me if I am wrong, but isn't this the same solution Andre came up with encapsulated in his HYDRA Xtreme 512K SRAM Card?

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    JMH
  • jazzedjazzed Posts: 11,803
    edited 2009-05-06 19:27
    James Michael Huselton said...
    Please forgive me if I am wrong, but isn't this the same solution Andre came up with encapsulated in his HYDRA Xtreme 512K SRAM Card?

    Andre'·uses 11 pins instead of 10 among other things. His Hydra databits are badly placed meaning you have to add an extra instruction per byte for shifing, but that can be cured with a floppy drive cable. I wanted to prototype on the HX512 especially since I have one, but am not willing to pay for a Lattice license.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    --Steve


    Propalyzer: Propeller PC Logic Analyzer
    http://forums.parallax.com/showthread.php?p=788230
  • MagIO2MagIO2 Posts: 2,243
    edited 2009-05-06 19:30
    I already mentioned it in other threads ... My idea is to use the video generator to shift the adress out into a CPLD. I already tried that out with a 8 bit shift registe I had at hand. As the video generator clock can be 128MHz you can shift out round about 6 bits per cycle. The pins used for that will be the data bus pins. When adress is shifted in, the CPLD then works as counter driven by a clock signal generated by a counter. Why would you need to transfer a number of bytes? The CPLD increases the adress anytime you send a clock impulse. This way you could even do read modify write.

    11 pins is all that we need.
  • Mike HuseltonMike Huselton Posts: 746
    edited 2009-05-06 19:30
    mikediv, the thread you are looking for is : http://forums.parallax.com/showthread.php?p=771392

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    JMH
  • AribaAriba Posts: 2,690
    edited 2009-05-06 19:35
    Hanno said...
    ..Something like:
      read     mov addr,ina 
               add read,incdest
               djnz cntr,#read
    
    


    This takes 12 cycles per byte read/written, so 6.6MB/sec- once everything has been set up. Can anyone improve?

    Yes, you can pack 3 bytes in one long with the MOVS, MOVD, MOVI instructions. The Databus must be at PA0..PA7 (or PA1..PA8).
    This has the additional advantage, that you can fill up to ~ 1400 bytes into the cog memory:
    loop  movs t1,ina      '3 bytes per long = up to ~1400 bytes in cog
          movd t1,ina
          movi t1,ina
    :wrt  mov  0-0,t1
          add  :wrt,d_inc
          djnz cntr,#loop  '24 cycles for 3 bytes = 10MByte/sec
    
    


    This loop reads on the first 3 instructions then 3 instructions pause, what is not so fine for a Bus interface. So its better
    to reorder the instructions to:
    loop  movi t1,ina
    :wrt  mov  0-0,t1
          movs t1,ina
          add  :wrt,d_inc
          movd t1,ina
          djnz cntr,#loop
    
    


    Now the reads are all 2 instructions with a constant 10 MHz rate. But the first long holds only 1 byte:
    long0 [noparse][[/noparse]byte0

    ]
    long1 [noparse][[/noparse]byte3 byte2 byte1]
    long2 [noparse][[/noparse]byte6 byte5 byte4]
    ...

    But if this is usefull, depends on what you will do with the data bytes. The cog can now copy the databytes to the HubRAM
    and another cog (also on another Propeller) can use the Bus to receive a burst of 1400 bytes. Or the cog can rearrange the bytes to full instructions (4 bytes in 1 long) and then execute the loaded data as a cog overlay (like on this other product [noparse]:)[/noparse]

    Andy
  • jazzedjazzed Posts: 11,803
    edited 2009-05-06 19:43
    @MagIO,

    I've looked at your proposal, and it has merit. It would enable using a smaller, cheaper CPLD or just some 74LVxxx counters. Setting up the address bits though would take at least 3 instructions beyond the overhead of getting the address. Is there anything that can be done while the bits are being shifted before grabbing or writing data?

    BTW, I don't think it's possible to use 9 pins for this with an 8 bit bus unless the Propeller XO clock can be used some way.


    @Andy,

    That's pretty cool, but you have to shift the bytes around (not good) and the clocking in the CPLD has to adapt to the code (not too bad as long as it's consistent). Can you think of a good, fast way to pack the bytes into normal alignment?

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    --Steve


    Propalyzer: Propeller PC Logic Analyzer
    http://forums.parallax.com/showthread.php?p=788230

    Post Edited (jazzed) : 5/6/2009 7:56:00 PM GMT
  • MagIO2MagIO2 Posts: 2,243
    edited 2009-05-06 19:49
    The given loop reads bytes into longs
    I'd suggest to use input pins 0-7 for the data-bus. Then you can do:
      read1    movs addr,ina 
               add  read1,incdest
               nop
      read2    movd addr,ina
               add  read2,incdest
               nop
      read3    movi addr, ina
               add  read3,incdest
               djnz cntr,#read
    
    

    This way 3 bytes of one long can be used after a little adjustment.
    Maybe it's worth to think about the rev instruction as well. The nops then could be used to do the adjustment straight away.
    Of course rev would change little endian and big endian and the RAM would hold data in mixed modes, but if you store the data the same way it's not a problem, only funny ;o)
    ·
  • jazzedjazzed Posts: 11,803
    edited 2009-05-06 20:00
    hmmm.
    movi val, ina
    shr   val, #8
    movi val, ina
    shr   val, #8
    movi val, ina
    shr   val, #8
    movi val, ina
    mov addr, val
    
    

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    --Steve


    Propalyzer: Propeller PC Logic Analyzer
    http://forums.parallax.com/showthread.php?p=788230
  • MagIO2MagIO2 Posts: 2,243
    edited 2009-05-06 20:02
    I currently plan to use a 72 IO CPLS. (because I already have it here ;o) This would allow to have a 33 bit shift register, a 32 bit latch which takes over the shifted in bits and then counts up the adress. The benefit of this way is, that all not used adress-bits can be used as port expander.
    Maybe some of the free macrocells can be used to adjust CPLD operation to the code that reads the stuff. For example the CPLD only increases the adress 3 clocks after another and then skips one clock to allow the code to do the djnz. This means the NOPs in the code I gave above can be removed.
  • MagIO2MagIO2 Posts: 2,243
    edited 2009-05-06 20:04
    @jazzed:
    does not work ... movi writes 9 bits, so you'd overwrite one of the shifted bits.
  • jazzedjazzed Posts: 11,803
    edited 2009-05-06 20:11
    Which CPLD ?

    >>> does not work ... movi writes 9 bits, so you'd overwrite one of the shifted bits.

    Dang. Ok, how 'bout this?
    movi val, ina
    ror   val, #8
    movi val, ina
    ror   val, #8
    movi val, ina
    ror   val, #8 wc
    movi val, ina
    shl   val, #1
    muxc val,#1
    mov addr, val
    
    

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    --Steve


    Propalyzer: Propeller PC Logic Analyzer
    http://forums.parallax.com/showthread.php?p=788230
  • AribaAriba Posts: 2,690
    edited 2009-05-06 20:18
    Jazzed
    Again, the next movi overwrite the highest bit of the previous byte (because 9 databits are read).
    You need a 9 bit databus for this kind of solution.

    Andy
  • MagIO2MagIO2 Posts: 2,243
    edited 2009-05-06 20:21
    Xilinx XC9572

    No, does not work either. Now you have additional bits between the 8 bits that we read.

    To get 3 bytes to the right place with only one additional instruction besides the movi only works with the rev. But for the 4th byte we need masking. But as I already mentioned, the memory will then hold bigendian mixed with little endian. So, it would be essential to have the long alligned read/write.
  • AribaAriba Posts: 2,690
    edited 2009-05-06 20:25
    Jazzed
    If you have the data on Bit0..7 and PA8=0, then something like this works:
    loop  movs t1,ina    '8 bit data on PA0..PA7 and PA8=0 (can be /RD)
          shl  t1,#8
          movs t2,ina
          or   t1,t2
          shl  t1,#8
          movs t2,ina
          or   t1,t2
          shl  t1,#8
          movs t2,ina
          or   t1,t2
    :wrt  mov  0-0,t1
          add  :wrt,d_inc
          djnz cntr,#loop  '52 cycles for 1 long = 6.15 MByte/sec
    
    


    With some nop's to get a constant rate, you will end up with 5 MByte/sec.

    But at this rate you can also read the burst direct into HubRAM (what I would prefere, to do some LMM overlays):
    loop  rdbyte 0-0,ina
          add loop,d_inc
          djnz  cntr,#loop  '16 cycles for 1 byte = 5 MByte/sec
    
    



    Andy
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2009-05-06 20:27
    Overwriting the MSB (shown in lowercase) after the shift is okay. It's just that last byte that takes a little extra finagling:

                  movi      val,ina     'aAAAAAAAA....................... .
                  shr       val,#8      '........aAAAAAAAA............... .
                  movi      val,ina     'bBBBBBBBBAAAAAAAA............... .
                  shr       val,#8      '........bBBBBBBBBAAAAAAAA....... .
                  movi      val,ina     'cCCCCCCCCBBBBBBBBAAAAAAAA....... .
                  shr       val,#7      '.......cCCCCCCCCBBBBBBBBAAAAAAAA .
                  shr       val,#1 wc   '........cCCCCCCCCBBBBBBBBAAAAAAA A
                  movi      val,ina     'dDDDDDDDDCCCCCCCCBBBBBBBBAAAAAAA A
                  rcl       val,#1      'DDDDDDDDCCCCCCCCBBBBBBBBAAAAAAAA A
    
    
    



    -Phil
  • jazzedjazzed Posts: 11,803
    edited 2009-05-06 20:35
    @Phil,
    Very nice ! [noparse]:)[/noparse] I new something was there. Why not just use "rol val, #8 wc" before movi... and rcl?

    @MagIO2,
    Endianness is corrected by the CPLD counter. But now the CPLD is harder for a burst.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    --Steve


    Propalyzer: Propeller PC Logic Analyzer
    http://forums.parallax.com/showthread.php?p=788230
  • parts-man73parts-man73 Posts: 830
    edited 2009-05-06 20:36
    Hanno said...
    Sounds great! Once it's working maybe we can get spinstudio to do a board for us?

    I'd be glad to oblige. All the interest lately in large and fast memory.

    I actually had a project in mind that would require large and fast memory. My project was a Sony PSP screen. They can be bought on Ebay quite cheaply ($30-40) but the interface is quite complicated, yet simple.

    Each tick of the clock (9 mhz clock) increments to the next pixel, each pixel is 24 bits of data - 8 bits per color RGB. Plus the applicable front and back porch...... Resolution is 480x272. That's quite a bit of data to stream.

    I had thought about feeding the data to the LCD directly from the memory and just provide addresses to the memory chips from the propeller, or use a counter that increments the address on every clock. But then how to load the RAM with the image to be displayed?

    This thread has me thinking about streaming the data from memory through the Propeller. This would be a dedicated Propeller just for the display, like a serial backpack on a HD44780 LCD, but allow allow some pretty high-res graphics with millions of color.

    This is something that I'm just envisioning in my mind right now. I've got a few important projects in front of it, so don't expect to see a functioning prototype at the California show (Ohio maybe???? smilewinkgrin.gif )

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Brian

    uController.com - home of SpinStudio - the modular Development system for the Propeller

    PropNIC - Add ethernet ability to your Propeller! PropJoy - Plug in a joystick and play some games!

    SD card Adapter - mass storage for the masses Audio/Video adapter add composite video and sound to your Proto Board
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2009-05-06 20:41
    jazzed,

    I'm not sure I follow you. One of the gotchas with the carry flag and shifts/rotates is that only the starting bit (31 or 0) gets shifted into carry. IOW, the carry bit doesn't act like a super MSB or LSB when you do a shift or rotate with a wc.

    -Phil
  • AribaAriba Posts: 2,690
    edited 2009-05-06 20:43
    Phil
    Yes, you have convinced me

    So a constant burst rate of 6.6 Mbyte/sec should be possible for Cog code overlays if the databus is at PA0..PA7.

    Andy
  • MagIO2MagIO2 Posts: 2,243
    edited 2009-05-06 20:50
    Thanks Phil!

    So what's missing is the update of the pointer, write the long to the destination and the djnz. So we are again at the point where 3 cycles are needed to do the burst read - but now have longs. Not to bad.
  • Dr. JimDr. Jim Posts: 7
    edited 2009-05-06 20:52
    Found the new thread. Looking good.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Yours in continuing machine intelligence research,

    Dr. Jim
    http://www.machineinteltech.com
    support@machineinteltech.com
Sign In or Register to comment.