Bursting data to/from Cog RAM- 6.6MB/sec using 9 pins?

Hanno · 2009-05-06 14:24

What's the fastest way to access lots of RAM from a Propeller cog using the fewest IO pins?
I spent some time thinking (started by another very lively thread [noparse]:)[/noparse] ) and had a very rough thought- maybe others can fill in the details.
I would use 8 IO lines to get bytes in/out. I would use burst mode, where I ideally tell some other device (might be discrete logic, or another cog) the start address and the read/write mode- this takes at minimum 1 IO line. Once set up, the cog could burst read/write into cog memory- very fast! Something like:

  read     mov addr,ina 
           add read,incdest
           djnz cntr,#read

This takes 12 cycles per byte read/written, so 6.6MB/sec- once everything has been set up. Can anyone improve? Please run with this and turn it into an affordable product!
Hanno

Giemme · 2009-05-06 14:58

Hi Hanno

an 8bit databus was proofed with the open source HIVE computer by drohne23
http://drohne.piranho.de/pic-hive-board2jpg.htm

I am prototyping it right now

Regards

Gianni

Hanno · 2009-05-06 15:09

Sounds great! Once it's working maybe we can get spinstudio to do a board for us?
Hanno

Linus Akesson · 2009-05-06 15:32

Hanno said...
What's the fastest way to access lots of RAM from a Propeller cog using the fewest IO pins?
I spent some time thinking (started by another very lively thread [noparse]:)[/noparse] ) and had a very rough thought- maybe others can fill in the details.
I would use 8 IO lines to get bytes in/out. I would use burst mode, where I ideally tell some other device (might be discrete logic, or another cog) the start address and the read/write mode- this takes at minimum 1 IO line. Once set up, the cog could burst read/write into cog memory- very fast! Something like:
  read     mov addr,ina 
           add read,incdest
           djnz cntr,#read
This takes 12 cycles per byte read/written, so 6.6MB/sec- once everything has been set up. Can anyone improve? Please run with this and turn it into an affordable product!
Hanno

You could always unroll the loop, depending on the amount of data. The unrolled loop could even overlap the buffer:

addr
read
                    mov    addr, INA
                    mov    addr + 1, INA
                    mov    addr + 2, INA
                    ...

The unrolled loop would obviously be generated during the preparation phase. Something like:

prepare
                    mov    temp, instr
                    movd   loop, #read
                    mov    count, #N
loop
                    mov    0, temp
                    add    loop, d0
                    add    temp, d0
                    djnz   count, #loop
prepare_ret
                    ret
instr
                    mov    addr, INA
d0
                    long   1 << 9

mynet43 · 2009-05-06 15:34

Hi Hanno,

I'm trying to do something similar, to read from SRAM.

The problem with your example seems to be that INA in assembly is not the same as in spin.

You can't specify the pins in assembly, so you're reading all 32 pins every time.

So, unless you want to store a 32 bit long for every byte you input, you'll have to shift/mask the input to get the byte to store.

I may be showing my ignorance (nothing new), but this is what I'm doing.

Jim

Linus Akesson · 2009-05-06 15:39

You could even improve the unrolled loop, actually. By executing that code in four different cogs, each of them running one cycle later than its predecessor, you'd be able to sample at 80 MHz (times 32 bits).

jazzed · 2009-05-06 16:29

@Hanno, Nice COG loop.

I've worked through some of these details in hardware ideas and have verilog that can be used in a CPLD.

Obviously as you mention a synchronous access device is necessary. As I see it, one needs 2 pins in addition to the 8 for byte data. One pin would be for the clock which would be produced by the CTRA on demand. The other pin would be a start bit. The data access would be via a packet as described below.

Synchronous memory transfer packet format:

# S BYTE
1 1 WLLLLLLL
2 0 AAAAAAAA
3 0 AAAAAAAA
4 0 AAAAAAAA
5 0 XXXXXXXX
6 0 DDDDDDDD
7 0 DDDDDDDD
8 0 DDDDDDDD
M 0 DDDDDDDD

Legend:
# - Packet Byte
S - Start bit state
W - Write Bit:  Write if high, Read if low
L - Length Bit: Transaction up to N bytes
A - Address Bit: Target address
X - Turn Around: Need to turn the BUS to input
D - Data Bits
M - Packet Length: N data + 5 setup

Timing all depends on how data is stored by Propeller. The packet could be smaller for smaller length and address.
The turnaround byte can be skipped in a write packet. Obviously the more data that is transferred, the higher the burst rate.

I've also thought a little about a more effective asynchronous transfer, but that is less attractive.

@Linus,
It's great to see you participate here. I would like to see·your·fully·developed·write/read routines and pinout for COG and/or HUB memory transactions. Not being pushy of course [noparse]:)[/noparse]

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

Post Edited (jazzed) : 5/6/2009 4:34:29 PM GMT

Linus Akesson · 2009-05-06 16:44

jazzed said...
@Linus,
It's great to see you participate here. I would like to see your fully developed write/read routines and pinout for COG and/or HUB memory transactions. Not being pushy of course [noparse]:)[/noparse]

Ah, but that depends on the use case. Perhaps you don't need to store the data in HUB memory. Maybe you can only access the external RAM one block (cache line) at a time. To change a single byte you'd have to read the entire block, then modify one of the registers, and finally write everything back using a similar unrolled loop. Basically, the four cogs would perform the role of a traditional cache.

If I had any fully developed write/read routines I'd post them.

hippy · 2009-05-06 16:56

An un-rolled loop ( mov plus space for storage ) would allow a block of 128 bytes to be read. You could go to 256 bytes per block if you overwrote the instructions ( mov $,INA ) but that would mean reconstructing the mov's every block.

If you have an unrolled loop packing the bytes into words or longs that reduces the block size and decreases overall throughput, a rolled-up loop decreases throughput further.

Ultimately real throughput depends on what you are going to do with the data when you have it. If you need to put it to Hub that's another overhead, though maybe a 'mov acc,INA / wrbyte acc,hubPtr' loop could hit the sweetspot making that quite efficient. That may well be the case using multiple Cogs, because you've got to get the data in multiple Cogs to one place to use it.

If this data is to be interpreted LMM or similar, a 128 byte block becomes 32 PASM/LMM instructions which isn't a lot and how much overhead does it add in practice when code fetches alternate between blocks ? When I did block overlays in LMM for a VM I found it was generally faster to fecth and execute from hub on demand than load a block and execute.

The only case I can see where speed of memory access would help is being able to keep a Hub cache filled from one Cog while an LMM interpreter runs in another but even there having to synchronise the two will likely wipe out any gains.

We don't have DMA to Hub or Cog, Cogs can only access Hub round robbin, there's no efficient inter-Cog links ( unless sacrificing I/O ), every extra PASM instruction needed eats into throughput so it's a case of banging a square peg into a round hole. The search is for the best square peg.

I think the bottom line is that ultra high speed memory transfer is seen as a Holy Grail when there are other bottlenecks which ultimately cap throughput.

mikediv · 2009-05-06 17:28

Giemme I don't think I have seen that computer before can I ask where to find details also is there any software for it? I recognize the prop chip but what are the other large chips.
Thanks

Mike Huselton · 2009-05-06 19:14

Please forgive me if I am wrong, but isn't this the same solution Andre came up with encapsulated in his HYDRA Xtreme 512K SRAM Card?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
JMH

jazzed · 2009-05-06 19:27

James Michael Huselton said...
Please forgive me if I am wrong, but isn't this the same solution Andre came up with encapsulated in his HYDRA Xtreme 512K SRAM Card?

Andre'·uses 11 pins instead of 10 among other things. His Hydra databits are badly placed meaning you have to add an extra instruction per byte for shifing, but that can be cured with a floppy drive cable. I wanted to prototype on the HX512 especially since I have one, but am not willing to pay for a Lattice license.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

MagIO2 · 2009-05-06 19:30

I already mentioned it in other threads ... My idea is to use the video generator to shift the adress out into a CPLD. I already tried that out with a 8 bit shift registe I had at hand. As the video generator clock can be 128MHz you can shift out round about 6 bits per cycle. The pins used for that will be the data bus pins. When adress is shifted in, the CPLD then works as counter driven by a clock signal generated by a counter. Why would you need to transfer a number of bytes? The CPLD increases the adress anytime you send a clock impulse. This way you could even do read modify write.

11 pins is all that we need.

Mike Huselton · 2009-05-06 19:30

mikediv, the thread you are looking for is : http://forums.parallax.com/showthread.php?p=771392

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
JMH

Ariba · 2009-05-06 19:35

Hanno said...
..Something like:
  read     mov addr,ina 
           add read,incdest
           djnz cntr,#read
This takes 12 cycles per byte read/written, so 6.6MB/sec- once everything has been set up. Can anyone improve?

Yes, you can pack 3 bytes in one long with the MOVS, MOVD, MOVI instructions. The Databus must be at PA0..PA7 (or PA1..PA8).
This has the additional advantage, that you can fill up to ~ 1400 bytes into the cog memory:

loop  movs t1,ina      '3 bytes per long = up to ~1400 bytes in cog
      movd t1,ina
      movi t1,ina
:wrt  mov  0-0,t1
      add  :wrt,d_inc
      djnz cntr,#loop  '24 cycles for 3 bytes = 10MByte/sec

This loop reads on the first 3 instructions then 3 instructions pause, what is not so fine for a Bus interface. So its better
to reorder the instructions to:

loop  movi t1,ina
:wrt  mov  0-0,t1
      movs t1,ina
      add  :wrt,d_inc
      movd t1,ina
      djnz cntr,#loop

Now the reads are all 2 instructions with a constant 10 MHz rate. But the first long holds only 1 byte:
long0 [noparse][[/noparse]byte0

]
long1 [noparse][[/noparse]byte3 byte2 byte1]
long2 [noparse][[/noparse]byte6 byte5 byte4]
...

But if this is usefull, depends on what you will do with the data bytes. The cog can now copy the databytes to the HubRAM
and another cog (also on another Propeller) can use the Bus to receive a burst of 1400 bytes. Or the cog can rearrange the bytes to full instructions (4 bytes in 1 long) and then execute the loaded data as a cog overlay (like on this other product [noparse]:)[/noparse]

Andy

jazzed · 2009-05-06 19:43

@MagIO,

I've looked at your proposal, and it has merit. It would enable using a smaller, cheaper CPLD or just some 74LVxxx counters. Setting up the address bits though would take at least 3 instructions beyond the overhead of getting the address. Is there anything that can be done while the bits are being shifted before grabbing or writing data?

BTW, I don't think it's possible to use 9 pins for this with an 8 bit bus unless the Propeller XO clock can be used some way.

@Andy,

That's pretty cool, but you have to shift the bytes around (not good) and the clocking in the CPLD has to adapt to the code (not too bad as long as it's consistent). Can you think of a good, fast way to pack the bytes into normal alignment?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

Post Edited (jazzed) : 5/6/2009 7:56:00 PM GMT

MagIO2 · 2009-05-06 19:49

The given loop reads bytes into longs
I'd suggest to use input pins 0-7 for the data-bus. Then you can do:

  read1    movs addr,ina 
           add  read1,incdest
           nop
  read2    movd addr,ina
           add  read2,incdest
           nop
  read3    movi addr, ina
           add  read3,incdest
           djnz cntr,#read

This way 3 bytes of one long can be used after a little adjustment.
Maybe it's worth to think about the rev instruction as well. The nops then could be used to do the adjustment straight away.
Of course rev would change little endian and big endian and the RAM would hold data in mixed modes, but if you store the data the same way it's not a problem, only funny ;o)
·

jazzed · 2009-05-06 20:00

hmmm.

movi val, ina
shr   val, #8
movi val, ina
shr   val, #8
movi val, ina
shr   val, #8
movi val, ina
mov addr, val

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

MagIO2 · 2009-05-06 20:02

I currently plan to use a 72 IO CPLS. (because I already have it here ;o) This would allow to have a 33 bit shift register, a 32 bit latch which takes over the shifted in bits and then counts up the adress. The benefit of this way is, that all not used adress-bits can be used as port expander.
Maybe some of the free macrocells can be used to adjust CPLD operation to the code that reads the stuff. For example the CPLD only increases the adress 3 clocks after another and then skips one clock to allow the code to do the djnz. This means the NOPs in the code I gave above can be removed.

MagIO2 · 2009-05-06 20:04

@jazzed:
does not work ... movi writes 9 bits, so you'd overwrite one of the shifted bits.

jazzed · 2009-05-06 20:11

Which CPLD ?

>>> does not work ... movi writes 9 bits, so you'd overwrite one of the shifted bits.

Dang. Ok, how 'bout this?

movi val, ina
ror   val, #8
movi val, ina
ror   val, #8
movi val, ina
ror   val, #8 wc
movi val, ina
shl   val, #1
muxc val,#1
mov addr, val

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

Ariba · 2009-05-06 20:18

Jazzed
Again, the next movi overwrite the highest bit of the previous byte (because 9 databits are read).
You need a 9 bit databus for this kind of solution.

Andy

MagIO2 · 2009-05-06 20:21

Xilinx XC9572

No, does not work either. Now you have additional bits between the 8 bits that we read.

To get 3 bytes to the right place with only one additional instruction besides the movi only works with the rev. But for the 4th byte we need masking. But as I already mentioned, the memory will then hold bigendian mixed with little endian. So, it would be essential to have the long alligned read/write.

Ariba · 2009-05-06 20:25

Jazzed
If you have the data on Bit0..7 and PA8=0, then something like this works:

loop  movs t1,ina    '8 bit data on PA0..PA7 and PA8=0 (can be /RD)
      shl  t1,#8
      movs t2,ina
      or   t1,t2
      shl  t1,#8
      movs t2,ina
      or   t1,t2
      shl  t1,#8
      movs t2,ina
      or   t1,t2
:wrt  mov  0-0,t1
      add  :wrt,d_inc
      djnz cntr,#loop  '52 cycles for 1 long = 6.15 MByte/sec

With some nop's to get a constant rate, you will end up with 5 MByte/sec.

But at this rate you can also read the burst direct into HubRAM (what I would prefere, to do some LMM overlays):

loop  rdbyte 0-0,ina
      add loop,d_inc
      djnz  cntr,#loop  '16 cycles for 1 byte = 5 MByte/sec

Andy

Phil Pilgrim (PhiPi) · 2009-05-06 20:27

Overwriting the MSB (shown in lowercase) after the shift is okay. It's just that last byte that takes a little extra finagling:

              movi      val,ina     'aAAAAAAAA....................... .
              shr       val,#8      '........aAAAAAAAA............... .
              movi      val,ina     'bBBBBBBBBAAAAAAAA............... .
              shr       val,#8      '........bBBBBBBBBAAAAAAAA....... .
              movi      val,ina     'cCCCCCCCCBBBBBBBBAAAAAAAA....... .
              shr       val,#7      '.......cCCCCCCCCBBBBBBBBAAAAAAAA .
              shr       val,#1 wc   '........cCCCCCCCCBBBBBBBBAAAAAAA A
              movi      val,ina     'dDDDDDDDDCCCCCCCCBBBBBBBBAAAAAAA A
              rcl       val,#1      'DDDDDDDDCCCCCCCCBBBBBBBBAAAAAAAA A

-Phil

jazzed · 2009-05-06 20:35

@Phil,
Very nice ! [noparse]:)[/noparse] I new something was there. Why not just use "rol val, #8 wc" before movi... and rcl?

@MagIO2,
Endianness is corrected by the CPLD counter. But now the CPLD is harder for a burst.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

parts-man73 · 2009-05-06 20:36

Hanno said...
Sounds great! Once it's working maybe we can get spinstudio to do a board for us?

I'd be glad to oblige. All the interest lately in large and fast memory.

I actually had a project in mind that would require large and fast memory. My project was a Sony PSP screen. They can be bought on Ebay quite cheaply ($30-40) but the interface is quite complicated, yet simple.

Each tick of the clock (9 mhz clock) increments to the next pixel, each pixel is 24 bits of data - 8 bits per color RGB. Plus the applicable front and back porch...... Resolution is 480x272. That's quite a bit of data to stream.

I had thought about feeding the data to the LCD directly from the memory and just provide addresses to the memory chips from the propeller, or use a counter that increments the address on every clock. But then how to load the RAM with the image to be displayed?

This thread has me thinking about streaming the data from memory through the Propeller. This would be a dedicated Propeller just for the display, like a serial backpack on a HD44780 LCD, but allow allow some pretty high-res graphics with millions of color.

This is something that I'm just envisioning in my mind right now. I've got a few important projects in front of it, so don't expect to see a functioning prototype at the California show (Ohio maybe????

)

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Brian

uController.com - home of SpinStudio - the modular Development system for the Propeller

PropNIC - Add ethernet ability to your Propeller! PropJoy - Plug in a joystick and play some games!

SD card Adapter - mass storage for the masses Audio/Video adapter add composite video and sound to your Proto Board

Phil Pilgrim (PhiPi) · 2009-05-06 20:41

jazzed,

I'm not sure I follow you. One of the gotchas with the carry flag and shifts/rotates is that only the starting bit (31 or 0) gets shifted into carry. IOW, the carry bit doesn't act like a super MSB or LSB when you do a shift or rotate with a wc.

-Phil

Ariba · 2009-05-06 20:43

Phil
Yes, you have convinced me

So a constant burst rate of 6.6 Mbyte/sec should be possible for Cog code overlays if the databus is at PA0..PA7.

Andy

MagIO2 · 2009-05-06 20:50

Thanks Phil!

So what's missing is the update of the pointer, write the long to the destination and the djnz. So we are again at the point where 3 cycles are needed to do the burst read - but now have longs. Not to bad.

Dr. Jim · 2009-05-06 20:52

Found the new thread. Looking good.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Yours in continuing machine intelligence research,

Dr. Jim
http://www.machineinteltech.com
support@machineinteltech.com

Bursting data to/from Cog RAM- 6.6MB/sec using 9 pins?

Comments