Bursting data to/from Cog RAM- 6.6MB/sec using 9 pins?
Hanno
Posts: 1,130
What's the fastest way to access lots of RAM from a Propeller cog using the fewest IO pins?
I spent some time thinking (started by another very lively thread [noparse]:)[/noparse] ) and had a very rough thought- maybe others can fill in the details.
I would use 8 IO lines to get bytes in/out. I would use burst mode, where I ideally tell some other device (might be discrete logic, or another cog) the start address and the read/write mode- this takes at minimum 1 IO line. Once set up, the cog could burst read/write into cog memory- very fast! Something like:
This takes 12 cycles per byte read/written, so 6.6MB/sec- once everything has been set up. Can anyone improve? Please run with this and turn it into an affordable product!
Hanno
I spent some time thinking (started by another very lively thread [noparse]:)[/noparse] ) and had a very rough thought- maybe others can fill in the details.
I would use 8 IO lines to get bytes in/out. I would use burst mode, where I ideally tell some other device (might be discrete logic, or another cog) the start address and the read/write mode- this takes at minimum 1 IO line. Once set up, the cog could burst read/write into cog memory- very fast! Something like:
read mov addr,ina add read,incdest djnz cntr,#read
This takes 12 cycles per byte read/written, so 6.6MB/sec- once everything has been set up. Can anyone improve? Please run with this and turn it into an affordable product!
Hanno
Comments
an 8bit databus was proofed with the open source HIVE computer by drohne23
http://drohne.piranho.de/pic-hive-board2jpg.htm
I am prototyping it right now
Regards
Gianni
Hanno
You could always unroll the loop, depending on the amount of data. The unrolled loop could even overlap the buffer:
The unrolled loop would obviously be generated during the preparation phase. Something like:
I'm trying to do something similar, to read from SRAM.
The problem with your example seems to be that INA in assembly is not the same as in spin.
You can't specify the pins in assembly, so you're reading all 32 pins every time.
So, unless you want to store a 32 bit long for every byte you input, you'll have to shift/mask the input to get the byte to store.
I may be showing my ignorance (nothing new), but this is what I'm doing.
Jim
I've worked through some of these details in hardware ideas and have verilog that can be used in a CPLD.
Obviously as you mention a synchronous access device is necessary. As I see it, one needs 2 pins in addition to the 8 for byte data. One pin would be for the clock which would be produced by the CTRA on demand. The other pin would be a start bit. The data access would be via a packet as described below.
Timing all depends on how data is stored by Propeller. The packet could be smaller for smaller length and address.
The turnaround byte can be skipped in a write packet. Obviously the more data that is transferred, the higher the burst rate.
I've also thought a little about a more effective asynchronous transfer, but that is less attractive.
@Linus,
It's great to see you participate here. I would like to see·your·fully·developed·write/read routines and pinout for COG and/or HUB memory transactions. Not being pushy of course [noparse]:)[/noparse]
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230
Post Edited (jazzed) : 5/6/2009 4:34:29 PM GMT
Ah, but that depends on the use case. Perhaps you don't need to store the data in HUB memory. Maybe you can only access the external RAM one block (cache line) at a time. To change a single byte you'd have to read the entire block, then modify one of the registers, and finally write everything back using a similar unrolled loop. Basically, the four cogs would perform the role of a traditional cache.
If I had any fully developed write/read routines I'd post them.
If you have an unrolled loop packing the bytes into words or longs that reduces the block size and decreases overall throughput, a rolled-up loop decreases throughput further.
Ultimately real throughput depends on what you are going to do with the data when you have it. If you need to put it to Hub that's another overhead, though maybe a 'mov acc,INA / wrbyte acc,hubPtr' loop could hit the sweetspot making that quite efficient. That may well be the case using multiple Cogs, because you've got to get the data in multiple Cogs to one place to use it.
If this data is to be interpreted LMM or similar, a 128 byte block becomes 32 PASM/LMM instructions which isn't a lot and how much overhead does it add in practice when code fetches alternate between blocks ? When I did block overlays in LMM for a VM I found it was generally faster to fecth and execute from hub on demand than load a block and execute.
The only case I can see where speed of memory access would help is being able to keep a Hub cache filled from one Cog while an LMM interpreter runs in another but even there having to synchronise the two will likely wipe out any gains.
We don't have DMA to Hub or Cog, Cogs can only access Hub round robbin, there's no efficient inter-Cog links ( unless sacrificing I/O ), every extra PASM instruction needed eats into throughput so it's a case of banging a square peg into a round hole. The search is for the best square peg.
I think the bottom line is that ultra high speed memory transfer is seen as a Holy Grail when there are other bottlenecks which ultimately cap throughput.
Thanks
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
JMH
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230
11 pins is all that we need.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
JMH
Yes, you can pack 3 bytes in one long with the MOVS, MOVD, MOVI instructions. The Databus must be at PA0..PA7 (or PA1..PA8).
This has the additional advantage, that you can fill up to ~ 1400 bytes into the cog memory:
This loop reads on the first 3 instructions then 3 instructions pause, what is not so fine for a Bus interface. So its better
to reorder the instructions to:
Now the reads are all 2 instructions with a constant 10 MHz rate. But the first long holds only 1 byte:
long0 [noparse][[/noparse]byte0
]
long1 [noparse][[/noparse]byte3 byte2 byte1]
long2 [noparse][[/noparse]byte6 byte5 byte4]
...
But if this is usefull, depends on what you will do with the data bytes. The cog can now copy the databytes to the HubRAM
and another cog (also on another Propeller) can use the Bus to receive a burst of 1400 bytes. Or the cog can rearrange the bytes to full instructions (4 bytes in 1 long) and then execute the loaded data as a cog overlay (like on this other product [noparse]:)[/noparse]
Andy
I've looked at your proposal, and it has merit. It would enable using a smaller, cheaper CPLD or just some 74LVxxx counters. Setting up the address bits though would take at least 3 instructions beyond the overhead of getting the address. Is there anything that can be done while the bits are being shifted before grabbing or writing data?
BTW, I don't think it's possible to use 9 pins for this with an 8 bit bus unless the Propeller XO clock can be used some way.
@Andy,
That's pretty cool, but you have to shift the bytes around (not good) and the clocking in the CPLD has to adapt to the code (not too bad as long as it's consistent). Can you think of a good, fast way to pack the bytes into normal alignment?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230
Post Edited (jazzed) : 5/6/2009 7:56:00 PM GMT
I'd suggest to use input pins 0-7 for the data-bus. Then you can do:
This way 3 bytes of one long can be used after a little adjustment.
Maybe it's worth to think about the rev instruction as well. The nops then could be used to do the adjustment straight away.
Of course rev would change little endian and big endian and the RAM would hold data in mixed modes, but if you store the data the same way it's not a problem, only funny ;o)
·
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230
Maybe some of the free macrocells can be used to adjust CPLD operation to the code that reads the stuff. For example the CPLD only increases the adress 3 clocks after another and then skips one clock to allow the code to do the djnz. This means the NOPs in the code I gave above can be removed.
does not work ... movi writes 9 bits, so you'd overwrite one of the shifted bits.
>>> does not work ... movi writes 9 bits, so you'd overwrite one of the shifted bits.
Dang. Ok, how 'bout this?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230
Again, the next movi overwrite the highest bit of the previous byte (because 9 databits are read).
You need a 9 bit databus for this kind of solution.
Andy
No, does not work either. Now you have additional bits between the 8 bits that we read.
To get 3 bytes to the right place with only one additional instruction besides the movi only works with the rev. But for the 4th byte we need masking. But as I already mentioned, the memory will then hold bigendian mixed with little endian. So, it would be essential to have the long alligned read/write.
If you have the data on Bit0..7 and PA8=0, then something like this works:
With some nop's to get a constant rate, you will end up with 5 MByte/sec.
But at this rate you can also read the burst direct into HubRAM (what I would prefere, to do some LMM overlays):
Andy
-Phil
Very nice ! [noparse]:)[/noparse] I new something was there. Why not just use "rol val, #8 wc" before movi... and rcl?
@MagIO2,
Endianness is corrected by the CPLD counter. But now the CPLD is harder for a burst.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve
Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230
I'd be glad to oblige. All the interest lately in large and fast memory.
I actually had a project in mind that would require large and fast memory. My project was a Sony PSP screen. They can be bought on Ebay quite cheaply ($30-40) but the interface is quite complicated, yet simple.
Each tick of the clock (9 mhz clock) increments to the next pixel, each pixel is 24 bits of data - 8 bits per color RGB. Plus the applicable front and back porch...... Resolution is 480x272. That's quite a bit of data to stream.
I had thought about feeding the data to the LCD directly from the memory and just provide addresses to the memory chips from the propeller, or use a counter that increments the address on every clock. But then how to load the RAM with the image to be displayed?
This thread has me thinking about streaming the data from memory through the Propeller. This would be a dedicated Propeller just for the display, like a serial backpack on a HD44780 LCD, but allow allow some pretty high-res graphics with millions of color.
This is something that I'm just envisioning in my mind right now. I've got a few important projects in front of it, so don't expect to see a functioning prototype at the California show (Ohio maybe???? )
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Brian
uController.com - home of SpinStudio - the modular Development system for the Propeller
PropNIC - Add ethernet ability to your Propeller! PropJoy - Plug in a joystick and play some games!
SD card Adapter - mass storage for the masses Audio/Video adapter add composite video and sound to your Proto Board
I'm not sure I follow you. One of the gotchas with the carry flag and shifts/rotates is that only the starting bit (31 or 0) gets shifted into carry. IOW, the carry bit doesn't act like a super MSB or LSB when you do a shift or rotate with a wc.
-Phil
Yes, you have convinced me
So a constant burst rate of 6.6 Mbyte/sec should be possible for Cog code overlays if the databus is at PA0..PA7.
Andy
So what's missing is the update of the pointer, write the long to the destination and the djnz. So we are again at the point where 3 cycles are needed to do the burst read - but now have longs. Not to bad.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Yours in continuing machine intelligence research,
Dr. Jim
http://www.machineinteltech.com
support@machineinteltech.com