Bursting data to/from Cog RAM- 6.6MB/sec using 9 pins?

MagIO2 · 2009-05-08 08:09

Nitpicking is easy, isn't it?

Guess you want to write ina to hubbuf. But as far as I know ina should not be a destination because then it would use the COG-RAM content instead of the port register.

Cluso99 · 2009-05-08 08:36

Oops... typo thanks. (or as my teachers used to say - just checking you are awake)

@MagIO2: No you cannot use ina as a destination - you will get the shaddow ram.

@virtuPIC: Why would I want to put it into hub. It would be much quicker to compile a long into cog and 1 write to hub. D0..7 = P0..7 is the way I did it in the TriBlade but it will be quite a bit quicker to move D0..7 to P24..P31. The same applies to writes.

@jazzed: The eeprom is not connected this way - you will just have to wait

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBladeProp, SixBladeProp, website (Multiple propeller pcbs)
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: Micros eg Altair, and Terminals eg VT100 (Index)
· Search the Propeller forums (via Google)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm

virtuPIC · 2009-05-08 11:34

Cluso, you can put it into hub temporarily for assembling words from bytes. Write the bytes into a temporary long in hub memory (wrbyte) and read the complete value (rdlong). The hub instructions need more time but you can intersperse other code.

Well, you are right, too. After detailed calculation I found that your cog-only code is still faster than my hub-too code.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Airspace V - international hangar flying!
www.airspace-v.com/ggadgets for tools & toys

jazzed · 2009-05-08 14:35

@Cluso99,
Ok, good marketing to draw suspense.

@virtuPIC,
It might be good to post a working version of your code. I'm quite confused by what you posted.

@MagIO2,
I missed your edited comments before. A WE jumper sounds appropriate for a generic board. Also, I'm convinced that 32 macrocells is not enough. Of course to do what hippy mentioned would take closer to 128 macrocells. I've tried putting a PLCC84 on my 3x4 board, but it's just too big. Using a TQFP-100 would be best I guess since there is a smaller macrocell limit on 44 pin devices. BTW, to use Propeller XO to drive the CPLD, one needs a 10MHz crystal ... never thought we would have that problem[noparse]:)[/noparse]

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

MagIO2 · 2009-05-08 19:41

@jazzed:
I doubt that we need a 128 macrocell CPLD. In the project I learned the CPLD programming I did something very similar. It was a CPLD to support a microcontroller in driving 6 seven segment displays for displaying the time. It could be loaded via 7 pins plus digit adress - this is similar to loading the latches. And then it counted upwards with a given clock signal - but it did not count binary, it directly counted in 7 segment output-mode to drive the LED's directly. And it counted the time correct with overflow from seconds to ten seconds to minutes and so on. So it was more complex than the simple binary counter that we need for counting the adress. And the 72 macrocell CPLD was not fully used for that!

jazzed · 2009-05-08 20:06

@MagIO2, did you look at hippy's proposal? What does it take to accomodate all the modes he mentioned?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

MagIO2 · 2009-05-08 21:06

I have something else in mind ... similar, but not the same.

As already mentioned I don't see a reason for giving a length of something. Neither the length of the adress I want to send nor the length of the burst-mode. You have to signal that you want to send an adress anyway. So, the CPLD simply takes up to 4 bytes as adress. When you send only 1 - fine, when you send 2 - fine ... The fifth could maybe be some kind of mode byte. Don't know yet. This has the advantage that you don't have to send it. I suppose that the kind of RAM access won't change so rapidly in one application. But that's only a detail which can be changed, if practice teaches something different.

When you chang back to "read memory"-mode you are also free to send as many clocks as you want. With each clock the CPLD increments the adress. If you don't need burst mode, then you simply don't send clocks.

For the clock I see two possibilites. Either the code is symetrical, which means getting each byte takes the same time even if you are in burst mode. Then a counter can be used to generate the clock. Or the code is asymetrical to allow max performance to fetch one long but then needs some time to do a djnz for the next long. This clock signal hopefully can be generated by the video generator. The VG can create any clock pattern which is 32 bits wide. So, for example it generates 010101010000 where each 1 is as long as 2 instruction cycles ... have to do some tests with that idea.

Maybe we can have two independent adress-registers? Would be handsome as well. One COG is reading for TV output, the other for data processing?

jazzed · 2009-05-08 21:34

This sounds pretty sweet MagIO2 [noparse]:)[/noparse]

So without a mode byte, the most typical performance demanding transfer could be done to save a byte ... perhaps replacing the wasted turn-around byte. Excellent.

Thinking out loud here ... the start bit could be used to specify the entirity of the header or command phase. In the case of a read since we have to change to INA anyway, a pullup on the start bit can signal command phase complete. In the case of a write, the start bit can be driven as part of the address or mode byte.

I especially like managing the clock with the video generator ... having clock control is always useful and that's why i kept pushing for it. Yes, this way the length is controlled by the clock ticks and would not need a counter of any type. Excellent again.

You are suggesting maybe a form of serial dma with "reading for TV output" ? This could also make Ethernet a reality ... I won't mention size this time [noparse]:)[/noparse]

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

Bill Henning · 2009-05-11 16:01

True, but you'd need an external counter, and it would have to be synchronized VERY tightly... and you would need ram with (12.5ns-C) access time, where C is the settling time of the counter.

Linus Akesson said...
You could even improve the unrolled loop, actually. By executing that code in four different cogs, each of them running one cycle later than its predecessor, you'd be able to sample at 80 MHz (times 32 bits).

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers

MagIO2 · 2009-05-11 16:49

Why would you need 12.5ns-C RAMs? The execution time of one instruction is 50ns.

Bill Henning · 2009-05-11 17:49

Because he was suggesting using four interleaved cogs to read at 80MB/sec into cog memory.

MagIO2 said...
Why would you need 12.5ns-C RAMs? The execution time of one instruction is 50ns.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers

Bill Henning · 2009-05-11 21:52

Taking jazzed's & Phil's & Hanno's code, one can add:

loop:

mov val,ina 'xxxxxxxxxxxxxxxxxxxxxxxxAAAAAAAA .
ror val,#17 wc 'xxxxxxxxxAAAAAAAAxxxxxxxxxxxxxxx A0
movi val,ina 'bBBBBBBBBAAAAAAAAxxxxxxxxxxxxxxx A0
shr val,#8 '........bBBBBBBBBAAAAAAAAxxxxxxx A0

movi val,ina 'cCCCCCCCCBBBBBBBBAAAAAAAAxxxxxxx A0
shr val,#8 '........cCCCCCCCCBBBBBBBBAAAAAAA A0
movi val,ina 'dDDDDDDDDCCCCCCCCBBBBBBBBAAAAAAA A0
rcl val,#1 'DDDDDDDDCCCCCCCCBBBBBBBBAAAAAAAA A0

wrlong val,hubptr
add hubptr,#4
djnz longcount,#loop:

which will lock it to the hub, and write one long every 48 cycles (or 12 cycles per byte)

Mind you, now the clocking of the counter becomes --__--__ --__--__ ________ ... that is 32 cycles for reading the four bytes, and 16 cycles for flushing it to the hub.

6.667MB/sec is not bad at all!

Very good work guys.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers

Phil Pilgrim (PhiPi) · 2009-05-11 23:19

Just to give credit where credit is due, I think Lonesock's insight with the mov and ror was far keener than my own contribution to this discussion.

-Phil

jazzed · 2009-05-12 00:23

Taking what Linus and Bill said and extending on it a bit in a theoretical collaborative COG viewpoint ... Synchronous packet transfer and the asynchronous access cases performance can both be improved by using two COGs.

While the read and write cycle time of an SRAM is fixed to whatever you are willing to pay essentially, the addressing phase is usually less sensitive. So for synchronous performance, one can employ COGs to send the header (whatever that is) on 2n and 2n+2 counts to the CPLD and thereafter use one COG to fetch bytes.

In the asynchronous case, COGs can be used to effectively double transfer rates since the address must be set before the data can be fetched. The only thing that is necessary to coordinate the COG collaborative access is the CNT register; the addresser can fire on 2n steps and fetcher can fire on 2n+2 steps. In the case of the multi-propeller design, this would be more attractive though because more COGs are available for the execution engine. Of course this assumes an interleaved address and data cycle which has been worked out for data on the lower 8 pins.

The other possibility is to use multiple COGs for setting up and accessing legacy burst mode DRAM although that is not available in production quantity. I just happen to have one on my desk connected to a Propeller though [noparse]:)[/noparse] Note to self ... put in project queue.

Bill Henning said...

True, but you'd need an external counter, and it would have to be synchronized VERY tightly... and you would need ram with (12.5ns-C) access time, where C is the settling time of the counter.

Linus Akesson said...

You could even improve the unrolled loop, actually. By executing that code in four different cogs, each of them running one cycle later than its predecessor, you'd be able to sample at 80 MHz (times 32 bits).

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

jazzed · 2009-05-14 22:24


Cluso99 wrote...

add      IdSDeR
mov          IdSDer
              ^      <-- write of address to the output pins
               ^     <-- read of data pins
              ||     less than 12.5nS but cannot guarantee any time > 0nS

Cluso99 said...

The address output will be output at the R cycle and the data will be read in the next S clock cycle. The timing between these cannot be guaranteed (I don't recall this sort of timing published on the data sheet). So you will require the first 'nop' no matter how fast the memory.

Edit to Un-strike this:
Ray, Looking at other threads on this subject leads me to believe that the operand and not the register is being fetched on S. The register appears to be fetched on the "e" or execute part of the cycle which would give a at least a 25ns margin between OUTA and INA at 80MHz. One would have to use the more expensive SRAM for this though.
Looking at the new P8X32A Datasheet 1.1, it is clear that Ray's interpretation is correct except that it appears the timing will be 1/clkfreq if there is no conditional.

It has been confirmed in another thread with Chip's help that we get 3 clock cycles - ~4.5ns delay on back to back instructions from the time an address is emitted by "mov outa, addr" to the time the data can be collected by "mov data, ina" . NOP is absolutely not required for an 80MHz system clock with sram rated at less than 33ns.

Post Edited (jazzed) : 12/14/2009 5:35:14 PM GMT

Mike Huselton · 2009-05-27 15:42

Hippy,

Hippy said...
Very neat and very impressive. In terms of 32-bit data transfers ( I'm looking at executing code from XMM rather than just getting the data ) - PASM 20 MIPS, LMM 5 MIPS, XMM 2 MIPS which isn't too bad, although that considers instructions streamed in and not any random adressing overhead.

Any update on the preferred algorithm or source code for this topic?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
JMH

jazzed · 2009-05-27 17:14

I've stopped working on the CPLD/SRAM variation of this to focus on a fair trade-off, broader solution.
I expect to get 5 to 5.3MB/s raw with up to 1MB SRAM without a CPLD.

The solution will use 1 Propeller, 2 COGs, and 16 pins. It is physically compact and widely applicable.

The solution will be usable on any Protoboard, Demo board, or (as a carrier board) with any Propeller DIP-40.
FABs are due·soon .... Demo code is available, and I will start a new thread when appropriate.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

MagIO2 · 2009-05-27 19:10

I'm currently working on some CPLD code. As I can't spend as much time as I'd like to it will take a while to finish this.

hinv · 2009-05-28 10:38

Hi Jazzed,

I am very interested to see what you have come up with. 16 pins for a 1MB of SRAM?
Have you taken this to another thread somewhere?

Thanks,
Doug

MagIO2 · 2009-05-28 20:18

http://forums.parallax.com/showthread.php?p=811092

hinv · 2009-05-29 01:32

I am a bit confused about 1 question. Can a high rate (5MB/s+) be done with only 8 pins if there is a common Xin clock?

jazzed · 2009-05-29 03:20

hinv,

Normally one or more control signals are used to coordinate work with the secondary device.
One could use a unique token on an 8 bit bus to start a transfer if there is only one device and the protocol is carefully managed.
Also it is likely that a 5 or 10 MHz Xin clock would not be fast enough to achieve 5MB/s.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
--Steve

Propalyzer: Propeller PC Logic Analyzer
http://forums.parallax.com/showthread.php?p=788230

MagIO2 · 2009-05-29 06:27

@hinv:
Why not? Your protocol has to be a bit different then.
1. wait-mode: the hardware waits until·a dedicated·bit is 1 to start the read adress cycle
2. if this bit is 1 the rest of the bits are copied to the MSB of the adress register
3. with the next clock the MSB (middle significant byte ;o) is copied
3. with the next clock the LSB is copied .... (say 23 bit adress register is enough)
4. with the next clock the MSB of byte-counter is copied
5. with the next clock the LSB of byte-counter is copied

6. now the hardware switches so read-mode and simply counts up the adress counter and down the byte counter with each clock until as the byte counter is 0.
7. go back to 1.

Maybe you can use one more bit of first byte to set the write mode.

The netto transfer rate in this case depends on how often do you have to set the adress to a different value. It would be perfect for a recording measurements until memory is full and then stream the result to a PC - for example. It would also work fine for showing images on a screen. With doing graphics output (drawing lines across the screen) the performance will rapidly decrease.

kuroneko · 2009-05-31 05:30

Cluso99 said...

loadxmm
 mov outa, addr
 add addr, #4      '<-- nop reqd but can advance addr
 mov data, ina     'AAAAAAAA....
 add outa, #1
 shr data, #24     '000000000000000000000000AAAAAAAA
 mov d2, ina       'BBBBBBBB....
 add outa, #1
 and d2, hFF000000 'BBBBBBBB000000000000000000000000
 mov d3, ina       'CCCCCCCC....
 add outa, #1
 shr d3, #24       '000000000000000000000000CCCCCCCC
 mov d4, ina       'DDDDDDDD.... 

 and d4, hFF000000 'DDDDDDDD000000000000000000000000
 or data, d4       'DDDDDDDD0000000000000000AAAAAAAA
 or d2, d3         'BBBBBBBB0000000000000000CCCCCCCC
 rol d2, #16       '00000000CCCCCCCCBBBBBBBB00000000
 or data, d2       'DDDDDDDDCCCCCCCCBBBBBBBBAAAAAAAA 

 ret

This method adds 2 instructions to XMM, but random is 1 instruction faster. The main disadvantage to my method is that I require the use of SI/SO and the Eeprom pins.

That's 17 instructions + return (68+4 cycles). If you have say 4 spare cogs and your SRAM is good enough you can get away with 43+4 cycles [noparse]:)[/noparse] The added advantage is that even with 20 address bits (1M) you can place your data byte at P20..P27 and have I2C and serial free.

Edit: added 4 cycle penalty due to byte placement

Post Edited (kuroneko) : 5/31/2009 5:43:58 AM GMT

Cluso99 · 2009-05-31 05:47

kuroneko: I figure that the overhead to pass the info between cogs and get them synchronised will kill any benefit in access speed for anything other than large blocks. By placing the data D0-7 in P24-31 I can remove any extraneous bits with a single SHR/SHL #24. No requirement to AND and SHR/SHL so it saves an instruction.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBladeProp, SixBladeProp, website (Multiple propeller pcbs)
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: Micros eg Altair, and Terminals eg VT100 (Index)
· Search the Propeller forums (via Google)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm

kuroneko · 2009-05-31 05:59

Cluso99 said...
kuroneko: I figure that the overhead to pass the info between cogs and get them synchronised will kill any benefit in access speed for anything other than large blocks. By placing the data D0-7 in P24-31 I can remove any extraneous bits with a single SHR/SHL #24. No requirement to AND and SHR/SHL so it saves an instruction.

There is no such overhead. By picking the right cogs they are implicitly synchronised. All the reader cog does is issue the address and pick up the result [noparse]:)[/noparse]

Cluso99 · 2009-05-31 09:14

Kuroneko: Exactly - you have to pass the address and pick up the result via hub !!!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBladeProp, SixBladeProp, website (Multiple propeller pcbs)
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: Micros eg Altair, and Terminals eg VT100 (Index)
· Search the Propeller forums (via Google)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm

kuroneko · 2009-05-31 09:31

Cluso99 said...
Kuroneko: Exactly - you have to pass the address and pick up the result via hub !!!

Does it matter if it's faster than your current approach?

Cluso99 · 2009-05-31 09:46

My access is random, so your method·has to be slower because I would have to pass the data to another cog via hub to use it. This is much slower than my method.

Your method must use more code as well. We do not have space in ZiCog.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBladeProp, SixBladeProp, website (Multiple propeller pcbs)
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: Micros eg Altair, and Terminals eg VT100 (Index)
· Search the Propeller forums (via Google)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm

Post Edited (Cluso99) : 5/31/2009 9:51:23 AM GMT

Mike Huselton · 2009-05-31 23:42

So in conclusion - what is the conclusion, exactly?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
JMH

Bursting data to/from Cog RAM- 6.6MB/sec using 9 pins?

Comments