Quick Cog-to-Hub transfer
lonesock
Posts: 917
Hi, everybody.
I saw a thread on the use of a counter to auto update a pointer. With 16 clocks between consecutive wrlongs (for example), the PHSx register would be jumping by 16 each time. So, using a stride of 4 longs, I can upload an entire buffer using 4 passes (so the buffer must be a multiple of 16 bytes long, and long aligned).
Here is some proof-of-concept code that actually works for me. It's hard coded to move a 256-byte buffer from the Cog RAM to Hub RAM.
Anybody see a way to speed it up?
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
I saw a thread on the use of a counter to auto update a pointer. With 16 clocks between consecutive wrlongs (for example), the PHSx register would be jumping by 16 each time. So, using a stride of 4 longs, I can upload an entire buffer using 4 passes (so the buffer must be a multiple of 16 bytes long, and long aligned).
Here is some proof-of-concept code that actually works for me. It's hard coded to move a 256-byte buffer from the Cog RAM to Hub RAM.
' start Counter B as my address incrementer (Logic ALWAYS) mov ctrb,#%11111 shl ctrb,#26 mov frqb,#1 ' blah blah blah '' adr_Buf is already set externally cog_to_hub ' setup all 4 passes mov val,#4 movd :write_long,#buffer :one_of_four ' sync to the Hub RAM access rdlong tmp,tmp ' how many long to move on this pass? (256 bytes / 4)longs / 4 passes mov tmp,#(256 / 4 / 4) ' get my starting address right (phsb is incremented 1 per clock, so 16 each Hub access) mov phsb,adr_Hub ' write the longs, stride 4...low 2 bits of phsb are ignored :write_long wrlong 0-0,phsb add :write_long,incDest4 djnz tmp,#:write_long ' go back to where I started, but advanced 1 long sub :write_long,decDestNminus1 ' offset my Hub pointer by one long per pass add adr_Hub,#4 ' do all 4 passes djnz val,#:one_of_four cog_to_hub_ret ret {===== PASM initialized variables and parameters =====} incDest long 1 << 9 incDest4 long 4 << 9 decDestNminus1 long (256 / 4 - 1) << 9 {===== my buffer, 256 bytes =====} buffer res 256/4
Anybody see a way to speed it up?
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Comments
I definitely need to be processing 256 bytes practically simultaneously with multiple cogs as fast as possible now.
I won't know if what I am trying to do will work yet though.
must be the time of the year, I was playing with the same idea just for getting stuff into the cog (and before someone mentions overlay loaders, they don't do what I want). You can't do anything about the single wasted slot between passes (djnz timing) unless you are prepared to have an unrolled loop to transfer the buffer manually which - depending on the size of the buffer - may not be an option. The second wasted slot can be removed if you are prepared to run each pass separately (and can afford the extra code). This also gives you the option to transfer any size of buffer if it's at least 4 longs. Code is based on your example.
It uses the fact that hub long accesses ignore the lower 2 bits, so firstly the hub pointer is decremented by 7, then by 1, followed by 7, 1 etc.
We load in reverse and of course we use the fact that the djnz is replaced.
So it only remains to see if a combination of this could work for writing to hub. I will think about this.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
Post Edited (Cluso99) : 12/2/2009 7:31:03 AM GMT
Don't get me wrong, there are use cases where the append approach does what I want and I will use it [noparse]:)[/noparse]
lonesock & kuroneko: I love the use of the counters. I haven't ever used them before. I am having a look again as there may be ways I can use them - thanks
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
I took the liberty of rearranging your code so I could understand it better (the inline comments confuse/distract me)...hope you don't mind.
I had to make some tweaks to the code to get it to work in this little test program. See the "FIXME" notes in the ASM comments...in particular, I had in indexing problem wherein I had to add 4 to the hub address before entering the loop in order to get the output right. Strange that it works for you without this...
I also modded your code so that I could specify any arbitrary size block transfer (as long as it is divisible by 4). This cost 4 extra longs, but seemed like a worthwhile addition.
Anyway, this little test routine yields the following results:
For large block transfers, the overhead only amounts to about a 5-10% penalty relative to the ideal. Hard to see how you'll do much better than what you've got. I haven't looked at the spin interpreter's longmove routine to see if there's any clever tricks to be had there. Might be.
I'm guessing this is for your speech recognition challenge project. There must be scads of PASM routines people have written to do this; I wish there were some kind of "ASM code snippet repository" for this kind of stuff so we could check out what others have done before.
Post Edited (BR) : 12/2/2009 10:24:47 PM GMT
works exactly as advertised provided the code is sync'ed with the hub window and addr is 4n. The actual problem here is your test code, you dump the array starting with index 1 (when it should be 0, i.e. 0..127).
HTH
I noticed that there's only 1 bit different between the wrlong and rdlong instructions. So just out of curiosity, I made a version that will do block reads and writes. Takes 2 extra longs to implement. Seems to work fine, see attached demo. I thought about adding the read functionality using a mask...only 1 long needed, but it just didn't seem as intuitive to use.
Jonathan, I guess I'm not helping much relative to your original question. But this is a handy little algorithm you've come up with here.
BTW, You can also do
movi wr_mode, #%000010_000 'set to writelong
and
movi wr_mode, #%000010_001 'set to readlong
The easiest would be to have the automatic fix back to read at the end of the routine, and set to a write 1 instruction before the read call to setup the write.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
Should I submit this to the PASM tips & tricks thread? Or the prop Wiki? (I'm not sure of the protocol, or who's in charge of those resources.)
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
In this case the buffer starts at $001C, so its size will be limited to $200 - $1C bytes.
-Phil
If you had Cog RAM to spare, you could have 2 equal sized buffers: data & addresses. Pre-load the addresses once on cog load. This assumes that the hub addresses are fixed.
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
-Phil
BTW Isn't the incboth·incorrect...
incboth long· 1<<9 |·4··'add 1 to cog address (long) and add 4 to hub address (bytes)
(I think the immediate use of the hub is still addressed as bytes - manual on other laptop)
Phil: IIRC the compiler re-arranges the DAT code so that it will not likely be in the lower 512 longs of hub. However, at least homespun or bst (or maybe both) has an option to reserve space after the mandatory $10? longs in hub.
I used the immediate hub address for some code quite a while ago. It's a relatively unknown (or thought of) idea.
As for the tiny hubwrite loop, a
· movi·· hubwrite,#000010_000· 'set to writelong
or
· movi·· hubwrite,#000010_001· 'set to readlong
before calling the routine will make it a read or write loop. Below is·even better·though.
So, here is an improved·extended solution (1 extra instruction used per block for read, 2 for write), but I saved 2 (movs & movd)
Now to work out an easy·way to force the buffer into the lower 512 bytes of hub. For this we will have to ask the compiler masters, Brad & Michael.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
Another option (apart from the hub-below-0x200) would be to let one cog write its data to a fixed hub address. Two displacer cogs will take it from there and distribute it to wherever you want.
write loop (fragment)
displacer loop (fragment)
Post Edited (kuroneko) : 12/4/2009 3:04:07 AM GMT
I know it puts VAR variables somewhere else, but I'm pretty sure the DAT block remains contiguous with the object's Spin code. I could be wrong, though.
-Phil
I'm down to trying to see if I can pick a freqb value such that the counter wraps around and increments by 4 once every 16 clocks. I.e., set freqb := ($FFFF_FFFF+4)/4 and hope that the remainder won't skew the count. I've tried this but so far it doesn't work. Think I've been staring at it too long, time to stop and try again later.
I've attached my code for anyone interested in taking a crack at it. Be prepared for a helping of vexation.
@Cluso99: I've been staring at your overlay code snippet and I'm just not getting it. Is there a thread somewhere that discusses this further?
I now have another power supply for my 2nd laptop (old one died) so I can run both - am now off to see how I can force the compiler to place the buffer in lower hub.
BR: Overlay - Use heaters link. There are examples and descriptions in the code. Heater uses it in the ZiCog. BTW acknowledgements are in the code including PhiPi, hippy and others.
Yes, I was looking at the counters and like you, I couldn't see a way without wasting a pin... and I never seem to have a spare [noparse]:([/noparse]
kuroneco, lonesock and PhiPi·seem to be the counter masters. (added PhiPi as I just saw his explanations on another thread - nice Phil)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
Post Edited (Cluso99) : 12/4/2009 5:38:00 AM GMT
Of course, the downside is that you can't use res to allocate the cog buffer.
See below.
-Phil
Post Edited (Phil Pilgrim (PhiPi)) : 12/4/2009 6:26:48 AM GMT
-Phil
A list of pointers to the routines, etc come first
DAT's are next
PUB's and PRI's are next
VAR's are placed at the end
So, the buffer needs to be in the first DAT. So use a DAT just for the buffer. Use spin to find the hub address and "poke" it into the PASM routine before loading or pass it via a par parameter.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
Nope, that's correct.
Each object consists of
- Method Table
- DAT block (contiguous and not restructured or tampered with in any way)
- Spin Methods
Lather, Rinse, Repeat for all objects.. then
- VAR block. Object by object with each objects variables sorted in LONG/WORD/BYTE order.
So the DAT block in the top object is as close to the start of the hub as possible without losing the spin method table.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
If you always do what you always did, you always get what you always got.
-Phil
If you want to make it minimal, have the top object contain just a DAT section and one method to point to the next sub-object. So the method table will include one spin method and one sub-object. Then you can just behave as normal under that, and the DAT section in your top object can reserve the memory required for you.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
If you always do what you always did, you always get what you always got.
-Phil
Yep, just take into account that each spin method and each object instance consumes 4 bytes from that.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
If you always do what you always did, you always get what you always got.