Quick Cog-to-Hub transfer
Hi, everybody.
I saw a thread on the use of a counter to auto update a pointer. With 16 clocks between consecutive wrlongs (for example), the PHSx register would be jumping by 16 each time. So, using a stride of 4 longs, I can upload an entire buffer using 4 passes (so the buffer must be a multiple of 16 bytes long, and long aligned).
Here is some proof-of-concept code that actually works for me. It's hard coded to move a 256-byte buffer from the Cog RAM to Hub RAM.
Anybody see a way to speed it up?
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
I saw a thread on the use of a counter to auto update a pointer. With 16 clocks between consecutive wrlongs (for example), the PHSx register would be jumping by 16 each time. So, using a stride of 4 longs, I can upload an entire buffer using 4 passes (so the buffer must be a multiple of 16 bytes long, and long aligned).
Here is some proof-of-concept code that actually works for me. It's hard coded to move a 256-byte buffer from the Cog RAM to Hub RAM.
' start Counter B as my address incrementer (Logic ALWAYS)
mov ctrb,#%11111
shl ctrb,#26
mov frqb,#1
' blah blah blah
'' adr_Buf is already set externally
cog_to_hub
' setup all 4 passes
mov val,#4
movd :write_long,#buffer
:one_of_four
' sync to the Hub RAM access
rdlong tmp,tmp
' how many long to move on this pass? (256 bytes / 4)longs / 4 passes
mov tmp,#(256 / 4 / 4)
' get my starting address right (phsb is incremented 1 per clock, so 16 each Hub access)
mov phsb,adr_Hub
' write the longs, stride 4...low 2 bits of phsb are ignored
:write_long
wrlong 0-0,phsb
add :write_long,incDest4
djnz tmp,#:write_long
' go back to where I started, but advanced 1 long
sub :write_long,decDestNminus1
' offset my Hub pointer by one long per pass
add adr_Hub,#4
' do all 4 passes
djnz val,#:one_of_four
cog_to_hub_ret
ret
{===== PASM initialized variables and parameters =====}
incDest long 1 << 9
incDest4 long 4 << 9
decDestNminus1 long (256 / 4 - 1) << 9
{===== my buffer, 256 bytes =====}
buffer res 256/4
Anybody see a way to speed it up?
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
Comments
I definitely need to be processing 256 bytes practically simultaneously with multiple cogs as fast as possible now.
I won't know if what I am trying to do will work yet though.
must be the time of the year, I was playing with the same idea just for getting stuff into the cog (and before someone mentions overlay loaders, they don't do what I want). You can't do anything about the single wasted slot between passes (djnz timing) unless you are prepared to have an unrolled loop to transfer the buffer manually which - depending on the size of the buffer - may not be an option. The second wasted slot can be removed if you are prepared to run each pass separately (and can afford the extra code). This also gives you the option to transfer any size of buffer if it's at least 4 longs. Code is based on your example.
PUB null CON lcnt = 256 / 4 ' number of longs to transfer (min 4) ' lcnt | pass 0 1 2 3 ' 4n+0 n n n n ' 4n+1 n+1 n n n ' 4n+2 n+1 n+1 n n ' 4n+3 n+1 n+1 n+1 n ' ' example lcnt = 6 (4n+2) ' pass 0 transfers 2 longs 0 4 ' pass 1 transfers 2 longs 1 5 ' pass 2 transfers 1 longs 2 ' pass 3 transfers 1 longs 3 DAT org 0 movi ctrb, #%0_11111_000 mov frqb, #1 adr_Hub long 0 '' adr_Hub is already set externally cog_to_hub movd :write_long0, #buffer + 0 movd :write_long1, #buffer + 1 movd :write_long2, #buffer + 2 movd :write_long3, #buffer + 3 ' sync to hub window rdlong tmp, tmp ' +0 = mov tmp, #(lcnt + 3 / 4) ' +8 see constant section mov phsb, adr_Hub ' +12 :write_long0 wrlong 0-0, phsb ' +0 = add :write_long0, incDest4 ' +8 djnz tmp, #:write_long0 ' +12 add adr_Hub, #4 ' +20 mov tmp, #(lcnt + 2 / 4) ' +24 see constant section mov phsb, adr_Hub ' +28 :write_long1 wrlong 0-0, phsb ' +0 = add :write_long1, incDest4 ' +8 djnz tmp, #:write_long1 ' +12 add adr_Hub, #4 ' +20 mov tmp, #(lcnt + 1 / 4) ' +24 see constant section mov phsb, adr_Hub ' +28 :write_long2 wrlong 0-0, phsb ' +0 = add :write_long2, incDest4 ' +8 djnz tmp, #:write_long2 ' +12 add adr_Hub, #4 ' +20 mov tmp, #(lcnt + 0 / 4) ' +24 see constant section mov phsb, adr_Hub ' +28 :write_long3 wrlong 0-0, phsb ' +0 = add :write_long3, incDest4 ' +8 djnz tmp, #:write_long3 ' +12 cog_to_hub_ret ret {===== PASM initialized variables and parameters =====} incDest4 long 4 << 9 tmp long 0 {===== my buffer, 256 bytes =====} buffer res lcnt fit DAT
movd overlay_copy2,overlay_par 'move cog END address into rdlong instruction sub overlay_par,#1 'decrement cog End address by 1 movd overlay_copy1,overlay_par 'move cog END-1 address into rdlong instruction shr overlay_par,#16 'extract the overlay## hub END address (remove cog address) overlay_copy2 rdlong 0-0,overlay_par 'copy long from hub to cog (hptr ignores last 2 bits!) sub overlay_par,#7 'decrement hub ptr by 1 long (prev by 1, now by 7) sub overlay_copy2,_0x400 'decrement cog (destination) address by 2 overlay_copy1 rdlong 0-0,overlay_par 'copy long from hub to cog sub overlay_copy1,_0x400 'decrement cog (destination) address by 2 djnz overlay_par,#overlay_copy2 'decrement hub ptr by 1 long (now by 1, next by 7)
It uses the fact that hub long accesses ignore the lower 2 bits, so firstly the hub pointer is decremented by 7, then by 1, followed by 7, 1 etc.
We load in reverse and of course we use the fact that the djnz is replaced.
So it only remains to see if a combination of this could work for writing to hub. I will think about this.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
Post Edited (Cluso99) : 12/2/2009 7:31:03 AM GMT
Don't get me wrong, there are use cases where the append approach does what I want and I will use it [noparse]:)[/noparse]
lonesock & kuroneko: I love the use of the counters. I haven't ever used them before. I am having a look again as there may be ways I can use them - thanks
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
I took the liberty of rearranging your code so I could understand it better (the inline comments confuse/distract me)...hope you don't mind.
I had to make some tweaks to the code to get it to work in this little test program. See the "FIXME" notes in the ASM comments...in particular, I had in indexing problem wherein I had to add 4 to the hub address before entering the loop in order to get the output right. Strange that it works for you without this...
I also modded your code so that I could specify any arbitrary size block transfer (as long as it is divisible by 4). This cost 4 extra longs, but seemed like a worthwhile addition.
Anyway, this little test routine yields the following results:
#longs xfred #ticks time/ideal 16 438 1.71 32 694 1.36 64 1206 1.18 128 2230 1.09 256 4278 1.04
For large block transfers, the overhead only amounts to about a 5-10% penalty relative to the ideal. Hard to see how you'll do much better than what you've got. I haven't looked at the spin interpreter's longmove routine to see if there's any clever tricks to be had there. Might be.
I'm guessing this is for your speech recognition challenge project. There must be scads of PASM routines people have written to do this; I wish there were some kind of "ASM code snippet repository" for this kind of stuff so we could check out what others have done before.
Post Edited (BR) : 12/2/2009 10:24:47 PM GMT
mov phsa, addr ' -4 wrlong $, phsa ' +0 =
works exactly as advertised provided the code is sync'ed with the hub window and addr is 4n. The actual problem here is your test code, you dump the array starting with index 1 (when it should be 0, i.e. 0..127).
HTH
I noticed that there's only 1 bit different between the wrlong and rdlong instructions. So just out of curiosity, I made a version that will do block reads and writes. Takes 2 extra longs to implement. Seems to work fine, see attached demo. I thought about adding the read functionality using a mask...only 1 long needed, but it just didn't seem as intuitive to use.
Jonathan, I guess I'm not helping much relative to your original question. But this is a handy little algorithm you've come up with here.
BTW, You can also do
movi wr_mode, #%000010_000 'set to writelong
and
movi wr_mode, #%000010_001 'set to readlong
The easiest would be to have the automatic fix back to read at the end of the routine, and set to a write 1 instruction before the read call to setup the write.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
Should I submit this to the PASM tips & tricks thread? Or the prop Wiki? (I'm not sure of the protocol, or who's in charge of those resources.)
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
[b]movs[/b] hubwrite,hubaddr [b]movd[/b] hubwrite,bufaddr [b]mov[/b] counter,#bufsize hubwrite [b]wrlong[/b] 0-0,#0-0 [b]add[/b] hubwrite,incboth [b]djnz[/b] counter,#hubwrite incboth [b]long[/b] 1 << 9 | 4
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
[b]CON[/b] [b]_clkmode[/b] = [b]xtal1[/b] + [b]pll16x[/b] [b]_xinfreq[/b] = 5_000_000 [b]OBJ[/b] top : "RealTopLevelObject" [b]PUB[/b] Start top.start(@buffer) [b]DAT[/b] buffer [b]byte[/b] 0[noparse][[/noparse]*256]
In this case the buffer starts at $001C, so its size will be limited to $200 - $1C bytes.
-Phil
If you had Cog RAM to spare, you could have 2 equal sized buffers: data & addresses. Pre-load the addresses once on cog load. This assumes that the hub addresses are fixed.
movs hubwrite,#address_table movd hubwrite,#cog_buffer mov counter,#bufsize hubwrite wrlong 0-0,0-0 add hubwrite,incboth djnz counter,#hubwrite incboth long 1 << 9 | 1
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.
-Phil
BTW Isn't the incboth·incorrect...
incboth long· 1<<9 |·4··'add 1 to cog address (long) and add 4 to hub address (bytes)
(I think the immediate use of the hub is still addressed as bytes - manual on other laptop)
Phil: IIRC the compiler re-arranges the DAT code so that it will not likely be in the lower 512 longs of hub. However, at least homespun or bst (or maybe both) has an option to reserve space after the mandatory $10? longs in hub.
I used the immediate hub address for some code quite a while ago. It's a relatively unknown (or thought of) idea.
As for the tiny hubwrite loop, a
· movi·· hubwrite,#000010_000· 'set to writelong
or
· movi·· hubwrite,#000010_001· 'set to readlong
before calling the routine will make it a read or write loop. Below is·even better·though.
So, here is an improved·extended solution (1 extra instruction used per block for read, 2 for write), but I saved 2 (movs & movd)
'NOTE: Requires the hub block to be in first 512 bytes (warning... more complex than you think) readblock mov hubloop,hubread 'setup to readlong writeblock mov counter,#bufsize hubloop wrlong 0-0,0-0 'setup to readlong/writelong on entry/exit respectively add hubloop,incboth 'add 1 to cog (longs) and 4 to hub (bytes) djnz counter,#hubloop mov hubloop,hubwrite 'preset to writelong (for next time) ret counter long 0 incboth long 1 << 9 | 4 'add 1 to cog (longs) and 4 to hub (bytes) hubwrite wrlong cog_buffer,#address_table '\ used as initialisation instruction format hubread wrlong cog_buffer,#address-table '/
Now to work out an easy·way to force the buffer into the lower 512 bytes of hub. For this we will have to ask the compiler masters, Brad & Michael.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
Another option (apart from the hub-below-0x200) would be to let one cog write its data to a fixed hub address. Two displacer cogs will take it from there and distribute it to wherever you want.
|---------------|---------------|---------------|---------------|---------------|---------------|---- cog0 W W W W W W W cog1 R W R W R W R cog2 N R W R W R W
write loop (fragment)
wrlong 0-0, fixed add $-1, dst_plus_1 djnz length, #$-2
displacer loop (fragment)
rdlong temp, fixed wrlong temp, buffer add buffer, #8 djnz half_of_length, #$-3
Post Edited (kuroneko) : 12/4/2009 3:04:07 AM GMT
I know it puts VAR variables somewhere else, but I'm pretty sure the DAT block remains contiguous with the object's Spin code. I could be wrong, though.
-Phil
I'm down to trying to see if I can pick a freqb value such that the counter wraps around and increments by 4 once every 16 clocks. I.e., set freqb := ($FFFF_FFFF+4)/4 and hope that the remainder won't skew the count. I've tried this but so far it doesn't work. Think I've been staring at it too long, time to stop and try again later.
I've attached my code for anyone interested in taking a crack at it. Be prepared for a helping of vexation.
@Cluso99: I've been staring at your overlay code snippet and I'm just not getting it. Is there a thread somewhere that discusses this further?
I now have another power supply for my 2nd laptop (old one died) so I can run both - am now off to see how I can force the compiler to place the buffer in lower hub.
BR: Overlay - Use heaters link. There are examples and descriptions in the code. Heater uses it in the ZiCog. BTW acknowledgements are in the code including PhiPi, hippy and others.
Yes, I was looking at the counters and like you, I couldn't see a way without wasting a pin... and I never seem to have a spare [noparse]:([/noparse]
kuroneco, lonesock and PhiPi·seem to be the counter masters. (added PhiPi as I just saw his explanations on another thread - nice Phil)
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
Post Edited (Cluso99) : 12/4/2009 5:38:00 AM GMT
[b]DAT[/b] cog [b]jmp[/b] #cogstart cogbuf [b]long[/b] 0[noparse][[/noparse]*bufsize-1] cogbufend [b]long[/b] 0 cogstart [b]mov[/b] hubbufend,#bufsize-1 [b]shl[/b] hubbufend,#2 [b]add[/b] hubbufend,[b]par[/b] '... transfer [b]mov[/b] hubptr,hubbufend 'Start at par + 4 * bufsize - 4. [b]mov[/b] cogptr,#cogbufend 'Start at cogbufend == bufsize. xferlp [b]wrlong[/b] cogptr,hubptr 'Last wrlong is from cogbuf == 1, to par [b]sub[/b] hubptr,#4 [b]djnz[/b] cogptr,#xferlp
Of course, the downside is that you can't use res to allocate the cog buffer.
See below.
-Phil
Post Edited (Phil Pilgrim (PhiPi)) : 12/4/2009 6:26:48 AM GMT
-Phil
A list of pointers to the routines, etc come first
DAT's are next
PUB's and PRI's are next
VAR's are placed at the end
So, the buffer needs to be in the first DAT. So use a DAT just for the buffer. Use spin to find the hub address and "poke" it into the PASM routine before loading or pass it via a par parameter.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
Nope, that's correct.
Each object consists of
- Method Table
- DAT block (contiguous and not restructured or tampered with in any way)
- Spin Methods
Lather, Rinse, Repeat for all objects.. then
- VAR block. Object by object with each objects variables sorted in LONG/WORD/BYTE order.
So the DAT block in the top object is as close to the start of the hub as possible without losing the spin method table.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
If you always do what you always did, you always get what you always got.
-Phil
If you want to make it minimal, have the top object contain just a DAT section and one method to point to the next sub-object. So the method table will include one spin method and one sub-object. Then you can just behave as normal under that, and the DAT section in your top object can reserve the memory required for you.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
If you always do what you always did, you always get what you always got.
-Phil
Yep, just take into account that each spin method and each object instance consumes 4 bytes from that.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
If you always do what you always did, you always get what you always got.