Assembly Oververlay Loader for Cog FAST (renamed & released)
Cluso99
Posts: 18,071
Editted 14 June 2008 - Solved in post below 
I am trying to find a way to shift a variable length of longs between (to and from) cog and hub memory while maintaining the "sweet spot". I have looked at the LMM threads.
This code misses the sweet spot by one instruction and so takes 32 cycles instead of 16...
The same applies when moving data the other way (using wrlong instead of rdlong).
I can take into account if the pipeling is already loaded when the instruction is fetched if that helps.
So the challenge - can anyone find a way to replace the two adds (or subs) with a single instruction???? It may utilise the djnz instruction.·
FYI - I am trying to build an assembler overlay model (similar to computers in the 70's) when memory was precious
Post Edited (Cluso99) : 6/14/2008 10:54:12 AM GMT

I am trying to find a way to shift a variable length of longs between (to and from) cog and hub memory while maintaining the "sweet spot". I have looked at the LMM threads.
This code misses the sweet spot by one instruction and so takes 32 cycles instead of 16...
'' loading overlay - going forward
overlay_load
movd load,#0 'initialise cog ptr
nop 'delay for pipeline
load rdlong 0-0,hptr 'read long from hub ram
add load,d_inc 'increment cog pointer
add hptr,#4 'increment hub pointer by 1 long
djnz hlen,#load 'repeat for entire buffer
jmp #0 'go execute the overlay
'' loading overlay - going backwards
overlay_load
movd load,hlen 'initialise cog ptr
nop 'delay for pipeline
load rdlong 0-0,hptr 'read long from hub ram
sub load,d_inc 'decrement buffer pointer
sub hptr,#4 'decrement hub memory pointer
djnz hlen,#load
jmp #0 'jump to overlay (address $000)
hptr long 0-0 'hub ram overlay xx end address hlen long 0-0 'hub ram overlay xx length d_inc long $0200 'decrement destination (source) by 1
The same applies when moving data the other way (using wrlong instead of rdlong).
I can take into account if the pipeling is already loaded when the instruction is fetched if that helps.
So the challenge - can anyone find a way to replace the two adds (or subs) with a single instruction???? It may utilise the djnz instruction.·

FYI - I am trying to build an assembler overlay model (similar to computers in the 70's) when memory was precious

Post Edited (Cluso99) : 6/14/2008 10:54:12 AM GMT

Comments
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
Thanks. I've been looking at this off and on for a few weeks.
I did get a Logic Analyser sampling at 12.5nS
So once setup, you could leave the counters running and just reload the address at the beginning of each transfer. This way you only have the setup cost when you start the cog. This obviously depends on how and when you do the transfer.
*) enforced by the hub access window
Post Edited (kuroneko) : 6/10/2008 5:04:58 AM GMT
I'm trying to decide how to have a table of overlays with their hub addresses and length.
So I've posted the code to get any comments
That would be fantastic, thanks
To show how it works, it adds up 8 longs the start address of which is passed in PAR. If the sum is correct, it lights up one LED, if the timing is correct, another (Hydra gamepad LEDs). Timing obviously depends on the hub access window so there is a 0..15 cycle penalty depending on code location. The source also includes the missed-slot-by-one-instruction version for comparison. For real world usage just get rid of the timing code, only keep the counter setup, sync and workload. If anything is not clear, just ask.
Post Edited (kuroneko) : 6/10/2008 7:26:04 AM GMT
Does anyone know how to code a long so that the value stored by the assembler will be the actual·hub address of the assembly routine??
var long @myfunction
that would use the address in HUB if you have that long in a DAT section... if I got you correctly
kuroneko's use of CTR looks very interesting. Thanks.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
CON OVERLAYSIZE = 256 PUB start DAT 'At this point, hubaddr points to the beginning of the overlay code in the hub, 'and count contains the number of instructions (must be an odd number) to transfer. sub count,#1 'cogaddr := overlay + count - 1 mov cogaddr,#overlay add cogaddr,count shl count,#2 'hubaddr += (count - 1) * 4 add hubaddr,count mov overlay,overlay0 'Copy djnz instruction to head of overlay area. movd xfer1,cogaddr 'Insert cog addresses into rdlong instructions. sub cogaddr,#1 movd xfer0,cogaddr jmp #xfer1 xfer0 rdlong 0-0,hubaddr 'Transfer the next instruction. sub xfer0,_0x400 'Decrement destination address by 2. sub hubaddr,#7 'Decrement the hub address by 7 (actually 4). xfer1 rdlong 0-0,hubaddr 'Transfer the next instruction. sub xfer1,_0x400 'Decrement destination address by 2. overlay long 0[noparse][[/noparse]OVERLAYSIZE] 'Either decrement the hub address and jump back ' to xfer0, or execute the first instruction ' in the overlay. overlay0 djnz hubaddr,#xfer0 _0x400 long $400 cogaddr res 1 hubaddr res 1 count res 1-Phil
Update: Corrected some errors in the code.
Post Edited (Phil Pilgrim (PhiPi)) : 6/10/2008 9:59:42 PM GMT
Many thanks to kuroneko and Phil Pilgrim (PhiPi) for your code to hit the "sweet spot". At the moment I wanted to get it sampling for comments so I am not hitting it so it takes 32 cycles per long transfer. The code can also be modified to transfer blocks of data between cog and hub memory.
Kuroneko's option uses the two counters and a pin. Phil's option means the overlay cannot begin at $0 - I am not sure this is a problem anyway.
I was blown away by Kuroneko's use of the counters
Hippy: Yes it depends entirely on the application as to how many loops, etc are actually used. I had a look at the LMM thread while doing this. I liked your concepts.
····· Here's how to call an overlay...
mov olay,#2 'overlay## =2 jmp #overlay_loadHere's how the overlay loader works...
org _overlay_loader 'WARNING: see fill_it above if moved !! overlay_next add olay,#1 'inc overlay## overlay_load mov ptr,olay 'copy overlay## (0 based) shl ptr,#3 'x8 to offset to correct overlay## add ptr,par 'add base of hub overlay_table rdlong hptr,ptr 'get the overlay## hub address add ptr,#4 'point to overlay## length rdlong hlen,ptr 'get the overlay## hub length overlay_copy rdlong 0-0,hptr 'copy longs from hub to cog add overlay_copy,d_inc 'modify (inc) cog addresses $0200 add hptr,#4 'increment hub pointer by 1 long djnz hlen,#overlay_copy 'next movd overlay_copy,#0 'reset cog pointer (for next time) jmp #0 'execute overlay overlay_load_ret overlay_next_ret ret 'for storing a return address/jump (not used yet) olay long 0-0 'overlay## (0...n-1) number ptr long 0-0 hptr long 0-0 'hub ram pointer hlen long 0-0 'overlay length d_inc long $0000_0200 'increment destination by 1Comments please...
Post Edited (Cluso99) : 6/11/2008 1:25:44 PM GMT
This code is a concept for fast overlaying assembler routines within a COG and achieves the "sweet spot" for loading code in 16 clock cycles per instruction (long) plus an overhead of 36 clocks (+/-) per overlay. (1 clock = 12.5nS @ 80MHz)
The load section is posted below and the complete file OverlayLoader.spin is attached.
OVERLAY_LOAD mov OVERLAY_START,_djnz0 'Copy djnz instruction to head of overlay area. movd overlay_copy2,overlay_par 'move cog END address into rdlong instruction sub overlay_par,#1 'decrement cog End address by 1 movd overlay_copy1,overlay_par 'move cog END-1 address into rdlong instruction shr overlay_par,#16 'extract the overlay## hub END address (remove cog address) test overlay_par,#3 wz 'z if length is odd (nz if length is even) if_z jmp #overlay_odd 'j if length is odd overlay_copy2 rdlong 0-0,overlay_par 'copy long from hub to cog (hptr ignores last 2 bits!) sub overlay_par,#7 'decrement hub ptr by 1 long (prev by 1, now by 7) overlay_odd sub overlay_copy2,_0x400 'decrement cog (destination) address by 2 overlay_copy1 rdlong 0-0,overlay_par 'copy long from hub to cog sub overlay_copy1,_0x400 'decrement cog (destination) address by 2 ' OVERLAY_START '<--- djnz overlay_par,#overlay_copy2 'decrement hub ptr by 1 long (now by 1, next by 7)Enjoy
OVERLAY_LOAD mov OVERLAY_START,_djnz0 'Copy djnz instruction to head of overlay area. movd overlay_copy2,overlay_par 'move cog END address into rdlong instruction sub overlay_par,#1 'decrement cog End address by 1 movd overlay_copy1,overlay_par 'move cog END-1 address into rdlong instruction shr overlay_par,#16 'extract the overlay## hub END address (remove cog address) overlay_copy2 rdlong 0-0,overlay_par 'copy long from hub to cog (hptr ignores last 2 bits!) sub overlay_par,#7 'decrement hub ptr by 1 long (prev by 1, now by 7) sub overlay_copy2,_0x400 'decrement cog (destination) address by 2 overlay_copy1 rdlong 0-0,overlay_par 'copy long from hub to cog sub overlay_copy1,_0x400 'decrement cog (destination) address by 2 ' OVERLAY_START '<--- djnz overlay_par,#overlay_copy2 'decrement hub ptr by 1 long (now by 1, next by 7) '^^^ The above instruction is moved in by the loader and is overwritten by the first overlay ' instruction during the final overlay load (rdlong) instruction. (loading is done in reverse) '