Assembly Oververlay Loader for Cog FAST (renamed & released)

Cluso99 · 2008-06-09 23:25

Editted 14 June 2008 - Solved in post below

I am trying to find a way to shift a variable length of longs between (to and from) cog and hub memory while maintaining the "sweet spot". I have looked at the LMM threads.

This code misses the sweet spot by one instruction and so takes 32 cycles instead of 16...

''      loading overlay - going forward
overlay_load
              movd      load,#0                         'initialise cog ptr
              nop                                       'delay for pipeline
load          rdlong    0-0,hptr                        'read long from hub ram
              add       load,d_inc                      'increment cog pointer
              add       hptr,#4                         'increment hub pointer by 1 long
              djnz      hlen,#load                      'repeat for entire buffer              
              jmp       #0                              'go execute the overlay

''      loading overlay - going backwards
overlay_load
              movd      load,hlen                       'initialise cog ptr
              nop                                       'delay for pipeline
load          rdlong    0-0,hptr                        'read long from hub ram
              sub       load,d_inc                      'decrement buffer pointer
              sub       hptr,#4                         'decrement hub memory pointer   
              djnz      hlen,#load
              jmp       #0                              'jump to overlay (address $000)

hptr          long      0-0                             'hub ram overlay xx end address
hlen          long      0-0                             'hub ram overlay xx length
d_inc         long      $0200                           'decrement destination (source) by 1

The same applies when moving data the other way (using wrlong instead of rdlong).

I can take into account if the pipeling is already loaded when the instruction is fetched if that helps.

So the challenge - can anyone find a way to replace the two adds (or subs) with a single instruction???? It may utilise the djnz instruction.·

FYI - I am trying to build an assembler overlay model (similar to computers in the 70's) when memory was precious

Post Edited (Cluso99) : 6/14/2008 10:54:12 AM GMT

Paul Baker · 2008-06-09 23:52

Unfortunately in instances where two pointers must be incremented there isn't a way to squeeze the necessary operations into accessing the hub every 16 cycles. I tried desperately to overcome this when writing the logic analyzer application but found it impossible.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

Cluso99 · 2008-06-10 00:02

Paul:

Thanks. I've been looking at this off and on for a few weeks.

I did get a Logic Analyser sampling at 12.5nS

Your Logic Analyser inspired me to give it a go! It uses 4 cogs giving 1880 samples and rates of 12.5nS, 50nS, 100nS, 200ns, then incrementally by 50nS thereafter. It outputs in *.spin unicode format so you can see the waveforms in the Propeller Tool

kuroneko · 2008-06-10 00:45

Cluso99 said...
I've been looking at this off and on for a few weeks.

I played around with this a while ago and it is in fact possible. But it comes at a cost (both counters and a link bit). Basically what you do is use PHSB as the hub pointer and adjust it by 4 every 16 cycles*, using CTRA as enabler (i.e. CTRA running in 6.25% DUTY, CTRB in LOGIC A mode).

So once setup, you could leave the counters running and just reload the address at the beginning of each transfer. This way you only have the setup cost when you start the cog. This obviously depends on how and when you do the transfer.

*) enforced by the hub access window

Post Edited (kuroneko) : 6/10/2008 5:04:58 AM GMT

Cluso99 · 2008-06-10 06:00

Thanks kuroneko. I don't know much about the counter so will have a look later.

I'm trying to decide how to have a table of overlays with their hub addresses and length.

So I've posted the code to get any comments

· Also I am not sure if I want to call the overlay loader or jump to it.

kuroneko · 2008-06-10 06:01

Cluso99 said...
I don't know much about the counter so will have a look later.

I can send you an example if you need one.

Cluso99 · 2008-06-10 06:05

kuroneko,

That would be fantastic, thanks

kuroneko · 2008-06-10 07:19

Cluso99 said...
That would be fantastic ...

Attached.

To show how it works, it adds up 8 longs the start address of which is passed in PAR. If the sum is correct, it lights up one LED, if the timing is correct, another (Hydra gamepad LEDs). Timing obviously depends on the hub access window so there is a 0..15 cycle penalty depending on code location. The source also includes the missed-slot-by-one-instruction version for comparison. For real world usage just get rid of the timing code, only keep the counter setup, sync and workload. If anything is not clear, just ask.

Post Edited (kuroneko) : 6/10/2008 7:26:04 AM GMT

Cluso99 · 2008-06-10 10:34

Kuroneko: Thanks very much for the code. I now understand a lot more about the counters. I tried to find a way without sacrificing a pin and I couldn't. I would like to make the overlayer independent, so for the moment I will skip using a pin, but will keep it in mind in case the speed is ultimately required. Thanks once again.

Does anyone know how to code a long so that the value stored by the assembler will be the actual·hub address of the assembly routine??

Ale · 2008-06-10 12:53

someth9ing like

var long @myfunction

that would use the address in HUB if you have that long in a DAT section... if I got you correctly

hippy · 2008-06-10 14:56

In the testing I did, overlay then execute turned out to be quite a lot slower than simply LMM sequential execution. It does of course depend on whether it's just straight linear code or looping. In my cases it turned out better to not overlay, take the hit when looping was involved with higher gains when it wasn't.

kuroneko's use of CTR looks very interesting. Thanks.

Ken Peterson · 2008-06-10 15:14

Nice solution, kuroneko. This is one for the toolbox!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Paul Baker · 2008-06-10 16:31

Very clever kuroneko. Because the destination field does not support indirection, this technique can only be used to autoincrement the source address (the hub pointer). But·only one instruction needs to be trimmed, and this fits the bill.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

Phil Pilgrim (PhiPi) · 2008-06-10 17:02

For reading code into the overlay area, you can also take advantage of the trick I used in the reverse LMM. In the code fragment that follows, the DJNZ is used both to decrement the hub address and to loop back. This continues until the first instruction of the overlay overwrites the DJNZ, which automatically begins execution of the overlay code.

CON

  OVERLAYSIZE   = 256

PUB start

DAT

'At this point, hubaddr points to the beginning of the overlay code in the hub,
'and count contains the number of instructions (must be an odd number) to transfer.

              sub       count,#1           'cogaddr := overlay + count - 1
              mov       cogaddr,#overlay
              add       cogaddr,count
              shl       count,#2           'hubaddr += (count - 1) * 4
              add       hubaddr,count
              mov       overlay,overlay0   'Copy djnz instruction to head of overlay area.
              movd      xfer1,cogaddr      'Insert cog addresses into rdlong instructions.
              sub       cogaddr,#1
              movd      xfer0,cogaddr
              jmp       #xfer1

xfer0         rdlong    0-0,hubaddr        'Transfer the next instruction.
              sub       xfer0,_0x400       'Decrement destination address by 2.
              sub       hubaddr,#7         'Decrement the hub address by 7 (actually 4).
xfer1         rdlong    0-0,hubaddr        'Transfer the next instruction.
              sub       xfer1,_0x400       'Decrement destination address by 2.
overlay       long      0[noparse][[/noparse]OVERLAYSIZE]     'Either decrement the hub address and jump back
                                           '  to xfer0, or execute the first instruction
                                           '  in the overlay.
              
overlay0      djnz      hubaddr,#xfer0
_0x400        long      $400

cogaddr       res       1
hubaddr       res       1
count         res       1

-Phil

Update: Corrected some errors in the code.

Post Edited (Phil Pilgrim (PhiPi)) : 6/10/2008 9:59:42 PM GMT

Cluso99 · 2008-06-11 13:12

Well here is the Overlay Code (full cade attached).

Many thanks to kuroneko and Phil Pilgrim (PhiPi) for your code to hit the "sweet spot". At the moment I wanted to get it sampling for comments so I am not hitting it so it takes 32 cycles per long transfer. The code can also be modified to transfer blocks of data between cog and hub memory.
Kuroneko's option uses the two counters and a pin. Phil's option means the overlay cannot begin at $0 - I am not sure this is a problem anyway.

I was blown away by Kuroneko's use of the counters

Hippy: Yes it depends entirely on the application as to how many loops, etc are actually used. I had a look at the LMM thread while doing this. I liked your concepts.

····· Here's how to call an overlay...

              mov       olay,#2                         'overlay## =2 
              jmp       #overlay_load

Here's how the overlay loader works...

              org       _overlay_loader                 'WARNING: see fill_it above if moved !!  
                                                         
overlay_next  add       olay,#1                         'inc overlay##
overlay_load  mov       ptr,olay                        'copy overlay## (0 based)
              shl       ptr,#3                          'x8 to offset to correct overlay##
              add       ptr,par                         'add base of hub overlay_table              
              rdlong    hptr,ptr                        'get the overlay## hub address
              add       ptr,#4                          'point to overlay## length
              rdlong    hlen,ptr                        'get the overlay## hub length
overlay_copy  rdlong    0-0,hptr                        'copy longs from hub to cog
              add       overlay_copy,d_inc              'modify (inc) cog addresses $0200
              add       hptr,#4                         'increment hub pointer by 1 long
              djnz      hlen,#overlay_copy              'next
              movd      overlay_copy,#0                 'reset cog pointer (for next time)
              jmp       #0                              'execute overlay              
overlay_load_ret
overlay_next_ret
              ret                                       'for storing a return address/jump   (not used yet)
olay          long      0-0                             'overlay## (0...n-1) number
ptr           long      0-0
hptr          long      0-0                             'hub ram pointer
hlen          long      0-0                             'overlay length
d_inc         long      $0000_0200                      'increment destination by 1

Comments please...

·and enjoy...

Post Edited (Cluso99) : 6/11/2008 1:25:44 PM GMT

Cluso99 · 2008-06-14 10:47

Release of OverlayLoader for Cog assembler code.

This code is a concept for fast overlaying assembler routines within a COG and achieves the "sweet spot" for loading code in 16 clock cycles per instruction (long) plus an overhead of 36 clocks (+/-) per overlay. (1 clock = 12.5nS @ 80MHz)

The load section is posted below and the complete file OverlayLoader.spin is attached.

OVERLAY_LOAD
              mov       OVERLAY_START,_djnz0            'Copy djnz instruction to head of overlay area.
              movd      overlay_copy2,overlay_par       'move cog END address into rdlong instruction
              sub       overlay_par,#1                  'decrement cog End address by 1
              movd      overlay_copy1,overlay_par       'move cog END-1 address into rdlong instruction
              shr       overlay_par,#16                 'extract the overlay## hub END address (remove cog address)
              test      overlay_par,#3 wz               'z if length is odd (nz if length is even)
        if_z  jmp       #overlay_odd                    'j if length is odd
overlay_copy2 rdlong    0-0,overlay_par                 'copy long from hub to cog   (hptr ignores last 2 bits!)
              sub       overlay_par,#7                  'decrement hub ptr by 1 long (prev by 1, now by 7)
overlay_odd   sub       overlay_copy2,_0x400            'decrement cog (destination) address by 2
overlay_copy1 rdlong    0-0,overlay_par                 'copy long from hub to cog
              sub       overlay_copy1,_0x400            'decrement cog (destination) address by 2 
'&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;
OVERLAY_START '<--- djnz  overlay_par,#overlay_copy2    'decrement hub ptr by 1 long (now by 1, next by 7)

Enjoy

Cluso99 · 2008-08-14 14:45

Here is my latest version. It still catches the 16 clock sweet spot. The setup overhead has been reduced by enforcing an even number of longs is loaded. The Flags c & z are preserved (untouched). Otherwise , operation is as described in the previous versions code.

OVERLAY_LOAD
              mov       OVERLAY_START,_djnz0            'Copy djnz instruction to head of overlay area.
              movd      overlay_copy2,overlay_par       'move cog END address into rdlong instruction
              sub       overlay_par,#1                  'decrement cog End address by 1
              movd      overlay_copy1,overlay_par       'move cog END-1 address into rdlong instruction
              shr       overlay_par,#16                 'extract the overlay## hub END address (remove cog address)
overlay_copy2 rdlong    0-0,overlay_par                 'copy long from hub to cog   (hptr ignores last 2 bits!)
              sub       overlay_par,#7                  'decrement hub ptr by 1 long (prev by 1, now by 7)
              sub       overlay_copy2,_0x400            'decrement cog (destination) address by 2
overlay_copy1 rdlong    0-0,overlay_par                 'copy long from hub to cog
              sub       overlay_copy1,_0x400            'decrement cog (destination) address by 2 
'&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;
OVERLAY_START '<--- djnz  overlay_par,#overlay_copy2    'decrement hub ptr by 1 long (now by 1, next by 7)
              '^^^ The above instruction is moved in by the loader and is overwritten by the first overlay
              '    instruction during the final overlay load (rdlong) instruction. (loading is done in reverse)     
'&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;