Shop OBEX P1 Docs P2 Docs Learn Events
Assembly Oververlay Loader for Cog FAST (renamed & released) — Parallax Forums

Assembly Oververlay Loader for Cog FAST (renamed & released)

Cluso99Cluso99 Posts: 18,069
edited 2008-08-14 14:45 in Propeller 1
Editted 14 June 2008 - Solved in post below roll.gif

I am trying to find a way to shift a variable length of longs between (to and from) cog and hub memory while maintaining the "sweet spot". I have looked at the LMM threads.

This code misses the sweet spot by one instruction and so takes 32 cycles instead of 16...
''      loading overlay - going forward
overlay_load
              movd      load,#0                         'initialise cog ptr
              nop                                       'delay for pipeline
load          rdlong    0-0,hptr                        'read long from hub ram
              add       load,d_inc                      'increment cog pointer
              add       hptr,#4                         'increment hub pointer by 1 long
              djnz      hlen,#load                      'repeat for entire buffer              
              jmp       #0                              'go execute the overlay

''      loading overlay - going backwards
overlay_load
              movd      load,hlen                       'initialise cog ptr
              nop                                       'delay for pipeline
load          rdlong    0-0,hptr                        'read long from hub ram
              sub       load,d_inc                      'decrement buffer pointer
              sub       hptr,#4                         'decrement hub memory pointer   
              djnz      hlen,#load
              jmp       #0                              'jump to overlay (address $000)

hptr          long      0-0                             'hub ram overlay xx end address
hlen          long      0-0                             'hub ram overlay xx length
d_inc         long      $0200                           'decrement destination (source) by 1


The same applies when moving data the other way (using wrlong instead of rdlong).

I can take into account if the pipeling is already loaded when the instruction is fetched if that helps.

So the challenge - can anyone find a way to replace the two adds (or subs) with a single instruction???? It may utilise the djnz instruction.· rolleyes.gif

FYI - I am trying to build an assembler overlay model (similar to computers in the 70's) when memory was precious smile.gif

Post Edited (Cluso99) : 6/14/2008 10:54:12 AM GMT

Comments

  • Paul BakerPaul Baker Posts: 6,351
    edited 2008-06-09 23:52
    Unfortunately in instances where two pointers must be incremented there isn't a way to squeeze the necessary operations into accessing the hub every 16 cycles. I tried desperately to overcome this when writing the logic analyzer application but found it impossible.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Paul Baker
    Propeller Applications Engineer

    Parallax, Inc.
  • Cluso99Cluso99 Posts: 18,069
    edited 2008-06-10 00:02
    Paul:

    Thanks. I've been looking at this off and on for a few weeks.

    I did get a Logic Analyser sampling at 12.5nS cool.gif Your Logic Analyser inspired me to give it a go! It uses 4 cogs giving 1880 samples and rates of 12.5nS, 50nS, 100nS, 200ns, then incrementally by 50nS thereafter. It outputs in *.spin unicode format so you can see the waveforms in the Propeller Tool cool.gif
  • kuronekokuroneko Posts: 3,623
    edited 2008-06-10 00:45
    Cluso99 said...
    I've been looking at this off and on for a few weeks.
    I played around with this a while ago and it is in fact possible. But it comes at a cost (both counters and a link bit). Basically what you do is use PHSB as the hub pointer and adjust it by 4 every 16 cycles*, using CTRA as enabler (i.e. CTRA running in 6.25% DUTY, CTRB in LOGIC A mode).

    So once setup, you could leave the counters running and just reload the address at the beginning of each transfer. This way you only have the setup cost when you start the cog. This obviously depends on how and when you do the transfer.

    *) enforced by the hub access window

    Post Edited (kuroneko) : 6/10/2008 5:04:58 AM GMT
  • Cluso99Cluso99 Posts: 18,069
    edited 2008-06-10 06:00
    Thanks kuroneko. I don't know much about the counter so will have a look later.

    I'm trying to decide how to have a table of overlays with their hub addresses and length.

    So I've posted the code to get any comments cool.gif· Also I am not sure if I want to call the overlay loader or jump to it.
  • kuronekokuroneko Posts: 3,623
    edited 2008-06-10 06:01
    Cluso99 said...
    I don't know much about the counter so will have a look later.
    I can send you an example if you need one.
  • Cluso99Cluso99 Posts: 18,069
    edited 2008-06-10 06:05
    kuroneko,

    That would be fantastic, thankscool.gifcool.gif
  • kuronekokuroneko Posts: 3,623
    edited 2008-06-10 07:19
    Cluso99 said...
    That would be fantastic ...
    Attached.

    To show how it works, it adds up 8 longs the start address of which is passed in PAR. If the sum is correct, it lights up one LED, if the timing is correct, another (Hydra gamepad LEDs). Timing obviously depends on the hub access window so there is a 0..15 cycle penalty depending on code location. The source also includes the missed-slot-by-one-instruction version for comparison. For real world usage just get rid of the timing code, only keep the counter setup, sync and workload. If anything is not clear, just ask.

    Post Edited (kuroneko) : 6/10/2008 7:26:04 AM GMT
  • Cluso99Cluso99 Posts: 18,069
    edited 2008-06-10 10:34
    Kuroneko: Thanks very much for the code. I now understand a lot more about the counters. I tried to find a way without sacrificing a pin and I couldn't. I would like to make the overlayer independent, so for the moment I will skip using a pin, but will keep it in mind in case the speed is ultimately required. Thanks once again.

    Does anyone know how to code a long so that the value stored by the assembler will be the actual·hub address of the assembly routine?? confused.gif
  • AleAle Posts: 2,363
    edited 2008-06-10 12:53
    someth9ing like


    var long @myfunction

    that would use the address in HUB if you have that long in a DAT section... if I got you correctly
  • hippyhippy Posts: 1,981
    edited 2008-06-10 14:56
    In the testing I did, overlay then execute turned out to be quite a lot slower than simply LMM sequential execution. It does of course depend on whether it's just straight linear code or looping. In my cases it turned out better to not overlay, take the hit when looping was involved with higher gains when it wasn't.

    kuroneko's use of CTR looks very interesting. Thanks.
  • Ken PetersonKen Peterson Posts: 806
    edited 2008-06-10 15:14
    Nice solution, kuroneko. This is one for the toolbox!

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
  • Paul BakerPaul Baker Posts: 6,351
    edited 2008-06-10 16:31
    Very clever kuroneko. Because the destination field does not support indirection, this technique can only be used to autoincrement the source address (the hub pointer). But·only one instruction needs to be trimmed, and this fits the bill.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Paul Baker
    Propeller Applications Engineer

    Parallax, Inc.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2008-06-10 17:02
    For reading code into the overlay area, you can also take advantage of the trick I used in the reverse LMM. In the code fragment that follows, the DJNZ is used both to decrement the hub address and to loop back. This continues until the first instruction of the overlay overwrites the DJNZ, which automatically begins execution of the overlay code.

    CON
    
      OVERLAYSIZE   = 256
    
    PUB start
    
    DAT
    
    'At this point, hubaddr points to the beginning of the overlay code in the hub,
    'and count contains the number of instructions (must be an odd number) to transfer.
    
                  sub       count,#1           'cogaddr := overlay + count - 1
                  mov       cogaddr,#overlay
                  add       cogaddr,count
                  shl       count,#2           'hubaddr += (count - 1) * 4
                  add       hubaddr,count
                  mov       overlay,overlay0   'Copy djnz instruction to head of overlay area.
                  movd      xfer1,cogaddr      'Insert cog addresses into rdlong instructions.
                  sub       cogaddr,#1
                  movd      xfer0,cogaddr
                  jmp       #xfer1
    
    xfer0         rdlong    0-0,hubaddr        'Transfer the next instruction.
                  sub       xfer0,_0x400       'Decrement destination address by 2.
                  sub       hubaddr,#7         'Decrement the hub address by 7 (actually 4).
    xfer1         rdlong    0-0,hubaddr        'Transfer the next instruction.
                  sub       xfer1,_0x400       'Decrement destination address by 2.
    overlay       long      0[noparse][[/noparse]OVERLAYSIZE]     'Either decrement the hub address and jump back
                                               '  to xfer0, or execute the first instruction
                                               '  in the overlay.
                  
    overlay0      djnz      hubaddr,#xfer0
    _0x400        long      $400
    
    cogaddr       res       1
    hubaddr       res       1
    count         res       1
    
    



    -Phil

    Update: Corrected some errors in the code.

    Post Edited (Phil Pilgrim (PhiPi)) : 6/10/2008 9:59:42 PM GMT
  • Cluso99Cluso99 Posts: 18,069
    edited 2008-06-11 13:12
    Well here is the Overlay Code (full cade attached).

    Many thanks to kuroneko and Phil Pilgrim (PhiPi) for your code to hit the "sweet spot". At the moment I wanted to get it sampling for comments so I am not hitting it so it takes 32 cycles per long transfer. The code can also be modified to transfer blocks of data between cog and hub memory.
    Kuroneko's option uses the two counters and a pin. Phil's option means the overlay cannot begin at $0 - I am not sure this is a problem anyway.

    I was blown away by Kuroneko's use of the counters roll.gif

    Hippy: Yes it depends entirely on the application as to how many loops, etc are actually used. I had a look at the LMM thread while doing this. I liked your concepts.

    ····· Here's how to call an overlay...
                  mov       olay,#2                         'overlay## =2 
                  jmp       #overlay_load
    
    

    Here's how the overlay loader works...
                  org       _overlay_loader                 'WARNING: see fill_it above if moved !!  
                                                             
    overlay_next  add       olay,#1                         'inc overlay##
    overlay_load  mov       ptr,olay                        'copy overlay## (0 based)
                  shl       ptr,#3                          'x8 to offset to correct overlay##
                  add       ptr,par                         'add base of hub overlay_table              
                  rdlong    hptr,ptr                        'get the overlay## hub address
                  add       ptr,#4                          'point to overlay## length
                  rdlong    hlen,ptr                        'get the overlay## hub length
    overlay_copy  rdlong    0-0,hptr                        'copy longs from hub to cog
                  add       overlay_copy,d_inc              'modify (inc) cog addresses $0200
                  add       hptr,#4                         'increment hub pointer by 1 long
                  djnz      hlen,#overlay_copy              'next
                  movd      overlay_copy,#0                 'reset cog pointer (for next time)
                  jmp       #0                              'execute overlay              
    overlay_load_ret
    overlay_next_ret
                  ret                                       'for storing a return address/jump   (not used yet)
    olay          long      0-0                             'overlay## (0...n-1) number
    ptr           long      0-0
    hptr          long      0-0                             'hub ram pointer
    hlen          long      0-0                             'overlay length
    d_inc         long      $0000_0200                      'increment destination by 1
    
    

    Comments please... cool.gif·and enjoy... cool.gif


    Post Edited (Cluso99) : 6/11/2008 1:25:44 PM GMT
  • Cluso99Cluso99 Posts: 18,069
    edited 2008-06-14 10:47
    Release of OverlayLoader for Cog assembler code.

    This code is a concept for fast overlaying assembler routines within a COG and achieves the "sweet spot" for loading code in 16 clock cycles per instruction (long) plus an overhead of 36 clocks (+/-) per overlay. (1 clock = 12.5nS @ 80MHz)

    The load section is posted below and the complete file OverlayLoader.spin is attached.
    OVERLAY_LOAD
                  mov       OVERLAY_START,_djnz0            'Copy djnz instruction to head of overlay area.
                  movd      overlay_copy2,overlay_par       'move cog END address into rdlong instruction
                  sub       overlay_par,#1                  'decrement cog End address by 1
                  movd      overlay_copy1,overlay_par       'move cog END-1 address into rdlong instruction
                  shr       overlay_par,#16                 'extract the overlay## hub END address (remove cog address)
                  test      overlay_par,#3 wz               'z if length is odd (nz if length is even)
            if_z  jmp       #overlay_odd                    'j if length is odd
    overlay_copy2 rdlong    0-0,overlay_par                 'copy long from hub to cog   (hptr ignores last 2 bits!)
                  sub       overlay_par,#7                  'decrement hub ptr by 1 long (prev by 1, now by 7)
    overlay_odd   sub       overlay_copy2,_0x400            'decrement cog (destination) address by 2
    overlay_copy1 rdlong    0-0,overlay_par                 'copy long from hub to cog
                  sub       overlay_copy1,_0x400            'decrement cog (destination) address by 2 
    '
    OVERLAY_START '<--- djnz  overlay_par,#overlay_copy2    'decrement hub ptr by 1 long (now by 1, next by 7)
    
    

    Enjoy roll.gif
  • Cluso99Cluso99 Posts: 18,069
    edited 2008-08-14 14:45
    Here is my latest version. It still catches the 16 clock sweet spot. The setup overhead has been reduced by enforcing an even number of longs is loaded. The Flags c & z are preserved (untouched). Otherwise , operation is as described in the previous versions code. roll.gif
    OVERLAY_LOAD
                  mov       OVERLAY_START,_djnz0            'Copy djnz instruction to head of overlay area.
                  movd      overlay_copy2,overlay_par       'move cog END address into rdlong instruction
                  sub       overlay_par,#1                  'decrement cog End address by 1
                  movd      overlay_copy1,overlay_par       'move cog END-1 address into rdlong instruction
                  shr       overlay_par,#16                 'extract the overlay## hub END address (remove cog address)
    overlay_copy2 rdlong    0-0,overlay_par                 'copy long from hub to cog   (hptr ignores last 2 bits!)
                  sub       overlay_par,#7                  'decrement hub ptr by 1 long (prev by 1, now by 7)
                  sub       overlay_copy2,_0x400            'decrement cog (destination) address by 2
    overlay_copy1 rdlong    0-0,overlay_par                 'copy long from hub to cog
                  sub       overlay_copy1,_0x400            'decrement cog (destination) address by 2 
    '&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;
    OVERLAY_START '<--- djnz  overlay_par,#overlay_copy2    'decrement hub ptr by 1 long (now by 1, next by 7)
                  '^^^ The above instruction is moved in by the loader and is overwritten by the first overlay
                  '    instruction during the final overlay load (rdlong) instruction. (loading is done in reverse)     
    '&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;&#61610;&#61609;
    
    
Sign In or Register to comment.