Will this work as a fast overlay loader?

lonesock · 2009-12-09 20:15

Hi, everybody.

I have some untested code, but was hoping for some comments from those who have already done this sort of thing. It uses a counter PHSx to auto-increment the Hub address pointer, meaning the overlays themselves need to have the instructions spaced out every 4 longs. You can interleave 4 different overlays, of course, to minimize lost Hub RAM. If this idea doesn't look sound, I don't want to spend a bunch of time on a pre-processor or SPIN code to do the interleaving automatically.

{{
  Jonathan Dummer
  untested overlay code
}}
PUB dummy

DAT
ORG 0

{{
  Inputs
    * overlay_pointer   - Hub location of the start of the overlay code
    * overlay_longs     - number of instructions in this overlay

  The actual overlay instructions must be every 4 longs, because we are
  using a counter to auto increment the Hub Pointer.  So, feel free to
  interlace 4 different overlay functions.  I should probably write some
  SPIN code to do the interleaving, unless it's in a pre-processor which
  would have to run on the PC.

  Warning: Overwrites the Z flag! 
}}
QuadOverlay
        ' Did we already load this?
        cmp overlay_pointer,last_ovr_pointer wz
if_z    jmp #Overlay_Execute
        ' save the address for next time, and find the end location
        mov last_ovr_pointer,overlay_pointer
        mov ovr_jump_return,overlay_longs
        add ovr_jump_return,overlay_address
        mov ovr_jump_return,jump_instruction
        ' start writing to the overlay's cog address
        movd Overlay_Load,overlay_address
        ' and start here in the Hub
        mov phsb,overlay_pointer
Overlay_Load
        rdlong 0-0,phsb
        add Overlay_Load,incDest
        djnz overlay_longs,#Overlay_Load 
Overlay_Execute
        jmpret overlay_address,ovr_jump_return        
QuadOverlay_ret
        ret

{===== PASM Initialized variables, constants, or Parameters =====}
overlay_address         long 0                  ' where in cog to start the overlay            
incDest                 long 512                ' add 1 to the Destination of an instruction                
last_ovr_pointer        long -1                 ' this is guaranteed to net be valid
jump_instruction        jmp #0                  ' We'll add this to the end of the overlay

{===== PASM Scratch Variables =====}
overlay_pointer         res 1                   ' Set this to the hub address of the 1st overlay inst.
overlay_longs           res 1                   ' How many instructions
ovr_jump_return         res 1                   ' the address of the end of this overlay
        
FIT 496

Thanks in advance,
Jonathan

edit: fixed the line "mov phsb,overlay_pointer"

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

Post Edited (lonesock) : 12/11/2009 4:53:40 PM GMT

kuroneko · 2009-12-09 23:34

From a quick glance, Overlay_Execute is back to front (src is jump target) and your jump return insertion needs indirection. Otherwise it should work. Having said that, with this amount of code you might as well use the sub #7/djnz approach (PhiPi/Cluso99¹). It has the added benefit that the overlay doesn't have to be interleaved.

¹http://forums.parallax.com/showthread.php?p=730815

Post Edited (kuroneko) : 12/9/2009 11:59:51 PM GMT

lonesock · 2009-12-10 00:33

Thanks, kuroneko.

Half of the code size in the above post is a lazy overlay/execute. This let's me say (the equivalent of) "Overlay & execute Mult_32_32", and call the routine. If this was not the last overlay loaded, then the load is executed, otherwise it just jumps to the overlay and executes it (a total of 6 instructions of overhead to execute: call QuadOvr, cmp address, jmp, jmpret, ret_from_overlay, ret from QuadOvr). So I never have to explicitly track the last overlay function I called. I did not see a similar mechanism in the code that Cluso posted in his overlay thread (though, of course, this is probably because he doesn't need/use that feature [noparse][[/noparse]8^). If the lazy-overlay stuff is removed, the code is just 5 instructions (not including the initial set-Hub-address, and set-number-of-longs).

thanks,
Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

kuroneko · 2009-12-10 00:39

Fair enough. I guess it's all down to specific requirements. Although the 4n+m requirement might turn out a bit messy when you have to interleave overlays of different sizes [noparse]:)[/noparse]

heater · 2009-12-10 00:51

Interesting,

The Phil/Cluso overlay loader is 5 instructions in the start up and then 3 instructions per LONG loaded. Minus one, because the last LONG loaded overwrites the DJNZ.

So if you are getting down to 2 instructions in the start up you are winning.
The Phil/Cluso loader requires a even number of LONGS to be loaded. As yours does not that's 3 instructions shaved off the loading time.

The Phil/Cluso loader loop is 6 instructions, loading two longs each time. Yours is only 3.

Looks like 7 LONGS of code space saved. Starts to sound very interesting. ZiCog needs all the LONGS it can get.

In the emulation engine I don't think we want the overhead of lazy loads. There are lots of overlays being swapped all the time. It might cost more to do the checking than it saves on not loading occasionally.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Cluso99 · 2009-12-10 00:58

lonesock:

It·has·already been·done and tested here (heater uses it in ZiCog). Acknowledgements in the code. Any improvements/suggestions appreciated.

Assembly Oververlay Loader for Cog FAST· http://forums.parallax.com/showthread.php?p=730815

It can be used as a call or just execute it. It checks if already loaded. The addresses and size are computed and optimised prior so it runs the fastest. The overlay has to be a multiple of 2 longs and code will automatically fill the wasted long if needed.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm

heater · 2009-12-10 01:05

Cluso, looking in OVERLAY_LOAD in ZiCog I don't see any checking for overlay already loaded. Is that a new feature or did we hack that out for ZiCog?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Cluso99 · 2009-12-10 08:25

I was sure I had it there at one stage but maybe I am wrong. It is not hard to do, but for short overlays probably not worth it.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:

· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm

heater · 2009-12-10 19:01

Cluso, I don't think we miss lazy loading in the emulator, even if the loaded chunks were short. I have a feeling that as Z80 code executes it will thrash form overlay to overlay more than it needs the same overlay twice. Perhaps we should experiment to confirm this for some code, BBC BASIC for example, one day.

So the questions are: Does lonesocks overlay loader work? Does it save any execution time? (seems not). Does it save any space in CLOG? Is it worth the effort to change given the need for a pre-processor?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Post Edited (heater) : 12/11/2009 5:04:45 PM GMT

lonesock · 2009-12-11 16:50

There is another way to have the loader know how many longs to transfer. If you flag the end of the overlay code in Hub RAM by placing a "long 0", you do not have to set the number of longs to read, and instead of a djnz (with requires an extra 4 clocks to fall through), you simply use a "if_nz jmp #whatever". This will be slower, though, because the final pass where you are reading the stopping 0 value requires 16 clocks for the rdlong, then that is followed by the add and the (falling through) jmp. This would save on a table of overlay lengths, and any overhead associated with reading that table...so whether this would be a net win depends on the surrounding code.

This would a few different things:

1 - You would require the return jmp call to be part of the overlay instruction set inside Hub RAM, which in turn requires that the return jump be to a know location. If you have the Overlay code start at the beginning of the cog (right after the ORG 0), then you can terminate all your overlays with "jmp #6" (see requirement 3...it get's incremented by 1).

2 - you will be loading the final 0 to the cog, so you end up wasting one long of storage in the cog, and one long per overlay in the Hub.

3 - The other thing placing the loader at location 0 means, is you have to fall through gracefully on the 1st pass. For simplicity, I'll put a jmp to the entry point as the first instruction, and the overlay will effectively start at 1 instead of 0, so I update my overlays' final instruction to be "jmp #7".

4 - if you really need a NOP inside the overlay, use something like "if_never jmp #0".

NOTE: there was another typo in my first post (as well as my having callret operands backwards): the line initializing phsb should have read "mov phsb,overlay_pointer".

        jmp #cog_entry_point
Overylay
        ' start writing to the overlay's cog address
        movd Overlay_Load,overlay_address
        ' and start here in the Hub
        mov phsb,overlay_pointer
Overlay_Load
        rdlong 0-0,phsb wz
        add Overlay_Load,incDest
if_nz   jmp #Overlay_Load
        ' OK, execute the overlay
        jmp #overlay_address
Overylay_ret
        ret

Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
lonesock
Piranha are people too.

kuroneko · 2009-12-12 00:17

I was considering a nop end marker as well, however, what if the overlay code has embedded data which is/must be 0 (unlikely but you know how it is)? You wouldn't want to go out of your way to disguise that. Again, it might just work perfectly well for your requirements.

Will this work as a fast overlay loader?

Comments