Relative jumps - are they possible?

Dr_Acula · 2012-12-02 17:36

Brainstorming some cache ideas, I am wondering if pasm can do relative jumps - ie jump forward n instructions?

Of course, the compiler can - as described in the manual

Toggle mov dira, Pin 'Set I/O pin to output
xor outa, Pin 'Toggle I/O pin's state
jmp #$-1 'Loop back endlessly
The last instruction, JMP #$-1, causes execution to jump back to the second-to-last instruction (i.e., here minus 1).

But at compile time, my understanding is that the JMP #$-1 will be resolved into the s field as an absolute value.

There are several discussion threads on the forum that seem to suggest that relative jumps are not possible, but I'd just like to make sure.

The reason for asking is I am still pondering various cache models, with the hub used as a first level cache, and external memory as a second level cache, and in theory, a 32 bit flat memory space with pasm code. For cache code, it is not that hard to think of replacing all the call and jmp and ret and jmpret instructions with jumps to a fixed location in a cog that then calls a cache driver. But there are some bits of code that are time critical and which simply have to run at full speed - ie DJNZ loops.

If code is broken up into 2k segments then each segment can be precompiled and loaded in and out of a cog and that will work fine. But if the cache fragments are smaller, say 200 longs, then they need to be able to be loaded into any location in cog. And for that you need relative jumps for small loops within those code fragments.

If relative jumps are not possible, I wonder if there is another solution. Say all code fragments are always 200 longs in length. Most of my pasm code is considerably shorter than that, and indeed, most of my pasm tends to be written in subroutines that are about a page in length. But let's say it is 200. Take off some space for code overhead and maybe you can fit 8 of these in a cog. If there are 8 possible places this code could end up in a cog, what you could do is run the code fragment through the compiler 8 times, padding it with 0,200, 400, 600 dummy longs at the beginning. Then no matter where the cache driver puts this block of code, all the internal jumps and djnz's will all be correct.

The catch is that the entire program is now 8x bigger. But once you add external memory, code size is not so important. And maybe this is a way to run huge pasm programs where individual subroutines can run faster than LMM?

Thoughts would be most appreciated.

localroger · 2012-12-02 17:45

Short answer: No, PASM jumps are always absolute. Another way to handle this would be the way DOS EXE format did, and create a header structure to be read by the loader indicating what locations would need their offsets adjusted for load location. That would be more flexible and less wasteful of memory than the 8 copies idea. But it would also take up a lot of Cog RAM if done in PASM, so you'd have to coordinate an external loader to do the offsetting before the PASM loader in the cog loads the routine for execution.

Phil Pilgrim (PhiPi) · 2012-12-02 17:46

Relative jumps, as you describe them, are not possible in PASM. This is mainly because the program counter is not an accessible register. Otherwise, statements like the following would work:

        add     pc,#8

-Phil

kuroneko · 2012-12-02 17:52

Not to mention relocation of ordinary storage instructions ... So yes, add a map to your code fragment and relocate the relevant instructions before loading them. This came up [thread=135926]before[/thread] ...

CON
  _clkmode = XTAL1|PLL16X
  _xinfreq = 5_000_000

  ofs = 57
  
VAR
  long  buffer[512]

PUB selftest | lcnt

  lcnt := (@tail - @entry) >> 2
  relocate(@entry, lcnt, @tail, ofs)
  longmove(@buffer[ofs], @entry, lcnt)
  
  coginit(cogid, @buffer{0}, 0)
    
PRI relocate(base, lcnt, info, offset) : n

  repeat lcnt
    outb := long[info][n >> 4] >> (n << 1)
    dirb := long[base][n]
    if outb[0]
      dirb[8..0]  += offset
    if outb[1]
      dirb[17..9] += offset
    long[base][n++] := dirb

DAT             org     0

entry           jmpret  outa, #$+1

                mov     dira, mask
                rev     outa, #{32-}8
                waitpeq $, #0

                hubop   $, #%10000_000
                
' initialised data and/or presets

mask            long    $00FF0000

' uninitialised data and/or temporaries

                fit

'                         FEDC BA98 7654 3210
tail            long    %%0000_0000_0002_2011

DAT

potatohead · 2012-12-02 18:00

Seems to me, you could pay a jump penalty by having code segments look at a specific COG LONG, then add to it prior to performing the JMP. Use the indirect JMP instruction, without "#"

Whatever loader brings it in would write the base address of the routine in that long, then the routine either performs the ADD, JMP sequence, or it modifies itself right away, perhaps using the modify instructions as storage, if there aren't too many of them.

Dr_Acula · 2012-12-02 18:16

Great ideas - thanks guys!

@PhiPi, thanks for clarifying the PC cannot be accessed. That does explain quite a few things (I'm used to micros where you can do tricks like push a value onto the call/return stack, and then do a ret and it will jump to that location).

@localroger, that idea of having a list of all internal jumps is very clever. That would work. Just run through the code prior to compilation and work out where all the jumps are and when the cache loads in the code fragment, add offsets to all the s fields of the jump instructions. That list would be part of the code fragment and if jumps are only (say) 10% of the code, that only makes the code 10% bigger. Not 8x bigger. I like it!

@kuroneko - you are way ahead of me! I'll read your code again as I think you are doing something very clever there. In general terms, I am wondering about a style of pasm code where you might declare 'global' cog variables, which stay in cog code forever (perhaps registers, or a small stack, and really common constants like zero), and then you might have global hub variables which take a little longer to load, and then within code fragments they have local variables that are part of the code and are lost when the subroutine finishes. The code might be a little less efficient than standard pasm as some constants might get declared several times.

In general terms, I like the idea of adding all the jump offsets once when the cache fragment is loaded, rather than on the fly with self modifying code that might have to recalculate every djnz loop.

Phil Pilgrim (PhiPi) · 2012-12-02 18:41

At the cost of an extra long per jump and a two-long overhead, you could do something like this:

mycall        call      #jmprel                 'Call relative jump routine.
              long      8                       'Amount to add to PC (offset from mycall+1).

jmprel  
jmprel_ret    add       jmprel,0-0              'jmprel src contains calladdr+1. Add contents of that address to jmprel. 
              jmp       jmprel                  'Jump to the new address.

-Phil

kwinn · 2012-12-02 18:43

@Dr_Acula

AFAIK there are only 512 longs per cog so you would only be able to get 2 blocks of 200 longs into a cog. Did you mean 200 bytes?

For DJNZ loops perhaps it would be simpler and faster to have a "jump to cog address" instruction as one of the LMM instructions and load the loops into the cog along with the LMM code.

kuroneko · 2012-12-02 18:45

Phil Pilgrim (PhiPi) wrote: »

mycall        call      #jmprel     'Call relative jump routine.
              long      8           'Amount to add to PC (offset from mycall+1).

jmprel  
jmprel_ret    add       jmprel,0-0  'jmprel src contains calladdr+1. Add contents of that address to jmprel. 
              jmp       jmprel      'Jump to the new address.

Doesn't work. The add is already fetched before its source field is modified.

pjv · 2012-12-02 18:49

Dr_Acula;

I believe what Phil said is correct.... no inherent relative jumps.

However....... what I have done for loading small pieces of code into arbitrary cog locations is to prepend each 16 lines of code with a single long which has its bits set/reset according to whether the loader running in the cog should or should not apply an offset to the D_Field and/or S-Field. This means there is an inefficiency of 1 long per 16 longs, which is not too bad. Also, I have a very tight loader that uses very few insructions to read the code from hub as it processes the need to relocate and place each instruction. If I remenber correctly, it's in the range of a 5 instruction loop.

What I use this process for is to dynamically load and activate threads into cogs which otherwise are under utilized, and let me run 8 to 16 threads simultaneously in a single cog.

So, while relative jumps per-se seem not possible, the effect you were describing can be achieved by the method I describe. Once loaded, everything runs at full speed.

Cheers,

Peter (pjv)

Phil Pilgrim (PhiPi) · 2012-12-02 18:51

kuroneko wrote:

Doesn't work. The add is already fetched before its source field is modified.

Oh, right. Well, okay then, do it this way:

mycall        call      #jmprel                 'Call relative jump routine.
              long      8                       'Amount to add to PC (offset from mycall+1).

jmprel        nop         
jmprel_ret    add       jmprel_ret,0-0          'jmprel src contains calladdr+1. Add contents of that address to jmprel. 
              jmp       jmprel_ret              'Jump to the new address.

-Phil

Dr_Acula · 2012-12-02 19:01

AFAIK there are only 512 longs per cog so you would only be able to get 2 blocks of 200 longs into a cog. Did you mean 200 bytes?

Yes, sorry, I meant bytes.

Ok, maybe fragments are smaller. Perhaps maximum a page long (I find pasm starts to turn into spaghetti code when longer than a page).

In general terms, I suspect it is more efficient to modify code just once when loaded by the cache driver, rather than on the fly.

Which instructions need to be modified?

Local jumps, calls, ret and djnz
rdlong, wrlong (and their word and byte equivalents)
and probably some others. Actually, most of them, looking at the pasm list!

(calls to functions from within other functions are more complex, as that will need to go out to the cache driver. So jumps and calls are divided into 'local' and 'non local' and the compiler will have to sort that out)

So you might have a few simple routines for the cache loader - eg in this block of code, take instruction n and add offset x to the i field. Another routine for the s field. I think that is the movd and movi instructions. In some ways, the cache loader need not know or care whether the instruction it is modifying is a rdlong or a jmp etc. Just a list of locations of all the instructions that are to be modified and which field is to be modified. The compiler might group them together - here are the locations of all the instructions that need their s fields modified. Do all those. Then do all the d fields etc.

Bill Henning · 2012-12-02 20:14

Hi Dr. Acula,

Unfortunately, as others already showed above, relative jumps are not possible in PASM.

I went through a lot of teeth gnashing (otherwise known as analysis) when I came up with LMM. I actually tried a number of different alternate methods including several variations on paging (fixed and variable sizes and whole cog reloads) and eventually came up with LMM, adding FCACHE and FLIB along the way (propgcc calls FLIB's kernel extensions).

You may find the following timing analysis helpful:

http://forums.parallax.com/showthread.php?134480-Propeller-GCC-Status-Update&p=1036003&viewfull=1#post1036003

Basically, paging methods fall down because in-line code is not any faster paged (and paging linear code leaves less room for loop caching), and explicitly FCACHING loops, combined with highly optimized subroutines for things like the C mem* and str* functions (and float primitives) provide far better performance due to reduced thrashing and better gains from only caching loops that fit in the FCACHE - which also reduced thrashing. See above post for a mathematical analysis.

Hope this helps!

Bill

Dr_Acula wrote: »
Brainstorming some cache ideas, I am wondering if pasm can do relative jumps - ie jump forward n instructions?

Of course, the compiler can - as described in the manual
Toggle mov dira, Pin 'Set I/O pin to output
xor outa, Pin 'Toggle I/O pin's state
jmp #$-1 'Loop back endlessly
The last instruction, JMP #$-1, causes execution to jump back to the second-to-last instruction (i.e., &#8216;here&#8217; minus 1).
But at compile time, my understanding is that the JMP #$-1 will be resolved into the s field as an absolute value.

There are several discussion threads on the forum that seem to suggest that relative jumps are not possible, but I'd just like to make sure.

The reason for asking is I am still pondering various cache models, with the hub used as a first level cache, and external memory as a second level cache, and in theory, a 32 bit flat memory space with pasm code. For cache code, it is not that hard to think of replacing all the call and jmp and ret and jmpret instructions with jumps to a fixed location in a cog that then calls a cache driver. But there are some bits of code that are time critical and which simply have to run at full speed - ie DJNZ loops.

If code is broken up into 2k segments then each segment can be precompiled and loaded in and out of a cog and that will work fine. But if the cache fragments are smaller, say 200 longs, then they need to be able to be loaded into any location in cog. And for that you need relative jumps for small loops within those code fragments.

If relative jumps are not possible, I wonder if there is another solution. Say all code fragments are always 200 longs in length. Most of my pasm code is considerably shorter than that, and indeed, most of my pasm tends to be written in subroutines that are about a page in length. But let's say it is 200. Take off some space for code overhead and maybe you can fit 8 of these in a cog. If there are 8 possible places this code could end up in a cog, what you could do is run the code fragment through the compiler 8 times, padding it with 0,200, 400, 600 dummy longs at the beginning. Then no matter where the cache driver puts this block of code, all the internal jumps and djnz's will all be correct.

The catch is that the entire program is now 8x bigger. But once you add external memory, code size is not so important. And maybe this is a way to run huge pasm programs where individual subroutines can run faster than LMM?

Thoughts would be most appreciated.

Bill Henning · 2012-12-02 20:32

Kuroneko reminded me that I was too pessimistic in my analysis last year - with his timer-assisted code, FCACHE could be loaded in 16 cycles per long - 33.3% faster

Dr_Acula · 2012-12-02 20:55

Thanks Bill for the links and great explanation. I'm still coming to terms with LMM and FCACHE and exactly how it all works.

I guess I'm starting with the simple assumption that if you are writing a program more than 32k it is going to need caching of some sort from external memory. And if you are going to the bother of caching from external ram into hub, is the size of code in a cog (2k) exactly the optimum size? Or is it better to, say, split that cog into to two parts. Or 4 etc. I guess it depends on the type of code that is the source code - whether it is hand coded pasm, or whether it has been created by a higher level language.

Or do I get the prop II FPGA cog emulator board?

Bill Henning · 2012-12-02 21:07

You are most welcome!

Reading my VMCOG thread may help:

http://forums.parallax.com/showthread.php?119725-VMCOG-Virtual-Memory-for-ZiCog-Zog-more-%28VMCOG-0.976-PropCade-TriBlade_2-HYDRA-HX512-XEDODRAM%29&highlight=vmcog

There are a lot of excellent papers out there on multi-level caching; Google is our friend

Finding the optimal page size for a given workload is a non-trivial problem; I took the easy way out by picking 512 bytes, as that matches legacy sector sizes, and fits the Prop's architecture quite nicely.

Smaller page sizes may thrash less, but will cause more cache lookups - cavet emptor - it all depends on the code that is running.

Larger page sizes chew up more of the hub.

Hopefully I'll have time to expand VMCOG for a larger virtual space later this year.

I'll dig out, and post on my site, my original LMM papers, they may help you.

Dr_Acula wrote: »

Thanks Bill for the links and great explanation. I'm still coming to terms with LMM and FCACHE and exactly how it all works.

I guess I'm starting with the simple assumption that if you are writing a program more than 32k it is going to need caching of some sort from external memory. And if you are going to the bother of caching from external ram into hub, is the size of code in a cog (2k) exactly the optimum size? Or is it better to, say, split that cog into to two parts. Or 4 etc. I guess it depends on the type of code that is the source code - whether it is hand coded pasm, or whether it has been created by a higher level language.

Or do I get the prop II FGPA cog emulator board?

Dr_Acula · 2012-12-03 00:32

Great reading that thread - thanks Bill. In your work you talk about the 32k SPI chips and I think you had a model based on 64k. Since then the larger SPI ram chips have come out and I think they change things. All the work I did with parallel 512k ram chips and latches always took just a bit too many propeller pins, and I think you make a good argument for using slower serial SPI ram but with smarter caching to make it appear faster.

I wonder if it is time to revisit your work but with a larger ram model? I don't know the optimum size. 512k? Megabyte(s). Historically I guess it has always been 'why would you possibly want more than x amount of ram', where x is where the chips will be in a few year's time. Maybe to get around this just go for a flat 32 bit memory space?

The clever C boffins got all their code working in these larger memory models. But maybe we can look at other languages too. Spin. Basic. Pure pasm for the hardcore programmers.

Also another issue I came up against when I pushed Catalina in XMM mode to the limit (and GCC) is the downloads take rather a long time. 200k was starting to get tedious. There might be an argument for precompiling blocks of code and putting those on an SD card and loading these blocks in at startup and I wonder what such code might look like and whether it could be tied into something like vmcog?

Have you got some more info about vmcog and your original LMM papers - I'd be interested in reading those.

Relative jumps - are they possible?

Comments