Relative jumps - are they possible?
Dr_Acula
Posts: 5,484
Brainstorming some cache ideas, I am wondering if pasm can do relative jumps - ie jump forward n instructions?
Of course, the compiler can - as described in the manual
But at compile time, my understanding is that the JMP #$-1 will be resolved into the s field as an absolute value.
There are several discussion threads on the forum that seem to suggest that relative jumps are not possible, but I'd just like to make sure.
The reason for asking is I am still pondering various cache models, with the hub used as a first level cache, and external memory as a second level cache, and in theory, a 32 bit flat memory space with pasm code. For cache code, it is not that hard to think of replacing all the call and jmp and ret and jmpret instructions with jumps to a fixed location in a cog that then calls a cache driver. But there are some bits of code that are time critical and which simply have to run at full speed - ie DJNZ loops.
If code is broken up into 2k segments then each segment can be precompiled and loaded in and out of a cog and that will work fine. But if the cache fragments are smaller, say 200 longs, then they need to be able to be loaded into any location in cog. And for that you need relative jumps for small loops within those code fragments.
If relative jumps are not possible, I wonder if there is another solution. Say all code fragments are always 200 longs in length. Most of my pasm code is considerably shorter than that, and indeed, most of my pasm tends to be written in subroutines that are about a page in length. But let's say it is 200. Take off some space for code overhead and maybe you can fit 8 of these in a cog. If there are 8 possible places this code could end up in a cog, what you could do is run the code fragment through the compiler 8 times, padding it with 0,200, 400, 600 dummy longs at the beginning. Then no matter where the cache driver puts this block of code, all the internal jumps and djnz's will all be correct.
The catch is that the entire program is now 8x bigger. But once you add external memory, code size is not so important. And maybe this is a way to run huge pasm programs where individual subroutines can run faster than LMM?
Thoughts would be most appreciated.
Of course, the compiler can - as described in the manual
Toggle mov dira, Pin 'Set I/O pin to output xor outa, Pin 'Toggle I/O pin's state jmp #$-1 'Loop back endlessly The last instruction, JMP #$-1, causes execution to jump back to the second-to-last instruction (i.e., here minus 1).
But at compile time, my understanding is that the JMP #$-1 will be resolved into the s field as an absolute value.
There are several discussion threads on the forum that seem to suggest that relative jumps are not possible, but I'd just like to make sure.
The reason for asking is I am still pondering various cache models, with the hub used as a first level cache, and external memory as a second level cache, and in theory, a 32 bit flat memory space with pasm code. For cache code, it is not that hard to think of replacing all the call and jmp and ret and jmpret instructions with jumps to a fixed location in a cog that then calls a cache driver. But there are some bits of code that are time critical and which simply have to run at full speed - ie DJNZ loops.
If code is broken up into 2k segments then each segment can be precompiled and loaded in and out of a cog and that will work fine. But if the cache fragments are smaller, say 200 longs, then they need to be able to be loaded into any location in cog. And for that you need relative jumps for small loops within those code fragments.
If relative jumps are not possible, I wonder if there is another solution. Say all code fragments are always 200 longs in length. Most of my pasm code is considerably shorter than that, and indeed, most of my pasm tends to be written in subroutines that are about a page in length. But let's say it is 200. Take off some space for code overhead and maybe you can fit 8 of these in a cog. If there are 8 possible places this code could end up in a cog, what you could do is run the code fragment through the compiler 8 times, padding it with 0,200, 400, 600 dummy longs at the beginning. Then no matter where the cache driver puts this block of code, all the internal jumps and djnz's will all be correct.
The catch is that the entire program is now 8x bigger. But once you add external memory, code size is not so important. And maybe this is a way to run huge pasm programs where individual subroutines can run faster than LMM?
Thoughts would be most appreciated.
Comments
-Phil
Whatever loader brings it in would write the base address of the routine in that long, then the routine either performs the ADD, JMP sequence, or it modifies itself right away, perhaps using the modify instructions as storage, if there aren't too many of them.
@PhiPi, thanks for clarifying the PC cannot be accessed. That does explain quite a few things (I'm used to micros where you can do tricks like push a value onto the call/return stack, and then do a ret and it will jump to that location).
@localroger, that idea of having a list of all internal jumps is very clever. That would work. Just run through the code prior to compilation and work out where all the jumps are and when the cache loads in the code fragment, add offsets to all the s fields of the jump instructions. That list would be part of the code fragment and if jumps are only (say) 10% of the code, that only makes the code 10% bigger. Not 8x bigger. I like it!
@kuroneko - you are way ahead of me! I'll read your code again as I think you are doing something very clever there. In general terms, I am wondering about a style of pasm code where you might declare 'global' cog variables, which stay in cog code forever (perhaps registers, or a small stack, and really common constants like zero), and then you might have global hub variables which take a little longer to load, and then within code fragments they have local variables that are part of the code and are lost when the subroutine finishes. The code might be a little less efficient than standard pasm as some constants might get declared several times.
In general terms, I like the idea of adding all the jump offsets once when the cache fragment is loaded, rather than on the fly with self modifying code that might have to recalculate every djnz loop.
-Phil
AFAIK there are only 512 longs per cog so you would only be able to get 2 blocks of 200 longs into a cog. Did you mean 200 bytes?
For DJNZ loops perhaps it would be simpler and faster to have a "jump to cog address" instruction as one of the LMM instructions and load the loops into the cog along with the LMM code.
I believe what Phil said is correct.... no inherent relative jumps.
However....... what I have done for loading small pieces of code into arbitrary cog locations is to prepend each 16 lines of code with a single long which has its bits set/reset according to whether the loader running in the cog should or should not apply an offset to the D_Field and/or S-Field. This means there is an inefficiency of 1 long per 16 longs, which is not too bad. Also, I have a very tight loader that uses very few insructions to read the code from hub as it processes the need to relocate and place each instruction. If I remenber correctly, it's in the range of a 5 instruction loop.
What I use this process for is to dynamically load and activate threads into cogs which otherwise are under utilized, and let me run 8 to 16 threads simultaneously in a single cog.
So, while relative jumps per-se seem not possible, the effect you were describing can be achieved by the method I describe. Once loaded, everything runs at full speed.
Cheers,
Peter (pjv)
-Phil
Yes, sorry, I meant bytes.
Ok, maybe fragments are smaller. Perhaps maximum a page long (I find pasm starts to turn into spaghetti code when longer than a page).
In general terms, I suspect it is more efficient to modify code just once when loaded by the cache driver, rather than on the fly.
Which instructions need to be modified?
Local jumps, calls, ret and djnz
rdlong, wrlong (and their word and byte equivalents)
and probably some others. Actually, most of them, looking at the pasm list!
(calls to functions from within other functions are more complex, as that will need to go out to the cache driver. So jumps and calls are divided into 'local' and 'non local' and the compiler will have to sort that out)
So you might have a few simple routines for the cache loader - eg in this block of code, take instruction n and add offset x to the i field. Another routine for the s field. I think that is the movd and movi instructions. In some ways, the cache loader need not know or care whether the instruction it is modifying is a rdlong or a jmp etc. Just a list of locations of all the instructions that are to be modified and which field is to be modified. The compiler might group them together - here are the locations of all the instructions that need their s fields modified. Do all those. Then do all the d fields etc.
Unfortunately, as others already showed above, relative jumps are not possible in PASM.
I went through a lot of teeth gnashing (otherwise known as analysis) when I came up with LMM. I actually tried a number of different alternate methods including several variations on paging (fixed and variable sizes and whole cog reloads) and eventually came up with LMM, adding FCACHE and FLIB along the way (propgcc calls FLIB's kernel extensions).
You may find the following timing analysis helpful:
http://forums.parallax.com/showthread.php?134480-Propeller-GCC-Status-Update&p=1036003&viewfull=1#post1036003
Basically, paging methods fall down because in-line code is not any faster paged (and paging linear code leaves less room for loop caching), and explicitly FCACHING loops, combined with highly optimized subroutines for things like the C mem* and str* functions (and float primitives) provide far better performance due to reduced thrashing and better gains from only caching loops that fit in the FCACHE - which also reduced thrashing. See above post for a mathematical analysis.
Hope this helps!
Bill
I guess I'm starting with the simple assumption that if you are writing a program more than 32k it is going to need caching of some sort from external memory. And if you are going to the bother of caching from external ram into hub, is the size of code in a cog (2k) exactly the optimum size? Or is it better to, say, split that cog into to two parts. Or 4 etc. I guess it depends on the type of code that is the source code - whether it is hand coded pasm, or whether it has been created by a higher level language.
Or do I get the prop II FPGA cog emulator board?
Reading my VMCOG thread may help:
http://forums.parallax.com/showthread.php?119725-VMCOG-Virtual-Memory-for-ZiCog-Zog-more-%28VMCOG-0.976-PropCade-TriBlade_2-HYDRA-HX512-XEDODRAM%29&highlight=vmcog
There are a lot of excellent papers out there on multi-level caching; Google is our friend
Finding the optimal page size for a given workload is a non-trivial problem; I took the easy way out by picking 512 bytes, as that matches legacy sector sizes, and fits the Prop's architecture quite nicely.
Smaller page sizes may thrash less, but will cause more cache lookups - cavet emptor - it all depends on the code that is running.
Larger page sizes chew up more of the hub.
Hopefully I'll have time to expand VMCOG for a larger virtual space later this year.
I'll dig out, and post on my site, my original LMM papers, they may help you.
I wonder if it is time to revisit your work but with a larger ram model? I don't know the optimum size. 512k? Megabyte(s). Historically I guess it has always been 'why would you possibly want more than x amount of ram', where x is where the chips will be in a few year's time. Maybe to get around this just go for a flat 32 bit memory space?
The clever C boffins got all their code working in these larger memory models. But maybe we can look at other languages too. Spin. Basic. Pure pasm for the hardcore programmers.
Also another issue I came up against when I pushed Catalina in XMM mode to the limit (and GCC) is the downloads take rather a long time. 200k was starting to get tedious. There might be an argument for precompiling blocks of code and putting those on an SD card and loading these blocks in at startup and I wonder what such code might look like and whether it could be tied into something like vmcog?
Have you got some more info about vmcog and your original LMM papers - I'd be interested in reading those.