Fast minimum-footprint bytecode interpreter
cgracey
Posts: 14,206
After OPC, Potatohead mentioned a great idea to me: in-line assembly for Spin.
Currently, to execute assembly language, you must write a whole PASM program, launch it with COGNEW, then communicate with it via memory. This is cumbersome and requires a whole extra cog to do any PASM, at all.
If the Spin interpreter footprint could be made small by making it modular, there would be room in the cog for assembly to be loaded and executed, all within the context of Spin. Then you could get the benefit of assembly-language speed within your Spin code, not to mention being able to ease into PASM, without having to use a whole separate cog.
The key is making a modular interpreter that loads snippets of interpreter code as needed, based on compact Spin byte codes. Most snippets are under four instructions, so they could load very quickly with RDLONGC. The total size of all snippets could be far in excess of the cog's 512 registers, though, as they reside in hub RAM and are dynamically loaded and executed, as needed.
Anyway, after Potatohead talked to me about this, my head's been on fire and I've been thinking about how to make a small-footprint bytecode interpreter so that in-line assembly could be practical. It turns out the core interpreter only needs ~14 instructions, plus perhaps space for ~32 dynamic instructions, to accommodate potentially-large snippets. And it's actually faster to do this than to branch to one of many cog-resident routines.
This is just the interpreter. To load in-line assembly, it could be $000..$1AF based, with Spin local variables sitting at $1B0..$1BF, or so. The top of memory is the interpreter and snippet space, while the stack RAM gets used for the Spin run-time stack. By exploiting Spin for its scoping and modularity, this framework might be an easy way to write extensive PASM apps.
Here is how in-line assembly could look:
Currently, to execute assembly language, you must write a whole PASM program, launch it with COGNEW, then communicate with it via memory. This is cumbersome and requires a whole extra cog to do any PASM, at all.
If the Spin interpreter footprint could be made small by making it modular, there would be room in the cog for assembly to be loaded and executed, all within the context of Spin. Then you could get the benefit of assembly-language speed within your Spin code, not to mention being able to ease into PASM, without having to use a whole separate cog.
The key is making a modular interpreter that loads snippets of interpreter code as needed, based on compact Spin byte codes. Most snippets are under four instructions, so they could load very quickly with RDLONGC. The total size of all snippets could be far in excess of the cog's 512 registers, though, as they reside in hub RAM and are dynamically loaded and executed, as needed.
Anyway, after Potatohead talked to me about this, my head's been on fire and I've been thinking about how to make a small-footprint bytecode interpreter so that in-line assembly could be practical. It turns out the core interpreter only needs ~14 instructions, plus perhaps space for ~32 dynamic instructions, to accommodate potentially-large snippets. And it's actually faster to do this than to branch to one of many cog-resident routines.
' ' Get byte code, look up descriptor long, load code longs, execute code longs, repeat ' nextbyte rdbyte x,ptra++ 'get next byte code shl x,#2 'convert byte code to descriptor address add x,descbase setptrb codebase 'point PTRB to code base rdlong x,x 'get long descriptor movd :reps,x 'descriptor[8..0] is number of code longs shr x,#7 'descriptor[15..7] is code offset addptrb x 'point PTRB to code longs :reps reps #1,#1 'ready for fast in-line code load setindb #code 'point INDB to in-line code rdlongc indb++,ptrb++ 'load code longs in-line using cached read mov indb++,jmpback 'finish in-line code with jmp back shr x,#9 'descriptor[31..16] is for code use code long 0[32] 'dynamic in-line code, loops to nextbyte descbase long $1000 'address of 256 descriptor longs codebase long $1400 'address of code longs jmpback jmp #nextbyte x long 0 'variable
This is just the interpreter. To load in-line assembly, it could be $000..$1AF based, with Spin local variables sitting at $1B0..$1BF, or so. The top of memory is the interpreter and snippet space, while the stack RAM gets used for the Spin run-time stack. By exploiting Spin for its scoping and modularity, this framework might be an easy way to write extensive PASM apps.
Here is how in-line assembly could look:
PRI addvars(a, b) ASM mov result,a add result,b ENDASM
Comments
It Can be nice addition to Cluso's Debugger
Thanks
LMM code and Inline (overlay) code each have their advantages, which depend upon the actual code being executed. Once you have a loop, then inline/overlay is faster, otherwise usually LMM will win.
Perhaps by using tasking, performance could be improved by operating two tasks - one reading ahead and fetching the next opcode and overlay, while the other is executing the prior bytecode fetched and loaded.
This might even have some use with Peters Tachyon.
You're right about the bit ranges being wrong. I need to fix that.
To run in-line assembler, I would run a snippet to load the code and set the context, then call the loaded code. After that, I would update the local variables from the copies that the assembler code used.
For short sequences of instructions, it seems faster to load it in-line and execute it than it does to determine and execute a branch to somewhere else.
Can the in-line PASM not use the addresses that Spin uses, as needed, to save the moves for results and globals ?
Also, perhaps a more complex mix example of Spin + PASM would help in #1
In your example above, you access SPIN arguments from PASM. In SPIN 1 the first 9 longs were cached in COG memory, I think it would be worthwhile to set aside 16 longs for local storage, then when you execute inline code, the compiler will do a fixup and adjust the D or S offsets to point to the local copy of the variable. This would mean that only the first 16 parameters and local variables would be accessible via inline PASM, that would be a nice limit.
Are you intent on limiting instructions only to those that are non-branching, or are you open to allowing any snippet that can fit in 2^5 instructions, run?
I think the latter makes this feature more useful and functional.
I would also recommend not using the descbase the way you're doing it, use the stack ram. You have 256 longs in stack and you shouldn't waste that much space with stack use, so put your descriptor table in stack.
Use SPA for the stack and SPB for descriptor lookup. It only costs 2 instructions to access this, whereas you spend a lot more than that in SPIN 1 to do the same thing.
I'd put descriptors in $80-$FF. If you allocate 16 longs for immediate access in SPIN, then you can put the overflow on the stack. In SPIN 1 you only had 64 opcodes, so having 128 should be useful in SPIN 2.
For single PASM instruction snippets, you get 4 hub windows per "interpreted" instruction, and the same rate appears to hold for up to 4 instructions in the snippet - assuming 1 clock per PASM instruction and if the snippets themselves are quad aligned which should be possible. That is then effectively up to 5, 10, 15, or 20 raw PASM MIPs (for 1, 2, 3, 4 instruction snippet sizes) at 160MHz and the hub RAM could certainly hold a fairly large byte code program and any associated PASM code snippets.
Plus you can always have some instruction codes jumping off to COG resident functions that are executed more frequently or are more complex to avoid the extra load overload once the snippets get bigger. I guess that is how the SPIN part generally works (though I am less familiar with that aspect).
Cool!
It seems to me that because locals are kept inside the cog, they are easy conduit for assembly code. Globals live in hub RAM and are not directly accessible in cog RAM, so if a local is used to pass a global variable address to PASM code, that would be about as straightforward as could be.
Using stack RAM for descriptor lookup would only save a few clocks, but would tie up a huge resource (stack RAM).
Fixups wouldn't be necessary, as you could quickly relocate the 16 locals (result, parameters, variables) to fixed addresses which would be directly addressable by PASM code. Afterwards, you would restore them. There would be 16 cycles of overhead for each directional move. Not really a big deal to make a very simple addressing rule. 'Result' would always be at, say, $1B0, with parameters and variables following. Your PASM code would load at $000 and continue, potentially, to $1AF. No rules on what instructions you use (looping, etc). When you are done, you jump to $1C0 or allow the jump to $1C0 tacked onto the end of your code to execute.
When I say fixups, I mean that variables in a local scope would be automatically address adjusted at compile time so you wouldn't need to do things such as r14, r15, r11 as named for statically located registers. The registers defined and scoped by DAT should be possible too, but maybe not.
The stack RAM is 256 entries, how do you expect all of this to get used? I looked at the code in the SPIN 1 interpreter and it takes self modifying code and about 4 instructions, but with the stack it only takes 2 to do indirect addressing, without pipeline restrictions.
I'm trying to think of ways to make the SPIN 2 interpreter as fast as possible. The code density of the new instruction set is already 30% better than the first Prop.
Fast is always nice, but this thread's approach make small quite important too, as that buys room for more PASM, and that suddenly gets a LOT faster. (not just 30%~50%).
You are right that keeping all the balls in the air re variable addresses, and lifetimes, will be tricky.
The failure modes can get quite subtle here.
I guess another tiny block runs the debugger / COG sniffer ?
I'm thinking that the stack RAM will hold the run-time stack, which is subroutine addresses, local variables, and math workspace. I think hard-limiting PUB/PRI's to 16 result/parameters/locals will allow almost 16 call depths, which should be plenty. Better to use stack RAM than cog RAM for that business. The cog RAM stays flexible.
One advantage to using part of the stack for lookup is using RETx to cause a jump directly.
I was thinking that it might be nice if the memory were laid out like below, then you could do something like set the INB to jmpback, then use the post decrement form of INDB and PTRB to read the instructions. In this manner the inline PASM and locals could share a heap. If you want pure SPIN and have a lot of local storage, then you get 47 longs, but if you want inline PASM, you can have up to 47 instructions, it's up to the programmer to balance the usage.
result long 0
locals long 0[15]
pasm long 0[32]
jmpback jmp #nextbyte
The jmp is put at the very end so that you just back up however many instructions you load, then the jmp acts as a sentinel at the end of memory. It's a little messy, but you could SETINDB to #jmpback, then SETPTRB to the code offset + code size, and back down the reads with PTRB-- and INDB--.
Also, since you use indention to delineate blocks, why not keep with convention and omit the ENDASM keyword.
As a matter of historical convention, I like the keyword INLINE for inline assembly, it is just neat to me.
Because indent is a bad convention to follow....
- but ASM is clearer, as it actually states what the language change is, and what is inside the delimiters is clearly ASM.
Clarity helps, especially if English is not your first language.
It could even be PASM..ENDPASM on that basis, then someone knows to reach for the PASM manual.
Your suggestion about using PASM is a good one. In my xbasic interpreter I also allowed inline bytecode assembly so ASM might be ambiguous. Does it refer to Spin bytecodes or to PASM instructions?
Same here.
This 4-byte program loops every 64 clocks, which is pretty quick.
My argument for the use of whitespace as a block delimiter was to ensure the language addition was orthogonal; it looked like it was designed as a block element from day one. Having a closing block keyword just totally throws "conventional" language design into a language which is largely absent of it. Furthermore, you would then have to document exceptions that say "when you put code inside this block construct, whitespace doesn't matter, but everywhere else you must maintain strict order".
It's analogous to "do as I say, not as I do". You can't tell people to simultaneously adhere to strict structuring and then relax that structuring just for one element.
I am aware that PASM within a DAT block is entirely open and has no whitespace restrictions, but there is a clear line drawn between SPIN and PASM at that point, not so when embedded inline.
Embedded PASM is the gateway drug for new programmers, they dip their toe in the pool with easy inline assembler as a stepping stone to writing full on COG code.
I do feel that inline PASM has a much broader implications for the whole language -- it allows very fast and compact code without wasting a single COG. This will allow hybrid drivers for a lot of peripherals, opening the way for the drivers to be resource free, again multiplying the useful number of COGs.
While on that subject, I feel it would also be nice to expose the task switching and multi-threading capabilities in SPIN as well. The ability to launch COGLETs using these constructs, would be very powerful.
The SPIN interpreter could be written to have yield points within it and allow task switching to occur periodically, supporting perhaps 1-3 COGLETs for some small periodic code, or come to think of it, perhaps the resources can be shared such that between 1 and 4 SPIN interpreters can execute simultaneously per COG.
How can't you tell people?
Seriously, inside the block, it's PASM and you get to do PASM things, which are not SPIN things. Outside the block, it's SPIN, and you get to do SPIN things. Done this way, all the syntax, etc... surrounding PASM is entirely valid within a PASM block, making PASM consistent everywhere it appears.
We aren't merging SPIN and PASM, so much as providing an access right to the metal where needed.
Correct, and I'd make it PASM..ENDPASM to make it clear to even non-english programmers.
It allows direct & predictable cut/paste of PASM, ( & Conditional ASM ?) without suddenly finding your paste accidently triggered an unexpected indent inferred boundary.
Chip
64 clocks seems to be wrong for me. So I looked a bit closer. I think you loop only the last 2 bytecode instructions, so you have a lot of POPs without PUSHs. This only works because the POPs return zero and you toggle PIN 0. To loop all the instructions the last bytecode should be: 2,$03 and then it takes 96 clocks. And 96 is best case when the whole bytecode fits in a quad, if you cross quad boundaries it takes 112 clocks (just add another long before the bytecode to test this).
I like the flexibility of this bytecode interpreter, and I think it works about 10 times faster on a 160MHz Prop2 than Spin on the Prop 1, but an optimized Spin2 interpreter in a cog would maybe be 20..30 times faster.
Another concern I have is the hubmemory footprint of this bytecode interpreter. I think it will need more than 4kByte of hubram for all the Spin functions. The description table alone needs 1kB and then it will be hard to share subroutines for the overlays you load. Maybe a few native subroutines in addition to the interpreter loop in cogram helps to minimize hub-footprint.
Andy
Very astute, Andy! I thought it was running too fast, too, but I errantly attributed it to the RDBYTEC instructions. You're right that the jump distance needs to be 3, not 2. It was a fluke that it was running correctly.
I figure, too, that the hub footprint will be about 4KB. Many little snippets could be combined into general-case snippets, with some conditional execution, but that will slow things down.
I like the idea of having just the bytecode expander/loader in cog RAM, as it leaves lots of memory free for inline assembly that conveniently ORGs to $000.
In fact, it's possible for the compiler to generate the specific snippets from "general case" ones.