Byte Code Memory Model - BCMM
localroger
Posts: 3,452
We now have numerous compilers targeting the Large Memory Model, Bill Henning's brilliant trick by which indefinitely large PASM code can be run out of Hub RAM. LMM is great for those situations when Spin isn't fast enough and Cog RAM isn't big enough. But LMM code is also a lot bigger than Spin bytecode, and big code runs even more slowly when it's executed from off-chip external memory where you get no advantage from highly tuned LMM interpreters.
The Bye Code Memory Model is a trick I developed inspired by LMM but geared toward producing small code. It isn't nearly as fast as LMM when run from Hub RAM, but is still about five times faster than Spin; and because it's smaller, it has significant advantages when run from XMM.
I developed the BCMM for a project which I used to call Windmill and I wrote a few blogs about it back when the forum software was updated, but that project keeps morphing (and I keep getting pulled from it for other work duties). That project also has a lot of other features which have nothing to do with the fundamental trick of BCMM, so I thought it might be a good idea to present BCMM on its own. I think it could, among other things, form the basis for a very superior C bytecode interpreter.
As we know PASM instructions target a Source and Destination, both of which can actually be argument sources. There are only 64 fundamental PASM instructions, so let's do something like this:
I leave implementing the stack as an exercise to the reader. This essentially turns the entire PASM command set into a stack machine. How useful is it? Well with about 20 longs of Cog RAM it implements 35 useful instructions pretty much all the math operations. And while there's a lot more overhead there than with most LMM interpreters, there's a lot less than there is in the Spin interpreter much less most workable XMM schemes. And you have over 400 longs of Cog RAM and 192 free byte codes for helper functions.
I'll note here that while my project implements separate m- and r-stacks, one could quite easily code helper functions around a single C-style stack.
But wait, there's more!
Let's assume for the sake of argument that we have 64 longs to blow on a lookup table. (Don't worry, they won't stay blown :-) We could selectively decide whether to use the flags, pop D or push the result, like this:
Bit 31 = Rotate this bit into the instruction R field
Bit 30 = pop D before execution
Bit 29 = restore flags before execution
Of course to make use of this you'd need a table of which instructions work best with which options. Fortunately I've made it for you:
And of course, back in the PASM image
So for about 40 longs of interpreter and a 64 long table, this gets us over 50 fully implemented useful instructions, pretty much the entire PASM function set except for Hub RAM writes and HUBOP. It's still a lot faster than Spin, and not even slowing things down much on top of XMM fetches.
But wait, there's more!
You will, of course, be using all that free Cog RAM for helper functions. And unless you pull Peter Jakacki's (admittedly cool) trick from the Tachyon interpreter, you'll need a jump table. And since your jumps are only 9 bits, you can share the helper jump table with the code index!
The only problem you're likely to have with this is that you might not have 64 helper instructions. (I have 35.) Rather than waste the difference in Cog Longs, it is possible with very little additional code to fold the lookup table into 32 longs:
And that is how you can get a byte code version of nearly the entire PASM instruction set implemented and still have over 400 longs of Cog RAM for those helper functions like PASM multiply, divide, and string manipulation. This has been implemented and tested; it works, and quite well. Executing from Hub RAM it's about five times faster than Spin, and when executing out of most forms of XMM it's a lot faster than LMM simply because it's much more compact.
From this point the sky is the limit. I'm implementing a separate R-stack, but you could alternately create helper functions targeting a C or Spin style memory model. I'm also using the D field of the lookup table to index a separate helper function setup function, so this is what the start of my instr_vector really looks like:
The Bye Code Memory Model is a trick I developed inspired by LMM but geared toward producing small code. It isn't nearly as fast as LMM when run from Hub RAM, but is still about five times faster than Spin; and because it's smaller, it has significant advantages when run from XMM.
I developed the BCMM for a project which I used to call Windmill and I wrote a few blogs about it back when the forum software was updated, but that project keeps morphing (and I keep getting pulled from it for other work duties). That project also has a lot of other features which have nothing to do with the fundamental trick of BCMM, so I thought it might be a good idea to present BCMM on its own. I think it could, among other things, form the basis for a very superior C bytecode interpreter.
As we know PASM instructions target a Source and Destination, both of which can actually be argument sources. There are only 64 fundamental PASM instructions, so let's do something like this:
interpreter rdbyte i_code, i_ptr add i_ptr, #1 ' test i_code, #%1100_0000 wz if_nz jmp do_helper ' shl i_code,#3 'code into upper 6 of 9 bit INSTR or i_code,#%111 'set z,c,r movi :to_instr_cl1,i_code 'write INSTR field ' call #m_pop mov i_src,m_xfer 'pop S call #m_pop 'pop D mov i_dest,m_xfer ' :to_instr_cl1 mov i_dest,i_src wz, wc, wr 'EXECUTE THE TOKEN ' mov m_xfer, i_dest call #m_push ' jmp #interpreter do_helper: 'handle helper tokens jmp #interpreter i_code long 0 i_ptr long 0 i_src long 0 i_dest long 0
I leave implementing the stack as an exercise to the reader. This essentially turns the entire PASM command set into a stack machine. How useful is it? Well with about 20 longs of Cog RAM it implements 35 useful instructions pretty much all the math operations. And while there's a lot more overhead there than with most LMM interpreters, there's a lot less than there is in the Spin interpreter much less most workable XMM schemes. And you have over 400 longs of Cog RAM and 192 free byte codes for helper functions.
I'll note here that while my project implements separate m- and r-stacks, one could quite easily code helper functions around a single C-style stack.
But wait, there's more!
Let's assume for the sake of argument that we have 64 longs to blow on a lookup table. (Don't worry, they won't stay blown :-) We could selectively decide whether to use the flags, pop D or push the result, like this:
Bit 31 = Rotate this bit into the instruction R field
Bit 30 = pop D before execution
Bit 29 = restore flags before execution
interpreter rdbyte i_code, i_ptr add i_ptr, #1 ' test i_code, #%1100_0000 wz if_nz jmp do_helper ' mov tmp, i_code add tmp,#code_index movs :iv_get,tmp nop 'of course you'd put a useful instr here :iv_get mov i_code_desc,0-0 shl i_code,#3 'code into upper 6 of 9 bit INSTR or i_code,#%111 'set z,c,r movi :to_instr_cl1,i_code 'write INSTR field ' call #m_pop mov i_src,m_xfer 'pop S shl i_code_desc,#1 wc 'pop D? if_c call #m_pop 'assuming m_pop doesn't trash C if_c mov i_dest,m_xfer ' shl i_vec_desc,#1 wc 'vector: recover flags? if_c mov i_flags,i_flagx 'recover instead of using default test i_flags,#1 wc 'set flags test i_flags,#2 wz ' :to_instr_cl1 mov i_dest,i_src wz, wc, wr 'EXECUTE THE TOKEN ' muxc i_flagx,#1 'save flags for retrieval muxnz i_flagx,#2 'but mov i_flags,#0 'assume we're not using them ' test i_code,#1 wz 'this still contains R bit if_nz mov m_xfer, i_dest if_nz call #m_push ' jmp #interpreter do_helper: 'handle helper tokens jmp #interpreter i_code long 0 i_code_desc long 0 i_ptr long 0 i_src long 0 i_dest long 0 i_flags long 0 i_flagx long 0
Of course to make use of this you'd need a table of which instructions work best with which options. Fortunately I've made it for you:
con ' ' Native token bit fields of the Instruction Vector ' IV_R_Set = 1 << 31 IV_Pop_D = 1 << 30 IV_Flags = 1 << 29 ' IV_Normal = IV_R_Set + IV_Pop_D '2 args + result IV_Carry = IV_Normal + IV_Flags '2 args + result + use flags IV_1_Arg = IV_R_Set '1 arg + result IV_1_Carry = IV_1_Arg + IV_Flags '1 arg + result + use flags IV_No_R = IV_Pop_D '2 args no result IV_No_R_C = IV_No_R + IV_Flags '2 args no result + use flags ' ' Native token descriptors for each instruction ' IV_RDBYTE = IV_1_Arg IV_RDWORD = IV_1_Arg IV_RDLONG = IV_1_Arg IV_HUBOP = IV_Normal 'not useful ' ' %100-%111 ' not used ' IV_ROR = IV_Normal IV_ROL = IV_Normal IV_SHR = IV_Normal IV_SHL = IV_Normal IV_RCR = IV_Carry IV_RCL = IV_Carry IV_SAR = IV_Normal IV_REV = IV_Normal IV_MINS = IV_Normal IV_MAXS = IV_Normal IV_MIN = IV_Normal IV_MAX = IV_Normal IV_MOVS = IV_Normal IV_MOVD = IV_Normal IV_MOVI = IV_Normal IV_JMP = IV_1_Arg 'not useful IV_AND = IV_Normal IV_ANDN = IV_Normal IV_OR = IV_Normal IV_XOR = IV_Normal IV_MUXC = IV_Carry IV_MUXNC = IV_Carry IV_MUXZ = IV_Carry IV_MUXNZ = IV_Carry IV_ADD = IV_Normal IV_SUB = IV_Normal IV_ADDABS = IV_Normal IV_SUBABS = IV_Normal IV_SUMC = IV_Carry IV_SUMNC = IV_Carry IV_SUMZ = IV_Carry IV_SUMNZ = IV_Carry IV_MOV = IV_Normal 'functions as NIP IV_NEG = IV_1_Arg IV_ABS = IV_1_Arg IV_ABSNEG = IV_1_Arg IV_NEGC = IV_1_Carry IV_NEGNC = IV_1_Carry IV_NEGZ = IV_1_Carry IV_NEGNZ = IV_1_Carry IV_CMP = IV_No_R IV_CMPSX = IV_No_R_C IV_ADDX = IV_Carry IV_CMPX = IV_No_R_C IV_ADDS = IV_Normal IV_SUBS = IV_Normal IV_ADDSX = IV_Carry IV_SUBSX = IV_Carry IV_CMPSUB = IV_Normal IV_DJNZ = IV_1_Arg 'not useful IV_TJNZ = IV_1_Arg 'not useful IV_TJZ = IV_1_Arg 'not useful IV_WAITPEQ = IV_Pop_D IV_WAITPNE = IV_Pop_D IV_WAITCNT = IV_Normal IV_WAITVID = IV_Pop_D
And of course, back in the PASM image
code_index long IV_RDBYTE long IV_RDWORD long IV_RDLONG 'etc.
So for about 40 longs of interpreter and a 64 long table, this gets us over 50 fully implemented useful instructions, pretty much the entire PASM function set except for Hub RAM writes and HUBOP. It's still a lot faster than Spin, and not even slowing things down much on top of XMM fetches.
But wait, there's more!
You will, of course, be using all that free Cog RAM for helper functions. And unless you pull Peter Jakacki's (admittedly cool) trick from the Tachyon interpreter, you'll need a jump table. And since your jumps are only 9 bits, you can share the helper jump table with the code index!
instr_vector long IV_RDBYTE + instr_l_store long IV_RDWORD + instr_l_fetch long IV_RDLONG + instr_rstka 'etc
The only problem you're likely to have with this is that you might not have 64 helper instructions. (I have 35.) Rather than waste the difference in Cog Longs, it is possible with very little additional code to fold the lookup table into 32 longs:
... mov tmp, i_code andn tmp,#%0010_0000 'fold index add tmp,#instr_vector movs :iv_get,tmp test i_code,#%0010_0000 wz 'was index actually folded? :iv_get mov i_code_desc,0-0 if_nz shl i_code_desc,#4 'if so, shift in alternate mod bits ... ... instr_vector long IV_RDBYTE + IV_ADD >> 4 + instr_l_store long IV_RDWORD + IV_SUB >> 4 + instr_l_fetch long IV_RDLONG + IV_ADDABS >> 4 + instr_rstka 'etc.
And that is how you can get a byte code version of nearly the entire PASM instruction set implemented and still have over 400 longs of Cog RAM for those helper functions like PASM multiply, divide, and string manipulation. This has been implemented and tested; it works, and quite well. Executing from Hub RAM it's about five times faster than Spin, and when executing out of most forms of XMM it's a lot faster than LMM simply because it's much more compact.
From this point the sky is the limit. I'm implementing a separate R-stack, but you could alternately create helper functions targeting a C or Spin style memory model. I'm also using the D field of the lookup table to index a separate helper function setup function, so this is what the start of my instr_vector really looks like:
instr_vector long IV_RDBYTE + IV_ADD >> 4 + { } prep_1arg + instr_l_store << 9 long IV_RDWORD + IV_SUB >> 4 + { } prep_1arg + instr_l_fetch << 9 long IV_RDLONG + IV_ADDABS >> 4 + { } prep_1arg + instr_rstka << 9 'etc.
Comments
The PropGCC CMM instruction set is moderately complicated, so the interpreter takes a lot of space (but gives good compression of C code, and allows for any PASM instruction to be represented). Here's an excerpt from the documentation:
Hi localroger!
Welcome to the CMM club! Yours is an interesting implementation - can you post what your "m_push" and "m_pop" functions looks like? I think I get what you're proposing, but this will help me judge how well your BCMM might be adapted for use by C etc. As you point out, it currently looks like it would not be a good fit for C - but a variant of it might well be!
Ross.
REALLY cool; should make for some really compact code. I will watch its evolution with great interest...
(p.s. thanks for the kind words :-) )
I also have an R-stack growing up toward the M-stack and a sanity check for if they collide. Since C is a stack based language the stack machine PASM implementation should work pretty well given a suitable mix of helper functions. You could also easily modify this to treat i_src or i_dest more like a math accumulator, which might agree better with some of the compiler architectures.
I do have one question, wouldn't it speed it up if you inline the push/pop in the interpreter and move to/from the source or destination instead of moving through the transfer location?
C.W.
It looks like the stack pointer could actually reside IN the "to_instr_cl1" instruction.
The actual SP value would be the 9 bits that represent the source, the dest value would be 1 greater, since it represents the next position down the stack.
Since the stack grows down, a PUSH subtracts $201 to the to_instr_cl1 location, this updates both SP and SP-1 in the same operation, likewise a POP adds $201.
With this setup the executed instruction directly accesses the stack instead of going through intermediate locations.
The stack is adjusted AFTER the executed instruction, and is "POPPED" once or twice dependending on if the return value was to be PUSHED.
In the case where the result was to be pushed, the R flag would be set so the SP-1 gets the destination value and the stack is just "POPPED" once.
In the case where the result was not to be pushed, the R flag would be cleared so that SP-1 is not altered and the stack would be "POPPED" twice, this could be done by adding $402 to do it in a single add.
Normal PUSH and POP operations can still be implemented that use a couple of memory locations to be used in helper functions, etc. They would still use the "to_instr_cl1" location and add/subtract $201 so as to keep both source (SP) and dest (SP-1) pointers intact.
Just something to chew on.
C.W.