Byte Code Memory Model - BCMM

localroger · 2012-10-17 08:00

We now have numerous compilers targeting the Large Memory Model, Bill Henning's brilliant trick by which indefinitely large PASM code can be run out of Hub RAM. LMM is great for those situations when Spin isn't fast enough and Cog RAM isn't big enough. But LMM code is also a lot bigger than Spin bytecode, and big code runs even more slowly when it's executed from off-chip external memory where you get no advantage from highly tuned LMM interpreters.

The Bye Code Memory Model is a trick I developed inspired by LMM but geared toward producing small code. It isn't nearly as fast as LMM when run from Hub RAM, but is still about five times faster than Spin; and because it's smaller, it has significant advantages when run from XMM.

I developed the BCMM for a project which I used to call Windmill and I wrote a few blogs about it back when the forum software was updated, but that project keeps morphing (and I keep getting pulled from it for other work duties). That project also has a lot of other features which have nothing to do with the fundamental trick of BCMM, so I thought it might be a good idea to present BCMM on its own. I think it could, among other things, form the basis for a very superior C bytecode interpreter.

As we know PASM instructions target a Source and Destination, both of which can actually be argument sources. There are only 64 fundamental PASM instructions, so let's do something like this:

interpreter   rdbyte    i_code, i_ptr
              add       i_ptr, #1
              '
              test      i_code, #%1100_0000 wz
        if_nz jmp       do_helper
              '
              shl       i_code,#3               'code into upper 6 of 9 bit INSTR
              or        i_code,#%111            'set z,c,r
              movi      :to_instr_cl1,i_code    'write INSTR field
              '
              call      #m_pop                  
              mov       i_src,m_xfer            'pop S
              call      #m_pop                  'pop D  
              mov       i_dest,m_xfer
              '
:to_instr_cl1 mov       i_dest,i_src wz, wc, wr 'EXECUTE THE TOKEN
              '
              mov       m_xfer, i_dest
              call      #m_push
              '
              jmp       #interpreter

do_helper:    'handle helper tokens
              jmp       #interpreter

i_code        long      0
i_ptr         long      0
i_src         long      0
i_dest        long      0

I leave implementing the stack as an exercise to the reader. This essentially turns the entire PASM command set into a stack machine. How useful is it? Well with about 20 longs of Cog RAM it implements 35 useful instructions pretty much all the math operations. And while there's a lot more overhead there than with most LMM interpreters, there's a lot less than there is in the Spin interpreter much less most workable XMM schemes. And you have over 400 longs of Cog RAM and 192 free byte codes for helper functions.

I'll note here that while my project implements separate m- and r-stacks, one could quite easily code helper functions around a single C-style stack.

But wait, there's more!

Let's assume for the sake of argument that we have 64 longs to blow on a lookup table. (Don't worry, they won't stay blown :-) We could selectively decide whether to use the flags, pop D or push the result, like this:

Bit 31 = Rotate this bit into the instruction R field
Bit 30 = pop D before execution
Bit 29 = restore flags before execution

interpreter   rdbyte    i_code, i_ptr
              add       i_ptr, #1
              '
              test      i_code, #%1100_0000 wz
        if_nz jmp       do_helper
              '
              mov       tmp, i_code
              add       tmp,#code_index
              movs      :iv_get,tmp             
              nop                               'of course you'd put a useful instr here
:iv_get       mov       i_code_desc,0-0          
              
              shl       i_code,#3               'code into upper 6 of 9 bit INSTR
              or        i_code,#%111            'set z,c,r
              movi      :to_instr_cl1,i_code    'write INSTR field
              '
              call      #m_pop                  
              mov       i_src,m_xfer            'pop S
              shl       i_code_desc,#1 wc       'pop D?
        if_c  call      #m_pop                  'assuming m_pop doesn't trash C  
        if_c  mov       i_dest,m_xfer
              '
              shl       i_vec_desc,#1 wc        'vector: recover flags?
        if_c  mov       i_flags,i_flagx         'recover instead of using default
              test      i_flags,#1 wc           'set flags
              test      i_flags,#2 wz
              '
:to_instr_cl1 mov       i_dest,i_src wz, wc, wr 'EXECUTE THE TOKEN
              '
              muxc      i_flagx,#1              'save flags for retrieval
              muxnz     i_flagx,#2              'but
              mov       i_flags,#0              'assume we're not using them
              '
              test      i_code,#1 wz            'this still contains R bit              
        if_nz mov       m_xfer, i_dest
        if_nz call      #m_push
              '
              jmp       #interpreter

do_helper:    'handle helper tokens
              jmp       #interpreter

i_code        long      0
i_code_desc   long      0
i_ptr         long      0
i_src         long      0
i_dest        long      0
i_flags       long      0
i_flagx       long      0

Of course to make use of this you'd need a table of which instructions work best with which options. Fortunately I've made it for you:

con
  '
  ' Native token bit fields of the Instruction Vector
  '
  IV_R_Set = 1 << 31
  IV_Pop_D = 1 << 30
  IV_Flags = 1 << 29
  '
  IV_Normal     = IV_R_Set + IV_Pop_D   '2 args + result
  IV_Carry      = IV_Normal + IV_Flags  '2 args + result + use flags
  IV_1_Arg      = IV_R_Set              '1 arg + result
  IV_1_Carry    = IV_1_Arg + IV_Flags   '1 arg + result + use flags
  IV_No_R       = IV_Pop_D              '2 args no result
  IV_No_R_C     = IV_No_R + IV_Flags    '2 args no result + use flags
  '
  ' Native token descriptors for each instruction
  '
  IV_RDBYTE     = IV_1_Arg
  IV_RDWORD     = IV_1_Arg
  IV_RDLONG     = IV_1_Arg
  IV_HUBOP      = IV_Normal     'not useful
  '
  ' %100-%111
  ' not used
  '
  IV_ROR        = IV_Normal
  IV_ROL        = IV_Normal
  IV_SHR        = IV_Normal
  IV_SHL        = IV_Normal
  IV_RCR        = IV_Carry 
  IV_RCL        = IV_Carry 
  IV_SAR        = IV_Normal
  IV_REV        = IV_Normal
  IV_MINS       = IV_Normal
  IV_MAXS       = IV_Normal
  IV_MIN        = IV_Normal
  IV_MAX        = IV_Normal
  IV_MOVS       = IV_Normal
  IV_MOVD       = IV_Normal
  IV_MOVI       = IV_Normal
  IV_JMP        = IV_1_Arg      'not useful
  IV_AND        = IV_Normal
  IV_ANDN       = IV_Normal
  IV_OR         = IV_Normal
  IV_XOR        = IV_Normal
  IV_MUXC       = IV_Carry
  IV_MUXNC      = IV_Carry
  IV_MUXZ       = IV_Carry
  IV_MUXNZ      = IV_Carry
  IV_ADD        = IV_Normal
  IV_SUB        = IV_Normal
  IV_ADDABS     = IV_Normal
  IV_SUBABS     = IV_Normal
  IV_SUMC       = IV_Carry
  IV_SUMNC      = IV_Carry
  IV_SUMZ       = IV_Carry
  IV_SUMNZ      = IV_Carry
  IV_MOV        = IV_Normal     'functions as NIP
  IV_NEG        = IV_1_Arg 
  IV_ABS        = IV_1_Arg
  IV_ABSNEG     = IV_1_Arg
  IV_NEGC       = IV_1_Carry
  IV_NEGNC      = IV_1_Carry
  IV_NEGZ       = IV_1_Carry
  IV_NEGNZ      = IV_1_Carry
  IV_CMP        = IV_No_R       
  IV_CMPSX      = IV_No_R_C
  IV_ADDX       = IV_Carry
  IV_CMPX       = IV_No_R_C
  IV_ADDS       = IV_Normal
  IV_SUBS       = IV_Normal
  IV_ADDSX      = IV_Carry
  IV_SUBSX      = IV_Carry
  IV_CMPSUB     = IV_Normal
  IV_DJNZ       = IV_1_Arg      'not useful
  IV_TJNZ       = IV_1_Arg      'not useful
  IV_TJZ        = IV_1_Arg      'not useful   
  IV_WAITPEQ    = IV_Pop_D 
  IV_WAITPNE    = IV_Pop_D  
  IV_WAITCNT    = IV_Normal
  IV_WAITVID    = IV_Pop_D

And of course, back in the PASM image

code_index    long  IV_RDBYTE
              long  IV_RDWORD
              long  IV_RDLONG
              'etc.

So for about 40 longs of interpreter and a 64 long table, this gets us over 50 fully implemented useful instructions, pretty much the entire PASM function set except for Hub RAM writes and HUBOP. It's still a lot faster than Spin, and not even slowing things down much on top of XMM fetches.

But wait, there's more!

You will, of course, be using all that free Cog RAM for helper functions. And unless you pull Peter Jakacki's (admittedly cool) trick from the Tachyon interpreter, you'll need a jump table. And since your jumps are only 9 bits, you can share the helper jump table with the code index!

instr_vector  long  IV_RDBYTE   + instr_l_store
              long  IV_RDWORD   + instr_l_fetch
              long  IV_RDLONG   + instr_rstka
              'etc

The only problem you're likely to have with this is that you might not have 64 helper instructions. (I have 35.) Rather than waste the difference in Cog Longs, it is possible with very little additional code to fold the lookup table into 32 longs:

              ...
              mov       tmp, i_code
              andn      tmp,#%0010_0000         'fold index 
              add       tmp,#instr_vector
              movs      :iv_get,tmp             
              test      i_code,#%0010_0000 wz   'was index actually folded?
:iv_get       mov       i_code_desc,0-0          
        if_nz shl       i_code_desc,#4          'if so, shift in alternate mod bits
              ...
...

instr_vector
              long  IV_RDBYTE   + IV_ADD >> 4 + instr_l_store
              long  IV_RDWORD   + IV_SUB >> 4 + instr_l_fetch
              long  IV_RDLONG   + IV_ADDABS >> 4 + instr_rstka
              'etc.

And that is how you can get a byte code version of nearly the entire PASM instruction set implemented and still have over 400 longs of Cog RAM for those helper functions like PASM multiply, divide, and string manipulation. This has been implemented and tested; it works, and quite well. Executing from Hub RAM it's about five times faster than Spin, and when executing out of most forms of XMM it's a lot faster than LMM simply because it's much more compact.

From this point the sky is the limit. I'm implementing a separate R-stack, but you could alternately create helper functions targeting a C or Spin style memory model. I'm also using the D field of the lookup table to index a separate helper function setup function, so this is what the start of my instr_vector really looks like:

instr_vector
              long  IV_RDBYTE   + IV_ADD      >> 4 + {
                                } prep_1arg   + instr_l_store   << 9 
              long  IV_RDWORD   + IV_SUB      >> 4 + {
                                } prep_1arg   + instr_l_fetch   << 9    
              long  IV_RDLONG   + IV_ADDABS   >> 4 + {
                                } prep_1arg   + instr_rstka     << 9 
              'etc.

ersmith · 2012-10-17 14:21

That's a neat take on things. That makes at least 3 compressed memory models now: Catalina and PropGCC also have CMMs. I think for both the C compilers a prime motivation was that the compressed instructions should be able to represent all possible PASM instructions, with the more common ones occupying less space. It looks like your scheme trades that requirement off for a simpler (and probably faster) model.

The PropGCC CMM instruction set is moderately complicated, so the interpreter takes a lot of space (but gives good compression of C code, and allows for any PASM instruction to be represented). Here's an excerpt from the documentation:

The CMM Instruction Set
-----------------------
The compressed forms of instructions are byte oriented, and can be
read from any byte boundary. They are of variable length. The first
byte specifies the instruction type and (usually) the destination
register; following bytes fill in the rest of the instruction.

The upper 4 bits of the first byte select the type, and the lower 4
bits typically have the destination register (r0-r15) or a condition code.

MACRO INSTRUCTIONS (see below)
$0m   "macro" instruction; m selects which one

COMMON OPERATIONS (see below)
$1y   register/register instruction; y is destination
$2y   register/4 bit immediate: y is destination
$3y   register/12 bit immediate: y is destination

MISCELLANEOUS
$4y   brw: y is the condition for the branch, next 2 bytes are address
$5y   mvil; y is destination, next 4 bytes are immediate value
$6y   mviw; y is destination, next 2 bytes are unsigned immediate

$7y   brs: y is condition, next byte is a signed offset to the PC
$8y   skip2: y is condition, no more bytes; if condition, pc = pc + 2
$9y   skip3: y is condition, no more bytes; if condition, pc = pc + 3

$Ay   mvib: y is destination, next byte is 8 bit immediate
$By   mvi0: y is destination, sets register y to 0

$Cy   leasp: y is destination, next 2 bytes are offset;
      sets y = sp + offset

XMOV: MOV plus COMMON OPERATION
$Dy   xmov: move instruction followed by register/register operation
$Ey   xmov: move followed by register/4 bit immediate

PACKED NATIVE
$Fy   packed unconditional instruction; next 3 bytes are the remaining
      parts of the instruction. The whole long word looks like:

      1111_eeeI       eee = effects bits(wz,wc,wr) I = immediate bit
      ssss_ssss    sss = source bits (9 in all)
      sddd_dddd    ddd = destination bits (9 in all)
      ddii_iiii    iii = instruction bits (6 in all)

This format was chosen so that the source and destination bits are
laid out in memory the same as for normal PASM instructions, and so
the linker can operate on them the same way. Note that the condition
bits are all 1 for this instruction, which is the same as the $F
selector at the start, so unpacking is a matter of re-arranging the
bits.

If condition bits need to be set on an instruction, use the $0F
"native" macro instruction.


===================================================================
Macro Instructions

The 16 available macro operations and their expansions are given below

$00   no-op
     nop
$01   reserved for break 
      (currently a no-op)
$02   ret: return from subroutine
         mov pc,lr
$03   pushm <byte>: push registers on stack
           mov __TMP0, #<byte>
     call #__LMM_PUSHM
$04   popm <byte>: pop registers from stack
         mov __TMP0, #<byte>
     call #__LMM_PUSHM
$05   popret <byte>: pop registers and return
         mov __TMP0, #<byte>
     call #__LMM_PUSHRET
$06   lcall <word>: call subroutine
         mov __TMP0, #<word>
         call #_LMM_CALL_INDIRECT
$07   mul: multiply
         call #__MULSI
$08   udiv: unsigned divide
         call #_UDIVSI
$09   div: signed divide
         call #_DIVSI
$0A   mvreg <byte>: register to register move
         mov x, y  where byte is $xy
$0B   xmov <byte1> <byte2>: two register to register moves
         mov x, y  where byte1 is $xy
     mov z, w  where byte2 is $zw
$0C   addsp <byte>: add signed 8 bit offset to sp
         add sp, <byte> (note that byte is signed)
$0D   reserved, unused
$0E   fcache <word>: load code to fcache
         this is approximately the same as call #__LMM_FCACHE_LOAD,
     except that 16 bytes instead of a long word follow, and the
     pc is then forced to 32 byte alignment. The next <word> bytes
     after the alignment are loaded into the FCACHE and run from
     there.
$0F    native <long>
         the 32 bits after the $0F byte are executed directly as a
     PASM instruction

===================================================================
Common Operations

The most common 16 instruction forms are compressed specially. Here is
a table of those 16 forms:

    add    0-0,0-0        '' common op 0
    sub    0-0,0-0        '' common op 1
    cmps    0-0,0-0 wz,wc    '' common op 2
    cmp    0-0,0-0 wz,wc    '' common op 3
    and    0-0,0-0        '' common op 4
    andn    0-0,0-0        '' common op 5
    neg    0-0,0-0        '' common op 6
    or    0-0,0-0        '' common op 7
    xor    0-0,0-0        '' common op 8
    shl    0-0,0-0        '' common op 9
    shr    0-0,0-0        '' common op A
    sar    0-0,0-0        '' common op B
    rdbyte    0-0,0-0        '' common op C
    rdlong    0-0,0-0        '' common op D
    wrbyte    0-0,0-0        '' common op E
    wrlong    0-0,0-0        '' common op F

Note that these forms are always executed (condition is if_always) and
the effects flags are the defaults for the instruction, unless
otherwise specified.

The compressed encoding for these is as follows:

$1y $sw: register/register operation:
          opw y, s, where "opw" is the operation selected by w from the
          table above
$2y $sw:
      opw y, #s, where "opw" is selected by w and s is an unsigned 4 bit
          immediate
$3y $ss $ws:
          opw y, #sss, where "opw" is selected by w, and s is a signed 12
          bit constant
NOTE that the encoding of which operation to use is in an unusual place
for the 12 bit case!

The first two of these can also be encoded along with a register/register move:

$Dy $uv $sw: xmov mov u, v  opw y,s
          execute mov u,v, then opw y, s
$Ey $uv $sw: xmov mov u,v   opw y, #s
          execute mov u,v, then opw y, #s

localroger · 2012-10-17 16:00

Yeah a prime motivation for me was to leave as much Cog RAM free for helper functions as possible, because for things like string manipulation PASM helpers are faster than anything and make up for a lot of the ills of slow XMM access. Every PASM instruction in my model is 1 byte because there is no ned to supply source and destination registers/locations. In my helper function set I have two and three byte instructions (some capable of directly addressing 512K because I have 16 instruction with 3 bits of data in the byte code) for memory access and jumps. It would make a poor platform for C but it won't quite be Forth either despite the separate R-stack.

RossH · 2012-10-18 00:44

localroger wrote: »

Yeah a prime motivation for me was to leave as much Cog RAM free for helper functions as possible, because for things like string manipulation PASM helpers are faster than anything and make up for a lot of the ills of slow XMM access. Every PASM instruction in my model is 1 byte because there is no ned to supply source and destination registers/locations. In my helper function set I have two and three byte instructions (some capable of directly addressing 512K because I have 16 instruction with 3 bits of data in the byte code) for memory access and jumps. It would make a poor platform for C but it won't quite be Forth either despite the separate R-stack.

Hi localroger!

Welcome to the CMM club! Yours is an interesting implementation - can you post what your "m_push" and "m_pop" functions looks like? I think I get what you're proposing, but this will help me judge how well your BCMM might be adapted for use by C etc. As you point out, it currently looks like it would not be a good fit for C - but a variant of it might well be!

Ross.

Bill Henning · 2012-10-18 06:00

Hi Roger,

REALLY cool; should make for some really compact code. I will watch its evolution with great interest...

(p.s. thanks for the kind words :-) )

localroger · 2012-10-18 06:00

Sure Ross. It's just your basic Hub RAM stack:

m_push        sub       m_ptr,#4
              wrlong    m_xfer,m_ptr
m_push_ret    ret

' NB the inner interpreter requires that m_pop preserve C
'
m_pop         rdlong    m_xfer,m_ptr    'm_xfer from m_stack
              add       m_ptr,#4
m_pop_ret     ret

I also have an R-stack growing up toward the M-stack and a sanity check for if they collide. Since C is a stack based language the stack machine PASM implementation should work pretty well given a suitable mix of helper functions. You could also easily modify this to treat i_src or i_dest more like a math accumulator, which might agree better with some of the compiler architectures.

ctwardell · 2012-10-18 10:45

This looks really cool.

I do have one question, wouldn't it speed it up if you inline the push/pop in the interpreter and move to/from the source or destination instead of moving through the transfer location?

C.W.

localroger · 2012-10-18 12:16

ctwardell, yes that would be faster but my real push/pops include a sanity check for stack overflow against the R-stack, and they are called from a lot of other places so it saves significant longs to make them routines. I also have a version where I hacked out the XMM drivers and put the stack for that cog in cog RAM; making them routines centralizes that sort of thing.

ctwardell · 2012-10-18 19:15

Thinking a bit about a version with the stack in the COG, growing down from top of available memory.

It looks like the stack pointer could actually reside IN the "to_instr_cl1" instruction.

The actual SP value would be the 9 bits that represent the source, the dest value would be 1 greater, since it represents the next position down the stack.

Since the stack grows down, a PUSH subtracts $201 to the to_instr_cl1 location, this updates both SP and SP-1 in the same operation, likewise a POP adds $201.

With this setup the executed instruction directly accesses the stack instead of going through intermediate locations.

The stack is adjusted AFTER the executed instruction, and is "POPPED" once or twice dependending on if the return value was to be PUSHED.

In the case where the result was to be pushed, the R flag would be set so the SP-1 gets the destination value and the stack is just "POPPED" once.
In the case where the result was not to be pushed, the R flag would be cleared so that SP-1 is not altered and the stack would be "POPPED" twice, this could be done by adding $402 to do it in a single add.

Normal PUSH and POP operations can still be implemented that use a couple of memory locations to be used in helper functions, etc. They would still use the "to_instr_cl1" location and add/subtract $201 so as to keep both source (SP) and dest (SP-1) pointers intact.

Just something to chew on.

C.W.

Byte Code Memory Model - BCMM

Comments