Fast Bytecode Interpreter
cgracey
Posts: 14,152
I've made some progress on the interpreter and it's going to run quite fast.
Here is the main loop:
In case the instruction from the LUT is 2 clocks (not a CALL), the entire bytecode loop is 13 clocks.
Here is what the LUT table looks like:
Each of those longs in the table executes as an instruction, in place of the NOP in the interpreter loop. For most bytecodes, more needs to be accomplished than what can happen in one instruction, so a CALL is placed in the LUT table. The return address is the 'jmp #inst' in the main loop.
Here's what the math operations look like:
Most operations wind up being just one in-lined instruction. The more complex stuff CALLs to routines.
This '_RET_' is an implied return that uses code %0000 of the conditional field. This means NOP cannot be $00000000, anymore, as it would become a RET instruction. Having implied RET's saves lots of instructions, especially in a situation like this where you have lots of snippets of code that get CALL'd. Each _RET_ saves two clocks and a long.
If the _RET_ is used on a branch instruction, the branch instruction takes precedence, unless the branch doesn't occur. Then, the _RET_ happens. Here are two routines from one of the VGA demos:
As long as the DJNZ is looping, the _RET_ is ignored. When the DJNZ would otherwise fall through, the _RET_ is executed.
Sorry for having made another change. I just realized this was going to make big improvements in code density, so I did it.
What makes things go really fast is the ALTI instruction. Being able to in-line a random instruction has lots of speed and density benefits.
Here is the main loop:
' ' Interpreter loop ' inst rfbyte p 'get bytecode rdlut x,p 'lookup long from lut alti x 'execute long as next instruction nop '(instruction placeholder) jmp #inst 'loop
In case the instruction from the LUT is 2 clocks (not a CALL), the entire bytecode loop is 13 clocks.
Here is what the LUT table looks like:
' ' Bytecode-to-instruction lookup in LUT ' org $200 push neg1 'constant -1 pusha #0 'constant 0 pusha #1 'constant 1 pusha #2 'constant 2 pusha #3 'constant 3 pusha #4 'constant 4 pusha #7 'constant 7 pusha #8 'constant 8 pusha #15 'constant 15 pusha #16 'constant 16 pusha #31 'constant 31 pusha #32 'constant 32 call rd1xx 'read register $1xx call wr1xx 'write register $1xx ops call #op1 'unary operator NOT boolean not call #op1 'unary operator ! bitwise not call #op1 'unary operator - negate call #op1 'unary operator || absolute call #op1 'unary operator >| encode call #op1 'unary operator |< decode call #op1 'unary operator ~ sign extend byte call #op1 'unary operator ~~ sign extened word call #op2 'binary operator AND boolean and call #op2 'binary operator OR boolean or call #op2 'binary operator & bitwise and call #op2 'binary operator | bitwise or call #op2 'binary operator ^ bitwise xor call #op2 'binary operator >> shift right call #op2 'binary operator << shift left call #op2 'binary operator ~> arithmetic shift right call #op2 'binary operator <~ arithmetic shift left call #op2 'binary operator -> rotate right call #op2 'binary operator <- rotate left call #op2 'binary operator >< reverse LSBs call #op2 'binary operator #> limit minimum (signed) call #op2 'binary operator <# limit maximum (signed) call #op2 'binary operator + add call #op2 'binary operator - subtract call #op2 'binary operator * multiply (signed) call #op2 'binary operator / divide (signed) call #op2 'binary operator // remainder (signed) call #op2 'binary operator ** scale (unsigned) call #op2 'binary operator */ fraction (unsigned) call #op1 'SQRT(x) math terms, 1 result call #op1 'LOG(x) call #op1 'EXP(x) call #op2 'NEGBOOL(x,bool) call #op3 'MUXBOOL(x,mux,bool) call #op2 'POLCAR(r,t : x,y) math terms, 2 results call #op2 'CARPOL(x,y : r,t) call #op3 'ROTCAR(x,y,t : x,y)
Each of those longs in the table executes as an instruction, in place of the NOP in the interpreter loop. For most bytecodes, more needs to be accomplished than what can happen in one instruction, so a CALL is placed in the LUT table. The return address is the 'jmp #inst' in the main loop.
Here's what the math operations look like:
' ' Math operators ' op3 popa z 'pop z op2 popa y 'pop y op1 popa x wc, wz 'pop x, set flags altd p,#.op - (ops & $FF) 'point next d to op instruction alti 0 'execute d as next instruction nop '(instruction executes here) _ret_ pusha x 'push x, return to bytecode loop .op call #.not 'NOT unaries not x '! neg x '- abs x '|| topone x '>| decod x '|< call #.sxb '~~ call #.sxw '~ call #.and 'AND binaries call #.or 'OR and x,y '& or x,y '| xor x,y '^ shr x,y '>> shl x,y '<< sar x,y '~> sal x,y '<~ ror x,y '-> rol x,y '<- call #.rev '>< mins x,y '#> maxs x,y '<# add x,y '+ sub x,y '- call #.mul '* call #.div '/ call #.rem '// call #.scl '** call #.fra '*/ call #.sqrt 'SQRT() 1 result call #.log 'LOG() call #.exp 'EXP() call #.negbool 'NEGBOOL(,) call #.muxbool 'MUXBOOL(,,) call #.polcar 'POLCAR(r,t : x,y) 2 results call #.carpol 'CARPOL(x,y : r,t) call #.rotcar 'ROTCAR(x,y,t : x,y) .not if_nz not x,#0 _ret_ not x,#0 .sxb shl x,#24 _ret_ sar x,#24 .sxw shl x,#16 _ret_ sar x,#16 .and if_nz cmp y,#0 wz if_nz not x,#0 ret .or if_z cmp y,#0 wz if_nz not x,#0 ret .rev rev x rol x,y rol x,#1 _ret_ triml x,y .mul testb y,#31 wz abs x abs y qmul x,y .mul2 getqx x if_c_eq_z neg x ret .div testb y,#31 wz abs x abs y qdiv x,y jmp #.mul2 .rem abs x abs y qdiv x,y getqy x if_c neg x ret .scl qmul x,y _ret_ getqy x .fra qfrac x,y _ret_ getqx x .sqrt qsqrt x _ret_ getqx x .log qlog x _ret_ getqx x .exp qexp x _ret_ getqx x .negbool cmp y,#0 wz _ret_ negnz x .muxbool cmp z,#0 wz _ret_ muxnz x,y .polcar qrotate x,y jmp #.results .carpol qvector x,y jmp #.results .rotcar setq y qrotate x,z .results getqy x pusha x _ret_ getqx x
Most operations wind up being just one in-lined instruction. The more complex stuff CALLs to routines.
This '_RET_' is an implied return that uses code %0000 of the conditional field. This means NOP cannot be $00000000, anymore, as it would become a RET instruction. Having implied RET's saves lots of instructions, especially in a situation like this where you have lots of snippets of code that get CALL'd. Each _RET_ saves two clocks and a long.
If the _RET_ is used on a branch instruction, the branch instruction takes precedence, unless the branch doesn't occur. Then, the _RET_ happens. Here are two routines from one of the VGA demos:
' ' Subroutines ' blank call #hsync 'blank lines xcont m_vi,#0 _ret_ djnz x,#blank hsync xcont m_bs,#0 'horizontal sync xzero m_sn,#1 _ret_ xcont m_bv,#0
As long as the DJNZ is looping, the _RET_ is ignored. When the DJNZ would otherwise fall through, the _RET_ is executed.
Sorry for having made another change. I just realized this was going to make big improvements in code density, so I did it.
What makes things go really fast is the ALTI instruction. Being able to in-line a random instruction has lots of speed and density benefits.
Comments
Seems like Spin2 is going to be a lot more capable than Spin1 was. I often resort to PASM to do most things on P1, might not need as much PASM on P2.
I like the "_ret_ " addition too.
I was thinking 'WAITX #0'.
What about making blank Flash a NOP, which means 0xffffffff ?
That would be 'AUGD #all_ones'. This NOP thing is the biggest problem now.
What happens now, if an errant branch lands on a whole chain of 'AUGD #all_ones' ?
They pretty much are NOPs. The next instruction afterward, though, with a literal D would get the top 27 bits as highs. May or may not be a problem.
Here's the constant portion of the LUT table:
The most common constants are single instructions.
Don't be. Worth it. This exercise is important.
Or, don't bother re-arranging anything and just define NOP to be an alias for OR 0,0.
Hope you're not thinking about changing that...
That's an excellent point. Possibly better would be to trigger a debug interrupt. Instead of a debugger, startup code could set the debug interrupt to call cogstop or whatever other behavior is desirable (cogatn a watchdog cog, for instance).
At which point, yeah, just make NOP an alias for an existing instruction.
PS: Way to go Chip!!! W000000T!
If your program is buggy and runs off into the weeds, those weeds are quite likely to be some other part of your code or your data somewhere, which it will then try to execute.
Could be fixed, but inconvenient...
One thing garryj did is load the top of his cog code with variables initialized to 0. As such, they are just NOPs and skipped over. Can't do this if NOP changes:
I did that once a couple of times, myself.
We just can't do that anymore.
It would have generated the same exact code, and wouldn't have broken when Chip made this change.
In any case, the changes to make it a non-issue are minimal.
_ret_ only works if the internal call-stackpointer is not at bottom, otherwise it's ignored and therfore a NOP.
So as long as you not have done a CALL in your code, the %0000 condition works as a NOP and only inside subroutines it is a _ret_.
Andy