Fast Bytecode Interpreter

I've made some progress on the interpreter and it's going to run quite fast.
Here is the main loop:
In case the instruction from the LUT is 2 clocks (not a CALL), the entire bytecode loop is 13 clocks.
Here is what the LUT table looks like:
Each of those longs in the table executes as an instruction, in place of the NOP in the interpreter loop. For most bytecodes, more needs to be accomplished than what can happen in one instruction, so a CALL is placed in the LUT table. The return address is the 'jmp #inst' in the main loop.
Here's what the math operations look like:
Most operations wind up being just one in-lined instruction. The more complex stuff CALLs to routines.
This '_RET_' is an implied return that uses code %0000 of the conditional field. This means NOP cannot be $00000000, anymore, as it would become a RET instruction. Having implied RET's saves lots of instructions, especially in a situation like this where you have lots of snippets of code that get CALL'd. Each _RET_ saves two clocks and a long.
If the _RET_ is used on a branch instruction, the branch instruction takes precedence, unless the branch doesn't occur. Then, the _RET_ happens. Here are two routines from one of the VGA demos:
As long as the DJNZ is looping, the _RET_ is ignored. When the DJNZ would otherwise fall through, the _RET_ is executed.
Sorry for having made another change. I just realized this was going to make big improvements in code density, so I did it.
What makes things go really fast is the ALTI instruction. Being able to in-line a random instruction has lots of speed and density benefits.
Here is the main loop:
'
' Interpreter loop
'
inst rfbyte p 'get bytecode
rdlut x,p 'lookup long from lut
alti x 'execute long as next instruction
nop '(instruction placeholder)
jmp #inst 'loop
In case the instruction from the LUT is 2 clocks (not a CALL), the entire bytecode loop is 13 clocks.
Here is what the LUT table looks like:
'
' Bytecode-to-instruction lookup in LUT
'
org $200
push neg1 'constant -1
pusha #0 'constant 0
pusha #1 'constant 1
pusha #2 'constant 2
pusha #3 'constant 3
pusha #4 'constant 4
pusha #7 'constant 7
pusha #8 'constant 8
pusha #15 'constant 15
pusha #16 'constant 16
pusha #31 'constant 31
pusha #32 'constant 32
call rd1xx 'read register $1xx
call wr1xx 'write register $1xx
ops call #op1 'unary operator NOT boolean not
call #op1 'unary operator ! bitwise not
call #op1 'unary operator - negate
call #op1 'unary operator || absolute
call #op1 'unary operator >| encode
call #op1 'unary operator |< decode
call #op1 'unary operator ~ sign extend byte
call #op1 'unary operator ~~ sign extened word
call #op2 'binary operator AND boolean and
call #op2 'binary operator OR boolean or
call #op2 'binary operator & bitwise and
call #op2 'binary operator | bitwise or
call #op2 'binary operator ^ bitwise xor
call #op2 'binary operator >> shift right
call #op2 'binary operator << shift left
call #op2 'binary operator ~> arithmetic shift right
call #op2 'binary operator <~ arithmetic shift left
call #op2 'binary operator -> rotate right
call #op2 'binary operator <- rotate left
call #op2 'binary operator >< reverse LSBs
call #op2 'binary operator #> limit minimum (signed)
call #op2 'binary operator <# limit maximum (signed)
call #op2 'binary operator + add
call #op2 'binary operator - subtract
call #op2 'binary operator * multiply (signed)
call #op2 'binary operator / divide (signed)
call #op2 'binary operator // remainder (signed)
call #op2 'binary operator ** scale (unsigned)
call #op2 'binary operator */ fraction (unsigned)
call #op1 'SQRT(x) math terms, 1 result
call #op1 'LOG(x)
call #op1 'EXP(x)
call #op2 'NEGBOOL(x,bool)
call #op3 'MUXBOOL(x,mux,bool)
call #op2 'POLCAR(r,t : x,y) math terms, 2 results
call #op2 'CARPOL(x,y : r,t)
call #op3 'ROTCAR(x,y,t : x,y)
Each of those longs in the table executes as an instruction, in place of the NOP in the interpreter loop. For most bytecodes, more needs to be accomplished than what can happen in one instruction, so a CALL is placed in the LUT table. The return address is the 'jmp #inst' in the main loop.
Here's what the math operations look like:
'
' Math operators
'
op3 popa z 'pop z
op2 popa y 'pop y
op1 popa x wc, wz 'pop x, set flags
altd p,#.op - (ops & $FF) 'point next d to op instruction
alti 0 'execute d as next instruction
nop '(instruction executes here)
_ret_ pusha x 'push x, return to bytecode loop
.op call #.not 'NOT unaries
not x '!
neg x '-
abs x '||
topone x '>|
decod x '|<
call #.sxb '~~
call #.sxw '~
call #.and 'AND binaries
call #.or 'OR
and x,y '&
or x,y '|
xor x,y '^
shr x,y '>>
shl x,y '<<
sar x,y '~>
sal x,y '<~
ror x,y '->
rol x,y '<-
call #.rev '><
mins x,y '#>
maxs x,y '<#
add x,y '+
sub x,y '-
call #.mul '*
call #.div '/
call #.rem '//
call #.scl '**
call #.fra '*/
call #.sqrt 'SQRT() 1 result
call #.log 'LOG()
call #.exp 'EXP()
call #.negbool 'NEGBOOL(,)
call #.muxbool 'MUXBOOL(,,)
call #.polcar 'POLCAR(r,t : x,y) 2 results
call #.carpol 'CARPOL(x,y : r,t)
call #.rotcar 'ROTCAR(x,y,t : x,y)
.not if_nz not x,#0
_ret_ not x,#0
.sxb shl x,#24
_ret_ sar x,#24
.sxw shl x,#16
_ret_ sar x,#16
.and if_nz cmp y,#0 wz
if_nz not x,#0
ret
.or if_z cmp y,#0 wz
if_nz not x,#0
ret
.rev rev x
rol x,y
rol x,#1
_ret_ triml x,y
.mul testb y,#31 wz
abs x
abs y
qmul x,y
.mul2 getqx x
if_c_eq_z neg x
ret
.div testb y,#31 wz
abs x
abs y
qdiv x,y
jmp #.mul2
.rem abs x
abs y
qdiv x,y
getqy x
if_c neg x
ret
.scl qmul x,y
_ret_ getqy x
.fra qfrac x,y
_ret_ getqx x
.sqrt qsqrt x
_ret_ getqx x
.log qlog x
_ret_ getqx x
.exp qexp x
_ret_ getqx x
.negbool cmp y,#0 wz
_ret_ negnz x
.muxbool cmp z,#0 wz
_ret_ muxnz x,y
.polcar qrotate x,y
jmp #.results
.carpol qvector x,y
jmp #.results
.rotcar setq y
qrotate x,z
.results getqy x
pusha x
_ret_ getqx x
Most operations wind up being just one in-lined instruction. The more complex stuff CALLs to routines.
This '_RET_' is an implied return that uses code %0000 of the conditional field. This means NOP cannot be $00000000, anymore, as it would become a RET instruction. Having implied RET's saves lots of instructions, especially in a situation like this where you have lots of snippets of code that get CALL'd. Each _RET_ saves two clocks and a long.
If the _RET_ is used on a branch instruction, the branch instruction takes precedence, unless the branch doesn't occur. Then, the _RET_ happens. Here are two routines from one of the VGA demos:
'
' Subroutines
'
blank call #hsync 'blank lines
xcont m_vi,#0
_ret_ djnz x,#blank
hsync xcont m_bs,#0 'horizontal sync
xzero m_sn,#1
_ret_ xcont m_bv,#0
As long as the DJNZ is looping, the _RET_ is ignored. When the DJNZ would otherwise fall through, the _RET_ is executed.
Sorry for having made another change. I just realized this was going to make big improvements in code density, so I did it.
What makes things go really fast is the ALTI instruction. Being able to in-line a random instruction has lots of speed and density benefits.
Comments
Seems like Spin2 is going to be a lot more capable than Spin1 was. I often resort to PASM to do most things on P1, might not need as much PASM on P2.
I like the "_ret_ " addition too.
I was thinking 'WAITX #0'.
What about making blank Flash a NOP, which means 0xffffffff ?
That would be 'AUGD #all_ones'. This NOP thing is the biggest problem now.
What happens now, if an errant branch lands on a whole chain of 'AUGD #all_ones' ?
They pretty much are NOPs. The next instruction afterward, though, with a literal D would get the top 27 bits as highs. May or may not be a problem.
' ' Constants ' con8 rfbyte x _ret_ pusha x con8n rfbyte x or x,_FFFFFF00 _ret_ pusha x con16 rfword x _ret_ pusha x con16n rfword x or x,_FFFF0000 _ret_ pusha x con32 rflong x _ret_ pusha x conexp rfbyte y decod x,y testb y,#5 wc if_c sub x,#1 testb y,#6 wc if_c not x _ret_ pusha x _FFFFFFFF long $FFFFFFFF _FFFFFF00 long $FFFFFF00 _FFFF0000 long $FFFF0000
Here's the constant portion of the LUT table:
push _FFFFFFFF 'constant -1 pusha #0 'constant 0 pusha #1 'constant 1 pusha #2 'constant 2 pusha #3 'constant 3 pusha #4 'constant 4 pusha #7 'constant 7 pusha #8 'constant 8 pusha #15 'constant 15 pusha #16 'constant 16 pusha #31 'constant 31 pusha #32 'constant 32 call #con8 'constant $000000xx call #con8n 'constant $FFFFFFxx call #con16 'constant $0000xxxx call #con16n 'constant $FFFFxxxx call #con32 'constant $xxxxxxxx call #conexp 'constant decod/dec/not
The most common constants are single instructions.
' ' Interpreter loop ' inst rfbyte p 'get bytecode rdlut x,p 'lookup long from lut alti x 'execute long as next instruction nop '(instruction placeholder) rfbyte p 'get bytecode rdlut x,p 'lookup long from lut alti x 'execute long as next instruction nop '(instruction placeholder) rfbyte p 'get bytecode rdlut x,p 'lookup long from lut alti x 'execute long as next instruction nop '(instruction placeholder) rfbyte p 'get bytecode rdlut x,p 'lookup long from lut alti x 'execute long as next instruction nop '(instruction placeholder) jmp #inst 'loop
Don't be. Worth it. This exercise is important.
Or, don't bother re-arranging anything and just define NOP to be an alias for OR 0,0.
Hope you're not thinking about changing that...
That's an excellent point. Possibly better would be to trigger a debug interrupt. Instead of a debugger, startup code could set the debug interrupt to call cogstop or whatever other behavior is desirable (cogatn a watchdog cog, for instance).
At which point, yeah, just make NOP an alias for an existing instruction.
PS: Way to go Chip!!! W000000T!
If your program is buggy and runs off into the weeds, those weeds are quite likely to be some other part of your code or your data somewhere, which it will then try to execute.
Could be fixed, but inconvenient...
One thing garryj did is load the top of his cog code with variables initialized to 0. As such, they are just NOPs and skipped over. Can't do this if NOP changes:
dat 'UsbHostEntry org UsbHostEntry 'RJA ' /* Host common registers hr0 long 0 ' Multi-purpose registers hr1 long 0 hpar1 long 0 ' Routine entry/exit parameters hpar2 long 0 ... etc... ... pkt_tmp long 0 ' Tmp storage for routines that deal with datax packets ' */ ' /* usb_host_start '------------------------------------------------------------------------------ ' The USB host cog. '------------------------------------------------------------------------------ usb_host_start call #load_lut
I did that once a couple of times, myself.
We just can't do that anymore.
dat 'UsbHostEntry org UsbHostEntry 'RJA ' /* Host common registers hr0 nop ' Multi-purpose registers hr1 nop hpar1 nop ' Routine entry/exit parameters hpar2 nop ... etc... ... pkt_tmp nop ' Tmp storage for routines that deal with datax packets ' */ ' /* usb_host_start '------------------------------------------------------------------------------ ' The USB host cog. '------------------------------------------------------------------------------ usb_host_start call #load_lut
It would have generated the same exact code, and wouldn't have broken when Chip made this change.
dat org usb_host_entry jmp #usb_host_start ' Skip the static reg block ' /* Host common registers hr0 long 0 ' Multi-purpose registers hr1 long 0 hpar1 long 0 ' Routine entry/exit parameters hpar2 long 0 ... etc... ... pkt_tmp long 0 ' Tmp storage for routines that deal with datax packets ' */ ' /* usb_host_start '------------------------------------------------------------------------------ ' The USB host cog. '------------------------------------------------------------------------------ usb_host_start call #load_lut
In any case, the changes to make it a non-issue are minimal.
_ret_ only works if the internal call-stackpointer is not at bottom, otherwise it's ignored and therfore a NOP.
So as long as you not have done a CALL in your code, the %0000 condition works as a NOP and only inside subroutines it is a _ret_.
Andy