Interpreter toolkit for P2

ersmith · 2019-04-11 20:09

I've created a simple interpreter toolkit for P2. Source code is at: https://github.com/totalspectrum/p2-jit-tools

It's based on the JIT translation code I used for my P2 ZPU emulator. It comes with a very stripped down stack based virtual machine, in 4 versions: a very straightforward ("plain") interpreter, an XBYTE based interpreter, a simple JIT compiler, and an optimizing JIT compiler. There's a timing program using the virtual machine to toggle pin 0 one million times. Times for the various interpreters running the same code are:

plain interpreter:    336_000_321 cycles
xbyte interpreter:    160_000_129 cycles
JIT w. HUB cache:     168_001_865 cycles
JIT w. LUT cache:     120_002_177 cycles
optimized JIT w. HUB:  64_002_363 cycles
optimized JIT w. LUT:  48_002_425 cycles

The timing program itself looks like:

InitialPC
  byte OP_PUSHIM
  long @startmsg
  byte OP_PRSTR
  
  byte OP_GETCNT
  byte OP_PUSHIM
  long @starttime
  byte OP_STORE

  '' establish loop counter: toggle 1000000 times
  byte OP_PUSHIM
  long 1000000	' loop counter

  ' negate it so we can count up
  byte OP_PUSHIM
  long 0
  byte OP_SWAP
  byte OP_SUB

loop
  byte OP_PUSHIM
  long 0
  byte OP_PINLO

  byte OP_PUSHIM
  long 0
  byte OP_PINHI

  ' decrement loop counter
  
  byte OP_PUSHIM
  long 1
  byte OP_ADD
  byte OP_DUP

  byte OP_JNEG
  long loop

exitloop
  ' push elapsed time onto stack
  byte OP_GETCNT
  byte OP_PUSHIM
  long @starttime
  byte OP_LOAD
  byte OP_SUB

  ' print elapsed time
  byte OP_PUSHIM
  long @endmsg
  byte OP_PRSTR
  
  byte OP_PRHEX

  byte OP_PUSHIM
  long @newline
  byte OP_PRSTR
  
  ' end of loop
  byte OP_HALT

startmsg
	byte "toggling pin 0 1_000_000 times:"
newline
	byte 13, 10, 0
endmsg
	byte "done. elapsed cycles: 0x", 0

	alignl
starttime  long 0
endtime    long 0
vara	   long 0

cgracey · 2019-04-11 21:10

Looks interesting, Eric.

Could you briefly explain the JIT concept, please? I know it means just-in-time, but what is happening, exactly? You are translating something into code and executing it, right?

jmg · 2019-04-11 21:12

cgracey wrote: »

Looks interesting, Eric.

Could you briefly explain the JIT concept, please? I know it means just-in-time, but what is happening, exactly? You are translating something into code and executing it, right?

and what is different between JIT and 'optimized JIT' ?

ersmith · 2019-04-11 21:34

The idea behind JIT compilation is that at run time we compile the byte codes into P2 instructions before executing them. So for example in the plain or XBYTE interpreter when we see an OP_ADD instruction the interpreter executes something like:

    mov temp, tos
    rdlong tos, --ptra
    add tos, temp

Instead of executing those instructions right away as XBYTE would do, the just in time compiler appends that instruction sequence to its cache, and only executes it once the cache line is finished (basically when either the cache is full or we branch somewhere else).

When a branch is encountered we look in the cache for a line that starts with the new PC; if it's found we just jump directly to the cache. That's where the speed improvement comes; we don't have any interpretation overhead for loops that fit in cache, it's just running raw machine code at that point. (There's an extra "trampoline" mechanism to avoid the cache check entirely for jumps from cache to cache.)

ersmith · 2019-04-11 21:44

jmg wrote: »

and what is different between JIT and 'optimized JIT' ?

The "regular" JIT is a dead simple translation of the plain and XBYTE interpreters; it copies basically the same instructions that they would execute into the code cache.

The "optimized" JIT compiler does some optimization on the sequences. It can do this because we know where branches are, and know that we can never branch into the middle of a cache line. The main optimization it performs is keeping track of what's on the stack, so we can turn pushes and pulls into register moves. It also optimizes the size of moves. So for example the bytecode to drive pin 56 low looks like:

   byte OP_PUSHIM
   long  56
   byte OP_PINLO

In the regular JIT this becomes

    wrlong tos, ptra++
    mov    tos, ##56
    drvl     tos

but in the optimized JIT it's compiled to:

    mov reg02, #56
    drvl  reg02

because we know that the sequence leaves the stack unchanged at the end, we don't need to push or pop any values. There's an obvious further optimization to:

    drvl #56

but I haven't implemented that yet.

David Betz · 2019-04-11 22:35

This is quite cool! I will definitely try it when I convert my ebasic interpreter to use XBYTE.

cgracey · 2019-04-11 22:53

Thanks for the explanation, Eric. Some things to think about there.

Roy Eltham · 2019-04-12 01:29

Very cool Eric!

You are quite the busy P2 guy. Producing piles of useful stuff!

Interpreter toolkit for P2

Comments