Here are some more up to date results of experimenting with three interpreters for the P2, namely the zog_p2 ZPU interpreter and two different RISCV interpreters: a traditional interpreter (riscvemu_p2) and a just in time compiler (riscvjit_p2). These are all checked in to github and require spin2cpp 3.6.4 or later to build.
The ZPU interpreter uses the P2 XBYTE feature to directly interpret bytecodes at high speed. It has decent performance all around, although the stack based nature of the ZPU architecture means that we end up memory bound compared to the CMM and RISC-V nterpreters.
The RISC-V interpreter comes in two versions. One (riscvemu_p2) is a traditional interpreter which reads RISC-V opcodes and interprets them one at a time. Because RISC-V opcodes are 4 bytes each it does not use XBYTE. The riscvjit_p2 interpreter is a just in time compiler, which translates RISC-V instructions (4 bytes each) into PASM instructions. Each RISC-V instruction translates to 1-4 PASM instructions. A small cache of 64 translated instructions is kept in the LUT. Code with loops that fit in the LUT runs very fast, as we can
see from the xxtea and fftbench results. Thanks in part to the hardware multiplier the interpreted P2 fftbench is actually faster than the compiled P1 one! However, this breaks down with large loops such as Dhrystone has, where constantly re-compiling slows the JIT interpreter down.
I think the data shows that strict stack based interpreters like the ZPU are hampered by the cost of pushing and popping. OTOH the small bytecodes of the ZPU are well suited to XBYTE interpretation, so the interpreter overhead is low. Register machines like the RISC-V keep a lot of data in COG memory, but instruction decode is more expensive. The JIT approach works very well when loops fit in LUT, but becomes expensive when they don't.
Benchmark results are reproduced below, along with some PropGCC numbers (from P1) for comparison.
All benchmarks were compiled with gcc for the appropriate architecture (PropGCC for CMM and LMM, RISC-V gcc for RISC-V, and the ZPU little endian toolchain for zog_p2). All were compiled with -Os.
The system clocks for both P1 and P2 are 80 MHz.
---------------- dhrystone -----------------------
speed is dhrystones/second; higher is better
size is code and initialized data size of dhry_1.o + dhry_2.o
p1gcc LMM: 6031/sec 3000 bytes
p1gcc CMM: 2074/sec 1248 bytes
zog_p2: 2133/sec 2295 bytes
riscvemu_p2: 1521/sec 2560 bytes
riscvjit_p2: 1119/sec 2560 bytes
------------------ xxtea -----------------------
speed is cycles to decode; lower is better
size is size of xxtea.o + main.o
p1gcc LMM: 21984 cycles 912 bytes
p1gcc CMM: 362544 cycles 384 bytes
zog_p2: 168964 cycles 623 bytes
riscvemu_p2: 202656 cycles 708 bytes
riscvjit_p2: 59755 cycles 708 bytes
----------------- fftbench -----------------------
speed is microseconds to complete FFT; lower is better
size is size of fftbench.o
p1gcc LMM: 138096 us 2052 bytes
p1gcc CMM: 567318 us 852 bytes
zog_p2: 239650 us 1243 bytes
riscvemu_p2: 232937 us 1536 bytes
riscvjit_p2: 60964 us 1536 bytes