P2 interpreter comparison

ersmith · 2017-05-11 19:58

Here are some more up to date results of experimenting with three interpreters for the P2, namely the zog_p2 ZPU interpreter and two different RISCV interpreters: a traditional interpreter (riscvemu_p2) and a just in time compiler (riscvjit_p2). These are all checked in to github and require spin2cpp 3.6.4 or later to build.

The ZPU interpreter uses the P2 XBYTE feature to directly interpret bytecodes at high speed. It has decent performance all around, although the stack based nature of the ZPU architecture means that we end up memory bound compared to the CMM and RISC-V nterpreters.

The RISC-V interpreter comes in two versions. One (riscvemu_p2) is a traditional interpreter which reads RISC-V opcodes and interprets them one at a time. Because RISC-V opcodes are 4 bytes each it does not use XBYTE. The riscvjit_p2 interpreter is a just in time compiler, which translates RISC-V instructions (4 bytes each) into PASM instructions. Each RISC-V instruction translates to 1-4 PASM instructions. A small cache of 64 translated instructions is kept in the LUT. Code with loops that fit in the LUT runs very fast, as we can
see from the xxtea and fftbench results. Thanks in part to the hardware multiplier the interpreted P2 fftbench is actually faster than the compiled P1 one! However, this breaks down with large loops such as Dhrystone has, where constantly re-compiling slows the JIT interpreter down.

I think the data shows that strict stack based interpreters like the ZPU are hampered by the cost of pushing and popping. OTOH the small bytecodes of the ZPU are well suited to XBYTE interpretation, so the interpreter overhead is low. Register machines like the RISC-V keep a lot of data in COG memory, but instruction decode is more expensive. The JIT approach works very well when loops fit in LUT, but becomes expensive when they don't.

Benchmark results are reproduced below, along with some PropGCC numbers (from P1) for comparison.

All benchmarks were compiled with gcc for the appropriate architecture (PropGCC for CMM and LMM, RISC-V gcc for RISC-V, and the ZPU little endian toolchain for zog_p2). All were compiled with -Os.

The system clocks for both P1 and P2 are 80 MHz.

---------------- dhrystone -----------------------
speed is dhrystones/second; higher is better
size is code and initialized data size of dhry_1.o + dhry_2.o

                SPEED           SIZE
p1gcc LMM:	6031/sec	3000 bytes
p1gcc CMM:	2074/sec	1248 bytes
zog_p2:		2133/sec	2295 bytes
riscvemu_p2:	1521/sec	2560 bytes
riscvjit_p2:	1119/sec	2560 bytes

------------------ xxtea -----------------------
speed is cycles to decode; lower is better
size is size of xxtea.o + main.o

     	     	  SPEED		  SIZE
p1gcc LMM:	 21984 cycles	912 bytes
p1gcc CMM:	362544 cycles	384 bytes
zog_p2:		168964 cycles	623 bytes
riscvemu_p2:    202656 cycles   708 bytes
riscvjit_p2:	 59755 cycles	708 bytes

----------------- fftbench -----------------------
speed is microseconds to complete FFT; lower is better
size is size of fftbench.o

p1gcc LMM:	138096 us	 2052 bytes
p1gcc CMM:	567318 us	  852 bytes
zog_p2:		239650 us	 1243 bytes
riscvemu_p2:	232937 us	 1536 bytes
riscvjit_p2:	 60964 us	 1536 bytes

ke4pjw · 2017-05-11 20:20

Nevermind. Deleted

Roy Eltham · 2017-05-11 21:02

Could a riscvjit variant be made that uses hub memry/hubexec for the "jit" results instead of or in combination with the LUT?
Could possibly make the dhrystones results better.

ersmith · 2017-05-12 00:52

Roy Eltham wrote: »

Could a riscvjit variant be made that uses hub memry/hubexec for the "jit" results instead of or in combination with the LUT?
Could possibly make the dhrystones results better.

Yes, that's probably a good idea. The ZPU JIT (which was P1 only) had a level 2 cache in HUB memory, so something similar should help even out the riscvjit performance.

K2 · 2017-05-12 21:25

I apologize for missing many posts. Have you shown elsewhere how P2 SPIN performs in these three benchmarks?
(Does the dhrystone benchmark even exist in P2 SPIN?)

David Betz · 2017-05-12 21:35

K2 wrote: »

I apologize for missing many posts. Have you shown elsewhere how P2 SPIN performs in these three benchmarks?
(Does the dhrystone benchmark even exist in P2 SPIN?)

As far as I know there is no P2 version of Spin yet other than Eric's FastSpin compiler. I don't think Chip has his working yet.

jmg · 2017-05-12 22:37

K2 wrote: »

I apologize for missing many posts. Have you shown elsewhere how P2 SPIN performs in these three benchmarks?
(Does the dhrystone benchmark even exist in P2 SPIN?)

P2 Spin is still a work-in-progress, but ersmith does say this

The ZPU interpreter uses the P2 XBYTE feature to directly interpret bytecodes at high speed. It has decent performance all around...

and the XBYTE feature was recently added to support/improve P2 Spin, so they should have similar end outcomes, as they use the same lower opcodes & same sized byte codes.
One hopes P2 Spin can come in ahead, as Chip has everything under his control.

ersmith · 2017-05-13 01:51

K2 wrote: »

I apologize for missing many posts. Have you shown elsewhere how P2 SPIN performs in these three benchmarks?
(Does the dhrystone benchmark even exist in P2 SPIN?)

If by "P2 Spin" you mean Chip's P2 Spin interpreter, then I don't think he's released that yet.

I have ported the FastSpin compiler to P2. It's not quite an apples to apples comparison, since we're looking at Spin source code that's been compiled to run directly in hubexec mode, rather than C source code that's being interpreted, but the times for P2 are:

dhry:      9523 dhrystones/sec
xxtea:    43651 cycles
fftbench: 42681 us

(The dhrystone benchmark is dhrystone 1.1 ported via Dave Hein's cspin. The other benchmarks were with dhrystone 2.1, so again it's not quite an exact comparison, just a rough ballpark.)

Sizes aren't comparable at all, since fastspin doesn't produce .o files and we're linking totally different libraries.

P2 interpreter comparison

Comments