P2 interpreter comparison
ersmith
Posts: 6,053
in Propeller 2
Here are some more up to date results of experimenting with three interpreters for the P2, namely the zog_p2 ZPU interpreter and two different RISCV interpreters: a traditional interpreter (riscvemu_p2) and a just in time compiler (riscvjit_p2). These are all checked in to github and require spin2cpp 3.6.4 or later to build.
The ZPU interpreter uses the P2 XBYTE feature to directly interpret bytecodes at high speed. It has decent performance all around, although the stack based nature of the ZPU architecture means that we end up memory bound compared to the CMM and RISC-V nterpreters.
The RISC-V interpreter comes in two versions. One (riscvemu_p2) is a traditional interpreter which reads RISC-V opcodes and interprets them one at a time. Because RISC-V opcodes are 4 bytes each it does not use XBYTE. The riscvjit_p2 interpreter is a just in time compiler, which translates RISC-V instructions (4 bytes each) into PASM instructions. Each RISC-V instruction translates to 1-4 PASM instructions. A small cache of 64 translated instructions is kept in the LUT. Code with loops that fit in the LUT runs very fast, as we can
see from the xxtea and fftbench results. Thanks in part to the hardware multiplier the interpreted P2 fftbench is actually faster than the compiled P1 one! However, this breaks down with large loops such as Dhrystone has, where constantly re-compiling slows the JIT interpreter down.
I think the data shows that strict stack based interpreters like the ZPU are hampered by the cost of pushing and popping. OTOH the small bytecodes of the ZPU are well suited to XBYTE interpretation, so the interpreter overhead is low. Register machines like the RISC-V keep a lot of data in COG memory, but instruction decode is more expensive. The JIT approach works very well when loops fit in LUT, but becomes expensive when they don't.
Benchmark results are reproduced below, along with some PropGCC numbers (from P1) for comparison.
All benchmarks were compiled with gcc for the appropriate architecture (PropGCC for CMM and LMM, RISC-V gcc for RISC-V, and the ZPU little endian toolchain for zog_p2). All were compiled with -Os.
The system clocks for both P1 and P2 are 80 MHz.
The ZPU interpreter uses the P2 XBYTE feature to directly interpret bytecodes at high speed. It has decent performance all around, although the stack based nature of the ZPU architecture means that we end up memory bound compared to the CMM and RISC-V nterpreters.
The RISC-V interpreter comes in two versions. One (riscvemu_p2) is a traditional interpreter which reads RISC-V opcodes and interprets them one at a time. Because RISC-V opcodes are 4 bytes each it does not use XBYTE. The riscvjit_p2 interpreter is a just in time compiler, which translates RISC-V instructions (4 bytes each) into PASM instructions. Each RISC-V instruction translates to 1-4 PASM instructions. A small cache of 64 translated instructions is kept in the LUT. Code with loops that fit in the LUT runs very fast, as we can
see from the xxtea and fftbench results. Thanks in part to the hardware multiplier the interpreted P2 fftbench is actually faster than the compiled P1 one! However, this breaks down with large loops such as Dhrystone has, where constantly re-compiling slows the JIT interpreter down.
I think the data shows that strict stack based interpreters like the ZPU are hampered by the cost of pushing and popping. OTOH the small bytecodes of the ZPU are well suited to XBYTE interpretation, so the interpreter overhead is low. Register machines like the RISC-V keep a lot of data in COG memory, but instruction decode is more expensive. The JIT approach works very well when loops fit in LUT, but becomes expensive when they don't.
Benchmark results are reproduced below, along with some PropGCC numbers (from P1) for comparison.
All benchmarks were compiled with gcc for the appropriate architecture (PropGCC for CMM and LMM, RISC-V gcc for RISC-V, and the ZPU little endian toolchain for zog_p2). All were compiled with -Os.
The system clocks for both P1 and P2 are 80 MHz.
---------------- dhrystone ----------------------- speed is dhrystones/second; higher is better size is code and initialized data size of dhry_1.o + dhry_2.o SPEED SIZE p1gcc LMM: 6031/sec 3000 bytes p1gcc CMM: 2074/sec 1248 bytes zog_p2: 2133/sec 2295 bytes riscvemu_p2: 1521/sec 2560 bytes riscvjit_p2: 1119/sec 2560 bytes
------------------ xxtea ----------------------- speed is cycles to decode; lower is better size is size of xxtea.o + main.o SPEED SIZE p1gcc LMM: 21984 cycles 912 bytes p1gcc CMM: 362544 cycles 384 bytes zog_p2: 168964 cycles 623 bytes riscvemu_p2: 202656 cycles 708 bytes riscvjit_p2: 59755 cycles 708 bytes
----------------- fftbench ----------------------- speed is microseconds to complete FFT; lower is better size is size of fftbench.o p1gcc LMM: 138096 us 2052 bytes p1gcc CMM: 567318 us 852 bytes zog_p2: 239650 us 1243 bytes riscvemu_p2: 232937 us 1536 bytes riscvjit_p2: 60964 us 1536 bytes
Comments
Could possibly make the dhrystones results better.
Yes, that's probably a good idea. The ZPU JIT (which was P1 only) had a level 2 cache in HUB memory, so something similar should help even out the riscvjit performance.
(Does the dhrystone benchmark even exist in P2 SPIN?)
The ZPU interpreter uses the P2 XBYTE feature to directly interpret bytecodes at high speed. It has decent performance all around...
and the XBYTE feature was recently added to support/improve P2 Spin, so they should have similar end outcomes, as they use the same lower opcodes & same sized byte codes.
One hopes P2 Spin can come in ahead, as Chip has everything under his control.
If by "P2 Spin" you mean Chip's P2 Spin interpreter, then I don't think he's released that yet.
I have ported the FastSpin compiler to P2. It's not quite an apples to apples comparison, since we're looking at Spin source code that's been compiled to run directly in hubexec mode, rather than C source code that's being interpreted, but the times for P2 are: (The dhrystone benchmark is dhrystone 1.1 ported via Dave Hein's cspin. The other benchmarks were with dhrystone 2.1, so again it's not quite an exact comparison, just a rough ballpark.)
Sizes aren't comparable at all, since fastspin doesn't produce .o files and we're linking totally different libraries.