ZPU (and RISCV) emulation for P2 (now with XBYTE)
ersmith
Posts: 6,053
I mentioned this over in the P1 ZOG discussion, but thought it would be relevant here, especially since I've updated the P2 ZOG interpreter to optionally use the XBYTE mechanism. It seems to be working well, and gives a nice performance bump (around 20-30%, depending on the application). I've attached a .zip file with my sources here; they're also checked in to https://github.com/totalspectrum/zog. (*Edit*: the .zip file is removed, please see the git repository for current sources).
A few things of interest I found in the process of doing the port:
(1) The interpreter has some kind of weird I/O bug if XBYTE is enabled -- the newlib fflush() function seems to trigger a stack corruption. It only happens in XBYTE mode, which is odd. My suspicion is that there's a timing related bug in my code, but I haven't been able to track it down. For now I've worked around it by using my own simple_printf.c to print results.
(2) The software multiply routine (basically the one Heater and Lonesock did for the P1 ZOG) is often slightly faster than the CORDIC one. It depends on the magnitude of the inputs.
(3) There are a lot of features on P2 that help interpreters, and they can make a big difference. I think the biggest helpers were streaming the opcodes from RAM (XBYTE does this automatically, of course, but on my way to XBYTE I had an intermediate step that used RDFAST) and using the LUT to hold the opcode table. XBYTE itself is pretty nice too, of course.
(4) The performance even with XBYTE is still a lot slower than native code. For xxtea, for example, the native (fastspin compiled) version takes 18327 cycles to do xxtea decode, whereas the XBYTE zog_p2 takes 186556 cycles. Tuning the bytecode to the architecture will probably make a big difference, but ultimately I think a register based virtual machine will outperform a stack based one -- the HUB memory definitely becomes a bottleneck.
I
A few things of interest I found in the process of doing the port:
(1) The interpreter has some kind of weird I/O bug if XBYTE is enabled -- the newlib fflush() function seems to trigger a stack corruption. It only happens in XBYTE mode, which is odd. My suspicion is that there's a timing related bug in my code, but I haven't been able to track it down. For now I've worked around it by using my own simple_printf.c to print results.
(2) The software multiply routine (basically the one Heater and Lonesock did for the P1 ZOG) is often slightly faster than the CORDIC one. It depends on the magnitude of the inputs.
(3) There are a lot of features on P2 that help interpreters, and they can make a big difference. I think the biggest helpers were streaming the opcodes from RAM (XBYTE does this automatically, of course, but on my way to XBYTE I had an intermediate step that used RDFAST) and using the LUT to hold the opcode table. XBYTE itself is pretty nice too, of course.
(4) The performance even with XBYTE is still a lot slower than native code. For xxtea, for example, the native (fastspin compiled) version takes 18327 cycles to do xxtea decode, whereas the XBYTE zog_p2 takes 186556 cycles. Tuning the bytecode to the architecture will probably make a big difference, but ultimately I think a register based virtual machine will outperform a stack based one -- the HUB memory definitely becomes a bottleneck.
I
Comments
Is it easy to try XBYTE from HUB, as that is more linear in operation ? Might give a clue ?
Could be a bug in the fast-skip side of XBYTE ?
Makes the SW look good, but the other benefit of HW MUT/DIV is the code size drops, allowing that space to be used for other things.
Is the LUT opcode table 256 words ?
What is the code size of the emulation engine ?
With the new Dual port LUT, I'm thinking of a paired COG that can load bytecodes into the paired LUT from XIP memory
Do you have a metric on P2.Cycles per byte.opcode fetched, on average ?
ie what external memory bandwidth would be needed, in order to not slow down the emulation COG ?
That's expected, but I think getting to a 10:1 ratio, for an emulation engine is pretty good.
If you compare most compact FPGA ZOG, the P2 version is within 1.57 : 1, and 5.15 : 1, for the fastest FPGA.
Are there any spare bytecodes in ZOG at all. One possible scheme could be to use one spare bytecode to stream in native P2 code, allowing Fast COG blobs to be mixed in with compact byte-code ?
The weird thing about the failure is that it happens only in this specific case (writing data to the UART) and only when the write is invoked from newlib. The data it writes is fine, and it seems to be able to write as much data as we want (so XBYTE has been working fine up to and through the writing routine). I can't rule out an XBYTE bug (since it only happens when XBYTE is enabled) but I think it's probably more likely that either there's a timing bug or I've misunderstood something about XBYTE. Yes. 1568 bytes right now; that includes 128 bytes of initialization data in HUB. There's certainly room to reduce the COG code size with SKIP. There are also 256 words free in LUT.
I don't know what the average is. The shortest paths are one instruction (with a _ret_ prefix) which with XBYTE execute in 8 cycles.
There's a generic SYSCALL bytecode which we could use for that, I think, and there are 4 opcodes that are marked as illegal instructions, we could re-purpose some of those.
Eric
Right. Well, the xxtea numbers are in cycles, so the frequency doesn't matter, but the dhrystone numbers assume an 80 MHz clock.
https://github.com/totalspectrum/zog
https://github.com/totalspectrum/riscvemu
The RiscV JIT engine has also received some significant improvements, to the point where emulated RiscV is competitive with native P2 code. Here are recent benchmarks. I've used an 80MHz clock for easy comparison with P1 code:
The result that surprised me most is that riscvjit beat everything else in fftbench. I presume that's because the gcc 8.2 compiler that it's using has some new optimizations over the gcc 6.0 that was used by p2gcc. Conversely, I expected riscvjit to be closer on xxtea than it is; we're probably seeing some effect due to the cache it uses, where some key function is straddling a cache line.
Compression tends to reduce program size by about 25%, so on a 256K program you'd save around 64K (possibly a bit less). There's not much performance difference. On benchmarks the compressed version may even be slightly faster, but it mainly depends on how much expanded code fits in cache, which tends to be the same for compressed and non-compressed instructions.
The repository is at https://github.com/totalspectrum/riscvemu.