I mentioned this over in the P1 ZOG discussion, but thought it would be relevant here, especially since I've updated the P2 ZOG interpreter to optionally use the XBYTE mechanism. It seems to be working well, and gives a nice performance bump (around 20-30%, depending on the application). I've attached a .zip file with my sources here; they're also checked in to https://github.com/totalspectrum/zog
. (*Edit*: the .zip file is removed, please see the git repository for current sources).
A few things of interest I found in the process of doing the port:
(1) The interpreter has some kind of weird I/O bug if XBYTE is enabled -- the newlib fflush() function seems to trigger a stack corruption. It only happens in XBYTE mode, which is odd. My suspicion is that there's a timing related bug in my code, but I haven't been able to track it down. For now I've worked around it by using my own simple_printf.c to print results.
(2) The software multiply routine (basically the one Heater and Lonesock did for the P1 ZOG) is often slightly faster than the CORDIC one. It depends on the magnitude of the inputs.
(3) There are a lot of features on P2 that help interpreters, and they can make a big difference. I think the biggest helpers were streaming the opcodes from RAM (XBYTE does this automatically, of course, but on my way to XBYTE I had an intermediate step that used RDFAST) and using the LUT to hold the opcode table. XBYTE itself is pretty nice too, of course.
(4) The performance even with XBYTE is still a lot slower than native code. For xxtea, for example, the native (fastspin compiled) version takes 18327 cycles to do xxtea decode, whereas the XBYTE zog_p2 takes 186556 cycles. Tuning the bytecode to the architecture will probably make a big difference, but ultimately I think a register based virtual machine will outperform a stack based one -- the HUB memory definitely becomes a bottleneck.