ZPU (and RISCV) emulation for P2 (now with XBYTE)

ersmith · 2017-04-25 14:10

I mentioned this over in the P1 ZOG discussion, but thought it would be relevant here, especially since I've updated the P2 ZOG interpreter to optionally use the XBYTE mechanism. It seems to be working well, and gives a nice performance bump (around 20-30%, depending on the application). I've attached a .zip file with my sources here; they're also checked in to https://github.com/totalspectrum/zog. (*Edit*: the .zip file is removed, please see the git repository for current sources).

A few things of interest I found in the process of doing the port:

(1) The interpreter has some kind of weird I/O bug if XBYTE is enabled -- the newlib fflush() function seems to trigger a stack corruption. It only happens in XBYTE mode, which is odd. My suspicion is that there's a timing related bug in my code, but I haven't been able to track it down. For now I've worked around it by using my own simple_printf.c to print results.

(2) The software multiply routine (basically the one Heater and Lonesock did for the P1 ZOG) is often slightly faster than the CORDIC one. It depends on the magnitude of the inputs.

(3) There are a lot of features on P2 that help interpreters, and they can make a big difference. I think the biggest helpers were streaming the opcodes from RAM (XBYTE does this automatically, of course, but on my way to XBYTE I had an intermediate step that used RDFAST) and using the LUT to hold the opcode table. XBYTE itself is pretty nice too, of course.

(4) The performance even with XBYTE is still a lot slower than native code. For xxtea, for example, the native (fastspin compiled) version takes 18327 cycles to do xxtea decode, whereas the XBYTE zog_p2 takes 186556 cycles. Tuning the bytecode to the architecture will probably make a big difference, but ultimately I think a register based virtual machine will outperform a stack based one -- the HUB memory definitely becomes a bottleneck.

I

ersmith · 2017-04-25 14:23

Here are xxtea results on both P1 and P2. jmg requested sizes, so I've done that, but note that you can't really get much info from these because they are the total size of the executables including all libraries and any kernel (LMM/CMM have 1K and 2K each for kernel, roughly, and the ZOG binary has at least 1K of interrupt vectors and library routines).

P1:
propgcc lmm:   22592 cycles   4380 bytes
fastspin 3.6:  38064 cycles   3184 bytes
propgcc cmm:  365136 cycles   3336 bytes
riscvemu:     497824 cycles   1544 bytes
zog JIT:      679168 cycles   4516 bytes
zog 1.6LE:    885696 cycles   4516 bytes
openspin:    1044640 cycles   1200 bytes

P2:
fastspin 3.6:  18327 cycles   3744 bytes
zog_p2:       186556 cycles   4516 bytes
riscvemu:     243848 cycles   1544 bytes

ersmith · 2017-04-25 14:42

Finally, dhrystone numbers of ZPU:

Original zog 1.6 on P1:   436 dhrystones/sec
zog_p2 on P2 w/o XBYTE:  1527 dhrystones/sec
zog_p2 on P2 w. XBYTE:   1905 dhrystones/sec

ZPUFlex_min (FPGA):     ~2990 dhrystones/sec
ZPUFlex_fast (FPGA):    ~9840 dhrystones/sec

PropGCC LMM on P1:       6016 dhrystones/sec
PropGCC CMM on P1:       2070 dhrystones/sec

The ZPUFlex numbers are from http://retroramblings.net.

jmg · 2017-04-25 21:10

ersmith wrote: »

I mentioned this over in the P1 ZOG discussion, but thought it would be relevant here, especially since I've updated the P2 ZOG interpreter to optionally use the XBYTE mechanism. It seems to be working well, and gives a nice performance bump (around 20-30%, depending on the application).

Wow, nice collection of stats...

ersmith wrote: »

(1) The interpreter has some kind of weird I/O bug if XBYTE is enabled -- the newlib fflush() function seems to trigger a stack corruption. It only happens in XBYTE mode, which is odd. My suspicion is that there's a timing related bug in my code, but I haven't been able to track it down.

Is it easy to try XBYTE from HUB, as that is more linear in operation ? Might give a clue ?
Could be a bug in the fast-skip side of XBYTE ?

ersmith wrote: »

(2) The software multiply routine (basically the one Heater and Lonesock did for the P1 ZOG) is often slightly faster than the CORDIC one. It depends on the magnitude of the inputs.

Makes the SW look good, but the other benefit of HW MUT/DIV is the code size drops, allowing that space to be used for other things.

ersmith wrote: »

(3) There are a lot of features on P2 that help interpreters, and they can make a big difference. I think the biggest helpers were streaming the opcodes from RAM (XBYTE does this automatically, of course, but on my way to XBYTE I had an intermediate step that used RDFAST) and using the LUT to hold the opcode table. XBYTE itself is pretty nice too, of course.

Is the LUT opcode table 256 words ?
What is the code size of the emulation engine ?

With the new Dual port LUT, I'm thinking of a paired COG that can load bytecodes into the paired LUT from XIP memory
Do you have a metric on P2.Cycles per byte.opcode fetched, on average ?

ie what external memory bandwidth would be needed, in order to not slow down the emulation COG ?

ersmith wrote: »

(4) The performance even with XBYTE is still a lot slower than native code. For xxtea, for example, the native (fastspin compiled) version takes 18327 cycles to do xxtea decode, whereas the XBYTE zog_p2 takes 186556 cycles. Tuning the bytecode to the architecture will probably make a big difference, but ultimately I think a register based virtual machine will outperform a stack based one -- the HUB memory definitely becomes a bottleneck.

That's expected, but I think getting to a 10:1 ratio, for an emulation engine is pretty good.
If you compare most compact FPGA ZOG, the P2 version is within 1.57 : 1, and 5.15 : 1, for the fastest FPGA.

Are there any spare bytecodes in ZOG at all. One possible scheme could be to use one spare bytecode to stream in native P2 code, allowing Fast COG blobs to be mixed in with compact byte-code ?

Rayman · 2017-04-25 23:18

These numbers are for 80 MHz P2, right?

ersmith · 2017-04-25 23:32

jmg wrote: »

ersmith wrote: »

I mentioned this over in the P1 ZOG discussion, but thought it would be relevant here, especially since I've updated the P2 ZOG interpreter to optionally use the XBYTE mechanism. It seems to be working well, and gives a nice performance bump (around 20-30%, depending on the application).

Wow, nice collection of stats...

ersmith wrote: »

(1) The interpreter has some kind of weird I/O bug if XBYTE is enabled -- the newlib fflush() function seems to trigger a stack corruption. It only happens in XBYTE mode, which is odd. My suspicion is that there's a timing related bug in my code, but I haven't been able to track it down.

Is it easy to try XBYTE from HUB, as that is more linear in operation ? Might give a clue ?
Could be a bug in the fast-skip side of XBYTE ?

XBYTE uses the streamer, so I don't think it can be used at the same time as hub exec. Moreover, I don't actually have any skip fields set right now -- I have a separate routine for each opcode. Using skip would allow some code space reduction, but the other P2 interpreter tools have already given us quite a bit of space in the COG compared to P1.

The weird thing about the failure is that it happens only in this specific case (writing data to the UART) and only when the write is invoked from newlib. The data it writes is fine, and it seems to be able to write as much data as we want (so XBYTE has been working fine up to and through the writing routine). I can't rule out an XBYTE bug (since it only happens when XBYTE is enabled) but I think it's probably more likely that either there's a timing bug or I've misunderstood something about XBYTE.

Is the LUT opcode table 256 words ?

Yes.

What is the code size of the emulation engine ?

1568 bytes right now; that includes 128 bytes of initialization data in HUB. There's certainly room to reduce the COG code size with SKIP. There are also 256 words free in LUT.

Do you have a metric on P2.Cycles per byte.opcode fetched, on average ?
ie what external memory bandwidth would be needed, in order to not slow down the emulation COG ?

I don't know what the average is. The shortest paths are one instruction (with a _ret_ prefix) which with XBYTE execute in 8 cycles.

Are there any spare bytecodes in ZOG at all. One possible scheme could be to use one spare bytecode to stream in native P2 code, allowing Fast COG blobs to be mixed in with compact byte-code ?

There's a generic SYSCALL bytecode which we could use for that, I think, and there are 4 opcodes that are marked as illegal instructions, we could re-purpose some of those.

Eric

cgracey · 2017-04-25 23:55

Wow! I'm glad you're using XBYTE, already. I'll make a new release soon to cover the improved SETQ+XBYTE behavior (LSB-based vs MSB-based bytecodes).

ersmith · 2017-04-26 00:51

Rayman wrote: »

These numbers are for 80 MHz P2, right?

Right. Well, the xxtea numbers are in cycles, so the frequency doesn't matter, but the dhrystone numbers assume an 80 MHz clock.

jmg · 2017-04-26 01:02

ersmith wrote: »

Right. Well, the xxtea numbers are in cycles, so the frequency doesn't matter, but the dhrystone numbers assume an 80 MHz clock.

Do you have the ZPUFlex_min (FPGA) & ZPUFlex_fast (FPGA) clock speeds, for those dhrystone numbers ?

ersmith · 2019-03-05 15:02

Resurrecting an old thread, but I've updated my ZPU and RiscV interpreters to work on the P2 Eval board if anyone is interested. The URLs are:

https://github.com/totalspectrum/zog
https://github.com/totalspectrum/riscvemu

The RiscV JIT engine has also received some significant improvements, to the point where emulated RiscV is competitive with native P2 code. Here are recent benchmarks. I've used an 80MHz clock for easy comparison with P1 code:

dhrystones/sec: higher is better
  p2gcc:        9717 (I think p2gcc defaults to -Os)
  riscvjit_p2:  7051 (compiled with -Os)
  riscvjit_p2: 11111 (compiled with -O3) 
  fastspin -2:   ??? (compiler failed on C code)
  zog_p2:       2141
  riscvemu:     2095
 
xxtea: cycles to complete (lower is better)
  p2gcc:      15706 cycles
  fastspin:   16290 cycles
  riscvjit:   33232 cycles
  zog_p2:    190280 cycles
  riscvemu:  283992 cycles

fftbench: time to complete @ 80 MHz
  p2gcc:       48674 us
  fastspin:    48191 us
  riscvjit:    38010 us
  zog_p2:     275364 us
  riscvemu:   335470 us

The result that surprised me most is that riscvjit beat everything else in fftbench. I presume that's because the gcc 8.2 compiler that it's using has some new optimizations over the gcc 6.0 that was used by p2gcc. Conversely, I expected riscvjit to be closer on xxtea than it is; we're probably seeing some effect due to the cache it uses, where some key function is straddling a cache line.

ersmith · 2019-03-19 23:21

There are some more updates to my RiscV emulator for P2. It now implements a number of P2 specific instructions, such as coginit and the various pin instructions. The file lib/riscv.h has C defines for accessing these. I've added a toggle demo that manipulates the P2ES control board, and a VGA demo for the P2ES AV board. The VGA demo isn't in native RiscV yet (I think it'd be possible to some VGA that way, the emulation is fast enough, but I haven't implemented all of the streamer instructions yet). Instead it uses a spin2cpp converted C file that contains a binary blob of PASM code that can be loaded with cognew().

ersmith · 2019-06-19 20:40

The JIT p2trace version of the RiscV emulator now supports compressed instructions (RV32IMC). I haven't added those to the plain emulator or the regular cache JIT yet; actually the regular cache JIT is likely to be a bit tricky, since the cache code was written with the assumption that all instructions start on 4 byte boundaries, which is no longer true if there are compressed (16 bit) instructions. The p2trace version is the fastest and most flexible one, so it's the one I'm mainly focused on now.

Compression tends to reduce program size by about 25%, so on a 256K program you'd save around 64K (possibly a bit less). There's not much performance difference. On benchmarks the compressed version may even be slightly faster, but it mainly depends on how much expanded code fits in cache, which tends to be the same for compressed and non-compressed instructions.

The repository is at https://github.com/totalspectrum/riscvemu.

David Betz · 2019-06-19 20:50

ersmith wrote: »

The JIT p2trace version of the RiscV emulator now supports compressed instructions (RV32IMC). I haven't added those to the plain emulator or the regular cache JIT yet; actually the regular cache JIT is likely to be a bit tricky, since the cache code was written with the assumption that all instructions start on 4 byte boundaries, which is no longer true if there are compressed (16 bit) instructions. The p2trace version is the fastest and most flexible one, so it's the one I'm mainly focused on now.

Compression tends to reduce program size by about 25%, so on a 256K program you'd save around 64K (possibly a bit less). There's not much performance difference. On benchmarks the compressed version may even be slightly faster, but it mainly depends on how much expanded code fits in cache, which tends to be the same for compressed and non-compressed instructions.

The repository is at https://github.com/totalspectrum/riscvemu.

That's great news!

ZPU (and RISCV) emulation for P2 (now with XBYTE)

Comments