ZPU emulation for P2 (now with XBYTE)

I mentioned this over in the P1 ZOG discussion, but thought it would be relevant here, especially since I've updated the P2 ZOG interpreter to optionally use the XBYTE mechanism. It seems to be working well, and gives a nice performance bump (around 20-30%, depending on the application). I've attached a .zip file with my sources here; they're also checked in to https://github.com/totalspectrum/zog. A few things of interest I found in the process of doing the port:

(1) The interpreter has some kind of weird I/O bug if XBYTE is enabled -- the newlib fflush() function seems to trigger a stack corruption. It only happens in XBYTE mode, which is odd. My suspicion is that there's a timing related bug in my code, but I haven't been able to track it down. For now I've worked around it by using my own simple_printf.c to print results.

(2) The software multiply routine (basically the one Heater and Lonesock did for the P1 ZOG) is often slightly faster than the CORDIC one. It depends on the magnitude of the inputs.

(3) There are a lot of features on P2 that help interpreters, and they can make a big difference. I think the biggest helpers were streaming the opcodes from RAM (XBYTE does this automatically, of course, but on my way to XBYTE I had an intermediate step that used RDFAST) and using the LUT to hold the opcode table. XBYTE itself is pretty nice too, of course.

(4) The performance even with XBYTE is still a lot slower than native code. For xxtea, for example, the native (fastspin compiled) version takes 18327 cycles to do xxtea decode, whereas the XBYTE zog_p2 takes 186556 cycles. Tuning the bytecode to the architecture will probably make a big difference, but ultimately I think a register based virtual machine will outperform a stack based one -- the HUB memory definitely becomes a bottleneck.

I

Comments

  • 8 Comments sorted by Date Added Votes
  • Here are xxtea results on both P1 and P2. jmg requested sizes, so I've done that, but note that you can't really get much info from these because they are the total size of the executables including all libraries and any kernel (LMM/CMM have 1K and 2K each for kernel, roughly, and the ZOG binary has at least 1K of interrupt vectors and library routines).
    P1:
    propgcc lmm:   22592 cycles   4380 bytes
    fastspin 3.6:  38064 cycles   3184 bytes
    propgcc cmm:  365136 cycles   3336 bytes
    riscvemu:     497824 cycles   1544 bytes
    zog JIT:      679168 cycles   4516 bytes
    zog 1.6LE:    885696 cycles   4516 bytes
    openspin:    1044640 cycles   1200 bytes
    
    P2:
    fastspin 3.6:  18327 cycles   3744 bytes
    zog_p2:       186556 cycles   4516 bytes
    riscvemu:     243848 cycles   1544 bytes
    
  • Finally, dhrystone numbers of ZPU:
    Original zog 1.6 on P1:   436 dhrystones/sec
    zog_p2 on P2 w/o XBYTE:  1527 dhrystones/sec
    zog_p2 on P2 w. XBYTE:   1905 dhrystones/sec
    
    ZPUFlex_min (FPGA):     ~2990 dhrystones/sec
    ZPUFlex_fast (FPGA):    ~9840 dhrystones/sec
    
    PropGCC LMM on P1:       6016 dhrystones/sec
    PropGCC CMM on P1:       2070 dhrystones/sec
    
    The ZPUFlex numbers are from http://retroramblings.net.
    


  • jmgjmg Posts: 9,829
    ersmith wrote: »
    I mentioned this over in the P1 ZOG discussion, but thought it would be relevant here, especially since I've updated the P2 ZOG interpreter to optionally use the XBYTE mechanism. It seems to be working well, and gives a nice performance bump (around 20-30%, depending on the application).
    Wow, nice collection of stats...
    ersmith wrote: »
    (1) The interpreter has some kind of weird I/O bug if XBYTE is enabled -- the newlib fflush() function seems to trigger a stack corruption. It only happens in XBYTE mode, which is odd. My suspicion is that there's a timing related bug in my code, but I haven't been able to track it down.
    Is it easy to try XBYTE from HUB, as that is more linear in operation ? Might give a clue ?
    Could be a bug in the fast-skip side of XBYTE ?
    ersmith wrote: »
    (2) The software multiply routine (basically the one Heater and Lonesock did for the P1 ZOG) is often slightly faster than the CORDIC one. It depends on the magnitude of the inputs.
    Makes the SW look good, but the other benefit of HW MUT/DIV is the code size drops, allowing that space to be used for other things.
    ersmith wrote: »
    (3) There are a lot of features on P2 that help interpreters, and they can make a big difference. I think the biggest helpers were streaming the opcodes from RAM (XBYTE does this automatically, of course, but on my way to XBYTE I had an intermediate step that used RDFAST) and using the LUT to hold the opcode table. XBYTE itself is pretty nice too, of course.
    Is the LUT opcode table 256 words ?
    What is the code size of the emulation engine ?

    With the new Dual port LUT, I'm thinking of a paired COG that can load bytecodes into the paired LUT from XIP memory
    Do you have a metric on P2.Cycles per byte.opcode fetched, on average ?

    ie what external memory bandwidth would be needed, in order to not slow down the emulation COG ?

    ersmith wrote: »
    (4) The performance even with XBYTE is still a lot slower than native code. For xxtea, for example, the native (fastspin compiled) version takes 18327 cycles to do xxtea decode, whereas the XBYTE zog_p2 takes 186556 cycles. Tuning the bytecode to the architecture will probably make a big difference, but ultimately I think a register based virtual machine will outperform a stack based one -- the HUB memory definitely becomes a bottleneck.
    That's expected, but I think getting to a 10:1 ratio, for an emulation engine is pretty good.
    If you compare most compact FPGA ZOG, the P2 version is within 1.57 : 1, and 5.15 : 1, for the fastest FPGA.

    Are there any spare bytecodes in ZOG at all. One possible scheme could be to use one spare bytecode to stream in native P2 code, allowing Fast COG blobs to be mixed in with compact byte-code ?




  • These numbers are for 80 MHz P2, right?
    Prop Info and Apps: http://www.rayslogic.com/
  • jmg wrote: »
    ersmith wrote: »
    I mentioned this over in the P1 ZOG discussion, but thought it would be relevant here, especially since I've updated the P2 ZOG interpreter to optionally use the XBYTE mechanism. It seems to be working well, and gives a nice performance bump (around 20-30%, depending on the application).
    Wow, nice collection of stats...
    ersmith wrote: »
    (1) The interpreter has some kind of weird I/O bug if XBYTE is enabled -- the newlib fflush() function seems to trigger a stack corruption. It only happens in XBYTE mode, which is odd. My suspicion is that there's a timing related bug in my code, but I haven't been able to track it down.
    Is it easy to try XBYTE from HUB, as that is more linear in operation ? Might give a clue ?
    Could be a bug in the fast-skip side of XBYTE ?
    XBYTE uses the streamer, so I don't think it can be used at the same time as hub exec. Moreover, I don't actually have any skip fields set right now -- I have a separate routine for each opcode. Using skip would allow some code space reduction, but the other P2 interpreter tools have already given us quite a bit of space in the COG compared to P1.

    The weird thing about the failure is that it happens only in this specific case (writing data to the UART) and only when the write is invoked from newlib. The data it writes is fine, and it seems to be able to write as much data as we want (so XBYTE has been working fine up to and through the writing routine). I can't rule out an XBYTE bug (since it only happens when XBYTE is enabled) but I think it's probably more likely that either there's a timing bug or I've misunderstood something about XBYTE.
    Is the LUT opcode table 256 words ?
    Yes.
    What is the code size of the emulation engine ?
    1568 bytes right now; that includes 128 bytes of initialization data in HUB. There's certainly room to reduce the COG code size with SKIP. There are also 256 words free in LUT.
    Do you have a metric on P2.Cycles per byte.opcode fetched, on average ?
    ie what external memory bandwidth would be needed, in order to not slow down the emulation COG ?
    I don't know what the average is. The shortest paths are one instruction (with a _ret_ prefix) which with XBYTE execute in 8 cycles.
    Are there any spare bytecodes in ZOG at all. One possible scheme could be to use one spare bytecode to stream in native P2 code, allowing Fast COG blobs to be mixed in with compact byte-code ?

    There's a generic SYSCALL bytecode which we could use for that, I think, and there are 4 opcodes that are marked as illegal instructions, we could re-purpose some of those.

    Eric
  • Wow! I'm glad you're using XBYTE, already. I'll make a new release soon to cover the improved SETQ+XBYTE behavior (LSB-based vs MSB-based bytecodes).
  • Rayman wrote: »
    These numbers are for 80 MHz P2, right?

    Right. Well, the xxtea numbers are in cycles, so the frequency doesn't matter, but the dhrystone numbers assume an 80 MHz clock.
  • jmgjmg Posts: 9,829
    ersmith wrote: »
    Right. Well, the xxtea numbers are in cycles, so the frequency doesn't matter, but the dhrystone numbers assume an 80 MHz clock.
    Do you have the ZPUFlex_min (FPGA) & ZPUFlex_fast (FPGA) clock speeds, for those dhrystone numbers ?

Sign In or Register to comment.