riscvp2: a C and C++ compiler for P2

2»

Comments

  • Personally I’d think of a jit compiler as slower than an optimized emulator
    Prop Info and Apps: http://www.rayslogic.com/
  • ersmith wrote: »
    I posted it a few messages back. On fft-bench and dhrystone, riscvp2 is the fastest compiler for P2 -- faster than Catalina, p2gcc, and fastspin. On micropython we had some back and forth over wither riscvp2 or p2gcc was faster; I think as it stands now riscvp2 is about 10% slower than the custom p2gcc that @rogloh created, but the binaries are smaller and support floating point (which p2gcc doesn't support yet).
    p2gcc supports 32-bit floats, but it doesn't support 64 bits. p2gcc uses the -m32bit-doubles compiler flag to convert 64-bit floats to 32 bits.

  • Rayman wrote: »
    I think the difference is the source... bytecode is not machine instructions...
    I don't see the difference. You could just think of the RISC-V instructions as byte code. They just happen to run natively on some hardware. Here they're JIT compiled to P2 instructions.
  • jmgjmg Posts: 14,107
    David Betz wrote: »
    Rayman wrote: »
    I think the difference is the source... bytecode is not machine instructions...
    I don't see the difference. You could just think of the RISC-V instructions as byte code. They just happen to run natively on some hardware. Here they're JIT compiled to P2 instructions.

    Maybe different benchmarks are needed when JIT comes into the mix ?
    eg Something that calls one routine that toggles a pin, then calls another that toggles at a slightly different speed, using different code, can be used.
    The delay between those running, would be more in a JIT compiler than an interpreter, and an interpreter would be slower than PASM.
  • Rayman wrote: »
    I think the difference is the source... bytecode is not machine instructions...

    JVM bytecodes are machine instructions, for the "Java Virtual Machine". Java implementations on PCs are emulators for the JVM. These implementations may be either interpreters or JIT compilers (or some mix of the two).

    Similarly, the riscvp2 emulator is a JIT compiler. It literally does compile RISC-V instructions to P2 instructions. It does not interpret those instructions one at a time.
    FlexGUI, a GUI for programming the P1 and P2 in Spin, PASM, BASIC, and C.
    Help support its development at Patreon or PayPal.
  • Dave Hein wrote: »
    ersmith wrote: »
    I posted it a few messages back. On fft-bench and dhrystone, riscvp2 is the fastest compiler for P2 -- faster than Catalina, p2gcc, and fastspin. On micropython we had some back and forth over wither riscvp2 or p2gcc was faster; I think as it stands now riscvp2 is about 10% slower than the custom p2gcc that @rogloh created, but the binaries are smaller and support floating point (which p2gcc doesn't support yet).
    p2gcc supports 32-bit floats, but it doesn't support 64 bits. p2gcc uses the -m32bit-doubles compiler flag to convert 64-bit floats to 32 bits.

    Sorry, I meant to say that the micropython compiled by p2gcc doesn't have floating point support (I think the p2gcc float libraries don't yet have all the functions micropython needs).

    I am working on some generic P2 floating point routines, both 32 and 64 bit, which I hope can be used by all the compilers including p2gcc.
    FlexGUI, a GUI for programming the P1 and P2 in Spin, PASM, BASIC, and C.
    Help support its development at Patreon or PayPal.
  • Whatever you call it ("emulator", "JIT compiler", "CMM") what matters ultimately to the user is the performance. So here's another example, toggling a pin as fast as it can. Here's the source code. A lot of it is defines to make it portable to different compilers; once we've all implemented a standard propeller2.h that should simplify at lot.
    #define PIN 0
    #define COUNT 1000000
    #include <stdio.h>
    #if defined(__riscv) || defined(__propeller2__)
    #include <propeller2.h>
    #define togglepin(x) _pinnot(x)
    #define getcnt(x) _cnt(x)
    #else
    #include <propeller.h>
    #define togglepin(x) OUTA ^= (1<<x)
    #endif
    
    #ifdef __CATALINA__
    #define waitcnt _waitcnt
    #define getcnt _cnt
    #endif
    
    #ifdef __FASTSPIN__
    // work around bug in __builtin_printf
    #undef printf
    #endif
    
    #ifndef getcnt
    #define getcnt() CNT
    #endif
    #ifndef _waitx
    #define _waitx(x) waitcnt(getcnt() + x)
    #endif
    
    void
    toggle(unsigned int n)
    {
        int i;
        do {
            togglepin(PIN);
        } while (--n != 0);
    }
    
    int
    main()
    {
        unsigned long elapsed;
    
        _waitx(10000000);
        printf("toggle test\n");
        elapsed = getcnt();
        toggle(COUNT);
        elapsed = getcnt() - elapsed;
        printf("%u toggles takes %9u cycles\n", COUNT, elapsed);
        for(;;)
            ;
    }
    

    Here are the results for the 4 C compilers I tried on P2:
    fastspin: 1000000 toggles takes   2000107 cycles
    riscvp2:  1000000 toggles takes  10003362 cycles
    p2gcc:    1000000 toggles takes  16000024 cycles
    catalina: 1000000 toggles takes  24000447 cycles
    
    Command lines used:
    
    fastspin -2 -O2 toggle.c
    p2gcc -o toggle.binary toggle.c
    riscv-none-embed-gcc -T riscvp2_lut.ld -Os -o toggle.elf toggle.c
    catalina -lci -p2 -O3 -C P2_EVAL -C NATIVE toggle.c
    

    The theoretical limit on pin toggles in software is 2 cycles/toggle (a REP loop with drvnot inside it). fastspin is able to achieve that by caching the loop in LUT. The other compilers, including riscvp2, all have some loop overhead as well. We also get some sense of the latency here: riscvp2 does have that 3362 extra (which represents the compilation time) compared to only 24 cycles extra for p2gcc (which is probably measurement and subroutine call overhead).

    The literal assembly code produced by riscvp2 for the toggle function is as follows:

    RISC-V code (output of riscv-none-embed-gcc):
    000025e8 <toggle>:
        25e8:	4781                	li	a5,0
        25ea:	c007a00b          	0xc007a00b
        25ee:	157d                	addi	a0,a0,-1
        25f0:	fd6d                	bnez	a0,25ea <toggle+0x2>
        25f2:	8082                	ret
    
    The "0xc007a00b" instruction is a RISC-V "drvnot" instruction, one of the custom P2 instructions, which the generic RISC-V toolchain doesn't understand. We could easily add support for this to the disassembler in the future.

    This gets compiled at runtime into:
    00004AE8 F6041E00        mov    $00F,#$000 
    00004AEA FD601E5F        drvnot $00F 
    00004AEE F1841401        sub    $00A,#$001 
    00004AF0 F2181400        cmp    $00A,$000 wcz
    00004AF4 FEE04AEA        loc    ptrb, #00004AEA 
    00004AF8 5E200144 if_nz  calld  pb, #00000144 
    00004AFC FEE04AF2        loc    ptrb, #00004AF2 
    00004B00 FE200144        calld  pb, #00000144 
    
    And the "loc ptrb/ calld" sequence gets further optimized to a "jmp" after it's placed in LUT and executed the first time. So the actual runtime loop looks like:
    here
            drvnot  a5
    	sub	a0, #1
    	cmp	a0, #0 wcz
      if_nz	jmp	#here
    
    Which, sure enough, is 10 cycles/iteration. It's definitely not optimal but OTOH it's better than some of the other compilers produce.

    For comparison:

    fastspin's version of toggle is actually inlined into main and looks like:
    	loc	pa,	#(@LR__0003-@LR__0001)
    	calla	#FCACHE_LOAD_
    LR__0001
    	rep	@LR__0004, ##1000000
    LR__0002
    	drvnot	#0
    LR__0003
    LR__0004
    

    p2gcc's looks like:
            .global _toggle
    _toggle
    .L2
            xor     OUTA,#1
            djnz    r0,#.L2
            jmp     lr
    
    which is pretty nice code, and would be faster than riscvp2 except for the hubexec penalty.

    Catalina's looks like:
     alignl ' align long
    C_toggle ' <symbol:toggle>
     PRIMITIVE(#PSHM)
     long $400000 ' save registers
    C_toggle_2
     xor OUTA, #1 ' ASGNU4 special BXOR4 special coni
    ' C_toggle_3 ' (symbol refcount = 0)
     mov r22, r2
     sub r22, #1 ' SUBU4 coni
     mov r2, r22 ' CVI, CVU or LOAD
     cmp r22,  #0 wz
     if_nz jmp #C_toggle_2  ' NEU4
    ' C_toggle_1 ' (symbol refcount = 0)
     PRIMITIVE(#POPM) ' restore registers
     PRIMITIVE(#RETN)
    

    I think the PRIMITIVE(#X) macro turns into a "CALL #X".
    FlexGUI, a GUI for programming the P1 and P2 in Spin, PASM, BASIC, and C.
    Help support its development at Patreon or PayPal.
  • jmgjmg Posts: 14,107
    ersmith wrote: »
    Here are the results for the 4 C compilers I tried on P2:
    fastspin: 1000000 toggles takes   2000107 cycles
    riscvp2:  1000000 toggles takes  10003362 cycles
    p2gcc:    1000000 toggles takes  16000024 cycles
    catalina: 1000000 toggles takes  24000447 cycles
    

    The theoretical limit on pin toggles in software is 2 cycles/toggle (a REP loop with drvnot inside it). fastspin is able to achieve that by caching the loop in LUT. The other compilers, including riscvp2, all have some loop overhead as well. We also get some sense of the latency here: riscvp2 does have that 3362 extra (which represents the compilation time) compared to only 24 cycles extra for p2gcc (which is probably measurement and subroutine call overhead).
    Nice numbers. That means riscvp2 will not do so well on code that loops (eg) 10 times, as now it costs 10+336.2 averaged per loop.

    Why does fastspin have a 107 clock overhead, vs p2gcc's 24 ? Is that the cost of the calla #FCACHE_LOAD_ ?


  • jmg wrote: »
    ersmith wrote: »
    Here are the results for the 4 C compilers I tried on P2:
    fastspin: 1000000 toggles takes   2000107 cycles
    riscvp2:  1000000 toggles takes  10003362 cycles
    p2gcc:    1000000 toggles takes  16000024 cycles
    catalina: 1000000 toggles takes  24000447 cycles
    

    The theoretical limit on pin toggles in software is 2 cycles/toggle (a REP loop with drvnot inside it). fastspin is able to achieve that by caching the loop in LUT. The other compilers, including riscvp2, all have some loop overhead as well. We also get some sense of the latency here: riscvp2 does have that 3362 extra (which represents the compilation time) compared to only 24 cycles extra for p2gcc (which is probably measurement and subroutine call overhead).
    Nice numbers. That means riscvp2 will not do so well on code that loops (eg) 10 times, as now it costs 10+336.2 averaged per loop.
    If the code is only ever called once, then yes, it'll be expensive -- on the other hand code that's only called once is probably not particularly time sensitive.

    If there's code that loops 10 times but it's called many times (e.g. serial output) then it'll be cached and the compilation overhead only happens once, the very first time it's called.
    Why does fastspin have a 107 clock overhead, vs p2gcc's 24 ? Is that the cost of the calla #FCACHE_LOAD_ ?

    Yes.
    FlexGUI, a GUI for programming the P1 and P2 in Spin, PASM, BASIC, and C.
    Help support its development at Patreon or PayPal.
  • jmgjmg Posts: 14,107
    ersmith wrote: »
    If there's code that loops 10 times but it's called many times (e.g. serial output) then it'll be cached and the compilation overhead only happens once, the very first time it's called.
    That might get tricky on things like serial RX, where the P2 lacks hardware FIFOs, so it will only have one char time from INT until next char overwrite.
    Taking a 8M.8.n.2 link, that ~1.375us or 110 opcodes at 160MHz.
    Is there any way to force/pre-direct pre-compiled code into LUT for example - for things like serial interrupts ? ( or any interrupt, where you want to avoid run-time JIT)


  • ersmithersmith Posts: 3,623
    edited 2019-08-30 - 10:20:31
    jmg wrote: »
    ersmith wrote: »
    If there's code that loops 10 times but it's called many times (e.g. serial output) then it'll be cached and the compilation overhead only happens once, the very first time it's called.
    That might get tricky on things like serial RX, where the P2 lacks hardware FIFOs, so it will only have one char time from INT until next char overwrite.
    Taking a 8M.8.n.2 link, that ~1.375us or 110 opcodes at 160MHz.
    Is there any way to force/pre-direct pre-compiled code into LUT for example - for things like serial interrupts ? ( or any interrupt, where you want to avoid run-time JIT)

    There's no interrupt support at the moment, and no way to pre-compile. In general for code that timing sensitive I'd probably run PASM in a COG to handle it. A lot of things could be done in smart pins or with the streamer, too.
    FlexGUI, a GUI for programming the P1 and P2 in Spin, PASM, BASIC, and C.
    Help support its development at Patreon or PayPal.
  • ersmith wrote: »
    Dave Hein wrote: »
    ersmith wrote: »
    I posted it a few messages back. On fft-bench and dhrystone, riscvp2 is the fastest compiler for P2 -- faster than Catalina, p2gcc, and fastspin. On micropython we had some back and forth over wither riscvp2 or p2gcc was faster; I think as it stands now riscvp2 is about 10% slower than the custom p2gcc that @rogloh created, but the binaries are smaller and support floating point (which p2gcc doesn't support yet).
    p2gcc supports 32-bit floats, but it doesn't support 64 bits. p2gcc uses the -m32bit-doubles compiler flag to convert 64-bit floats to 32 bits.

    Sorry, I meant to say that the micropython compiled by p2gcc doesn't have floating point support (I think the p2gcc float libraries don't yet have all the functions micropython needs).
    Last I recall is that I was able to get some floating point for Micropython working with p2gcc through some extra hacks and using the provided Micropython floating point library routines rather than those provided by the regular gcc math libs which are missing in p2gcc.

    See
    http://forums.parallax.com/discussion/comment/1474298/#Comment_1474298

    I did play about beyond this and other floating point functions beyond sqrt seemed to also work (though IIRC I found one of the tan functions seemed to have issues, not sure if it was p2gcc related in any way or just some math lib issue in Micropython itself). None of this was extensively tested, just a proof that it could be compiled with this functionality added and to see the size impact.
Sign In or Register to comment.