riscvp2: a C and C++ compiler for P2

2»

Comments

  • Personally I’d think of a jit compiler as slower than an optimized emulator
  • ersmith wrote: »
    I posted it a few messages back. On fft-bench and dhrystone, riscvp2 is the fastest compiler for P2 -- faster than Catalina, p2gcc, and fastspin. On micropython we had some back and forth over wither riscvp2 or p2gcc was faster; I think as it stands now riscvp2 is about 10% slower than the custom p2gcc that @rogloh created, but the binaries are smaller and support floating point (which p2gcc doesn't support yet).
    p2gcc supports 32-bit floats, but it doesn't support 64 bits. p2gcc uses the -m32bit-doubles compiler flag to convert 64-bit floats to 32 bits.

  • Rayman wrote: »
    I think the difference is the source... bytecode is not machine instructions...
    I don't see the difference. You could just think of the RISC-V instructions as byte code. They just happen to run natively on some hardware. Here they're JIT compiled to P2 instructions.
  • jmgjmg Posts: 14,380
    David Betz wrote: »
    Rayman wrote: »
    I think the difference is the source... bytecode is not machine instructions...
    I don't see the difference. You could just think of the RISC-V instructions as byte code. They just happen to run natively on some hardware. Here they're JIT compiled to P2 instructions.

    Maybe different benchmarks are needed when JIT comes into the mix ?
    eg Something that calls one routine that toggles a pin, then calls another that toggles at a slightly different speed, using different code, can be used.
    The delay between those running, would be more in a JIT compiler than an interpreter, and an interpreter would be slower than PASM.
  • Rayman wrote: »
    I think the difference is the source... bytecode is not machine instructions...

    JVM bytecodes are machine instructions, for the "Java Virtual Machine". Java implementations on PCs are emulators for the JVM. These implementations may be either interpreters or JIT compilers (or some mix of the two).

    Similarly, the riscvp2 emulator is a JIT compiler. It literally does compile RISC-V instructions to P2 instructions. It does not interpret those instructions one at a time.
  • Dave Hein wrote: »
    ersmith wrote: »
    I posted it a few messages back. On fft-bench and dhrystone, riscvp2 is the fastest compiler for P2 -- faster than Catalina, p2gcc, and fastspin. On micropython we had some back and forth over wither riscvp2 or p2gcc was faster; I think as it stands now riscvp2 is about 10% slower than the custom p2gcc that @rogloh created, but the binaries are smaller and support floating point (which p2gcc doesn't support yet).
    p2gcc supports 32-bit floats, but it doesn't support 64 bits. p2gcc uses the -m32bit-doubles compiler flag to convert 64-bit floats to 32 bits.

    Sorry, I meant to say that the micropython compiled by p2gcc doesn't have floating point support (I think the p2gcc float libraries don't yet have all the functions micropython needs).

    I am working on some generic P2 floating point routines, both 32 and 64 bit, which I hope can be used by all the compilers including p2gcc.
  • Whatever you call it ("emulator", "JIT compiler", "CMM") what matters ultimately to the user is the performance. So here's another example, toggling a pin as fast as it can. Here's the source code. A lot of it is defines to make it portable to different compilers; once we've all implemented a standard propeller2.h that should simplify at lot.
    #define PIN 0
    #define COUNT 1000000
    #include <stdio.h>
    #if defined(__riscv) || defined(__propeller2__)
    #include <propeller2.h>
    #define togglepin(x) _pinnot(x)
    #define getcnt(x) _cnt(x)
    #else
    #include <propeller.h>
    #define togglepin(x) OUTA ^= (1<<x)
    #endif
    
    #ifdef __CATALINA__
    #define waitcnt _waitcnt
    #define getcnt _cnt
    #endif
    
    #ifdef __FASTSPIN__
    // work around bug in __builtin_printf
    #undef printf
    #endif
    
    #ifndef getcnt
    #define getcnt() CNT
    #endif
    #ifndef _waitx
    #define _waitx(x) waitcnt(getcnt() + x)
    #endif
    
    void
    toggle(unsigned int n)
    {
        int i;
        do {
            togglepin(PIN);
        } while (--n != 0);
    }
    
    int
    main()
    {
        unsigned long elapsed;
    
        _waitx(10000000);
        printf("toggle test\n");
        elapsed = getcnt();
        toggle(COUNT);
        elapsed = getcnt() - elapsed;
        printf("%u toggles takes %9u cycles\n", COUNT, elapsed);
        for(;;)
            ;
    }
    

    Here are the results for the 4 C compilers I tried on P2:
    fastspin: 1000000 toggles takes   2000107 cycles
    riscvp2:  1000000 toggles takes  10003362 cycles
    p2gcc:    1000000 toggles takes  16000024 cycles
    catalina: 1000000 toggles takes  24000447 cycles
    
    Command lines used:
    
    fastspin -2 -O2 toggle.c
    p2gcc -o toggle.binary toggle.c
    riscv-none-embed-gcc -T riscvp2_lut.ld -Os -o toggle.elf toggle.c
    catalina -lci -p2 -O3 -C P2_EVAL -C NATIVE toggle.c
    

    The theoretical limit on pin toggles in software is 2 cycles/toggle (a REP loop with drvnot inside it). fastspin is able to achieve that by caching the loop in LUT. The other compilers, including riscvp2, all have some loop overhead as well. We also get some sense of the latency here: riscvp2 does have that 3362 extra (which represents the compilation time) compared to only 24 cycles extra for p2gcc (which is probably measurement and subroutine call overhead).

    The literal assembly code produced by riscvp2 for the toggle function is as follows:

    RISC-V code (output of riscv-none-embed-gcc):
    000025e8 <toggle>:
        25e8:	4781                	li	a5,0
        25ea:	c007a00b          	0xc007a00b
        25ee:	157d                	addi	a0,a0,-1
        25f0:	fd6d                	bnez	a0,25ea <toggle+0x2>
        25f2:	8082                	ret
    
    The "0xc007a00b" instruction is a RISC-V "drvnot" instruction, one of the custom P2 instructions, which the generic RISC-V toolchain doesn't understand. We could easily add support for this to the disassembler in the future.

    This gets compiled at runtime into:
    00004AE8 F6041E00        mov    $00F,#$000 
    00004AEA FD601E5F        drvnot $00F 
    00004AEE F1841401        sub    $00A,#$001 
    00004AF0 F2181400        cmp    $00A,$000 wcz
    00004AF4 FEE04AEA        loc    ptrb, #00004AEA 
    00004AF8 5E200144 if_nz  calld  pb, #00000144 
    00004AFC FEE04AF2        loc    ptrb, #00004AF2 
    00004B00 FE200144        calld  pb, #00000144 
    
    And the "loc ptrb/ calld" sequence gets further optimized to a "jmp" after it's placed in LUT and executed the first time. So the actual runtime loop looks like:
    here
            drvnot  a5
    	sub	a0, #1
    	cmp	a0, #0 wcz
      if_nz	jmp	#here
    
    Which, sure enough, is 10 cycles/iteration. It's definitely not optimal but OTOH it's better than some of the other compilers produce.

    For comparison:

    fastspin's version of toggle is actually inlined into main and looks like:
    	loc	pa,	#(@LR__0003-@LR__0001)
    	calla	#FCACHE_LOAD_
    LR__0001
    	rep	@LR__0004, ##1000000
    LR__0002
    	drvnot	#0
    LR__0003
    LR__0004
    

    p2gcc's looks like:
            .global _toggle
    _toggle
    .L2
            xor     OUTA,#1
            djnz    r0,#.L2
            jmp     lr
    
    which is pretty nice code, and would be faster than riscvp2 except for the hubexec penalty.

    Catalina's looks like:
     alignl ' align long
    C_toggle ' <symbol:toggle>
     PRIMITIVE(#PSHM)
     long $400000 ' save registers
    C_toggle_2
     xor OUTA, #1 ' ASGNU4 special BXOR4 special coni
    ' C_toggle_3 ' (symbol refcount = 0)
     mov r22, r2
     sub r22, #1 ' SUBU4 coni
     mov r2, r22 ' CVI, CVU or LOAD
     cmp r22,  #0 wz
     if_nz jmp #C_toggle_2  ' NEU4
    ' C_toggle_1 ' (symbol refcount = 0)
     PRIMITIVE(#POPM) ' restore registers
     PRIMITIVE(#RETN)
    

    I think the PRIMITIVE(#X) macro turns into a "CALL #X".
  • jmgjmg Posts: 14,380
    ersmith wrote: »
    Here are the results for the 4 C compilers I tried on P2:
    fastspin: 1000000 toggles takes   2000107 cycles
    riscvp2:  1000000 toggles takes  10003362 cycles
    p2gcc:    1000000 toggles takes  16000024 cycles
    catalina: 1000000 toggles takes  24000447 cycles
    

    The theoretical limit on pin toggles in software is 2 cycles/toggle (a REP loop with drvnot inside it). fastspin is able to achieve that by caching the loop in LUT. The other compilers, including riscvp2, all have some loop overhead as well. We also get some sense of the latency here: riscvp2 does have that 3362 extra (which represents the compilation time) compared to only 24 cycles extra for p2gcc (which is probably measurement and subroutine call overhead).
    Nice numbers. That means riscvp2 will not do so well on code that loops (eg) 10 times, as now it costs 10+336.2 averaged per loop.

    Why does fastspin have a 107 clock overhead, vs p2gcc's 24 ? Is that the cost of the calla #FCACHE_LOAD_ ?


  • jmg wrote: »
    ersmith wrote: »
    Here are the results for the 4 C compilers I tried on P2:
    fastspin: 1000000 toggles takes   2000107 cycles
    riscvp2:  1000000 toggles takes  10003362 cycles
    p2gcc:    1000000 toggles takes  16000024 cycles
    catalina: 1000000 toggles takes  24000447 cycles
    

    The theoretical limit on pin toggles in software is 2 cycles/toggle (a REP loop with drvnot inside it). fastspin is able to achieve that by caching the loop in LUT. The other compilers, including riscvp2, all have some loop overhead as well. We also get some sense of the latency here: riscvp2 does have that 3362 extra (which represents the compilation time) compared to only 24 cycles extra for p2gcc (which is probably measurement and subroutine call overhead).
    Nice numbers. That means riscvp2 will not do so well on code that loops (eg) 10 times, as now it costs 10+336.2 averaged per loop.
    If the code is only ever called once, then yes, it'll be expensive -- on the other hand code that's only called once is probably not particularly time sensitive.

    If there's code that loops 10 times but it's called many times (e.g. serial output) then it'll be cached and the compilation overhead only happens once, the very first time it's called.
    Why does fastspin have a 107 clock overhead, vs p2gcc's 24 ? Is that the cost of the calla #FCACHE_LOAD_ ?

    Yes.
  • jmgjmg Posts: 14,380
    ersmith wrote: »
    If there's code that loops 10 times but it's called many times (e.g. serial output) then it'll be cached and the compilation overhead only happens once, the very first time it's called.
    That might get tricky on things like serial RX, where the P2 lacks hardware FIFOs, so it will only have one char time from INT until next char overwrite.
    Taking a 8M.8.n.2 link, that ~1.375us or 110 opcodes at 160MHz.
    Is there any way to force/pre-direct pre-compiled code into LUT for example - for things like serial interrupts ? ( or any interrupt, where you want to avoid run-time JIT)


  • ersmithersmith Posts: 4,235
    edited 2019-08-30 - 10:20:31
    jmg wrote: »
    ersmith wrote: »
    If there's code that loops 10 times but it's called many times (e.g. serial output) then it'll be cached and the compilation overhead only happens once, the very first time it's called.
    That might get tricky on things like serial RX, where the P2 lacks hardware FIFOs, so it will only have one char time from INT until next char overwrite.
    Taking a 8M.8.n.2 link, that ~1.375us or 110 opcodes at 160MHz.
    Is there any way to force/pre-direct pre-compiled code into LUT for example - for things like serial interrupts ? ( or any interrupt, where you want to avoid run-time JIT)

    There's no interrupt support at the moment, and no way to pre-compile. In general for code that timing sensitive I'd probably run PASM in a COG to handle it. A lot of things could be done in smart pins or with the streamer, too.
  • ersmith wrote: »
    Dave Hein wrote: »
    ersmith wrote: »
    I posted it a few messages back. On fft-bench and dhrystone, riscvp2 is the fastest compiler for P2 -- faster than Catalina, p2gcc, and fastspin. On micropython we had some back and forth over wither riscvp2 or p2gcc was faster; I think as it stands now riscvp2 is about 10% slower than the custom p2gcc that @rogloh created, but the binaries are smaller and support floating point (which p2gcc doesn't support yet).
    p2gcc supports 32-bit floats, but it doesn't support 64 bits. p2gcc uses the -m32bit-doubles compiler flag to convert 64-bit floats to 32 bits.

    Sorry, I meant to say that the micropython compiled by p2gcc doesn't have floating point support (I think the p2gcc float libraries don't yet have all the functions micropython needs).
    Last I recall is that I was able to get some floating point for Micropython working with p2gcc through some extra hacks and using the provided Micropython floating point library routines rather than those provided by the regular gcc math libs which are missing in p2gcc.

    See
    http://forums.parallax.com/discussion/comment/1474298/#Comment_1474298

    I did play about beyond this and other floating point functions beyond sqrt seemed to also work (though IIRC I found one of the tan functions seemed to have issues, not sure if it was p2gcc related in any way or just some math lib issue in Micropython itself). None of this was extensively tested, just a proof that it could be compiled with this functionality added and to see the size impact.
  • How would one go about debugging riscv-p2? I'm using it to build LWIP. My network connection is PPP over serial. Due to difficulties, I tried to run the same code on another platform. That was my PC. It now works perfectly on the PC. The only differences in the code are for system timers and serial port.

    x86_64: ok
    arm 32 bit: ok
    riscv-p2: problems

    How can I help narrow it down between gcc and riscv-p2?

    Some bugs were resolved by eliminating the -Os compiler option. But it still doesn't work.
  • It would be tricky and this is where having RISC-V translation on the fly makes things harder to debug on the P2 itself. I think you'd almost need to build a special debugger aware of RISC-V opcodes on the P2 to be able to step through its ASM code and see the translation to P2 code once you locate roughly where the problem is happening. If you had a real RISC-V board that could run this code that might help to identify if it is a JIT translation problem somewhere in the P2 or something generic in GCC for RISC-V if these two systems behave differently. You'll probably going to have to learn the RISC-V ISA and then gain a detailed understanding of the translation into P2 code, which is going to be additional work unless you know it already.
  • How would one go about debugging riscv-p2? I'm using it to build LWIP. My network connection is PPP over serial. Due to difficulties, I tried to run the same code on another platform. That was my PC. It now works perfectly on the PC. The only differences in the code are for system timers and serial port.

    I had plans to implement a gdb stub for Risc-V, but haven't gotten to it yet (it might be a good project if you're feeling ambitious). For now I would use printf to try to narrow the problem down; you could, for example, compare results from a known working version and the one that isn't working.

    If you can get it down to a specific function that's going wrong and you think it might be a bug in the JIT compiler itself, you could enable the defines DEBUG_ENGINE, USE_DISASM, and DEBUG_THOROUGH in rsicvtrace_p2.spin, do a "make install", and then rebuild the application. To get everything to fit you may also have to disable some other feature like OPTIMIZE_CMP_ZERO by commenting out the appropriate #define. Now when you run you'll get a spew of debug output like:
    =00008252 $9083DBC9 @00008252 =memory chksum
    
    00008252 00000002 FF000041        augs   #00008200 
    00000002 F6040254        mov    $001,#$054 
    
    000080CE 00000001 F603F201        mov    $1F9,$001 
    
    00008254 00000002 FF000057        augs   #0000AE00 
    00000002 F6041E64        mov    $00F,#$064 
    

    The "=xxxx" lines show when there are new cache lines compiled. The =memory chksum is probably of limited utility but shows a checksum of current memory. The other lines show the generated assembly code corresponding to RISC-V instructions (e.g. the RISC-V instruction at 8252 was compiled to "augs #$8200; mov $1, #$54" which amounts to "mov $1, ##$8254". The RISC-V registers are at the bottom of COG memory, so this is setting up a return address in RISC-V register x1. You can use riscv-objdump to get a disassembly of the RISC-V code to see where the addreses like $8252 correspond to RISC-V instructions.
    rogloh wrote: »
    It would be tricky and this is where having RISC-V translation on the fly makes things harder to debug on the P2 itself.
    No more so than with any other virtual machine (like Chip's Spin bytecodes or microPython's bytecodes) on the P2. If anything a bit less, since at least the RISC-V instruction set is well documented and very straightforward. Native debugging on the P2 is not very well supported right now, unfortunately, although we can hope that eventually that will improve.

  • ersmith wrote: »
    For now I would use printf to try to narrow the problem down; you could, for example, compare results from a known working version and the one that isn't working.

    If you can get it down to a specific function that's going wrong and you think it might be a bug in the JIT compiler itself, you could enable the defines DEBUG_ENGINE, USE_DISASM, and DEBUG_THOROUGH in rsicvtrace_p2.spin, do a "make install", and then rebuild the application. To get everything to fit you may also have to disable some other feature like OPTIMIZE_CMP_ZERO by commenting out the appropriate #define. Now when you run you'll get a spew of debug output like:

    Thanks, Eric. I have done the printf method and have identified the exact functions that malfunction.

    An update on my status: I looked at the riscvemu tree for hints. I noticed that the Makefile specified -march=rv32im. Adding this option to my LWIP build seems to have resolved most of the issues. The remaining issues might be from lack of memory as the elf file is ~400kB. I don't know right now what the default -march from gcc is, but I'll try to find out.

    I noticed in the micropython thread that there is a non-jit version of the emulator. I tried to run my objects using that, but have not been successful yet. It's probably something with the binary not being loaded in the right place, and maybe not stripping off the p2 code added by riscvp2. It looks like objcopy doesn't do that. It looks like micropython is now built with -march=rv32imc. I'll try that tomorrow.



  • Thanks, Eric. I have done the printf method and have identified the exact functions that malfunction.
    Are you able to share those? Perhaps I could reproduce the problem here.
    An update on my status:  I looked at the riscvemu tree for hints. I noticed that the Makefile specified -march=rv32im.  Adding this option to my LWIP build seems to have resolved most of the issues. The remaining issues might be from lack of memory as the elf file is ~400kB.  I don't know right now what the default -march from gcc is, but I'll try to find out.
    
    Hmmm, that may indicate a bug in the decode of one of the compressed RISC-V instructions. riscvp2 is supposed to support -march=rv32imc, and I think generally that should be the default for most RISC-V gcc compilers. Are you using the riscv-none-embed-gcc that I pointed to in the README? Or are you using a different GCC?
    I noticed in the micropython thread that there is a non-jit version of the emulator. I tried to run my objects using that, but have not been successful yet. It's probably something with the binary not being loaded in the right place, and maybe not stripping off the p2 code added by riscvp2. It looks like objcopy doesn't do that. It looks like micropython is now built with -march=rv32imc. I'll try that tomorrow.
    The non-jit version is at https://github.com/totalspectrum/riscvemu. I have to warn you that I haven't looked at that code for a long time, so it may have bit-rotted (and I'm not sure if it works on Rev B silicon, although it may). [Edit: yes, it has bit-rotted, the newest fastspin throws errors compiling it :(. It does seem to be possible to build the plain emulator, p2emu.binary, if you're careful to specify -march=rv32im for the RISC-V code and you replace "pausems" with "waitms".]

    The major difference from riscvp2 is that you have to build the RISC-V binary and then manually attach it to the emulator using objcopy or something similar. (So don't use -T riscvp2.ld as a linker script, you'll have to create a different one; I think there are some examples in riscvemu Makefiles). riscvemu also doesn't support the compressed instructions (so -march=rv32im only). OTOH debugging is a lot more feasible; the non-JIT emulator buiids can be linked with debug.spin to produce a very simple debug interface.
  • I found that the bug only happens with -march=rv32imc -Os and OPTIMIZE_CMP_ZERO on. Am I really lucky or really unluckly? :unamused: I used this compiler: xpack-riscv-none-embed-gcc-8.3.0-1.1-linux-x64.tgz which I found following the links in the readme.

    One function where a bug occurred was lcp_cilen. Here's a section of riscv assembly.
    LENCIVOID(go->neg_ssnhf) +
       17fb6:	0077d513          	srli	a0,a5,0x7
       17fba:	8909                	andi	a0,a0,2
    	    (go->neg_endpoint? CILEN_CHAR + go->endpoint.length: 0));
       17fbc:	2007f793          	andi	a5,a5,512
    	    LENCIVOID(go->neg_accompression) +
       17fc0:	972a                	add	a4,a4,a0
    	    (go->neg_endpoint? CILEN_CHAR + go->endpoint.length: 0));
       17fc2:	4501                	li	a0,0
       17fc4:	c781                	beqz	a5,17fcc <lcp_cilen+0x6e>
       17fc6:	0726c503          	lbu	a0,114(a3)
       17fca:	050d                	addi	a0,a0,3
    }
       17fcc:	953a                	add	a0,a0,a4
       17fce:	8082                	ret
    

    Trace with all optimizations off:
    00017FB6 00000001 F600140F        mov    $00A,$00F 
    00000001 F04C1407        shr    $00A,#$007 wz
    
    00017FBA 00000001 F50C1402        and    $00A,#$002 wz
    
    00017FBC 00000002 FF000001        augs   #00000200 
    00000002 F50C1E00        and    $00F,#$000 wz
    
    00017FC0 00000001 F1081C0A        add    $00E,$00A wz
    
    00017FC2 00000001 F6041400        mov    $00A,#$000 
    
    00017FC4 00000001 F2181E00        cmp    $00F,$000 wcz
    00000002 FEE17FCC        loc    ptrb, #00017FCC 
    00000002 AE200149 if_z   calld  pb, #00000149 
    00000002 FEE17FC6        loc    ptrb, #00017FC6 
    00000002 FE200149        calld  pb, #00000149 
    
    =00017FCC $1D81523A @00017FCC =memory chksum
    
    00017FCC 00000001 F108140E        add    $00A,$00E wz
    
    00017FCE 00000001 F603F201        mov    $1F9,$001 
    00000001 FD800148        jmp    #00000148 
    00000002 FEE17FD0        loc    ptrb, #00017FD0 
    00000002 FE200149        calld  pb, #00000149 
    

    With OPTIMIZE_CMP_ZERO
    00017FB6 00000001 F600140F        mov    $00A,$00F 
    00000001 F04C1407        shr    $00A,#$007 wz
    
    00017FBA 00000001 F50C1402        and    $00A,#$002 wz
    
    00017FBC 00000002 FF000001        augs   #00000200 
    00000002 F50C1E00        and    $00F,#$000 wz
    
    00017FC0 00000001 F1081C0A        add    $00E,$00A wz
    
    00017FC2 00000001 F6041400        mov    $00A,#$000 
    
    00017FC4 00000002 FEE17FCC        loc    ptrb, #00017FCC ***
    00000002 AE200154 if_z   calld  pb, #00000154 
    00000002 FEE17FC6        loc    ptrb, #00017FC6 
    00000002 FE200154        calld  pb, #00000154 
    
    =00017FC6 $C78B4E7F @00017FC6 =memory chksum
    
    00017FC6 00000001 F603F00D        mov    $1F8,$00D ***
    00000002 FF004000        augs   #00800000 
    00000002 FACC1472        rdbyte $00A,#$072 wz
    
    00017FCA 00000001 F10C1403        add    $00A,#$003 wz  ***
    
    00017FCC 00000001 F108140E        add    $00A,$00E wz
    
    00017FCE 00000001 F603F201        mov    $1F9,$001 
    00000001 FD800153        jmp    #00000153 
    00000002 FEE17FD0        loc    ptrb, #00017FD0 
    00000002 FE200154        calld  pb, #00000154 
    

    The instruction 17fc0: 972a add a4,a4,a0
    is a c_add and goes through c_mv block. adddata has wz set, but c_mv does not update zcmp_reg. It was $00F from the "and" @17FBC, so the compare is wrongly optimized away.

    In this case it would be more efficient for the "add" to not update Z. Not sure what would work better overall. The LWIP code and some fixes I tried are on github.
  • Thanks, James! You nailed it, the c_mv case also implemented c_add and I missed the need to update zcmp_reg. Thanks for the bug fix, I've merged your pull request.

    Cheers,
    Eric
  • @SaucySoliton : Are you using a RevA or RevB board?

    In general: is anyone using riscvp2 on a RevA? Right now there is an interrupt service routine to maintain the upper 64 bits of the cycle counter, which on RevB may be obtained directly. I'd like to use the RevB method and save some COG RAM, but this will break compatibility with RevA. Does anyone care?
  • At this point in time there are only ~110 Rev A's, and there are only ~100-200 Rev B's out there so far from what I recall.
    So some 30% are Rev A's.

    IMHO Rev A's still need to be supported, with the obvious restrictions of what cannot be done on Rev A's.
  • jmgjmg Posts: 14,380
    ersmith wrote: »
    @SaucySoliton : Are you using a RevA or RevB board?

    In general: is anyone using riscvp2 on a RevA? Right now there is an interrupt service routine to maintain the upper 64 bits of the cycle counter, which on RevB may be obtained directly. I'd like to use the RevB method and save some COG RAM, but this will break compatibility with RevA. Does anyone care?

    Given Rev A will certainly fade, it seems fine to focus on rev B, and just archive the version that works on rev A ?
  • potatoheadpotatohead Posts: 10,067
    edited 2020-01-31 - 03:02:37
    The sooner Rev A gets left behind, the cleaner things will be.

    We can frame 'em, and package up last known Rev A tools, should one somehow be used, and call it a day.
  • potatohead wrote: »
    The sooner Rev A gets left behind, the cleaner things will be.

    We can frame 'em, and package up last known Rev A tools, should one somehow be used, and call it a day.

    Even though I'm stuck with a Rev A board for now, I completely agree. Supporting both seems like PITA. Archive the final version that works with Rev A and move on :wink:
  • ersmith wrote: »
    @SaucySoliton : Are you using a RevA or RevB board?

    In general: is anyone using riscvp2 on a RevA? Right now there is an interrupt service routine to maintain the upper 64 bits of the cycle counter, which on RevB may be obtained directly. I'd like to use the RevB method and save some COG RAM, but this will break compatibility with RevA. Does anyone care?

    RevB. I tried to run on RevA as a test, but it didn't work. I didn't look into it. So mark my vote as don't care.
  • Did you ever resolve video capture differences between rev A and B?
  • potatohead wrote: »
    Did you ever resolve video capture differences between rev A and B?

    I'm still thinking about what is the best way to do video capture.

    Goertzel mode: This is what I used on RevA. The RevB has a nice improvement from adding multiple ADCs together in parallel. But this code uses Goertzel mode in a way that the hardware was not designed for. The problem is that the sinc2 mode for Goertzel added a clock delay to the output. I can compensate for this partially, but not completely. Because the video sampling rate is not an integer division of sysclock, the number of clocks between NCO overflows varies. The compensation offset is fixed. It's especially frustrating because I helped promote sinc on Goertzel. Nevertheless, RevB with a few ADCs in parallel should still be better.

    Scope mode: This is non-optimal with regards to sampling time. The samples must be quantized to the nearest clock cycle. The time uncertainty is a few degrees worth at the colorburst frequency. A lot easier to program. Easier to support s-video/component/RGB. Harder to run parallel ADCs, the time to sum bytes together is not insignificant. Maybe it would be insignificant compared to color decoding. Since the streamer writes to hub ram, it is more difficult for the sampling cog to process the captured samples.
Sign In or Register to comment.