I posted it a few messages back. On fft-bench and dhrystone, riscvp2 is the fastest compiler for P2 -- faster than Catalina, p2gcc, and fastspin. On micropython we had some back and forth over wither riscvp2 or p2gcc was faster; I think as it stands now riscvp2 is about 10% slower than the custom p2gcc that @rogloh created, but the binaries are smaller and support floating point (which p2gcc doesn't support yet).
p2gcc supports 32-bit floats, but it doesn't support 64 bits. p2gcc uses the -m32bit-doubles compiler flag to convert 64-bit floats to 32 bits.
I think the difference is the source... bytecode is not machine instructions...
I don't see the difference. You could just think of the RISC-V instructions as byte code. They just happen to run natively on some hardware. Here they're JIT compiled to P2 instructions.
I think the difference is the source... bytecode is not machine instructions...
I don't see the difference. You could just think of the RISC-V instructions as byte code. They just happen to run natively on some hardware. Here they're JIT compiled to P2 instructions.
Maybe different benchmarks are needed when JIT comes into the mix ?
eg Something that calls one routine that toggles a pin, then calls another that toggles at a slightly different speed, using different code, can be used.
The delay between those running, would be more in a JIT compiler than an interpreter, and an interpreter would be slower than PASM.
I think the difference is the source... bytecode is not machine instructions...
JVM bytecodes are machine instructions, for the "Java Virtual Machine". Java implementations on PCs are emulators for the JVM. These implementations may be either interpreters or JIT compilers (or some mix of the two).
Similarly, the riscvp2 emulator is a JIT compiler. It literally does compile RISC-V instructions to P2 instructions. It does not interpret those instructions one at a time.
I posted it a few messages back. On fft-bench and dhrystone, riscvp2 is the fastest compiler for P2 -- faster than Catalina, p2gcc, and fastspin. On micropython we had some back and forth over wither riscvp2 or p2gcc was faster; I think as it stands now riscvp2 is about 10% slower than the custom p2gcc that @rogloh created, but the binaries are smaller and support floating point (which p2gcc doesn't support yet).
p2gcc supports 32-bit floats, but it doesn't support 64 bits. p2gcc uses the -m32bit-doubles compiler flag to convert 64-bit floats to 32 bits.
Sorry, I meant to say that the micropython compiled by p2gcc doesn't have floating point support (I think the p2gcc float libraries don't yet have all the functions micropython needs).
I am working on some generic P2 floating point routines, both 32 and 64 bit, which I hope can be used by all the compilers including p2gcc.
Whatever you call it ("emulator", "JIT compiler", "CMM") what matters ultimately to the user is the performance. So here's another example, toggling a pin as fast as it can. Here's the source code. A lot of it is defines to make it portable to different compilers; once we've all implemented a standard propeller2.h that should simplify at lot.
The theoretical limit on pin toggles in software is 2 cycles/toggle (a REP loop with drvnot inside it). fastspin is able to achieve that by caching the loop in LUT. The other compilers, including riscvp2, all have some loop overhead as well. We also get some sense of the latency here: riscvp2 does have that 3362 extra (which represents the compilation time) compared to only 24 cycles extra for p2gcc (which is probably measurement and subroutine call overhead).
The literal assembly code produced by riscvp2 for the toggle function is as follows:
The "0xc007a00b" instruction is a RISC-V "drvnot" instruction, one of the custom P2 instructions, which the generic RISC-V toolchain doesn't understand. We could easily add support for this to the disassembler in the future.
And the "loc ptrb/ calld" sequence gets further optimized to a "jmp" after it's placed in LUT and executed the first time. So the actual runtime loop looks like:
here
drvnot a5
sub a0, #1
cmp a0, #0 wcz
if_nz jmp #here
Which, sure enough, is 10 cycles/iteration. It's definitely not optimal but OTOH it's better than some of the other compilers produce.
For comparison:
fastspin's version of toggle is actually inlined into main and looks like:
The theoretical limit on pin toggles in software is 2 cycles/toggle (a REP loop with drvnot inside it). fastspin is able to achieve that by caching the loop in LUT. The other compilers, including riscvp2, all have some loop overhead as well. We also get some sense of the latency here: riscvp2 does have that 3362 extra (which represents the compilation time) compared to only 24 cycles extra for p2gcc (which is probably measurement and subroutine call overhead).
Nice numbers. That means riscvp2 will not do so well on code that loops (eg) 10 times, as now it costs 10+336.2 averaged per loop.
Why does fastspin have a 107 clock overhead, vs p2gcc's 24 ? Is that the cost of the calla #FCACHE_LOAD_ ?
The theoretical limit on pin toggles in software is 2 cycles/toggle (a REP loop with drvnot inside it). fastspin is able to achieve that by caching the loop in LUT. The other compilers, including riscvp2, all have some loop overhead as well. We also get some sense of the latency here: riscvp2 does have that 3362 extra (which represents the compilation time) compared to only 24 cycles extra for p2gcc (which is probably measurement and subroutine call overhead).
Nice numbers. That means riscvp2 will not do so well on code that loops (eg) 10 times, as now it costs 10+336.2 averaged per loop.
If the code is only ever called once, then yes, it'll be expensive -- on the other hand code that's only called once is probably not particularly time sensitive.
If there's code that loops 10 times but it's called many times (e.g. serial output) then it'll be cached and the compilation overhead only happens once, the very first time it's called.
Why does fastspin have a 107 clock overhead, vs p2gcc's 24 ? Is that the cost of the calla #FCACHE_LOAD_ ?
If there's code that loops 10 times but it's called many times (e.g. serial output) then it'll be cached and the compilation overhead only happens once, the very first time it's called.
That might get tricky on things like serial RX, where the P2 lacks hardware FIFOs, so it will only have one char time from INT until next char overwrite.
Taking a 8M.8.n.2 link, that ~1.375us or 110 opcodes at 160MHz.
Is there any way to force/pre-direct pre-compiled code into LUT for example - for things like serial interrupts ? ( or any interrupt, where you want to avoid run-time JIT)
If there's code that loops 10 times but it's called many times (e.g. serial output) then it'll be cached and the compilation overhead only happens once, the very first time it's called.
That might get tricky on things like serial RX, where the P2 lacks hardware FIFOs, so it will only have one char time from INT until next char overwrite.
Taking a 8M.8.n.2 link, that ~1.375us or 110 opcodes at 160MHz.
Is there any way to force/pre-direct pre-compiled code into LUT for example - for things like serial interrupts ? ( or any interrupt, where you want to avoid run-time JIT)
There's no interrupt support at the moment, and no way to pre-compile. In general for code that timing sensitive I'd probably run PASM in a COG to handle it. A lot of things could be done in smart pins or with the streamer, too.
I posted it a few messages back. On fft-bench and dhrystone, riscvp2 is the fastest compiler for P2 -- faster than Catalina, p2gcc, and fastspin. On micropython we had some back and forth over wither riscvp2 or p2gcc was faster; I think as it stands now riscvp2 is about 10% slower than the custom p2gcc that @rogloh created, but the binaries are smaller and support floating point (which p2gcc doesn't support yet).
p2gcc supports 32-bit floats, but it doesn't support 64 bits. p2gcc uses the -m32bit-doubles compiler flag to convert 64-bit floats to 32 bits.
Sorry, I meant to say that the micropython compiled by p2gcc doesn't have floating point support (I think the p2gcc float libraries don't yet have all the functions micropython needs).
Last I recall is that I was able to get some floating point for Micropython working with p2gcc through some extra hacks and using the provided Micropython floating point library routines rather than those provided by the regular gcc math libs which are missing in p2gcc.
I did play about beyond this and other floating point functions beyond sqrt seemed to also work (though IIRC I found one of the tan functions seemed to have issues, not sure if it was p2gcc related in any way or just some math lib issue in Micropython itself). None of this was extensively tested, just a proof that it could be compiled with this functionality added and to see the size impact.
How would one go about debugging riscv-p2? I'm using it to build LWIP. My network connection is PPP over serial. Due to difficulties, I tried to run the same code on another platform. That was my PC. It now works perfectly on the PC. The only differences in the code are for system timers and serial port.
x86_64: ok
arm 32 bit: ok
riscv-p2: problems
How can I help narrow it down between gcc and riscv-p2?
Some bugs were resolved by eliminating the -Os compiler option. But it still doesn't work.
It would be tricky and this is where having RISC-V translation on the fly makes things harder to debug on the P2 itself. I think you'd almost need to build a special debugger aware of RISC-V opcodes on the P2 to be able to step through its ASM code and see the translation to P2 code once you locate roughly where the problem is happening. If you had a real RISC-V board that could run this code that might help to identify if it is a JIT translation problem somewhere in the P2 or something generic in GCC for RISC-V if these two systems behave differently. You'll probably going to have to learn the RISC-V ISA and then gain a detailed understanding of the translation into P2 code, which is going to be additional work unless you know it already.
How would one go about debugging riscv-p2? I'm using it to build LWIP. My network connection is PPP over serial. Due to difficulties, I tried to run the same code on another platform. That was my PC. It now works perfectly on the PC. The only differences in the code are for system timers and serial port.
I had plans to implement a gdb stub for Risc-V, but haven't gotten to it yet (it might be a good project if you're feeling ambitious). For now I would use printf to try to narrow the problem down; you could, for example, compare results from a known working version and the one that isn't working.
If you can get it down to a specific function that's going wrong and you think it might be a bug in the JIT compiler itself, you could enable the defines DEBUG_ENGINE, USE_DISASM, and DEBUG_THOROUGH in rsicvtrace_p2.spin, do a "make install", and then rebuild the application. To get everything to fit you may also have to disable some other feature like OPTIMIZE_CMP_ZERO by commenting out the appropriate #define. Now when you run you'll get a spew of debug output like:
The "=xxxx" lines show when there are new cache lines compiled. The =memory chksum is probably of limited utility but shows a checksum of current memory. The other lines show the generated assembly code corresponding to RISC-V instructions (e.g. the RISC-V instruction at 8252 was compiled to "augs #$8200; mov $1, #$54" which amounts to "mov $1, ##$8254". The RISC-V registers are at the bottom of COG memory, so this is setting up a return address in RISC-V register x1. You can use riscv-objdump to get a disassembly of the RISC-V code to see where the addreses like $8252 correspond to RISC-V instructions.
It would be tricky and this is where having RISC-V translation on the fly makes things harder to debug on the P2 itself.
No more so than with any other virtual machine (like Chip's Spin bytecodes or microPython's bytecodes) on the P2. If anything a bit less, since at least the RISC-V instruction set is well documented and very straightforward. Native debugging on the P2 is not very well supported right now, unfortunately, although we can hope that eventually that will improve.
For now I would use printf to try to narrow the problem down; you could, for example, compare results from a known working version and the one that isn't working.
If you can get it down to a specific function that's going wrong and you think it might be a bug in the JIT compiler itself, you could enable the defines DEBUG_ENGINE, USE_DISASM, and DEBUG_THOROUGH in rsicvtrace_p2.spin, do a "make install", and then rebuild the application. To get everything to fit you may also have to disable some other feature like OPTIMIZE_CMP_ZERO by commenting out the appropriate #define. Now when you run you'll get a spew of debug output like:
Thanks, Eric. I have done the printf method and have identified the exact functions that malfunction.
An update on my status: I looked at the riscvemu tree for hints. I noticed that the Makefile specified -march=rv32im. Adding this option to my LWIP build seems to have resolved most of the issues. The remaining issues might be from lack of memory as the elf file is ~400kB. I don't know right now what the default -march from gcc is, but I'll try to find out.
I noticed in the micropython thread that there is a non-jit version of the emulator. I tried to run my objects using that, but have not been successful yet. It's probably something with the binary not being loaded in the right place, and maybe not stripping off the p2 code added by riscvp2. It looks like objcopy doesn't do that. It looks like micropython is now built with -march=rv32imc. I'll try that tomorrow.
Thanks, Eric. I have done the printf method and have identified the exact functions that malfunction.
Are you able to share those? Perhaps I could reproduce the problem here.
An update on my status: I looked at the riscvemu tree for hints. I noticed that the Makefile specified -march=rv32im. Adding this option to my LWIP build seems to have resolved most of the issues. The remaining issues might be from lack of memory as the elf file is ~400kB. I don't know right now what the default -march from gcc is, but I'll try to find out.
Hmmm, that may indicate a bug in the decode of one of the compressed RISC-V instructions. riscvp2 is supposed to support -march=rv32imc, and I think generally that should be the default for most RISC-V gcc compilers. Are you using the riscv-none-embed-gcc that I pointed to in the README? Or are you using a different GCC?
I noticed in the micropython thread that there is a non-jit version of the emulator. I tried to run my objects using that, but have not been successful yet. It's probably something with the binary not being loaded in the right place, and maybe not stripping off the p2 code added by riscvp2. It looks like objcopy doesn't do that. It looks like micropython is now built with -march=rv32imc. I'll try that tomorrow.
The non-jit version is at https://github.com/totalspectrum/riscvemu. I have to warn you that I haven't looked at that code for a long time, so it may have bit-rotted (and I'm not sure if it works on Rev B silicon, although it may). [Edit: yes, it has bit-rotted, the newest fastspin throws errors compiling it . It does seem to be possible to build the plain emulator, p2emu.binary, if you're careful to specify -march=rv32im for the RISC-V code and you replace "pausems" with "waitms".]
The major difference from riscvp2 is that you have to build the RISC-V binary and then manually attach it to the emulator using objcopy or something similar. (So don't use -T riscvp2.ld as a linker script, you'll have to create a different one; I think there are some examples in riscvemu Makefiles). riscvemu also doesn't support the compressed instructions (so -march=rv32im only). OTOH debugging is a lot more feasible; the non-JIT emulator buiids can be linked with debug.spin to produce a very simple debug interface.
I found that the bug only happens with -march=rv32imc -Os and OPTIMIZE_CMP_ZERO on. Am I really lucky or really unluckly? I used this compiler: xpack-riscv-none-embed-gcc-8.3.0-1.1-linux-x64.tgz which I found following the links in the readme.
One function where a bug occurred was lcp_cilen. Here's a section of riscv assembly.
The instruction 17fc0: 972a add a4,a4,a0
is a c_add and goes through c_mv block. adddata has wz set, but c_mv does not update zcmp_reg. It was $00F from the "and" @17FBC, so the compare is wrongly optimized away.
In this case it would be more efficient for the "add" to not update Z. Not sure what would work better overall. The LWIP code and some fixes I tried are on github.
Thanks, James! You nailed it, the c_mv case also implemented c_add and I missed the need to update zcmp_reg. Thanks for the bug fix, I've merged your pull request.
@SaucySoliton : Are you using a RevA or RevB board?
In general: is anyone using riscvp2 on a RevA? Right now there is an interrupt service routine to maintain the upper 64 bits of the cycle counter, which on RevB may be obtained directly. I'd like to use the RevB method and save some COG RAM, but this will break compatibility with RevA. Does anyone care?
@SaucySoliton : Are you using a RevA or RevB board?
In general: is anyone using riscvp2 on a RevA? Right now there is an interrupt service routine to maintain the upper 64 bits of the cycle counter, which on RevB may be obtained directly. I'd like to use the RevB method and save some COG RAM, but this will break compatibility with RevA. Does anyone care?
Given Rev A will certainly fade, it seems fine to focus on rev B, and just archive the version that works on rev A ?
The sooner Rev A gets left behind, the cleaner things will be.
We can frame 'em, and package up last known Rev A tools, should one somehow be used, and call it a day.
Even though I'm stuck with a Rev A board for now, I completely agree. Supporting both seems like PITA. Archive the final version that works with Rev A and move on
@SaucySoliton : Are you using a RevA or RevB board?
In general: is anyone using riscvp2 on a RevA? Right now there is an interrupt service routine to maintain the upper 64 bits of the cycle counter, which on RevB may be obtained directly. I'd like to use the RevB method and save some COG RAM, but this will break compatibility with RevA. Does anyone care?
RevB. I tried to run on RevA as a test, but it didn't work. I didn't look into it. So mark my vote as don't care.
Did you ever resolve video capture differences between rev A and B?
I'm still thinking about what is the best way to do video capture.
Goertzel mode: This is what I used on RevA. The RevB has a nice improvement from adding multiple ADCs together in parallel. But this code uses Goertzel mode in a way that the hardware was not designed for. The problem is that the sinc2 mode for Goertzel added a clock delay to the output. I can compensate for this partially, but not completely. Because the video sampling rate is not an integer division of sysclock, the number of clocks between NCO overflows varies. The compensation offset is fixed. It's especially frustrating because I helped promote sinc on Goertzel. Nevertheless, RevB with a few ADCs in parallel should still be better.
Scope mode: This is non-optimal with regards to sampling time. The samples must be quantized to the nearest clock cycle. The time uncertainty is a few degrees worth at the colorburst frequency. A lot easier to program. Easier to support s-video/component/RGB. Harder to run parallel ADCs, the time to sum bytes together is not insignificant. Maybe it would be insignificant compared to color decoding. Since the streamer writes to hub ram, it is more difficult for the sampling cog to process the captured samples.
There's no fancy IDE, just command line GCC and G++ (along with the usual GNU binutils). Hooking these up to your IDE of choice should be straightforward.
These are RISC-V toolchains, so by default they build RISC-V binaries. If you use the "riscvp2.ld" linker script (e.g. by giving the "-T riscvp2.ld" command line option to gcc) then they'll build P2 binaries instead. There's also a "riscvp2_lut.ld" linker script that puts the cached P2 code in LUT rather than HUB; this saves a lot of RAM, but at the cost of performance since the instruction cache is so small.
There's no fancy IDE, just command line GCC and G++ (along with the usual GNU binutils). Hooking these up to your IDE of choice should be straightforward.
These are RISC-V toolchains, so by default they build RISC-V binaries. If you use the "riscvp2.ld" linker script (e.g. by giving the "-T riscvp2.ld" command line option to gcc) then they'll build P2 binaries instead. There's also a "riscvp2_lut.ld" linker script that puts the cached P2 code in LUT rather than HUB; this saves a lot of RAM, but at the cost of performance since the instruction cache is so small.
Thanks, Eric! I just tried the Mac OS X version and it seems to work although I had to do a lot of messing around with the Mac's security settings to get it to run.
There's no fancy IDE, just command line GCC and G++ (along with the usual GNU binutils). Hooking these up to your IDE of choice should be straightforward.
These are RISC-V toolchains, so by default they build RISC-V binaries. If you use the "riscvp2.ld" linker script (e.g. by giving the "-T riscvp2.ld" command line option to gcc) then they'll build P2 binaries instead. There's also a "riscvp2_lut.ld" linker script that puts the cached P2 code in LUT rather than HUB; this saves a lot of RAM, but at the cost of performance since the instruction cache is so small.
Thanks, Eric! I just tried the Mac OS X version and it seems to work although I had to do a lot of messing around with the Mac's security settings to get it to run.
Thanks for trying this, David. Not much I can do about the security warnings, I'm afraid, all the binaries are unmodified copies of the offical xpack distribution for the Mac.
Comments
Maybe different benchmarks are needed when JIT comes into the mix ?
eg Something that calls one routine that toggles a pin, then calls another that toggles at a slightly different speed, using different code, can be used.
The delay between those running, would be more in a JIT compiler than an interpreter, and an interpreter would be slower than PASM.
JVM bytecodes are machine instructions, for the "Java Virtual Machine". Java implementations on PCs are emulators for the JVM. These implementations may be either interpreters or JIT compilers (or some mix of the two).
Similarly, the riscvp2 emulator is a JIT compiler. It literally does compile RISC-V instructions to P2 instructions. It does not interpret those instructions one at a time.
Sorry, I meant to say that the micropython compiled by p2gcc doesn't have floating point support (I think the p2gcc float libraries don't yet have all the functions micropython needs).
I am working on some generic P2 floating point routines, both 32 and 64 bit, which I hope can be used by all the compilers including p2gcc.
Here are the results for the 4 C compilers I tried on P2:
The theoretical limit on pin toggles in software is 2 cycles/toggle (a REP loop with drvnot inside it). fastspin is able to achieve that by caching the loop in LUT. The other compilers, including riscvp2, all have some loop overhead as well. We also get some sense of the latency here: riscvp2 does have that 3362 extra (which represents the compilation time) compared to only 24 cycles extra for p2gcc (which is probably measurement and subroutine call overhead).
The literal assembly code produced by riscvp2 for the toggle function is as follows:
RISC-V code (output of riscv-none-embed-gcc): The "0xc007a00b" instruction is a RISC-V "drvnot" instruction, one of the custom P2 instructions, which the generic RISC-V toolchain doesn't understand. We could easily add support for this to the disassembler in the future.
This gets compiled at runtime into: And the "loc ptrb/ calld" sequence gets further optimized to a "jmp" after it's placed in LUT and executed the first time. So the actual runtime loop looks like: Which, sure enough, is 10 cycles/iteration. It's definitely not optimal but OTOH it's better than some of the other compilers produce.
For comparison:
fastspin's version of toggle is actually inlined into main and looks like:
p2gcc's looks like: which is pretty nice code, and would be faster than riscvp2 except for the hubexec penalty.
Catalina's looks like:
I think the PRIMITIVE(#X) macro turns into a "CALL #X".
Why does fastspin have a 107 clock overhead, vs p2gcc's 24 ? Is that the cost of the calla #FCACHE_LOAD_ ?
If there's code that loops 10 times but it's called many times (e.g. serial output) then it'll be cached and the compilation overhead only happens once, the very first time it's called.
Yes.
Taking a 8M.8.n.2 link, that ~1.375us or 110 opcodes at 160MHz.
Is there any way to force/pre-direct pre-compiled code into LUT for example - for things like serial interrupts ? ( or any interrupt, where you want to avoid run-time JIT)
There's no interrupt support at the moment, and no way to pre-compile. In general for code that timing sensitive I'd probably run PASM in a COG to handle it. A lot of things could be done in smart pins or with the streamer, too.
See
http://forums.parallax.com/discussion/comment/1474298/#Comment_1474298
I did play about beyond this and other floating point functions beyond sqrt seemed to also work (though IIRC I found one of the tan functions seemed to have issues, not sure if it was p2gcc related in any way or just some math lib issue in Micropython itself). None of this was extensively tested, just a proof that it could be compiled with this functionality added and to see the size impact.
x86_64: ok
arm 32 bit: ok
riscv-p2: problems
How can I help narrow it down between gcc and riscv-p2?
Some bugs were resolved by eliminating the -Os compiler option. But it still doesn't work.
I had plans to implement a gdb stub for Risc-V, but haven't gotten to it yet (it might be a good project if you're feeling ambitious). For now I would use printf to try to narrow the problem down; you could, for example, compare results from a known working version and the one that isn't working.
If you can get it down to a specific function that's going wrong and you think it might be a bug in the JIT compiler itself, you could enable the defines DEBUG_ENGINE, USE_DISASM, and DEBUG_THOROUGH in rsicvtrace_p2.spin, do a "make install", and then rebuild the application. To get everything to fit you may also have to disable some other feature like OPTIMIZE_CMP_ZERO by commenting out the appropriate #define. Now when you run you'll get a spew of debug output like:
The "=xxxx" lines show when there are new cache lines compiled. The =memory chksum is probably of limited utility but shows a checksum of current memory. The other lines show the generated assembly code corresponding to RISC-V instructions (e.g. the RISC-V instruction at 8252 was compiled to "augs #$8200; mov $1, #$54" which amounts to "mov $1, ##$8254". The RISC-V registers are at the bottom of COG memory, so this is setting up a return address in RISC-V register x1. You can use riscv-objdump to get a disassembly of the RISC-V code to see where the addreses like $8252 correspond to RISC-V instructions.
No more so than with any other virtual machine (like Chip's Spin bytecodes or microPython's bytecodes) on the P2. If anything a bit less, since at least the RISC-V instruction set is well documented and very straightforward. Native debugging on the P2 is not very well supported right now, unfortunately, although we can hope that eventually that will improve.
Thanks, Eric. I have done the printf method and have identified the exact functions that malfunction.
An update on my status: I looked at the riscvemu tree for hints. I noticed that the Makefile specified -march=rv32im. Adding this option to my LWIP build seems to have resolved most of the issues. The remaining issues might be from lack of memory as the elf file is ~400kB. I don't know right now what the default -march from gcc is, but I'll try to find out.
I noticed in the micropython thread that there is a non-jit version of the emulator. I tried to run my objects using that, but have not been successful yet. It's probably something with the binary not being loaded in the right place, and maybe not stripping off the p2 code added by riscvp2. It looks like objcopy doesn't do that. It looks like micropython is now built with -march=rv32imc. I'll try that tomorrow.
Hmmm, that may indicate a bug in the decode of one of the compressed RISC-V instructions. riscvp2 is supposed to support -march=rv32imc, and I think generally that should be the default for most RISC-V gcc compilers. Are you using the riscv-none-embed-gcc that I pointed to in the README? Or are you using a different GCC?
The non-jit version is at https://github.com/totalspectrum/riscvemu. I have to warn you that I haven't looked at that code for a long time, so it may have bit-rotted (and I'm not sure if it works on Rev B silicon, although it may). [Edit: yes, it has bit-rotted, the newest fastspin throws errors compiling it . It does seem to be possible to build the plain emulator, p2emu.binary, if you're careful to specify -march=rv32im for the RISC-V code and you replace "pausems" with "waitms".]
The major difference from riscvp2 is that you have to build the RISC-V binary and then manually attach it to the emulator using objcopy or something similar. (So don't use -T riscvp2.ld as a linker script, you'll have to create a different one; I think there are some examples in riscvemu Makefiles). riscvemu also doesn't support the compressed instructions (so -march=rv32im only). OTOH debugging is a lot more feasible; the non-JIT emulator buiids can be linked with debug.spin to produce a very simple debug interface.
One function where a bug occurred was lcp_cilen. Here's a section of riscv assembly.
Trace with all optimizations off:
With OPTIMIZE_CMP_ZERO
The instruction 17fc0: 972a add a4,a4,a0
is a c_add and goes through c_mv block. adddata has wz set, but c_mv does not update zcmp_reg. It was $00F from the "and" @17FBC, so the compare is wrongly optimized away.
In this case it would be more efficient for the "add" to not update Z. Not sure what would work better overall. The LWIP code and some fixes I tried are on github.
Cheers,
Eric
In general: is anyone using riscvp2 on a RevA? Right now there is an interrupt service routine to maintain the upper 64 bits of the cycle counter, which on RevB may be obtained directly. I'd like to use the RevB method and save some COG RAM, but this will break compatibility with RevA. Does anyone care?
So some 30% are Rev A's.
IMHO Rev A's still need to be supported, with the obvious restrictions of what cannot be done on Rev A's.
Given Rev A will certainly fade, it seems fine to focus on rev B, and just archive the version that works on rev A ?
We can frame 'em, and package up last known Rev A tools, should one somehow be used, and call it a day.
Even though I'm stuck with a Rev A board for now, I completely agree. Supporting both seems like PITA. Archive the final version that works with Rev A and move on
RevB. I tried to run on RevA as a test, but it didn't work. I didn't look into it. So mark my vote as don't care.
I'm still thinking about what is the best way to do video capture.
Goertzel mode: This is what I used on RevA. The RevB has a nice improvement from adding multiple ADCs together in parallel. But this code uses Goertzel mode in a way that the hardware was not designed for. The problem is that the sinc2 mode for Goertzel added a clock delay to the output. I can compensate for this partially, but not completely. Because the video sampling rate is not an integer division of sysclock, the number of clocks between NCO overflows varies. The compensation offset is fixed. It's especially frustrating because I helped promote sinc on Goertzel. Nevertheless, RevB with a few ADCs in parallel should still be better.
Scope mode: This is non-optimal with regards to sampling time. The samples must be quantized to the nearest clock cycle. The time uncertainty is a few degrees worth at the colorburst frequency. A lot easier to program. Easier to support s-video/component/RGB. Harder to run parallel ADCs, the time to sum bytes together is not insignificant. Maybe it would be insignificant compared to color decoding. Since the streamer writes to hub ram, it is more difficult for the sampling cog to process the captured samples.
https://github.com/totalspectrum/riscvp2/releases/latest/
There's no fancy IDE, just command line GCC and G++ (along with the usual GNU binutils). Hooking these up to your IDE of choice should be straightforward.
These are RISC-V toolchains, so by default they build RISC-V binaries. If you use the "riscvp2.ld" linker script (e.g. by giving the "-T riscvp2.ld" command line option to gcc) then they'll build P2 binaries instead. There's also a "riscvp2_lut.ld" linker script that puts the cached P2 code in LUT rather than HUB; this saves a lot of RAM, but at the cost of performance since the instruction cache is so small.
Thanks for trying this, David. Not much I can do about the security warnings, I'm afraid, all the binaries are unmodified copies of the offical xpack distribution for the Mac.