riscvp2: a C and C++ compiler for P2

ersmith · 2019-07-07 23:55

Update July 31, 2020: I've downloaded specific toolchains for Win32, MacOSX, and Linux, applied the changes to them, and uploaded the binaries to:

https://github.com/totalspectrum/riscvp2/releases/latest

So you can get the final binaries and don't have to build them yourself. Note that these are command line tools only, although since they're standard GCC (with one extra command line option for the P2 link) it should be easy to hook them up to your favorite IDE.

Provided are all the GNU binutils, GCC, and G++, along with newlib libraries. If you get this you'll be all set up for developing applications for both P2 and RISC-V.

Under the hood this compiler works by doing a JIT (just in time) translation from RISC-V code to P2 code. The RISC-V compressed instruction set is supported, so binaries can be reasonably small (but beware, the newlib libraries tend to add a lot of bloat; you'll probably want to use the "nano" version of the libraries). Performance is actually very respectable; only many benchmarks riscvp2 beats all of the other C compilers for the P2 (fastspin, Catalina, and p2gcc). On a substantial "real world" workload (micropython) the riscvp2 compiled binary ended up about 10% slower than the p2gcc compiled one, but was substantially smaller.

AFAIK this is currently the only solution for full C++ on P2.

Original message:

I've posted what I hope are useful instructions for compiling C and C++ applications for P2 using a RISC-V toolchain. The repository is at https://github.com/totalspectrum/riscvp2. The README.md should describe everything you need, including where to find a pre-built RISC-V toolchain. Once your RISC-V toolchain is installed, edit the Makefile to have the correct paths and then do "make install". That should update your RISC-V toolchain so that it can build P2 binaries as well, by just giving the "-T riscvp2.ld" switch to riscv-none-embed-gcc or riscv-none-embed-g++.

"make hello-c++.binary" will build a simple C++ demo application that should run on the P2 Eval board (it works for me, at least!)

Feedback is very welcome. I've been using this kind of setup for a while, so I may well have overlooked some step in the descriptions, or perhaps not made something as clear as it should be. If so, please let me know!

(Tested on Debian Linux. It should work the same for any Linux or Mac OSX. For Windows it *should* work if you have some kind of POSIX system like msys or cygwin installed, but I haven't actually tested it there.)

ersmith · 2019-07-08 01:43

Here's a sample pin toggling program. As compiled it will toggle pin 0 as quickly as possible; there's also a version to toggle pin 56 at visible speeds for testing.

// simple pin toggle demo
#include <stdint.h>
#include <propeller2.h>

#if 0
#define PIN 56
#define DELAY 40000000
#else
#define PIN 0
#endif

void main()
{
    for(;;) {
        _pinnot(PIN);
#ifdef DELAY        
        _waitx(DELAY);
#endif        
    }
}

I compiled it with:

riscv-none-embed-gcc -specs=nano.specs -T riscvp2_lut.ld -o toggle.elf -Os toggle.c
riscv-none-embed-objcopy -O binary toggle.elf toggle.binary

(This uses the LUT caching version of the JIT compiler, which is the fastest but has a small cache, so best for small programs like this.)

If I've done everything right, the inner loop will compile in the end to two instructions:

loop
    drvnot x15
    jmp    #loop

Running from LUT cache this should take just 6 cycles per loop. The system clock is 160 MHz, so we should get about 27 million toggles per second. I don't have an oscilloscope to check this, but I've attached the binary so interested readers can double check my work.

You'll also probably notice that the binary is huge, relatively speaking. That's a consequence of the newlib library that comes with most RISC-V toolchains, which for an "embedded" library is incredibly bulky. We'll probably want to port proplib or some other P2 library to RISC-V to get a more compact binary.

ersmith · 2019-07-17 04:39

I've updated my github repository with some improvements and bug fixes. Most notably there was a bug in _sbrk() which caused malloc() to fail randomly; I also made read() work (although characters are not echoed right now).

Has anyone tried this out? If you've tried and failed (or couldn't understand the docs) please let me know -- I would like to improve the installation instructions.

Tubular · 2019-07-17 23:16

I'll at least try and run the binary and confirm the rate today Eric

DavidZemon · 2019-07-18 02:19

Looks really good. I gave it a try and was able to get PropWare to recognize the tools and start attempting compilation. It quickly failed out as there are lots of P1-specific things throughout the codebase, ranging from "-m32bit-doubles" to "CNT" and "__builtin_propeller_rev(x, bits)"

I started fixing some but quickly grew tired. Porting PropWare to P2 is going to require far more than an hour on a weeknight

But, it does seem feasible, and the fact that this is all based on a GCC toolchain means no crazy hacks need to be applied to CMake. It's very promising!

ersmith · 2019-07-18 11:59

Tubular wrote: »

I'll at least try and run the binary and confirm the rate today Eric

Thanks!

ersmith · 2019-07-18 12:01

DavidZemon wrote: »

Looks really good. I gave it a try and was able to get PropWare to recognize the tools and start attempting compilation. It quickly failed out as there are lots of P1-specific things throughout the codebase, ranging from "-m32bit-doubles" to "CNT" and "__builtin_propeller_rev(x, bits)"

I started fixing some but quickly grew tired. Porting PropWare to P2 is going to require far more than an hour on a weeknight

But, it does seem feasible, and the fact that this is all based on a GCC toolchain means no crazy hacks need to be applied to CMake. It's very promising!

Ah, __builtin_propeller_rev reveals something we're missing from propeller2.h -- a way to access the _rev() instruction. We should add that. There are probably a few other things that are operators in Spin2 that we'll want to add as C functions.

CNT of course is now _cnt() on P2.

Thanks for giving this a try. I hope you were able to run something simple like "hello world"?

Cluso99 · 2019-07-18 12:04

There are a few gotchas with condition codes too

ersmith · 2019-07-18 12:24

Cluso99 wrote: »

There are a few gotchas with condition codes too

Sorry, what kind of gotchas with condition codes? Oh, you mean porting P1 assembly code to P2. Yeah, in general there are lots of gotchas there. Even more so if you're using the RISC-V toolchain which only supports RISC-V inline assembly

. If you stick with the macros in propeller2.h you should be OK though.

ersmith · 2019-07-21 18:00

Here's a version of loadp2 with ELF file support, which eliminates the need to run riscv-none-embed-objcopy to convert the ELF file to binary before loading to the P2. The source is checked in to my github fork of p2gcc, and also in the spin2gui repository (loadp2 subdirectory).

ersmith · 2019-08-29 19:29

@Rayman asked on another thread:

Sounded like you need Linux for the risc compiler for p2, right?

The gnu-mcu-eclipse Risc-V toolchain that I point to in the README has Windows and MacOSX binaries, and the code to change the Risc-V toolchain to a P2 toolchain is fairly generic (it does use some Unix utilities that should be available for MacOS and would be available for Windows in cygwin or msys). But I haven't tried it myself on Windows.

ersmith · 2019-08-29 19:31

I'll also note here that I've updated the checked-in binary to have some "hardware" floating point support (native code using the CORDIC). Eventually this could serve as an underpinning to emulating the RISC-V floating point instructions, but for now it's just accessed via a syscall. Only the 32 bit float code is hooked up right now, the 64 bit double code is there but it still reports some issues on tests so I haven't enabled it yet.

Rayman · 2019-08-29 20:36

Can you say how this would compare to C++ via GCC (when it comes)?

Also, I'm guessing that inline assembly is not possible. Is that true?
If so, how would one add assembly drivers to a project?

Rayman · 2019-08-29 20:37

I seem to have a MinGW folder, probably from testing GCC... Maybe that would work for this...

David Betz · 2019-08-29 20:37

Rayman wrote: »

Can you say how this would compare to C++ via GCC (when it comes)?

Also, I'm guessing that inline assembly is not possible. Is that true?
If so, how would one add assembly drivers to a project?

Sure you can use inline assembler. The problem is it will be RISC-V assembler.

ersmith · 2019-08-29 21:09

Rayman wrote: »

Can you say how this would compare to C++ via GCC (when it comes)?

"compare" in what sense? It is GCC, so it fully supports C++. Performance wise it's hard to say what a native P2 port of the same version of GCC would be like, but I'd guess that riscvp2 would probably be about 10-20% slower but about 25% smaller in code size.

Also, I'm guessing that inline assembly is not possible. Is that true?
If so, how would one add assembly drivers to a project?

As David mentioned, inline RISC-V assembly works fine

. The emulated RISC-V has a number of extended instructions (RISC-V is inherently extensible) which map directly to P2 instructions, and most of those are supported via macros from propeller2.h.

To add assembly drivers to a project, convert them to C code via spin2cpp (or compile them with fastspin to a binary blob) and then use the _coginit() function (from propeller2.h) to start them. That's how I added the USB and VGA drivers to the micropython port.

Rayman · 2019-08-29 21:25

Binary blob sounds interesting... Would that work for all FastSpin languages?

Is there a way to include the binary blob in the C++ source? Need to convert the binary to text to do it?

Rayman · 2019-08-29 21:29

Can you do this with GNU linker?
looking at this: https://csl.name/post/embedding-binary-data/

I assume this comes with GCC, right?

Rayman · 2019-08-29 21:33

Am I seeing this right that you actually are fully compiling the source? So, it would run on an RISCV processor?

By JIT compiling, do you mean you are taking the RISCV binary and converting it to P2 binary in real time as the program runs?

Isn't this an emulator?

ersmith · 2019-08-29 21:51

Rayman wrote: »

Binary blob sounds interesting... Would that work for all FastSpin languages?

Is there a way to include the binary blob in the C++ source? Need to convert the binary to text to do it?

Mainly I was thinking about a PASM driver. In theory you could probably do it for any fastspin language, but by default fastspin produces hubexec and you'd have to do some careful juggling to get that to link together with a C++ program. But if you used --code=cog so it would run in another COG it probably would work.

For a binary blob, there are a number of ways to include it in the program. The GNU linker could be used to link it into a program, or you could use xxd or some similar tool to convert it to a C array of hex bytes. Or if you're using spin2cpp it will (by default) output the DAT section as an array of bytes, and convert the Spin parts to C++ code, so the result can just be compiled as a .c or .c++ file.

ersmith · 2019-08-29 22:02

Rayman wrote: »

Am I seeing this right that you actually are fully compiling the source? So, it would run on an RISCV processor?

riscvp2 is a funny hybrid. It turns a RISC-V compiler into a P2 compiler by basically adding a linker script that combines the RISC-V binary with a P2 JIT compiler. So the .elf file that riscvp2 outputs is a P2 executable, not a RISC-V executable, but it contains a RISC-V executable.

By JIT compiling, do you mean you are taking the RISCV binary and converting it to P2 binary in real time as the program runs?

Isn't this an emulator?

Yes, it does convert the RISC-V binary to P2 at run time. But It is a compiler, not an interpreter -- it caches the compiled code and re-uses it. So there is some latency the first time through any loop (as the RISC-V instructions are compiled to P2 instructions) but subsequent loop iterations run at full hubexec speed (or LUT exec speed, depending on the options you give).

Also, the RISC-V processor that is emulated has some custom P2 instructions, so if any of those are used then it won't run on "real" RISC-V hardware (unless someday somebody makes a RISC-V with those custom instructions).

Here's the performance of some compilers on Heater's fft benchmark:

results: time / size of binary loaded to P2

riscvp2:  25245 us  20968 bytes
fastspin: 39835 us  16384 bytes
catalina: 55876 us  27808 bytes
p2gcc:    64765 us  21204 bytes

command lines used:

fastspin -2 -O2 -o fastspin.bin fft_bench.c
catalina -lci -p2 -O3 -C P2_EVAL -C NATIVE -D PROPELLER fft_bench.c
p2gcc -D _cnt=getcnt fft_bench.c
riscv-none-embed-gcc -T riscvp2_lut.ld -specs=nano.specs -Os -o a.elf fft_bench.c

I wouldn't pay too much attention to the binary sizes, those are mostly influenced by the libraries (Catalina and riscvp2 have pretty big libraries, fastspin and p2gcc have minimalistic ones). But the performance numbers show that riscvp2 more than holds its own. The dhrystone benchmark numbers are similar. On some other benchmarks (e.g. micropython) riscvp2 trails p2gcc by a bit, so it isn't always the fastest, but it is definitely competitive. It's also the only compiler at present to have 64 bit double and long long support.

Rayman · 2019-08-29 22:21

Just trying to get a handle on your terminology...

You seem to be calling it an emulator but also a compiler (and not an interpreter).

I think if this were running on a PC, it would be called an emulator, right?

DavidZemon · 2019-08-29 22:26

David Betz wrote: »

Rayman wrote: »

Can you say how this would compare to C++ via GCC (when it comes)?

Also, I'm guessing that inline assembly is not possible. Is that true?
If so, how would one add assembly drivers to a project?

Sure you can use inline assembler. The problem is it will be RISC-V assembler.

Aw crud... that has me a bit depressed. I was hoping that, with some work to change header files around and a few PropGCC-specific macros, I could compile all of PropWare with riscvp2. But since PropWare makes extensive use of inline (Propeller) assembly, porting to riscvp2 would be a significant undertaking - one that definitely isn't worth it so long as a proper GCC/LLVM port is still on the table.

ersmith · 2019-08-29 22:31

It's complicated. There are a number of ways of looking at it:

(1) You could just pretend it's a P2 compiler. After all, it takes C/C++ code and produces a binary that can run on a P2. The binaries it produces are a bit smaller than most P2 compilers, and potentially a little bit slower (although in practice, as shown above, they're often faster!)

(2) You could call it a RISC-V emulator, because it works by translating (at run time) RISC-V instructions into P2 instructions. But that's also a little bit misleading, because the RISC-V it emulates doesn't really exist (it has custom instructions that no actual RISC-V hardware has right now, although if Chip wants to use the RISC-V instruction set in P3 then he could use this as a starting point

).

When you say "emulator", especially in the Propeller world, people tend to assume a (slow) interpreter. Since this one uses a JIT compiler, the output does run at full hubexec speed -- it just has some startup latency due to compilation (and really big programs that don't fit in cache do tend to slow down, but that's a rare case). So I've been kind of avoiding the "emulator" word. But yes, it's fair to call it an emulator -- but a very, very fast one.

jmg · 2019-08-29 22:38

ersmith wrote: »

...
When you say "emulator", especially in the Propeller world, people tend to assume a (slow) interpreter. Since this one uses a JIT compiler, the output does run at full hubexec speed -- it just has some startup latency due to compilation (and really big programs that don't fit in cache do tend to slow down, but that's a rare case). So I've been kind of avoiding the "emulator" word. But yes, it's fair to call it an emulator -- but a very, very fast one.

Do you have any comparisons with a compiler that compiles to produce a native P2 binary - that should be the fastest of all speeds ?

In a JIT compiler case, if the P2 can manage this itself (with some time overhead), could a PC also do the same work, and remove that load from the P2, and so create a native P2 file ?
How much larger does that P2 file become ?

ersmith · 2019-08-29 22:39

DavidZemon wrote: »

David Betz wrote: »

Rayman wrote: »

Can you say how this would compare to C++ via GCC (when it comes)?

Also, I'm guessing that inline assembly is not possible. Is that true?
If so, how would one add assembly drivers to a project?

Sure you can use inline assembler. The problem is it will be RISC-V assembler.

Aw crud... that has me a bit depressed. I was hoping that, with some work to change header files around and a few PropGCC-specific macros, I could compile all of PropWare with riscvp2. But since PropWare makes extensive use of inline (Propeller) assembly, porting to riscvp2 would be a significant undertaking - one that definitely isn't worth it so long as a proper GCC/LLVM port is still on the table.

Do the PropWare files have non-inline (probably slower, of course) versions? If so those would just compile. If not, you may want to add them: it'll make your code a lot more portable, maybe potentially to other compilers like fastspin (fastspin only supports a very limited subset of C++ now, but perhaps someday it could handle more). Besides the P2 instruction set changes, things like FCACHE are going to be very different between P2 and P1. Many P2 compilers will skip FCACHE entirely, since it isn't nearly as much of a win over hubexec as it is over LMM.

riscvp2 does have a <propeller2.h> file with macros for a lot of common P2 instructions.

ersmith · 2019-08-29 22:45

jmg wrote: »

ersmith wrote: »

...
When you say "emulator", especially in the Propeller world, people tend to assume a (slow) interpreter. Since this one uses a JIT compiler, the output does run at full hubexec speed -- it just has some startup latency due to compilation (and really big programs that don't fit in cache do tend to slow down, but that's a rare case). So I've been kind of avoiding the "emulator" word. But yes, it's fair to call it an emulator -- but a very, very fast one.

Do you have any comparisons with a compiler that compiles to produce a native P2 binary - that should be the fastest of all speeds ?

I posted it a few messages back. On fft-bench and dhrystone, riscvp2 is the fastest compiler for P2 -- faster than Catalina, p2gcc, and fastspin. On micropython we had some back and forth over wither riscvp2 or p2gcc was faster; I think as it stands now riscvp2 is about 10% slower than the custom p2gcc that @rogloh created, but the binaries are smaller and support floating point (which p2gcc doesn't support yet).

In a JIT compiler case, if the P2 can manage this itself (with some time overhead), could a PC also do the same work, and remove that load from the P2, and so create a native P2 file ?[
How much larger does that P2 file become ?

In theory someone could do something like this on the PC; it would be like p2gcc but taking RISC-V input instead of P1 input. It's not a project I'm particularly interested in. In general RISC-V compressed code is 25% smaller than uncompressed RISC-V code, which is in turn roughly 10% smaller than P1 code (the RISC-V instruction set is more compact). Obviously your milage may vary.

Rayman · 2019-08-29 22:48

Ok, I think this is more of an emulator than a JIT compiler...
But, I do appreciate the negative connotations of "emulator".
Maybe I'd call it a "real time emulator" or "full speed emulator" or something like that...

ersmith · 2019-08-29 22:52

Rayman wrote: »

Ok, I think this is more of an emulator than a JIT compiler...
But, I do appreciate the negative connotations of "emulator".
Maybe I'd call it a "real time emulator" or "full speed emulator" or something like that...

No, it really is literally a JIT compiler. A JIT ("just in time") compiler is a compiler that translates at run time one instruction set into another. For example, the JVM JIT compiler translates Java bytecode to x86 (or whatever) instructions. That's exactly what riscvp2 does, at run time it compiles the RISC-V instructions to P2 instructions.

Rayman · 2019-08-29 22:59

I think the difference is the source... bytecode is not machine instructions...

Rayman · 2019-08-29 23:04

Another thing to consider is that if you call it an emulator, I know what you are talking about...

riscvp2: a C and C++ compiler for P2

Comments