C and C++ for the Propeller 2
ersmith
Posts: 6,053
in Propeller 2
On Thursday, Aug. 6 at 2pm Pacific time I plan to talk about C and C++ compilers available for the P2. I'll be discussing the following compilers:
Catalina 4.3
- https://sourceforge.net/projects/catalina-c/
fastspin 4.2.7
- https://github.com/totalspectrum/spin2cpp/
p2gcc 0.007 with propeller-elf-gcc 4.6.1 (SimpleIDE version)
- https://github.com/davehein/p2gcc/
riscvp2 (GCC 8.3.0)
- https://github.com/totalspectrum/riscvp2/
The basic outline will be:
- Overview of the 4 compilers
- Advantages and disadvantages of each
- Some benchmark comparisons
- Making code portable between compilers
- Using spin2cpp to convert Spin2 drivers to C code
Obviously since I'm the author of fastspin and riscvp2 you'll have to take my analysis with a grain of salt; clearly I am biased, but I will try to be fair to Catalina and p2gcc, which have some enthusiastic users .
The benchmarks I was planning to run are:
CoreMark 1.0 (https://www.eembc.org/coremark/)
fftbench (https://github.com/ZiCog/fftbench)
proplisp (https://github.com/totalspectrum/proplisp)
Catalina 4.3
- https://sourceforge.net/projects/catalina-c/
fastspin 4.2.7
- https://github.com/totalspectrum/spin2cpp/
p2gcc 0.007 with propeller-elf-gcc 4.6.1 (SimpleIDE version)
- https://github.com/davehein/p2gcc/
riscvp2 (GCC 8.3.0)
- https://github.com/totalspectrum/riscvp2/
The basic outline will be:
- Overview of the 4 compilers
- Advantages and disadvantages of each
- Some benchmark comparisons
- Making code portable between compilers
- Using spin2cpp to convert Spin2 drivers to C code
Obviously since I'm the author of fastspin and riscvp2 you'll have to take my analysis with a grain of salt; clearly I am biased, but I will try to be fair to Catalina and p2gcc, which have some enthusiastic users .
The benchmarks I was planning to run are:
CoreMark 1.0 (https://www.eembc.org/coremark/)
fftbench (https://github.com/ZiCog/fftbench)
proplisp (https://github.com/totalspectrum/proplisp)
Comments
Edit: updated sizes / times / command lines to reflect RossH's suggestions later in this thread.
This is a small benchmark, so I wouldn't put too much stock into the sizes (they will be dominated by library sizes; both Catalina and riscvp2 have pretty complete libraries, fastspin and p2gcc less so).
Can you let me know what Catalina options you are using when compiling your benchmarks?
Also, can you post the version of fftbench you are using? Version 1.0 works ok, but 1.2 needs some modification to to compile with Catalina since it is not strictly ANSI C, and I'd like to make sure we are using the same code.
If I get time (not for your presentation) I'd like to modify it to use Catalina's multi-processing capabilities instead of OMP. It would make a nice demo, since it should speed up the execution time according to the number of available cogs.
Ross.
Looks like our posts crossed - you actually answered my questions before I had asked them!
Why do you use optimizaton level 3 and not 5 for Catalina? The additional optimization doesn't make a great deal of difference in this case, but it may in some others.
If you use -O5 in this case, the results are:
Also, you can add -ltiny to use the "tiny" version of printf (does this look familiar? ):
Again, not a huge difference, but it does highlight the fact that this particular benchmark is including parts of the standard C library, so if some of the compilers are not implementing the full library function it will make a difference in both the file sizes and the execution times.
So the command I finally used (note that you don't need the -D PROPELLER) was:
Ross.
Another thing to point out when benchmarking (at least for Catalina, but I am sure it would be true of some of the other compilers as well) is that the file size has little bearing on the actual memory footprint occupied by the program.
Catalina loads up things like plugins and kernels into cogs, and then re-uses the memory they occupied as stack and heap space where possible. So quoting the file size is not a good indication of the run-time size. On the P1, which had only 32kb, it was crucial to re-use as much Hub RAM as possible, and this complicated the load process (and added to the file sizes) significantly.
While this is now less critical on the P2, Catalina still reclaims memory where possible, so a better indication of the actual memory "footprint" of a program is the sum of the segment sizes (i.e. code, cnst, init, data) and not the final file size, which is often many kb larger.
For instance, if you compile this program as a COMPACT program with Catalina, the code size is almost halved, but that difference is not reflected in the file size.
Ross.
this all sounds very interesting. I've only used Fastspin so far. I've tried Catalina but I have to admit that I gave up after not being able to run HelloWorld after two hours. I'd probably worked if I tried harder but I was too lazy because I'm really happy with Fastspin at the moment.
However, I've seen that there is a debugger for Catalina called "black box" or "black cat". I think this is quite interesting as there seems to be no other real source level debugger for the P2 at the moment. Will you (or Ross if he also participates) talk about debugging in C?
I don't really know how you can simplify "catalina -lc hello_world.c" much further ... but perhaps that's just me!
BlackBox and BlackCat are two different (but similar) source-level debuggers. BlackCat was originally created by Bob Anderson, but with some technical input from me. It is a graphical debugger, but runs only on Windows. I wanted a debugger that worked on all platforms, so I created BlackBox, which is a command-line debugger in the style of gdb.
I don't think I will be abe to participate in this discussion - the time doesn't work well for me because I generally have to work during the day. But I will certainly try and catch up with it afterward.
Since the Simple Tools Library has a wealth of support for all kinds of devices that are used for the P1 and could be possibly used on the P2, should there be a c2spin2 conversion program, if that is even possible?
Yea, that seems kind of weird and somehow backwards, but the Simple Tools Library has a lot of stuff covered. Sure would hate to see that go to waste, if everybody is going to switch over to using Spin2.
Ray
A conversion program is great if you just need some working code but it's not a good long term solution.
Spin2 code would be nice for us Propeller-heads but Parallax needs Python code for their education customers and the truth is they are the money makers at the moment.
I think going forward all documentation should include Pseudo-code that defines all the constants needed along with generic functions that show how to use the device.
That would make it easier to customize the code for whatever micro or language someone wants to use.
I haven't used Catalina's debugger, so I'm not really qualified to talk about it.
There's not much point: all of the SimpleTools C code is based on Spin objects (or has Spin object equivalents). And it all has to be ported from P1 to P2. Given that it has to be ported anyway, I think most people will start with the original Spin objects rather than the C code.
I got lots of error messages but was finally able to compile but not to download and run it. (see discussion here). But as I said, I had too little patience, that time. I'll watch the presentation and maybe give it another try.
Oh, yes - I remember. You had a problem with the payload loader. When you are ready to try again, let me know and we'll see if we can track it down. But let's take it up again in the original thread.
I wasn't able to build this with the new llvm based C compiler; while it looks promising, it's still not quite ready to build big applications yet.
Coremark benchmark results (more iterations/sec is better)
All tests were run at 180 MHz. fastspin, p2gcc, and riscvp2 binaries were run with:
Catalina defaults to 180 MHz, so the -PATCH was unnecessary.
Various quirks:
I had to disable floating point in core_portme.h in order to get the p2gcc version to build. This was probably good anyway, because it made the binaries smaller. I also had to manually copy a file, because p2gcc couldn't find its .s file in a subdirectory.
fastspin produced incorrect results when compiled with -O2, so this is exposing an optimizer bug .
riscvp2 has 2 ways of building: with a 32K instruction cache (the default) or with cache in LUT (riscvp2_lut). The latter is much smaller, but also much slower.
How to build:
Source code is at https://github.com/totalspectrum/coremark.
for riscvp2, edit the propeller/core_portme.mak and do for fastspin, edit the propeller/core_portme.mak and do
For catalina, use the command line:
catalina -p2 -O5 -lci -ltiny -C NATIVE -I propeller -I . -D 'FLAGS_STR=\"default\"' -D ITERATIONS=0 -D PERFORMANCE_RUN=1 core_list_join.c core_main.c core_matrix.c core_state.c core_util.c propeller/core_portme.c -o catalina
For p2gcc, you'll need a hack, because it can't find the .s file in propeller/; so link core_portme.c to the current directory and do:
p2gcc -I propeller -I . -D 'FLAGS_STR="default"' -D ITERATIONS=450 -D PERFORMANCE_RUN=1 -D P2GCC=1 core_list_join.c core_main.c core_matrix.c core_state.c core_util.c core_portme.c -o p2gcc.bin
(The ITERATIONS count is limited because above a certain amount of time the "Total ticks" line is printed wrong, probably a bug in p2gcc's printf.)
it would be interesting to hear, if there is now a path possible to bring P2 into the arduino environment with one of these compilers.
Not because this is the best tool (it is not) but because of the immense whealth of code and libraries (and a library management) and because it will make it easy to switch to P2. Programming Esp32 and P2 with the same tool would be great for example.
The arduino environment requires C++, so for now riscvp2 would be the only feasible option.
As a bonus the benchmark was simple enough to translate into Spin easily, and so I could compare the C compilers to Spin2 (PNut and fastspin). All of the C compilers handily beat PNut on this benchmark.
The SPI code I basically just took from the first google hit I got for "bit banged SPI in C"; I had to tweak it a bit of course (changed macros to work with pinh/pinl instructions, etc.). If I were writing it myself from scratch I would definitely use "int" everywhere instead of "unsigned char", which is slower on most compilers (you can see the difference in fastspin's performance in C vs. Spin; Spin only has "int" parameters). OTOH the compiler should adapt to the human's code, not vice-versa, so perhaps it's good to feed sub-optimal code to the compiler.
Results (fewer cycles is better):
If I wanted the fastest possible SPI on P2 I would shift the data left by 24 at the beginning and test the high bit, because I know fastspin can optimize "x & $8000_0000" and "x <<= 1" by using the carry bit, leading to an inner loop of: and a benchmark score of 49397 cycles. I didn't want to cheat .
I've never looked at those colored blocks ever the same since the VentilatorOS DIR sorting incident
IDK.
That's not entirely cheating, as a good language should have pathways to be "high level assembler' and if that means the C code becomes less clear, but the ASM is faster, to me that is ok.
I have quite a lot of MCU C code exactly like that, where I have to whack the compiler about the head, until the ASM it gives, is half-way-sane.
ie I'd suggest you add a 'crafted fastspin' line, where C clarity is traded for speed. 49397 cycles crafted is faster than 82174 cycles as generic.
Usually that is only done on inner loops, like in this example.
Curious why fastspin(spin2) 57589 cycles is different from fastspin(c) 82174 cycles ? Is there an optimize step your spin2 compile path has, that is not yet in c ?
(1) Subroutine calls are very expensive in Catalina. Is there some way you could modify the code generator to use one of the (many) P2 call instruction variants directly, instead of calling into the kernel for the call and the return?
(2) As mentioned on another thread, pushing/popping multiple registers is cheaper with setq / rdlong patterns. In fact given the overhead of calling the current push_m / pop_m functions, you may find a saving by using these and saving/restoring all of the eligible registers all the time. Individual rdlong and wrlong are expensive.
(3) Catalina is generating qmul / getqx for multiplies, which is fine, but you've spent a lot of effort elsewhere in the code to make it re-entrant and that particular sequence isn't (an interrupt could happen between the qmul and the getqx). Frankly I'm not sure the effort to allow ISRs written in C is worth it for the P2, but OTOH it is a unique feature of Catalina.
(4) mov r22, ##512 is smaller and faster if re-written as "encod r22, #9"
I think I mentioned that the C version uses an "unsigned char" variable rather than "int". That means the compiler has to insert a "zerox" instruction in the inner loop of the C version (C requires that unsigned char be extended to int before doing arithmetic on it). Again, *I* know that int is better for fastspin, and indeed that will probably be true for all compilers, but I got the SPI code from Google and I tried to resist any urges to improve it. For Spin the issue didn't come up, parameters in Spin2 are always LONG.
I can understand the desire to harvest test code, but if one test then skews to use int, because that language cannot support unsigned char, then the benchmarks actually become less useful.
The best benchmarks compare like-with-like.
EDIT: I recall now - it was not SETQ that you can't use in an interrupt routine - it is the SKIP operations. I should indeed rewrite the push_m and pop_m functions to use SETQ!
Ah! Thanks for that - I didn't know that - I will have to disable interrupts
Good call, thanks.
My approach with Catalina has always been "functionality first, efficiency later" ... but then there always seems to be more interesting functionality to implement!
Had I but world enough and time ...
Different languages are always going to have different code. The Spin2 version is just a sideshow. I'm benchmarking C compilers here, and I thought the Spin2 comparison would be an interesting contrast and a rough order of magnitude comparison of how hubexec and xbyte stack up, but evidently it's a pretty big distraction .
But you do bring up some interesting points. It might be a good idea to have a "fastest implementation" kind of benchmark where people, for example, try to create the fastest possible SPI implementation for their particular language/compiler. We'd have to set out some rules (like no inline assembly) to really test the language itself, but it would provide a good kind of "upper bound" on how good a particular compiler can be. As you say, sometimes in MCU development you really want to squeeze out all the performance you can and tweak the code to be whatever is best for the exact tool you're using.
Roy brought up some points about cross-platform code style so it runs in each compiler. Parallax would like to get more involved, host libraries and lend some backing to Eric's efforts. I suggest we have a meeting before the end of August with those interested in C to discuss some of these details so we get more behind the effort.
I'll schedule that meeting after we get the next couple planned on Zoom: the Free-for-all; SmartPins with Chip, and JonnyMac's "Tao of Spin2 Programming".
Ken Gracey
That's a great idea, Ken. Having everyone on the same call to talk about making standard C libraries/defines would be really good.
A small correction: it's not just "my" efforts, but rather the efforts of everyone involved in making C for the Propeller, and there are a lot of us . Which is why having some coordination would be so helpful.
Noted with much recognition for all contributors!
Ken Gracey