fft_bench - An MCU benchmark using a simple FFT algorithm in Spin, C, and ...
Heater.
Posts: 21,230
Introducing fft_bench.
This is a simplified version of the heater_fft intended for use as a benchmark when comparing compilers, languages, interpreters or even different MCU's.
Included are both a Spin and a C language version of the FFT algorithm. I have tried to keep them as similar in operation as possible so that we are at least attempting to compare like with like when benchmarking different languages.
Fixed point arithmetic is used. The idea is to for this to be a benchmark algorithm containing a good selection of constructs typically used in MCU applications. To that end it is not making heavy use of strings (or any in fact) or becoming solely a test of subroutine calling like the FIBO test.
The Spin version completes a 1024 point FFT in about a second and a half. I have not adapted the C version to Zog or Catalina yet. Perhaps someone with a few spare minutes could do so. On my PC it completes the same task in 195 micro seconds!!!
Hopefully others might like to contribute other language implementations to this effort.
May the benchmarking commence:)
This is a simplified version of the heater_fft intended for use as a benchmark when comparing compilers, languages, interpreters or even different MCU's.
Included are both a Spin and a C language version of the FFT algorithm. I have tried to keep them as similar in operation as possible so that we are at least attempting to compare like with like when benchmarking different languages.
Fixed point arithmetic is used. The idea is to for this to be a benchmark algorithm containing a good selection of constructs typically used in MCU applications. To that end it is not making heavy use of strings (or any in fact) or becoming solely a test of subroutine calling like the FIBO test.
The Spin version completes a 1024 point FFT in about a second and a half. I have not adapted the C version to Zog or Catalina yet. Perhaps someone with a few spare minutes could do so. On my PC it completes the same task in 195 micro seconds!!!
Hopefully others might like to contribute other language implementations to this effort.
May the benchmarking commence:)
Comments
Something's wrong. Why do I get 0us?
Anyway ... Thanks for posting the benchmark.
Perhaps you have a really fast machine!:)
I just compiled and ran it on a different PC under Debian using:
If I change the optimization to -O3 or -Os the execution time drops to 81us !
Or, could you hack some printfs into time_us() to check that timeval.tv_sec and timeval.tv_usec are being set nicely by gettimeofday().
The astute will notice that the frequency zero magnitude is 1FE instead of 1FF. That is because I'm using shift right by 12 (>> 12) in the butterfly calculation instead of divide by 4096 for the fixed point scaling. These are NOT quite equivalent.
I tried to use the rusage() function to get execution time but was getting zeros for the micro-seconds.
Oh I get it. You are compiling on Windows. Perhaps that is not returning a microsecond resolution time of day.
The benchmark gives
Change c code is attached.
PS: I forgot to mention that time on the propeller is measured in milliseconds!
Thanks, Martin.
I get the same result on a DRACBLADE, using the command line:
So Catalina is around 4 times faster than Spin.
BTW - your hacks are not hacks - the original program is not quite ANSI standard C. I have attached an ANSI C89 compliant version (quite similar to yours). Here is a summary of the changes:
- replace sys/time.h with time.h
- replace gettimeofday() with clock(), which returns type clock_t
- incomplete types cannot be static - ANSI std section 6.9.2
- the %lld format specifier (for long long) is not defined in C89.
None of the changes affect the actual fft algorithm - just the timekeeping. I'd be happy to accept benchmark results from any of the C versionsIf I try to compile my vertion with gcc option -std=c89 or -ansi it throws up on all the // comment markers.
I hate those old style /*...*/ comment separators.
Anyway I'll unify these effortss at some point.
Drat! I forgot about those!
Anybody got any ZOG times?
Your problem may be that (on the PC - which is many hundreds of times faster than a Prop!) the program is too quick for the resolution of clock() - do the fft in a loop 100 times and print the total time for 100 cycles.
I thought it would be instructive to create a version of the fft benchmark that does not use stdio - i.e. which uses only the "built-in" Catalina HMI functions. These are simpler than stdio, and more closely mimic the "fullduplexserial" functions used by the SPIN version (on which they were originally based!). This makes no difference to the benchmark itself - only to the code that prints the results.
The SPIN version was compiled with the Propeller Spin tool, and the Catalina version was compiled with the command:
Here is a comparison of the run-time size (not the file size) of the results (sizes in bytes). So Catalina C is not only nearly 4 times faster than SPIN, it does so using less than 20% more Hub RAM at run time. - i.e. with both C & SPIN you still have around 16k of hub RAM left for more program code.
But a sad day for Zog, he is very sick.
Using Dave Betz latest zog build for the 32MB SDRAM Zog completes the FFT in 6.2 seconds.
BUT the results are wrong!
The DC and nyquist frequencies are correct but the other two are some what wrong. Plus every other frequency is having a few low order bits set.
I can get the correct results by turning off all optimizations but then the execution time is 23.4 seconds!
Well, I would be a bit concerned about Zog getting the results wrong (after all, this is all integer arithmetic!) - but I don't think Zog's speed is all that bad - considering it is running from XMM RAM.
The following results will not be directly comparable, but when I run the program from XMM RAM with Catalina using a caching driver similar to the one Zog uses (but on the DRACBLADE, which will have XMM RAM somewhat slower than Jazzed's card), I get the following times:
- 3.1 seconds using the XMM SMALL mode (data in Hub RAM)
- 10 seconds using the XMM LARGE mode (data in XMM RAM)
Ross.P.S. Is anybody doing a "Big SPIN" version? I'd be interested in comparing these times with that!
Something is seriously wrong with Zog and this benchmark, or perhaps zog and the SDRAM platform.
When I move code around or comment things out it can crash totally.
Again with no optimization it is fine. With optimization it fails in a different way!
This is not good.
These are all compromises - bigger memory = slower, but with the flexibility to move parts of the time critical code into hub if needed. Or even pasm loadable objects if required.
I am sending Zog a getwell card.
I'll try this latest version later today and try to reproduce the ZOG problems.
Where does this show up in the output?
When things go wrong a lot more of the frequency bins are non-zero resulting in a lot more output lines.
Just now this is looking bad. I have three different ZPUs all failing differently. ZOG himself and two versions of ZPU in C one optimized and one not.
Looks like I have to run fft_bench on the real VHDL ZPU under the GHDL simulator to see where we are going wrong.
This might take a while.
A 340Mhz Prop! I want one of those!
fft_bench.c on ZOG on GG SDRAM card, code/data/stack external, 5.7 seconds!
With bit perfect results I might add:-)
That even gives Catalina a run for it's money in such a configuration.:-)
The problems with ZOG and fft_bench are a bit weird all will be revealed when I have more than just this phone to post with.
That's good news, Heater! I wish I had a GG board to do a direct comparison. Hopefully someone else will be able to do one when I release the next version of Catalina with caching XMM driver support.
Ross.
Additionally something in printSpectrum causes trouble - trying to narrow it down now.
The old fft run time was 7.8s.
Can't wait to hear the rest of the ZOG story.
Thanks,
David
In fft_bench there are tables, wx and wy, of "twiddle factors". Yes that's what the FFT literature often calls them. These are basically tables of cos and minus sin for angles 0 to 180 degrees. The FFT algorithm steps through the tables at it works. See, the the FFT rotates through the angles hence "twiddle factors".
We find that the second half of the cos table (wx) is exactly the same values as the first half of the minus sin table (wy). So, being sneaky, I decide to save space. I delete the second half of the cos table and allow the cos lookups to overflow the cos table into the sine table which has the required correct values. This works fine as long as wy follows wx in memory contiguously. In Spin it's just a case of putting the wy lable in the correct place in the table.
BUT, boom! For zpugcc we find that the tables are aligned on LONG boundaries. Even though they are WORD wide tables. That means there is a one WORD gap between them in memory and the whole thing goes bad. Strangely enough this only seemed to happen when switching on optimization, which we need to get any speed out of the thing.
So, here is a new version of fft_bench.c that works for Zog on the SDRAM platform.
Another solution would be to use tables of LONGs but that makes everything bigger and more importantly removes a part of the benchmark, converting between WORDs and LONGs.
Of course relying on memory layout in C, or any language, like that is a really bad idea anyway.
Another solution might have been to declare one big array for the table and have wx and wy as pointers into the correct locations. Turns out that adds 20% to my execution time. Odd since arrays and pointers in C are generaly thought to be interchangeable.
Still this leaves me with another headache, my ZOG virtual machine in C still fails:( Odd because it was the prototype for the PASM version.
Have fun benchmarking.
I'll try and get on the Zog changes again when I have a free minute. I might add fft_bench to the test directory as well.
Thanks! I'll be looking forward to your improvements!
Use #ifdef GCC for compilers that can't handle it.
Was there anything else that you had to do for ZOG? Everything is fine now?
As I said previously I have two arrays of words wx[] and wy[] used as lookup tables.
In order to save a bit of space these arrays run consecutively. The second half of wx overlaps the first half of wy. You can see that clearly in the Spin version.
When you do that in C you are taking a gamble that the two arrays are laid out in memory consecutively by the compiler.
Turns out that zpu-gcc does that when optimizations are switched off. But when optimizing in actually puts the wy array first in memory followed by wx!
BOOM the table lookups fail.
So that piece of sloppy programming on my part just has to go.
There are no changes to Zog due to this issue.