fft_bench - An MCU benchmark using a simple FFT algorithm in Spin, C, and ...

Heater. · 2011-02-28 05:14

Introducing fft_bench.

This is a simplified version of the heater_fft intended for use as a benchmark when comparing compilers, languages, interpreters or even different MCU's.

Included are both a Spin and a C language version of the FFT algorithm. I have tried to keep them as similar in operation as possible so that we are at least attempting to compare like with like when benchmarking different languages.

Fixed point arithmetic is used. The idea is to for this to be a benchmark algorithm containing a good selection of constructs typically used in MCU applications. To that end it is not making heavy use of strings (or any in fact) or becoming solely a test of subroutine calling like the FIBO test.

The Spin version completes a 1024 point FFT in about a second and a half. I have not adapted the C version to Zog or Catalina yet. Perhaps someone with a few spare minutes could do so. On my PC it completes the same task in 195 micro seconds!!!

Hopefully others might like to contribute other language implementations to this effort.

May the benchmarking commence:)

jazzed · 2011-02-28 06:11

Heater. wrote: »

May the benchmarking commence:)

Something's wrong. Why do I get 0us?

c:\Propeller\fft_bench_v1.0>gcc fft_bench.c -o fft

c:\Propeller\fft_bench_v1.0>fft
fft_bench v1.0
Freq.    Magnitude
00000000 000001fe
000000c0 000001ff
00000140 000001ff
00000200 000001ff
1024 point bit-reversal and butterfly run time = 0 us

Anyway ... Thanks for posting the benchmark.

Heater. · 2011-02-28 06:27

Jazzed,

Perhaps you have a really fast machine!:)

I just compiled and ran it on a different PC under Debian using:

heater@rsm-51:~/fft_bench_v1.0$ gcc -O0 -o fft_bench fft_bench.c
heater@rsm-51:~/fft_bench_v1.0$ ./fft_bench
fft_bench v1.0
Freq.    Magnitude
00000000 000001fe
000000c0 000001ff
00000140 000001ff
00000200 000001ff
1024 point bit-reversal and butterfly run time = 191 us

If I change the optimization to -O3 or -Os the execution time drops to 81us !

Or, could you hack some printfs into time_us() to check that timeval.tv_sec and timeval.tv_usec are being set nicely by gettimeofday().

The astute will notice that the frequency zero magnitude is 1FE instead of 1FF. That is because I'm using shift right by 12 (>> 12) in the butterfly calculation instead of divide by 4096 for the fixed point scaling. These are NOT quite equivalent.

I tried to use the rusage() function to get execution time but was getting zeros for the micro-seconds.

Heater. · 2011-02-28 06:30

Jazzed:

c:\Propeller\fft_bench_v1.0>gcc fft_bench.c -o fft

Oh I get it. You are compiling on Windows. Perhaps that is not returning a microsecond resolution time of day.

jazzed · 2011-02-28 08:30

Here is a version with windows support.

c:\Propeller\fft_bench_v1.0>gcc -DWINDOWS fft_bench.c -o fft
c:\Propeller\fft_bench_v1.0>fft
fft_bench v1.0
Freq.    Magnitude
00000000 000001fe
000000c0 000001ff
00000140 000001ff
00000200 000001ff
1024 point bit-reversal and butterfly run time = 252 us

martinh · 2011-02-28 10:04

Hello, I made a very quick hack to compile the c version of the benchmark with catalina 2.9 and compiled it for my demo board.

catalina fft_bench.c -lci -D PC -D DEMO -D CLOCK

The benchmark gives

fft_bench v1.0

Freq.    Magnitude
00000000 000001fe
000000c0 000001ff
00000140 000001ff
00000200 000001ff
1024 point bit-reversal and butterfly run time:    413 msec

Change c code is attached.

PS: I forgot to mention that time on the propeller is measured in milliseconds!

RossH · 2011-02-28 23:39

martinh wrote: »

Hello, I made a very quick hack to compile the c version of the benchmark with catalina 2.9 and compiled it for my demo board.

Thanks, Martin.

I get the same result on a DRACBLADE, using the command line:

catalina -D DRACBLADE -D LORES_VGA -D CLOCK ANSI_fft_bench.c -lci

So Catalina is around 4 times faster than Spin.

BTW - your hacks are not hacks - the original program is not quite ANSI standard C. I have attached an ANSI C89 compliant version (quite similar to yours). Here is a summary of the changes:

replace sys/time.h with time.h
replace gettimeofday() with clock(), which returns type clock_t
incomplete types cannot be static - ANSI std section 6.9.2
the %lld format specifier (for long long) is not defined in C89.

None of the changes affect the actual fft algorithm - just the timekeeping. I'd be happy to accept benchmark results from any of the C versions

Heater. · 2011-03-01 00:10

Standards, shmandards.

If I try to compile my vertion with gcc option -std=c89 or -ansi it throws up on all the // comment markers.

I hate those old style /*...*/ comment separators.

Anyway I'll unify these effortss at some point.

RossH · 2011-03-01 00:25

Heater. wrote: »

If I try to compile my vertion with gcc option -std=c89 or -ansi it throws up on all the // comment markers.

Drat! I forgot about those!

Anybody got any ZOG times?

Heater. · 2011-03-01 00:44

clock() does not seem to work under linux. I only get zeros returned.

RossH · 2011-03-01 00:45

Hi Heater,

Your problem may be that (on the PC - which is many hundreds of times faster than a Prop!) the program is too quick for the resolution of clock() - do the fft in a loop 100 times and print the total time for 100 cycles.

RossH · 2011-03-01 01:14

All,

I thought it would be instructive to create a version of the fft benchmark that does not use stdio - i.e. which uses only the "built-in" Catalina HMI functions. These are simpler than stdio, and more closely mimic the "fullduplexserial" functions used by the SPIN version (on which they were originally based!). This makes no difference to the benchmark itself - only to the code that prints the results.

The SPIN version was compiled with the Propeller Spin tool, and the Catalina version was compiled with the command:

catalina -lci Catalina_fft_be.c -D DRACBLADE -D CLOCK -D PC

Here is a comparison of the run-time size (not the file size) of the results (sizes in bytes).

+---------+---------+---------+-------+
|Language | runtime | runtime | Time  |
|         | code    | var     | ms    |
+---------+---------+---------+-------+
|SPIN     | 3228    | 8260    | 1465  |
+---------+---------+---------+-------+
|C        | 3800    | 9996    | 413   |
+---------+---------+---------+-------+

So Catalina C is not only nearly 4 times faster than SPIN, it does so using less than 20% more Hub RAM at run time. - i.e. with both C & SPIN you still have around 16k of hub RAM left for more program code.

Heater. · 2011-03-01 02:32

An excellent result Ross.

But a sad day for Zog, he is very sick.

Using Dave Betz latest zog build for the 32MB SDRAM Zog completes the FFT in 6.2 seconds.

BUT the results are wrong!

The DC and nyquist frequencies are correct but the other two are some what wrong. Plus every other frequency is having a few low order bits set.

I can get the correct results by turning off all optimizations but then the execution time is 23.4 seconds!

RossH · 2011-03-01 03:02

Hi Heater,

Well, I would be a bit concerned about Zog getting the results wrong (after all, this is all integer arithmetic!) - but I don't think Zog's speed is all that bad - considering it is running from XMM RAM.

The following results will not be directly comparable, but when I run the program from XMM RAM with Catalina using a caching driver similar to the one Zog uses (but on the DRACBLADE, which will have XMM RAM somewhat slower than Jazzed's card), I get the following times:

3.1 seconds using the XMM SMALL mode (data in Hub RAM)
10 seconds using the XMM LARGE mode (data in XMM RAM)

Ross.

P.S. Is anybody doing a "Big SPIN" version? I'd be interested in comparing these times with that!

Heater. · 2011-03-01 03:14

Seems zog can actualy keep up in some cases:-)

Something is seriously wrong with Zog and this benchmark, or perhaps zog and the SDRAM platform.

When I move code around or comment things out it can crash totally.

Heater. · 2011-03-01 03:42

Hmmm....I just ran fft_bench compiled for zpu under my zpu emulator in C on the PC.

Again with no optimization it is fine. With optimization it fails in a different way!

This is not good.

Dr_Acula · 2011-03-01 05:34

So zog comes in somewhere between xmm 'small' and xmm 'large'

These are all compromises - bigger memory = slower, but with the flexibility to move parts of the time critical code into hub if needed. Or even pasm loadable objects if required.

I am sending Zog a getwell card.

jazzed · 2011-03-01 05:54

I tested the SPIN FFT last week with BigSpin on SDRAM and it took about 7 seconds.
I'll try this latest version later today and try to reproduce the ZOG problems.

Heater. wrote: »

The DC and nyquist frequencies are correct but the other two are some what wrong. Plus every other frequency is having a few low order bits set.

Where does this show up in the output?

Heater. · 2011-03-01 06:11

Jazzed,

here does this show up in the output?

The output display function will print any frequency component that is non-zero. Should normally be 4 lines reflecting the input waveform we but in. First the zero frequency (DC) level. Then the magnitudes of two mid range tones. Then the magnitude of the highest possible frequeny.

When things go wrong a lot more of the frequency bins are non-zero resulting in a lot more output lines.

Just now this is looking bad. I have three different ZPUs all failing differently. ZOG himself and two versions of ZPU in C one optimized and one not.

Looks like I have to run fft_bench on the real VHDL ZPU under the GHDL simulator to see where we are going wrong.

This might take a while.

Dave Hein · 2011-03-01 07:41

I got 344 msec for Spin, but I cheated and ran it under SpinSim. Compariing this to Ross' Spin time makes that the equivalent of a 340 MHz Prop. I got 109 us on the same PC running the C code.

RossH · 2011-03-01 13:51

Dave Hein wrote: »

I got 344 msec for Spin, but I cheated and ran it under SpinSim. Compariing this to Ross' Spin time makes that the equivalent of a 340 MHz Prop. I got 109 us on the same PC running the C code.

A 340Mhz Prop! I want one of those!

Heater. · 2011-03-01 13:55

Yipee! ZOG is now well again. Thank you Dr A.

fft_bench.c on ZOG on GG SDRAM card, code/data/stack external, 5.7 seconds!

With bit perfect results I might add:-)

That even gives Catalina a run for it's money in such a configuration.:-)

The problems with ZOG and fft_bench are a bit weird all will be revealed when I have more than just this phone to post with.

RossH · 2011-03-01 14:12

Heater. wrote: »

Yipee! ZOG is now well again. Thank you Dr A.

fft_bench.c on ZOG on GG SDRAM card, code/data/stack external, 5.7 seconds!

With bit perfect results I might add:-)

That even gives Catalina a run for it's money in such a configuration.:-)

The problems with ZOG and fft_bench are a bit weird all will be revealed when I have more than just this phone to post with.

That's good news, Heater! I wish I had a GG board to do a direct comparison. Hopefully someone else will be able to do one when I release the next version of Catalina with caching XMM driver support.

Ross.

jazzed · 2011-03-01 14:59

Bad news for BigSpin. The latest fft_bench runs in 8.78s with all segments in SDRAM.
Additionally something in printSpectrum causes trouble - trying to narrow it down now.
The old fft run time was 7.8s.

Can't wait to hear the rest of the ZOG story.

David Betz · 2011-03-01 17:19

Heater. wrote: »

The problems with ZOG and fft_bench are a bit weird all will be revealed when I have more than just this phone to post with.

When you get a chance could you commit your changes to SVN on Google Code?

Thanks,
David

jazzed · 2011-03-01 17:31

The BigSpin problems appear to be related to it's currently unfinished state. Guess I have reason to do more work to it now, but i still have other higher priorities. It's hard to tell whether that 8.78s measure is valid - some tests I ran gave 5.9s. More to come later.

Heater. · 2011-03-02 02:07

The rest of the ZOG story....

In fft_bench there are tables, wx and wy, of "twiddle factors". Yes that's what the FFT literature often calls them. These are basically tables of cos and minus sin for angles 0 to 180 degrees. The FFT algorithm steps through the tables at it works. See, the the FFT rotates through the angles hence "twiddle factors".

We find that the second half of the cos table (wx) is exactly the same values as the first half of the minus sin table (wy). So, being sneaky, I decide to save space. I delete the second half of the cos table and allow the cos lookups to overflow the cos table into the sine table which has the required correct values. This works fine as long as wy follows wx in memory contiguously. In Spin it's just a case of putting the wy lable in the correct place in the table.

BUT, boom! For zpugcc we find that the tables are aligned on LONG boundaries. Even though they are WORD wide tables. That means there is a one WORD gap between them in memory and the whole thing goes bad. Strangely enough this only seemed to happen when switching on optimization, which we need to get any speed out of the thing.

So, here is a new version of fft_bench.c that works for Zog on the SDRAM platform.

Another solution would be to use tables of LONGs but that makes everything bigger and more importantly removes a part of the benchmark, converting between WORDs and LONGs.

Of course relying on memory layout in C, or any language, like that is a really bad idea anyway.

Another solution might have been to declare one big array for the table and have wx and wy as pointers into the correct locations. Turns out that adds 20% to my execution time. Odd since arrays and pointers in C are generaly thought to be interchangeable.

Still this leaves me with another headache, my ZOG virtual machine in C still fails:( Odd because it was the prototype for the PASM version.

Have fun benchmarking.

Heater. · 2011-03-02 02:08

Dave Betz,

I'll try and get on the Zog changes again when I have a free minute. I might add fft_bench to the test directory as well.

David Betz · 2011-03-02 03:49

Heater. wrote: »

Dave Betz,

I'll try and get on the Zog changes again when I have a free minute. I might add fft_bench to the test directory as well.

Thanks! I'll be looking forward to your improvements!

jazzed · 2011-03-02 07:04

Heater. wrote: »

The rest of the ZOG story....

BUT, boom! For zpugcc we find that the tables are aligned on LONG boundaries. Even though they are WORD wide tables. That means there is a one WORD gap between them in memory and the whole thing goes bad. Strangely enough this only seemed to happen when switching on optimization, which we need to get any speed out of the thing.

static int16_t wx[] = { 4095, 4094 /* ... */ } __attribute__((packed));
Use #ifdef GCC for compilers that can't handle it.

Was there anything else that you had to do for ZOG? Everything is fine now?

Heater. · 2011-03-03 02:49

I was wrong. The issue with Zog and fft_bench is not to do with alignment.

As I said previously I have two arrays of words wx[] and wy[] used as lookup tables.
In order to save a bit of space these arrays run consecutively. The second half of wx overlaps the first half of wy. You can see that clearly in the Spin version.
When you do that in C you are taking a gamble that the two arrays are laid out in memory consecutively by the compiler.
Turns out that zpu-gcc does that when optimizations are switched off. But when optimizing in actually puts the wy array first in memory followed by wx!
BOOM the table lookups fail.

So that piece of sloppy programming on my part just has to go.

There are no changes to Zog due to this issue.

fft_bench - An MCU benchmark using a simple FFT algorithm in Spin, C, and ...

Comments