fft_bench - An MCU benchmark using a simple FFT algorithm in Spin, C, and ...

Dave Hein · 2011-03-03 07:47

Heater,

You could make wx and wy completely independent, which means that you would need to use 256 more words for wx. Or you could make wx and wy one array, and access the wy component as wx[wIndex+256], or define wy as a pointer equal to &wx[256].

Dave

Heater. · 2011-03-03 08:29

Dave Hein,

I tried making a big array and making wy a pointer. It works fine of course but it slowed the thing down by 20%. That was surprising as I would expect array indexing and pointer offsetting would amount to the same code.

I could skip adding offsets to the pointer and just increment it appropriately. Which is what I do in the latest heater_fft.spin. I have decided though to keep true to the Spin version and stick with array indexing and just have two arrays. It's not so important that the benchmark be small.

Similarly I did not want to use wx[wIndex+256] as that slows things down and is not true to the original intent.

I also discovered that using ints insteat of shorts for the w's speeds things up by about 20%. Again I'm sticking with shorts as that type convertion is all part of the test.

David Betz · 2011-03-03 08:31

Heater. wrote: »

Dave Hein,

I tried making a big array and making wy a pointer. It works fine of course but it slowed the thing down by 20%. That was surprising as I would expect array indexing and pointer offsetting would amount to the same code.

I could skip adding offsets to the pointer and just increment it appropriately. Which is what I do in the latest heater_fft.spin. I have decided though to keep true to the Spin version and stick with array indexing and just have two arrays. It's not so important that the benchmark be small.

Similarly I did not want to use wx[wIndex+256] as that slows things down and is not true to the original intent.

I also discovered that using ints insteat of shorts for the w's speeds things up by about 20%. Again I'm sticking with shorts as that type convertion is all part of the test.

Could you declare both arrays as members of a structure so they wouldn't be reordered?

Heater. · 2011-03-03 08:39

Dave Betz,

I thought about that. I had a suspicion that it may generate more code and slow things down. Not sure, I might have to try it.

But I think I'm sticking with the two big arrays. Using structures here diverges fromthe Spin version and deliberately coding the thing to overflow out of one array into another is a pretty gacky style anyway:)

jazzed · 2011-03-03 08:56

I like the one array idea with wy as a pointer to &wx[wystart].

Dave Hein · 2011-03-03 09:09

If x[] and y[] are put in a struct "w" the compiler should generate the same code to access w.x[wIndex] and w.y[wIndex] as it would have for wx[wIndex] and wy[wIndex]. w.x and w.y are known constant addresses, just like wx and wy would be.

David Betz · 2011-03-03 09:14

Dave Hein wrote: »

If x[] and y[] are put in a struct "w" the compiler should generate the same code to access w.x[wIndex] and w.y[wIndex] as it would have for wx[wIndex] and wy[wIndex]. w.x and w.y are known constant addresses, just like wx and wy would be.

That is what I was thinking as well. Of course, even noticing that they are constant addresses might require some degree of optimization to be turned on.

Heater. · 2011-03-05 11:39

I've just been running fft_bench.c on an XMOS chip. It gives the Prop a sound thrashing I'm afraid.
Using one thread on a core gets the job done in 4.5ms. That's 10 times faster than the heater_fft in PASM and 300 times faster that Zog!!!

If one wants to use all 8 threads on the xcore it slows to 9.5ms

We are going to need the Prop II to keep up with this.

jazzed · 2011-03-05 12:18

Heater. wrote: »

I've just been running fft_bench.c on an XMOS chip. It gives the Prop a sound thrashing I'm afraid.
Using one thread on a core gets the job done in 4.5ms. That's 10 times faster than the heater_fft in PASM and 300 times faster that Zog!!!

If one wants to use all 8 threads on the xcore it slows to 9.5ms

We are going to need the Prop II to keep up with this.

Ugh. Just what I need on a day of questioning the value of certain things!

Heater. · 2011-03-05 12:31

I'm guessing that the Prop II programmed in PASM will be in the race with the similar priced single core xmos when running 8 threads programmed in C. So I think the outlook is good.

C on the Prop is a disaster by comparison which ever way you look at it. Hence, I guess, the reason the "professionals" are said to pass it by.

I'm curious. What other values have you been struggling with today?

David Betz · 2011-03-05 12:34

Heater. wrote: »

I've just been running fft_bench.c on an XMOS chip. It gives the Prop a sound thrashing I'm afraid.
Using one thread on a core gets the job done in 4.5ms. That's 10 times faster than the heater_fft in PASM and 300 times faster that Zog!!!

If one wants to use all 8 threads on the xcore it slows to 9.5ms

We are going to need the Prop II to keep up with this.

When you say PASM do you mean code running in a COG or LMM PASM code?

Heater. · 2011-03-05 12:58

Sorry yes, I should have mentioned that the heater_fft is much the same algorithm as fft_bench but implemented as straight PASM running in a single COG.

jazzed · 2011-03-05 13:04

Heater. wrote: »

I'm curious. What other values have you been struggling with today?

If I had remained in my "professional hole" with my blinders on I would have never considered a Propeller to lift me out (or drop me for that matter).

This other "value thing" can not be revealed until later because of NDA. At least one version of it would fit very nicely in our Key-chain Propeller though. Unfortunately Propeller does not support any of the languages in any form so I would have to re-write something one way or another.

Heater. · 2011-03-05 13:14

I too skipped over the Prop in the ELFA catalog a few times before curiosity got the better of me. Glad it did. All hoby fun aside I can see many "professional" uses for the Prop, sadly not so much in my current positon.

All that talk of NDA makes me even more curious:-)

rod1963 · 2011-03-05 13:22

Dreadful performance on the part of the Prop. Worse that C doesn't work well on Prop, great way to scare off commercial users when their programmers skill set can't be leveraged.

I do hope that Chip does correct the Propeller architecture say in version III to be more friendly to C or any HLL compiled langauge for that matter.

RossH · 2011-03-05 15:51

rod1963 wrote: »

Dreadful performance on the part of the Prop. Worse that C doesn't work well on Prop, great way to scare off commercial users when their programmers skill set can't be leveraged.

I do hope that Chip does correct the Propeller architecture say in version III to be more friendly to C or any HLL compiled langauge for that matter.

I think this is missing the point. Any reasonably modern microprocessor can easily outperform the Propeller - or the X___ for that matter. If you need a fast processor you'd be silly to choose either of those chips, since neither is exactly a speed demon. You might choose the Prop as a dedicated peripheral controller for it's fantastic I/O capability, it's astounding flexibility and its unique degree of determinism. You might choose the X___ for some other dedicated task like audio signal processing. You certainly wouldn't (unless you happen to be humanoido) choose to build a supercomputer out of either!

Heater has already demonstrated that a PC absolutely wipes the floor with both the Prop and the X___ in pure speed. But the point of this thread is not to compare the Prop with other chips - it is to compare the various language implementations on the Prop.

What this thread shows so far is that PASM is fastest, C is next fastest, and other languages (e.g. SPIN) are slower. What a surprise! This is in fact exactly the same result you would see on any micro. The relativity on the time/space tradeoff on the propeller is what is most interesting - C is (if I am correctly reading heater's post about fft implemented in PASM correctly) about 8 times slower than PASM in this case, and SPIN is about 40 times slower - but Hub-based C and SPIN programs can be 32k (and XMM based programs can be even larger) whereas PASM programs can only 2k - i.e. you have a classic time/space tradeoff. Is it a good tradeoff? Well, it depends on your application - it's not exactly the same tradeoff you would see on other chips, but that's because the propeller architecture is so different. Does this make C unusable - of course not!

We all accept that SPIN is perfectly usable on the Propeller (don't we?). This thread demonstrates (yet again) that C is as well. Complaining about the relative performance of either language compared to PASM is futile, since if the program is larger than 2k (actually 496 instructions) you simply cannot implement it in PASM. LMM PASM is about the closest and fastest you could get, but this would only be slightly faster than C (maybe 20% faster? I hope someone tries this and adds it to the list of benchmarks).

So if an application needs more speed than you can achieve with SPIN then use C. If C is not fast enough then you could get a slight improvement by hand-coding it in LMM PASM. If that is still not fast enough (or you don't want to re-write your application in assembly) then you should probably consider using another chip.

Is this decision making process any different than for any other micro? No.

Ross.

Heater. · 2011-03-05 16:14

RossH,

But the point of this thread is not to compare the Prop with other chips - it is to compare the various language mplementations on the Prop

Actually in my introductory post I do specifically mention comarison with other MCUs. Note, not micro-processors. Of course when you do that you are in trouble as you are not comparing like with like. Most MCUs are not multicore. If an AVR turned in an fft_bench result the same as a prop that misses the point that it can't do anything else at the same time.

The X devices on the other hand are a fair comparison as they are both based on the idea of multi-threaded/multi-core software solutions with minimal hardware dedicated to peripheral functions. "Software Defined Silicon" as the X guys day.

RossH · 2011-03-05 16:25

Heater. wrote: »

RossH,

Actually in my introductory post I do specifically mention comarison with other MCUs. Note, not micro-processors. Of course when you do that you ate in trouble as you are not comparing like with like. Most MCUs are not multicore. If an AVR turned in an fft_bench result the same as a prop that misses the point that it can't do anything else at the same time.

The X devices on the other hand are a fair comparison as they are both based on the idea of multi-threaded/multi-core software solutions with minimal hardware dedicsted to peripheral functions. "Software Defined Silicon" as the X guys day.

Okay - fair enough. I'm not particularly interested in comparing to other MCUs, but I suppose others may be. I was mainly trying to make the point that rod1963's contention that "C doesn't work well on the Prop" doesn't really make sense. He may as well say that "SPIN doesn't work well on the Prop", when in fact many prop users make quite effective use of it every day. It's a matter of using the right tool for the right job.

Ross.

Ariba · 2011-03-05 19:53

RossH wrote: »

Okay - fair enough. I'm not particularly interested in comparing to other MCUs, but I suppose others may be. I was mainly trying to make the point that rod1963's contention that "C doesn't work well on the Prop" doesn't really make sense. He may as well say that "SPIN doesn't work well on the Prop", when in fact many prop users make quite effective use of it every day. It's a matter of using the right tool for the right job.

Ross.

The difference regarding Spin and C is: Spin in known as an interpreted language (also if not exactly true).
But from C people expect that it is compiled to machine code and reach nearly the performance of native Assembly code. And that is not possible on the Propeller, because of its architecture. I think that is what poeple (and I) mean with "C doesn't work well on the Prop". There are different expectations.

If you don't expect that C on the Propeller works like on other processors, and just compare the available languages for the Propeller then Catalina C is one of the best solutions to program a Propeller!
Perhaps I would use it more, if I don't have made my own Basic to LMM compiler. Here are the results of my Compiler. The produced LMM code seems to be faster then Catalina's, but hand optimized LMM code will still be faster.

fft_bench v1.0 
Freq.    Magnitude 
00000000 000001FE
000000C0 000001FF
00000140 000001FF
00000200 000001FF
1024 point bit reversal and butterfly 
run time = 168ms

This is with a 6MHz crystal, with a 5MHz crystal (80MHz clock) the result will be 202 ms.

Andy

RossH · 2011-03-05 20:17

Ariba wrote: »
The difference regarding Spin and C is: Spin in known as an interpreted language (also if not exactly true).
But from C people expect that it is compiled to machine code and reach nearly the performance of native Assembly code. And that is not possible on the Propeller, because of its architecture. I think that is what poeple (and I) mean with "C doesn't work well on the Prop". There are different expectations.

If you don't expect that C on the Propeller works like on other processors, and just compare the available languages for the Propeller then Catalina C is one of the best solutions to program a Propeller!
Perhaps I would use it more, if I don't have made my own Basic to LMM compiler. Here are the results of my Compiler. The produced LMM code seems to be faster then Catalina's, but hand optimized LMM code will still be faster.
fft_bench v1.0 
Freq.    Magnitude 
00000000 000001FE
000000C0 000001FF
00000140 000001FF
00000200 000001FF
1024 point bit reversal and butterfly 
run time = 168ms
This is with a 6MHz crystal, with a 5MHz crystal (80MHz clock) the result will be 202 ms.

Andy

202ms! Impressive!

Comparing any high level languages to PASM is simply nonsensical when you can't implement a high level language in PASM (at least not for programs larger than 2k). Anyone who has expectations that such a language will be comparable in speed to PASM has not thought through the implications of the Prop architecture.

The numbers you just posted are the types of numbers we actually should be comparing - and it looks like Catalina needs to lift its game! :frown:

Ross.

Bean · 2011-03-05 20:48

Heater,
I'd like to see a much simpler benchmark program.
Perhaps a bubble sort or a shell sort. Something that doesn't take hours to convert to different languages.

Bean

Ariba · 2011-03-05 21:13

RossH wrote: »

...and it looks like Catalina needs to lift its game!

Not sure if this is possible. My compiler works much like PropBasic, that means no stack for function parameters and local variables.Variables are global and are in fact cog registers. Only arrays and strings are in HubRam. I think this makes it faster. My understanding of C is that it has the variables on the stack and that it has to access theme relative to the stackpointer or a datapointer from HubRam which consumes a lot of time. So I would say Catalina is a bit slower but more advanced.

Andy

RossH · 2011-03-05 21:40

Ariba wrote: »

Not sure if this is possible. My compiler works much like PropBasic, that means no stack for function parameters and local variables.Variables are global and are in fact cog registers. Only arrays and strings are in HubRam. I think this makes it faster. My understanding of C is that it has the variables on the stack and that it has to access theme relative to the stackpointer or a datapointer from HubRam which consumes a lot of time. So I would say Catalina is a bit slower but more advanced.

Andy

Hi Andy,

Thanks for the technical info. Yes, it sounds like this will always be faster than Catalina provided the program can fit all its variables in the available cog space. This would be equivalent to Catalina being benchmarked against Zog where Catalina can fit all all the local variables in registers (which are cog variables) but Zog has to keep them on the stack.

This just highlights the problems typical of such small benchmark programs. Maybe we need another larger benchmark as well.

Still, there's no harm in having a bit of competition, and this gives Catalina a reasonable target to shoot for (at least for smaller programs!)

Ross.

Heater. · 2011-03-06 05:10

Ariba,

That is an impressive result from your BASIC compiler. Sounds like your approach with variables in COG etc is what I had in mind for a version of my TINY compiler (See Jack Crenshaw). I have been dreaming about that kind of thing with a more C or Pascal like sytax. Not that I have the time or skill to pull it of this side of eternity.

Bean,

Looking at it it fft_bench does seem a bit large. But it's only 70 odd lines for the functions that have to be timed. The rest one does not have to be so careful about translating. Thing is I only put this up as a benchmark as I had already got C and Spin versions of the same thing after the great FFT debate. Of course additional benchmarks would be nice, if you can do a sort for us...

Rod1963,

I do hope that Chip does correct the Propeller architecture say in version III to be more friendly to C or any HLL compiled langauge for that matter.

That statement rather worries me. I makes the assumptions that the Prop needs "correcting", that C is the be all and end all of a chips existence. It worries me because if the Prop is "fixed" enough it will no longer have the qualities of the Prop, it will have mutated into just another MCU with all the complications that involves. Might as well give up and use an X... I am rather hoping that the sheer elegance and simplicity of the Prop and it's gorgeously easy assembly language are not compromised just to run C.

Heater. · 2011-03-06 05:53

OK. Breaking my own rule about not making comparisons to micro-processors only MCU's.
I could not resist running fft_bench on the IGEPv2 board sitting in front of me. It has an ARM CORTEX A8 core running at a giga hertz or so. The whole board is about the size of a Gadget Gangster Propeller card and costs 150 Euro.

Result: 1ms !!

jazzed · 2011-03-06 07:48

Heater. wrote: »

I could not resist running fft_bench on the IGEPv2 board sitting in front of me. It has an ARM CORTEX A8 core ....

Impressive board all around!!!

That definitely provides perspective on what people (me included) should probably not be doing with Propeller for the same price. It makes me feel very silly. Did you try running fft_bench with Dave's spinsim on it?

Dave Hein · 2011-03-06 09:30

jazzed wrote: »

That definitely provides perspective on what people (me included) should probably not be doing with Propeller for the same price. It makes me feel very silly. Did you try running fft_bench with Dave's spinsim on it?

I'm guessing it would take more than 3 seconds under spinsim on that processor based on the results I got on my PC. Of course, he could run it with the -p option where it uses the PASM interpreter in ROM. That would probably take a minuite or two to run.

The Prop is not a DSP. It has never been promoted as one, and it clearly can't compete on DSP algorithms against any processor that has a hardware multiplier. Now the Prop 2 running PASM code will be a little more interesting for DSP benchmarks. However, it still will not be much better than a TI 32010 DSP from 25 years ago unless it can break the algorithm up into 8 parallel pieces.

EDIT: I did a search on the TMS32010, it was substantially slower than I remember it to be. It took 42 msecs to do a 1024-point FFT. I can't recall seeing the exectuion time for the Prop PASM version, but Heater's earlier comment said it was 10 times slower than 4.5 msecs, so that makes it about the same as the 32010. The Prop 2 will be much faster than this.

RossH · 2011-03-06 14:56

jazzed wrote: »

Impressive board all around!!!

That definitely provides perspective on what people (me included) should probably not be doing with Propeller for the same price. It makes me feel very silly.

Yes, I agree it is an impressve board - but if you are going to insist on comparing a Prop against a single board computer where just the processor chip appears to cost around 10 times the price of a Propeller chip, then of course you are going to end up feeling silly - because it's a silly thing to do!

Ross.

koehler · 2011-03-06 15:34

Heater, thanks for taking the time to do the comparison. Hopefully some of the discussions from the past couple of weeks now make a bit more sense.
This to me seems like a small earthquake in the Prop expectations arena. Who would of thunk it that Basic to LMM would be faster than Catalina? This is most excellent for the community as a whole to drive advancements, and a heads up to Parallax that they really may need to prioritize things such as C, if they want to make a dent with the professional market. Not they they aren't already doing so, but now it is even more evidently a need.

Now maybe someone will try running the FFT under Forth?

jazzed · 2011-03-06 16:07

RossH wrote: »

Yes, I agree it is an impressve board - but if you are going to insist on comparing a Prop against a single board computer where just the processor chip appears to cost around 10 times the price of a Propeller chip, then of course you are going to end up feeling silly - because it's a silly thing to do!

Ross.

Where exactly do you see those price comparisons? According to Heater the whole board is 150 euro.

A Propeller board with the same connectors and 1/10th the performance (if that much) would cost about the same. Of course being a Propeller fan makes up for the disparities :-)

fft_bench - An MCU benchmark using a simple FFT algorithm in Spin, C, and ...

Comments