fft_bench - An MCU benchmark using a simple FFT algorithm in Spin, C, and ...

RossH · 2011-04-08 05:47

Heater. wrote: »

... Compare to the fft_bench in Catalina C on the Prop at approx 400ms ...

... As we see C on the Prop sucks by a factor of about 20 !!! ...

... Support for C on the Prop is poor.

You probably won't be at all surprised to hear that I completely disagree.

Let's look at this another way ...

The Prop has eight cores, and for this benchmark Catalina is using just one of them to achieve a time of ~400ms. But Catalina can just as happily use all eight cores as easily as just one - so if we were executing an application that could be distributed across all the cores , then the prop time would come down by a factor of eight (i.e. to ~50ms) - but since the ARM is single core, it's time would remain the same.

And the prop less than one quarter the price of the ARM.

And the prop is clocked at less than half the clock speed of the ARM.

So what this means is ... for applications that can take advantage of the Prop's symmetric multiprocessing capabilities, then the Prop is a very cost-effective solution.

This is why I disute the fact that C support on the Prop is poor. It is more of a case that our ability to use the Prop's capabilities in any language is poor - i.e. most of us use the Prop for applications that do not take advantage of the Prop's multiprocessing capabilites.

But you can continue to blame C for your own deficiencies if it makes you feel better.

Ross.

Heater. · 2011-04-08 06:30

RossH,

You probably won't be at all surprised to hear that I completely disagree.

Thank God for that:)

Let's look at this another way ...

If we must.

The Prop has eight cores, and for this benchmark Catalina is using just one of them to achieve a time of ~400ms. But Catalina can just as happily use all eight cores as easily as just one - so if we were executing an application that could be distributed across all the cores , then the prop time would come down by a factor of eight (i.e. to ~50ms) - but since the ARM is single core, it's time would remain the same.

Now that's a sneaky argument. Yes the prop has 8 cores, of course all those C monkeys out there expect their MCU to have ACD, UART, SPI etc etc built in. So in general most of those cores are not available for top level application code but busy being "soft" peripherals.
Besides you slipped that little word "could" in there. The FFT is easily split 2, 4, 8 or more ways most
apps are not.

And the prop less than one quarter the price of the ARM.

True, If I allow the Prop a further head start by dividing the ARM speed by 4 to compensate for the price difference then the Prop is still behind by a factor 5.

And the prop is clocked at less than half the clock speed of the ARM.

Read again, I already accounted for that.

So what this means is ... for applications that can take advantage of the Prop's symmetric multiprocessing capabilities, then the Prop is a very cost-effective solution.

True but most apps can't be parallelized so easily besides as I said for any real world app there are not so many free cores available.

This is why I disute the fact that C support on the Prop is poor. It is more of a case that our ability to use the Prop's capabilities in any language is poor - i.e. most of us use the Prop for applications that do not take advantage of the Prop's multiprocessing capabilites.

Except in odd cases the Prop has no multi-processing capabilities. The general model is:
a) Slow top level app code
b) Surrounded by faster time-critical soft peripherals.

But you can continue to blame C for your own deficiencies if it makes you feel better.

I will, it does:)

To put things on a more level footing I've just discovered I can get an entire 25MHz ARM Cortex M3 MCU dev kit for 9 Euros locally.
So I'll soon be pitting Catalina (and Zog) against that.

davidsaunders · 2011-04-08 21:25

Ok. Most ARM chips run at an average of 1.2MIPS, so you would have to go better than that and get it down to 20MIPS in order to compare its performance against a single Propeller COG. So giving the ARM the benefit of the doubt, devide its score by 10 (18MHz), and we have 70ms. Which is still 5.7 times as fast as the Prop, To improve the comparison, do a benchmark that only uses integer math, we are not trying to compare Apples to Nuclear power plants here.

Heater. · 2011-04-09 00:18

davidsaunders

Most ARM chips run at an average of 1.2MIPS

What?

There is a vast range of ARM chips.
The embedded systems I have been working on in recent years have an ARM920T running at 200MHz which Linux reports as 90 "bogomips".
Our new ARM boards are running at 1GHz.
The worlds mobile phones are using ARMs somewhere in between.
The chip I was proposing as a comparison to the Prop runs at 25Mhz and a complete development board is available for 8 Euros and 91 cents. https://www.elfa.se/elfa3~eu_en/elfa/init.do?item=73-872-39&toc=20989

No wonder the Prop has an up hill struggle.

The challenge around here seems to be:
a) Find a chip of a similar price to compare the Prop against.
b) Find a chip of a similar clock rate to compare the Prop against.

Which is fair enough.

Of course comparing a Prop to a regular MCU is about as futile as comparing an SN7400 to an
Intel core 2 duo but that's not going to stop us:)

To improve the comparison, do a benchmark that only uses integer math, we are not trying to compare Apples to Nuclear power plants here.

The fft_bench is an integer only benchmark.

davidsaunders · 2011-04-09 07:32

I feel that it would be an interesting experiment to create an 68000 emulation on the Propeller, It should be possible to get close to the 1MIPS of the original, though I would not use C.

heater wrote:

The fft_bench is an integer only benchmark.

Sorry about that, I had not looked at the code till just now (primarily because I have not attempted to use a C compiler with the Propeller). Is there any way you could trim that down to fit entirely in a cog (this would be a much better test).

martinh · 2011-04-09 07:40

davidsaunders wrote: »

Is there any way you could trim that down to fit entirely in a cog (this would be a much better test).

Seems to be here:
Heater-s-Fast-Fourier-Transform

davidsaunders · 2011-04-09 08:06

martinh:
By entirely I mean, including all data, no hub access at all.

Heater. · 2011-04-09 08:29

I don't recall how much COG space the PASM FFT takes but I guess one could put the data in COG as well if only a small data set is required. I though 1024 elements was about the minimum to be useful.
Then there is the problem of the sin/cos tables. One could calculate those values on the fly I guess but that would be much slower.

I might leave that as an exercise for the reader:)

martinh · 2011-04-09 08:40

davidsaunders wrote: »

martinh:
By entirely I mean, including all data, no hub access at all.

That will be an useless excercise just to do an artificial benchmark. You cannot solve a real world problem with 2K RAM if you want to store the data structures in the cog.

davidsaunders · 2011-04-09 08:52

martinh wrote:

That will be an useless excercise just to do an artificial benchmark. You cannot solve a real world problem with 2K RAM if you want to store the data structures in the cog.

True, though for many things most of the processing time takes place between hub accesses, and Heaters bench mark has to access HUB mem to often to be representative. Perhaps a better balance, something that mostly works in COG mem, with occasional HUB accesses (maybe once every 32nd HUB window). I do not know what form of algorithm would qualify, that would be equal on other processors. I was going to recommend that someone that has the time implement a DHRYSTONE benchmark in C that fits entirely in COG mem, though that does not address the issue of balance.

I would say that for this purpose it is best to view COG mem as a kind of CPU cache.

Perhaps due Heaters FFT Bench, and a DHRYSTONE, and average the resaults?

Heater. · 2011-04-09 08:56

Like a said I don't recall the COG usage of the FFT but lets say there were 256 LONGs free. You need two longs per data sample, real part and imaginary part.
So already you are limited to 128 samples.
If you want sin and cos in COG as well you need 96 more LONGs. No go.
So then you are down to 64 samples and 48 table entries. Not much use.
Or leave the sin/cos in HUB, they are in the PROM anyway.
I'm not so sure the speed gains would be that great anyway.

davidsaunders · 2011-04-09 08:58

Also I do think that the PropII will likely have performance very close to a similarly clocked ARM (depending on the core used in the ARM) with a single COG.

Heater. · 2011-04-09 09:02

The PASM FFT has most of it's HUB accesses spaced such that the is not time lost waiting for the HUB access window.
I'm pretty sure having data in COG would not speed things very much.

martinh · 2011-04-09 09:07

davidsaunders wrote: »

Perhaps a better balance, something that mostly works in COG mem, with occasional HUB accesses (maybe once every 32nd HUB window).

That is of course a valid argument to treat the cog mem as something similar to a L1 cache (ok, this is a bit a weird comparism and wording).

davidsaunders wrote: »

I do not know what form of algorithm would qualify, that would be equal on other processors. I was going to recommend that someone that has the time implement a DHRYSTONE benchmark in C that fits entirely in COG mem, though that does not address the issue of balance.

You cannot do that with C as it is now (don't know about ICC, but catalina cannot). There is at the moment no option to compile something like a "pure C cog function" everything is done with the large memory model. You can include something which was written in PASM with a simple trick (it is then converted to a C array), but this would be cheating.

By the way that would be a great enhancement to catalina to have this possibility (PropBasic can do it) or to have inline pasm in C or both. No idea if something like that is planned for the future (I did not really search if there is a discussion about that).

davidsaunders · 2011-04-09 09:21

Heater:
Yes your PASM FFT is quite a bit faster, though it would not compile very well on other MCUs

.

RossH · 2011-04-09 16:44

martinh wrote: »

By the way that would be a great enhancement to catalina to have this possibility (PropBasic can do it) or to have inline pasm in C or both. No idea if something like that is planned for the future (I did not really search if there is a discussion about that).

Inline PASM is something I considered, but eventually decided against. Being able to call PASM functions is a much more elegant solution - especially given that Catalina uses register passing for function arguments, and that the Catalina Optimizer can eliminate the overhead of the function call by inlining the code itself. So in most cases there is no overhead to doing things properly.

Being able to generate stand-alone cog programs is not currently planned for Catalina. Apart from giving better results on small benchmark programs, I can't see much use for it. However, if enough others feel there is a call for such a thing, I'm sure someone will attempt it (but not me - I'm not quite that masochistic!).

I have suggested previously that someone might instead like to take the tiny-C compiler (the original 600 line version, not the more recent bloatware

) and use it to generate a compiler for a C-like language specifically for cog programming. This would be a reasonably simple exercise, and would achieve the desired outcome - i.e. the ability to write cog programs in a slightly higher level language than PASM.

Ross.

martinh · 2011-04-09 17:15

RossH wrote: »

Inline PASM is something I considered, but eventually decided against. Being able to call PASM functions is a much more elegant solution - especially given that Catalina uses register passing for function arguments, and that the Catalina Optimizer can eliminate the overhead of the function call by inlining the code itself. So in most cases there is no overhead to doing things properly.

Thanks for the info. That refers to writing some object files by hand? Fair enough, it is easy to do (and you are right it is more elegant from its syntax than some weird inlined code). I simply forgot about that possibility.

RossH wrote: »

Being able to generate stand-alone cog programs is not currently planned for Catalina. Apart from giving better results on small benchmark programs, I can't see much use for it. However, if enough others feel there is a call for such a thing, I'm sure someone will attempt it (but not me - I'm not quite that masochistic!).

My thoughts went more in the direction being able to write a driver or library (like F32, ok that is bad example because that is integrated in catalina) in a high level language like C, not some benchmarks, I can imagine that it is really weird to make the compiler produce pasm output which fits in a cog.

CarlJacobs · 2011-07-20 03:33

After recently posting some stuff in the post http://forums.parallax.com/showthread.php?132995-Empty-Loop-Benchmarks. I decided to have a look at a forth based FFT implementation.

I've attached my source, which compiles (under JDForth) to a total 13620 bytes of code+data. It calculates a 1024 FFT in 1200ms when using pure forth and as fast as 272ms with just the innermost butterfly written as a PASM based forth word. Both versions are included in the attached file with only a minor change required to switch between the two versions.

I wasn't very happy with the speed of the forth-only version, but when analysing it discovered that the majority of the time was spent in getting and setting the values of the local variables. This is something that C compilers do well, but generally indicates that the implementation is not very forth friendly. I considered re-writing the algorithm in a more forth-friendly manner, but don't really have the time to figure out exactly what's happening with the algorithm.

The attached code includes my usual approach to this sort of problem - i.e. write a solution that works in forth, and then convert the minimum amount to PASM to achieve the best overall result.

Enjoy

BTW: I had to add the .txt extension in order to allow it to be uploaded.

Heater. · 2011-07-20 04:54

Very cool. All the unreadability of ASM plus the unreadability of Forth in a single file:)

CarlJacobs · 2011-07-20 05:02

Heater. wrote: »

Very cool. All the unreadability of ASM plus the unreadability of Forth in a single file:)

I agree (somewhat). The un-syntax-highlighted text is hard to read. It doesn't look quite as bad from within JDForth.

Heater. · 2016-03-27 12:17

It's Easter Sunday, got bored, found myself fishing for all the implementations of fft_bench in different languages here. Added them to the fft_bench repository: https://github.com/ZiCog/fftbench

Thanks to everyone who contributed.

Could do with someone contributing instructions as to how to run the BASIC, Forth etc versions, along with how to install and run the compilers.

fft_bench - An MCU benchmark using a simple FFT algorithm in Spin, C, and ...

Comments