Heater's Fast Fourier Transform.

Heater. · 2010-12-28 00:37

Presenting my implementation of the Fast Fourier Transform for the Propeller.

Spec:

1) Contains both Spin and PASM FFTs. Use BST and set some defines to select which is compiled. Or Use the Prop Tool and manually edit out the version you don't want. Just look for "#define..." and "#ifdef..." to see what is what.

2) Performs an FFT on 1024 complex samples. Just set the input signal into array "bx" and set all of array "by" to zero. Output spectrum is the first 512 complex samples in "bx" and "by". The demo/test harness shows how to get real spectrum magnitudes from this.

3) Designed using fixed point arithmetic. Input signal samples should be -4096 to + 4095. Could probably be changed to full 16 bit but I think limiting bits gives a little speed advantage.

4) 1024 point FFT in PASM takes 41ms on an 80MHz Prop. EDIT: Oops...When fed with some random noise rather than a few simple tones execution speed drops to 63ms. The multiply loop I am using has an execution time dependent on data content.

5) Included demo shows how to use it. Generates a few sinusoids at various frequencies as input. Prints the resulting spectrum magnitudes vs frequency.

Background:

This is the happy result of the interesting discussion on the "Fourier Transform For Dummies" thread. This code has been created straight from the mathematical FFT definition and Cooley-Tukey idea. It is not a "port" of any existing FFT source code. Rather it was a challenge to myself to gain understanding of Cooley-Tukey and create my own code without "peeking". As such its accuracy may still be questionable. Initial tests look good though.

A big thank you to Dave Hein for clearing up some details of my understanding of Cooley-Tukey during the "FTFD" thread.

Future:

This is using it's own sin/cos tables and is therefore wasteful of space. I may take the Prop ROM tables into use. So far I have failed to get that right when I tried.

I might extend the fixed point arithmetic to a full 16 bits. Given the nature of most real world inputs this mat not be worth it.

If PASM gurus want to take the challenge of optimizing this it would be great. Just now the PASM is a direct translation of the original Spin version with out much thought to optimization.

What's it for?

Just an academic exercise for me so far.
If it had been ready in time it would have made a "sound to light" unit for my Christmas lights:)

Edit: 2011-01-14 Attached is heater_fft version 1.0, now 25% faster!

Edit 2011-01-14 Attached is heater_fft verion 1.1, yet another 20% or so faster than v1.0. Thanks due to Lonesock for contributing this optimization. 1024 point FFT is down to 25ms with the test data in use.

Edit:: 2011-01-21 Attached is heater_fft version 2.0. Bumped a whole version as the application interface has changed.
This version has fast PASM bit-reversal and magnitude calculation stages.
About 31ms for the entire 1024 FFT and magnitude conversion.
See instructions in heater_fft.spin.

Edit: 2011-01-25 Attached is heater_fft version 2.1
This version makes use of the Props PROM Trig tables thus saving a lot of space.
Sadly it means sacrificing 10% of performance. Down to 34ms on the test demo test data.

RickB · 2010-12-28 15:40

Is this with 1 cog only?

Here are the results with a single core avr at 16 MHz. Written by ChaN.

; 16bit fixed-point FFT performance with MegaAVRs
; (Running at 16MHz/internal SRAM)
;
; Points: Input, Execute, Output, Total: Throughput
; 64pts: .17ms, 2.0ms, 1.2ms, 3.4ms: 19.0kpps
; 128pts: .33ms, 4.6ms, 2.4ms, 7.3ms: 17.5kpps
; 256pts: .66ms, 10.4ms, 4.9ms, 15.9ms: 16.1kpps
; 512pts: 1.3ms, 23.2ms, 9.7ms, 34.2ms: 14.9kpps
; 1024pts: 2.7ms, 51.7ms, 19.4ms, 73.7ms: 13.9kpps

Rick :-)

Heater. · 2010-12-28 17:20

RickB,
That time is for one COG only. It does not include input/decimation or output, only the actual butterfly calculations.

Splitting the thing over two or four COGs would be quite easy.

That AVR code is doing very well.

ericball · 2010-12-28 17:46

The advantage of including your own sine table is you can make it 32 bits and scale it by 2^n rather than 16 bits scaled by 65535.

Heater. · 2010-12-28 23:10

ericball,

I'm not with you. Given that at max we are doing 16 bit arithmetic on 16 bit samples (with 32 bit intermediate results) what is the advantage to having more than 16 bits in the sine tables?

The advantages of having custom tables here is:

1) It's a bit faster than using the ROM tables. I want values of cos and -sin over 180 degrees in 512 steps. Currently scaled to a range of +4095 to -4096. It's quicker to look them up directly.

2) It works. Twice tried to use the PROM tables and the getsin/getcos routines from the Prop manual. Twice I could not get it to work correctly. Just something stupid I did I'm sure but I gave up.

lonesock · 2011-01-03 07:18

Hi, Heater.

Nice work!

Here's a slightly faster version (runs in about 38 ms), mostly due to massaging the code a bit to optimize the hub accesses. I made some tiny changes to the demo harness (only printing non-zero values to the table so I could check by eye that I didn't break anything, rounding to the nearest ms and printing raw clock counts, and the 'wait for done' Spin code is a bit faster too). Unfortunately, these are all little optimizations which will break if you want to change the algorithm, but it was fun to do [8^)

Jonathan

P.S. Btw, for eventual porting to PASM, I always find it helpful to use decrementing loops.

Heater. · 2011-01-04 03:48

Lonesock,

Thank you. I'm still totally amazed at myself that I managed to create an FFT from scratch, if you have followed the story on "Fourier For Dummies" you will see that the workings of the Cooley-Tukey FFT have taken me over twenty years to get a handle on:)

A quick diff shows you have been all over that PASM. I had realized that we were not doing very well on hitting the HUB "sweet spots" and the thing about counting loops downwards only occurred to me when I realized that djnz would be useful.

As it stands we don't seem to get much of a speed up with these tricks though so I'm leaving my PASM as it is, matching the Spin version. Perhaps I'll rewrite the Spin with backward loops and then use matching djnz in the PASM at some point.

I was kind of wondering if my recursive implementation (only exists in C/C++ just now) migh be quicker, on the face of it it has less overheads managing all the loops traded against the overheads of recursive calls. But I'm thinking that with a stack in COG that need not be so great.

VIRAND · 2011-01-10 13:30

Can anyone use this FFT to make a spectrum analyzer demo that uses the demo board mic
and looks something like the microphone_to_vga object?

Like a supersized graphic equalizer display?

Heater. · 2011-01-10 17:03

VIRAND,

Can anyone use this FFT to make a spectrum analyzer demo that uses the demo board mic .... Like a supersized graphic equalizer display?

Yes indeed that is possible and I'd love to see someone do that. Or anything else with it for that matter. I should put an MIT licence statement on it.

You should also check out Ale's spectrum analyser demo with his version of a 1024 point FFT

http://forums.parallax.com/showthread.php?121433-Real-Time-FFT-Is-the-Prop-up-to-it

and

http://propeller.wikispaces.com/FFT

VIRAND · 2011-01-10 21:20

I've tried to get Ale's to work with the mic but he uses an ADC chip I don't have,
and all I've gotten so far is either a blank screen or a thumb shape with the letters FFT on it,
without any bars. I couldn't tell if it was doing anything or hung up.

I think I'll try with yours, but I can't promise anything now.
Hmm. I'm not typing fast enough. I get messages saying I'm not logged in and assume I timed out.

Heater. · 2011-01-11 01:32

VIRAND,

OK, That's great.

What frequency range are you hoping to cover? What display VGA or TV?

Initially I would assume that heater_fft is only good for 10 transforms per second. So working in "real time" is only good for 10K samples per second or 5KHz audio max. I suspect it can do a bit better than that though.

But there is no reason you can't sample for a while at a higher frequency and then when you have 1024 samples do a transform and display the result. There may just be gaps in the input that is sampled.

Trick would be to take 1024 samples in parallel with transforming the previous batch in parallel displaying the batch before that!

Let me know if you need any help or anything changing in the FFT interface.

Cluso99 · 2011-01-11 01:45

Sorry heater - I don't dare look at your code as I may lose myself for a while....
firstly by understanding fft - I never did really understand it (40 years ago)
secondly by trying to increase the performance (this is like raising a red flag to a bull for me)

But a prop spectrum analyser is a great device - way to go Ale.

Heater. · 2011-01-11 02:07

Cluso,

Sorry heater - I don't dare look at your code as I may lose myself for a while....

Yep, contemplating FFT has consumed far too much of my concentration for weeks now.

firstly by understanding fft - I never did really understand it (40 years ago)

Same here, as you have read. But hopefully heater FFT is usable without having to understand the nittygritty of FFT.

secondly by trying to increase the performance

Shame, heater_fft has a lot of scope for improvement as it was written for clarity not speed.

But a prop spectrum analyser is a great device - way to go Ale.

Yep, Ale did a great job there.

lonesock · 2011-01-11 09:06

The code as it stands could be easily sped up by special-casing the 1st pass (where cosine => 1, and sine => 0).

Jonathan

Heater. · 2011-01-11 10:35

Lonesock,

True enough. Hold that thought...

I was just contemplating the fact that currently heater_fft maintains a bunch of indices, b0, b1 and wIndex which get used to access arrays bx, by, wx, wy,. This involves a lot of fiddling around with index and hub_ptr.

Seems to me things would go much faster if instead it just maintained a bunch of pointers like: b0x_hub_ptr, b0y_hub_ptr, b1x_hub_ptr, b1y_hub_ptr, wx_hub_ptr, wy_hub_ptr.
These pointers would be initialized at appropriate places and thereafter just "cranked up and down" by appropriate amounts to get the required array elements.
That would save a lot of index moving, shifting, adding and all that loading of hub_ptr.

Then of course there was your idea count loops downward and use djnz.

Heater. · 2011-01-13 15:23

heater_fft verion 1.0 is attached to the first post.

For V1.0 I have totally rewritten the Spin butterfly method and the PASM implementation to match. They both now use pointers into the data arrays rather than using array indexing. Result is that the PASM version is about 25% faster. I'm seeing 30ms and 56ms where previously took 40ms and 70+ms for a 1024 point transform. Remember execution speed is somewhat dependent on the data due to the optimized multiply it uses.

Now to look at Lonesock's suggested optimization...

lonesock · 2011-01-13 17:11

Here it is. The very first iteration of bloop always has c=1, d=0, so you can drop the multiplication...modified code attached (25ms).

Note: there is one more optimization to be made...halfway through each butterfly c=0,d=-1, so you could do the same thing there, with an addition bit of code, and a cmp + jmp.

Jonathan

Heater. · 2011-01-14 00:34

Lonesock,

Where are you? We seem to be working alternate shifts:)

I just had a restless night because I realized that your suggestion to short circuit the multiplies on the first level would be better done by short circuiting for all the first iterations of the butterfly loop on all levels. Exactly as you have done. I was just too tired to continue.

Anyway this is brilliant, the fft of the test data is now down to 25ms from the original 40ms and the 32ms of version 1.0.

The PASM but longer matches the Spin but that's OK. The Spin is there as a reference for the curious.

As for the similar optimization when halfway through each butterfly what's stopping you

I have to run to work today.

Heater. · 2011-01-14 02:50

I just added heater_fft version 1.1 incorporating Lonesock's optimization to the first post.

25ms per 1024 point FFT and still room for improvement!

Sapieha · 2011-01-14 02:56

Hi Heater.

Nice progress

lonesock · 2011-01-14 09:03

Hi, Heater.

I'm on California time [8^)

Here's the latest version with 3 main changes:

1 - I added a counter to the Spin version to see how many times the optimization kicked in:
....The initial optimization (c=1,d=0) is used ~1024 times, and the new optimization (c=0,d=-1) is used ~512 times

2 - I cleaned up my initial optimization (I had duplicated too much code, and now I also weave the math btwn hub accesses better)

3 - the new optimization is in place...diminishing returns (down to 23 ms), but looking at the exact clock counts, it's still an 8% improvement, and the absolute improvement will be larger if you use > 14-bit numbers for your sin/cos table, or if the user uses larger numbers to be crunched. And it only cost about 8 lines of PASM [8^)

Jonathan

Cluso99 · 2011-01-14 18:59

Quickly approaching 0ms

Heater. · 2011-01-21 13:55

I have attached heater_fft version 2.0 to the first post.

I bumped a whole version as the application interface has changed.
This version includes Lonesock second optimizations and has fast PASM bit-reversal and magnitude calculation stages.
About 31ms for the entire 1024 FFT (bit-reversal, butterfly and magnitude conversion on the demo test data)
See instructions in heater_fft.spin.

siljamicke · 2011-01-22 06:22

@Heater
I'm glad i triggered this excellent program into existance. I'm sorry if me starting my thread got you occupied the last days, but perhaps you can put this baby to rest now, considering you been spending many years trying to understand FFT.
Now i wish i could do the same...

Heater. · 2011-01-22 07:36

Siljamicke,

I'm glad i triggered this excellent program into existance.

Actually it's not your fault.

You see Bean wrote a wonderful document "CORDIC For Dummies" which explained how the CORDIC algorithms work. Complete with Spin code. This is another thing that had eluded my mind for many years so I was very glad to see it. Anyway I believe it was me who then asked for a "Fourier For Dummies" from Bean, wondering if he could do a similar job. Which he has done to some extent.

Well, one thing leads to another and with pushing from Prof Braino and others here we are:)

...perhaps you can put this baby to rest now...

Not quite. No Fourier transform code is complete until it can perform the inverse transform. That is to go from a list of frequency amplitude values to the equivalent signal time series. That is we should be able to synthesis waves out of the frequency components. I have an idea how to do this in a simple way.

Then, after all this effort I should use the FFT in some practical demo. Even if it is just a disco style "Sound To Light" unit from a Prop.

Now i wish i could do the same...

I'm still pondering how to write that document for you and others.

Heater. · 2011-01-22 09:25

I've just been doing a little testing with this.

For example, setting as an input a square wave of amplitude 1023 with frequency 64 it gives the following amplitudes for the fundamental and harmonics:

Freq.   Amplitude   Expected
-----   ---------   --------
   64        1310    1310.93
  192         459     460.33
  320         307     307.58
  448         260     260.76

Where "expected" is the result produced by an FFT on my PC. Not bad hey?

BUT...

I have a little problem thinking about overflow conditions. What actual range of input can we use?

As it stands this fft uses sin and cos values for the "twiddle" factors that are in the range +/-4095. I seems to tolerate input signals up to about +/-1200, for example that square wave.

It seems to be dependent on the shape of the input signal. For example two sines at +/.1023 giving peaks at 2046 seem to work fine.

Anyone have any ideas on this?

Should I reduce the sin/cos resolution.
Does my magnitude calc overflow?
Does it matter?

We seem to be well short of dealing with 16 bit "twiddle factors" and 16 bit signals.

Heater. · 2011-01-25 06:53

New version 2.1 of heater_fft is attached to the first post.

This version makes use of the Props PROM Trig tables thus saving a lot of space. Sadly it means sacrificing 10% of performance. Down to 34ms on the test demo test data.

Unless someone has any neat optimizations this is probably the last version.

Is this thing ready for OBEX?

Jay Kickliter · 2011-03-14 10:03

I read the posts, and skimmed the code, and I'm still not sure if the results give you phase in addition to magnitude? I need to measure power factor where simple peak lag timing might not be enough.

Heater. · 2011-03-15 02:10

Jay,

I read the posts, and skimmed the code..

Brave man:)

...I'm still not sure if the results give you phase in addition to magnitude?

Oh yes. With a little rearrangement.

I need to measure power factor where simple peak lag timing might not be enough.

OK. Sounds quite doable. You might want to elaborate on what exactly you want to do but I'll make some assumptions to get started.

Let's say we have an input signal derived from the mains voltage. That is a sinusoidal wave. It's frequency is 50Hz (because I'm in Europe). It has been sampled with an ADC at a rate of 1024 samples per second. The samples have a range of, say, +/-255.

This might be a bit hard to explain without diagrams but let's try. Thing is whilst the magnitude and phase of the signal is encoded in the output of the FFT it takes a bit of finding:)

Just for a moment I'm going to assume this signal has been sampled in such a way that it starts at it's maximum positive value in the buffer. That is to say it's always in a cos phase relationship in the sample buffer.

That is the so called "real" part of the signal, in buffer bx[] in heater_fft. The "imaginary" part we set to all zeros, buffer by[].

Now, if we take the FFT of this signal (Forget the magnitude calculation for now) what do we get?

The result should be all zeros in the bx[] and by[] arrays except for a peak at bx[50] representing the 50Hz input frequency. (Of course the other elements will have some small values present due to noise). Because our signal is in the "cos" phase the imaginary part of the output, by[50] should be zero.

If we were to change the phase relationship of the sample buffer to the signal such that the samples start at amplitude zero and then proceed downwards, i.e. we now see the sin phase, we should then see the output of the FFT having a peak value at by[50] with bx[50] being zero.

Brilliant, we can detect the phase of the signal by looking at bx[50] and by[50].

The phase is given by:

phase = atan(img / real);

or for the heater_fft looking at 50Hz:

phase = atan(by[50] / bx[50]

OK that's fine but it assumes we have the sampling phase locked to the incomming signal. That's not very nice.

Now you mentioned power factor, so I assume we have another signal coming in that is a measure of the current sampled at the same rate as the voltage, and we want to get the phase relationship with the voltage.

One way to do this is to make two transforms, one for the voltage and one for the current signal. Then use the phase formula above to get the phases of each signal. Then take the difference between those two phases. Done.

BUT there is an easier way:

Put the voltage samples into the "real" part, buffer bx, and put the current samples into the "imaginary" part, buffer by. Now when you take the FFT the the phase difference between current and voltage is given directly by the phase formula.

How to do this with heater_fft?

The fft.butterflies method take a command parameter that has three bits that can cause three operations to occur:
CMD_DECIMATE - Causes the sample buffers to be reordered such that the algorithm works.
CMD_BUTTERFLY - The actual FFT work producing the real and imaginary results in bx[] and by[].
CMD_MAGNITUDE - The magnitude calc.

If we call butterflies with only the decimate and butterfly options the real and imag results will be left in the buffers which can then be used in the phase formula.

Whoa, there's a lot here. I have to try some experiments with this to make sure it is all true:)

Jay Kickliter · 2011-03-15 15:58

Thanks for the tip. I'm now getting calculating power-factor with the Prop that are spot on with the Kill-a-Watt. For all I know, mine might be better since I bet the KaW just does peak timing.

Heater. · 2011-03-16 07:07

Jay,

Wow, that was quick.

I have been thinking about this a bit. Again no idea what your final aims are but here goes:

1) I seem to recall that measuring determining power factor from phase shift is not going to be so accurate if the load is non-linear. For example if it has a half-wave rectifier in it.

2) If we are just looking at the fundamental frequency, 50Hz or 60Hz, what is the effect of having a distortion in the current or voltage. A lot of the power is then in the harmonics and our measurement on the fundamental may be flawed.

These two are related of course.

Imagine, a distortion free sinusoidal voltage is driving a load through a diode rectifier. Clearly the current waveform is grossly distorted. A lot of it now shows up as harmonics. Those harmonics may have different phase shifts than the fundamental. I have no idea how to deal with this.

Heater's Fast Fourier Transform.

Comments