1024-point FFT in 79 longs

Rayman · 2024-01-14 23:22

@SaucySoliton did you find a way to make the 1024 point fft faster than what chip posted?
Sorry if I’m being slow…

evanh · 2024-01-15 00:08

More than twice as fast, so far. Mainly from using the FIFO for data fetches and unrolling the inner loop to utilise the Cordic pipeline.

SaucySoliton · 2024-01-15 01:54

@Rayman said:
@SaucySoliton did you find a way to make the 1024 point fft faster than what chip posted?
Sorry if I’m being slow…

I posted a list of optimizations. I could have forgotten some.

For the same clock frequency, Chip's posted code was 3400-3600 uS. (It's not that variable, I just don't remember precisely and can't test it right now. It would be a bit faster than the inline version.) My FFT alone was 1900 uS. Then I added in bit reversal for comparison to fftbench. The alignment does affect performance by a few percent. Not quite double the speed.

SaucySoliton · 2024-01-15 21:24

Another performance boost has been unlocked! In many common use cases, the input to the FFT is entirely real data. So half of the input values are zeros. That's the imaginary part of the complex number input. Then we throw away half the output. No more! It is possible to pack all of the real number values in and process them as a complex FFT of half the size. Since the complexity of an FFT is N log(N), by halving N we more than double the speed. But there is some post-processing required which is comparable to another level of butterflies. So the log(N) term is unchanged while the N term is halved. So the time to process the FFT can be halved. That is reflected in the formula here: https://www.fftw.org/speed/

The post-processing does result in the output values being doubled compared to the usual.

fft_bench v1.2.1 for PROPELLER
OpenMP not available on this system
Freq.    Magnitude         I         Q
       0,     1024,     1024,        0
     192,     1023,     1023,        0
     256,     1024,     1024,        0
     512,     1024,    -1024,        0
1024 point bit-reversal and butterfly run time = 1188 us
clock frequency = 160000000

Based on this test the P2 scored 21.5 MOPS for the real FFT and 24.8 MOPS for the complex FFT.

evanh · 2024-01-15 21:41

You just said real goes twice as fast. How come 21.5 vs 24.8 MOPS?

SaucySoliton · 2024-01-15 22:37

The real FFT theoretically halves the operations and should theoretically run twice as fast. In practice, it's a really nice speed boost, but not quite double. The MOPS was calculated from the formula on the FFTW page. https://www.fftw.org/speed/ Most systems experience a reduction in MFLOPS when processing real data. It doesn't matter, we just want our desired transform to finish as quickly as possible.

2059 uS complex 1024 point with bitreverse

1188 uS real 1024 point with bitreverse and postprocessing

1022 uS for a 512 point complex fft and non-optimized bitreverse

978 uS for a 512 point complex fft and optimized bitreverse

899 uS for a 512 point complex fft no bitreverse

MOPS went down for the real version because of the extra post-processing. The post-processing for the real FFT version hasn't been optimized too much. Those cordic operations take a while.

evanh · 2024-01-15 22:48

Oh, so the 21.5 is actually 43 MOPS. Why are they halving it?

SaucySoliton · 2024-01-17 23:59

@cgracey said:
Thanks, Rayman. I didn't know such things existed. I wonder how usable it is by us.

By the way, I did what you said and placed one of the SETQ+RDLONG sequences between the QROTATE and GETQX and it certainly sped things up. I will use that for a simple implementation that is compact.

I've been playing around with optimizing the FFT today and have made a lot of progress, but when I get to the last iteration set of 1x512, the angle must be recalculated for every sample and I can't fit it into two instruction, as I'd like, in order to get it to flow with the CORDIC. There is a REV needed and two instructions before that. That last iteration set is a special case in a few different ways, when trying to optimize it. It's slowing the whole FFT down.

Found a solution for a bit reversed counter: Just manually change the bits with an XOR.

        setq    ay2             'rotate (bx,by) by angle
        qrotate ax2,angle       ' %000x...

        xor angle,##$4000_0000  ' %010x...
        add i3,#8               ' 2*fftsize*4
        setq    ay4             'rotate (bx,by) by angle
        qrotate ax4,angle

        xor angle,##$6000_0000  ' %001x...
        mov ax2,ax1
        setq    by2             'rotate (bx,by) by angle
        qrotate bx2,angle

        mov ay2,ay1
        xor angle,##$4000_0000  ' %011x...
        setq    by4             'rotate (bx,by) by angle
        qrotate bx4,angle

        mov angle,i3
        rev angle

The full counter and bit reverse runs once per loop. The XORs fit into the 4 butterfly unrolled loop perfectly. Of course, in practice you would store those constants in a long to avoid the AUGS penalty.

Posting here to back up my work. Now a 1024 point real data FFT runs in 982uS ! At 160MHz

cgracey · 2024-01-18 05:35

I had worked on a fast 1024-point FFT a few years ago and got it down to 700us at 250MHz. I hard-coded different parts of the butterfly to speed it up. I attached it here. It's part of a program which does microphone ADC, HDMI display, and FFT.

SaucySoliton · 2024-01-18 05:52

@evanh said:
Oh, so the 21.5 is actually 43 MOPS. Why are they halving it?

It's not doing 43 MOPS. From the formula on the FFTW page:

1024 point complex data requires 51,200 useful operations. (100%)

512 point complex data requires 23,040 useful operations. (45%)

1024 point real data requires 25,600 useful operations. (50%)

The useful operations counts the mathematical operations in the data path like multiplies and adds. It does not include stuff like loop counters, memory reads, and so on.

Now if you have 1024 points of real number data you certainly can use a 1024 point complex fft to process it. But the CPU will end up doing twice as many operations as compared to an optimal algorithm. The 1024 point real fft is based on the 512 point real fft. Then another level of processing is required. I've made massive gains by optimizing the assembly code. It was also very much worthwhile to use an optimal algorithm that eliminates 50% of the operations required.

evanh · 2024-01-18 11:03

Oh!!! I had that inverted. Here I was thinking MOPS was a speed measure (Mega Operations Per Second) when it's really a demand measure (Mega OPerationS).

SaucySoliton · 2024-01-19 06:23

@cgracey said:
I had worked on a fast 1024-point FFT a few years ago and got it down to 700us at 250MHz. I hard-coded different parts of the butterfly to speed it up. I attached it here. It's part of a program which does microphone ADC, HDMI display, and FFT.

Thanks, Chip!

Also, the simple version you posted years earlier was one of the things that helped me understand how an fft works. The cordic operations add just the right amount of abstraction. In most code you see a complex multiply as individual multiplies. Using some algorithm to do a complex rotate with 3 multiplies. It's easy to get lost there. I plan to include the simple version in the library for those who want to try to understand it.

It occurred to me tonight that the loops running bfly4, bfly2, and bfly1 are operating on the same data. So it works to call them all in the same loop. No need to write the data to hub ram in between. Sadly, that only saved 50uS. 1024 point real fft with ordered iq output takes 760uS at 160MHz. So 486uS at 250MHz. The bit reversal step is really starting to slow things down.

I don't really even need to keep optimizing the FFT but the ideas to speed things up keep coming. It's too much fun! I have an idea about how to split the work between multiple cogs. But first I want to squeeze as much performance out of one cog.

cgracey · 2024-01-19 19:57

@SaucySoliton said:

@cgracey said:
I had worked on a fast 1024-point FFT a few years ago and got it down to 700us at 250MHz. I hard-coded different parts of the butterfly to speed it up. I attached it here. It's part of a program which does microphone ADC, HDMI display, and FFT.

Thanks, Chip!

Also, the simple version you posted years earlier was one of the things that helped me understand how an fft works. The cordic operations add just the right amount of abstraction. In most code you see a complex multiply as individual multiplies. Using some algorithm to do a complex rotate with 3 multiplies. It's easy to get lost there. I plan to include the simple version in the library for those who want to try to understand it.

It occurred to me tonight that the loops running bfly4, bfly2, and bfly1 are operating on the same data. So it works to call them all in the same loop. No need to write the data to hub ram in between. Sadly, that only saved 50uS. 1024 point real fft with ordered iq output takes 760uS at 160MHz. So 486uS at 250MHz. The bit reversal step is really starting to slow things down.

I don't really even need to keep optimizing the FFT but the ideas to speed things up keep coming. It's too much fun! I have an idea about how to split the work between multiple cogs. But first I want to squeeze as much performance out of one cog.

Yeah, optimizations can go on and on. I was talking with Shannon Mackey the other day and I was telling him that because hardware optimizations exist in the P2 that enable software optimizations, you wind up playing chess instead of checkers, which can be taxing. If we had just simple instructions like AND, OR, ADD, SUB, SHR, SHL, and no CORDIC, it would be more straightforward to write code, but everything would be slow. He pointed out that if the hardware was simple, everyone's software efforts would result in the same solutions. So, it's not simple, but there's potential for lots of optimization. Enough to wear your brain out.

Rayman · 2024-01-19 22:58

Wonder if the shared LUT would help when using 2 cogs...

cgracey · 2024-01-20 04:48

@Rayman said:
Wonder if the shared LUT would help when using 2 cogs...

Maybe it could.

bob_g4bby · 2026-01-08 21:33

@SaucySoliton , how is your FFT code these days - have you made any more speed improvements? Did you try using more than one cog? I could do with a 1024 complex to complex FFT and inverse FFT for my dsp library.
Cheers, Bob

SaucySoliton · 2026-01-08 22:40

IFFT isn't much different that FFT, but I just never got around to it. If this is something you would like I have some free time right now to work on it. I really need to clean it up and post to obex anyway.

The ironic thing is after doing all this work I found another FFT library https://forums.parallax.com/discussion/comment/1466754/#Comment_1466754 There is a fundamental difference. This library does the bit reversal first. (Decimation In Time)

The code in this thread does bit reversal last. (Decimation In Frequency) Usually the type of algorithm doesn't matter. If you are doing FFT filtering, the bit reversal step can be omitted.

How much speed do you need? The 1024 complex FFT is 1015uS at 160MHz without bit reversal. With bit reversal is 1220uS.

bob_g4bby · 2026-01-09 06:28

Thank you very much for offering to tidy up and obex the code. That would be great. Speed - I hope to run a software defined radio at 48 ksamples/s. With 1024 iq samples per buffer, that sets the main loop at 21.33 ms, or 42.66ms, or 63.99ms - at the expense of more buffers. So the fft times you mention would be acceptable. I aim to use as many cogs as I can to speed things up. I'm writing an inline assembly library - sine generator, modulator and mixer so far. Working on iq to polar and polar to iq at the moment. If I use two cogs with shared lut, should run at 14 cycles per iq sample.

Could the bit reversal step in the fft be run optionally to suit the application? Likewise windowing?

Best regards, Bob

bob_g4bby · 2026-01-09 06:50

I'm copying the design of a software radio that first appeared a decade ago. The receiver main filter is calculated in the frequency domain as an fft fast convolution filter:-
1. Convert from time domain to frequency domain with fft
2. Frequency translate from an offset of 12khz to baseband IF of 0 Hz
3. Sideband selection
4. Generate the required bandpass filter coefficients, when user changes bandwidth. Apply Blackman-Harris windowing to this once, rather than the signal path all the time
5. Fft fast convolution filtering
6. Conversion back to time domain with ifft

I made a successful receiver in the language LabView, so reasonably confident about doing it on the P2, providing I reuse buffers as much as possible.

I forget whether reordering was used or not, hence my question about keeping it as a discrete method. I'll look that up in the code. I think it probably was. See article https://arrl.org/files/file/Technology/tis/info/pdf/021112qex027.pdf

The code I've written so far is demoed near the bottom of [https://forums.parallax.com/discussion/176026/efficiently-processing-continuous-signals#latest) I've standardised on buffers that store samples as two longs like real1, imag1, real2, imag2 - real1024, imag1024. It's best run in Spin Tools, where the Debug Scope windows are fast enough to keep up. Pnut doesn't keep up.

Cheers, Bob

Christof Eb. · 2026-01-10 08:39

@bob_g4bby said:
I'm copying the design of a software radio that first appeared a decade ago. The receiver main filter is calculated in the frequency domain as an fft fast convolution filter:-
1. Convert from time domain to frequency domain with fft
2. Frequency translate from an offset of 12khz to baseband IF of 0 Hz
3. Sideband selection
4. Generate the required bandpass filter coefficients, when user changes bandwidth. Apply Blackman-Harris windowing to this once, rather than the signal path all the time
5. Fft fast convolution filtering
6. Conversion back to time domain with ifft

I made a successful receiver in the language LabView, so reasonably confident about doing it on the P2, providing I reuse buffers as much as possible.

I forget whether reordering was used or not, hence my question about keeping it as a discrete method. I'll look that up in the code. I think it probably was. See article https://arrl.org/files/file/Technology/tis/info/pdf/021112qex027.pdf

The code I've written so far is demoed near the bottom of [https://forums.parallax.com/discussion/176026/efficiently-processing-continuous-signals#latest) I've standardised on buffers that store samples as two longs like real1, imag1, real2, imag2 - real1024, imag1024. It's best run in Spin Tools, where the Debug Scope windows are fast enough to keep up. Pnut doesn't keep up.

Cheers, Bob

Hi Bob
If you do FFT, then set most of the results to zero and then do IFFT, isn't this rather inefficient in comparison to directly using a filter, that does only calculate the desired result(s) like Goertzel algorithm? You could vary bandwidth modifying block length, as you could use any block length. I wonder, what inverse Goertzel would be, just generate sin and cos?
And second question: Why do you need to do IFFT at all? Is this not amplitude modulated signal anyways?
I understand that a complete FFT is interesting for a graphical display, but probably that does not need a high update rate?
Cheers Christof

bob_g4bby · 2026-01-10 11:18

Hi @"Christof Eb." , good to hear from you - Happy New Year!

Good questions! I am no mathematician, but turn instead to a four part article written by Gerald Youngblood, the CEO of FlexRadio Systems. He claims that using an FFT Fast Convolution Filter gives a far superior filter performance to any FIR filter calculated in the time domain. The desirable filter features are flat top and almost vertical deep filter sides. The fast convolution filter done in the frequency domain is the equivalent of a many, many staged FIR filter in the time domain, taking far less time to compute and with much better filter shape. This is covered in some detail in part 3 of 'sdr for the masses' under 'Fast Convolution Filter Magic'. You can see the 'brick wall' shape of the filter that results. I've been using PC based radios based on this architecture for many years and can vouch for how pleasant they are to use - good quality audio, very good adjacent signal rejection, good signal to noise, absence of ringing even at 100Hz filter width for morse code and so on. I recommend reading the whole series (attached), he's very good at explaining the various design decisions.

Ham radio operators use morse code, lower single sideband, upper single sideband, AM (rarely), FM, and many digital modes. Single sideband is efficient for long distance weak signal performance, since one sideband and the carrier is not transmitted. So sideband selection is trivial to do in the frequency domain.

You're right, the graphical display doesn't need to be updated more than (say) 10-15 times per second for a smooth appearance.

At the moment, I have no clear idea how much of this design can be squeezed into 500 kbytes, but I'm willing to have a go. The development of an array based dsp library is probably of general interest to any audio programmer or signal analyst, so it's worth doing anyway. The hi-speed techniques used, e.g. dual cog with shared LUT, can be used in other applications. Also, so far, I'm surprised how small the dsp functions are when written in assembly language, which is encouraging.

Cheers, Bob

bob_g4bby · 2026-01-10 11:32

Oh - and why not develop in Taqoz? I started out with Taqoz, but found having to transfer arrays to and from LabView on the PC a bit tiresome. So I turned to SPIN/PASM.
1. The debug scope windows are so easy to use and essential for debugging the dsp methods
2. The SPIN interpreter has a tiny footprint compared to the Taqoz system. Spin byte code is tiny / there is the option in Flexprop ide to turn it into machine code.
3. The PNUT or Spin Tools ide offer single shot through assembly code - very handy, no need to guess what's wrong
4. The assemblers flag many more errors than the Taqoz assembler
5. It was high time I learnt how to code in SPIN/PASM, to take advantage of the Obex and the friendly help avaiable.
I prefer and enjoy using Taqoz for some jobs, but not this one.
Cheers, Bob

Christof Eb. · 2026-01-10 14:41

Thanks @bob_g4bby for the pdfs!
I am always learning in this forum.... So in this type of receiver a big number of bins is used and also a frequency shift.
When full speed is needed a compiler or assembler is better than Forth, ideally with good debugging possibilities.
Cheers Christof

SaucySoliton · 2026-01-15 05:44

@bob_g4bby said:
Thank you very much for offering to tidy up and obex the code. That would be great. Speed - I hope to run a software defined radio at 48 ksamples/s. With 1024 iq samples per buffer, that sets the main loop at 21.33 ms, or 42.66ms, or 63.99ms - at the expense of more buffers. So the fft times you mention would be acceptable. I aim to use as many cogs as I can to speed things up. I'm writing an inline assembly library - sine generator, modulator and mixer so far. Working on iq to polar and polar to iq at the moment. If I use two cogs with shared lut, should run at 14 cycles per iq sample.

Could the bit reversal step in the fft be run optionally to suit the application? Likewise windowing?

Best regards, Bob

The code could still use some cleanup. I just threw together a spin2 test program to see if the library compiled with something other than flexspin. Surprise! it didn't. The fft_ifft_cog.spin2 is a hastily patched version to get it working in Propeller Tool.

I didn't add an option to turn off bit reversal. I couldn't get the IFFT output to look anything like the FFT input without bit reversal. Each FFT or IFFT takes about 200,000 clock cycles. And that's with the silliness of restarting the cog every time. The frequency shift from 12khz to 0 would be complicated by trying to work on bit reversed data. So let's just leave bit reversal on for now. It should be possible to FFT, copy the desired bands, zero the undesired bands, and IFFT in 3 mS.

My library is "Bring Your Own Window."

    //           size + config , input arry, output arry    input array is modified!
    fft.cog_fft( FFT_SIZE      ,     bxy   ,     rxy ); // time domain -> frequency domain

    // frequency domain filter and value scaling
    for(i=0;i<FFT_SIZE*2;i++)  rxy[i]=i < 800 ? rxy[i]/FFT_SIZE : 0 ; 
    fft.cog_fft( FFT_SIZE | fft.IFFT , rxy , bxy ); // frequency domain -> time domain 

    fft.cog_fft( FFT_SIZE , bxy , rxy ); // FFT for spectrum analysis

Some of the runtimes in comparefftbench are inflated because the FFT is running 3 times. The 2 additional transforms don't seem to affect the results unexpectedly and spectrum mask seems to filter out unwanted frequencies.

bob_g4bby · 2026-01-15 08:21

Thank you very much @SaucySoliton , "could still use some cleanup" is fine. Starting with code that works is the essential issue. Yes, I agree, the 12kHz shift wants frequency ordered results, so bit reversal on is essential for that step.

I have cartesian to polar and polar to cartesian working mice and fast. I'm working on a gain control method right now. Since qmul and qdiv don't handle signed words, I'll do the math in polar coords (since the magnitude of a signal is always positive - at least I hope so!)

Windowing - Gerald Youngblood took a novel turn on this. Instead of putting the windowing in the signal path, where it would have to be combined every buffer, he placed the windowing in the passband filter definition, where it only needs running once, when the user changes filter width.

I'll be trying your code out tonight - thanks again, Bob

bob_g4bby · 2026-01-16 03:57

This demo of IFFT / FFT will run correctly under PNUT v52, and Spin Tools version 0.52.1. Start with ctrl-F10. The three scope windows displayed are:-
1. scope1 spectrum with two signals, two different frequencies
2. scope2 the inverse FFT is taken to show the signal waveform
3. scope3 the FFT is taken of the waveform to see if the same spectrum results
The scope3 spectrum is much bigger amplitude, which could be corrected by scaling if so desired.

Christof Eb. · 2026-01-20 16:32

Hi,
stumbled over this interesting site: https://101-things.readthedocs.io/en/latest/breadboard_radio_part3.html
They use FFT filter too, but to prevent overflow do some scaling during FFT:
" During the forward FFT, a growth of 256 can occur. With no scaling at all, a sin wave with an amplitude of 16 in the time domain would result in a peak with a magnitude of 4096 in the frequency domain. Various scaling strategies can be employed in an FFT to allow for this growth.

Scaling by a factor of 2 after each stage is safe and prevents any possibility of overflow. The downside of this approach is that we truncate the lowest 8 bits of the output data, 4 of these bits contain useful data and 4 contain noise.

To preserve as many useful bits as possible, a smaller scaling factor is used in this design. Scaling by a factor of 2 after every second stage means that we truncate by 4 bits overall, losing only the bits that contain noise. This means that there is a possibility of overflow under worst-case conditions if the input data exceeds 12 bits."
Cheers Christof

bob_g4bby · 2026-01-20 17:36

What an interesting project, @"Christof Eb." ! The animated graphs make a lot of the techniques so clear. Plenty of integer dsp ideas there, showing it should be possible on a P2 as well. I'm enjoying getting each dsp method running, the debug Scope windows confirming whether I've got it right, or there's still a bug to fix. The Scope and other windows must be unique to Propeller, I think. They save having to write so much tedious test harness.

Overflow is an issue with integer dsp, so I've written an "atten" method which divides the signal amplitude by an integer. Useful as a volume control too. I might change that to an integer / 10 if necessary, for finer control. A more simple overflow control is simply to shift right of course. None of my methods have overflow detection built in, but it would be easy to fit.

SaucySoliton · 2026-01-31 23:34

I think I have the cleanup done. Also:

Faster bit-reversal, about 4% faster FFT and IFFT.
Selectable in-place mode, saves memory at the cost of some speed.
Inline PASM2 version split off into separate file. Was necessary to compile with PNut. Or I could have changed variable names, but I didn't see a reason to put both the dedicated cog and inline version in one file.
Functions are renamed. Sorry.
NEW! Non-blocking API so the main spin2/C cog can do other work while waiting for the FFT.

This will go into the obex once my account is approved.

1024 point butterfly run time = 1018 us FFT ONLY, no bit reversal
1024 point bit-reversal and butterfly run time = 1328 us IFFT In-Place Algorithm
1024 point bit-reversal and butterfly run time = 1297 us In-Place Algorithm
1024 point bit-reversal and butterfly run time = 1204 us IFFT was 1275uS
1024 point bit-reversal and butterfly run time = 1174 us Out-of-Place Algorithm, was 1220uS
1024 point bit-reversal and butterfly run time = 664 us Real Input
1024 point bit-reversal and butterfly run time = 3764 us Inline
1024 point bit-reversal and butterfly run time = 12991 us flexcc -2 comparefftbench.c -O3

IFFT is a bit slower than FFT by 2.5%. The IFFT is implemented as an FFT with the inputs and outputs conjugated. I felt this was an acceptable trade-off between speed, library size, and development effort.

bob_g4bby · 2026-02-02 09:17

An excellent piece of work James, should do much to encourage folks to dsp with P2. Impressed you managed to squeeze a little more speed out of it. Thank you very much, Bob
--- I'm now reading up on windowing (Blackman-Harris is popular for software defined radios) and bandpass filter coefficient generation with a view to writing the receiver filter.

1024-point FFT in 79 longs

Comments