Shop OBEX P1 Docs P2 Docs Learn Events
1024-point FFT in 79 longs - Page 3 — Parallax Forums

1024-point FFT in 79 longs

13»

Comments

  • RaymanRayman Posts: 15,947
    edited 2024-01-14 23:23

    @SaucySoliton did you find a way to make the 1024 point fft faster than what chip posted?
    Sorry if I’m being slow…

  • evanhevanh Posts: 17,037

    More than twice as fast, so far. Mainly from using the FIFO for data fetches and unrolling the inner loop to utilise the Cordic pipeline.

  • @Rayman said:
    @SaucySoliton did you find a way to make the 1024 point fft faster than what chip posted?
    Sorry if I’m being slow…

    I posted a list of optimizations. I could have forgotten some.

    For the same clock frequency, Chip's posted code was 3400-3600 uS. (It's not that variable, I just don't remember precisely and can't test it right now. It would be a bit faster than the inline version.) My FFT alone was 1900 uS. Then I added in bit reversal for comparison to fftbench. The alignment does affect performance by a few percent. Not quite double the speed.

  • Another performance boost has been unlocked! In many common use cases, the input to the FFT is entirely real data. So half of the input values are zeros. That's the imaginary part of the complex number input. Then we throw away half the output. No more! It is possible to pack all of the real number values in and process them as a complex FFT of half the size. Since the complexity of an FFT is N log(N), by halving N we more than double the speed. But there is some post-processing required which is comparable to another level of butterflies. So the log(N) term is unchanged while the N term is halved. So the time to process the FFT can be halved. That is reflected in the formula here: https://www.fftw.org/speed/

    The post-processing does result in the output values being doubled compared to the usual.

    fft_bench v1.2.1 for PROPELLER
    OpenMP not available on this system
    Freq.    Magnitude         I         Q
           0,     1024,     1024,        0
         192,     1023,     1023,        0
         256,     1024,     1024,        0
         512,     1024,    -1024,        0
    1024 point bit-reversal and butterfly run time = 1188 us
    clock frequency = 160000000
    

    Based on this test the P2 scored 21.5 MOPS for the real FFT and 24.8 MOPS for the complex FFT.

  • evanhevanh Posts: 17,037

    You just said real goes twice as fast. How come 21.5 vs 24.8 MOPS?

  • The real FFT theoretically halves the operations and should theoretically run twice as fast. In practice, it's a really nice speed boost, but not quite double. The MOPS was calculated from the formula on the FFTW page. https://www.fftw.org/speed/ Most systems experience a reduction in MFLOPS when processing real data. It doesn't matter, we just want our desired transform to finish as quickly as possible.

    2059 uS complex 1024 point with bitreverse

    1188 uS real 1024 point with bitreverse and postprocessing

    1022 uS for a 512 point complex fft and non-optimized bitreverse

    978 uS for a 512 point complex fft and optimized bitreverse

    899 uS for a 512 point complex fft no bitreverse

    MOPS went down for the real version because of the extra post-processing. The post-processing for the real FFT version hasn't been optimized too much. Those cordic operations take a while.

  • evanhevanh Posts: 17,037
    edited 2024-01-15 22:51

    Oh, so the 21.5 is actually 43 MOPS. Why are they halving it?

  • @cgracey said:
    Thanks, Rayman. I didn't know such things existed. I wonder how usable it is by us.

    By the way, I did what you said and placed one of the SETQ+RDLONG sequences between the QROTATE and GETQX and it certainly sped things up. I will use that for a simple implementation that is compact.

    I've been playing around with optimizing the FFT today and have made a lot of progress, but when I get to the last iteration set of 1x512, the angle must be recalculated for every sample and I can't fit it into two instruction, as I'd like, in order to get it to flow with the CORDIC. There is a REV needed and two instructions before that. That last iteration set is a special case in a few different ways, when trying to optimize it. It's slowing the whole FFT down.

    Found a solution for a bit reversed counter: Just manually change the bits with an XOR.

            setq    ay2             'rotate (bx,by) by angle
            qrotate ax2,angle       ' %000x...
    
            xor angle,##$4000_0000  ' %010x...
            add i3,#8               ' 2*fftsize*4
            setq    ay4             'rotate (bx,by) by angle
            qrotate ax4,angle
    
            xor angle,##$6000_0000  ' %001x...
            mov ax2,ax1
            setq    by2             'rotate (bx,by) by angle
            qrotate bx2,angle
    
            mov ay2,ay1
            xor angle,##$4000_0000  ' %011x...
            setq    by4             'rotate (bx,by) by angle
            qrotate bx4,angle
    
            mov angle,i3
            rev angle
    

    The full counter and bit reverse runs once per loop. The XORs fit into the 4 butterfly unrolled loop perfectly. Of course, in practice you would store those constants in a long to avoid the AUGS penalty.

    Posting here to back up my work. o:) Now a 1024 point real data FFT runs in 982uS ! At 160MHz

  • cgraceycgracey Posts: 14,289

    I had worked on a fast 1024-point FFT a few years ago and got it down to 700us at 250MHz. I hard-coded different parts of the butterfly to speed it up. I attached it here. It's part of a program which does microphone ADC, HDMI display, and FFT.

  • @evanh said:
    Oh, so the 21.5 is actually 43 MOPS. Why are they halving it?

    It's not doing 43 MOPS. From the formula on the FFTW page:

    1024 point complex data requires 51,200 useful operations. (100%)

    512 point complex data requires 23,040 useful operations. (45%)

    1024 point real data requires 25,600 useful operations. (50%)

    The useful operations counts the mathematical operations in the data path like multiplies and adds. It does not include stuff like loop counters, memory reads, and so on.

    Now if you have 1024 points of real number data you certainly can use a 1024 point complex fft to process it. But the CPU will end up doing twice as many operations as compared to an optimal algorithm. The 1024 point real fft is based on the 512 point real fft. Then another level of processing is required. I've made massive gains by optimizing the assembly code. It was also very much worthwhile to use an optimal algorithm that eliminates 50% of the operations required.

  • evanhevanh Posts: 17,037

    Oh!!! I had that inverted. Here I was thinking MOPS was a speed measure (Mega Operations Per Second) when it's really a demand measure (Mega OPerationS).

  • @cgracey said:
    I had worked on a fast 1024-point FFT a few years ago and got it down to 700us at 250MHz. I hard-coded different parts of the butterfly to speed it up. I attached it here. It's part of a program which does microphone ADC, HDMI display, and FFT.

    Thanks, Chip!

    Also, the simple version you posted years earlier was one of the things that helped me understand how an fft works. The cordic operations add just the right amount of abstraction. In most code you see a complex multiply as individual multiplies. Using some algorithm to do a complex rotate with 3 multiplies. It's easy to get lost there. I plan to include the simple version in the library for those who want to try to understand it.

    It occurred to me tonight that the loops running bfly4, bfly2, and bfly1 are operating on the same data. So it works to call them all in the same loop. No need to write the data to hub ram in between. Sadly, that only saved 50uS. 1024 point real fft with ordered iq output takes 760uS at 160MHz. So 486uS at 250MHz. The bit reversal step is really starting to slow things down.

    I don't really even need to keep optimizing the FFT but the ideas to speed things up keep coming. It's too much fun! I have an idea about how to split the work between multiple cogs. But first I want to squeeze as much performance out of one cog.

  • cgraceycgracey Posts: 14,289
    edited 2024-01-19 19:58

    @SaucySoliton said:

    @cgracey said:
    I had worked on a fast 1024-point FFT a few years ago and got it down to 700us at 250MHz. I hard-coded different parts of the butterfly to speed it up. I attached it here. It's part of a program which does microphone ADC, HDMI display, and FFT.

    Thanks, Chip!

    Also, the simple version you posted years earlier was one of the things that helped me understand how an fft works. The cordic operations add just the right amount of abstraction. In most code you see a complex multiply as individual multiplies. Using some algorithm to do a complex rotate with 3 multiplies. It's easy to get lost there. I plan to include the simple version in the library for those who want to try to understand it.

    It occurred to me tonight that the loops running bfly4, bfly2, and bfly1 are operating on the same data. So it works to call them all in the same loop. No need to write the data to hub ram in between. Sadly, that only saved 50uS. 1024 point real fft with ordered iq output takes 760uS at 160MHz. So 486uS at 250MHz. The bit reversal step is really starting to slow things down.

    I don't really even need to keep optimizing the FFT but the ideas to speed things up keep coming. It's too much fun! I have an idea about how to split the work between multiple cogs. But first I want to squeeze as much performance out of one cog.

    Yeah, optimizations can go on and on. I was talking with Shannon Mackey the other day and I was telling him that because hardware optimizations exist in the P2 that enable software optimizations, you wind up playing chess instead of checkers, which can be taxing. If we had just simple instructions like AND, OR, ADD, SUB, SHR, SHL, and no CORDIC, it would be more straightforward to write code, but everything would be slow. He pointed out that if the hardware was simple, everyone's software efforts would result in the same solutions. So, it's not simple, but there's potential for lots of optimization. Enough to wear your brain out.

  • RaymanRayman Posts: 15,947

    Wonder if the shared LUT would help when using 2 cogs...

  • cgraceycgracey Posts: 14,289

    @Rayman said:
    Wonder if the shared LUT would help when using 2 cogs...

    Maybe it could.

  • @SaucySoliton , how is your FFT code these days - have you made any more speed improvements? Did you try using more than one cog? I could do with a 1024 complex to complex FFT and inverse FFT for my dsp library.
    Cheers, Bob

  • SaucySolitonSaucySoliton Posts: 553
    edited 2026-01-08 23:02

    IFFT isn't much different that FFT, but I just never got around to it. If this is something you would like I have some free time right now to work on it. I really need to clean it up and post to obex anyway.

    The ironic thing is after doing all this work I found another FFT library https://forums.parallax.com/discussion/comment/1466754/#Comment_1466754 There is a fundamental difference. This library does the bit reversal first. (Decimation In Time)

    The code in this thread does bit reversal last. (Decimation In Frequency) Usually the type of algorithm doesn't matter. If you are doing FFT filtering, the bit reversal step can be omitted.

    How much speed do you need? The 1024 complex FFT is 1015uS at 160MHz without bit reversal. With bit reversal is 1220uS.

  • bob_g4bbybob_g4bby Posts: 504
    edited 2026-01-09 11:04

    Thank you very much for offering to tidy up and obex the code. That would be great. Speed - I hope to run a software defined radio at 48 ksamples/s. With 1024 iq samples per buffer, that sets the main loop at 21.33 ms, or 42.66ms, or 63.99ms - at the expense of more buffers. So the fft times you mention would be acceptable. I aim to use as many cogs as I can to speed things up. I'm writing an inline assembly library - sine generator, modulator and mixer so far. Working on iq to polar and polar to iq at the moment. If I use two cogs with shared lut, should run at 14 cycles per iq sample.

    Could the bit reversal step in the fft be run optionally to suit the application? Likewise windowing?

    Best regards, Bob

  • bob_g4bbybob_g4bby Posts: 504
    edited 2026-01-09 11:06

    I'm copying the design of a software radio that first appeared a decade ago. The receiver main filter is calculated in the frequency domain as an fft fast convolution filter:-
    1. Convert from time domain to frequency domain with fft
    2. Frequency translate from an offset of 12khz to baseband IF of 0 Hz
    3. Sideband selection
    4. Generate the required bandpass filter coefficients, when user changes bandwidth. Apply Blackman-Harris windowing to this once, rather than the signal path all the time
    5. Fft fast convolution filtering
    6. Conversion back to time domain with ifft

    I made a successful receiver in the language LabView, so reasonably confident about doing it on the P2, providing I reuse buffers as much as possible.

    I forget whether reordering was used or not, hence my question about keeping it as a discrete method. I'll look that up in the code. I think it probably was. See article https://arrl.org/files/file/Technology/tis/info/pdf/021112qex027.pdf

    The code I've written so far is demoed near the bottom of [https://forums.parallax.com/discussion/176026/efficiently-processing-continuous-signals#latest) I've standardised on buffers that store samples as two longs like real1, imag1, real2, imag2 - real1024, imag1024. It's best run in Spin Tools, where the Debug Scope windows are fast enough to keep up. Pnut doesn't keep up.

    Cheers, Bob

  • @bob_g4bby said:
    I'm copying the design of a software radio that first appeared a decade ago. The receiver main filter is calculated in the frequency domain as an fft fast convolution filter:-
    1. Convert from time domain to frequency domain with fft
    2. Frequency translate from an offset of 12khz to baseband IF of 0 Hz
    3. Sideband selection
    4. Generate the required bandpass filter coefficients, when user changes bandwidth. Apply Blackman-Harris windowing to this once, rather than the signal path all the time
    5. Fft fast convolution filtering
    6. Conversion back to time domain with ifft

    I made a successful receiver in the language LabView, so reasonably confident about doing it on the P2, providing I reuse buffers as much as possible.

    I forget whether reordering was used or not, hence my question about keeping it as a discrete method. I'll look that up in the code. I think it probably was. See article https://arrl.org/files/file/Technology/tis/info/pdf/021112qex027.pdf

    The code I've written so far is demoed near the bottom of [https://forums.parallax.com/discussion/176026/efficiently-processing-continuous-signals#latest) I've standardised on buffers that store samples as two longs like real1, imag1, real2, imag2 - real1024, imag1024. It's best run in Spin Tools, where the Debug Scope windows are fast enough to keep up. Pnut doesn't keep up.

    Cheers, Bob

    Hi Bob
    If you do FFT, then set most of the results to zero and then do IFFT, isn't this rather inefficient in comparison to directly using a filter, that does only calculate the desired result(s) like Goertzel algorithm? You could vary bandwidth modifying block length, as you could use any block length. I wonder, what inverse Goertzel would be, just generate sin and cos?
    And second question: Why do you need to do IFFT at all? Is this not amplitude modulated signal anyways?
    I understand that a complete FFT is interesting for a graphical display, but probably that does not need a high update rate?
    Cheers Christof

  • Hi @"Christof Eb." , good to hear from you - Happy New Year!

    Good questions! I am no mathematician, but turn instead to a four part article written by Gerald Youngblood, the CEO of FlexRadio Systems. He claims that using an FFT Fast Convolution Filter gives a far superior filter performance to any FIR filter calculated in the time domain. The desirable filter features are flat top and almost vertical deep filter sides. The fast convolution filter done in the frequency domain is the equivalent of a many, many staged FIR filter in the time domain, taking far less time to compute and with much better filter shape. This is covered in some detail in part 3 of 'sdr for the masses' under 'Fast Convolution Filter Magic'. You can see the 'brick wall' shape of the filter that results. I've been using PC based radios based on this architecture for many years and can vouch for how pleasant they are to use - good quality audio, very good adjacent signal rejection, good signal to noise, absence of ringing even at 100Hz filter width for morse code and so on. I recommend reading the whole series (attached), he's very good at explaining the various design decisions.

    Ham radio operators use morse code, lower single sideband, upper single sideband, AM (rarely), FM, and many digital modes. Single sideband is efficient for long distance weak signal performance, since one sideband and the carrier is not transmitted. So sideband selection is trivial to do in the frequency domain.

    You're right, the graphical display doesn't need to be updated more than (say) 10-15 times per second for a smooth appearance.

    At the moment, I have no clear idea how much of this design can be squeezed into 500 kbytes, but I'm willing to have a go. The development of an array based dsp library is probably of general interest to any audio programmer or signal analyst, so it's worth doing anyway. The hi-speed techniques used, e.g. dual cog with shared LUT, can be used in other applications. Also, so far, I'm surprised how small the dsp functions are when written in assembly language, which is encouraging.

    Cheers, Bob

  • Oh - and why not develop in Taqoz? I started out with Taqoz, but found having to transfer arrays to and from LabView on the PC a bit tiresome. So I turned to SPIN/PASM.
    1. The debug scope windows are so easy to use and essential for debugging the dsp methods
    2. The SPIN interpreter has a tiny footprint compared to the Taqoz system. Spin byte code is tiny / there is the option in Flexprop ide to turn it into machine code.
    3. The PNUT or Spin Tools ide offer single shot through assembly code - very handy, no need to guess what's wrong
    4. The assemblers flag many more errors than the Taqoz assembler
    5. It was high time I learnt how to code in SPIN/PASM, to take advantage of the Obex and the friendly help avaiable.
    I prefer and enjoy using Taqoz for some jobs, but not this one.
    Cheers, Bob

  • Thanks @bob_g4bby for the pdfs!
    I am always learning in this forum.... So in this type of receiver a big number of bins is used and also a frequency shift.
    When full speed is needed a compiler or assembler is better than Forth, ideally with good debugging possibilities.
    Cheers Christof

Sign In or Register to comment.