1024-point FFT in 79 longs

Rayman · 2024-01-14 23:22

@SaucySoliton did you find a way to make the 1024 point fft faster than what chip posted?
Sorry if I’m being slow…

evanh · 2024-01-15 00:08

More than twice as fast, so far. Mainly from using the FIFO for data fetches and unrolling the inner loop to utilise the Cordic pipeline.

SaucySoliton · 2024-01-15 01:54

@Rayman said:
@SaucySoliton did you find a way to make the 1024 point fft faster than what chip posted?
Sorry if I’m being slow…

I posted a list of optimizations. I could have forgotten some.

For the same clock frequency, Chip's posted code was 3400-3600 uS. (It's not that variable, I just don't remember precisely and can't test it right now. It would be a bit faster than the inline version.) My FFT alone was 1900 uS. Then I added in bit reversal for comparison to fftbench. The alignment does affect performance by a few percent. Not quite double the speed.

SaucySoliton · 2024-01-15 21:24

Another performance boost has been unlocked! In many common use cases, the input to the FFT is entirely real data. So half of the input values are zeros. That's the imaginary part of the complex number input. Then we throw away half the output. No more! It is possible to pack all of the real number values in and process them as a complex FFT of half the size. Since the complexity of an FFT is N log(N), by halving N we more than double the speed. But there is some post-processing required which is comparable to another level of butterflies. So the log(N) term is unchanged while the N term is halved. So the time to process the FFT can be halved. That is reflected in the formula here: https://www.fftw.org/speed/

The post-processing does result in the output values being doubled compared to the usual.

fft_bench v1.2.1 for PROPELLER
OpenMP not available on this system
Freq.    Magnitude         I         Q
       0,     1024,     1024,        0
     192,     1023,     1023,        0
     256,     1024,     1024,        0
     512,     1024,    -1024,        0
1024 point bit-reversal and butterfly run time = 1188 us
clock frequency = 160000000

Based on this test the P2 scored 21.5 MOPS for the real FFT and 24.8 MOPS for the complex FFT.

evanh · 2024-01-15 21:41

You just said real goes twice as fast. How come 21.5 vs 24.8 MOPS?

SaucySoliton · 2024-01-15 22:37

The real FFT theoretically halves the operations and should theoretically run twice as fast. In practice, it's a really nice speed boost, but not quite double. The MOPS was calculated from the formula on the FFTW page. https://www.fftw.org/speed/ Most systems experience a reduction in MFLOPS when processing real data. It doesn't matter, we just want our desired transform to finish as quickly as possible.

2059 uS complex 1024 point with bitreverse

1188 uS real 1024 point with bitreverse and postprocessing

1022 uS for a 512 point complex fft and non-optimized bitreverse

978 uS for a 512 point complex fft and optimized bitreverse

899 uS for a 512 point complex fft no bitreverse

MOPS went down for the real version because of the extra post-processing. The post-processing for the real FFT version hasn't been optimized too much. Those cordic operations take a while.

evanh · 2024-01-15 22:48

Oh, so the 21.5 is actually 43 MOPS. Why are they halving it?

SaucySoliton · 2024-01-17 23:59

@cgracey said:
Thanks, Rayman. I didn't know such things existed. I wonder how usable it is by us.

By the way, I did what you said and placed one of the SETQ+RDLONG sequences between the QROTATE and GETQX and it certainly sped things up. I will use that for a simple implementation that is compact.

I've been playing around with optimizing the FFT today and have made a lot of progress, but when I get to the last iteration set of 1x512, the angle must be recalculated for every sample and I can't fit it into two instruction, as I'd like, in order to get it to flow with the CORDIC. There is a REV needed and two instructions before that. That last iteration set is a special case in a few different ways, when trying to optimize it. It's slowing the whole FFT down.

Found a solution for a bit reversed counter: Just manually change the bits with an XOR.

        setq    ay2             'rotate (bx,by) by angle
        qrotate ax2,angle       ' %000x...

        xor angle,##$4000_0000  ' %010x...
        add i3,#8               ' 2*fftsize*4
        setq    ay4             'rotate (bx,by) by angle
        qrotate ax4,angle

        xor angle,##$6000_0000  ' %001x...
        mov ax2,ax1
        setq    by2             'rotate (bx,by) by angle
        qrotate bx2,angle

        mov ay2,ay1
        xor angle,##$4000_0000  ' %011x...
        setq    by4             'rotate (bx,by) by angle
        qrotate bx4,angle

        mov angle,i3
        rev angle

The full counter and bit reverse runs once per loop. The XORs fit into the 4 butterfly unrolled loop perfectly. Of course, in practice you would store those constants in a long to avoid the AUGS penalty.

Posting here to back up my work. Now a 1024 point real data FFT runs in 982uS ! At 160MHz

cgracey · 2024-01-18 05:35

I had worked on a fast 1024-point FFT a few years ago and got it down to 700us at 250MHz. I hard-coded different parts of the butterfly to speed it up. I attached it here. It's part of a program which does microphone ADC, HDMI display, and FFT.

SaucySoliton · 2024-01-18 05:52

@evanh said:
Oh, so the 21.5 is actually 43 MOPS. Why are they halving it?

It's not doing 43 MOPS. From the formula on the FFTW page:

1024 point complex data requires 51,200 useful operations. (100%)

512 point complex data requires 23,040 useful operations. (45%)

1024 point real data requires 25,600 useful operations. (50%)

The useful operations counts the mathematical operations in the data path like multiplies and adds. It does not include stuff like loop counters, memory reads, and so on.

Now if you have 1024 points of real number data you certainly can use a 1024 point complex fft to process it. But the CPU will end up doing twice as many operations as compared to an optimal algorithm. The 1024 point real fft is based on the 512 point real fft. Then another level of processing is required. I've made massive gains by optimizing the assembly code. It was also very much worthwhile to use an optimal algorithm that eliminates 50% of the operations required.

evanh · 2024-01-18 11:03

Oh!!! I had that inverted. Here I was thinking MOPS was a speed measure (Mega Operations Per Second) when it's really a demand measure (Mega OPerationS).

SaucySoliton · 2024-01-19 06:23

@cgracey said:
I had worked on a fast 1024-point FFT a few years ago and got it down to 700us at 250MHz. I hard-coded different parts of the butterfly to speed it up. I attached it here. It's part of a program which does microphone ADC, HDMI display, and FFT.

Thanks, Chip!

Also, the simple version you posted years earlier was one of the things that helped me understand how an fft works. The cordic operations add just the right amount of abstraction. In most code you see a complex multiply as individual multiplies. Using some algorithm to do a complex rotate with 3 multiplies. It's easy to get lost there. I plan to include the simple version in the library for those who want to try to understand it.

It occurred to me tonight that the loops running bfly4, bfly2, and bfly1 are operating on the same data. So it works to call them all in the same loop. No need to write the data to hub ram in between. Sadly, that only saved 50uS. 1024 point real fft with ordered iq output takes 760uS at 160MHz. So 486uS at 250MHz. The bit reversal step is really starting to slow things down.

I don't really even need to keep optimizing the FFT but the ideas to speed things up keep coming. It's too much fun! I have an idea about how to split the work between multiple cogs. But first I want to squeeze as much performance out of one cog.

cgracey · 2024-01-19 19:57

@SaucySoliton said:

@cgracey said:
I had worked on a fast 1024-point FFT a few years ago and got it down to 700us at 250MHz. I hard-coded different parts of the butterfly to speed it up. I attached it here. It's part of a program which does microphone ADC, HDMI display, and FFT.

Thanks, Chip!

Also, the simple version you posted years earlier was one of the things that helped me understand how an fft works. The cordic operations add just the right amount of abstraction. In most code you see a complex multiply as individual multiplies. Using some algorithm to do a complex rotate with 3 multiplies. It's easy to get lost there. I plan to include the simple version in the library for those who want to try to understand it.

It occurred to me tonight that the loops running bfly4, bfly2, and bfly1 are operating on the same data. So it works to call them all in the same loop. No need to write the data to hub ram in between. Sadly, that only saved 50uS. 1024 point real fft with ordered iq output takes 760uS at 160MHz. So 486uS at 250MHz. The bit reversal step is really starting to slow things down.

I don't really even need to keep optimizing the FFT but the ideas to speed things up keep coming. It's too much fun! I have an idea about how to split the work between multiple cogs. But first I want to squeeze as much performance out of one cog.

Yeah, optimizations can go on and on. I was talking with Shannon Mackey the other day and I was telling him that because hardware optimizations exist in the P2 that enable software optimizations, you wind up playing chess instead of checkers, which can be taxing. If we had just simple instructions like AND, OR, ADD, SUB, SHR, SHL, and no CORDIC, it would be more straightforward to write code, but everything would be slow. He pointed out that if the hardware was simple, everyone's software efforts would result in the same solutions. So, it's not simple, but there's potential for lots of optimization. Enough to wear your brain out.

Rayman · 2024-01-19 22:58

Wonder if the shared LUT would help when using 2 cogs...

cgracey · 2024-01-20 04:48

@Rayman said:
Wonder if the shared LUT would help when using 2 cogs...

Maybe it could.

1024-point FFT in 79 longs

Comments