@Rayman said:
@SaucySoliton did you find a way to make the 1024 point fft faster than what chip posted?
Sorry if I’m being slow…
I posted a list of optimizations. I could have forgotten some.
For the same clock frequency, Chip's posted code was 3400-3600 uS. (It's not that variable, I just don't remember precisely and can't test it right now. It would be a bit faster than the inline version.) My FFT alone was 1900 uS. Then I added in bit reversal for comparison to fftbench. The alignment does affect performance by a few percent. Not quite double the speed.
Another performance boost has been unlocked! In many common use cases, the input to the FFT is entirely real data. So half of the input values are zeros. That's the imaginary part of the complex number input. Then we throw away half the output. No more! It is possible to pack all of the real number values in and process them as a complex FFT of half the size. Since the complexity of an FFT is N log(N), by halving N we more than double the speed. But there is some post-processing required which is comparable to another level of butterflies. So the log(N) term is unchanged while the N term is halved. So the time to process the FFT can be halved. That is reflected in the formula here: https://www.fftw.org/speed/
The post-processing does result in the output values being doubled compared to the usual.
fft_bench v1.2.1 for PROPELLER
OpenMP not available on this system
Freq. Magnitude I Q
0, 1024, 1024, 0
192, 1023, 1023, 0
256, 1024, 1024, 0
512, 1024, -1024, 0
1024 point bit-reversal and butterfly run time = 1188 us
clock frequency = 160000000
Based on this test the P2 scored 21.5 MOPS for the real FFT and 24.8 MOPS for the complex FFT.
The real FFT theoretically halves the operations and should theoretically run twice as fast. In practice, it's a really nice speed boost, but not quite double. The MOPS was calculated from the formula on the FFTW page. https://www.fftw.org/speed/ Most systems experience a reduction in MFLOPS when processing real data. It doesn't matter, we just want our desired transform to finish as quickly as possible.
2059 uS complex 1024 point with bitreverse
1188 uS real 1024 point with bitreverse and postprocessing
1022 uS for a 512 point complex fft and non-optimized bitreverse
978 uS for a 512 point complex fft and optimized bitreverse
899 uS for a 512 point complex fft no bitreverse
MOPS went down for the real version because of the extra post-processing. The post-processing for the real FFT version hasn't been optimized too much. Those cordic operations take a while.
@cgracey said:
Thanks, Rayman. I didn't know such things existed. I wonder how usable it is by us.
By the way, I did what you said and placed one of the SETQ+RDLONG sequences between the QROTATE and GETQX and it certainly sped things up. I will use that for a simple implementation that is compact.
I've been playing around with optimizing the FFT today and have made a lot of progress, but when I get to the last iteration set of 1x512, the angle must be recalculated for every sample and I can't fit it into two instruction, as I'd like, in order to get it to flow with the CORDIC. There is a REV needed and two instructions before that. That last iteration set is a special case in a few different ways, when trying to optimize it. It's slowing the whole FFT down.
Found a solution for a bit reversed counter: Just manually change the bits with an XOR.
The full counter and bit reverse runs once per loop. The XORs fit into the 4 butterfly unrolled loop perfectly. Of course, in practice you would store those constants in a long to avoid the AUGS penalty.
Posting here to back up my work. Now a 1024 point real data FFT runs in 982uS ! At 160MHz
I had worked on a fast 1024-point FFT a few years ago and got it down to 700us at 250MHz. I hard-coded different parts of the butterfly to speed it up. I attached it here. It's part of a program which does microphone ADC, HDMI display, and FFT.
@evanh said:
Oh, so the 21.5 is actually 43 MOPS. Why are they halving it?
It's not doing 43 MOPS. From the formula on the FFTW page:
1024 point complex data requires 51,200 useful operations. (100%)
512 point complex data requires 23,040 useful operations. (45%)
1024 point real data requires 25,600 useful operations. (50%)
The useful operations counts the mathematical operations in the data path like multiplies and adds. It does not include stuff like loop counters, memory reads, and so on.
Now if you have 1024 points of real number data you certainly can use a 1024 point complex fft to process it. But the CPU will end up doing twice as many operations as compared to an optimal algorithm. The 1024 point real fft is based on the 512 point real fft. Then another level of processing is required. I've made massive gains by optimizing the assembly code. It was also very much worthwhile to use an optimal algorithm that eliminates 50% of the operations required.
Oh!!! I had that inverted. Here I was thinking MOPS was a speed measure (Mega Operations Per Second) when it's really a demand measure (Mega OPerationS).
@cgracey said:
I had worked on a fast 1024-point FFT a few years ago and got it down to 700us at 250MHz. I hard-coded different parts of the butterfly to speed it up. I attached it here. It's part of a program which does microphone ADC, HDMI display, and FFT.
Thanks, Chip!
Also, the simple version you posted years earlier was one of the things that helped me understand how an fft works. The cordic operations add just the right amount of abstraction. In most code you see a complex multiply as individual multiplies. Using some algorithm to do a complex rotate with 3 multiplies. It's easy to get lost there. I plan to include the simple version in the library for those who want to try to understand it.
It occurred to me tonight that the loops running bfly4, bfly2, and bfly1 are operating on the same data. So it works to call them all in the same loop. No need to write the data to hub ram in between. Sadly, that only saved 50uS. 1024 point real fft with ordered iq output takes 760uS at 160MHz. So 486uS at 250MHz. The bit reversal step is really starting to slow things down.
I don't really even need to keep optimizing the FFT but the ideas to speed things up keep coming. It's too much fun! I have an idea about how to split the work between multiple cogs. But first I want to squeeze as much performance out of one cog.
@cgracey said:
I had worked on a fast 1024-point FFT a few years ago and got it down to 700us at 250MHz. I hard-coded different parts of the butterfly to speed it up. I attached it here. It's part of a program which does microphone ADC, HDMI display, and FFT.
Thanks, Chip!
Also, the simple version you posted years earlier was one of the things that helped me understand how an fft works. The cordic operations add just the right amount of abstraction. In most code you see a complex multiply as individual multiplies. Using some algorithm to do a complex rotate with 3 multiplies. It's easy to get lost there. I plan to include the simple version in the library for those who want to try to understand it.
It occurred to me tonight that the loops running bfly4, bfly2, and bfly1 are operating on the same data. So it works to call them all in the same loop. No need to write the data to hub ram in between. Sadly, that only saved 50uS. 1024 point real fft with ordered iq output takes 760uS at 160MHz. So 486uS at 250MHz. The bit reversal step is really starting to slow things down.
I don't really even need to keep optimizing the FFT but the ideas to speed things up keep coming. It's too much fun! I have an idea about how to split the work between multiple cogs. But first I want to squeeze as much performance out of one cog.
Yeah, optimizations can go on and on. I was talking with Shannon Mackey the other day and I was telling him that because hardware optimizations exist in the P2 that enable software optimizations, you wind up playing chess instead of checkers, which can be taxing. If we had just simple instructions like AND, OR, ADD, SUB, SHR, SHL, and no CORDIC, it would be more straightforward to write code, but everything would be slow. He pointed out that if the hardware was simple, everyone's software efforts would result in the same solutions. So, it's not simple, but there's potential for lots of optimization. Enough to wear your brain out.
@SaucySoliton , how is your FFT code these days - have you made any more speed improvements? Did you try using more than one cog? I could do with a 1024 complex to complex FFT and inverse FFT for my dsp library.
Cheers, Bob
IFFT isn't much different that FFT, but I just never got around to it. If this is something you would like I have some free time right now to work on it. I really need to clean it up and post to obex anyway.
The code in this thread does bit reversal last. (Decimation In Frequency) Usually the type of algorithm doesn't matter. If you are doing FFT filtering, the bit reversal step can be omitted.
How much speed do you need? The 1024 complex FFT is 1015uS at 160MHz without bit reversal. With bit reversal is 1220uS.
Thank you very much for offering to tidy up and obex the code. That would be great. Speed - I hope to run a software defined radio at 48 ksamples/s. With 1024 iq samples per buffer, that sets the main loop at 21.33 ms, or 42.66ms, or 63.99ms - at the expense of more buffers. So the fft times you mention would be acceptable. I aim to use as many cogs as I can to speed things up. I'm writing an inline assembly library - sine generator, modulator and mixer so far. Working on iq to polar and polar to iq at the moment. If I use two cogs with shared lut, should run at 14 cycles per iq sample.
Could the bit reversal step in the fft be run optionally to suit the application? Likewise windowing?
I'm copying the design of a software radio that first appeared a decade ago. The receiver main filter is calculated in the frequency domain as an fft fast convolution filter:-
1. Convert from time domain to frequency domain with fft
2. Frequency translate from an offset of 12khz to baseband IF of 0 Hz
3. Sideband selection
4. Generate the required bandpass filter coefficients, when user changes bandwidth. Apply Blackman-Harris windowing to this once, rather than the signal path all the time
5. Fft fast convolution filtering
6. Conversion back to time domain with ifft
I made a successful receiver in the language LabView, so reasonably confident about doing it on the P2, providing I reuse buffers as much as possible.
@bob_g4bby said:
I'm copying the design of a software radio that first appeared a decade ago. The receiver main filter is calculated in the frequency domain as an fft fast convolution filter:-
1. Convert from time domain to frequency domain with fft
2. Frequency translate from an offset of 12khz to baseband IF of 0 Hz
3. Sideband selection
4. Generate the required bandpass filter coefficients, when user changes bandwidth. Apply Blackman-Harris windowing to this once, rather than the signal path all the time
5. Fft fast convolution filtering
6. Conversion back to time domain with ifft
I made a successful receiver in the language LabView, so reasonably confident about doing it on the P2, providing I reuse buffers as much as possible.
Hi Bob
If you do FFT, then set most of the results to zero and then do IFFT, isn't this rather inefficient in comparison to directly using a filter, that does only calculate the desired result(s) like Goertzel algorithm? You could vary bandwidth modifying block length, as you could use any block length. I wonder, what inverse Goertzel would be, just generate sin and cos?
And second question: Why do you need to do IFFT at all? Is this not amplitude modulated signal anyways?
I understand that a complete FFT is interesting for a graphical display, but probably that does not need a high update rate?
Cheers Christof
Hi @"Christof Eb." , good to hear from you - Happy New Year!
Good questions! I am no mathematician, but turn instead to a four part article written by Gerald Youngblood, the CEO of FlexRadio Systems. He claims that using an FFT Fast Convolution Filter gives a far superior filter performance to any FIR filter calculated in the time domain. The desirable filter features are flat top and almost vertical deep filter sides. The fast convolution filter done in the frequency domain is the equivalent of a many, many staged FIR filter in the time domain, taking far less time to compute and with much better filter shape. This is covered in some detail in part 3 of 'sdr for the masses' under 'Fast Convolution Filter Magic'. You can see the 'brick wall' shape of the filter that results. I've been using PC based radios based on this architecture for many years and can vouch for how pleasant they are to use - good quality audio, very good adjacent signal rejection, good signal to noise, absence of ringing even at 100Hz filter width for morse code and so on. I recommend reading the whole series (attached), he's very good at explaining the various design decisions.
Ham radio operators use morse code, lower single sideband, upper single sideband, AM (rarely), FM, and many digital modes. Single sideband is efficient for long distance weak signal performance, since one sideband and the carrier is not transmitted. So sideband selection is trivial to do in the frequency domain.
You're right, the graphical display doesn't need to be updated more than (say) 10-15 times per second for a smooth appearance.
At the moment, I have no clear idea how much of this design can be squeezed into 500 kbytes, but I'm willing to have a go. The development of an array based dsp library is probably of general interest to any audio programmer or signal analyst, so it's worth doing anyway. The hi-speed techniques used, e.g. dual cog with shared LUT, can be used in other applications. Also, so far, I'm surprised how small the dsp functions are when written in assembly language, which is encouraging.
Oh - and why not develop in Taqoz? I started out with Taqoz, but found having to transfer arrays to and from LabView on the PC a bit tiresome. So I turned to SPIN/PASM.
1. The debug scope windows are so easy to use and essential for debugging the dsp methods
2. The SPIN interpreter has a tiny footprint compared to the Taqoz system. Spin byte code is tiny / there is the option in Flexprop ide to turn it into machine code.
3. The PNUT or Spin Tools ide offer single shot through assembly code - very handy, no need to guess what's wrong
4. The assemblers flag many more errors than the Taqoz assembler
5. It was high time I learnt how to code in SPIN/PASM, to take advantage of the Obex and the friendly help avaiable.
I prefer and enjoy using Taqoz for some jobs, but not this one.
Cheers, Bob
Thanks @bob_g4bby for the pdfs!
I am always learning in this forum.... So in this type of receiver a big number of bins is used and also a frequency shift.
When full speed is needed a compiler or assembler is better than Forth, ideally with good debugging possibilities.
Cheers Christof
Comments
@SaucySoliton did you find a way to make the 1024 point fft faster than what chip posted?
Sorry if I’m being slow…
More than twice as fast, so far. Mainly from using the FIFO for data fetches and unrolling the inner loop to utilise the Cordic pipeline.
I posted a list of optimizations. I could have forgotten some.
For the same clock frequency, Chip's posted code was 3400-3600 uS. (It's not that variable, I just don't remember precisely and can't test it right now. It would be a bit faster than the inline version.) My FFT alone was 1900 uS. Then I added in bit reversal for comparison to fftbench. The alignment does affect performance by a few percent. Not quite double the speed.
Another performance boost has been unlocked! In many common use cases, the input to the FFT is entirely real data. So half of the input values are zeros. That's the imaginary part of the complex number input. Then we throw away half the output. No more! It is possible to pack all of the real number values in and process them as a complex FFT of half the size. Since the complexity of an FFT is N log(N), by halving N we more than double the speed. But there is some post-processing required which is comparable to another level of butterflies. So the log(N) term is unchanged while the N term is halved. So the time to process the FFT can be halved. That is reflected in the formula here: https://www.fftw.org/speed/
The post-processing does result in the output values being doubled compared to the usual.
fft_bench v1.2.1 for PROPELLER OpenMP not available on this system Freq. Magnitude I Q 0, 1024, 1024, 0 192, 1023, 1023, 0 256, 1024, 1024, 0 512, 1024, -1024, 0 1024 point bit-reversal and butterfly run time = 1188 us clock frequency = 160000000Based on this test the P2 scored 21.5 MOPS for the real FFT and 24.8 MOPS for the complex FFT.
You just said real goes twice as fast. How come 21.5 vs 24.8 MOPS?
The real FFT theoretically halves the operations and should theoretically run twice as fast. In practice, it's a really nice speed boost, but not quite double. The MOPS was calculated from the formula on the FFTW page. https://www.fftw.org/speed/ Most systems experience a reduction in MFLOPS when processing real data. It doesn't matter, we just want our desired transform to finish as quickly as possible.
2059 uS complex 1024 point with bitreverse
1188 uS real 1024 point with bitreverse and postprocessing
1022 uS for a 512 point complex fft and non-optimized bitreverse
978 uS for a 512 point complex fft and optimized bitreverse
899 uS for a 512 point complex fft no bitreverse
MOPS went down for the real version because of the extra post-processing. The post-processing for the real FFT version hasn't been optimized too much. Those cordic operations take a while.
Oh, so the 21.5 is actually 43 MOPS. Why are they halving it?
Found a solution for a bit reversed counter: Just manually change the bits with an XOR.
setq ay2 'rotate (bx,by) by angle qrotate ax2,angle ' %000x... xor angle,##$4000_0000 ' %010x... add i3,#8 ' 2*fftsize*4 setq ay4 'rotate (bx,by) by angle qrotate ax4,angle xor angle,##$6000_0000 ' %001x... mov ax2,ax1 setq by2 'rotate (bx,by) by angle qrotate bx2,angle mov ay2,ay1 xor angle,##$4000_0000 ' %011x... setq by4 'rotate (bx,by) by angle qrotate bx4,angle mov angle,i3 rev angleThe full counter and bit reverse runs once per loop. The XORs fit into the 4 butterfly unrolled loop perfectly. Of course, in practice you would store those constants in a long to avoid the AUGS penalty.
Posting here to back up my work.
Now a 1024 point real data FFT runs in 982uS ! At 160MHz
I had worked on a fast 1024-point FFT a few years ago and got it down to 700us at 250MHz. I hard-coded different parts of the butterfly to speed it up. I attached it here. It's part of a program which does microphone ADC, HDMI display, and FFT.
It's not doing 43 MOPS. From the formula on the FFTW page:
1024 point complex data requires 51,200 useful operations. (100%)
512 point complex data requires 23,040 useful operations. (45%)
1024 point real data requires 25,600 useful operations. (50%)
The useful operations counts the mathematical operations in the data path like multiplies and adds. It does not include stuff like loop counters, memory reads, and so on.
Now if you have 1024 points of real number data you certainly can use a 1024 point complex fft to process it. But the CPU will end up doing twice as many operations as compared to an optimal algorithm. The 1024 point real fft is based on the 512 point real fft. Then another level of processing is required. I've made massive gains by optimizing the assembly code. It was also very much worthwhile to use an optimal algorithm that eliminates 50% of the operations required.
Oh!!! I had that inverted. Here I was thinking MOPS was a speed measure (Mega Operations Per Second) when it's really a demand measure (Mega OPerationS).
Thanks, Chip!
Also, the simple version you posted years earlier was one of the things that helped me understand how an fft works. The cordic operations add just the right amount of abstraction. In most code you see a complex multiply as individual multiplies. Using some algorithm to do a complex rotate with 3 multiplies. It's easy to get lost there. I plan to include the simple version in the library for those who want to try to understand it.
It occurred to me tonight that the loops running bfly4, bfly2, and bfly1 are operating on the same data. So it works to call them all in the same loop. No need to write the data to hub ram in between. Sadly, that only saved 50uS. 1024 point real fft with ordered iq output takes 760uS at 160MHz. So 486uS at 250MHz. The bit reversal step is really starting to slow things down.
I don't really even need to keep optimizing the FFT but the ideas to speed things up keep coming. It's too much fun! I have an idea about how to split the work between multiple cogs. But first I want to squeeze as much performance out of one cog.
Yeah, optimizations can go on and on. I was talking with Shannon Mackey the other day and I was telling him that because hardware optimizations exist in the P2 that enable software optimizations, you wind up playing chess instead of checkers, which can be taxing. If we had just simple instructions like AND, OR, ADD, SUB, SHR, SHL, and no CORDIC, it would be more straightforward to write code, but everything would be slow. He pointed out that if the hardware was simple, everyone's software efforts would result in the same solutions. So, it's not simple, but there's potential for lots of optimization. Enough to wear your brain out.
Wonder if the shared LUT would help when using 2 cogs...
Maybe it could.
@SaucySoliton , how is your FFT code these days - have you made any more speed improvements? Did you try using more than one cog? I could do with a 1024 complex to complex FFT and inverse FFT for my dsp library.
Cheers, Bob
IFFT isn't much different that FFT, but I just never got around to it. If this is something you would like I have some free time right now to work on it. I really need to clean it up and post to obex anyway.
The ironic thing is after doing all this work I found another FFT library https://forums.parallax.com/discussion/comment/1466754/#Comment_1466754 There is a fundamental difference. This library does the bit reversal first. (Decimation In Time)
The code in this thread does bit reversal last. (Decimation In Frequency) Usually the type of algorithm doesn't matter. If you are doing FFT filtering, the bit reversal step can be omitted.
How much speed do you need? The 1024 complex FFT is 1015uS at 160MHz without bit reversal. With bit reversal is 1220uS.
Thank you very much for offering to tidy up and obex the code. That would be great. Speed - I hope to run a software defined radio at 48 ksamples/s. With 1024 iq samples per buffer, that sets the main loop at 21.33 ms, or 42.66ms, or 63.99ms - at the expense of more buffers. So the fft times you mention would be acceptable. I aim to use as many cogs as I can to speed things up. I'm writing an inline assembly library - sine generator, modulator and mixer so far. Working on iq to polar and polar to iq at the moment. If I use two cogs with shared lut, should run at 14 cycles per iq sample.
Could the bit reversal step in the fft be run optionally to suit the application? Likewise windowing?
Best regards, Bob
I'm copying the design of a software radio that first appeared a decade ago. The receiver main filter is calculated in the frequency domain as an fft fast convolution filter:-
1. Convert from time domain to frequency domain with fft
2. Frequency translate from an offset of 12khz to baseband IF of 0 Hz
3. Sideband selection
4. Generate the required bandpass filter coefficients, when user changes bandwidth. Apply Blackman-Harris windowing to this once, rather than the signal path all the time
5. Fft fast convolution filtering
6. Conversion back to time domain with ifft
I made a successful receiver in the language LabView, so reasonably confident about doing it on the P2, providing I reuse buffers as much as possible.
I forget whether reordering was used or not, hence my question about keeping it as a discrete method. I'll look that up in the code. I think it probably was. See article https://arrl.org/files/file/Technology/tis/info/pdf/021112qex027.pdf
The code I've written so far is demoed near the bottom of [https://forums.parallax.com/discussion/176026/efficiently-processing-continuous-signals#latest) I've standardised on buffers that store samples as two longs like real1, imag1, real2, imag2 - real1024, imag1024. It's best run in Spin Tools, where the Debug Scope windows are fast enough to keep up. Pnut doesn't keep up.
Cheers, Bob
Hi Bob
If you do FFT, then set most of the results to zero and then do IFFT, isn't this rather inefficient in comparison to directly using a filter, that does only calculate the desired result(s) like Goertzel algorithm? You could vary bandwidth modifying block length, as you could use any block length. I wonder, what inverse Goertzel would be, just generate sin and cos?
And second question: Why do you need to do IFFT at all? Is this not amplitude modulated signal anyways?
I understand that a complete FFT is interesting for a graphical display, but probably that does not need a high update rate?
Cheers Christof
Hi @"Christof Eb." , good to hear from you - Happy New Year!
Good questions! I am no mathematician, but turn instead to a four part article written by Gerald Youngblood, the CEO of FlexRadio Systems. He claims that using an FFT Fast Convolution Filter gives a far superior filter performance to any FIR filter calculated in the time domain. The desirable filter features are flat top and almost vertical deep filter sides. The fast convolution filter done in the frequency domain is the equivalent of a many, many staged FIR filter in the time domain, taking far less time to compute and with much better filter shape. This is covered in some detail in part 3 of 'sdr for the masses' under 'Fast Convolution Filter Magic'. You can see the 'brick wall' shape of the filter that results. I've been using PC based radios based on this architecture for many years and can vouch for how pleasant they are to use - good quality audio, very good adjacent signal rejection, good signal to noise, absence of ringing even at 100Hz filter width for morse code and so on. I recommend reading the whole series (attached), he's very good at explaining the various design decisions.
Ham radio operators use morse code, lower single sideband, upper single sideband, AM (rarely), FM, and many digital modes. Single sideband is efficient for long distance weak signal performance, since one sideband and the carrier is not transmitted. So sideband selection is trivial to do in the frequency domain.
You're right, the graphical display doesn't need to be updated more than (say) 10-15 times per second for a smooth appearance.
At the moment, I have no clear idea how much of this design can be squeezed into 500 kbytes, but I'm willing to have a go. The development of an array based dsp library is probably of general interest to any audio programmer or signal analyst, so it's worth doing anyway. The hi-speed techniques used, e.g. dual cog with shared LUT, can be used in other applications. Also, so far, I'm surprised how small the dsp functions are when written in assembly language, which is encouraging.
Cheers, Bob
Oh - and why not develop in Taqoz? I started out with Taqoz, but found having to transfer arrays to and from LabView on the PC a bit tiresome. So I turned to SPIN/PASM.
1. The debug scope windows are so easy to use and essential for debugging the dsp methods
2. The SPIN interpreter has a tiny footprint compared to the Taqoz system. Spin byte code is tiny / there is the option in Flexprop ide to turn it into machine code.
3. The PNUT or Spin Tools ide offer single shot through assembly code - very handy, no need to guess what's wrong
4. The assemblers flag many more errors than the Taqoz assembler
5. It was high time I learnt how to code in SPIN/PASM, to take advantage of the Obex and the friendly help avaiable.
I prefer and enjoy using Taqoz for some jobs, but not this one.
Cheers, Bob
Thanks @bob_g4bby for the pdfs!
I am always learning in this forum.... So in this type of receiver a big number of bins is used and also a frequency shift.
When full speed is needed a compiler or assembler is better than Forth, ideally with good debugging possibilities.
Cheers Christof