1024-point FFT in 79 longs
cgracey
Posts: 14,134
I just made this and it works fine. It takes a hair over 3ms at 250MHz. It converts 1024 signed word samples to 512 frequency powers.
Maybe someone can figure out how to make it faster. I used all the tricks I could think of, but I'm sure more is possible.
'**************************** '* Fast Fourier Transform * '**************************** CON samples = $1000 'samples buffer (1024 words) real_imag = samples + 1024*2 'FFT (real,imaginary) buffer (1024*2 longs) powers = real_imag + 1024*2*4 'FFT powers buffer (512 longs) DAT org ' ' Fast Fourier transform ' ' n=1023 ' powers(k) = hypotenus sum samples(n) [cos(nk) - j sin(nk)], k=0-1023 ' n=0 ' ' on entry: samples must hold 1024 signed long samples ' on exit: powers holds 512 unsigned words indicating powers for frequency divisions ' ' ' Load 1024 (real,imaginary) coordinate pairs ' fft setq2 #1024/2-1 'get 1024 word samples into lut rdlong 0,samples_ptr mov ptra,#0 'write samples to real_imag as (sample,0) longs wrfast #0,real_imag_ptr rep @.r,##1024/2 rdlut a1,ptra++ getword a2,a1,#1 signx a1,#15 signx a2,#15 wflong a1 wflong #0 wflong a2 wflong #0 .r ' Perform 1024-point FFT decod i1,#encod(1024 >> 1) mov i2,#1 .loop1 mov i3,#0 mov i4,i1 mov c1,i2 .loop2 encod n,i1 mov angle,i3 shr angle,n rev angle mov ptra,i3 shl ptra,#3 add ptra,real_imag_ptr mov ptrb,i1 shl ptrb,#3 add ptrb,ptra mov c2,i4 sub c2,i3 .loop3 setq #2-1 rdlong a1,ptra setq #2-1 rdlong b1,ptrb qrotate b1,angle qrotate b2,angle getqx b1 getqy b2 getqx n add b2,n getqy n sub b1,n add a1,b1 add a2,b2 setq #2-1 wrlong a1,ptra++ sub a1,b1 sub a1,b1 sub a2,b2 sub a2,b2 setq #2-1 wrlong a1,ptrb++ djnz c2,#.loop3 add i3,i1 add i3,i1 add i4,i1 add i4,i1 djnz c1,#.loop2 shr i1,#1 shl i2,#1 tjnz i1,#.loop1 ' Convert FFT outputs to powers wrfast #0,powers_ptr 'read to store powers mov i1,#0 decod c1,#encod(1024 >> 1) .convert mov ptra,i1 'get (real,imaginary) coordinate pair rev ptra shr ptra,#32 - (encod 1024) - 3 '-3 for two longs per pair add ptra,real_imag_ptr setq #2-1 rdlong a1,ptra qvector a1,a2 'convert pair to polar to get power getqx n wflong n 'store power add i1,#1 _ret_ djnz c1,#.convert ' ' ' Data ' samples_ptr long samples real_imag_ptr long real_imag powers_ptr long powers c1 res 1 c2 res 1 i1 res 1 i2 res 1 i3 res 1 i4 res 1 angle res 1 a1 res 1 a2 res 1 b1 res 1 b2 res 1 n res 1
Maybe someone can figure out how to make it faster. I used all the tricks I could think of, but I'm sure more is possible.
Comments
With this demo, the P2 is in the league of digital signal processors. I suspected it, but so far I still lack the experience in p2asm.
I will also need a demo board soon. Testing with the simulator is very tedious.
Thanks for sharing this example.
Reinhard
edit:
Cheers to the rev command.
I wanted to assemble that next, but there is ;-)
I think the simplest way to speed up this program would be to take bigger bites in the inner loop, so that we load/store more A's and B's and pipeline more CORDIC operations. Right now it's taking ~750,000 clocks to do 10,000 operations. That's 75 clocks per. We could probably get it down to 30 clocks. Then, boost the clock a little more and we'll be under 1ms.
We need to get a board to you ASAP. I can send you my board right now. I can wait for the new run in the next 2-3 weeks.
PM me if you are interested.
In this context, I have another question. How fast is the internal ADC and what bit width does it have. An external ADC with 16 bit would not be a drama either.
Thank you, I live in Germany and I don't know how complicated it is to send.
The internal ADC are modest, ok for a MCU. Tests on the ES1 silicon, had 10~12bit ADC, which varied with supply and pin and MHz.
An external ADC will be needed to highest performance.
P2 can support the external isolated ADCs, which give 78kHz BW and 12~14b useful from a 16b result.
Thank you.
I mean the P2 has enough pins if a very high-performance application is needed.
a flash ADC can also be connected, which loads the data in parallel.
I just have no concrete idea at the moment how it can be synchronized.
A parallel processing, that would be a huge thing.
I think it is possible.
On the road now. PM in about an hour.
I haven't looked at pipelined CORDIC yet. Is it possible to issue a new command every 8 cycles, with three 2-cycle instructions in between? Eight A's and B's in the inner loop look to be optimal, at first glance. More accurately, anything less looks to be sub-optimal.
Pity there's no SUB2 (D = D - 2*S), nor for other applications an atomic SHR & ZEROX, the latter operation using the spare S bits, similar to BITx.
The core of the inner loop of the FFT has this:
I started staring at that sequence, thinking there must be some simple objective to what it does. I had transcribed this code from some 80386 assembly code I had written 16 years ago, so I was just trying to do it faithfully, without really understanding WHAT it was doing. Actually, I never before REALIZED what it was doing.
Looking at that sequence, my brain started to remember the (X,Y) rotation equations:
x′=xcosθ−ysinθ
y′=ycosθ+xsinθ
That seemed to be maybe what it was doing.
So, I substituted a single CORDIC operation for that whole sequence:
AND IT WORKED!!!!!
So, the CORDIC can really whip the core of the FFT problem. This just leaves some adds and subtracts in the inner loop. Cool!!!!! This can be sped up WAY more than I originally thought.
We could read 16 pairs at a time and then stuff the CORDIC pipeline.
This FFT does the following-sized runs of rotations and adds/subs on contiguously-placed coordinate pairs:
512
256
128
64
32
16
8
4
2
1
So, if we hard-coded 16 rotations+adds/subs, the extra operations just wouldn't be saved in the last four runs of 8, 4, 2, and 1. That would speed things way up and not waste much time.
Looks like you are coupling the microphone input just through a cap to a prop2 pin, as you said.
But, don't you need to bias the pin at Vdd/2 for that to work right?
Seems to me the bias level would float around, potentially causing the peaks or valleys to get clipped...
The ADC pulls it to its own center, which is the threshold of a particular inverter, and close to VIO/2.
One other thing...
I think in real life, you'd want to apply a windowing function to avoid distortions due to the first point not being close to the last point in the set... looks like Hamming window might be popular from this:
https://www.edn.com/windowing-functions-improve-fft-results-part-i/
(a1-2*b1, a2-2*b2)
???
A1 and A2 previously had B1 and B2 added to them. So we need to subtract twice now.
Ah, yes, it never occurred to me, but we need to do that. I'll integrate it into the FFT input gathering. Thanks.
512 x 1 time
256 x 2 times
128 x 4
64 x 8
32 x 16
16 x 32
8 x 64
4 x 128
2 x 256
1 x 512
I've got a 16-pair CORDIC stuffer coded up, but I noticed it was slowing down when it got to the 8x64 operation.
All operation sets of size 16x32 and up are each taking only ~68us! So, it's getting 60% of the job done in only 420us. I need to code special cases for the last four operation sets to get the overall time minimized.
The 'Hanning' window is another name for Hann that we use in scope mode.
There are some free datasets available these days:
https://www.cmswire.com/digital-asset-management/9-voice-datasets-you-should-know-about/
I think most of us would find the 16-pair pipelined CORDIC code interesting, even if not complete. The egg-beater comes in to play here, too. Is there any fixed phase relationship between ptra & ptrb in the inner loop, for 16 x 32 in particular? Or do the bit reversals make the address differences pretty random?
There are five instructions between wrlong a1,ptra++ and wrlong a1,ptrba++ which could be reduced to three with a few more instructions earlier.
Thanks, Rayman. I didn't know such things existed. I wonder how usable it is by us.
By the way, I did what you said and placed one of the SETQ+RDLONG sequences between the QROTATE and GETQX and it certainly sped things up. I will use that for a simple implementation that is compact.
I've been playing around with optimizing the FFT today and have made a lot of progress, but when I get to the last iteration set of 1x512, the angle must be recalculated for every sample and I can't fit it into two instruction, as I'd like, in order to get it to flow with the CORDIC. There is a REV needed and two instructions before that. That last iteration set is a special case in a few different ways, when trying to optimize it. It's slowing the whole FFT down.
Tony, how would you reorder that?
I have it like this now, for readability:
Could we use REP in the inner loop?
We could use REP, but it wouldn't save much time. Also, it would block interrupts.