1024-point FFT in 79 longs

I just made this and it works fine. It takes a hair over 3ms at 250MHz. It converts 1024 signed word samples to 512 frequency powers.
Maybe someone can figure out how to make it faster. I used all the tricks I could think of, but I'm sure more is possible.
'****************************
'* Fast Fourier Transform *
'****************************
CON samples = $1000 'samples buffer (1024 words)
real_imag = samples + 1024*2 'FFT (real,imaginary) buffer (1024*2 longs)
powers = real_imag + 1024*2*4 'FFT powers buffer (512 longs)
DAT org
'
' Fast Fourier transform
'
' n=1023
' powers(k) = hypotenus sum samples(n) [cos(nk) - j sin(nk)], k=0-1023
' n=0
'
' on entry: samples must hold 1024 signed long samples
' on exit: powers holds 512 unsigned words indicating powers for frequency divisions
'
'
' Load 1024 (real,imaginary) coordinate pairs
'
fft setq2 #1024/2-1 'get 1024 word samples into lut
rdlong 0,samples_ptr
mov ptra,#0 'write samples to real_imag as (sample,0) longs
wrfast #0,real_imag_ptr
rep @.r,##1024/2
rdlut a1,ptra++
getword a2,a1,#1
signx a1,#15
signx a2,#15
wflong a1
wflong #0
wflong a2
wflong #0
.r
' Perform 1024-point FFT
decod i1,#encod(1024 >> 1)
mov i2,#1
.loop1 mov i3,#0
mov i4,i1
mov c1,i2
.loop2 encod n,i1
mov angle,i3
shr angle,n
rev angle
mov ptra,i3
shl ptra,#3
add ptra,real_imag_ptr
mov ptrb,i1
shl ptrb,#3
add ptrb,ptra
mov c2,i4
sub c2,i3
.loop3 setq #2-1
rdlong a1,ptra
setq #2-1
rdlong b1,ptrb
qrotate b1,angle
qrotate b2,angle
getqx b1
getqy b2
getqx n
add b2,n
getqy n
sub b1,n
add a1,b1
add a2,b2
setq #2-1
wrlong a1,ptra++
sub a1,b1
sub a1,b1
sub a2,b2
sub a2,b2
setq #2-1
wrlong a1,ptrb++
djnz c2,#.loop3
add i3,i1
add i3,i1
add i4,i1
add i4,i1
djnz c1,#.loop2
shr i1,#1
shl i2,#1
tjnz i1,#.loop1
' Convert FFT outputs to powers
wrfast #0,powers_ptr 'read to store powers
mov i1,#0
decod c1,#encod(1024 >> 1)
.convert mov ptra,i1 'get (real,imaginary) coordinate pair
rev ptra
shr ptra,#32 - (encod 1024) - 3 '-3 for two longs per pair
add ptra,real_imag_ptr
setq #2-1
rdlong a1,ptra
qvector a1,a2 'convert pair to polar to get power
getqx n
wflong n 'store power
add i1,#1
_ret_ djnz c1,#.convert
'
'
' Data
'
samples_ptr long samples
real_imag_ptr long real_imag
powers_ptr long powers
c1 res 1
c2 res 1
i1 res 1
i2 res 1
i3 res 1
i4 res 1
angle res 1
a1 res 1
a2 res 1
b1 res 1
b2 res 1
n res 1
Maybe someone can figure out how to make it faster. I used all the tricks I could think of, but I'm sure more is possible.
Comments
With this demo, the P2 is in the league of digital signal processors. I suspected it, but so far I still lack the experience in p2asm.
I will also need a demo board soon. Testing with the simulator is very tedious.
Thanks for sharing this example.
Reinhard
edit:
Cheers to the rev command.
I wanted to assemble that next, but there is ;-)
I think the simplest way to speed up this program would be to take bigger bites in the inner loop, so that we load/store more A's and B's and pipeline more CORDIC operations. Right now it's taking ~750,000 clocks to do 10,000 operations. That's 75 clocks per. We could probably get it down to 30 clocks. Then, boost the clock a little more and we'll be under 1ms.
We need to get a board to you ASAP. I can send you my board right now. I can wait for the new run in the next 2-3 weeks.
PM me if you are interested.
In this context, I have another question. How fast is the internal ADC and what bit width does it have. An external ADC with 16 bit would not be a drama either.
Thank you, I live in Germany and I don't know how complicated it is to send.
The internal ADC are modest, ok for a MCU. Tests on the ES1 silicon, had 10~12bit ADC, which varied with supply and pin and MHz.
An external ADC will be needed to highest performance.
P2 can support the external isolated ADCs, which give 78kHz BW and 12~14b useful from a 16b result.
Thank you.
I mean the P2 has enough pins if a very high-performance application is needed.
a flash ADC can also be connected, which loads the data in parallel.
I just have no concrete idea at the moment how it can be synchronized.
A parallel processing, that would be a huge thing.
I think it is possible.
On the road now. PM in about an hour.
I haven't looked at pipelined CORDIC yet. Is it possible to issue a new command every 8 cycles, with three 2-cycle instructions in between? Eight A's and B's in the inner loop look to be optimal, at first glance. More accurately, anything less looks to be sub-optimal.
Pity there's no SUB2 (D = D - 2*S), nor for other applications an atomic SHR & ZEROX, the latter operation using the spare S bits, similar to BITx.
The core of the inner loop of the FFT has this:
qrotate b1,angle qrotate b2,angle getqx b1 getqy b2 getqx n add b2,n getqy n sub b1,n
I started staring at that sequence, thinking there must be some simple objective to what it does. I had transcribed this code from some 80386 assembly code I had written 16 years ago, so I was just trying to do it faithfully, without really understanding WHAT it was doing. Actually, I never before REALIZED what it was doing.
Looking at that sequence, my brain started to remember the (X,Y) rotation equations:
x′=xcosθ−ysinθ
y′=ycosθ+xsinθ
That seemed to be maybe what it was doing.
So, I substituted a single CORDIC operation for that whole sequence:
setq b2 'rotate (b1,b2) by angle qrotate b1,angle getqx b1 getqy b2
AND IT WORKED!!!!!
So, the CORDIC can really whip the core of the FFT problem. This just leaves some adds and subtracts in the inner loop. Cool!!!!! This can be sped up WAY more than I originally thought.
.loop3 setq #2-1 'read (a1,a2) rdlong a1,ptra setq #2-1 'read (b1,b2) rdlong b1,ptrb setq b2 'rotate (b1,b2) by angle qrotate b1,angle getqx b1 getqy b2 add a1,b1 'write (a1+b1,a2+b2) to (a1,a2) add a2,b2 setq #2-1 wrlong a1,ptra++ sub a1,b1 'write (a1-b1,a2-b2) to (b1,b2) sub a1,b1 sub a2,b2 sub a2,b2 setq #2-1 wrlong a1,ptrb++ djnz c2,#.loop3
We could read 16 pairs at a time and then stuff the CORDIC pipeline.
This FFT does the following-sized runs of rotations and adds/subs on contiguously-placed coordinate pairs:
512
256
128
64
32
16
8
4
2
1
So, if we hard-coded 16 rotations+adds/subs, the extra operations just wouldn't be saved in the last four runs of 8, 4, 2, and 1. That would speed things way up and not waste much time.
Looks like you are coupling the microphone input just through a cap to a prop2 pin, as you said.
But, don't you need to bias the pin at Vdd/2 for that to work right?
Seems to me the bias level would float around, potentially causing the peaks or valleys to get clipped...
The ADC pulls it to its own center, which is the threshold of a particular inverter, and close to VIO/2.
One other thing...
I think in real life, you'd want to apply a windowing function to avoid distortions due to the first point not being close to the last point in the set... looks like Hamming window might be popular from this:
https://www.edn.com/windowing-functions-improve-fft-results-part-i/
sub a1,b1 'write (a1-b1,a2-b2) to (b1,b2) sub a1,b1 sub a2,b2 sub a2,b2 setq #2-1 wrlong a1,ptrb++
This seems to be doing(a1-2*b1, a2-2*b2)
???
A1 and A2 previously had B1 and B2 added to them. So we need to subtract twice now.
Ah, yes, it never occurred to me, but we need to do that. I'll integrate it into the FFT input gathering. Thanks.
512 x 1 time
256 x 2 times
128 x 4
64 x 8
32 x 16
16 x 32
8 x 64
4 x 128
2 x 256
1 x 512
I've got a 16-pair CORDIC stuffer coded up, but I noticed it was slowing down when it got to the 8x64 operation.
All operation sets of size 16x32 and up are each taking only ~68us! So, it's getting 60% of the job done in only 420us. I need to code special cases for the last four operation sets to get the overall time minimized.
The 'Hanning' window is another name for Hann that we use in scope mode.
There are some free datasets available these days:
https://www.cmswire.com/digital-asset-management/9-voice-datasets-you-should-know-about/
I think most of us would find the 16-pair pipelined CORDIC code interesting, even if not complete. The egg-beater comes in to play here, too. Is there any fixed phase relationship between ptra & ptrb in the inner loop, for 16 x 32 in particular? Or do the bit reversals make the address differences pretty random?
There are five instructions between wrlong a1,ptra++ and wrlong a1,ptrba++ which could be reduced to three with a few more instructions earlier.
Thanks, Rayman. I didn't know such things existed. I wonder how usable it is by us.
By the way, I did what you said and placed one of the SETQ+RDLONG sequences between the QROTATE and GETQX and it certainly sped things up. I will use that for a simple implementation that is compact.
I've been playing around with optimizing the FFT today and have made a lot of progress, but when I get to the last iteration set of 1x512, the angle must be recalculated for every sample and I can't fit it into two instruction, as I'd like, in order to get it to flow with the CORDIC. There is a REV needed and two instructions before that. That last iteration set is a special case in a few different ways, when trying to optimize it. It's slowing the whole FFT down.
Tony, how would you reorder that?
I have it like this now, for readability:
.loop3 setq #2-1 'read (bx,by) rdlong bx,ptrb setq by 'rotate (bx,by) by angle qrotate bx,angle setq #2-1 'read (ax,ay) rdlong ax,ptra getqx bx 'get rotated (bx,by) getqy by add ax,bx '(ax,ay) = (ax+bx,ay+by) add ay,by shl bx,#1 '(bx,by) = (ax-bx,ay-by) subr bx,ax shl by,#1 subr by,ay setq #2-1 'write (ax,ay) wrlong ax,ptra++ setq #2-1 'write (bx,by) wrlong bx,ptrb++ djnz c2,#.loop3
Could we use REP in the inner loop?
We could use REP, but it wouldn't save much time. Also, it would block interrupts.