1024-point FFT in 79 longs — Parallax Forums

# 1024-point FFT in 79 longs

Posts: 14,112
edited 2019-12-17 17:45
I just made this and it works fine. It takes a hair over 3ms at 250MHz. It converts 1024 signed word samples to 512 frequency powers.
```'****************************
'*  Fast Fourier Transform  *
'****************************

CON		samples		= \$1000				'samples buffer (1024 words)
real_imag	= samples	+ 1024*2	'FFT (real,imaginary) buffer (1024*2 longs)
powers		= real_imag	+ 1024*2*4	'FFT powers buffer (512 longs)

DAT		org
'
' Fast Fourier transform
'
'			  n=1023
'  powers(k) =  hypotenus sum samples(n) [cos(nk) - j sin(nk)], k=0-1023
'			  n=0
'
' on entry:	samples must hold 1024 signed long samples
' on exit:	powers holds 512 unsigned words indicating powers for frequency divisions
'
'
' Load 1024 (real,imaginary) coordinate pairs
'
fft		setq2	#1024/2-1		'get 1024 word samples into lut
rdlong	0,samples_ptr

mov	ptra,#0			'write samples to real_imag as (sample,0) longs
wrfast	#0,real_imag_ptr

rep	@.r,##1024/2
rdlut	a1,ptra++
getword	a2,a1,#1
signx	a1,#15
signx	a2,#15
wflong	a1
wflong	#0
wflong	a2
wflong	#0
.r

' Perform 1024-point FFT

decod	i1,#encod(1024 >> 1)
mov	i2,#1

.loop1		mov	i3,#0
mov	i4,i1

mov	c1,i2

.loop2		encod	n,i1
mov	angle,i3
shr	angle,n
rev	angle

mov	ptra,i3
shl	ptra,#3

mov	ptrb,i1
shl	ptrb,#3

mov	c2,i4
sub	c2,i3

.loop3		setq	#2-1
rdlong	a1,ptra

setq	#2-1
rdlong	b1,ptrb

qrotate	b1,angle

qrotate	b2,angle

getqx	b1
getqy	b2

getqx	n
getqy	n
sub	b1,n

setq	#2-1
wrlong	a1,ptra++

sub	a1,b1
sub	a1,b1
sub	a2,b2
sub	a2,b2

setq	#2-1
wrlong	a1,ptrb++

djnz	c2,#.loop3

djnz	c1,#.loop2

shr	i1,#1
shl	i2,#1

tjnz	i1,#.loop1

' Convert FFT outputs to powers

wrfast	#0,powers_ptr			'read to store powers
mov	i1,#0
decod	c1,#encod(1024 >> 1)

.convert	mov	ptra,i1				'get (real,imaginary) coordinate pair
rev	ptra
shr	ptra,#32 - (encod 1024) - 3	'-3 for two longs per pair
setq	#2-1
rdlong	a1,ptra

qvector	a1,a2				'convert pair to polar to get power
getqx	n

wflong	n				'store power

_ret_	djnz	c1,#.convert
'
'
' Data
'
samples_ptr	long	samples
real_imag_ptr	long	real_imag
powers_ptr	long	powers

c1		res	1
c2		res	1
i1		res	1
i2		res	1
i3		res	1
i4		res	1
angle		res	1
a1		res	1
a2		res	1
b1		res	1
b2		res	1
n		res	1
```

Maybe someone can figure out how to make it faster. I used all the tricks I could think of, but I'm sure more is possible.
«13

• Posts: 489
edited 2019-12-17 17:32
WOW!!!
With this demo, the P2 is in the league of digital signal processors. I suspected it, but so far I still lack the experience in p2asm.
I will also need a demo board soon. Testing with the simulator is very tedious.
Thanks for sharing this example.
Reinhard

edit:
Cheers to the rev command.
I wanted to assemble that next, but there is ;-)
• Posts: 14,112
I wonder how fast mdern DSP's can do a 1024-point FFT. It might be a few microseconds.

I think the simplest way to speed up this program would be to take bigger bites in the inner loop, so that we load/store more A's and B's and pipeline more CORDIC operations. Right now it's taking ~750,000 clocks to do 10,000 operations. That's 75 clocks per. We could probably get it down to 30 clocks. Then, boost the clock a little more and we'll be under 1ms.
• Posts: 12,366
edited 2019-12-17 17:56
Reinhard wrote: »
WOW!!!
With this demo, the P2 is in the league of digital signal processors. I suspected it, but so far I still lack the experience in p2asm.
I will also need a demo board soon. Testing with the simulator is very tedious.
)

We need to get a board to you ASAP. I can send you my board right now. I can wait for the new run in the next 2-3 weeks.
PM me if you are interested.

• Posts: 489

In this context, I have another question. How fast is the internal ADC and what bit width does it have. An external ADC with 16 bit would not be a drama either.
• Posts: 489
Publison wrote: »
Reinhard wrote: »
WOW!!!
With this demo, the P2 is in the league of digital signal processors. I suspected it, but so far I still lack the experience in p2asm.
I will also need a demo board soon. Testing with the simulator is very tedious.
)

We need to get a board to you ASAP. I can send you my board right now. I can wait for the new run in the next 2-3 weeks.
PM me if you are interested.

Thank you, I live in Germany and I don't know how complicated it is to send.
• Posts: 15,140
Reinhard wrote: »
In this context, I have another question. How fast is the internal ADC and what bit width does it have. An external ADC with 16 bit would not be a drama either.

The internal ADC are modest, ok for a MCU. Tests on the ES1 silicon, had 10~12bit ADC, which varied with supply and pin and MHz.
An external ADC will be needed to highest performance.
P2 can support the external isolated ADCs, which give 78kHz BW and 12~14b useful from a 16b result.
• Posts: 489

Thank you.
I mean the P2 has enough pins if a very high-performance application is needed.
a flash ADC can also be connected, which loads the data in parallel.
• Posts: 13,711
I wonder how much using two cogs with shared LUT would speed things up...
• Posts: 489
Yes, I've already thought of that.
I just have no concrete idea at the moment how it can be synchronized.
A parallel processing, that would be a huge thing.
I think it is possible.
• Posts: 12,366
Reinhard wrote: »
Publison wrote: »
Reinhard wrote: »
WOW!!!
With this demo, the P2 is in the league of digital signal processors. I suspected it, but so far I still lack the experience in p2asm.
I will also need a demo board soon. Testing with the simulator is very tedious.
)

We need to get a board to you ASAP. I can send you my board right now. I can wait for the new run in the next 2-3 weeks.
PM me if you are interested.

Thank you, I live in Germany and I don't know how complicated it is to send.

• Posts: 2,099
edited 2019-12-18 02:57
cgracey wrote: »
I wonder how fast mdern DSP's can do a 1024-point FFT. It might be a few microseconds.

I think the simplest way to speed up this program would be to take bigger bites in the inner loop, so that we load/store more A's and B's and pipeline more CORDIC operations. Right now it's taking ~750,000 clocks to do 10,000 operations. That's 75 clocks per. We could probably get it down to 30 clocks. Then, boost the clock a little more and we'll be under 1ms.

I haven't looked at pipelined CORDIC yet. Is it possible to issue a new command every 8 cycles, with three 2-cycle instructions in between? Eight A's and B's in the inner loop look to be optimal, at first glance. More accurately, anything less looks to be sub-optimal.

Pity there's no SUB2 (D = D - 2*S), nor for other applications an atomic SHR & ZEROX, the latter operation using the spare S bits, similar to BITx.
• Posts: 15,043
edited 2019-12-17 22:48
TonyB_ wrote: »
... Is it possible to issue a new command every 8 cycles, with three 2-cycle instructions in between?
Correct for an 8-cog prop2. For a 16-cog prop2 the tightest is every 16 clocks, with up to seven 2-cycle instructions between.

• Posts: 14,112
edited 2019-12-19 19:29
I just had a huge realization!

The core of the inner loop of the FFT has this:
```		qrotate	b1,angle

qrotate	b2,angle

getqx	b1
getqy	b2

getqx	n
getqy	n
sub	b1,n
```

I started staring at that sequence, thinking there must be some simple objective to what it does. I had transcribed this code from some 80386 assembly code I had written 16 years ago, so I was just trying to do it faithfully, without really understanding WHAT it was doing. Actually, I never before REALIZED what it was doing.

Looking at that sequence, my brain started to remember the (X,Y) rotation equations:

x′=xcosθ−ysinθ
y′=ycosθ+xsinθ

That seemed to be maybe what it was doing.

So, I substituted a single CORDIC operation for that whole sequence:
```		setq	b2		'rotate (b1,b2) by angle
qrotate	b1,angle
getqx	b1
getqy	b2
```

AND IT WORKED!!!!!

So, the CORDIC can really whip the core of the FFT problem. This just leaves some adds and subtracts in the inner loop. Cool!!!!! This can be sped up WAY more than I originally thought.
• Posts: 14,112
edited 2019-12-19 20:29
The inner loop of the FFT now looks like this. Very straightforward:
```.loop3		setq	#2-1		'read (a1,a2)
rdlong	a1,ptra

rdlong	b1,ptrb

setq	b2		'rotate (b1,b2) by angle
qrotate	b1,angle
getqx	b1
getqy	b2

add	a1,b1		'write (a1+b1,a2+b2) to (a1,a2)
setq	#2-1
wrlong	a1,ptra++

sub	a1,b1		'write (a1-b1,a2-b2) to (b1,b2)
sub	a1,b1
sub	a2,b2
sub	a2,b2
setq	#2-1
wrlong	a1,ptrb++

djnz	c2,#.loop3
```
• Posts: 13,711
edited 2019-12-19 19:40
wonder if you could rdlong ptra[1], rdlong ptrb[1] to get the next values in between qrotate and getqx…
• Posts: 14,112
edited 2019-12-19 20:00
Rayman wrote: »
wonder if you could rdlong ptra[1], rdlong ptrb[1] to get the next values in between qrotate and getqx...

We could read 16 pairs at a time and then stuff the CORDIC pipeline.

This FFT does the following-sized runs of rotations and adds/subs on contiguously-placed coordinate pairs:

512
256
128
64
32
16
8
4
2
1

So, if we hard-coded 16 rotations+adds/subs, the extra operations just wouldn't be saved in the last four runs of 8, 4, 2, and 1. That would speed things way up and not waste much time.
• Posts: 13,711
Not sure if this is the right place, but I had a question about your FFT demo on the Chat yesterday...

Looks like you are coupling the microphone input just through a cap to a prop2 pin, as you said.

But, don't you need to bias the pin at Vdd/2 for that to work right?
Seems to me the bias level would float around, potentially causing the peaks or valleys to get clipped...
• Posts: 14,112
Rayman wrote: »
Not sure if this is the right place, but I had a question about your FFT demo on the Chat yesterday...

Looks like you are coupling the microphone input just through a cap to a prop2 pin, as you said.

But, don't you need to bias the pin at Vdd/2 for that to work right?
Seems to me the bias level would float around, potentially causing the peaks or valleys to get clipped...

The ADC pulls it to its own center, which is the threshold of a particular inverter, and close to VIO/2.
• Posts: 13,711
Ok, interesting. That's a nice feature.

One other thing...
I think in real life, you'd want to apply a windowing function to avoid distortions due to the first point not being close to the last point in the set... looks like Hamming window might be popular from this:

https://www.edn.com/windowing-functions-improve-fft-results-part-i/
• Posts: 18,066
```		sub	a1,b1		'write (a1-b1,a2-b2) to (b1,b2)
sub	a1,b1
sub	a2,b2
sub	a2,b2
setq	#2-1
wrlong	a1,ptrb++
```
This seems to be doing
(a1-2*b1, a2-2*b2)
???
• Posts: 14,112
Cluso99 wrote: »
```		sub	a1,b1		'write (a1-b1,a2-b2) to (b1,b2)
sub	a1,b1
sub	a2,b2
sub	a2,b2
setq	#2-1
wrlong	a1,ptrb++
```
This seems to be doing
(a1-2*b1, a2-2*b2)
???

A1 and A2 previously had B1 and B2 added to them. So we need to subtract twice now.
• Posts: 14,112
Rayman wrote: »
Ok, interesting. That's a nice feature.

One other thing...
I think in real life, you'd want to apply a windowing function to avoid distortions due to the first point not being close to the last point in the set... looks like Hamming window might be popular from this:

https://www.edn.com/windowing-functions-improve-fft-results-part-i/

Ah, yes, it never occurred to me, but we need to do that. I'll integrate it into the FFT input gathering. Thanks.
• Posts: 14,112
Upon closer examination, the FFT does the following-sized runs of rotations and adds/subs on contiguously-placed coordinate pairs:

512 x 1 time
256 x 2 times
128 x 4
64 x 8
32 x 16
16 x 32
8 x 64
4 x 128
2 x 256
1 x 512

I've got a 16-pair CORDIC stuffer coded up, but I noticed it was slowing down when it got to the 8x64 operation.

All operation sets of size 16x32 and up are each taking only ~68us! So, it's getting 60% of the job done in only 420us. I need to code special cases for the last four operation sets to get the overall time minimized.
• Posts: 2,099
cgracey wrote: »
Rayman wrote: »
Ok, interesting. That's a nice feature.

One other thing...
I think in real life, you'd want to apply a windowing function to avoid distortions due to the first point not being close to the last point in the set... looks like Hamming window might be popular from this:

https://www.edn.com/windowing-functions-improve-fft-results-part-i/

Ah, yes, it never occurred to me, but we need to do that. I'll integrate it into the FFT input gathering. Thanks.

The 'Hanning' window is another name for Hann that we use in scope mode.
• Posts: 13,711
edited 2019-12-20 01:24
Seems we could use this for voice recognition...
There are some free datasets available these days:
• Posts: 2,099
cgracey wrote: »
Upon closer examination, the FFT does the following-sized runs of rotations and adds/subs on contiguously-placed coordinate pairs:

512 x 1 time
256 x 2 times
128 x 4
64 x 8
32 x 16
16 x 32
8 x 64
4 x 128
2 x 256
1 x 512

I've got a 16-pair CORDIC stuffer coded up, but I noticed it was slowing down when it got to the 8x64 operation.

All operation sets of size 16x32 and up are each taking only ~68us! So, it's getting 60% of the job done in only 420us. I need to code special cases for the last four operation sets to get the overall time minimized.

I think most of us would find the 16-pair pipelined CORDIC code interesting, even if not complete. The egg-beater comes in to play here, too. Is there any fixed phase relationship between ptra & ptrb in the inner loop, for 16 x 32 in particular? Or do the bit reversals make the address differences pretty random?

There are five instructions between wrlong a1,ptra++ and wrlong a1,ptrba++ which could be reduced to three with a few more instructions earlier.
• Posts: 14,112
Rayman wrote: »
Seems we could use this for voice recognition...
There are some free datasets available these days:

Thanks, Rayman. I didn't know such things existed. I wonder how usable it is by us.

By the way, I did what you said and placed one of the SETQ+RDLONG sequences between the QROTATE and GETQX and it certainly sped things up. I will use that for a simple implementation that is compact.

I've been playing around with optimizing the FFT today and have made a lot of progress, but when I get to the last iteration set of 1x512, the angle must be recalculated for every sample and I can't fit it into two instruction, as I'd like, in order to get it to flow with the CORDIC. There is a REV needed and two instructions before that. That last iteration set is a special case in a few different ways, when trying to optimize it. It's slowing the whole FFT down.
• Posts: 14,112
TonyB_ wrote: »
cgracey wrote: »
Upon closer examination, the FFT does the following-sized runs of rotations and adds/subs on contiguously-placed coordinate pairs:

512 x 1 time
256 x 2 times
128 x 4
64 x 8
32 x 16
16 x 32
8 x 64
4 x 128
2 x 256
1 x 512

I've got a 16-pair CORDIC stuffer coded up, but I noticed it was slowing down when it got to the 8x64 operation.

All operation sets of size 16x32 and up are each taking only ~68us! So, it's getting 60% of the job done in only 420us. I need to code special cases for the last four operation sets to get the overall time minimized.

I think most of us would find the 16-pair pipelined CORDIC code interesting, even if not complete. The egg-beater comes in to play here, too. Is there any fixed phase relationship between ptra & ptrb in the inner loop, for 16 x 32 in particular? Or do the bit reversals make the address differences pretty random?

There are five instructions between wrlong a1,ptra++ and wrlong a1,ptrba++ which could be reduced to three with a few more instructions earlier.

Tony, how would you reorder that?

I have it like this now, for readability:
```.loop3		setq	#2-1		'read (bx,by)
rdlong	bx,ptrb

setq	by		'rotate (bx,by) by angle
qrotate	bx,angle

rdlong	ax,ptra

getqx	bx		'get rotated (bx,by)
getqy	by

shl	bx,#1		'(bx,by) = (ax-bx,ay-by)
subr	bx,ax
shl	by,#1
subr	by,ay

setq	#2-1		'write (ax,ay)
wrlong	ax,ptra++

setq	#2-1		'write (bx,by)
wrlong	bx,ptrb++

djnz	c2,#.loop3
```
• Posts: 2,099
edited 2019-12-21 21:05
Chip, what you have now is excellent, with the absolute minimum between the WRLONGs. SUBR is such as great instruction, yet so easily overlooked. I was re-ordering the SUBs and it's not worth showing now.

Could we use REP in the inner loop?
• Posts: 14,112
TonyB_ wrote: »
Chip, what you have now is really excellent, with the absolute minimum between the WRLONGs. SUBR is such as great instruction, yet so easily overlooked. I was re-ordering the SUBs and it's not worth showing now.

Could we use REP in the inner loop?

We could use REP, but it wouldn't save much time. Also, it would block interrupts.