1024-point FFT in 79 longs

cgracey · 2019-12-17 15:27

I just made this and it works fine. It takes a hair over 3ms at 250MHz. It converts 1024 signed word samples to 512 frequency powers.

'****************************
'*  Fast Fourier Transform  *
'****************************

CON		samples		= $1000				'samples buffer (1024 words)
		real_imag	= samples	+ 1024*2	'FFT (real,imaginary) buffer (1024*2 longs)
		powers		= real_imag	+ 1024*2*4	'FFT powers buffer (512 longs)

DAT		org
'
' Fast Fourier transform
'
'			  n=1023
'  powers(k) =  hypotenus sum samples(n) [cos(nk) - j sin(nk)], k=0-1023
'			  n=0
'
' on entry:	samples must hold 1024 signed long samples
' on exit:	powers holds 512 unsigned words indicating powers for frequency divisions
'
'
' Load 1024 (real,imaginary) coordinate pairs
'
fft		setq2	#1024/2-1		'get 1024 word samples into lut
		rdlong	0,samples_ptr

		mov	ptra,#0			'write samples to real_imag as (sample,0) longs
		wrfast	#0,real_imag_ptr

		rep	@.r,##1024/2
		rdlut	a1,ptra++
		getword	a2,a1,#1
		signx	a1,#15
		signx	a2,#15
		wflong	a1
		wflong	#0
		wflong	a2
		wflong	#0
.r

' Perform 1024-point FFT

		decod	i1,#encod(1024 >> 1)
		mov	i2,#1


.loop1		mov	i3,#0
		mov	i4,i1

		mov	c1,i2


.loop2		encod	n,i1
		mov	angle,i3
		shr	angle,n
		rev	angle

		mov	ptra,i3
		shl	ptra,#3
		add	ptra,real_imag_ptr

		mov	ptrb,i1
		shl	ptrb,#3
		add	ptrb,ptra

		mov	c2,i4
		sub	c2,i3


.loop3		setq	#2-1
		rdlong	a1,ptra

		setq	#2-1
		rdlong	b1,ptrb

		qrotate	b1,angle

		qrotate	b2,angle

		getqx	b1
		getqy	b2

		getqx	n
		add	b2,n
		getqy	n
		sub	b1,n

		add	a1,b1
		add	a2,b2

		setq	#2-1
		wrlong	a1,ptra++

		sub	a1,b1
		sub	a1,b1
		sub	a2,b2
		sub	a2,b2

		setq	#2-1
		wrlong	a1,ptrb++

		djnz	c2,#.loop3


		add	i3,i1
		add	i3,i1
		add	i4,i1
		add	i4,i1

		djnz	c1,#.loop2


		shr	i1,#1
		shl	i2,#1

		tjnz	i1,#.loop1


' Convert FFT outputs to powers

		wrfast	#0,powers_ptr			'read to store powers
		mov	i1,#0
		decod	c1,#encod(1024 >> 1)

.convert	mov	ptra,i1				'get (real,imaginary) coordinate pair
		rev	ptra
		shr	ptra,#32 - (encod 1024) - 3	'-3 for two longs per pair
		add	ptra,real_imag_ptr
		setq	#2-1
		rdlong	a1,ptra

		qvector	a1,a2				'convert pair to polar to get power
		getqx	n

		wflong	n				'store power

		add	i1,#1
	_ret_	djnz	c1,#.convert
'
'
' Data
'
samples_ptr	long	samples
real_imag_ptr	long	real_imag
powers_ptr	long	powers

c1		res	1
c2		res	1
i1		res	1
i2		res	1
i3		res	1
i4		res	1
angle		res	1
a1		res	1
a2		res	1
b1		res	1
b2		res	1
n		res	1

Maybe someone can figure out how to make it faster. I used all the tricks I could think of, but I'm sure more is possible.

Reinhard · 2019-12-17 17:11

WOW!!!
With this demo, the P2 is in the league of digital signal processors. I suspected it, but so far I still lack the experience in p2asm.
I will also need a demo board soon. Testing with the simulator is very tedious.
Thanks for sharing this example.
Reinhard

edit:
Cheers to the rev command.
I wanted to assemble that next, but there is ;-)

cgracey · 2019-12-17 17:41

I wonder how fast mdern DSP's can do a 1024-point FFT. It might be a few microseconds.

I think the simplest way to speed up this program would be to take bigger bites in the inner loop, so that we load/store more A's and B's and pipeline more CORDIC operations. Right now it's taking ~750,000 clocks to do 10,000 operations. That's 75 clocks per. We could probably get it down to 30 clocks. Then, boost the clock a little more and we'll be under 1ms.

Publison · 2019-12-17 17:56

Reinhard wrote: »

WOW!!!
With this demo, the P2 is in the league of digital signal processors. I suspected it, but so far I still lack the experience in p2asm.
I will also need a demo board soon. Testing with the simulator is very tedious.
)

We need to get a board to you ASAP. I can send you my board right now. I can wait for the new run in the next 2-3 weeks.
PM me if you are interested.

Reinhard · 2019-12-17 18:02

In this context, I have another question. How fast is the internal ADC and what bit width does it have. An external ADC with 16 bit would not be a drama either.

Reinhard · 2019-12-17 18:05

Publison wrote: »

Reinhard wrote: »

WOW!!!
With this demo, the P2 is in the league of digital signal processors. I suspected it, but so far I still lack the experience in p2asm.
I will also need a demo board soon. Testing with the simulator is very tedious.
)

We need to get a board to you ASAP. I can send you my board right now. I can wait for the new run in the next 2-3 weeks.
PM me if you are interested.

Thank you, I live in Germany and I don't know how complicated it is to send.

jmg · 2019-12-17 18:27

Reinhard wrote: »

In this context, I have another question. How fast is the internal ADC and what bit width does it have. An external ADC with 16 bit would not be a drama either.

The internal ADC are modest, ok for a MCU. Tests on the ES1 silicon, had 10~12bit ADC, which varied with supply and pin and MHz.
An external ADC will be needed to highest performance.
P2 can support the external isolated ADCs, which give 78kHz BW and 12~14b useful from a 16b result.

Reinhard · 2019-12-17 18:36

Thank you.
I mean the P2 has enough pins if a very high-performance application is needed.
a flash ADC can also be connected, which loads the data in parallel.

Rayman · 2019-12-17 18:54

I wonder how much using two cogs with shared LUT would speed things up...

Reinhard · 2019-12-17 19:03

Yes, I've already thought of that.
I just have no concrete idea at the moment how it can be synchronized.
A parallel processing, that would be a huge thing.
I think it is possible.

Publison · 2019-12-17 19:11

Reinhard wrote: »

Publison wrote: »

Reinhard wrote: »

WOW!!!
With this demo, the P2 is in the league of digital signal processors. I suspected it, but so far I still lack the experience in p2asm.
I will also need a demo board soon. Testing with the simulator is very tedious.
)

We need to get a board to you ASAP. I can send you my board right now. I can wait for the new run in the next 2-3 weeks.
PM me if you are interested.

Thank you, I live in Germany and I don't know how complicated it is to send.

On the road now. PM in about an hour.

TonyB_ · 2019-12-17 20:22

cgracey wrote: »

I wonder how fast mdern DSP's can do a 1024-point FFT. It might be a few microseconds.

I think the simplest way to speed up this program would be to take bigger bites in the inner loop, so that we load/store more A's and B's and pipeline more CORDIC operations. Right now it's taking ~750,000 clocks to do 10,000 operations. That's 75 clocks per. We could probably get it down to 30 clocks. Then, boost the clock a little more and we'll be under 1ms.

I haven't looked at pipelined CORDIC yet. Is it possible to issue a new command every 8 cycles, with three 2-cycle instructions in between? Eight A's and B's in the inner loop look to be optimal, at first glance. More accurately, anything less looks to be sub-optimal.

Pity there's no SUB2 (D = D - 2*S), nor for other applications an atomic SHR & ZEROX, the latter operation using the spare S bits, similar to BITx.

evanh · 2019-12-17 22:47

TonyB_ wrote: »

... Is it possible to issue a new command every 8 cycles, with three 2-cycle instructions in between?

Correct for an 8-cog prop2. For a 16-cog prop2 the tightest is every 16 clocks, with up to seven 2-cycle instructions between.

cgracey · 2019-12-19 19:16

I just had a huge realization!

The core of the inner loop of the FFT has this:

		qrotate	b1,angle

		qrotate	b2,angle

		getqx	b1
		getqy	b2

		getqx	n
		add	b2,n
		getqy	n
		sub	b1,n

I started staring at that sequence, thinking there must be some simple objective to what it does. I had transcribed this code from some 80386 assembly code I had written 16 years ago, so I was just trying to do it faithfully, without really understanding WHAT it was doing. Actually, I never before REALIZED what it was doing.

Looking at that sequence, my brain started to remember the (X,Y) rotation equations:

x′=xcosθ−ysinθ
y′=ycosθ+xsinθ

That seemed to be maybe what it was doing.

So, I substituted a single CORDIC operation for that whole sequence:

		setq	b2		'rotate (b1,b2) by angle
		qrotate	b1,angle
		getqx	b1
		getqy	b2

AND IT WORKED!!!!!

So, the CORDIC can really whip the core of the FFT problem. This just leaves some adds and subtracts in the inner loop. Cool!!!!! This can be sped up WAY more than I originally thought.

cgracey · 2019-12-19 19:33

The inner loop of the FFT now looks like this. Very straightforward:

.loop3		setq	#2-1		'read (a1,a2)
		rdlong	a1,ptra

		setq	#2-1		'read (b1,b2)
		rdlong	b1,ptrb

		setq	b2		'rotate (b1,b2) by angle
		qrotate	b1,angle
		getqx	b1
		getqy	b2

		add	a1,b1		'write (a1+b1,a2+b2) to (a1,a2)
		add	a2,b2
		setq	#2-1
		wrlong	a1,ptra++

		sub	a1,b1		'write (a1-b1,a2-b2) to (b1,b2)
		sub	a1,b1
		sub	a2,b2
		sub	a2,b2
		setq	#2-1
		wrlong	a1,ptrb++

		djnz	c2,#.loop3

Rayman · 2019-12-19 19:35

wonder if you could rdlong ptra[1], rdlong ptrb[1] to get the next values in between qrotate and getqx…

cgracey · 2019-12-19 19:58

Rayman wrote: »

wonder if you could rdlong ptra[1], rdlong ptrb[1] to get the next values in between qrotate and getqx...

We could read 16 pairs at a time and then stuff the CORDIC pipeline.

This FFT does the following-sized runs of rotations and adds/subs on contiguously-placed coordinate pairs:

512
256
128
64
32
16
8
4
2
1

So, if we hard-coded 16 rotations+adds/subs, the extra operations just wouldn't be saved in the last four runs of 8, 4, 2, and 1. That would speed things way up and not waste much time.

Rayman · 2019-12-19 20:31

Not sure if this is the right place, but I had a question about your FFT demo on the Chat yesterday...

Looks like you are coupling the microphone input just through a cap to a prop2 pin, as you said.

But, don't you need to bias the pin at Vdd/2 for that to work right?
Seems to me the bias level would float around, potentially causing the peaks or valleys to get clipped...

cgracey · 2019-12-19 20:35

Rayman wrote: »

Not sure if this is the right place, but I had a question about your FFT demo on the Chat yesterday...

Looks like you are coupling the microphone input just through a cap to a prop2 pin, as you said.

But, don't you need to bias the pin at Vdd/2 for that to work right?
Seems to me the bias level would float around, potentially causing the peaks or valleys to get clipped...

The ADC pulls it to its own center, which is the threshold of a particular inverter, and close to VIO/2.

Rayman · 2019-12-19 20:45

Ok, interesting. That's a nice feature.

One other thing...
I think in real life, you'd want to apply a windowing function to avoid distortions due to the first point not being close to the last point in the set... looks like Hamming window might be popular from this:

https://www.edn.com/windowing-functions-improve-fft-results-part-i/

Cluso99 · 2019-12-19 20:47

		sub	a1,b1		'write (a1-b1,a2-b2) to (b1,b2)
		sub	a1,b1
		sub	a2,b2
		sub	a2,b2
		setq	#2-1
		wrlong	a1,ptrb++

This seems to be doing
(a1-2*b1, a2-2*b2)
???

cgracey · 2019-12-19 20:50

Cluso99 wrote: »
		sub	a1,b1		'write (a1-b1,a2-b2) to (b1,b2)
		sub	a1,b1
		sub	a2,b2
		sub	a2,b2
		setq	#2-1
		wrlong	a1,ptrb++
This seems to be doing
(a1-2*b1, a2-2*b2)
???

A1 and A2 previously had B1 and B2 added to them. So we need to subtract twice now.

cgracey · 2019-12-19 20:52

Rayman wrote: »

Ok, interesting. That's a nice feature.

One other thing...
I think in real life, you'd want to apply a windowing function to avoid distortions due to the first point not being close to the last point in the set... looks like Hamming window might be popular from this:

https://www.edn.com/windowing-functions-improve-fft-results-part-i/

Ah, yes, it never occurred to me, but we need to do that. I'll integrate it into the FFT input gathering. Thanks.

cgracey · 2019-12-19 21:22

Upon closer examination, the FFT does the following-sized runs of rotations and adds/subs on contiguously-placed coordinate pairs:

512 x 1 time
256 x 2 times
128 x 4
64 x 8
32 x 16
16 x 32
8 x 64
4 x 128
2 x 256
1 x 512

I've got a 16-pair CORDIC stuffer coded up, but I noticed it was slowing down when it got to the 8x64 operation.

All operation sets of size 16x32 and up are each taking only ~68us! So, it's getting 60% of the job done in only 420us. I need to code special cases for the last four operation sets to get the overall time minimized.

TonyB_ · 2019-12-20 00:50

cgracey wrote: »

Rayman wrote: »

Ok, interesting. That's a nice feature.

One other thing...
I think in real life, you'd want to apply a windowing function to avoid distortions due to the first point not being close to the last point in the set... looks like Hamming window might be popular from this:

https://www.edn.com/windowing-functions-improve-fft-results-part-i/

Ah, yes, it never occurred to me, but we need to do that. I'll integrate it into the FFT input gathering. Thanks.

The 'Hanning' window is another name for Hann that we use in scope mode.

Rayman · 2019-12-20 01:24

Seems we could use this for voice recognition...
There are some free datasets available these days:
https://www.cmswire.com/digital-asset-management/9-voice-datasets-you-should-know-about/

TonyB_ · 2019-12-20 01:34

cgracey wrote: »

Upon closer examination, the FFT does the following-sized runs of rotations and adds/subs on contiguously-placed coordinate pairs:

512 x 1 time
256 x 2 times
128 x 4
64 x 8
32 x 16
16 x 32
8 x 64
4 x 128
2 x 256
1 x 512

I've got a 16-pair CORDIC stuffer coded up, but I noticed it was slowing down when it got to the 8x64 operation.

All operation sets of size 16x32 and up are each taking only ~68us! So, it's getting 60% of the job done in only 420us. I need to code special cases for the last four operation sets to get the overall time minimized.

I think most of us would find the 16-pair pipelined CORDIC code interesting, even if not complete. The egg-beater comes in to play here, too. Is there any fixed phase relationship between ptra & ptrb in the inner loop, for 16 x 32 in particular? Or do the bit reversals make the address differences pretty random?

There are five instructions between wrlong a1,ptra++ and wrlong a1,ptrba++ which could be reduced to three with a few more instructions earlier.

cgracey · 2019-12-20 02:39

Rayman wrote: »

Seems we could use this for voice recognition...
There are some free datasets available these days:
https://www.cmswire.com/digital-asset-management/9-voice-datasets-you-should-know-about/

Thanks, Rayman. I didn't know such things existed. I wonder how usable it is by us.

By the way, I did what you said and placed one of the SETQ+RDLONG sequences between the QROTATE and GETQX and it certainly sped things up. I will use that for a simple implementation that is compact.

I've been playing around with optimizing the FFT today and have made a lot of progress, but when I get to the last iteration set of 1x512, the angle must be recalculated for every sample and I can't fit it into two instruction, as I'd like, in order to get it to flow with the CORDIC. There is a REV needed and two instructions before that. That last iteration set is a special case in a few different ways, when trying to optimize it. It's slowing the whole FFT down.

cgracey · 2019-12-20 02:40

TonyB_ wrote: »

cgracey wrote: »

Upon closer examination, the FFT does the following-sized runs of rotations and adds/subs on contiguously-placed coordinate pairs:

512 x 1 time
256 x 2 times
128 x 4
64 x 8
32 x 16
16 x 32
8 x 64
4 x 128
2 x 256
1 x 512

I've got a 16-pair CORDIC stuffer coded up, but I noticed it was slowing down when it got to the 8x64 operation.

All operation sets of size 16x32 and up are each taking only ~68us! So, it's getting 60% of the job done in only 420us. I need to code special cases for the last four operation sets to get the overall time minimized.

I think most of us would find the 16-pair pipelined CORDIC code interesting, even if not complete. The egg-beater comes in to play here, too. Is there any fixed phase relationship between ptra & ptrb in the inner loop, for 16 x 32 in particular? Or do the bit reversals make the address differences pretty random?

There are five instructions between wrlong a1,ptra++ and wrlong a1,ptrba++ which could be reduced to three with a few more instructions earlier.

Tony, how would you reorder that?

I have it like this now, for readability:

.loop3		setq	#2-1		'read (bx,by)
		rdlong	bx,ptrb

		setq	by		'rotate (bx,by) by angle
		qrotate	bx,angle

		setq	#2-1		'read (ax,ay)
		rdlong	ax,ptra

		getqx	bx		'get rotated (bx,by)
		getqy	by

		add	ax,bx		'(ax,ay) = (ax+bx,ay+by)
		add	ay,by

		shl	bx,#1		'(bx,by) = (ax-bx,ay-by)
		subr	bx,ax
		shl	by,#1
		subr	by,ay

		setq	#2-1		'write (ax,ay)
		wrlong	ax,ptra++

		setq	#2-1		'write (bx,by)
		wrlong	bx,ptrb++

		djnz	c2,#.loop3

TonyB_ · 2019-12-20 03:22

Chip, what you have now is excellent, with the absolute minimum between the WRLONGs. SUBR is such as great instruction, yet so easily overlooked. I was re-ordering the SUBs and it's not worth showing now.

Could we use REP in the inner loop?

cgracey · 2019-12-20 03:32

TonyB_ wrote: »

Chip, what you have now is really excellent, with the absolute minimum between the WRLONGs. SUBR is such as great instruction, yet so easily overlooked. I was re-ordering the SUBs and it's not worth showing now.

Could we use REP in the inner loop?

We could use REP, but it wouldn't save much time. Also, it would block interrupts.

1024-point FFT in 79 longs

Comments