Shop OBEX P1 Docs P2 Docs Learn Events
1024-point FFT in 79 longs — Parallax Forums

1024-point FFT in 79 longs

cgraceycgracey Posts: 14,206
edited 2019-12-17 17:45 in Propeller 2
I just made this and it works fine. It takes a hair over 3ms at 250MHz. It converts 1024 signed word samples to 512 frequency powers.
'****************************
'*  Fast Fourier Transform  *
'****************************

CON		samples		= $1000				'samples buffer (1024 words)
		real_imag	= samples	+ 1024*2	'FFT (real,imaginary) buffer (1024*2 longs)
		powers		= real_imag	+ 1024*2*4	'FFT powers buffer (512 longs)

DAT		org
'
' Fast Fourier transform
'
'			  n=1023
'  powers(k) =  hypotenus sum samples(n) [cos(nk) - j sin(nk)], k=0-1023
'			  n=0
'
' on entry:	samples must hold 1024 signed long samples
' on exit:	powers holds 512 unsigned words indicating powers for frequency divisions
'
'
' Load 1024 (real,imaginary) coordinate pairs
'
fft		setq2	#1024/2-1		'get 1024 word samples into lut
		rdlong	0,samples_ptr

		mov	ptra,#0			'write samples to real_imag as (sample,0) longs
		wrfast	#0,real_imag_ptr

		rep	@.r,##1024/2
		rdlut	a1,ptra++
		getword	a2,a1,#1
		signx	a1,#15
		signx	a2,#15
		wflong	a1
		wflong	#0
		wflong	a2
		wflong	#0
.r

' Perform 1024-point FFT

		decod	i1,#encod(1024 >> 1)
		mov	i2,#1


.loop1		mov	i3,#0
		mov	i4,i1

		mov	c1,i2


.loop2		encod	n,i1
		mov	angle,i3
		shr	angle,n
		rev	angle

		mov	ptra,i3
		shl	ptra,#3
		add	ptra,real_imag_ptr

		mov	ptrb,i1
		shl	ptrb,#3
		add	ptrb,ptra

		mov	c2,i4
		sub	c2,i3


.loop3		setq	#2-1
		rdlong	a1,ptra

		setq	#2-1
		rdlong	b1,ptrb

		qrotate	b1,angle

		qrotate	b2,angle

		getqx	b1
		getqy	b2

		getqx	n
		add	b2,n
		getqy	n
		sub	b1,n

		add	a1,b1
		add	a2,b2

		setq	#2-1
		wrlong	a1,ptra++

		sub	a1,b1
		sub	a1,b1
		sub	a2,b2
		sub	a2,b2

		setq	#2-1
		wrlong	a1,ptrb++

		djnz	c2,#.loop3


		add	i3,i1
		add	i3,i1
		add	i4,i1
		add	i4,i1

		djnz	c1,#.loop2


		shr	i1,#1
		shl	i2,#1

		tjnz	i1,#.loop1


' Convert FFT outputs to powers

		wrfast	#0,powers_ptr			'read to store powers
		mov	i1,#0
		decod	c1,#encod(1024 >> 1)

.convert	mov	ptra,i1				'get (real,imaginary) coordinate pair
		rev	ptra
		shr	ptra,#32 - (encod 1024) - 3	'-3 for two longs per pair
		add	ptra,real_imag_ptr
		setq	#2-1
		rdlong	a1,ptra

		qvector	a1,a2				'convert pair to polar to get power
		getqx	n

		wflong	n				'store power

		add	i1,#1
	_ret_	djnz	c1,#.convert
'
'
' Data
'
samples_ptr	long	samples
real_imag_ptr	long	real_imag
powers_ptr	long	powers

c1		res	1
c2		res	1
i1		res	1
i2		res	1
i3		res	1
i4		res	1
angle		res	1
a1		res	1
a2		res	1
b1		res	1
b2		res	1
n		res	1


Maybe someone can figure out how to make it faster. I used all the tricks I could think of, but I'm sure more is possible.
«13

Comments

  • ReinhardReinhard Posts: 489
    edited 2019-12-17 17:32
    WOW!!!
    With this demo, the P2 is in the league of digital signal processors. I suspected it, but so far I still lack the experience in p2asm.
    I will also need a demo board soon. Testing with the simulator is very tedious.
    Thanks for sharing this example.
    Reinhard

    edit:
    Cheers to the rev command.
    I wanted to assemble that next, but there is ;-)
  • cgraceycgracey Posts: 14,206
    I wonder how fast mdern DSP's can do a 1024-point FFT. It might be a few microseconds.

    I think the simplest way to speed up this program would be to take bigger bites in the inner loop, so that we load/store more A's and B's and pipeline more CORDIC operations. Right now it's taking ~750,000 clocks to do 10,000 operations. That's 75 clocks per. We could probably get it down to 30 clocks. Then, boost the clock a little more and we'll be under 1ms.
  • PublisonPublison Posts: 12,366
    edited 2019-12-17 17:56
    Reinhard wrote: »
    WOW!!!
    With this demo, the P2 is in the league of digital signal processors. I suspected it, but so far I still lack the experience in p2asm.
    I will also need a demo board soon. Testing with the simulator is very tedious.
    )

    We need to get a board to you ASAP. I can send you my board right now. I can wait for the new run in the next 2-3 weeks.
    PM me if you are interested.


  • In this context, I have another question. How fast is the internal ADC and what bit width does it have. An external ADC with 16 bit would not be a drama either.
  • Publison wrote: »
    Reinhard wrote: »
    WOW!!!
    With this demo, the P2 is in the league of digital signal processors. I suspected it, but so far I still lack the experience in p2asm.
    I will also need a demo board soon. Testing with the simulator is very tedious.
    )

    We need to get a board to you ASAP. I can send you my board right now. I can wait for the new run in the next 2-3 weeks.
    PM me if you are interested.

    Thank you, I live in Germany and I don't know how complicated it is to send.
  • jmgjmg Posts: 15,175
    Reinhard wrote: »
    In this context, I have another question. How fast is the internal ADC and what bit width does it have. An external ADC with 16 bit would not be a drama either.

    The internal ADC are modest, ok for a MCU. Tests on the ES1 silicon, had 10~12bit ADC, which varied with supply and pin and MHz.
    An external ADC will be needed to highest performance.
    P2 can support the external isolated ADCs, which give 78kHz BW and 12~14b useful from a 16b result.

  • Thank you.
    I mean the P2 has enough pins if a very high-performance application is needed.
    a flash ADC can also be connected, which loads the data in parallel.
  • RaymanRayman Posts: 14,744
    I wonder how much using two cogs with shared LUT would speed things up...
  • Yes, I've already thought of that.
    I just have no concrete idea at the moment how it can be synchronized.
    A parallel processing, that would be a huge thing.
    I think it is possible.
  • Reinhard wrote: »
    Publison wrote: »
    Reinhard wrote: »
    WOW!!!
    With this demo, the P2 is in the league of digital signal processors. I suspected it, but so far I still lack the experience in p2asm.
    I will also need a demo board soon. Testing with the simulator is very tedious.
    )

    We need to get a board to you ASAP. I can send you my board right now. I can wait for the new run in the next 2-3 weeks.
    PM me if you are interested.

    Thank you, I live in Germany and I don't know how complicated it is to send.

    On the road now. PM in about an hour.



  • TonyB_TonyB_ Posts: 2,193
    edited 2019-12-18 02:57
    cgracey wrote: »
    I wonder how fast mdern DSP's can do a 1024-point FFT. It might be a few microseconds.

    I think the simplest way to speed up this program would be to take bigger bites in the inner loop, so that we load/store more A's and B's and pipeline more CORDIC operations. Right now it's taking ~750,000 clocks to do 10,000 operations. That's 75 clocks per. We could probably get it down to 30 clocks. Then, boost the clock a little more and we'll be under 1ms.

    I haven't looked at pipelined CORDIC yet. Is it possible to issue a new command every 8 cycles, with three 2-cycle instructions in between? Eight A's and B's in the inner loop look to be optimal, at first glance. More accurately, anything less looks to be sub-optimal.

    Pity there's no SUB2 (D = D - 2*S), nor for other applications an atomic SHR & ZEROX, the latter operation using the spare S bits, similar to BITx.
  • evanhevanh Posts: 16,023
    edited 2019-12-17 22:48
    TonyB_ wrote: »
    ... Is it possible to issue a new command every 8 cycles, with three 2-cycle instructions in between?
    Correct for an 8-cog prop2. For a 16-cog prop2 the tightest is every 16 clocks, with up to seven 2-cycle instructions between.

  • cgraceycgracey Posts: 14,206
    edited 2019-12-19 19:29
    I just had a huge realization!

    The core of the inner loop of the FFT has this:
    		qrotate	b1,angle
    
    		qrotate	b2,angle
    
    		getqx	b1
    		getqy	b2
    
    		getqx	n
    		add	b2,n
    		getqy	n
    		sub	b1,n
    

    I started staring at that sequence, thinking there must be some simple objective to what it does. I had transcribed this code from some 80386 assembly code I had written 16 years ago, so I was just trying to do it faithfully, without really understanding WHAT it was doing. Actually, I never before REALIZED what it was doing.

    Looking at that sequence, my brain started to remember the (X,Y) rotation equations:

    x′=xcosθ−ysinθ
    y′=ycosθ+xsinθ

    That seemed to be maybe what it was doing.

    So, I substituted a single CORDIC operation for that whole sequence:
    		setq	b2		'rotate (b1,b2) by angle
    		qrotate	b1,angle
    		getqx	b1
    		getqy	b2
    

    AND IT WORKED!!!!!

    So, the CORDIC can really whip the core of the FFT problem. This just leaves some adds and subtracts in the inner loop. Cool!!!!! This can be sped up WAY more than I originally thought.
  • cgraceycgracey Posts: 14,206
    edited 2019-12-19 20:29
    The inner loop of the FFT now looks like this. Very straightforward:
    .loop3		setq	#2-1		'read (a1,a2)
    		rdlong	a1,ptra
    
    		setq	#2-1		'read (b1,b2)
    		rdlong	b1,ptrb
    
    		setq	b2		'rotate (b1,b2) by angle
    		qrotate	b1,angle
    		getqx	b1
    		getqy	b2
    
    		add	a1,b1		'write (a1+b1,a2+b2) to (a1,a2)
    		add	a2,b2
    		setq	#2-1
    		wrlong	a1,ptra++
    
    		sub	a1,b1		'write (a1-b1,a2-b2) to (b1,b2)
    		sub	a1,b1
    		sub	a2,b2
    		sub	a2,b2
    		setq	#2-1
    		wrlong	a1,ptrb++
    
    		djnz	c2,#.loop3
    
  • RaymanRayman Posts: 14,744
    edited 2019-12-19 19:40
    wonder if you could rdlong ptra[1], rdlong ptrb[1] to get the next values in between qrotate and getqx…
  • cgraceycgracey Posts: 14,206
    edited 2019-12-19 20:00
    Rayman wrote: »
    wonder if you could rdlong ptra[1], rdlong ptrb[1] to get the next values in between qrotate and getqx...

    We could read 16 pairs at a time and then stuff the CORDIC pipeline.

    This FFT does the following-sized runs of rotations and adds/subs on contiguously-placed coordinate pairs:

    512
    256
    128
    64
    32
    16
    8
    4
    2
    1

    So, if we hard-coded 16 rotations+adds/subs, the extra operations just wouldn't be saved in the last four runs of 8, 4, 2, and 1. That would speed things way up and not waste much time.
  • RaymanRayman Posts: 14,744
    Not sure if this is the right place, but I had a question about your FFT demo on the Chat yesterday...

    Looks like you are coupling the microphone input just through a cap to a prop2 pin, as you said.

    But, don't you need to bias the pin at Vdd/2 for that to work right?
    Seems to me the bias level would float around, potentially causing the peaks or valleys to get clipped...
  • cgraceycgracey Posts: 14,206
    Rayman wrote: »
    Not sure if this is the right place, but I had a question about your FFT demo on the Chat yesterday...

    Looks like you are coupling the microphone input just through a cap to a prop2 pin, as you said.

    But, don't you need to bias the pin at Vdd/2 for that to work right?
    Seems to me the bias level would float around, potentially causing the peaks or valleys to get clipped...

    The ADC pulls it to its own center, which is the threshold of a particular inverter, and close to VIO/2.
  • RaymanRayman Posts: 14,744
    Ok, interesting. That's a nice feature.

    One other thing...
    I think in real life, you'd want to apply a windowing function to avoid distortions due to the first point not being close to the last point in the set... looks like Hamming window might be popular from this:

    https://www.edn.com/windowing-functions-improve-fft-results-part-i/
  • Cluso99Cluso99 Posts: 18,069
    		sub	a1,b1		'write (a1-b1,a2-b2) to (b1,b2)
    		sub	a1,b1
    		sub	a2,b2
    		sub	a2,b2
    		setq	#2-1
    		wrlong	a1,ptrb++
    
    This seems to be doing
    (a1-2*b1, a2-2*b2)
    ???
  • cgraceycgracey Posts: 14,206
    Cluso99 wrote: »
    		sub	a1,b1		'write (a1-b1,a2-b2) to (b1,b2)
    		sub	a1,b1
    		sub	a2,b2
    		sub	a2,b2
    		setq	#2-1
    		wrlong	a1,ptrb++
    
    This seems to be doing
    (a1-2*b1, a2-2*b2)
    ???

    A1 and A2 previously had B1 and B2 added to them. So we need to subtract twice now.
  • cgraceycgracey Posts: 14,206
    Rayman wrote: »
    Ok, interesting. That's a nice feature.

    One other thing...
    I think in real life, you'd want to apply a windowing function to avoid distortions due to the first point not being close to the last point in the set... looks like Hamming window might be popular from this:

    https://www.edn.com/windowing-functions-improve-fft-results-part-i/

    Ah, yes, it never occurred to me, but we need to do that. I'll integrate it into the FFT input gathering. Thanks.
  • cgraceycgracey Posts: 14,206
    Upon closer examination, the FFT does the following-sized runs of rotations and adds/subs on contiguously-placed coordinate pairs:

    512 x 1 time
    256 x 2 times
    128 x 4
    64 x 8
    32 x 16
    16 x 32
    8 x 64
    4 x 128
    2 x 256
    1 x 512

    I've got a 16-pair CORDIC stuffer coded up, but I noticed it was slowing down when it got to the 8x64 operation.

    All operation sets of size 16x32 and up are each taking only ~68us! So, it's getting 60% of the job done in only 420us. I need to code special cases for the last four operation sets to get the overall time minimized.
  • cgracey wrote: »
    Rayman wrote: »
    Ok, interesting. That's a nice feature.

    One other thing...
    I think in real life, you'd want to apply a windowing function to avoid distortions due to the first point not being close to the last point in the set... looks like Hamming window might be popular from this:

    https://www.edn.com/windowing-functions-improve-fft-results-part-i/

    Ah, yes, it never occurred to me, but we need to do that. I'll integrate it into the FFT input gathering. Thanks.

    The 'Hanning' window is another name for Hann that we use in scope mode.
  • RaymanRayman Posts: 14,744
    edited 2019-12-20 01:24
    Seems we could use this for voice recognition...
    There are some free datasets available these days:
    https://www.cmswire.com/digital-asset-management/9-voice-datasets-you-should-know-about/
  • cgracey wrote: »
    Upon closer examination, the FFT does the following-sized runs of rotations and adds/subs on contiguously-placed coordinate pairs:

    512 x 1 time
    256 x 2 times
    128 x 4
    64 x 8
    32 x 16
    16 x 32
    8 x 64
    4 x 128
    2 x 256
    1 x 512

    I've got a 16-pair CORDIC stuffer coded up, but I noticed it was slowing down when it got to the 8x64 operation.

    All operation sets of size 16x32 and up are each taking only ~68us! So, it's getting 60% of the job done in only 420us. I need to code special cases for the last four operation sets to get the overall time minimized.

    I think most of us would find the 16-pair pipelined CORDIC code interesting, even if not complete. The egg-beater comes in to play here, too. Is there any fixed phase relationship between ptra & ptrb in the inner loop, for 16 x 32 in particular? Or do the bit reversals make the address differences pretty random?

    There are five instructions between wrlong a1,ptra++ and wrlong a1,ptrba++ which could be reduced to three with a few more instructions earlier.
  • cgraceycgracey Posts: 14,206
    Rayman wrote: »
    Seems we could use this for voice recognition...
    There are some free datasets available these days:
    https://www.cmswire.com/digital-asset-management/9-voice-datasets-you-should-know-about/

    Thanks, Rayman. I didn't know such things existed. I wonder how usable it is by us.

    By the way, I did what you said and placed one of the SETQ+RDLONG sequences between the QROTATE and GETQX and it certainly sped things up. I will use that for a simple implementation that is compact.

    I've been playing around with optimizing the FFT today and have made a lot of progress, but when I get to the last iteration set of 1x512, the angle must be recalculated for every sample and I can't fit it into two instruction, as I'd like, in order to get it to flow with the CORDIC. There is a REV needed and two instructions before that. That last iteration set is a special case in a few different ways, when trying to optimize it. It's slowing the whole FFT down.
  • cgraceycgracey Posts: 14,206
    TonyB_ wrote: »
    cgracey wrote: »
    Upon closer examination, the FFT does the following-sized runs of rotations and adds/subs on contiguously-placed coordinate pairs:

    512 x 1 time
    256 x 2 times
    128 x 4
    64 x 8
    32 x 16
    16 x 32
    8 x 64
    4 x 128
    2 x 256
    1 x 512

    I've got a 16-pair CORDIC stuffer coded up, but I noticed it was slowing down when it got to the 8x64 operation.

    All operation sets of size 16x32 and up are each taking only ~68us! So, it's getting 60% of the job done in only 420us. I need to code special cases for the last four operation sets to get the overall time minimized.

    I think most of us would find the 16-pair pipelined CORDIC code interesting, even if not complete. The egg-beater comes in to play here, too. Is there any fixed phase relationship between ptra & ptrb in the inner loop, for 16 x 32 in particular? Or do the bit reversals make the address differences pretty random?

    There are five instructions between wrlong a1,ptra++ and wrlong a1,ptrba++ which could be reduced to three with a few more instructions earlier.

    Tony, how would you reorder that?

    I have it like this now, for readability:
    .loop3		setq	#2-1		'read (bx,by)
    		rdlong	bx,ptrb
    
    		setq	by		'rotate (bx,by) by angle
    		qrotate	bx,angle
    
    		setq	#2-1		'read (ax,ay)
    		rdlong	ax,ptra
    
    		getqx	bx		'get rotated (bx,by)
    		getqy	by
    
    		add	ax,bx		'(ax,ay) = (ax+bx,ay+by)
    		add	ay,by
    
    		shl	bx,#1		'(bx,by) = (ax-bx,ay-by)
    		subr	bx,ax
    		shl	by,#1
    		subr	by,ay
    
    		setq	#2-1		'write (ax,ay)
    		wrlong	ax,ptra++
    
    		setq	#2-1		'write (bx,by)
    		wrlong	bx,ptrb++
    
    		djnz	c2,#.loop3
    
  • TonyB_TonyB_ Posts: 2,193
    edited 2019-12-21 21:05
    Chip, what you have now is excellent, with the absolute minimum between the WRLONGs. SUBR is such as great instruction, yet so easily overlooked. I was re-ordering the SUBs and it's not worth showing now.

    Could we use REP in the inner loop?
  • cgraceycgracey Posts: 14,206
    TonyB_ wrote: »
    Chip, what you have now is really excellent, with the absolute minimum between the WRLONGs. SUBR is such as great instruction, yet so easily overlooked. I was re-ordering the SUBs and it's not worth showing now.

    Could we use REP in the inner loop?

    We could use REP, but it wouldn't save much time. Also, it would block interrupts.
Sign In or Register to comment.