1024-point FFT in 79 longs

bob_g4bby · 2026-02-03 09:44

Here's the Blackman-Harris window produced in a spreadsheet - what's the best way of applying it to a signal in the 32 bit integer environment of the P2?

A question of scaling the function up? Large enough to retain as much precision as possible, consistent with avoiding overflow when multiplying the signal, sample by sample.

The scaled window is a constant, so it's a DATA block to be loaded into a buffer on start up, I guess.

TonyB_ · 2026-02-03 12:22

@bob_g4bby said:
Here's the Blackman-Harris window produced in a spreadsheet - what's the best way of applying it to a signal in the 32 bit integer environment of the P2?

The best you could do with the CORDIC QMUL is one sample every 16 cycles. Your Blackman-Harris window length is 1024 samples with 512 different values so the obvious place to store them is LUT RAM. Will successive windows overlap by 512 samples?

Data could be read from hub RAM using RFLONG with 256 results, say, stored temporarily in reg RAM then written back later using fast block write. Alternatively, do fast block read to pre-load 256 samples in reg RAM and write results directly to hub RAM using WFLONG. Or use RFLONG and WRLONG (but the former might stall the later) with no intermediate storage in reg RAM if timing allows for that.

bob_g4bby · 2026-02-03 14:12

Good idea, LUT ram is the best place and like you say, the window is symmetric, so storing half of it is enough.
Yes, the FFT fast convolution filtering that comprises the software defined radio main filter does include an overlap-add feature - see attached pdf

I've got the dsp functions going quite fast as follows:-
1. Move 16 iq pairs from the input buffer in hub ram into a small register array using SETQ #31, RDLONG regarray, ptra++. This takes 1 cycle per long - see here.
2. Preload the cordic engine with 8 inputs from the register array.
3. Read result back from cordic into register array and load the cordic engine with another input - do this 8 times
4. Read the last 8 results from the cordic engine back into register array. For many dsp functions (2) to (4) results in around 9 cycles per cordic result
5. Move 16 iq pairs from register array back to the output buffer in hub ram. This again takes 1 cycle per long.
6. Repeat (1) to (5) 64 times for the whole buffer

Here's an example, with timing based on Chip's example:-

' convert x,y (cartesian) samples in buffin to magnitude, angle (polar) samples in buffout - optionally, buffout can be the same as buffin
' at 320MHz clock, runs about 56.1uS
' result stored as mag1, real1, mag2, real2 etc.
pub xytopol(buffin, buffout) | counter

    org
        push ptra
        mov ptra, buffin
        mov ptrb, buffout
        mov counter, #(sigbuffsize/16)
xypol1
            setq #31
            rdlong array, ptra++
            qvector array, array+1
            nop
            qvector array+2, array+3
            nop
            qvector array+4, array+5
            nop
            qvector array+6, array+7
            nop
            qvector array+8, array+9
            nop
            qvector array+10, array+11
            nop
            qvector array+12, array+13
            nop
            qvector array+14, array+15
            getqx   array
            getqy   array+1
            nop
            qvector array+16, array+17
            getqx   array+2
            getqy   array+3
            nop
            qvector array+18, array+19
            getqx   array+4
            getqy   array+5
            nop
            qvector array+20, array+21
            getqx   array+6
            getqy   array+7
            nop
            qvector array+22, array+23
            getqx   array+8
            getqy   array+9
            nop
            qvector array+24, array+25
            getqx   array+10
            getqy   array+11
            nop
            qvector array+26, array+27
            getqx   array+12
            getqy   array+13
            nop
            qvector array+28, array+29
            getqx   array+14
            getqy   array+15
            nop
            qvector array+30, array+31
            getqx   array+16
            getqy   array+17
            getqx   array+18
            getqy   array+19
            getqx   array+20
            getqy   array+21
            getqx   array+22
            getqy   array+23
            getqx   array+24
            getqy   array+25
            getqx   array+26
            getqy   array+27
            getqx   array+28
            getqy   array+29
            getqx   array+30
            getqy   array+31
            setq #31
            wrlong array, ptrb++
            djnz counter, #xypol1
        pop ptra
        ret

array res    32

    end

TonyB_ · 2026-02-03 16:02

@bob_g4bby said:

' convert x,y (cartesian) samples in buffin to magnitude, angle (polar) samples in buffout - optionally, buffout can be the same as buffin
' at 320MHz clock, runs about 56.1uS
' result stored as mag1, real1, mag2, real2 etc.
pub xytopol(buffin, buffout) | counter

    org
        push ptra
        mov ptra, buffin
        mov ptrb, buffout
        mov counter, #(sigbuffsize/16)
xypol1
            setq #31
            rdlong array, ptra++
            qvector array, array+1
            nop
            qvector array+2, array+3
            nop
            qvector array+4, array+5
            nop
            qvector array+6, array+7
            nop
            qvector array+8, array+9
            nop
            qvector array+10, array+11
            nop
            qvector array+12, array+13
            nop
            qvector array+14, array+15
            getqx   array
            getqy   array+1
            nop
            qvector array+16, array+17
            getqx   array+2
            getqy   array+3
            nop
            qvector array+18, array+19
            getqx   array+4
            getqy   array+5
            nop
            qvector array+20, array+21
            getqx   array+6
            getqy   array+7
            nop
            qvector array+22, array+23
            getqx   array+8
            getqy   array+9
            nop
            qvector array+24, array+25
            getqx   array+10
            getqy   array+11
            nop
            qvector array+26, array+27
            getqx   array+12
            getqy   array+13
            nop
            qvector array+28, array+29
            getqx   array+14
            getqy   array+15
            nop
            qvector array+30, array+31
            getqx   array+16
            getqy   array+17
            getqx   array+18
            getqy   array+19
            getqx   array+20
            getqy   array+21
            getqx   array+22
            getqy   array+23
            getqx   array+24
            getqy   array+25
            getqx   array+26
            getqy   array+27
            getqx   array+28
            getqy   array+29
            getqx   array+30
            getqy   array+31
            setq #31
            wrlong array, ptrb++
            djnz counter, #xypol1
        pop ptra
        ret

array res    32

    end

The NOP's between the QVECTOR's are not necessary, I think. Re window coefficients in LUT RAM, the 3 cycle RDLUT could be an issue for CORDIC pipelining and it might be quicker to keep some of them in reg RAM and swap them in and out as required.

Wuerfel_21 · 2026-02-03 16:27

@TonyB_ said:
The NOP's between the QVECTOR's are not necessary, I think.

Not needed. CORDIC ops automatically insert waitstates so they're spaced a multiple of 8 cycles apart

bob_g4bby · 2026-02-03 18:35

That's very interesting, I'll take the nops out, thanks both! RDLUT issue noted too.

TonyB_ · 2026-02-03 19:27

@bob_g4bby said:
That's very interesting, I'll take the nops out, thanks both! RDLUT issue noted too.

You could put up to 3*7=21 normally two-cycle but effectively zero-cycle instructions between 1st and 8th QVECTOR. I'll be interested to see your windowing code, whenever that may be.

SaucySoliton · 2026-02-27 07:54

I made a huge breakthrough on the bit-reversal that is a usual part of doing an FFT. The bit-reversal code in the last update took 160uS. This new code does digit-reversal in 50uS, when combined with other parts of the FFT.

Radix-4 FFT need digit reversal instead of bit reversal. For radix-4, digits are 2 bits. Thankfully with the P2's bit shuffling instructions it's only 3 extra instructions. This is still a lot of instructions for reading a single sample.

        rdfast  c_8000,ptrr    
        add     i1,#1
        mov     next_ptrr,i1 
        splitw  next_ptrr  *
        rev     next_ptrr
        rol next_ptrr,shift  *
        mergew  next_ptrr  *
        shl      next_ptrr,#3
        add     next_ptrr,real_imag_ptr
        rflong  ax+A
        rflong  ay+A

Let's look at what the digit reversal does.

The XOR trick I used previously won't work for addressing data because we can't assume that the arrays are aligned to their size. The carries would happen at different times depending on the array starting address. If we look at the differences between addresses we see that it follows a regular pattern. The current address can be adjusted to the next address with only a single ADD. After 16 samples there are additional bits rolling over. It's not a big deal. We just need to run the full digit reversal calculation to deal with this. The same method could be applied to bit-reversal as well, but will require more unique step sizes.

Of course the FIFO takes time to read from the hub. This wait time can be filled with the next address calculation or butterfly operations.

Combined with the greater computational efficiency of the radix-4 FFT, a complete 1024 point transform happens in 870uS. That is an unbelievable 25% reduction from the previous 1175uS. All times at 160MHz.

Sorry, flexspin only right now.

TonyB_ · 2026-02-27 14:00

@SaucySoliton said:
I made a huge breakthrough on the bit-reversal that is a usual part of doing an FFT. The bit-reversal code in the last update took 160uS. This new code does digit-reversal in 50uS, when combined with other parts of the FFT.

Combined with the greater computational efficiency of the radix-4 FFT, a complete 1024 point transform happens in 870uS. That is an unbelievable 25% reduction from the previous 1175uS. All times at 160MHz.

Good work!

Re your code:

                setnib  .dit4ret,#%0000,#7    ' set condition to _ret_
'...
                setnib  .dit4ret,#%1111,#7    ' set condition to always

changing the instruction prefix from always to _ret_ is an interesting way of converting plain code into a subroutine. The drawback is that another instruction is needed to reverse the change. Quite often in skip sequences I create a routine by duplicating an instruction but with a _ret_ prefix and which is always skipped. This adds only one long although it does require skipping to be active.

SaucySoliton · 2026-02-27 18:41

@TonyB_ said:

Good work!

Re your code:
                setnib  .dit4ret,#%0000,#7    ' set condition to _ret_
'...
                setnib  .dit4ret,#%1111,#7    ' set condition to always
changing the instruction prefix from always to _ret_ is an interesting way of converting plain code into a subroutine. The drawback is that another instruction is needed to reverse the change. Quite often in skip sequences I create a routine by duplicating an instruction but with a _ret_ prefix and which is always skipped. This adds only one long although it does require skipping to be active.

Thanks!

The call/return happens 64 times. It runs inline 64*3 = 192 times. The called section is 154 instructions so it would be nice to not duplicate that.

if_c   ret       '  would be 2 cycles if not returning,   4 cycles if returning
 _ret_  sub  ay,cy    ' 0 cycles if not returning,   2 cycles if returning

Due to hub and cordic alignment saving 2 cycles might cut 8 cycles off the loop.

I came up with another way of trying not to branch within the loop. The return address for djnz can be patched return to different locations.

            testb   p_fft_flags,#NOBR_BIT wc 
        ' patch the loop return location to reduce time penalty for 
        ' selecting which read method to use.
        ' only 3uS difference.
    if_c    sets    .first_loop_end,#(.seqread_dit - .first_loop_end -1 )&$1ff
    if_nc   sets    .first_loop_end,#(.randread_reorder - .first_loop_end -1)&$1ff
.first_loop
    if_nc   jmp #.randread_reorder


.seqread_dit    setq    #32-1       
                rdlong  ax,ptra++       
                jmp #.bfly1     

.randread_reorder       
...  ' this code mutually exclusive with .seqread
.bfly1
...  ' this code always runs
.first_loop_end djnz    flight_count,#.first_loop

This isn't in the uploaded code. After unrolling the randread function, most of bfly1 became dispersed between fifo reads. It was simpler just to go to 2 separate loops.

bob_g4bby · 2026-02-27 19:01

That's a very worthwhile gain in performance - can't wait to adopt your code. I've been reviewing fft fast convolution filtering and I've made spin code that creates the required bandpass filter impulse response. A small windows based software radio has been modified to supply an analogue iq signal which is converted to 48ks/s signal buffers in the P2 via an HDaudio kit. Spin Tools writes a single debug scope window fast enough to keep up with the signal, which is really useful. I've just written an 8x knobs driver to control signal settings later on.
Much appreciated, bob

1024-point FFT in 79 longs

Comments