Shop OBEX P1 Docs P2 Docs Learn Events
1024-point FFT in 79 longs - Page 4 — Parallax Forums

1024-point FFT in 79 longs

124»

Comments

  • bob_g4bbybob_g4bby Posts: 535
    edited 2026-02-03 11:23

    Here's the Blackman-Harris window produced in a spreadsheet - what's the best way of applying it to a signal in the 32 bit integer environment of the P2?

    A question of scaling the function up? Large enough to retain as much precision as possible, consistent with avoiding overflow when multiplying the signal, sample by sample.

    The scaled window is a constant, so it's a DATA block to be loaded into a buffer on start up, I guess.

  • TonyB_TonyB_ Posts: 2,268
    edited 2026-02-03 14:48

    @bob_g4bby said:
    Here's the Blackman-Harris window produced in a spreadsheet - what's the best way of applying it to a signal in the 32 bit integer environment of the P2?

    The best you could do with the CORDIC QMUL is one sample every 16 cycles. Your Blackman-Harris window length is 1024 samples with 512 different values so the obvious place to store them is LUT RAM. Will successive windows overlap by 512 samples?

    Data could be read from hub RAM using RFLONG with 256 results, say, stored temporarily in reg RAM then written back later using fast block write. Alternatively, do fast block read to pre-load 256 samples in reg RAM and write results directly to hub RAM using WFLONG. Or use RFLONG and WRLONG (but the former might stall the later) with no intermediate storage in reg RAM if timing allows for that.

  • bob_g4bbybob_g4bby Posts: 535
    edited 2026-02-03 14:37

    Good idea, LUT ram is the best place and like you say, the window is symmetric, so storing half of it is enough.
    Yes, the FFT fast convolution filtering that comprises the software defined radio main filter does include an overlap-add feature - see attached pdf

    I've got the dsp functions going quite fast as follows:-
    1. Move 16 iq pairs from the input buffer in hub ram into a small register array using SETQ #31, RDLONG regarray, ptra++. This takes 1 cycle per long - see here.
    2. Preload the cordic engine with 8 inputs from the register array.
    3. Read result back from cordic into register array and load the cordic engine with another input - do this 8 times
    4. Read the last 8 results from the cordic engine back into register array. For many dsp functions (2) to (4) results in around 9 cycles per cordic result
    5. Move 16 iq pairs from register array back to the output buffer in hub ram. This again takes 1 cycle per long.
    6. Repeat (1) to (5) 64 times for the whole buffer

    Here's an example, with timing based on Chip's example:-

    ' convert x,y (cartesian) samples in buffin to magnitude, angle (polar) samples in buffout - optionally, buffout can be the same as buffin
    ' at 320MHz clock, runs about 56.1uS
    ' result stored as mag1, real1, mag2, real2 etc.
    pub xytopol(buffin, buffout) | counter
    
        org
            push ptra
            mov ptra, buffin
            mov ptrb, buffout
            mov counter, #(sigbuffsize/16)
    xypol1
                setq #31
                rdlong array, ptra++
                qvector array, array+1
                nop
                qvector array+2, array+3
                nop
                qvector array+4, array+5
                nop
                qvector array+6, array+7
                nop
                qvector array+8, array+9
                nop
                qvector array+10, array+11
                nop
                qvector array+12, array+13
                nop
                qvector array+14, array+15
                getqx   array
                getqy   array+1
                nop
                qvector array+16, array+17
                getqx   array+2
                getqy   array+3
                nop
                qvector array+18, array+19
                getqx   array+4
                getqy   array+5
                nop
                qvector array+20, array+21
                getqx   array+6
                getqy   array+7
                nop
                qvector array+22, array+23
                getqx   array+8
                getqy   array+9
                nop
                qvector array+24, array+25
                getqx   array+10
                getqy   array+11
                nop
                qvector array+26, array+27
                getqx   array+12
                getqy   array+13
                nop
                qvector array+28, array+29
                getqx   array+14
                getqy   array+15
                nop
                qvector array+30, array+31
                getqx   array+16
                getqy   array+17
                getqx   array+18
                getqy   array+19
                getqx   array+20
                getqy   array+21
                getqx   array+22
                getqy   array+23
                getqx   array+24
                getqy   array+25
                getqx   array+26
                getqy   array+27
                getqx   array+28
                getqy   array+29
                getqx   array+30
                getqy   array+31
                setq #31
                wrlong array, ptrb++
                djnz counter, #xypol1
            pop ptra
            ret
    
    array res    32
    
        end
    
    
  • TonyB_TonyB_ Posts: 2,268
    edited 2026-02-03 16:05

    @bob_g4bby said:

    ' convert x,y (cartesian) samples in buffin to magnitude, angle (polar) samples in buffout - optionally, buffout can be the same as buffin
    ' at 320MHz clock, runs about 56.1uS
    ' result stored as mag1, real1, mag2, real2 etc.
    pub xytopol(buffin, buffout) | counter
    
        org
            push ptra
            mov ptra, buffin
            mov ptrb, buffout
            mov counter, #(sigbuffsize/16)
    xypol1
                setq #31
                rdlong array, ptra++
                qvector array, array+1
                nop
                qvector array+2, array+3
                nop
                qvector array+4, array+5
                nop
                qvector array+6, array+7
                nop
                qvector array+8, array+9
                nop
                qvector array+10, array+11
                nop
                qvector array+12, array+13
                nop
                qvector array+14, array+15
                getqx   array
                getqy   array+1
                nop
                qvector array+16, array+17
                getqx   array+2
                getqy   array+3
                nop
                qvector array+18, array+19
                getqx   array+4
                getqy   array+5
                nop
                qvector array+20, array+21
                getqx   array+6
                getqy   array+7
                nop
                qvector array+22, array+23
                getqx   array+8
                getqy   array+9
                nop
                qvector array+24, array+25
                getqx   array+10
                getqy   array+11
                nop
                qvector array+26, array+27
                getqx   array+12
                getqy   array+13
                nop
                qvector array+28, array+29
                getqx   array+14
                getqy   array+15
                nop
                qvector array+30, array+31
                getqx   array+16
                getqy   array+17
                getqx   array+18
                getqy   array+19
                getqx   array+20
                getqy   array+21
                getqx   array+22
                getqy   array+23
                getqx   array+24
                getqy   array+25
                getqx   array+26
                getqy   array+27
                getqx   array+28
                getqy   array+29
                getqx   array+30
                getqy   array+31
                setq #31
                wrlong array, ptrb++
                djnz counter, #xypol1
            pop ptra
            ret
    
    array res    32
    
        end
    
    

    The NOP's between the QVECTOR's are not necessary, I think. Re window coefficients in LUT RAM, the 3 cycle RDLUT could be an issue for CORDIC pipelining and it might be quicker to keep some of them in reg RAM and swap them in and out as required.

  • @TonyB_ said:
    The NOP's between the QVECTOR's are not necessary, I think.

    Not needed. CORDIC ops automatically insert waitstates so they're spaced a multiple of 8 cycles apart

  • That's very interesting, I'll take the nops out, thanks both! RDLUT issue noted too.

  • TonyB_TonyB_ Posts: 2,268
    edited 2026-02-03 19:28

    @bob_g4bby said:
    That's very interesting, I'll take the nops out, thanks both! RDLUT issue noted too.

    You could put up to 3*7=21 normally two-cycle but effectively zero-cycle instructions between 1st and 8th QVECTOR. I'll be interested to see your windowing code, whenever that may be.

Sign In or Register to comment.