Here's the Blackman-Harris window produced in a spreadsheet - what's the best way of applying it to a signal in the 32 bit integer environment of the P2?
A question of scaling the function up? Large enough to retain as much precision as possible, consistent with avoiding overflow when multiplying the signal, sample by sample.
The scaled window is a constant, so it's a DATA block to be loaded into a buffer on start up, I guess.
@bob_g4bby said:
Here's the Blackman-Harris window produced in a spreadsheet - what's the best way of applying it to a signal in the 32 bit integer environment of the P2?
The best you could do with the CORDIC QMUL is one sample every 16 cycles. Your Blackman-Harris window length is 1024 samples with 512 different values so the obvious place to store them is LUT RAM. Will successive windows overlap by 512 samples?
Data could be read from hub RAM using RFLONG with 256 results, say, stored temporarily in reg RAM then written back later using fast block write. Alternatively, do fast block read to pre-load 256 samples in reg RAM and write results directly to hub RAM using WFLONG. Or use RFLONG and WRLONG (but the former might stall the later) with no intermediate storage in reg RAM if timing allows for that.
Good idea, LUT ram is the best place and like you say, the window is symmetric, so storing half of it is enough.
Yes, the FFT fast convolution filtering that comprises the software defined radio main filter does include an overlap-add feature - see attached pdf
I've got the dsp functions going quite fast as follows:-
1. Move 16 iq pairs from the input buffer in hub ram into a small register array using SETQ #31, RDLONG regarray, ptra++. This takes 1 cycle per long - see here.
2. Preload the cordic engine with 8 inputs from the register array.
3. Read result back from cordic into register array and load the cordic engine with another input - do this 8 times
4. Read the last 8 results from the cordic engine back into register array. For many dsp functions (2) to (4) results in around 9 cycles per cordic result
5. Move 16 iq pairs from register array back to the output buffer in hub ram. This again takes 1 cycle per long.
6. Repeat (1) to (5) 64 times for the whole buffer
Here's an example, with timing based on Chip's example:-
The NOP's between the QVECTOR's are not necessary, I think. Re window coefficients in LUT RAM, the 3 cycle RDLUT could be an issue for CORDIC pipelining and it might be quicker to keep some of them in reg RAM and swap them in and out as required.
@bob_g4bby said:
That's very interesting, I'll take the nops out, thanks both! RDLUT issue noted too.
You could put up to 3*7=21 normally two-cycle but effectively zero-cycle instructions between 1st and 8th QVECTOR. I'll be interested to see your windowing code, whenever that may be.
Comments
Here's the Blackman-Harris window produced in a spreadsheet - what's the best way of applying it to a signal in the 32 bit integer environment of the P2?
A question of scaling the function up? Large enough to retain as much precision as possible, consistent with avoiding overflow when multiplying the signal, sample by sample.
The scaled window is a constant, so it's a DATA block to be loaded into a buffer on start up, I guess.
The best you could do with the CORDIC QMUL is one sample every 16 cycles. Your Blackman-Harris window length is 1024 samples with 512 different values so the obvious place to store them is LUT RAM. Will successive windows overlap by 512 samples?
Data could be read from hub RAM using RFLONG with 256 results, say, stored temporarily in reg RAM then written back later using fast block write. Alternatively, do fast block read to pre-load 256 samples in reg RAM and write results directly to hub RAM using WFLONG. Or use RFLONG and WRLONG (but the former might stall the later) with no intermediate storage in reg RAM if timing allows for that.
Good idea, LUT ram is the best place and like you say, the window is symmetric, so storing half of it is enough.
Yes, the FFT fast convolution filtering that comprises the software defined radio main filter does include an overlap-add feature - see attached pdf
I've got the dsp functions going quite fast as follows:-
1. Move 16 iq pairs from the input buffer in hub ram into a small register array using SETQ #31, RDLONG regarray, ptra++. This takes 1 cycle per long - see here.
2. Preload the cordic engine with 8 inputs from the register array.
3. Read result back from cordic into register array and load the cordic engine with another input - do this 8 times
4. Read the last 8 results from the cordic engine back into register array. For many dsp functions (2) to (4) results in around 9 cycles per cordic result
5. Move 16 iq pairs from register array back to the output buffer in hub ram. This again takes 1 cycle per long.
6. Repeat (1) to (5) 64 times for the whole buffer
Here's an example, with timing based on Chip's example:-
' convert x,y (cartesian) samples in buffin to magnitude, angle (polar) samples in buffout - optionally, buffout can be the same as buffin ' at 320MHz clock, runs about 56.1uS ' result stored as mag1, real1, mag2, real2 etc. pub xytopol(buffin, buffout) | counter org push ptra mov ptra, buffin mov ptrb, buffout mov counter, #(sigbuffsize/16) xypol1 setq #31 rdlong array, ptra++ qvector array, array+1 nop qvector array+2, array+3 nop qvector array+4, array+5 nop qvector array+6, array+7 nop qvector array+8, array+9 nop qvector array+10, array+11 nop qvector array+12, array+13 nop qvector array+14, array+15 getqx array getqy array+1 nop qvector array+16, array+17 getqx array+2 getqy array+3 nop qvector array+18, array+19 getqx array+4 getqy array+5 nop qvector array+20, array+21 getqx array+6 getqy array+7 nop qvector array+22, array+23 getqx array+8 getqy array+9 nop qvector array+24, array+25 getqx array+10 getqy array+11 nop qvector array+26, array+27 getqx array+12 getqy array+13 nop qvector array+28, array+29 getqx array+14 getqy array+15 nop qvector array+30, array+31 getqx array+16 getqy array+17 getqx array+18 getqy array+19 getqx array+20 getqy array+21 getqx array+22 getqy array+23 getqx array+24 getqy array+25 getqx array+26 getqy array+27 getqx array+28 getqy array+29 getqx array+30 getqy array+31 setq #31 wrlong array, ptrb++ djnz counter, #xypol1 pop ptra ret array res 32 endThe
NOP's between theQVECTOR's are not necessary, I think. Re window coefficients in LUT RAM, the 3 cycleRDLUTcould be an issue for CORDIC pipelining and it might be quicker to keep some of them in reg RAM and swap them in and out as required.Not needed. CORDIC ops automatically insert waitstates so they're spaced a multiple of 8 cycles apart
That's very interesting, I'll take the nops out, thanks both! RDLUT issue noted too.
You could put up to 3*7=21 normally two-cycle but effectively zero-cycle instructions between 1st and 8th
QVECTOR. I'll be interested to see your windowing code, whenever that may be.