Multiply, multiply, multiply
tomcrawford
Posts: 1,126
As an exercise, I tested the P2 16 x 16 MUL instruction against the P1 subroutine and the P2 CORDIC. I didn't actually expect to find any error, and of course, didn't.
For every possible pair of values (2**32 total), I multiplied three ways and compared the results. Here is a code fragment.
Each iteration takes 0.75 microseconds; each 64K takes 49 millisecs plus 4.6 millisecs to show progress on a serial LCD. The whole program takes just under an hour.
Notes:
I did completely overlap the CORDIC time with the subroutine. Didn't worry about another cog over-writing my result.
I did not use the build-in serial driver; didn't even use the serial capability of the Smart Pin.
I did modify the P1 subroutine to use REP, and sure enough, it got faster. Two instructions in the tight loop versus three. But it doesn't matter because the instruction is exactly equivalent and takes only 2 cycles.
For every possible pair of values (2**32 total), I multiplied three ways and compared the results. Here is a code fragment.
top QMUL plier, cand 'get CorDIC started mov x, plier 'set up for by-hand mov y, cand call #Multiply 'old prop1 16 x 16 mov byhand, y mov hardware, plier mul hardware, cand 'hardware 16 x 16 GetQY temp 'discard cordic top half GetQX cordic 'lower long cmp byhand, hardware wcz 'see if all same-same if_NE jmp #NotEqual cmp byhand, cordic wcz if_NE jmp #NotEqual drvnot #16 'toggle a scope probe
Each iteration takes 0.75 microseconds; each 64K takes 49 millisecs plus 4.6 millisecs to show progress on a serial LCD. The whole program takes just under an hour.
Notes:
I did completely overlap the CORDIC time with the subroutine. Didn't worry about another cog over-writing my result.
I did not use the build-in serial driver; didn't even use the serial capability of the Smart Pin.
I did modify the P1 subroutine to use REP, and sure enough, it got faster. Two instructions in the tight loop versus three. But it doesn't matter because the instruction is exactly equivalent and takes only 2 cycles.
Comments
I've been MULS for FIR filtering and am working on IIR filtering using QMUL with corrections for signed operation.
The CORDIC of course has the advantage that it can be interleaved with other operations. But if for some reason you can't do that then using MUL to do 32x32 multiplies seems to be the best bet.
I've written a Python script to generate the p2asm for this, each section (apart from first and last) has the form: So the pipeline is used without stalls every other available slot. There is only enough time to do
sign correction for one argument to the multiply, so I make all the coefficients positive and flip add<->sub
as appropriate in the accumulation.
The basic realization per stage is, in pseudo code (ie Python!): Where d1, d2 are the delay state (per stage), a1,a2,b1,b2 are the coefficients.
The scaling between stages is going to be a shift, once I've figured out how to compute the right
shift values for maximum headroom, and there ought to be one more multiply at the end to get
the gain precise (ie all the a0 coefficients are finessed into a series of shifts and this multiply)
You could also try overlapping your cordic operations.
i.e. start a new cordic operation before the last has finished.
Something like this
This should shve ~100 clocks off the total time.
A complex multiply needs 4 signed multiplies that can all execute concurrently, and an add and subtract to tie the parts together.
input x + i y, x2 + i y2
output xx + i yy = (x*x2 - y*y2) + i (x*y2 + y*x2)
I think its about 108 cycles all told. The setup for the signed corrections is done without time penalty as the pipeline runs, so that its pretty close to optimal for a cordic implementation.
I've used this in a FFT routine I've been playing with to characterize the ADC modes, producing output via a DAC pin to my 'scope, with output like:
For reference, the obvious way to implement complex multiplication is this:
(xr + xi*j)*(yi + yi*j) = (xr*yr) + j*(xr*yi + xi*yr) - (xi*yi)
Let A = xr*yr, and let B = xi*yi, since these show up in multiple places later.
Now let C = (xr+xi)*(yr+yi) = xr*yr + xr*yi + xi*yr + xi*yi. The middle two terms are the imaginary part of the result, and the first and last terms are A and B, which we already have and can subtract out.
So, zr + zi*j = (A - B ) + j*(C - A - B ).
The disadvantage is that your multipliers have to be one bit longer than your input numbers; this isn't a problem on the P2 if you're multiplying signed numbers, because QMUL is unsigned, so you already have to manage that extra bit anyway. See Wikipedia's article for more details.
Larger multiplies composed of smaller multiplies can be done similarly: just replace j with 2^32 or 2^16. I briefely mentioned in the other multiplication thread that they could use it for long multiplies.
EDIT: Eliminated smiley faces.
Then multiply is written:
z1.z2 = (x1+iy1)(x2+iy2) = (x1x2-y1y2) + i(x1y2+x2y1)
The problem I see with the Karatsuba on the cordic hardware is that the most streamlined way to do signed
correction on a multiply requires the inputs to the multiply to have their original sign bits intact,
which the additions might break by overflowing. You'd be forced to restrict your signed values
to 31 bits.
Given a 32x32->64 bit unsigned multiply of the form you can make it work as a signed multiply like this: with most of the housekeeping overlapping the pipelined cordic (except a single instruction
to correct the value of hi.)
The alternative requires taking the absolute values and recording the sign of result, requiring a 64 bit
negate at the end and 3 or 4 instructions overhead before feeding the pipeline.
With the complex multiply the gain from 4 to 3 multiplies is only 8 cycles for the pipeline, but the
simplicity of not having to deal with overflows probably wins out.
Another way is to upload a new picture with the "Attach a file" right below the edit box and use that. Once uploaded a small icon of the picture will appear. Hover over it, there will be an "Insert Image" you can click on. Clicking that will add the pitcure's URL to your edit box.
test
Martin