Multiply, multiply, multiply

As an exercise, I tested the P2 16 x 16 MUL instruction against the P1 subroutine and the P2 CORDIC. I didn't actually expect to find any error, and of course, didn't.
For every possible pair of values (2**32 total), I multiplied three ways and compared the results. Here is a code fragment.
Each iteration takes 0.75 microseconds; each 64K takes 49 millisecs plus 4.6 millisecs to show progress on a serial LCD. The whole program takes just under an hour.
Notes:
I did completely overlap the CORDIC time with the subroutine. Didn't worry about another cog over-writing my result.
I did not use the build-in serial driver; didn't even use the serial capability of the Smart Pin.
I did modify the P1 subroutine to use REP, and sure enough, it got faster. Two instructions in the tight loop versus three. But it doesn't matter because the instruction is exactly equivalent and takes only 2 cycles.
For every possible pair of values (2**32 total), I multiplied three ways and compared the results. Here is a code fragment.
top QMUL plier, cand 'get CorDIC started
mov x, plier 'set up for by-hand
mov y, cand
call #Multiply 'old prop1 16 x 16
mov byhand, y
mov hardware, plier
mul hardware, cand 'hardware 16 x 16
GetQY temp 'discard cordic top half
GetQX cordic 'lower long
cmp byhand, hardware wcz 'see if all same-same
if_NE jmp #NotEqual
cmp byhand, cordic wcz
if_NE jmp #NotEqual
drvnot #16 'toggle a scope probe
Each iteration takes 0.75 microseconds; each 64K takes 49 millisecs plus 4.6 millisecs to show progress on a serial LCD. The whole program takes just under an hour.
Notes:
I did completely overlap the CORDIC time with the subroutine. Didn't worry about another cog over-writing my result.
I did not use the build-in serial driver; didn't even use the serial capability of the Smart Pin.
I did modify the P1 subroutine to use REP, and sure enough, it got faster. Two instructions in the tight loop versus three. But it doesn't matter because the instruction is exactly equivalent and takes only 2 cycles.
Comments
I've been MULS for FIR filtering and am working on IIR filtering using QMUL with corrections for signed operation.
The CORDIC of course has the advantage that it can be interleaved with other operations. But if for some reason you can't do that then using MUL to do 32x32 multiplies seems to be the best bet.
I've written a Python script to generate the p2asm for this, each section (apart from first and last) has the form:
' <SOSStage 1 -1.49026988061 1.0 / 1 -1.57250768423 0.91164551065> mov a1, co_a1_2 QMUL a1, d1_2 test d1_2, signbit wz if_z mov a1, #0 add acc, a1r mov d2_1, d1_1 GETQY a2r sub a2r, a2 mov a2, co_a2_2 QMUL a2, d2_2 test d2_2, signbit wz if_z mov a2, #0 sub acc, a2r mov d1_1, acc GETQY b1r sub b1r, b1 mov b1, co_b1_2 QMUL b1, d1_2 test d1_2, signbit wz if_z mov b1, #0 sub acc, b1r shl d1_1, #2 GETQY b2r sub b2r, b2 mov b2, co_b2_2 QMUL b2, d2_2 test d2_2, signbit wz if_z mov b2, #0 add acc, b2r sar acc, #2 ' compensate for gain 3.12377533513 GETQY a1r sub a1r, a1
So the pipeline is used without stalls every other available slot. There is only enough time to dosign correction for one argument to the multiply, so I make all the coefficients positive and flip add<->sub
as appropriate in the accumulation.
The basic realization per stage is, in pseudo code (ie Python!):
acc -= a1 * d1 acc -= a2 * d2 d0 = acc acc += b1 * d1 acc += b2 * d2 d2 = d1 d1 = d0
Where d1, d2 are the delay state (per stage), a1,a2,b1,b2 are the coefficients.The scaling between stages is going to be a shift, once I've figured out how to compute the right
shift values for maximum headroom, and there ought to be one more multiply at the end to get
the gain precise (ie all the a0 coefficients are finessed into a series of shifts and this multiply)
You could also try overlapping your cordic operations.
i.e. start a new cordic operation before the last has finished.
Something like this
QMUL a1, d1_2 mov a1, co_a1_2 test d1_2, signbit wz if_z mov a1, #0 add acc, a1r mov d2_1, d1_1 qmul co_a2_2,d2_2 'overlap GETQY a2r sub a2r, a2 mov a2, co_a2_2 ' QMUL a2, d2_2 'moved before getqy test d2_2, signbit wz if_z mov a2, #0 sub acc, a2r mov d1_1, acc qmul co_b1_2,d1_2 overlap GETQY b1r sub b1r, b1 mov b1, co_b1_2 ' QMUL b1, d1_2 'moved before getqy test d1_2, signbit wz if_z mov b1, #0 sub acc, b1r shl d1_1, #2 qmul co_b2_2,d2_2 'overlap GETQY b2r sub b2r, b2 mov b2, co_b2_2 ' QMUL b2, d2_2 'moved before getqy test d2_2, signbit wz if_z mov b2, #0 add acc, b2r sar acc, #2 ' compensate for gain 3.12377533513 GETQY a1r sub a1r, a1
This should shve ~100 clocks off the total time.
A complex multiply needs 4 signed multiplies that can all execute concurrently, and an add and subtract to tie the parts together.
input x + i y, x2 + i y2
output xx + i yy = (x*x2 - y*y2) + i (x*y2 + y*x2)
compl_mul QMUL x, x2 testb x, #31 wc testb x2, #31 wz mov t1, #0 QMUL x, y2 if_c sub t1, x2 if_z sub t1, x testb y2, #31 wz QMUL y, y2 mov t2, #0 if_c sub t2, y2 if_z sub t2, x QMUL y, x2 testb y, #31 wc mov t3, #0 if_c sub t3, y2 if_z sub t3, y testb x2, #31 wz mov t4, #0 if_c sub t4, x2 if_z sub t4, y GETQY xx ' x * x2 GETQX lo1 add xx, t1 ' sign correct nop GETQY yy ' x * y2 GETQX lo2 add yy, t2 ' sign correct nop GETQY xx2 ' y * y2 GETQX lo3 add xx2, t3 ' sign correct subs lo1, lo3 wc GETQY yy2 ' y * x2 GETQX lo4 add yy2, t4 ' sign correct subsx xx, xx2 adds lo2, lo4 wc addsx yy, yy2 shl xx, #8 ' fix8.24 correct shr lo1, #24 or xx, lo1 shl yy, #8 ' fix8.24 correct shr lo2, #24 or yy, lo2 compl_mul_ret ret
I think its about 108 cycles all told. The setup for the signed corrections is done without time penalty as the pipeline runs, so that its pretty close to optimal for a cordic implementation.I've used this in a FFT routine I've been playing with to characterize the ADC modes, producing output via a DAC pin to my 'scope, with output like:
For reference, the obvious way to implement complex multiplication is this:
(xr + xi*j)*(yi + yi*j) = (xr*yr) + j*(xr*yi + xi*yr) - (xi*yi)
Let A = xr*yr, and let B = xi*yi, since these show up in multiple places later.
Now let C = (xr+xi)*(yr+yi) = xr*yr + xr*yi + xi*yr + xi*yi. The middle two terms are the imaginary part of the result, and the first and last terms are A and B, which we already have and can subtract out.
So, zr + zi*j = (A - B ) + j*(C - A - B ).
The disadvantage is that your multipliers have to be one bit longer than your input numbers; this isn't a problem on the P2 if you're multiplying signed numbers, because QMUL is unsigned, so you already have to manage that extra bit anyway. See Wikipedia's article for more details.
Larger multiplies composed of smaller multiplies can be done similarly: just replace j with 2^32 or 2^16. I briefely mentioned in the other multiplication thread that they could use it for long multiplies.
EDIT: Eliminated smiley faces.
Then multiply is written:
z1.z2 = (x1+iy1)(x2+iy2) = (x1x2-y1y2) + i(x1y2+x2y1)
The problem I see with the Karatsuba on the cordic hardware is that the most streamlined way to do signed
correction on a multiply requires the inputs to the multiply to have their original sign bits intact,
which the additions might break by overflowing. You'd be forced to restrict your signed values
to 31 bits.
Given a 32x32->64 bit unsigned multiply of the form you can make it work as a signed multiply like this:
hi, lo = a * b if a<0 hi -= b if b<0 hi -= a
with most of the housekeeping overlapping the pipelined cordic (except a single instructionto correct the value of hi.)
The alternative requires taking the absolute values and recording the sign of result, requiring a 64 bit
negate at the end and 3 or 4 instructions overhead before feeding the pipeline.
With the complex multiply the gain from 4 to 3 multiplies is only 8 cycles for the pipeline, but the
simplicity of not having to deal with overflows probably wins out.
Another way is to upload a new picture with the "Attach a file" right below the edit box and use that. Once uploaded a small icon of the picture will appear. Hover over it, there will be an "Insert Image" you can click on. Clicking that will add the pitcure's URL to your edit box.
test
Martin