Multiply, multiply, multiply

tomcrawfordtomcrawford Posts: 1,049
edited 2019-01-25 - 16:51:21 in Propeller 2
As an exercise, I tested the P2 16 x 16 MUL instruction against the P1 subroutine and the P2 CORDIC. I didn't actually expect to find any error, and of course, didn't.

For every possible pair of values (2**32 total), I multiplied three ways and compared the results. Here is a code fragment.
top   QMUL  plier, cand     'get CorDIC started
      mov   x, plier        'set up for by-hand
      mov   y, cand      
      call  #Multiply       'old prop1 16 x 16
      mov   byhand, y
      mov   hardware, plier
      mul   hardware, cand      'hardware 16 x 16
      GetQY temp             'discard cordic top half
      GetQX cordic            'lower long
      cmp   byhand, hardware wcz   'see if all same-same
if_NE jmp   #NotEqual
      cmp   byhand, cordic wcz
if_NE jmp   #NotEqual
      drvnot #16              'toggle a scope probe

Each iteration takes 0.75 microseconds; each 64K takes 49 millisecs plus 4.6 millisecs to show progress on a serial LCD. The whole program takes just under an hour.

Notes:
I did completely overlap the CORDIC time with the subroutine. Didn't worry about another cog over-writing my result.

I did not use the build-in serial driver; didn't even use the serial capability of the Smart Pin.

I did modify the P1 subroutine to use REP, and sure enough, it got faster. Two instructions in the tight loop versus three. But it doesn't matter because the instruction is exactly equivalent and takes only 2 cycles.
Re-inventing the wheel is not a waste of time if, when you are done, you understand why it is round.

Comments

  • Excellent! I might have a go at modifying this for MULS at somepoint. I guess there's QSQRT, QDIV, QFRAC too

    I've been MULS for FIR filtering and am working on IIR filtering using QMUL with corrections for signed operation.
  • The spin2gui samples/ directory contains a program called multest.spin2 that does timing tests for 32x32 multiply using a software multiply, CORDIC, and mul instruction. The software multiply has an early exit so it does well for small values. It takes between 28 and 981 cycles. CORDIC is always 70 cycles, and the 32x32 multiply constructed from 16x16 MUL instructions always takes 44 cycles (all of this is running from HUB, so in COG memory presumably the software and MUL implementations could go a bit faster).

    The CORDIC of course has the advantage that it can be interleaved with other operations. But if for some reason you can't do that then using MUL to do 32x32 multiplies seems to be the best bet.
  • The nice thing about IIR digital filters is a regular structure requiring (ideally 32 bit) 4 multiplies per second-order-section, and you can pipeline multiplies every 8 instructions leaving just enough spare instructions for the signed correction and housekeeping. Thus each section is about 1 cordic pipeline delay so you can reuse variables per section.

    I've written a Python script to generate the p2asm for this, each section (apart from first and last) has the form:
    '  <SOSStage 1 -1.49026988061 1.0 / 1 -1.57250768423 0.91164551065>
                    mov     a1, co_a1_2
                    QMUL    a1, d1_2
                    test    d1_2, signbit  wz
            if_z    mov     a1, #0
                    add     acc, a1r
                    mov     d2_1, d1_1
                    GETQY   a2r
                    sub     a2r, a2
                    mov     a2, co_a2_2
                    QMUL    a2, d2_2
                    test    d2_2, signbit  wz
            if_z    mov     a2, #0
                    sub     acc, a2r
                    mov     d1_1, acc
                    GETQY   b1r
                    sub     b1r, b1
                    mov     b1, co_b1_2
                    QMUL    b1, d1_2
                    test    d1_2, signbit  wz
            if_z    mov     b1, #0
                    sub     acc, b1r
                    shl     d1_1, #2
                    GETQY   b2r
                    sub     b2r, b2
                    mov     b2, co_b2_2
                    QMUL    b2, d2_2
                    test    d2_2, signbit  wz
            if_z    mov     b2, #0
                    add     acc, b2r
                    sar     acc, #2  ' compensate for gain 3.12377533513
                    GETQY   a1r
                    sub     a1r, a1
    
    So the pipeline is used without stalls every other available slot. There is only enough time to do
    sign correction for one argument to the multiply, so I make all the coefficients positive and flip add<->sub
    as appropriate in the accumulation.

    The basic realization per stage is, in pseudo code (ie Python!):
        acc -= a1 * d1
        acc -= a2 * d2
        d0 = acc
        acc += b1 * d1
        acc += b2 * d2
        d2 = d1
        d1 = d0
    
    Where d1, d2 are the delay state (per stage), a1,a2,b1,b2 are the coefficients.
    The scaling between stages is going to be a shift, once I've figured out how to compute the right
    shift values for maximum headroom, and there ought to be one more multiply at the end to get
    the gain precise (ie all the a0 coefficients are finessed into a series of shifts and this multiply)
  • @Mark_T
    You could also try overlapping your cordic operations.
    i.e. start a new cordic operation before the last has finished.
    Something like this
                    QMUL    a1, d1_2
                    mov     a1, co_a1_2
                    test    d1_2, signbit  wz
            if_z    mov     a1, #0
                    add     acc, a1r
                    mov     d2_1, d1_1
    		qmul	co_a2_2,d2_2	'overlap
                    GETQY   a2r
                    sub     a2r, a2
                    mov     a2, co_a2_2
          '          QMUL    a2, d2_2	'moved before getqy
                    test    d2_2, signbit  wz
            if_z    mov     a2, #0
                    sub     acc, a2r
                    mov     d1_1, acc
    		qmul	co_b1_2,d1_2	overlap
                    GETQY   b1r
                    sub     b1r, b1
                    mov     b1, co_b1_2
           '         QMUL    b1, d1_2	'moved before getqy
                    test    d1_2, signbit  wz
            if_z    mov     b1, #0
                    sub     acc, b1r
                    shl     d1_1, #2
    		qmul	co_b2_2,d2_2	'overlap
                    GETQY   b2r
                    sub     b2r, b2
                    mov     b2, co_b2_2
            '        QMUL    b2, d2_2	'moved before getqy
                    test    d2_2, signbit  wz
            if_z    mov     b2, #0
                    add     acc, b2r
                    sar     acc, #2  ' compensate for gain 3.12377533513
                    GETQY   a1r
                    sub     a1r, a1
    
    

    This should shve ~100 clocks off the total time. :)



    Melbourne, Australia
  • Mark_TMark_T Posts: 1,975
    edited 2019-01-27 - 01:54:04
    I do - that fragment has 3 cordic ops in flight at the start, a max of 4 in flight at any point. There is a header and tail section around the stages that sets up the first section and finalizes the last section. The excerpt is setting up section N+1 as it finalizes section N of the filter. See the thread: forums.parallax.com/discussion/169636/digital-filtering-approaches-on-the-prop2#latest
  • Talking of multiplies, here's a cordic pipelined complex multiply, using fix8.24 format real and imaginary parts.
    A complex multiply needs 4 signed multiplies that can all execute concurrently, and an add and subtract to tie the parts together.

    input x + i y, x2 + i y2
    output xx + i yy = (x*x2 - y*y2) + i (x*y2 + y*x2)
    compl_mul	QMUL	x, x2
    		testb	x, #31  wc
    		testb	x2, #31  wz
    		mov	t1, #0
    
    		QMUL	x, y2
    	if_c	sub	t1, x2
    	if_z	sub	t1, x
    		testb	y2, #31  wz
    
    		QMUL	y, y2
    		mov	t2, #0
    	if_c	sub	t2, y2
    	if_z	sub	t2, x
    
    		QMUL	y, x2
    		testb	y, #31  wc
    		mov	t3, #0
    	if_c	sub	t3, y2
    	if_z	sub	t3, y
    		testb	x2, #31  wz
    		mov	t4, #0
    	if_c	sub 	t4, x2
    	if_z	sub	t4, y
    		
    
    		GETQY	xx   	 ' x * x2
    		GETQX	lo1
    		add	xx, t1   ' sign correct
    		nop
    
    		GETQY	yy	 ' x * y2
    		GETQX	lo2
    		add	yy, t2   ' sign correct
    		nop
    		
    		GETQY	xx2	 ' y * y2
    		GETQX   lo3
    		add	xx2, t3  ' sign correct
    		subs	lo1, lo3  wc
    		
    		GETQY	yy2	 ' y * x2
    		GETQX	lo4
    		add	yy2, t4  ' sign correct
    		subsx	xx, xx2
    
    		adds	lo2, lo4  wc
    		addsx	yy, yy2
    
    		shl	xx, #8  ' fix8.24 correct
    		shr	lo1, #24
    		or	xx, lo1
    
    		shl	yy, #8  ' fix8.24 correct
    		shr	lo2, #24
    		or	yy, lo2
    
    compl_mul_ret	ret
    
    I think its about 108 cycles all told. The setup for the signed corrections is done without time penalty as the pipeline runs, so that its pretty close to optimal for a cordic implementation.

    I've used this in a FFT routine I've been playing with to characterize the ADC modes, producing output via a DAC pin to my 'scope, with output like: fft.jpg
  • In case anyone wonders, that's about 15dB per vertical division, 129 frequency samples (ie 256 FFT), from ADC running at about 48kSPS (ADC fed with a single sine wave from another pin in DAC dither mode). 2nd harmonic about 60dB down, fundamental 75dB above noise (12 ENOB?). Blackman window.
  • cgraceycgracey Posts: 11,012
    edited 2019-02-03 - 03:59:19
    Neat, Mark_T!
  • Still a little bit of work to go before its an HP35670 FFT analyzer, but its pretty pleasing :)
  • ElectrodudeElectrodude Posts: 1,275
    edited 2019-02-05 - 20:48:20
    I don't have my P2 with me to implement this, but a complex multiply can be done in only three multiplies using Karatsuba multiplication.

    For reference, the obvious way to implement complex multiplication is this:
    (xr + xi*j)*(yi + yi*j) = (xr*yr) + j*(xr*yi + xi*yr) - (xi*yi)

    Let A = xr*yr, and let B = xi*yi, since these show up in multiple places later.

    Now let C = (xr+xi)*(yr+yi) = xr*yr + xr*yi + xi*yr + xi*yi. The middle two terms are the imaginary part of the result, and the first and last terms are A and B, which we already have and can subtract out.

    So, zr + zi*j = (A - B ) + j*(C - A - B ).

    The disadvantage is that your multipliers have to be one bit longer than your input numbers; this isn't a problem on the P2 if you're multiplying signed numbers, because QMUL is unsigned, so you already have to manage that extra bit anyway. See Wikipedia's article for more details.

    Larger multiplies composed of smaller multiplies can be done similarly: just replace j with 2^32 or 2^16. I briefely mentioned in the other multiplication thread that they could use it for long multiplies.

    EDIT: Eliminated smiley faces.
  • ElectrodudeElectrodude Posts: 1,275
    edited 2019-02-06 - 03:26:08
    Also, can't you replace each complex multiply by a twiddle factor in an FFT with a single QROTATE?
  • That's really hard to read since the convention is to call a complex number z = x+iy as its the point (x,y) on the Argand diagram.

    Then multiply is written:
    z1.z2 = (x1+iy1)(x2+iy2) = (x1x2-y1y2) + i(x1y2+x2y1)

    The problem I see with the Karatsuba on the cordic hardware is that the most streamlined way to do signed
    correction on a multiply requires the inputs to the multiply to have their original sign bits intact,
    which the additions might break by overflowing. You'd be forced to restrict your signed values
    to 31 bits.

    Given a 32x32->64 bit unsigned multiply of the form
      hi, lo = a * b
    
    you can make it work as a signed multiply like this:
    hi, lo = a * b
    if a<0
      hi -= b
    if b<0
      hi -= a
    
    with most of the housekeeping overlapping the pipelined cordic (except a single instruction
    to correct the value of hi.)

    The alternative requires taking the absolute values and recording the sign of result, requiring a 64 bit
    negate at the end and 3 or 4 instructions overhead before feeding the pipeline.

    With the complex multiply the gain from 4 to 3 multiplies is only 8 cycles for the pipeline, but the
    simplicity of not having to deal with overflows probably wins out.
  • Would someone please explain the steps to add an image. I click on image and it asks for a url. I would like the individual steps please. I do not profess to be all knowing :):):)
  • You need to know where the file is. Mark's one above is a URL to a picture he's placed on another website.

    Another way is to upload a new picture with the "Attach a file" right below the edit box and use that. Once uploaded a small icon of the picture will appear. Hover over it, there will be an "Insert Image" you can click on. Clicking that will add the pitcure's URL to your edit box.
    "There's no huge amount of massive material
    hidden in the rings that we can't see,
    the rings are almost pure ice."
  • Thanks I will attempt. Welcome the instruction . If I screw up let me know
  • spin2gui%20error.JPGspin2gui%20error.JPG

    test
    659 x 342 - 31K
  • Hey that worked. Thanks for the help

    Martin
Sign In or Register to comment.