Lokking for signed 16 bit multiply and divide in PASM

ManAtWork · 2010-02-04 17:41

Hello,

I'm sure this has been discussed before but I haven't found anything with the forum search. I'm looking for a signed 16*16 multiply and 32:16 division routine in assembler. Well actually I only need the upper 16 bits of the multiplication result and a 16:16 bit divide would also do if I always get 16 significant result bits.·What I'd like to do is a simplified but fast floating point library.

If necessary I could do it on my own but I'm sure somebody has optimized it better.
Any links or hints welcome.

Mike Green · 2010-02-04 18:07

The routines in the Propeller Manual (V1.1) are for unsigned numbers. It's very easy to put a signed -> unsigned wrapper around the unsigned routines. You just keep a copy of the original multiplier and multiplicand, then use the ABS instruction to get an absolute value. After doing the unsigned multiplication, you XOR the multiplier and multiplicand to get the sign of the product which you shift into C. You then can use a NEGC instruction to get the signed result. Division would be similar.

The existing floating point library is pretty fast. Why not use that?

ManAtWork · 2010-02-04 20:18

Hello Mike,

ok, the sign wrapper is the best method? Good to know, I thought there was a more elegant way.

Do you mean the routines in float32.spin? How fast is pretty fast? I have to calculate a PID·control·loop·once every 20 µs, possibly with feed forward, so that means <5µs for one multiplication and one add. I'd like to use floating point format because I think it would be actually faster than fixed point. The reason is that it's hard to foresee the exact range of the input and coefficients. These are "user inputs" and I would have to do a lot of careful shifts and clipping to avoid overflows. And I would have to calculate more bits to achieve the same accuracy when the·scale was·not set optimally. With floating point I don't need to care.

Ok, forget about division. I only need that to precalculate the coefficients. That's not real time and I can use the spin routines there. For the time critical code I think I will take the·source from float32 and modify it to fit my needs. I didn't look at it earlier because I had an aversion against IEEE math. Once I saw a floating point emulation library for the 68k/coldfire processor and it looked awfully complex. There was a lot of code to handle special cases like denormalized numbers, not-a-number and quasi-infinity. Nonsense, nobody needs that except for scientific applications maybe. The popeller code seems to be quite lean and straight forward.

Thanks

Mike Green · 2010-02-04 21:04

How about using the Log / AntiLog routines in the appendix for the multiplication? Assuming that you've already converted the coefficients to Log form, you'd need to do one Log conversion and one AntiLog conversion which would take less than 3us. The additions and handling of signs could easily be done in less than 1us for a total computation time of under 4us.

ManAtWork · 2010-02-05 08:40

Hmm, I could store the float numbers in unpacked format. Multiplication takes two operations per bit with the loop unrolled plus some overhead. So one 16*16 bit multiplication should take aproxximately 2µs (128 cycles + overhead). Shift and add should take less than 1µs. So there's no real benefit using log/antilog conversions. I'll try it out.

ManAtWork · 2010-02-08 14:43

Unfortunatelly, I forgot that the float numbers need to be normalized which takes some additional time. The numexp routine from the Log table example seems to be the fastest way to do that.
BTW, when converting back to integer... What happens if I shift more than 31 bits. For example what does

shr x,b
...
b long 33

do? Is it the same as shr x,#1 or does it actuall clear x? The manual is unclear about that. Would be nice if it was the latter. Otherwise I had to check for underflows. However, FTrunc checks for exponent<-32 so I fear I also have to.

Mike Green · 2010-02-08 14:51

The shift instructions use the least significant 5 bits of the shift count to do the shift. They ignore the rest, so a count of 33 is the same as a count of 1.

The mantissas in floating point operations don't have to be normalized. The normalization is just to maintain the greatest significance. Without the normalization, you've got fixed point arithmetic.

Post Edited (Mike Green) : 2/8/2010 2:56:43 PM GMT

ManAtWork · 2010-02-08 16:33

Hello Mike,

thanks for the quick response. You are right, I don't necessarily have to normalize the P and D inputs because they have less than 16 bits of significance. To·left justify them·doesn't improve·precision. However the·integral part can have more than 16 bits. I'll do at least the first 3 checks (num4 to num2) to improve the worst case. The coefficients are constant and are already stored normalized, of course.

Now I think, I have everything together to write and test the code.

·

Thanks again for your help.

Dave Hein · 2010-02-08 16:46

ManAtWork,

You may be able to use a method of block normalization.· It sounds like from your requirements that you need to use 16-bit multiplications to achieve your loop time constraint.· I believe the unsigned 16-bit multiply routine in the propeller manual uses 208 cycles (3*16 + 4) * 4), which is 2.6 usecs.· That gives you 2.4 usecs·for overhead and other misc stuff.

You would want to treat your coefficients and data as normalized data with values between 1.0 and -1.0.· An integer value of $7fff would represent the maximum positive number of 0.99997 and the maximum negative number could be $8000, which would be -1.0.· However, you would probably need to limit it to $8001, or -0.99997 to prevent problems in the unsigned 16-bit multiply routine.

It sounds like you are summing the products of 4 coefficients with 4 samples of data.· You will have to normalize your coefficients so the sum of the absolute values is less than 1.0 to prevent overflows.· The block of 4 samples should be normalized so the maximum absolute value is less than 1.0.· The normalization factor should be a power of 2 so you can use shifts to implement it.

So the steps are as follows:

1. Get the coefficients and pre-normalize the sum of absolute values to less than 1.0
2. Get four samples and compute the maximum absolute value
3. Normalize the four samples so the maximum absolute value is less than 1.0
4. Compute the sum of products
5. Shift the result down by the sum of the coefficent and data normalization shift values

You can use the code from the first part of the numexp routine to determine the normalization shift value.

Dave
·

Dave Hein · 2010-02-08 16:54

ManAtWork,

If you accumulate the 32-bit product into a 32-bit accumulator you would not need to normalize the data.· However, if you accumulated only the upper 16· bits you would need to normalize.· The easiest and more accurate method would be to use the 32-bit products and accumulator.

Dave

Lokking for signed 16 bit multiply and divide in PASM

Comments