PDA

View Full Version : Fast PASM multiply by semi-static multipliers



Phil Pilgrim (PhiPi)
11-02-2011, 05:46 AM
This first post is merely peripheral to the title subject but lays the groundwork for covering that topic in the second post.

In my Hilbert Transform (http://forums.parallax.com/showthread.php?132977) and FIR2PASM (http://forums.parallax.com/showthread.php?133173[/url) threads, I mentioned and made considerable use of a number representation called Canonical Signed Digits (CSD). This representation is used to minimize the number of steps required to multiply a variable by a constant. For example, to multiply a number by 15 (%01111) requires four adds and three shifts. But you can get the same result from multiplying by 16 (%10000) and subtracting one (%00001), which requires one add, one subtract and one shift. By breaking a number into its positive and negative CSD components (%10000 and %00001 in the previous example), it is possible to minimize the number of steps required to multiply by any constant value by an average of 33%. In fact, the worst case is reduced by half.

Every binary number has a unique CSD representation, such that the total number of "1" bits is minimized, and the bitwise OR of the positive and negative components has no runs of two or more sequential "1" bits. This means that the total number of "1" bits in a 32-bit CSD representation cannot exceed 16. Once a multiplier's CSD representation is found, it's simply a matter of writing an unrolled "loop" of shifts, adds, and subtracts to get a product. For example, x * 15 reduces to:



neg product,x
shl x,#4
add product,x


In the FIR2PASM example, the PASM code to perform the coefficient multiplications is generated by the Perl program that resides on the server. This is possible, since the coefficients never change. So the Perl program can do the CSD conversions and spit out the code.

But what happens if you have constant multipliers that aren't so constant? For example, suppose you have a whole bunch of points in 2D space that need to rotate about the origin by the same angle -- but an angle which can change. Or, similarly, you need to compute the burst-relative IQ color components of a pixel for which you have absolute color burst IQ components and absolute pixel IQ components. In the latter case the color burst components (burst(i) and burst(q) in the following formula) change only once per line, but the pixel components change for every pixel:


I = k ( pixel(i) burst(q) - pixel(q) burst(i) )
Q = -k ( pixel(q) burst(q) + pixel(i) burst(i) )

So, if you have multipliers that change infrequently compared to the number of times they're used, but change nonetheless, is there a way to produce the optimum multiplication code on the fly in the Propeller? Yes, of course there is! But first, you have to compute the CSD coefficients of the new multiplier.

Attached is a program that does the CSD conversion in both Spin and PASM for comparison. Both are based on the following subroutine that I used in the Perl FIR2PASM code (for all you Perl-heads out there):



sub bin2csd {
my $b = shift;
my $orig = $b;
my ($p, $c) = (0, 0);
foreach my $nh (0 .. 31) {
my $i = 31 - $nh;
my $twobits = $b & 3;
if ($twobits == 0) {
if ($c) {
$p |= 0x8000_0000;
$c = 0;
}
} elsif ($twobits == 1) {
$p |= 0x8000_0000 unless $c
} elsif ($twobits == 3) {
$c = 1
}
if ($i) {
$p >>= 1;
$b >>= 1;
}
}
return ($p, $p - $orig)
}


In the next post (tomorrow, hopefully), I'll show how a PASM program can write it's own multiplication subroutine on the fly.

-Phil

Phil Pilgrim (PhiPi)
11-03-2011, 01:08 AM
Attached is the PASM code that produces the multiplication routine and then tests it. It keeps track of how many instruction cycles were required to produce the code and how many cycles each execution of the code requires. From this you can get an idea whether the time savings in execution offsets the time required to create the routine when a new multiplier is required.

Here's sample output from one iteration:



Multiplier = 11111111111111111111010010000011 = -2941
Instruction cycles to produce = 468
Instructions in mult routine = 10

101001 0010 1111 010000001 001111111 neg :product,:mpcd
001011 0011 1111 001111111 000000010 shl :mpcd,#2
100000 0010 1111 010000001 001111111 add :product,:mpcd
001011 0011 1111 001111111 000000101 shl :mpcd,#5
100000 0010 1111 010000001 001111111 add :product,:mpcd
001011 0011 1111 001111111 000000011 shl :mpcd,#3
100000 0010 1111 010000001 001111111 add :product,:mpcd
001011 0011 1111 001111111 000000010 shl :mpcd,#2
100001 0010 1111 010000001 001111111 sub :product,:mpcd
010111 0000 1111 000000000 001001110 ret

-2941 * 6314 = -18569474 = -18569474
-2941 * 4959 = -14584419 = -14584419
-2941 * 786 = -2311626 = -2311626
-2941 * -4498 = 13228618 = 13228618
-2941 * -7132 = 20975212 = 20975212
-2941 * -1667 = 4902647 = 4902647
-2941 * -2037 = 5990817 = 5990817
-2941 * -4724 = 13893284 = 13893284
-2941 * -2319 = 6820179 = 6820179
-2941 * -2214 = 6511374 = 6511374


-Phil

Cluso99
11-03-2011, 04:57 AM
Neat Phil. I know when I do maths in my head, I do things like this on complicated values. (I guess I am an over the hill dying breed)

But in reality, I have never thought about doing it in a micro. This has some really interesting prospects for speeding up programs, particularly the sort you are working on ;)

pjv
11-03-2011, 03:36 PM
Phil;

Very interesting, thanks for posting.

Cheers,

Peter (pjv)

ericball
11-03-2011, 04:47 PM
The generic form of this is Booth's multiplication algorithm http://en.wikipedia.org/wiki/Booth_multiplication_algorithm. For the Propeller the generic form is less efficient than the standard SHL WC, IF_C ADD loop, but if the multiplier is known the shift/add/sub form can be much faster.