Phil Pilgrim (PhiPi)

10-22-2006, 07:07 AM

In a recent thread (http://forums.parallax.com/showthread.php?p=609053) on the CORDIC algorithm, Chip demonstrated his derivation of a quick algorithm for multiplication of any multiplicand by the constant multiplier 0.607252935, rounded up to the nearest 16-bits, or $9B75. By recognizing a repeating bit pattern in the multiplier, he was able to accomplish the feat in only twelve assembly instructions. I thought it might be possible to automate, for any 16-bit constant fraction, the generation of fairly optimal code to perform multiplication by that fraction. This is because the search space, for a certain class of solutions at least, is not terribly large and can be scanned exhaustively for the best solution in a matter of minutes.

The class of solutions I considered encompasses what I call one- and two-tier algorithms. A one-tier algorithm is one in which the multiplicand is shifted repeatedly and either added to or subtracted from the product at each step. A two-tier algorithm is one in which a partial product is formed from the multiplier and multiplicand, as in the one-tier case, after which this partial product is shifted and added to or subtracted from the product. Chip's algorithm, by contrast, is three-tiered: a second partial product was formed by shifting and adding the first one before being used to operate on the final product. My feeling was that, by trying to encompass three-tiered algorithms, the search space would simply be too large to scan exhaustively.

The attached Spin program takes a single variable: a sixteen-bit fraction for which the multiply algorithm is to be derived. It uses TV_wtext to display its progress. At the end, it doesn't actually display any assembler code, but rather a pattern and a mask from which the desired code can be derived. A typical result will look something like this:

DC85 1101110010000101 <- Target

8000 ++0+++00+0000+0+ 16

8000 0-0+++00+0000+0+ 15

8000 00-0-+00+0000+0+ 13

8000 00-00-00+0000+0+ 11

The first line represents the hex and binary expression of the chosen fraction. Each line after that displays a successively better solution to the problem, and consists of: A pattern in hex.

An add/subtract mask.

The number of assembly instructions required to implement the solution.

The assembly program represented by the pattern and mask will first create a subproduct based on the pattern (if the pattern contains more than a single "one"-bit), then shift the subproduct to each successive non-zero position in the mask and perform the indicated operation. For example, the best solution to the above problem would be:

mov t,x

sar t,#3

sub x,t

sar t,#3

sub x,t

sar t,#3

add x,t

sar t,#5

add x,t

sar t,#2

add x,t

Likewise, the best two-tier solution for Chip's constant, would look like this (along with the implied code):

9B75 1001101101110101 <- Target

8800 +00+00+00+0+0000 13

sar x,#1 'x := x * %0.10000

mov t,x 'Copy x to t.

sar t,#4 'Shift it right by four

add x,t 'Add back to x, yielding 0.10001 of x

mov t,x 'Copy .10001 x to t.

sar t,#3 'Shift and add per the derived mask.

add x,t

sar t,#3

add x,t

sar t,#3

add x,t

sar t,#2

add x,t

As you can see, Chip's three-tier solution beats the best two-tier solution by one instruction.

In gauging how many instructions each solution requires, several code templates are pulled into play. The chosen template is determined by how many "one"-bits there are in the pattern, how many non-zero operations are in the mask, and the sign of the first non-zero operation. These templates are all listed in the source code.

Certainly, a lot more could be done with this. No attempt at any heuristic improvement was made. It just generates all the combinations and tests them. But, in a pinch, even that may be useful.

Hopefully, I've haven't made any blunders here, but I haven't given it a whole lot of testing either. So use at your own risk, and test your results! Also, if anyone can improve on the efficiency fo the templates, please let me know, and I'll include them in a new version.

Cheers!

Phil

Update (2006.10.22a): Replaced the uncommented Spin file with an archive containing the same file with comments.

Update (2006.10.22b): Fixed a bug that prevented the algortihm from terminating properly.

Post Edited (Phil Pilgrim (PhiPi)) : 10/23/2006 5:08:23 AM GMT

The class of solutions I considered encompasses what I call one- and two-tier algorithms. A one-tier algorithm is one in which the multiplicand is shifted repeatedly and either added to or subtracted from the product at each step. A two-tier algorithm is one in which a partial product is formed from the multiplier and multiplicand, as in the one-tier case, after which this partial product is shifted and added to or subtracted from the product. Chip's algorithm, by contrast, is three-tiered: a second partial product was formed by shifting and adding the first one before being used to operate on the final product. My feeling was that, by trying to encompass three-tiered algorithms, the search space would simply be too large to scan exhaustively.

The attached Spin program takes a single variable: a sixteen-bit fraction for which the multiply algorithm is to be derived. It uses TV_wtext to display its progress. At the end, it doesn't actually display any assembler code, but rather a pattern and a mask from which the desired code can be derived. A typical result will look something like this:

DC85 1101110010000101 <- Target

8000 ++0+++00+0000+0+ 16

8000 0-0+++00+0000+0+ 15

8000 00-0-+00+0000+0+ 13

8000 00-00-00+0000+0+ 11

The first line represents the hex and binary expression of the chosen fraction. Each line after that displays a successively better solution to the problem, and consists of: A pattern in hex.

An add/subtract mask.

The number of assembly instructions required to implement the solution.

The assembly program represented by the pattern and mask will first create a subproduct based on the pattern (if the pattern contains more than a single "one"-bit), then shift the subproduct to each successive non-zero position in the mask and perform the indicated operation. For example, the best solution to the above problem would be:

mov t,x

sar t,#3

sub x,t

sar t,#3

sub x,t

sar t,#3

add x,t

sar t,#5

add x,t

sar t,#2

add x,t

Likewise, the best two-tier solution for Chip's constant, would look like this (along with the implied code):

9B75 1001101101110101 <- Target

8800 +00+00+00+0+0000 13

sar x,#1 'x := x * %0.10000

mov t,x 'Copy x to t.

sar t,#4 'Shift it right by four

add x,t 'Add back to x, yielding 0.10001 of x

mov t,x 'Copy .10001 x to t.

sar t,#3 'Shift and add per the derived mask.

add x,t

sar t,#3

add x,t

sar t,#3

add x,t

sar t,#2

add x,t

As you can see, Chip's three-tier solution beats the best two-tier solution by one instruction.

In gauging how many instructions each solution requires, several code templates are pulled into play. The chosen template is determined by how many "one"-bits there are in the pattern, how many non-zero operations are in the mask, and the sign of the first non-zero operation. These templates are all listed in the source code.

Certainly, a lot more could be done with this. No attempt at any heuristic improvement was made. It just generates all the combinations and tests them. But, in a pinch, even that may be useful.

Hopefully, I've haven't made any blunders here, but I haven't given it a whole lot of testing either. So use at your own risk, and test your results! Also, if anyone can improve on the efficiency fo the templates, please let me know, and I'll include them in a new version.

Cheers!

Phil

Update (2006.10.22a): Replaced the uncommented Spin file with an archive containing the same file with comments.

Update (2006.10.22b): Fixed a bug that prevented the algortihm from terminating properly.

Post Edited (Phil Pilgrim (PhiPi)) : 10/23/2006 5:08:23 AM GMT