Working on my Oberon compiler for the P2, i need a set of multiply subroutines
(or even better a set of templates that the compiler can use to generate inline multiply code).
I know that i can use the cordic option but i was hoping to find something substantially faster,
that can be used when you cant overlap the cordic operation with other code.
I took a routine posted by ersmith in the "Multiply, multiply, multiply" forum post as the basis
for these routines. (See mul.spin)
32 x 32 bit multiply giving 64 bit result - 36 clock cycles
32 x 32 multiply 32 - 20 clock cycles
32 x 16 multiply 32 - 12 clock cycles
Also see mul1.spin which provides: (but needs a half register add instruction)
32 x 32 bit multiply giving 64 bit result - 24 clock cycles
32 x 32 multiply 32 - 16 clock cycles
32 x 16 multiply 32 - 8 clock cycles
I dont know the P2 instruction set in detail yet, are there existing instructions in the P2
which allow add and addx to only add a half register ? (matching the half size multiplier)
or are there other instructions that can achieve this ?
Optimizing the multiply routines make a big difference to code generated by the compiler
to support array of structures and multi dimensional arrays. My compiler already knows
how to use shifts for any multiply or divide by a power of 2.
Thanks in advance.