This is very similar to the 32x32-->64 unsigned multiply routine I use in the Tachyon kernel except I can't afford to unroll it however it is quite zippy as even with the overhead of the bytecode interpreter it takes from 1us to 11.8us. Here is a quick timing check:
123456 255 LAP UM* LAP .LAP SPACE <D> . 3.400us 31481280 ok
EDIT: just to make it clear that there is a 64-bit result:
12345678 $FF00 LAP UM* LAP .LAP SPACE <D> U. 5.400us 805925859840 ok
1,234,567,890 DUP LAP UM* LAP .LAP SPACE <D> U. 9.600us 1524157875019052100 ok
With my next board I will have a propeller dedicated to run tachyon just to be able to follow Peters advice. Peter, would you please support me? Urgent
Comments
EDIT: just to make it clear that there is a 64-bit result: