Thanks. That what I thought, the conversions are ultimately done in software functions not real hardware machine instructions. I had to ask because I have never see a CPU with such conversion instructions.
I've tried to imagine all kinds of unary operations that would be useful. Some are practical to implement, some aren't. How about this:
INCPRIM/DECPRIM D - go to next/previous prime number.
As far as I know only COBOL used BCD, called "PACKED DECIMAL" I believe. I would be
very surprised if any modern system uses a DAA instruction anywhere on the planet. Perhaps
on board the Voyager probes, but they're out of the solar system now (sort of).
Actually specialist financial software in bank systems may still be doing this, but only to avoid
round-off errors when paying billionaires' bonuses!
"A High Performance Binary TO BCD Converter for Decimal Multiplication", by Jairaj Bhattacharya, Aman Gupta, Anshul Singh, 2010
Proposed: Delay 1.57ns, Power 557.09nW, Area 1862 um^2, power-delay 874.66E-18
Srihari: Delay 1.85ns, Power 661.39nW, Area 2087 um^2, power-delay 1223.57E-18
ABSTRACT: "Decimal data processing applications have grown exponentially in recent years thereby increasing the need to have hardware support for decimal arithmetic. Binary to BCD conversion forms the basic building block of decimal digit multipliers. This paper presents novel high speed low power architecture for fixed bit binary to BCD conversion which is at least 28% better in terms of power-delay product than the existing designs."
"A High Performance Binary TO BCD Converter for Decimal Multiplication", by Jairaj Bhattacharya, Aman Gupta, Anshul Singh, 2010
Proposed: Delay 1.57ns, Power 557.09nW, Area 1862 um^2, power-delay 874.66E-18
Srihari: Delay 1.85ns, Power 661.39nW, Area 2087 um^2, power-delay 1223.57E-18
ABSTRACT: "Decimal data processing applications have grown exponentially in recent years thereby increasing the need to have hardware support for decimal arithmetic. Binary to BCD conversion forms the basic building block of decimal digit multipliers. This paper presents novel high speed low power architecture for fixed bit binary to BCD conversion which is at least 28% better in terms of power-delay product than the existing designs."
That's interesting. If you maintain all your data in BCD, and do all your math in BCD, there is no conversion issue.
I calculate 326. This is not something that could be single-cycle, by a long shot. At least, it appears so.
I find an 8 bit uC library, that I've tested to ~ 80 bytes and 1800 cycles @ 10 digits, for 32 Bin to 10 Digit BCD.
When the above P2 code is corrected for 10 digits, it will come in just under half the size, and 4.5x the speed (cycles based), or ~ 45x the speed(MHz adjusted). Interesting.
What does BCDADJ do? I don't understand the principle of operation here. How can you keep shifting one bit at a time from the value to the answer, and have BCDADJ do anything meaningful to the answer, when its BCD values are nibbles?
Optimized for speed it takes only 171 cycles, but 6 cog-longs more:
That's already larger than the v3 above, and only covers 8 of the 10 BCD digits.
Size and full digit coverage are likely to matter more than speed, as it's already < 1us
What does BCDADJ do? I don't understand the principle of operation here. How can you keep shifting one bit at a time from the value to the answer, and have BCDADJ do anything meaningful to the answer, when its BCD values are nibbles?
Chip
In post #10 there is Verilog code that is part of what BCDADJ would be.
The table is applied to all nibbles.
If the WC was shifted in first then the adjustment made, even faster again.
See attached diagram for explanation.
That's already larger than the v3 above, and only covers 8 of the 10 BCD digits.
Size and full digit coverage are likely to matter more than speed, as it's already < 1us
7 > 12 longs??
The 10 digit version would only be 9 longs and 159 cycles.
With shift in/out incorporated in BCDADJ ,even smaller and faster again.
Chip
In post #10 there is Verilog code that is part of what BCDADJ would be.
The table is applied to all nibbles.
If the WC was shifted in first then the adjustment made, even faster again.
See attached diagram for explanation.
After a shift if any nibble =>5 add 3 to that nibble.
And there's no carry from one nibble to the other? If so, that's really simple.
So we would need an instruction that shifts one bit into the left side of the result and then adds 3 to any result nibble > 4, with no carry beyond each nibble?
If that's all it is, we could make a unary instruction that performs that operation in 32 clocks. Handling the extra potential two digits is a pain, though. Are they that important?
Is there a reverse formula for going from BCD to binary?
And there's no carry from one nibble to the other? If so, that's really simple.
So we would need an instruction that shifts one bit into the left side of the result and then adds 3 to any result nibble > 4, with no carry beyond each nibble?
If that's all it is, we could make a unary instruction that performs that operation in 32 clocks. Handling the extra potential two digits is a pain, though. Are they that important?
Is there a reverse formula for going from BCD to binary?
That's correct, no carry from nibble to nibble.
8 digits would be great in 1 instruction.
I'm not aware of a reverse trick.
Comments
I've tried to imagine all kinds of unary operations that would be useful. Some are practical to implement, some aren't. How about this:
INCPRIM/DECPRIM D - go to next/previous prime number.
Maybe a BLMASK variant ?
Create mask of n bits starting from MSB? Saves a shift instruction.
I love it. Don't the crypto cracking guys need instructions like that?
How about:
FIBO fib, n ' Calculate the nth number in the Fibonacci sequence.
That would get one of our benchmarks up to speed
very surprised if any modern system uses a DAA instruction anywhere on the planet. Perhaps
on board the Voyager probes, but they're out of the solar system now (sort of).
Actually specialist financial software in bank systems may still be doing this, but only to avoid
round-off errors when paying billionaires' bonuses!
I would think executive bonuses today were more prone to overflow error than rounding error.
"A High Performance Binary TO BCD Converter for Decimal Multiplication", by Jairaj Bhattacharya, Aman Gupta, Anshul Singh, 2010
Proposed: Delay 1.57ns, Power 557.09nW, Area 1862 um^2, power-delay 874.66E-18
Srihari: Delay 1.85ns, Power 661.39nW, Area 2087 um^2, power-delay 1223.57E-18
ABSTRACT: "Decimal data processing applications have grown exponentially in recent years thereby increasing the need to have hardware support for decimal arithmetic. Binary to BCD conversion forms the basic building block of decimal digit multipliers. This paper presents novel high speed low power architecture for fixed bit binary to BCD conversion which is at least 28% better in terms of power-delay product than the existing designs."
That's interesting. If you maintain all your data in BCD, and do all your math in BCD, there is no conversion issue.
How many cycles for the complete operation ?
This code has two div32u, and feeds MSB first, and it also seems light on loops ?
2^32/10000000 = 429.49.. (2^32 is a 10 digit BCD result)
The algorithm I gave above in #26 need one div32u per digit.
It feeds LSB first, so will need a post-rotate, but it can exit early on 0 if needed.
Which is fastest may depend on the delay of div32u ?
I calculate 326. This is not something that could be single-cycle, by a long shot. At least, it appears so.
I find an 8 bit uC library, that I've tested to ~ 80 bytes and 1800 cycles @ 10 digits, for 32 Bin to 10 Digit BCD.
When the above P2 code is corrected for 10 digits, it will come in just under half the size, and 4.5x the speed (cycles based), or ~ 45x the speed(MHz adjusted). Interesting.
Correct, It's a 8 digit BCD example.
- not sure if REPS can be nested ?
Edited to add eswap4 in v3, pulls it under the magical 1us, (assuming 200MHz)
There is no RORNIB instruction, but there is an ESWAP4 that does an endian swap on the nibble order. That could be used after the loop.
Added as v3 above. Shaves a little more off time and size. - but I'm unsure on the opcode details of ESWAP4 - data hints it may have a count field ?
It's basically a 'move' instruction. What is in S goes into D with nibbles reversed. There are a lot of unary instructions in this D,S format.
Optimized for speed it takes only 171 cycles, but 6 cog-longs more: Andy
Now, that's the way to make it go faster!
What does BCDADJ do? I don't understand the principle of operation here. How can you keep shifting one bit at a time from the value to the answer, and have BCDADJ do anything meaningful to the answer, when its BCD values are nibbles?
That's already larger than the v3 above, and only covers 8 of the 10 BCD digits.
Size and full digit coverage are likely to matter more than speed, as it's already < 1us
Chip
In post #10 there is Verilog code that is part of what BCDADJ would be.
The table is applied to all nibbles.
If the WC was shifted in first then the adjustment made, even faster again.
See attached diagram for explanation.
7 > 12 longs??
The 10 digit version would only be 9 longs and 159 cycles.
With shift in/out incorporated in BCDADJ ,even smaller and faster again.
That's like magic. What governs adding the three?
If nibble equals or greater than 5 add 3.
But in that chart it was sometimes applied to nibble1, not nibble0. How come?
After a shift if any nibble =>5 add 3 to that nibble.
And there's no carry from one nibble to the other? If so, that's really simple.
So we would need an instruction that shifts one bit into the left side of the result and then adds 3 to any result nibble > 4, with no carry beyond each nibble?
If that's all it is, we could make a unary instruction that performs that operation in 32 clocks. Handling the extra potential two digits is a pain, though. Are they that important?
Is there a reverse formula for going from BCD to binary?
That's correct, no carry from nibble to nibble.
8 digits would be great in 1 instruction.
I'm not aware of a reverse trick.