BCD = (DAA) Instruction -- Question to Chip

cgracey · 2013-12-09 04:04

Heater. wrote: »

dMajo,

Thanks. That what I thought, the conversions are ultimately done in software functions not real hardware machine instructions. I had to ask because I have never see a CPU with such conversion instructions.

I've tried to imagine all kinds of unary operations that would be useful. Some are practical to implement, some aren't. How about this:

INCPRIM/DECPRIM D - go to next/previous prime number.

ozpropdev · 2013-12-09 04:31

cgracey wrote: »

I've tried to imagine all kinds of unary operations that would be useful. Some are practical to implement, some aren't. How about this:

INCPRIM/DECPRIM D - go to next/previous prime number.

Maybe a BLMASK variant ?
Create mask of n bits starting from MSB? Saves a shift instruction.

Heater. · 2013-12-09 04:47

Chip,

INCPRIM/DECPRIM D - go to next/previous prime number.

I love it. Don't the crypto cracking guys need instructions like that?

How about:

FIBO fib, n ' Calculate the nth number in the Fibonacci sequence.

That would get one of our benchmarks up to speed

Ale · 2013-12-09 04:50

This rolnib instruction solves the problem... now you don't need any BCD-specific instructions...

Mark_T · 2013-12-09 05:23

As far as I know only COBOL used BCD, called "PACKED DECIMAL" I believe. I would be
very surprised if any modern system uses a DAA instruction anywhere on the planet. Perhaps
on board the Voyager probes, but they're out of the solar system now (sort of).

Actually specialist financial software in bank systems may still be doing this, but only to avoid
round-off errors when paying billionaires' bonuses!

Heater. · 2013-12-09 05:30

Mark_T

I would think executive bonuses today were more prone to overflow error than rounding error.

Ramon · 2013-12-09 07:27

Try google "Binary to BCD 180nm"

"A High Performance Binary TO BCD Converter for Decimal Multiplication", by Jairaj Bhattacharya, Aman Gupta, Anshul Singh, 2010

Proposed: Delay 1.57ns, Power 557.09nW, Area 1862 um^2, power-delay 874.66E-18
Srihari: Delay 1.85ns, Power 661.39nW, Area 2087 um^2, power-delay 1223.57E-18

ABSTRACT: "Decimal data processing applications have grown exponentially in recent years thereby increasing the need to have hardware support for decimal arithmetic. Binary to BCD conversion forms the basic building block of decimal digit multipliers. This paper presents novel high speed low power architecture for fixed bit binary to BCD conversion which is at least 28% better in terms of power-delay product than the existing designs."

cgracey · 2013-12-09 10:54

Ramon wrote: »

Try google "Binary to BCD 180nm"

"A High Performance Binary TO BCD Converter for Decimal Multiplication", by Jairaj Bhattacharya, Aman Gupta, Anshul Singh, 2010

Proposed: Delay 1.57ns, Power 557.09nW, Area 1862 um^2, power-delay 874.66E-18
Srihari: Delay 1.85ns, Power 661.39nW, Area 2087 um^2, power-delay 1223.57E-18

ABSTRACT: "Decimal data processing applications have grown exponentially in recent years thereby increasing the need to have hardware support for decimal arithmetic. Binary to BCD conversion forms the basic building block of decimal digit multipliers. This paper presents novel high speed low power architecture for fixed bit binary to BCD conversion which is at least 28% better in terms of power-delay product than the existing designs."

That's interesting. If you maintain all your data in BCD, and do all your math in BCD, there is no conversion issue.

jmg · 2013-12-09 13:22

cgracey wrote: »

Neat! I used the ROLNIB (rotate nibble left) to save one instruction.

How many cycles for the complete operation ?

jmg · 2013-12-09 13:35

cgracey wrote: »

bin2bcd		reps	#8,#6
		        mov	dx,_10_000_000
		div32u	value,dx
		getdivr	value
		getdivq	ax
		rolnib	answer,ax
		div32u	dx,#10
		getdivq	dx
bin2bcd_ret	ret

_10_000_000	long	10_000_000

This code has two div32u, and feeds MSB first, and it also seems light on loops ?
2^32/10000000 = 429.49.. (2^32 is a 10 digit BCD result)

The algorithm I gave above in #26 need one div32u per digit.
It feeds LSB first, so will need a post-rotate, but it can exit early on 0 if needed.

Which is fastest may depend on the delay of div32u ?

cgracey · 2013-12-09 13:39

jmg wrote: »

How many cycles for the complete operation ?

I calculate 326. This is not something that could be single-cycle, by a long shot. At least, it appears so.

bin2bcd		reps	#8,#6			'1
		mov	dx,_10_000_000		'1
		div32u	value,dx		'8*1
		getdivr	value			'8*18
		getdivq	ax			'8*1
		rolnib	answer,ax		'8*1
		div32u	dx,#10			'8*1
		getdivq	dx			'8*18
bin2bcd_ret	ret				'4

jmg · 2013-12-09 13:50

cgracey wrote: »

I calculate 326. This is not something that could be single-cycle, by a long shot. At least, it appears so.

I find an 8 bit uC library, that I've tested to ~ 80 bytes and 1800 cycles @ 10 digits, for 32 Bin to 10 Digit BCD.

When the above P2 code is corrected for 10 digits, it will come in just under half the size, and 4.5x the speed (cycles based), or ~ 45x the speed(MHz adjusted). Interesting.

ozpropdev · 2013-12-09 16:09

jmg wrote: »

This code has two div32u, and feeds MSB first, and it also seems light on loops ?
2^32/10000000 = 429.49.. (2^32 is a 10 digit BCD result)

The algorithm I gave above in #26 need one div32u per digit.
It feeds LSB first, so will need a post-rotate, but it can exit early on 0 if needed.

Which is fastest may depend on the delay of div32u ?

Correct, It's a 8 digit BCD example.

jmg · 2013-12-09 16:48

There seems to be no rornib so the LSB first code could go something roughly like this, expanded to 10 digits

''v3
bin32bcd10	
         	reps	#8,#4               	'1 
		           mov	dx,#10          ' +1
		  div32u	value,dx        '+8*1
		  getdivr	ax              '+8*18
		  getdivq	value           '+8*1
                  rolnib	answer,ax       '+8*1
		eswap4	answer,answer 	        '+1*1
		div32u	value,dx         	'+1*1    by now <= 42 
		getdivq	ax                   	'+1*18 eg 4
                rolnib	answerU,ax       	'+1*1
		getdivr	ax                	'+1*1  eg 2
                rolnib	answerU,ax       	'+1*1
bin32bcd10_ret	
        	ret                             '+4  = 197cy, 52 Bytes
'v2
bin32bcd10	
         	reps	#8,#5               	'1 
		           mov	dx,#10          ' +1
		  div32u	value,dx        '+8*1
		  getdivr	ax              '+8*18
		  getdivq	value           '+8*1
                  or		answer,ax       '+8*1
		  shr		answer,#4       '+8*1 
		div32u	value,dx         	'+1*1  by now <= 42 
		getdivr	ax                   	'+1*18
                or	answerU,ax          	'+1*1
		shr	answerU,#4           	'+1*1 
		getdivq	ax                	'+1*1
                or	answerU,ax          	'+1*1
		shr	answerU,#24             '+1*1 
bin32bcd10_ret	
        	ret                             '+4  = 207cy, 60 Bytes
'v1 
bin32bcd10	
         	reps	#8,#5               	 '1 
		           mov	dx,#10           ' +1
		  div32u	value,dx         '+8*1
		  getdivr	ax               '+8*18
		  getdivq	value            '+8*1
                  or		answer,ax        '+8*1
		  shr		answer,#4        '+8*1 
         	reps	#2,#5   	         '+1 
		           nop          	 '+1
		  div32u	value,dx         '+2*1
		  getdivr	ax               '+2*18
		  getdivq	value            '+2*1
                  or		answerU,ax       '+2*1
		  shr		answerU,#4       '+2*1 
		shr	answerU,#24              '+1*1 
bin32bcd10_ret	
        	ret                              '+4  = 229cy, 64 Bytes

- not sure if REPS can be nested ?

Edited to add eswap4 in v3, pulls it under the magical 1us, (assuming 200MHz)

cgracey · 2013-12-09 16:55

jmg wrote: »

There seems to be no rornib so the LSB first code could go something roughly like this, expanded to 10 digits

bin32bcd10	
         	reps	#8,#5               	     '1 
		           mov	dx,#10               ' +1
		  div32u	value,dx             '+8*1
		  getdivr	ax                   '+8*18
		  getdivq	value                '+8*1
                  or		answer,ax            '+8*1
		  shr		answer,#4            '+8*1 
         	reps	#2,#5   	             '1 
		           nop          	     '+1
		  div32u	value,dx             '+2*1
		  getdivr	ax                   '+2*18
		  getdivq	value                '+2*1
                  or		answerU,ax           '+2*1
		  shr		answerU,#4           '+2*1 
		shr	answerU,#24                  '+1*1 
bin32bcd10_ret	
        	ret                                  '+4  = 308cy, 64 Bytes

An extended shift could drop the code size on this - not sure if REPS can be nested ?

There is no RORNIB instruction, but there is an ESWAP4 that does an endian swap on the nibble order. That could be used after the loop.

jmg · 2013-12-09 17:17

cgracey wrote: »

There is no RORNIB instruction, but there is an ESWAP4 that does an endian swap on the nibble order. That could be used after the loop.

Added as v3 above. Shaves a little more off time and size. - but I'm unsure on the opcode details of ESWAP4 - data hints it may have a count field ?

cgracey · 2013-12-09 17:59

jmg wrote: »

Added as v3 above. Shaves a little more off time and size. - but I'm unsure on the opcode details of ESWAP4 - data hints it may have a count field ?

It's basically a 'move' instruction. What is in S goes into D with nibbles reversed. There are a lot of unary instructions in this D,S format.

Ariba · 2013-12-09 22:56

cgracey wrote: »
I calculate 326. This is not something that could be single-cycle, by a long shot. At least, it appears so.
bin2bcd		reps	#8,#6			'1
		mov	dx,_10_000_000		'1
		div32u	value,dx		'8*1
		getdivr	value			'8*18
		getdivq	ax			'8*1
		rolnib	answer,ax		'8*1
		div32u	dx,#10			'8*1
		getdivq	dx			'8*18
bin2bcd_ret	ret				'4

Optimized for speed it takes only 171 cycles, but 6 cog-longs more:

bin2bcd		reps	#8,#4			'1
		setinda	#dx			'1
		div32u	value,inda++		'8*1
		getdivr	value			'8*18
		getdivq	ax			'8*1
		rolnib	answer,ax		'8*1
bin2bcd_ret	ret				'1  = 171cy

dx  long  10_000_000, 1_000_000, 100_000, 10_000, 1_000, 100, 10, 1

Andy

cgracey · 2013-12-09 22:59

Ariba wrote: »

Optimized for speed it takes only 171 cycles, but 6 cog-longs more:

bin2bcd		reps	#8,#4			'1
		setinda	#dx			'1
		div32u	value,inda++		'8*1
		getdivr	value			'8*18
		getdivq	ax			'8*1
		rolnib	answer,ax		'8*1
bin2bcd_ret	ret				'1  = 171cy

dx  long  10_000_000, 1_000_000, 100_000, 10_000, 1_000, 100, 10, 1

Andy

Now, that's the way to make it go faster!

ozpropdev · 2013-12-09 23:35

If BCDADJ was implemented, an 8 digit result can be achieved in 81 cycles and in 7 longs!

'81 cycles
bin2bcd8		reps	#26,#3
			mov	answeru,#0
			shl	value,#1 wc
			rcl	answer,#1 wc
			BCDADJ   answer
			shl	value,#1 wc
			rcl	answer,#1 wc

cgracey · 2013-12-09 23:43

ozpropdev wrote: »
If BCDADJ was implemented, an 8 digit result can be achieved in 81 cycles and in 7 longs!
'81 cycles
bin2bcd8		reps	#26,#3
			mov	answeru,#0
			shl	value,#1 wc
			rcl	answer,#1 wc
			BCDADJ   answer
			shl	value,#1 wc
			rcl	answer,#1 wc

What does BCDADJ do? I don't understand the principle of operation here. How can you keep shifting one bit at a time from the value to the answer, and have BCDADJ do anything meaningful to the answer, when its BCD values are nibbles?

jmg · 2013-12-10 00:17

Ariba wrote: »

Optimized for speed it takes only 171 cycles, but 6 cog-longs more:

That's already larger than the v3 above, and only covers 8 of the 10 BCD digits.
Size and full digit coverage are likely to matter more than speed, as it's already < 1us

ozpropdev · 2013-12-10 01:23

cgracey wrote: »

What does BCDADJ do? I don't understand the principle of operation here. How can you keep shifting one bit at a time from the value to the answer, and have BCDADJ do anything meaningful to the answer, when its BCD values are nibbles?

Chip
In post #10 there is Verilog code that is part of what BCDADJ would be.
The table is applied to all nibbles.
If the WC was shifted in first then the adjustment made, even faster again.
See attached diagram for explanation.

ozpropdev · 2013-12-10 01:45

jmg wrote: »

That's already larger than the v3 above, and only covers 8 of the 10 BCD digits.
Size and full digit coverage are likely to matter more than speed, as it's already < 1us

7 > 12 longs??
The 10 digit version would only be 9 longs and 159 cycles.
With shift in/out incorporated in BCDADJ ,even smaller and faster again.

cgracey · 2013-12-10 02:01

ozpropdev wrote: »

Chip
In post #10 there is Verilog code that is part of what BCDADJ would be.
The table is applied to all nibbles.
If the WC was shifted in first then the adjustment made, even faster again.
See attached diagram for explanation.

That's like magic. What governs adding the three?

ozpropdev · 2013-12-10 02:05

cgracey wrote: »

That's like magic. What governs adding the three?

If nibble equals or greater than 5 add 3.

cgracey · 2013-12-10 02:08

ozpropdev wrote: »

If nibble equals or greater than 5 add 3.

But in that chart it was sometimes applied to nibble1, not nibble0. How come?

ozpropdev · 2013-12-10 02:18

cgracey wrote: »

But in that chart it was sometimes applied to nibble1, not nibble0. How come?

After a shift if any nibble =>5 add 3 to that nibble.

cgracey · 2013-12-10 02:26

ozpropdev wrote: »

After a shift if any nibble =>5 add 3 to that nibble.

And there's no carry from one nibble to the other? If so, that's really simple.

So we would need an instruction that shifts one bit into the left side of the result and then adds 3 to any result nibble > 4, with no carry beyond each nibble?

If that's all it is, we could make a unary instruction that performs that operation in 32 clocks. Handling the extra potential two digits is a pain, though. Are they that important?

Is there a reverse formula for going from BCD to binary?

ozpropdev · 2013-12-10 02:41

cgracey wrote: »

And there's no carry from one nibble to the other? If so, that's really simple.

So we would need an instruction that shifts one bit into the left side of the result and then adds 3 to any result nibble > 4, with no carry beyond each nibble?

If that's all it is, we could make a unary instruction that performs that operation in 32 clocks. Handling the extra potential two digits is a pain, though. Are they that important?

Is there a reverse formula for going from BCD to binary?

That's correct, no carry from nibble to nibble.
8 digits would be great in 1 instruction.
I'm not aware of a reverse trick.

BCD = (DAA) Instruction -- Question to Chip

Comments