Shop OBEX P1 Docs P2 Docs Learn Events
BCD = (DAA) Instruction -- Question to Chip - Page 2 — Parallax Forums

BCD = (DAA) Instruction -- Question to Chip

2

Comments

  • cgraceycgracey Posts: 14,151
    edited 2013-12-09 04:04
    Heater. wrote: »
    dMajo,

    Thanks. That what I thought, the conversions are ultimately done in software functions not real hardware machine instructions. I had to ask because I have never see a CPU with such conversion instructions.

    I've tried to imagine all kinds of unary operations that would be useful. Some are practical to implement, some aren't. How about this:

    INCPRIM/DECPRIM D - go to next/previous prime number.
  • ozpropdevozpropdev Posts: 2,792
    edited 2013-12-09 04:31
    cgracey wrote: »
    I've tried to imagine all kinds of unary operations that would be useful. Some are practical to implement, some aren't. How about this:

    INCPRIM/DECPRIM D - go to next/previous prime number.

    Maybe a BLMASK variant ?
    Create mask of n bits starting from MSB? Saves a shift instruction.
  • Heater.Heater. Posts: 21,230
    edited 2013-12-09 04:47
    Chip,
    INCPRIM/DECPRIM D - go to next/previous prime number.
    I love it. Don't the crypto cracking guys need instructions like that? :)

    How about:

    FIBO fib, n ' Calculate the nth number in the Fibonacci sequence.

    That would get one of our benchmarks up to speed :)
  • AleAle Posts: 2,363
    edited 2013-12-09 04:50
    This rolnib instruction solves the problem... now you don't need any BCD-specific instructions...
  • Mark_TMark_T Posts: 1,981
    edited 2013-12-09 05:23
    As far as I know only COBOL used BCD, called "PACKED DECIMAL" I believe. I would be
    very surprised if any modern system uses a DAA instruction anywhere on the planet. Perhaps
    on board the Voyager probes, but they're out of the solar system now (sort of).

    Actually specialist financial software in bank systems may still be doing this, but only to avoid
    round-off errors when paying billionaires' bonuses!
  • Heater.Heater. Posts: 21,230
    edited 2013-12-09 05:30
    Mark_T

    I would think executive bonuses today were more prone to overflow error than rounding error. :)
  • RamonRamon Posts: 484
    edited 2013-12-09 07:27
    Try google "Binary to BCD 180nm"

    "A High Performance Binary TO BCD Converter for Decimal Multiplication", by Jairaj Bhattacharya, Aman Gupta, Anshul Singh, 2010

    Proposed: Delay 1.57ns, Power 557.09nW, Area 1862 um^2, power-delay 874.66E-18
    Srihari: Delay 1.85ns, Power 661.39nW, Area 2087 um^2, power-delay 1223.57E-18

    ABSTRACT: "Decimal data processing applications have grown exponentially in recent years thereby increasing the need to have hardware support for decimal arithmetic. Binary to BCD conversion forms the basic building block of decimal digit multipliers. This paper presents novel high speed low power architecture for fixed bit binary to BCD conversion which is at least 28% better in terms of power-delay product than the existing designs."
  • cgraceycgracey Posts: 14,151
    edited 2013-12-09 10:54
    Ramon wrote: »
    Try google "Binary to BCD 180nm"

    "A High Performance Binary TO BCD Converter for Decimal Multiplication", by Jairaj Bhattacharya, Aman Gupta, Anshul Singh, 2010

    Proposed: Delay 1.57ns, Power 557.09nW, Area 1862 um^2, power-delay 874.66E-18
    Srihari: Delay 1.85ns, Power 661.39nW, Area 2087 um^2, power-delay 1223.57E-18

    ABSTRACT: "Decimal data processing applications have grown exponentially in recent years thereby increasing the need to have hardware support for decimal arithmetic. Binary to BCD conversion forms the basic building block of decimal digit multipliers. This paper presents novel high speed low power architecture for fixed bit binary to BCD conversion which is at least 28% better in terms of power-delay product than the existing designs."

    That's interesting. If you maintain all your data in BCD, and do all your math in BCD, there is no conversion issue.
  • jmgjmg Posts: 15,173
    edited 2013-12-09 13:22
    cgracey wrote: »
    Neat! I used the ROLNIB (rotate nibble left) to save one instruction.

    How many cycles for the complete operation ?
  • jmgjmg Posts: 15,173
    edited 2013-12-09 13:35
    cgracey wrote: »
    bin2bcd		reps	#8,#6
    		        mov	dx,_10_000_000
    		div32u	value,dx
    		getdivr	value
    		getdivq	ax
    		rolnib	answer,ax
    		div32u	dx,#10
    		getdivq	dx
    bin2bcd_ret	ret
    
    _10_000_000	long	10_000_000
    

    This code has two div32u, and feeds MSB first, and it also seems light on loops ?
    2^32/10000000 = 429.49.. (2^32 is a 10 digit BCD result)


    The algorithm I gave above in #26 need one div32u per digit.
    It feeds LSB first, so will need a post-rotate, but it can exit early on 0 if needed.

    Which is fastest may depend on the delay of div32u ?
  • cgraceycgracey Posts: 14,151
    edited 2013-12-09 13:39
    jmg wrote: »
    How many cycles for the complete operation ?


    I calculate 326. This is not something that could be single-cycle, by a long shot. At least, it appears so.
    bin2bcd		reps	#8,#6			'1
    		mov	dx,_10_000_000		'1
    		div32u	value,dx		'8*1
    		getdivr	value			'8*18
    		getdivq	ax			'8*1
    		rolnib	answer,ax		'8*1
    		div32u	dx,#10			'8*1
    		getdivq	dx			'8*18
    bin2bcd_ret	ret				'4
    
  • jmgjmg Posts: 15,173
    edited 2013-12-09 13:50
    cgracey wrote: »
    I calculate 326. This is not something that could be single-cycle, by a long shot. At least, it appears so.

    I find an 8 bit uC library, that I've tested to ~ 80 bytes and 1800 cycles @ 10 digits, for 32 Bin to 10 Digit BCD.

    When the above P2 code is corrected for 10 digits, it will come in just under half the size, and 4.5x the speed (cycles based), or ~ 45x the speed(MHz adjusted). Interesting.
  • ozpropdevozpropdev Posts: 2,792
    edited 2013-12-09 16:09
    jmg wrote: »
    This code has two div32u, and feeds MSB first, and it also seems light on loops ?
    2^32/10000000 = 429.49.. (2^32 is a 10 digit BCD result)


    The algorithm I gave above in #26 need one div32u per digit.
    It feeds LSB first, so will need a post-rotate, but it can exit early on 0 if needed.

    Which is fastest may depend on the delay of div32u ?

    Correct, It's a 8 digit BCD example.
  • jmgjmg Posts: 15,173
    edited 2013-12-09 16:48
    There seems to be no rornib so the LSB first code could go something roughly like this, expanded to 10 digits
    ''v3
    bin32bcd10	
             	reps	#8,#4               	'1 
    		           mov	dx,#10          ' +1
    		  div32u	value,dx        '+8*1
    		  getdivr	ax              '+8*18
    		  getdivq	value           '+8*1
                      rolnib	answer,ax       '+8*1
    		eswap4	answer,answer 	        '+1*1
    		div32u	value,dx         	'+1*1    by now <= 42 
    		getdivq	ax                   	'+1*18 eg 4
                    rolnib	answerU,ax       	'+1*1
    		getdivr	ax                	'+1*1  eg 2
                    rolnib	answerU,ax       	'+1*1
    bin32bcd10_ret	
            	ret                             '+4  = 197cy, 52 Bytes
    'v2
    bin32bcd10	
             	reps	#8,#5               	'1 
    		           mov	dx,#10          ' +1
    		  div32u	value,dx        '+8*1
    		  getdivr	ax              '+8*18
    		  getdivq	value           '+8*1
                      or		answer,ax       '+8*1
    		  shr		answer,#4       '+8*1 
    		div32u	value,dx         	'+1*1  by now <= 42 
    		getdivr	ax                   	'+1*18
                    or	answerU,ax          	'+1*1
    		shr	answerU,#4           	'+1*1 
    		getdivq	ax                	'+1*1
                    or	answerU,ax          	'+1*1
    		shr	answerU,#24             '+1*1 
    bin32bcd10_ret	
            	ret                             '+4  = 207cy, 60 Bytes
    'v1 
    bin32bcd10	
             	reps	#8,#5               	 '1 
    		           mov	dx,#10           ' +1
    		  div32u	value,dx         '+8*1
    		  getdivr	ax               '+8*18
    		  getdivq	value            '+8*1
                      or		answer,ax        '+8*1
    		  shr		answer,#4        '+8*1 
             	reps	#2,#5   	         '+1 
    		           nop          	 '+1
    		  div32u	value,dx         '+2*1
    		  getdivr	ax               '+2*18
    		  getdivq	value            '+2*1
                      or		answerU,ax       '+2*1
    		  shr		answerU,#4       '+2*1 
    		shr	answerU,#24              '+1*1 
    bin32bcd10_ret	
            	ret                              '+4  = 229cy, 64 Bytes
    
    
    

    - not sure if REPS can be nested ?

    Edited to add eswap4 in v3, pulls it under the magical 1us, (assuming 200MHz)
  • cgraceycgracey Posts: 14,151
    edited 2013-12-09 16:55
    jmg wrote: »
    There seems to be no rornib so the LSB first code could go something roughly like this, expanded to 10 digits
    bin32bcd10	
             	reps	#8,#5               	     '1 
    		           mov	dx,#10               ' +1
    		  div32u	value,dx             '+8*1
    		  getdivr	ax                   '+8*18
    		  getdivq	value                '+8*1
                      or		answer,ax            '+8*1
    		  shr		answer,#4            '+8*1 
             	reps	#2,#5   	             '1 
    		           nop          	     '+1
    		  div32u	value,dx             '+2*1
    		  getdivr	ax                   '+2*18
    		  getdivq	value                '+2*1
                      or		answerU,ax           '+2*1
    		  shr		answerU,#4           '+2*1 
    		shr	answerU,#24                  '+1*1 
    bin32bcd10_ret	
            	ret                                  '+4  = 308cy, 64 Bytes
    

    An extended shift could drop the code size on this - not sure if REPS can be nested ?

    There is no RORNIB instruction, but there is an ESWAP4 that does an endian swap on the nibble order. That could be used after the loop.
  • jmgjmg Posts: 15,173
    edited 2013-12-09 17:17
    cgracey wrote: »
    There is no RORNIB instruction, but there is an ESWAP4 that does an endian swap on the nibble order. That could be used after the loop.

    Added as v3 above. Shaves a little more off time and size. - but I'm unsure on the opcode details of ESWAP4 - data hints it may have a count field ?
  • cgraceycgracey Posts: 14,151
    edited 2013-12-09 17:59
    jmg wrote: »
    Added as v3 above. Shaves a little more off time and size. - but I'm unsure on the opcode details of ESWAP4 - data hints it may have a count field ?

    It's basically a 'move' instruction. What is in S goes into D with nibbles reversed. There are a lot of unary instructions in this D,S format.
  • AribaAriba Posts: 2,690
    edited 2013-12-09 22:56
    cgracey wrote: »
    I calculate 326. This is not something that could be single-cycle, by a long shot. At least, it appears so.
    bin2bcd		reps	#8,#6			'1
    		mov	dx,_10_000_000		'1
    		div32u	value,dx		'8*1
    		getdivr	value			'8*18
    		getdivq	ax			'8*1
    		rolnib	answer,ax		'8*1
    		div32u	dx,#10			'8*1
    		getdivq	dx			'8*18
    bin2bcd_ret	ret				'4
    

    Optimized for speed it takes only 171 cycles, but 6 cog-longs more:
    bin2bcd		reps	#8,#4			'1
    		setinda	#dx			'1
    		div32u	value,inda++		'8*1
    		getdivr	value			'8*18
    		getdivq	ax			'8*1
    		rolnib	answer,ax		'8*1
    bin2bcd_ret	ret				'1  = 171cy
    
    dx  long  10_000_000, 1_000_000, 100_000, 10_000, 1_000, 100, 10, 1
    
    Andy
  • cgraceycgracey Posts: 14,151
    edited 2013-12-09 22:59
    Ariba wrote: »
    Optimized for speed it takes only 171 cycles, but 6 cog-longs more:
    bin2bcd		reps	#8,#4			'1
    		setinda	#dx			'1
    		div32u	value,inda++		'8*1
    		getdivr	value			'8*18
    		getdivq	ax			'8*1
    		rolnib	answer,ax		'8*1
    bin2bcd_ret	ret				'1  = 171cy
    
    dx  long  10_000_000, 1_000_000, 100_000, 10_000, 1_000, 100, 10, 1
    
    Andy

    Now, that's the way to make it go faster!
  • ozpropdevozpropdev Posts: 2,792
    edited 2013-12-09 23:35
    If BCDADJ was implemented, an 8 digit result can be achieved in 81 cycles and in 7 longs!
    '81 cycles
    bin2bcd8		reps	#26,#3
    			mov	answeru,#0
    			shl	value,#1 wc
    			rcl	answer,#1 wc
    			BCDADJ   answer
    			shl	value,#1 wc
    			rcl	answer,#1 wc
    
    
  • cgraceycgracey Posts: 14,151
    edited 2013-12-09 23:43
    ozpropdev wrote: »
    If BCDADJ was implemented, an 8 digit result can be achieved in 81 cycles and in 7 longs!
    '81 cycles
    bin2bcd8		reps	#26,#3
    			mov	answeru,#0
    			shl	value,#1 wc
    			rcl	answer,#1 wc
    			BCDADJ   answer
    			shl	value,#1 wc
    			rcl	answer,#1 wc
    
    

    What does BCDADJ do? I don't understand the principle of operation here. How can you keep shifting one bit at a time from the value to the answer, and have BCDADJ do anything meaningful to the answer, when its BCD values are nibbles?
  • jmgjmg Posts: 15,173
    edited 2013-12-10 00:17
    Ariba wrote: »
    Optimized for speed it takes only 171 cycles, but 6 cog-longs more:

    That's already larger than the v3 above, and only covers 8 of the 10 BCD digits.
    Size and full digit coverage are likely to matter more than speed, as it's already < 1us :)
  • ozpropdevozpropdev Posts: 2,792
    edited 2013-12-10 01:23
    cgracey wrote: »
    What does BCDADJ do? I don't understand the principle of operation here. How can you keep shifting one bit at a time from the value to the answer, and have BCDADJ do anything meaningful to the answer, when its BCD values are nibbles?

    Chip
    In post #10 there is Verilog code that is part of what BCDADJ would be.
    The table is applied to all nibbles.
    If the WC was shifted in first then the adjustment made, even faster again.
    See attached diagram for explanation.
    789 x 558 - 14K
  • ozpropdevozpropdev Posts: 2,792
    edited 2013-12-10 01:45
    jmg wrote: »
    That's already larger than the v3 above, and only covers 8 of the 10 BCD digits.
    Size and full digit coverage are likely to matter more than speed, as it's already < 1us :)

    7 > 12 longs??
    The 10 digit version would only be 9 longs and 159 cycles.
    With shift in/out incorporated in BCDADJ ,even smaller and faster again.
  • cgraceycgracey Posts: 14,151
    edited 2013-12-10 02:01
    ozpropdev wrote: »
    Chip
    In post #10 there is Verilog code that is part of what BCDADJ would be.
    The table is applied to all nibbles.
    If the WC was shifted in first then the adjustment made, even faster again.
    See attached diagram for explanation.

    That's like magic. What governs adding the three?
  • ozpropdevozpropdev Posts: 2,792
    edited 2013-12-10 02:05
    cgracey wrote: »
    That's like magic. What governs adding the three?

    If nibble equals or greater than 5 add 3.
  • cgraceycgracey Posts: 14,151
    edited 2013-12-10 02:08
    ozpropdev wrote: »
    If nibble equals or greater than 5 add 3.

    But in that chart it was sometimes applied to nibble1, not nibble0. How come?
  • ozpropdevozpropdev Posts: 2,792
    edited 2013-12-10 02:18
    cgracey wrote: »
    But in that chart it was sometimes applied to nibble1, not nibble0. How come?

    After a shift if any nibble =>5 add 3 to that nibble.
  • cgraceycgracey Posts: 14,151
    edited 2013-12-10 02:26
    ozpropdev wrote: »
    After a shift if any nibble =>5 add 3 to that nibble.

    And there's no carry from one nibble to the other? If so, that's really simple.

    So we would need an instruction that shifts one bit into the left side of the result and then adds 3 to any result nibble > 4, with no carry beyond each nibble?

    If that's all it is, we could make a unary instruction that performs that operation in 32 clocks. Handling the extra potential two digits is a pain, though. Are they that important?

    Is there a reverse formula for going from BCD to binary?
  • ozpropdevozpropdev Posts: 2,792
    edited 2013-12-10 02:41
    cgracey wrote: »
    And there's no carry from one nibble to the other? If so, that's really simple.

    So we would need an instruction that shifts one bit into the left side of the result and then adds 3 to any result nibble > 4, with no carry beyond each nibble?

    If that's all it is, we could make a unary instruction that performs that operation in 32 clocks. Handling the extra potential two digits is a pain, though. Are they that important?

    Is there a reverse formula for going from BCD to binary?

    That's correct, no carry from nibble to nibble.
    8 digits would be great in 1 instruction.
    I'm not aware of a reverse trick. :)
Sign In or Register to comment.