Faster number printing without QDIV

proppy · 2025-01-21 20:40

Hello!

I was working on printing numbers. Using the Propeller 2 Assembly Language (PASM2) Manual, I was able to print the numbers given the largest unsigned number. I stumbled onto QDIV and the program works and appears very accurate:

 ``` 
' set the reciving pin for input to the P2 microcontroller
RX_PIN = 63
' set the tranmission pin for output from the P2 microcontroller
TX_PIN = 62
' set the baud mode to support 2000000 buad
BAUD_MODE = 655367 

dat

' begin the program at address 0
org 0

' Set the clock mode
asmclk

' configure TX smart pin
fltl        #TX_PIN
wrpin       ##(P_ASYNC_TX | P_OE), #TX_PIN
wxpin       ##BAUD_MODE, #TX_PIN
drvl        #TX_PIN

mov div_idx, #0
mov number, ##4_294_967_295
mov tmp, number
mov divisor, ##1_000_000_000

.digit_loop
    mov digit, #0
    
.extract_digit
    cmpsub tmp, divisor  wc      ' Subtract divisor if num >= divisor
    if_c add  digit, #1                ' Increment digit if subtraction occurred
    if_c   jmp  #.extract_digit   ' Repeat if num >= divisor
        
    add     digit, #"0"       ' Convert to ASCII
    wypin   digit, #TX_PIN
    
.flush                  rdpin     pr2, #TX_PIN         WC      ' check busy flag
        if_c            jmp       #.flush                       ' hold until done
            
            add     div_idx, #1          ' Move to next divisor in table
            
            qdiv divisor, #10
            getqx divisor
            
            cmp     div_idx, #10  wz     ' Stop when all digits are printed
            if_nz   jmp  #.digit_loop     ' Continue until last divisor is reached
            jmp #.done

.done                   

ret

' Constants and Variables
digit res 1
divisor res 1
div_idx res 1
tmp res 1
number res 1
buffer byte "0", 0, 0
```

I thought i was done but then i read the fine print:
ALU circuit and CORDIC Solver math instructions. The ALU (Arithmetic Logic Unit) instructions perform common math operations in just 2 clock cycles each. The CORDIC (COordinate Rotation DIgital Computer) instructions perform more complicated math operations in 54 clock cycles each.

Right now for this simple example, I wouldn't notice much delay but I would imagine it would be rather slow for a more complex project, like a video game. Is there a faster way to do the division?

Wuerfel_21 · 2025-01-21 21:09

@proppy said:
Right now for this simple example, I wouldn't notice much delay but I would imagine it would be rather slow for a more complex project, like a video game. Is there a faster way to do the division?

Unless you are a modern JRPG you are not printing enough numbers per frame in a video game for it to really matter

You're actually doing two divisions per digit here: One iterative one in extract_digit and the obvious QDIV one. The QDIV here actually just moves divisor through a fixed sequence of powers of ten, so you could replace it with a table lookup. But then you're still doing a division loop that can take longer than a QDIV would (if the digit is 8 or 9).

Also never put initialized data after RES, the krampus will come and eat your socks.

ersmith · 2025-01-21 22:52

You can also convert a number to decimal without doing division at all, using the "double dabble" algorithm (https://en.wikipedia.org/wiki/Double_dabble). Technically this actually converts the number to binary coded decimal, but this is easily printed (just print it as you would a hex number).

TonyB_ · 2025-01-22 00:02

@proppy said:

Right now for this simple example, I wouldn't notice much delay but I would imagine it would be rather slow for a more complex project, like a video game. Is there a faster way to do the division?

For 16-bit values when the divisor is a constant and fairly small, you could pre-compute 65536/divisor and use that as S in the MUL D,S instruction.

evanh · 2025-01-22 02:33

@proppy said:

  ...
  add     digit, #"0"       ' Convert to ASCII
  wypin   digit, #TX_PIN
    
.flush                  rdpin     pr2, #TX_PIN         WC      ' check busy flag
        if_c            jmp       #.flush                       ' hold until done
  ...

It's an extra instruction but this makes use of the smartpin's transmit buffer. Allows the Cordic and the comport to be operated in parallel.

                ...
.txfull
                rqpin   inb, #TX_PIN   wc    ' transmiting? (C high == yes)  *Needed to initiate tx
                testp   #TX_PIN   wz    ' buffer free? (IN high == yes)
if_c_and_nz     jmp     #.txfull    ' wait while Smartpin is both full (nz) and transmitting (c)

                add     digit, #"0"    ' Convert to ASCII
                wypin   digit, #TX_PIN    ' write new byte to Y buffer
                ...

proppy · 2025-01-23 14:04

@Wuerfel_21 said:

Unless you are a modern JRPG you are not printing enough numbers per frame in a video game for it to really matter

You're actually doing two divisions per digit here: One iterative one in extract_digit and the obvious QDIV one. The QDIV here actually just moves divisor through a fixed sequence of powers of ten, so you could replace it with a table lookup. But then you're still doing a division loop that can take longer than a QDIV would (if the digit is 8 or 9).

Also never put initialized data after RES, the krampus will come and eat your socks.

I took some time to look into this and was pretty surprised to find that division via hardware, even in later game consoles (as recent as PS4!), has been avoided in many cases. I mostly stick to 2D stuff so that works.

I see what you mean about the division loop. I did think about doing a loopup table but didn't really start to understand arrays of longs until I had the above code completed. I think I'll revisit it now though.

So, I may need to reference the manual again but is there a reason not to do that? Alignment issues? Thank you in advance!

@ersmith said:
You can also convert a number to decimal without doing division at all, using the "double dabble" algorithm (https://en.wikipedia.org/wiki/Double_dabble). Technically this actually converts the number to binary coded decimal, but this is easily printed (just print it as you would a hex number).

I'm going to give this a shot. Thank you!

@TonyB_ said:

@proppy said:

Right now for this simple example, I wouldn't notice much delay but I would imagine it would be rather slow for a more complex project, like a video game. Is there a faster way to do the division?

For 16-bit values when the divisor is a constant and fairly small, you could pre-compute 65536/divisor and use that as S in the MUL D,S instruction.

I'll give this a shot as well. It'll be helpful for my journey learning Propeller Assembly. Thank you!

@evanh said:

@proppy said:

    ...
    add     digit, #"0"       ' Convert to ASCII
    wypin   digit, #TX_PIN
    
.flush                  rdpin     pr2, #TX_PIN         WC      ' check busy flag
        if_c            jmp       #.flush                       ' hold until done
    ...

It's an extra instruction but this makes use of the smartpin's transmit buffer. Allows the Cordic and the comport to be operated in parallel.

                ...
.txfull
                rqpin   inb, #TX_PIN   wc    ' transmiting? (C high == yes)  *Needed to initiate tx
                testp   #TX_PIN   wz    ' buffer free? (IN high == yes)
if_c_and_nz     jmp     #.txfull    ' wait while Smartpin is both full (nz) and transmitting (c)

                add     digit, #"0"    ' Convert to ASCII
                wypin   digit, #TX_PIN    ' write new byte to Y buffer
                ...

I'm going to look into this. I did see rqpin in the manual once but had no idea why I would use it. I couldn't figure out a reason I wouldn't want "no acknowledge". Thank you!

You all are awesome! I guess i got more homework to do.

Wuerfel_21 · 2025-01-23 14:51

You shouldn't put data after RES because RES desynchronizes the cog address counter with the actual data being assembled. It is only to be used to reserve space at the end of cog RAM without emitting corresponding padding into hub RAM. There's a longer explanation somewhere on here but I'm writing from my phone in a waiting room. Someone please find and link it.

I've actually been working on a 3D rendering thing: https://forums.parallax.com/discussion/176083/3d-teapot-demo/p1

evanh · 2025-01-24 04:25

@proppy said:
I'm going to look into this. I did see rqpin in the manual once but had no idea why I would use it. I couldn't figure out a reason I wouldn't want "no acknowledge". Thank you!

I'm not sure if that detail actually matters. I'd just cut'n'pasted from old code. The important part is the reverse order of checking smartpin status before writing the buffer instead the other way around. That and also checking for buffer full as well.

proppy · 2025-02-14 19:23

Hi all!

So I wanted to post, so far, what I cam away with:

This piece of code prints the number. i'm using an array for the place values. Only goes up to the billions since the largest positive number the can be stored in a variable is 4_294_967_290. I also added some logic for leading zeros.

' set the reciving pin for input to the P2 microcontroller
RX_PIN = 63
' set the tranmission pin for output from the P2 microcontroller
TX_PIN = 62
' set the baud mode to support 2000000 buad
BAUD_MODE = 655367 

dat

' begin the program at address 0
org 0

' Set the clock mode
asmclk

' configure TX smart pin
fltl        #TX_PIN
wrpin       ##(P_ASYNC_TX | P_OE), #TX_PIN
wxpin       ##BAUD_MODE, #TX_PIN
drvl        #TX_PIN

'EXAMPLE NUMBERS, uncomment only one to see it print
mov number, ##4_294_967_290
'mov number, ##10
'mov number, ##302
'mov number, ##032
'mov number, ##0_000_000_000

mov ptr, #@place_value  ' Get hub RAM address for the place_value array, store address in ptr
mov leading_zero, #0    ' Assign the leading_zero flag to detect if there is a leading zero in the number
mov div_idx, #0     ' Set the index of which place value we are on to the start (zero)
mov tmp, number     ' Copy the number to a temporary space in case we need to keep the original value.

.digit_loop
    mov digit, #0   ' Each time we loop, reset the digit counter for printing the individual digits  

.extract_digit

    rdlong divisor, ptr     ' Read long from the address into value
    cmpsub tmp, divisor  wc     ' Subtract divisor if tmp is greather than or equal to divisor
    if_c add  digit, #1     ' Increment digit if subtraction occurred
    if_c   jmp  #.extract_digit ' Repeat if tmp is greather than or equal to divisor

    cmp digit, #0 wz            ' Check if the current digit is zero

    if_z cmp leading_zero, #0 wz    ' If the current digit is zero, check if it is a leading zero
    if_z jmp #.flush            ' If the current digiti is a leading zero, then skip printing

    if_nz mov leading_zero, #1  ' If the current digit is not a leading zero, set the leading zero flag for future printing

.print_digit
    add     digit, #"0"     ' Convert the digit to ASCII using #"0"
        wypin   digit, #TX_PIN      ' Print the digit to the terminal

.flush  rdpin pr2, #TX_PIN wc       ' check busy flag
        if_c jmp #.flush            ' hold until done

    add     div_idx, #1     ' Move to next divisor in table
    add ptr, #4         ' Move to the address of the next place value

    cmp     div_idx, #10  wz        ' Stop when all digits are printed
    if_nz   jmp  #.digit_loop   ' Continue until last divisor is reached

    cmp leading_zero, #0 wz     ' Check we are still in a leading zero state,
    if_z jmp #.print_digit      ' Print the single zero
.done 


' Array of place values
place_value long 1_000_000_000, 100_000_000, 10_000_000, 1_000_000, 100_000, 10_000, 1_000, 100, 10, 1

' Res variables
digit res 1
divisor res 1
div_idx res 1
tmp res 1
number res 1
ptr res 1
leading_zero res 1

This second piece of code also prints a number from a variable. The difference is I'm using the double dabble method mentioned. This one took a little longer just because of reading the manual and a lot of commands I have never used to till now. It can use some improvements of course
1. This does not contain any checks for leading zeros.
2. The printing of the number could be in a loop.
3. Some of the commands in the double dabble loop repeat a bit. I think i can put this in an extra loop though not sure if this would speed things up a all.

' set the reciving pin for input to the P2 microcontroller
RX_PIN = 63
' set the tranmission pin for output from the P2 microcontroller
TX_PIN = 62
' set the baud mode to support 2000000 buad
BAUD_MODE = 655367 

dat

' begin the program at address 0
org 0

' Set the clock mode
asmclk

' configure TX smart pin
fltl        #TX_PIN
wrpin       ##(P_ASYNC_TX | P_OE), #TX_PIN
wxpin       ##BAUD_MODE, #TX_PIN
drvl        #TX_PIN

' Load number 1000 (or any 16-bit value)
mov number, ##1000  ' 16-bit binary number
mov bcd, #0     ' Clear BCD result
mov bit_count, #16  ' Process 16 bits

.double_dabble_loop
    ' Check if thousands place needs adjustment
    mov tmp, bcd            ' copy the BCD into tmp since we need to keep the original value of BCD
    shr tmp, #12            ' Isolate thousands place
    and tmp, #$F            ' only keep the lower 4 bits (0-9 in BCD)        
    cmp tmp, #5 wc      ' Check if the current place value is greater than 5 
    if_nc add bcd, ##$3000  ' Add 3 to thousands if >= 5

    ' Check if hundreds place needs adjustment
    mov tmp, bcd            ' copy the BCD into tmp since we need to keep the original value of BCD
    shr tmp, #8         ' Isolate hundreds place
    and tmp, #$F            ' only keep the lower 4 bits (0-9 in BCD)       
    cmp tmp, #5 wc      ' Check if the current place value is greater than 5 
    if_nc add bcd, ##$300   ' Add 3 to hundreds if >= 5

    ' Check if tens place needs adjustment
    mov tmp, bcd            ' copy the BCD into tmp since we need to keep the original value of BCD
    shr tmp, #4         ' Isolate tens place
    and tmp, #$F            ' only keep the lower 4 bits (0-9 in BCD) 
    cmp tmp, #5 wc      ' Check if the current place value is greater than 5 
    if_nc add bcd, ##$30    ' Add 3 to tens if >= 5

    ' Check if ones place needs adjustment
    mov tmp, bcd        ' copy the BCD into tmp since we need to keep the original value of BCD
    and tmp, #$F    ' only keep the lower 4 bits (0-9 in BCD)   
    cmp tmp, #5 wc  ' Check if the current place value is greater than 5 
    if_nc add bcd, #3   ' Add 3 to ones if >= 5

    shl bcd, #1                 ' Shift left and bring next binary bit
    test number, ##%1000000000000000 wc     ' Get most significant bit of 16-bit number
    if_c add bcd, #1                ' Carry bit into BCD
    shl number, #1              ' Shift number left

    djnz bit_count, #.double_dabble_loop        ' Repeat for 16 bits

' --- BCD now contains a properly converted value ---

' Extract and print **thousands** place
mov digit, bcd      ' copy the BCD into digit since we need to keep the original value of BCD
shr digit, #12      ' Isolate thousands place
and digit, #$F      ' only keep the lower 4 bits (0-9 in BCD) 
add digit, #"0"     ' Convert the digit to ASCII using #"0"
wypin digit, #TX_PIN    ' Print the digit to the terminal

.wait_tx0
rdpin pr2, #TX_PIN wc   ' check busy flag
if_c jmp #.wait_tx0 ' hold until done

' Extract and print **hundreds** place
mov digit, bcd      ' copy the BCD into digit since we need to keep the original value of BCD
shr digit, #8       ' Isolate thousands place
and digit, #$F      ' only keep the lower 4 bits (0-9 in BCD) 
add digit, #"0"     ' Convert the digit to ASCII using #"0"
wypin digit, #TX_PIN    ' Print the digit to the terminal

.wait_tx1
rdpin pr2, #TX_PIN wc   ' check busy flag
if_c jmp #.wait_tx1 ' hold until done

' Extract and print **tens** place
mov digit, bcd      ' copy the BCD into digit since we need to keep the original value of BCD
shr digit, #4       ' Isolate thousands place
and digit, #$F      ' only keep the lower 4 bits (0-9 in BCD) 
add digit, #"0"     ' Convert the digit to ASCII using #"0"
wypin digit, #TX_PIN    ' Print the digit to the terminal

.wait_tx2
rdpin pr2, #TX_PIN wc   ' check busy flag
if_c jmp #.wait_tx2 ' hold until done

' Extract and print **ones** place
mov digit, bcd      ' copy the BCD into digit since we need to keep the original value of BCD
and digit, #$F      ' only keep the lower 4 bits (0-9 in BCD) 
add digit, #"0"     ' Convert the digit to ASCII using #"0"
wypin digit, #TX_PIN    ' Print the digit to the terminal

.wait_tx3
rdpin pr2, #TX_PIN wc   ' check busy flag
if_c jmp #.wait_tx3 ' hold until done

' Res variables
number      res 1
bcd         res 1
digit       res 1
tmp         res 1
bit_count   res 1

This was all pretty fun though, there are so many ways to print numbers. I still need to try out the smartpin transmit buffer code. Hopefully this helps anyone the comes after me with a similar questions. Thank you all!

Wuerfel_21 · 2025-02-14 20:09

This part can be done better:

.extract_digit
    rdlong divisor, ptr     ' Read long from the address into value
    cmpsub tmp, divisor  wc     ' Subtract divisor if tmp is greather than or equal to divisor
    if_c add  digit, #1     ' Increment digit if subtraction occurred
    if_c   jmp  #.extract_digit ' Repeat if tmp is greather than or equal to divisor

RDLONG is fairly slow, you don't want to do it in a loop. Also, in this case, you placed the numbers in cog RAM already, so no need for that anyways, use ALTS.

.extract_digit
              alts div_idx, #place_value     ' Replace source operand of next instruction with divisor
              cmpsub tmp, 0-0  wc         ' Subtract divisor if tmp is greather than or equal to divisor
        if_c  add  digit, #1            ' Increment digit if subtraction occurred
        if_c  jmp  #.extract_digit  ' Repeat if tmp is greather than or equal to divisor

Also, you can merge the ADD and JMP into an IJNZ (this works because digit will never wrap around to zero)

.extract_digit
              alts div_idx, #place_value     ' Replace source operand of next instruction with divisor
              cmpsub tmp, 0-0  wc            ' Subtract divisor if tmp is greather than or equal to divisor
        if_c  ijnz  digit, #.extract_digit  ' Increment digit and repeat if tmp is greather than or equal to divisor

proppy · 2025-02-15 16:03

Hello @Wuerfel_21,

Thank you for the tips above. I took a look at the manual to understand why I would want to use it. It appears rdlong does indeed take much longer (about 16+ cycles?). Compared to alts, which is 2 cycles. It's actually well explained in the manual what's going on as well.

I noticed this line:
cmpsub tmp, 0-0 wc

The 0-0 is not something I am familiar with. I did run through the manual but did not see much about this. But looking at the way it works:
1. 0-0 always evalutes to 0, basically a place holder since cmpsub requires to two operands.
2. The alts overwrites the source of the next command. So 0-0 for cmpsub becomes the result of the previous line, alts div_idx, #place_value

Mostly just checking my understanding of what is going on. The instruction ijnz is pretty straightforward. Thank you so much for the assist.

Electrodude · 2025-02-15 16:31

By convention, 0-0 is used to indicate that an argument's compile-time value is irrelevant because it gets replaced by something else, in this case by the alts. It doesn't mean anything special to the compiler. I think it's supposed to look like a pair of eyeglasses, to tell you to look out.

Wuerfel_21 · 2025-02-15 16:52

@Electrodude said:
By convention, 0-0 is used to indicate that an argument's compile-time value is irrelevant because it gets replaced by something else, in this case by the alts. It doesn't mean anything special to the compiler. I think it's supposed to look like a pair of eyeglasses, to tell you to look out.

Exactly, just a placeholder value to make it easier to read. For flexspin it does mean something special though - if you used just cmpsub tmp, 0 wc it would tell you warning: Second operand to cmpsub is a constant used without #; is this correct? If so, you may suppress this warning by putting -0 after the operand (any other math expression would also work)

Faster number printing without QDIV

Comments