Shop OBEX P1 Docs P2 Docs Learn Events
Faster number printing without QDIV — Parallax Forums

Faster number printing without QDIV

proppyproppy Posts: 9
edited 2025-01-21 20:48 in Propeller 2

Hello!

I was working on printing numbers. Using the Propeller 2 Assembly Language (PASM2) Manual, I was able to print the numbers given the largest unsigned number. I stumbled onto QDIV and the program works and appears very accurate:

 ``` 
' set the reciving pin for input to the P2 microcontroller
RX_PIN = 63
' set the tranmission pin for output from the P2 microcontroller
TX_PIN = 62
' set the baud mode to support 2000000 buad
BAUD_MODE = 655367 

dat

' begin the program at address 0
org 0

' Set the clock mode
asmclk

' configure TX smart pin
fltl        #TX_PIN
wrpin       ##(P_ASYNC_TX | P_OE), #TX_PIN
wxpin       ##BAUD_MODE, #TX_PIN
drvl        #TX_PIN

mov div_idx, #0
mov number, ##4_294_967_295
mov tmp, number
mov divisor, ##1_000_000_000

.digit_loop
    mov digit, #0
    
.extract_digit
    cmpsub tmp, divisor  wc      ' Subtract divisor if num >= divisor
    if_c add  digit, #1                ' Increment digit if subtraction occurred
    if_c   jmp  #.extract_digit   ' Repeat if num >= divisor
        
    add     digit, #"0"       ' Convert to ASCII
    wypin   digit, #TX_PIN
    
.flush                  rdpin     pr2, #TX_PIN         WC      ' check busy flag
        if_c            jmp       #.flush                       ' hold until done
            
            add     div_idx, #1          ' Move to next divisor in table
            
            qdiv divisor, #10
            getqx divisor
            
            cmp     div_idx, #10  wz     ' Stop when all digits are printed
            if_nz   jmp  #.digit_loop     ' Continue until last divisor is reached
            jmp #.done

.done                   

ret

' Constants and Variables
digit res 1
divisor res 1
div_idx res 1
tmp res 1
number res 1
buffer byte "0", 0, 0
``` 

I thought i was done but then i read the fine print:
ALU circuit and CORDIC Solver math instructions. The ALU (Arithmetic Logic Unit) instructions perform common math operations in just 2 clock cycles each. The CORDIC (COordinate Rotation DIgital Computer) instructions perform more complicated math operations in 54 clock cycles each.

Right now for this simple example, I wouldn't notice much delay but I would imagine it would be rather slow for a more complex project, like a video game. Is there a faster way to do the division?

Comments

  • @proppy said:
    Right now for this simple example, I wouldn't notice much delay but I would imagine it would be rather slow for a more complex project, like a video game. Is there a faster way to do the division?

    Unless you are a modern JRPG you are not printing enough numbers per frame in a video game for it to really matter :)

    You're actually doing two divisions per digit here: One iterative one in extract_digit and the obvious QDIV one. The QDIV here actually just moves divisor through a fixed sequence of powers of ten, so you could replace it with a table lookup. But then you're still doing a division loop that can take longer than a QDIV would (if the digit is 8 or 9).

    Also never put initialized data after RES, the krampus will come and eat your socks.

  • ersmithersmith Posts: 6,102

    You can also convert a number to decimal without doing division at all, using the "double dabble" algorithm (https://en.wikipedia.org/wiki/Double_dabble). Technically this actually converts the number to binary coded decimal, but this is easily printed (just print it as you would a hex number).

  • TonyB_TonyB_ Posts: 2,198
    edited 2025-01-22 00:02

    @proppy said:

    Right now for this simple example, I wouldn't notice much delay but I would imagine it would be rather slow for a more complex project, like a video game. Is there a faster way to do the division?

    For 16-bit values when the divisor is a constant and fairly small, you could pre-compute 65536/divisor and use that as S in the MUL D,S instruction.

  • evanhevanh Posts: 16,129
    edited 2025-01-22 02:54

    @proppy said:

      ...
      add     digit, #"0"       ' Convert to ASCII
      wypin   digit, #TX_PIN
        
    .flush                  rdpin     pr2, #TX_PIN         WC      ' check busy flag
            if_c            jmp       #.flush                       ' hold until done
      ...
    

    It's an extra instruction but this makes use of the smartpin's transmit buffer. Allows the Cordic and the comport to be operated in parallel.

                    ...
    .txfull
                    rqpin   inb, #TX_PIN   wc    ' transmiting? (C high == yes)  *Needed to initiate tx
                    testp   #TX_PIN   wz    ' buffer free? (IN high == yes)
    if_c_and_nz     jmp     #.txfull    ' wait while Smartpin is both full (nz) and transmitting (c)
    
                    add     digit, #"0"    ' Convert to ASCII
                    wypin   digit, #TX_PIN    ' write new byte to Y buffer
                    ...
    
  • @Wuerfel_21 said:

    Unless you are a modern JRPG you are not printing enough numbers per frame in a video game for it to really matter :)

    You're actually doing two divisions per digit here: One iterative one in extract_digit and the obvious QDIV one. The QDIV here actually just moves divisor through a fixed sequence of powers of ten, so you could replace it with a table lookup. But then you're still doing a division loop that can take longer than a QDIV would (if the digit is 8 or 9).

    Also never put initialized data after RES, the krampus will come and eat your socks.

    I took some time to look into this and was pretty surprised to find that division via hardware, even in later game consoles (as recent as PS4!), has been avoided in many cases. I mostly stick to 2D stuff so that works.

    I see what you mean about the division loop. I did think about doing a loopup table but didn't really start to understand arrays of longs until I had the above code completed. I think I'll revisit it now though.

    So, I may need to reference the manual again but is there a reason not to do that? Alignment issues? Thank you in advance!

    @ersmith said:
    You can also convert a number to decimal without doing division at all, using the "double dabble" algorithm (https://en.wikipedia.org/wiki/Double_dabble). Technically this actually converts the number to binary coded decimal, but this is easily printed (just print it as you would a hex number).

    I'm going to give this a shot. Thank you!

    @TonyB_ said:

    @proppy said:

    Right now for this simple example, I wouldn't notice much delay but I would imagine it would be rather slow for a more complex project, like a video game. Is there a faster way to do the division?

    For 16-bit values when the divisor is a constant and fairly small, you could pre-compute 65536/divisor and use that as S in the MUL D,S instruction.

    I'll give this a shot as well. It'll be helpful for my journey learning Propeller Assembly. Thank you!

    @evanh said:

    @proppy said:

        ...
        add     digit, #"0"       ' Convert to ASCII
        wypin   digit, #TX_PIN
        
    .flush                  rdpin     pr2, #TX_PIN         WC      ' check busy flag
            if_c            jmp       #.flush                       ' hold until done
        ...
    

    It's an extra instruction but this makes use of the smartpin's transmit buffer. Allows the Cordic and the comport to be operated in parallel.

                    ...
    .txfull
                    rqpin   inb, #TX_PIN   wc    ' transmiting? (C high == yes)  *Needed to initiate tx
                    testp   #TX_PIN   wz    ' buffer free? (IN high == yes)
    if_c_and_nz     jmp     #.txfull    ' wait while Smartpin is both full (nz) and transmitting (c)
    
                    add     digit, #"0"    ' Convert to ASCII
                    wypin   digit, #TX_PIN    ' write new byte to Y buffer
                    ...
    

    I'm going to look into this. I did see rqpin in the manual once but had no idea why I would use it. I couldn't figure out a reason I wouldn't want "no acknowledge". Thank you!

    You all are awesome! I guess i got more homework to do.

  • You shouldn't put data after RES because RES desynchronizes the cog address counter with the actual data being assembled. It is only to be used to reserve space at the end of cog RAM without emitting corresponding padding into hub RAM. There's a longer explanation somewhere on here but I'm writing from my phone in a waiting room. Someone please find and link it.

    I've actually been working on a 3D rendering thing: https://forums.parallax.com/discussion/176083/3d-teapot-demo/p1

  • evanhevanh Posts: 16,129
    edited 2025-01-24 06:21

    @proppy said:
    I'm going to look into this. I did see rqpin in the manual once but had no idea why I would use it. I couldn't figure out a reason I wouldn't want "no acknowledge". Thank you!

    I'm not sure if that detail actually matters. I'd just cut'n'pasted from old code. The important part is the reverse order of checking smartpin status before writing the buffer instead the other way around. That and also checking for buffer full as well.

Sign In or Register to comment.