Faster number printing without QDIV
Hello!
I was working on printing numbers. Using the Propeller 2 Assembly Language (PASM2) Manual, I was able to print the numbers given the largest unsigned number. I stumbled onto QDIV and the program works and appears very accurate:
``` ' set the reciving pin for input to the P2 microcontroller RX_PIN = 63 ' set the tranmission pin for output from the P2 microcontroller TX_PIN = 62 ' set the baud mode to support 2000000 buad BAUD_MODE = 655367 dat ' begin the program at address 0 org 0 ' Set the clock mode asmclk ' configure TX smart pin fltl #TX_PIN wrpin ##(P_ASYNC_TX | P_OE), #TX_PIN wxpin ##BAUD_MODE, #TX_PIN drvl #TX_PIN mov div_idx, #0 mov number, ##4_294_967_295 mov tmp, number mov divisor, ##1_000_000_000 .digit_loop mov digit, #0 .extract_digit cmpsub tmp, divisor wc ' Subtract divisor if num >= divisor if_c add digit, #1 ' Increment digit if subtraction occurred if_c jmp #.extract_digit ' Repeat if num >= divisor add digit, #"0" ' Convert to ASCII wypin digit, #TX_PIN .flush rdpin pr2, #TX_PIN WC ' check busy flag if_c jmp #.flush ' hold until done add div_idx, #1 ' Move to next divisor in table qdiv divisor, #10 getqx divisor cmp div_idx, #10 wz ' Stop when all digits are printed if_nz jmp #.digit_loop ' Continue until last divisor is reached jmp #.done .done ret ' Constants and Variables digit res 1 divisor res 1 div_idx res 1 tmp res 1 number res 1 buffer byte "0", 0, 0 ```
I thought i was done but then i read the fine print:
ALU circuit and CORDIC Solver math instructions. The ALU (Arithmetic Logic Unit) instructions perform common math operations in just 2 clock cycles each. The CORDIC (COordinate Rotation DIgital Computer) instructions perform more complicated math operations in 54 clock cycles each.
Right now for this simple example, I wouldn't notice much delay but I would imagine it would be rather slow for a more complex project, like a video game. Is there a faster way to do the division?
Comments
Unless you are a modern JRPG you are not printing enough numbers per frame in a video game for it to really matter
You're actually doing two divisions per digit here: One iterative one in
extract_digit
and the obvious QDIV one. The QDIV here actually just movesdivisor
through a fixed sequence of powers of ten, so you could replace it with a table lookup. But then you're still doing a division loop that can take longer than a QDIV would (if the digit is 8 or 9).Also never put initialized data after RES, the krampus will come and eat your socks.
You can also convert a number to decimal without doing division at all, using the "double dabble" algorithm (https://en.wikipedia.org/wiki/Double_dabble). Technically this actually converts the number to binary coded decimal, but this is easily printed (just print it as you would a hex number).
For 16-bit values when the divisor is a constant and fairly small, you could pre-compute
65536/divisor
and use that asS
in theMUL D,S
instruction.It's an extra instruction but this makes use of the smartpin's transmit buffer. Allows the Cordic and the comport to be operated in parallel.
I took some time to look into this and was pretty surprised to find that division via hardware, even in later game consoles (as recent as PS4!), has been avoided in many cases. I mostly stick to 2D stuff so that works.
I see what you mean about the division loop. I did think about doing a loopup table but didn't really start to understand arrays of longs until I had the above code completed. I think I'll revisit it now though.
So, I may need to reference the manual again but is there a reason not to do that? Alignment issues? Thank you in advance!
I'm going to give this a shot. Thank you!
I'll give this a shot as well. It'll be helpful for my journey learning Propeller Assembly. Thank you!
I'm going to look into this. I did see rqpin in the manual once but had no idea why I would use it. I couldn't figure out a reason I wouldn't want "no acknowledge". Thank you!
You all are awesome! I guess i got more homework to do.
You shouldn't put data after RES because RES desynchronizes the cog address counter with the actual data being assembled. It is only to be used to reserve space at the end of cog RAM without emitting corresponding padding into hub RAM. There's a longer explanation somewhere on here but I'm writing from my phone in a waiting room. Someone please find and link it.
I've actually been working on a 3D rendering thing: https://forums.parallax.com/discussion/176083/3d-teapot-demo/p1
I'm not sure if that detail actually matters. I'd just cut'n'pasted from old code. The important part is the reverse order of checking smartpin status before writing the buffer instead the other way around. That and also checking for buffer full as well.
Hi all!
So I wanted to post, so far, what I cam away with:
This piece of code prints the number. i'm using an array for the place values. Only goes up to the billions since the largest positive number the can be stored in a variable is 4_294_967_290. I also added some logic for leading zeros.
This second piece of code also prints a number from a variable. The difference is I'm using the double dabble method mentioned. This one took a little longer just because of reading the manual and a lot of commands I have never used to till now. It can use some improvements of course
1. This does not contain any checks for leading zeros.
2. The printing of the number could be in a loop.
3. Some of the commands in the double dabble loop repeat a bit. I think i can put this in an extra loop though not sure if this would speed things up a all.
This was all pretty fun though, there are so many ways to print numbers. I still need to try out the smartpin transmit buffer code. Hopefully this helps anyone the comes after me with a similar questions. Thank you all!
This part can be done better:
RDLONG is fairly slow, you don't want to do it in a loop. Also, in this case, you placed the numbers in cog RAM already, so no need for that anyways, use ALTS.
Also, you can merge the ADD and JMP into an IJNZ (this works because
digit
will never wrap around to zero)Hello @Wuerfel_21,
Thank you for the tips above. I took a look at the manual to understand why I would want to use it. It appears rdlong does indeed take much longer (about 16+ cycles?). Compared to alts, which is 2 cycles. It's actually well explained in the manual what's going on as well.
I noticed this line:
cmpsub tmp, 0-0 wc
The 0-0 is not something I am familiar with. I did run through the manual but did not see much about this. But looking at the way it works:
1. 0-0 always evalutes to 0, basically a place holder since cmpsub requires to two operands.
2. The alts overwrites the source of the next command. So 0-0 for cmpsub becomes the result of the previous line, alts div_idx, #place_value
Mostly just checking my understanding of what is going on. The instruction ijnz is pretty straightforward. Thank you so much for the assist.
By convention,
0-0
is used to indicate that an argument's compile-time value is irrelevant because it gets replaced by something else, in this case by thealts
. It doesn't mean anything special to the compiler. I think it's supposed to look like a pair of eyeglasses, to tell you to look out.Exactly, just a placeholder value to make it easier to read. For flexspin it does mean something special though - if you used just
cmpsub tmp, 0 wc
it would tell youwarning: Second operand to cmpsub is a constant used without #; is this correct? If so, you may suppress this warning by putting -0 after the operand
(any other math expression would also work)