Inline ASM: code improvement?
SRLM
Posts: 5,045
Hi all,
I've tried my hand at writing a moderately complex inline assembly function, and the result is below. Unfortunately, it's quite a bit slower than the vanilla C function:
Most of the "odd" things in the code have to do with getting it to work with all optimization options and memory models. I think the biggest problem is that the code doesn't really have any outputs (since it's the bytes pointed to by "string" that change, but I can't indicate that...) so in the -Os level the compiler removes the inline assembly completely if I leave out the volatile.
Any suggestions on improvement for this code?
I've tried my hand at writing a moderately complex inline assembly function, and the result is below. Unfortunately, it's quite a bit slower than the vanilla C function:
(-Os and -mcmm on all tests to convert the number [-3258656] ) C: 23176 bytes, 14800 cycles (-mfcache) ASM: 23208 bytes, 34592 cycles C: 23000 bytes, 28400 cycles (-mno-fcache)(the bytes refers to the total program size). So, it looks like to get the most benefit I need to get the function into the fcache. I tried the "__attribute__((fcache))" option, but it doesn't like that I use r0 in the code (undefined reference to `__LMM_MVI_r0').
Most of the "odd" things in the code have to do with getting it to work with all optimization options and memory models. I think the biggest problem is that the code doesn't really have any outputs (since it's the bytes pointed to by "string" that change, but I can't indicate that...) so in the -Os level the compiler removes the inline assembly completely if I leave out the volatile.
Any suggestions on improvement for this code?
//Convert a number to it's string representation (decimal base). char * Numbers::Dec(int input_num, char * string){ char * starting_string = string; int count, divisor; if(input_num == 0){ string[0] = '0'; string[1] = '\0'; return starting_string; } if(input_num < 0){ input_num = -input_num; string[0] = '-'; string++; } __asm__ volatile( "brw #start_ \n\t" "divisor_array: " "long 1000000000 \n\t" "long 100000000 \n\t" "long 10000000 \n\t" "long 1000000 \n\t" "long 100000 \n\t" "long 10000 \n\t" "long 1000 \n\t" "long 100 \n\t" "long 10 \n\t" "long 1 \n\t" "start_: " "mvi r0, #divisor_array \n\t" "start_digit: " "rdlong %[divisor], r0 \n\t" /* Figure out what the first divisor is. */ "cmp %[input_num], %[divisor] wc \n\t" "if_c add r0, #4 \n\t" "if_c brw #start_digit \n\t" "main_loop: " "rdlong %[divisor], r0 \n\t" /* Read the current decimal divisor (10, 1000, 10000, etc.)*/ "mov %[count], #0 \n\t" /*Clear counter*/ "count_digit: " "cmp %[input_num], %[divisor] wc \n\t" "if_nc sub %[input_num], %[divisor] \n\t" "if_nc add %[count], #1 \n\t" "if_nc brw #count_digit \n\t" "add %[count], #48 \n\t" /*Convert to ascii representation*/ "wrbyte %[count], %[string] \n\t" /*Store result*/ "add %[string], #1 \n\t" /*increment to next char address*/ "cmp %[divisor], #1 wz \n\t" /*Are we at the last digit?*/ "if_nz add r0, #4 \n\t" "if_nz brw #main_loop \n\t" "mov %[count], #0 \n\t" /*Null terminate */ "wrbyte %[count], %[string] \n\t" : /*outputs (+inputs) */ [string] "+r" (string) : /*inputs */ [input_num] "r" (input_num), [count] "r" (count), [divisor] "r" (divisor) : /*clobber */ "r0" ); return starting_string; }
Comments
Eric
Mostly, I did this as a learning exercise. Dec is a moderately complex function that is easy to test, and I'm glad that I did because I've found lots of little quirks to keep in mind.
Still, if there are any suggestions for improving it then that would be great. It is one of my speed sensitive functions that could benefit from a PASM speed improvement.
(1) You could declare the divisor_array in C, and pass the address in to the inline assembly as a parameter. This would make the code a bit easier to read, and would also get rid of the mvi that was causing you grief in fcache mode.
(2) You could set up the fcache yourself, and then use djnz inside the fcached code. This is moderately complicated (branches all have to be changed to be relative to the start of the fcache region). The easiest way to start would be to begin with the assembly produced by the C equivalent of the above code -- the gcc -S option will create this.
(3) The best bet for big improvement would be a change of algorithm. As you've seen, the C compiler already produces code that's faster than the PASM you wrote. I'm sure you'll eventually get your PASM faster than the compiler's, but it probably won't be a huge change. But perhaps using some other kind of algorithm (maybe a binary search on the digits? or try multiplication by a fixed point equivalent of 0.1?) would make things much faster.
Eric