Inline ASM: code improvement?

SRLM · 2013-02-17 19:36

Hi all,

I've tried my hand at writing a moderately complex inline assembly function, and the result is below. Unfortunately, it's quite a bit slower than the vanilla C function:

(-Os and -mcmm on all tests to convert the number [-3258656] )
C:   23176 bytes, 14800 cycles   (-mfcache)
ASM: 23208 bytes, 34592 cycles
C:   23000 bytes, 28400 cycles   (-mno-fcache)

(the bytes refers to the total program size). So, it looks like to get the most benefit I need to get the function into the fcache. I tried the "__attribute__((fcache))" option, but it doesn't like that I use r0 in the code (undefined reference to `__LMM_MVI_r0').

Most of the "odd" things in the code have to do with getting it to work with all optimization options and memory models. I think the biggest problem is that the code doesn't really have any outputs (since it's the bytes pointed to by "string" that change, but I can't indicate that...) so in the -Os level the compiler removes the inline assembly completely if I leave out the volatile.

Any suggestions on improvement for this code?

//Convert a number to it's string representation (decimal base).
char * Numbers::Dec(int input_num, char * string){
	char * starting_string = string;
	int count, divisor;

	if(input_num == 0){
		string[0] = '0';
		string[1] = '\0';
		return starting_string;
	}
	
	if(input_num < 0){
		input_num = -input_num;
		string[0] = '-';
		string++;
	}
               
	__asm__ volatile(
            "brw	#start_							\n\t"

            "divisor_array: "
            "long	1000000000						\n\t"
			"long	100000000						\n\t"
			"long	10000000						\n\t"
			"long	1000000							\n\t"
			"long	100000							\n\t"
			"long	10000							\n\t"
			"long	1000							\n\t"
			"long	100								\n\t"
			"long	10								\n\t"
			"long	1								\n\t"

			"start_: "
			"mvi	r0, #divisor_array				\n\t"   
			
			"start_digit: "
            "rdlong %[divisor], r0					\n\t" /* Figure out what the first divisor is. */
            "cmp	%[input_num], %[divisor] wc		\n\t"
	"if_c	 add	r0, #4							\n\t"
	"if_c	 brw	#start_digit					\n\t"

            "main_loop: "
            "rdlong %[divisor], r0					\n\t" /* Read the current decimal divisor (10, 1000, 10000, etc.)*/
        
            "mov	%[count], #0					\n\t" /*Clear counter*/
			
			"count_digit: "
			"cmp	%[input_num], %[divisor] wc		\n\t"
	"if_nc   sub	%[input_num], %[divisor]		\n\t"
	"if_nc   add	%[count], #1					\n\t"
	"if_nc   brw    #count_digit					\n\t"
	
			"add	%[count], #48					\n\t" /*Convert to ascii representation*/
			"wrbyte %[count], %[string]				\n\t" /*Store result*/
	
			"add	%[string], #1					\n\t" /*increment to next char address*/
			
			"cmp	%[divisor], #1 wz				\n\t" /*Are we at the last digit?*/
	"if_nz	 add	r0, #4							\n\t"
	"if_nz	 brw	#main_loop						\n\t"
	
			"mov	%[count], #0					\n\t" /*Null terminate */
			"wrbyte %[count], %[string]				\n\t"
         
	: /*outputs (+inputs) */
		[string] "+r" (string)
		
	: /*inputs */
		[input_num] "r" (input_num),
		[count] "r" (count),
		[divisor] "r" (divisor)
	
	: /*clobber */
		"r0"
	);
	
	return starting_string;
}

ersmith · 2013-02-18 05:03

Why not just use the C version? It'll be easier to maintain, and as you've discovered the C compiler gives pretty decent performance for this sort of code. I'd reserve inline assembly for (a) small sequences that are hard to express in C (manipulating the carry bit or using Propeller specific instructions), or (b) hardware access that is extremely timing critical.

Eric

SRLM · 2013-02-18 07:30

ersmith wrote: »

Why not just use the C version? It'll be easier to maintain, and as you've discovered the C compiler gives pretty decent performance for this sort of code. I'd reserve inline assembly for (a) small sequences that are hard to express in C (manipulating the carry bit or using Propeller specific instructions), or (b) hardware access that is extremely timing critical.

Eric

Mostly, I did this as a learning exercise. Dec is a moderately complex function that is easy to test, and I'm glad that I did because I've found lots of little quirks to keep in mind.

Still, if there are any suggestions for improving it then that would be great. It is one of my speed sensitive functions that could benefit from a PASM speed improvement.

ersmith · 2013-02-20 05:00

A few ideas:

(1) You could declare the divisor_array in C, and pass the address in to the inline assembly as a parameter. This would make the code a bit easier to read, and would also get rid of the mvi that was causing you grief in fcache mode.

(2) You could set up the fcache yourself, and then use djnz inside the fcached code. This is moderately complicated (branches all have to be changed to be relative to the start of the fcache region). The easiest way to start would be to begin with the assembly produced by the C equivalent of the above code -- the gcc -S option will create this.

(3) The best bet for big improvement would be a change of algorithm. As you've seen, the C compiler already produces code that's faster than the PASM you wrote. I'm sure you'll eventually get your PASM faster than the compiler's, but it probably won't be a huge change. But perhaps using some other kind of algorithm (maybe a binary search on the digits? or try multiplication by a fixed point equivalent of 0.1?) would make things much faster.

Eric

Inline ASM: code improvement?

Comments