Compiling HLL to PASM possible?

SRLM · 2012-03-02 17:16

I would like to write code in a high level language (C/C++, BASIC, etc.) and have it compiled down to assembly. This would then be loaded into a cog and run as native assembly code.

The question: I would like to know which compiler(s) can compile directly to Propeller assembly.

I understand the limitations inherent in doing this (small code size, less efficient than pure assembly, etc.) I'm not looking for any interpreter type setup or LMM, or advanced language features such as objects, functions, etc. I want to be able to compile down so that I can write my code (math routines to be performed in a fast loop) in the high level language and not have to worry too much about the assembly code. I'd probably still do some tweaks and modifications, but it would be nice to start with the HLL.

Mike Green · 2012-03-02 17:35

PropBasic can do that. It can produce both PASM and LMM and, because it uses Spin/PASM as the intermediate source code, you can include Spin code that gets passed through to the Spin compiler. Some cogs can run straight compiled PASM, some LMM, and some Spin..

Kye · 2012-03-02 21:44

GCC? Google prop GCC.

Heater. · 2012-03-03 01:43

propgcc can compile C/C++ down to PASM that runs in a COG.

If you look in the propgcc demos you will find a FullDuplexSerial driver written in C that runs in COG and works at 115200 baud.

In some cases you don't even have to do anything special to get code to run in cog. In the propgcc demos there is a Fast Fourier Transform the core loop of which gets loaded in to COG to run at full PASM speed. Have a look for "fcahe" when you find the propgcc threads.

This is actually quite amazing. I had always argued that compiling C/C++ to COG was going to be useless as the code would be to big and there is no stack or indexed addressing etc to help the C compiler. All in all more work that it's worth. But the guys did it anyway and it works very well.

SRLM · 2012-03-04 19:03

So, I've been able to review the two options posted, and do a side by side comparison of their features and the code they produce. GCC seemed to be better, mostly because it can do certain optimizations.

[size=+2]PropBasic[/size]

I used PropBasic version 00.01.14 (2011-07-26) to test with. The source code that I used is a simple program to multiply some numbers together, add them, and divide with them. This goes well with the math intensive but no I/O application that I need. It's probably not a good benchmark to use if you're going to be doing complicated serial communication or anything with delays, I/O, etc.

A side note about PropBasic: the syntax is a bit quirky. It requires that your code have only one operator/statement per line. So "num = a+b+c" is out. It's odd, but easy enough to work with.

DEVICE P8X32A, XTAL1, PLL16X
FREQ 80_000_000

num1	VAR LONG
num2	VAR LONG
num3	VAR LONG
result0	VAR LONG

PROGRAM Start

Start:
  DO
    num3 = num3 * num3
  	num2 = num2 * num2
  	num1 = num1 * num1
  	result0 = num1 + num2
  	result0 = result0 + num3
  	result0 = result0 / num3
  LOOP
END

I used the following command to test with:

./PropBasic-bst.linux test.pbas

There doesn't seem to be any command line options to use. Anyway, that generated the following Spin file:

'{$BST PATH {REMOVED FROM POSTING}}
''  *** COMPILED WITH PropBasic VERSION 00.01.14  July 26, 2011 ***
                                                             '' This program tests the compiler for PropBasic.

                                                             '' Is result a command???



CON                                                          'DEVICE P8X32A, XTAL1, PLL16X
  _ClkMode = XTAL1 + PLL16X                                 

  _XInFreq =   5000000                                       'FREQ 80_000_000


' num1	VAR LONG                                              'num1	VAR LONG

' num2	VAR LONG                                              'num2	VAR LONG

' num3	VAR LONG                                              'num3	VAR LONG

' result0	VAR LONG                                           'result0	VAR LONG


PUB __Program                                                'PROGRAM Start
  CogInit(0, @__Init, @__DATASTART)                         
                                                            
DAT                                                         
                  org           0                           
__Init                                                      
__RAM                                                       
                  mov           dira,__InitDirA             
                  mov           outa,__InitOutA             
                  jmp           #Start                      


Start                                                        'Start:

__DO_1                                                       '  DO

                  mov           __temp1,num3                 '    num3 = num3 * num3
                  mov           __temp2,num3                
                  abs           __temp1,__temp1 WC          
                  muxc          __temp3,#1                  
                  abs           __temp2,__temp2 WC, WZ      
    IF_C          xor           __temp3,#1                  
                  mov           __temp4,#0                  
                  mov           __temp5,#32                 
                  shr           __temp1,#1 WC               
__L0001                                                     
    IF_C          add           __temp4,__temp2 WC          
                  rcr           __temp4,#1 WC               
                  rcr           __temp1,#1 WC               
                  djnz          __temp5,#__L0001            
                  test          __temp3,#1 WZ               
    IF_NZ         neg           __temp4,__temp4             
    IF_NZ         neg           __temp1,__temp1 WZ          
    IF_NZ         sub           __temp4,#1                  
                  mov           num3,__temp1                

                  mov           __temp1,num2                 '  	num2 = num2 * num2
                  mov           __temp2,num2                
                  abs           __temp1,__temp1 WC          
                  muxc          __temp3,#1                  
                  abs           __temp2,__temp2 WC, WZ      
    IF_C          xor           __temp3,#1                  
                  mov           __temp4,#0                  
                  mov           __temp5,#32                 
                  shr           __temp1,#1 WC               
__L0002                                                     
    IF_C          add           __temp4,__temp2 WC          
                  rcr           __temp4,#1 WC               
                  rcr           __temp1,#1 WC               
                  djnz          __temp5,#__L0002            
                  test          __temp3,#1 WZ               
    IF_NZ         neg           __temp4,__temp4             
    IF_NZ         neg           __temp1,__temp1 WZ          
    IF_NZ         sub           __temp4,#1                  
                  mov           num2,__temp1                

                  mov           __temp1,num1                 '  	num1 = num1 * num1
                  mov           __temp2,num1                
                  abs           __temp1,__temp1 WC          
                  muxc          __temp3,#1                  
                  abs           __temp2,__temp2 WC, WZ      
    IF_C          xor           __temp3,#1                  
                  mov           __temp4,#0                  
                  mov           __temp5,#32                 
                  shr           __temp1,#1 WC               
__L0003                                                     
    IF_C          add           __temp4,__temp2 WC          
                  rcr           __temp4,#1 WC               
                  rcr           __temp1,#1 WC               
                  djnz          __temp5,#__L0003            
                  test          __temp3,#1 WZ               
    IF_NZ         neg           __temp4,__temp4             
    IF_NZ         neg           __temp1,__temp1 WZ          
    IF_NZ         sub           __temp4,#1                  
                  mov           num1,__temp1                

                  mov           result0,num1                 '  	result0 = num1 + num2
                  adds          result0,num2                

                                                             '  	result0 = result0 + num3
                  adds          result0,num3                

                  mov           __temp1,result0              '  	result0 = result0 / num3
                  mov           __temp2,num3                
                  mov           __temp4,#0                  
                  abs           __temp1,__temp1 WC          
                  muxc          __temp5,#1                  
                  abs           __temp2,__temp2 WC, WZ      
    IF_Z          mov           __temp1,#0                  
    IF_Z          jmp           #__L0004                    
    IF_C          xor           __temp5,#1                  
                  mov           __temp3,#0                  
                  min           __temp2,#1                  
__L0005                                                     
                  add           __temp3,#1                  
                  shl           __temp2,#1 WC               
    IF_NC         jmp           #__L0005                    
                  rcr           __temp2,#1                  
__L0006                                                     
                  cmpsub        __temp1,__temp2 WC          
                  rcl           __temp4,#1                  
                  shr           __temp2,#1                  
                  djnz          __temp3,#__L0006            
                  test          __temp5,#1 WZ               
    IF_NZ         neg           __temp4,__temp4             
    IF_NZ         neg           __temp1,__temp1             
__L0004                                                     
                  mov           result0,__temp4             

                  jmp           #__DO_1                      '  LOOP
__LOOP_1                                                    

                  mov           __temp1,#0                   'END
                  waitpne       __temp1,__temp1             


'**********************************************************************
__InitDirA       LONG %00000000_00000000_00000000_00000000
__InitOutA       LONG %00000000_00000000_00000000_00000000
_FREQ            LONG 80000000


__remainder
__temp1          RES 1
__temp2          RES 1
__temp3          RES 1
__temp4          RES 1
__temp5          RES 1
__param1         RES 1
__param2         RES 1
__param3         RES 1
__param4         RES 1
__paramcnt       RES 1
num1             RES 1
num2             RES 1
num3             RES 1
result0          RES 1

FIT 492

CON
  LSBFIRST                         = 0
  MSBFIRST                         = 1
  MSBPRE                           = 0
  LSBPRE                           = 1
  MSBPOST                          = 2
  LSBPOST                          = 3

DAT
__DATASTART

I think the compiler did a good job of being faithful to the original code, but I noticed some things:
1. Every source code line is in the .spin file as a comment, which is very helpful.
2. The multiplication and division is done inline, so each additional multiplication consumes 18 longs. It does share temporary variables however.
3. All variables are stored in cog RAM, and user defined variables use the user defined name.
4. The compiler added the remnants of some serial communication code: three longs at "__RAM" and a constants block.
5. The code is nicely formatted straight from the compiler (although it uses spaces instead of tabs).

[size=+2]Propeller GCC[/size]

I used the most recent (and only) version posted in the GCC downloads page (v0_2_3 from 2012-02-08). The source program I used was the same as from the PropBasic, except modified a bit for C.

#if defined(__propeller__)
#include <propeller.h>
#define int32_t int
#define int16_t short int
#else
#endif


int main()
{

	for(;;){
		volatile int num1, num2, num3, result0;
	
		num3 = num3 * num3;
	  	num2 = num2 * num2;
	  	num1 = num1 * num1;
	  	result0 = num1 + num2;
	  	result0 = result0 + num3;
	  	result0 = result0 / num3;
	}
}

I based it off the fft_bench.c demo, which is why it has the various preprocessor statements at the begining. Note the use of the keyword "volatile" for the int declaration: wihtout it the compiler simply optimized away everything into a simple jump loop.

Anyway, I used the following command to generate the code:

propeller-elf-gcc -Os -S -mcog -mspin test.c

The options do the following:
-0s: optimize code for minimum size
-S: output source code as a file
-mcog: use the cog memory model (put everything in a single cog)
-mspin: generate the resulting spin file

There is also the -mfcache option, but in this case it did not generate code any differently.

And, when run it generated the following spin code:

'' spin code automatically generated by gcc
CON
  _clkmode = xtal1+pll16x
  _clkfreq = 80_000_000
  __clkfreq = 0  '' pointer to clock frequency
 '' adjust STACKSIZE to how much stack your program needs
  STACKSIZE = 256
VAR
  long cog  '' cog that was started up by start method
  long stack[STACKSIZE]
  '' add parameters here
  long param

'' add any appropriate methods below
PUB start
  stop
  cog := cognew(@entry, @param) + 1
PUB stop
  if cog
    cogstop(cog~ - 1)

DAT
	org

entry
r0	mov	sp,PAR
r1	mov	r0,sp
r2	jmp	#_main
r3	long 0
r4	long 0
r5	long 0
r6	long 0
r7	long 0
r8	long 0
r9	long 0
r10	long 0
r11	long 0
r12	long 0
r13	long 0
r14	long 0
lr	long 0
sp	long 0
	'.text
	long
	'global variable	_main
_main
	sub	sp, #16
L_L2
	mov	r5, #4
	add	r5, sp
	mov	r6, #8
	add	r6, sp
	mov	r7, #12
	add	r7, sp
	rdlong	r0, r5
	rdlong	r1, r5
	call	#__MULSI
	wrlong	r0, r5
	rdlong	r0, r6
	rdlong	r1, r6
	call	#__MULSI
	wrlong	r0, r6
	mov	r6, #8
	add	r6, sp
	rdlong	r0, r7
	rdlong	r1, r7
	call	#__MULSI
	wrlong	r0, r7
	mov	r7, #12
	add	r7, sp
	rdlong	r7, r7
	rdlong	r6, r6
	add	r7, r6
	wrlong	r7, sp
	rdlong	r7, sp
	rdlong	r6, r5
	add	r7, r6
	wrlong	r7, sp
	rdlong	r0, sp
	rdlong	r1, r5
	call	#__DIVSI
	wrlong	r0, sp
	jmp	#L_L2
__MASK_0000FFFF	long	$0000FFFF
__TMP0	long	0
__MULSI
	mov	__TMP0,r0
	min	__TMP0,r1
	max	r1,r0
	mov	r0,#0
__MULSI_loop
	shr	r1,#1 wz,wc
  IF_C	add	r0,__TMP0
	add	__TMP0,__TMP0
  IF_NZ	jmp	#__MULSI_loop
__MULSI_ret	ret
__MASK_00FF00FF	long	$00FF00FF
__MASK_0F0F0F0F	long	$0F0F0F0F
__MASK_33333333	long	$33333333
__MASK_55555555	long	$55555555
__CLZSI	rev	r0,#0
__CTZSI	neg	__TMP0,r0
	and	__TMP0,r0 wz
	mov	r0,#0
	IF_Z	mov	r0,#1
	test	__TMP0, __MASK_0000FFFF wz
	IF_Z	add	r0,#16
	test	__TMP0, __MASK_00FF00FF wz
	IF_Z	add	r0,#8
	test	__TMP0, __MASK_0F0F0F0F wz
	IF_Z	add	r0,#4
	test	__TMP0, __MASK_33333333 wz
	IF_Z	add	r0,#2
	test	__TMP0, __MASK_55555555 wz
	IF_Z	add	r0,#1
__CLZSI_ret ret
__DIVR	long	0
__DIVCNT	long	0
__UDIVSI
	mov	__DIVR,r0
	call	#__CLZSI
	neg	__DIVCNT,r0
	mov	r0,r1
	call	#__CLZSI
	add	__DIVCNT,r0
	mov	r0,#0
	cmps	__DIVCNT,#0 wz,wc
  IF_C	jmp	#__UDIVSI_done
	shl	r1,__DIVCNT
	add	__DIVCNT,#1
__UDIVSI_loop
	cmpsub	__DIVR,r1 wz,wc
	addx	r0,r0
	shr	r1,#1
	djnz	__DIVCNT,#__UDIVSI_loop
__UDIVSI_done
	mov	r1,__DIVR
__UDIVSI_ret	ret
__DIVSGN	long	0
__DIVSI	mov	__DIVSGN,r0
	xor	__DIVSGN,r1
	abs	r0,r0 wc
	muxc	__DIVSGN,#1 wc
	abs	r1,r1
	call	#__UDIVSI
	cmps	__DIVSGN,#0 wz,wc
	IF_B	neg	r0,r0
	test	__DIVSGN,#1 wz
	IF_NZ	neg	r1,r1
__DIVSI_ret	ret

Some things that I have noticed about this code:
1. The output lacks suitable comments, and the resultant code is rather difficult to understand. It doesn't use original variable names.
2. It creates a multiplication subroutine. This is slightly less efficient in execution time than putting it inline, but it is vastly more efficient on space.
3. The code stores variables in the hub, not the cog as expected.
4. The -Os option appears to be needed: with no optimization the output code is 192 lines. Interstingly, -O2 gives the same output as -0s.
5. The multiply loop ("__MULSI") is very compact (9 longs). It looks like it is O(1). It is also only 4 lines, so at most it will take 32*4 cycles to complete. I'm not sure how it works yet though (especially with a sign).
6. The divide routine is a bit more expensive: 51 longs. To support it though, the loop ("__UDIVSI") is as efficient as the multiply loop.
7. GCC isn't very efficient in memory management from the default: it creates a 256 long hub stack and a 16 long cog stack frame. This could probably be cleaned up manually.
8. It's missing a "FIT" statement at the end.
9. The generated code isn't very well formatted.

Next, I tried a slightly modified source:

#if defined(__propeller__)
#include <propeller.h>
#define int32_t int
#define int16_t short int
#else
#endif


int main()
{

	for(;;){
		int num1, num2, num3;
		volatile int result0;
	
		num3 = num3 * num3;
	  	num2 = num2 * num2;
	  	num1 = num1 * num1;
	  	result0 = num1 + num2;
	  	result0 = result0 + num3;
	  	result0 = result0 / num3;
	}
}

Note here that the only variable marked volatile is result0. I compiled with

propeller-elf-gcc -Os -S -mcog -mspin test.c

And got the following output:

'' spin code automatically generated by gcc
CON
  _clkmode = xtal1+pll16x
  _clkfreq = 80_000_000
  __clkfreq = 0  '' pointer to clock frequency
 '' adjust STACKSIZE to how much stack your program needs
  STACKSIZE = 256
VAR
  long cog  '' cog that was started up by start method
  long stack[STACKSIZE]
  '' add parameters here
  long param

'' add any appropriate methods below
PUB start
  stop
  cog := cognew(@entry, @param) + 1
PUB stop
  if cog
    cogstop(cog~ - 1)

DAT
	org

entry
r0	mov	sp,PAR
r1	mov	r0,sp
r2	jmp	#_main
r3	long 0
r4	long 0
r5	long 0
r6	long 0
r7	long 0
r8	long 0
r9	long 0
r10	long 0
r11	long 0
r12	long 0
r13	long 0
r14	long 0
lr	long 0
sp	long 0
	'.text
	long
	'global variable	_main
_main
	sub	sp, #4
L_L2
	mov	r1, r7
	mov	r0, r7
	call	#__MULSI
	mov	r7, r0
	mov	r1, r4
	mov	r0, r4
	call	#__MULSI
	mov	r1, r5
	mov	r4, r0
	mov	r0, r5
	call	#__MULSI
	mov	r6, r0
	add	r6, r4
	mov	r5, r0
	mov	r1, r7
	wrlong	r6, sp
	rdlong	r6, sp
	add	r6, r7
	wrlong	r6, sp
	rdlong	r0, sp
	call	#__DIVSI
	wrlong	r0, sp
	jmp	#L_L2
__MASK_0000FFFF	long	$0000FFFF
__TMP0	long	0
__MULSI
	mov	__TMP0,r0
	min	__TMP0,r1
	max	r1,r0
	mov	r0,#0
__MULSI_loop
	shr	r1,#1 wz,wc
  IF_C	add	r0,__TMP0
	add	__TMP0,__TMP0
  IF_NZ	jmp	#__MULSI_loop
__MULSI_ret	ret
__MASK_00FF00FF	long	$00FF00FF
__MASK_0F0F0F0F	long	$0F0F0F0F
__MASK_33333333	long	$33333333
__MASK_55555555	long	$55555555
__CLZSI	rev	r0,#0
__CTZSI	neg	__TMP0,r0
	and	__TMP0,r0 wz
	mov	r0,#0
	IF_Z	mov	r0,#1
	test	__TMP0, __MASK_0000FFFF wz
	IF_Z	add	r0,#16
	test	__TMP0, __MASK_00FF00FF wz
	IF_Z	add	r0,#8
	test	__TMP0, __MASK_0F0F0F0F wz
	IF_Z	add	r0,#4
	test	__TMP0, __MASK_33333333 wz
	IF_Z	add	r0,#2
	test	__TMP0, __MASK_55555555 wz
	IF_Z	add	r0,#1
__CLZSI_ret ret
__DIVR	long	0
__DIVCNT	long	0
__UDIVSI
	mov	__DIVR,r0
	call	#__CLZSI
	neg	__DIVCNT,r0
	mov	r0,r1
	call	#__CLZSI
	add	__DIVCNT,r0
	mov	r0,#0
	cmps	__DIVCNT,#0 wz,wc
  IF_C	jmp	#__UDIVSI_done
	shl	r1,__DIVCNT
	add	__DIVCNT,#1
__UDIVSI_loop
	cmpsub	__DIVR,r1 wz,wc
	addx	r0,r0
	shr	r1,#1
	djnz	__DIVCNT,#__UDIVSI_loop
__UDIVSI_done
	mov	r1,__DIVR
__UDIVSI_ret	ret
__DIVSGN	long	0
__DIVSI	mov	__DIVSGN,r0
	xor	__DIVSGN,r1
	abs	r0,r0 wc
	muxc	__DIVSGN,#1 wc
	abs	r1,r1
	call	#__UDIVSI
	cmps	__DIVSGN,#0 wz,wc
	IF_B	neg	r0,r0
	test	__DIVSGN,#1 wz
	IF_NZ	neg	r1,r1
__DIVSI_ret	ret

This is much better: the output no longer has a bunch of RDLONG and WRLONGs, and is hench much more efficient. Previously, the main loop was 34 lines (many of which are hub access), and now it is 22 lines. The other comments still apply though. Also as before, -mfcache did not change the output code.

[size=+2]Conclusion[/size]

I think I will look into Propeller GCC more. It seems to do a good job for compiling down to efficient Propeller assembly, and it isn't too hard to read the output. I hope that it will be improved over time as well. The PropBasic compiler has a more understandable output, but the inefficient use of cog RAM and the lack of updates (no changes in 8 months) has me worried. Propeller GCC seems to fit my requirements.

Circuitsoft · 2012-03-06 10:32

I don't think the PropGCC output is designed to be easily human-readable. It is intended to be quite efficient, and it can reorder operations if doing so makes the code more efficient. By that token, you can have one line of code split across several places in the source. Try compiling:

#include <propeller.h>
#include <stdint.h>

void main()
{
    for(;;)
    {
        int num1, num2, num3;
        volatile int result1, result2, result3;

        result1 = (num1*num1 + num2*num2 + num3*num3)/num1;
        result2 = (num1*num1 + num2*num2 + num3*num3)/num2;
        result3 = (num1*num1 + num2*num2 + num3*num3)/num3;
    }
}

As written, it's quite inefficient, but I'd bet that GCC will make a temporary variable for num1*num1+num2*num2+num3*num3.

Dave Hein · 2012-03-06 14:39

Here's the generated assembly code with -mcog -O2

	.text
	.balign	4
	.global	_main
_main
	sub	sp, #12
	mov	r5, #0
	mov	r1, r5
	mov	r0, r5
	mov	r6, #0
	mov	r4, #0
	call	#__MULSI
	mov	r7, r0
	add	r7, r0
	add	r7, r0
	mov	r1, r6
	mov	r0, r7
	call	#__DIVSI
	mov	r1, r6
	mov	r5, r0
	mov	r0, r7
	call	#__DIVSI
	mov	r6, r0
	mov	r1, r4
	mov	r0, r7
	call	#__DIVSI
	mov	r7, r0
.L2
	mov	r4, #8
	add	r4, sp
	wrlong	r5, r4
	mov	r4, #4
	add	r4, sp
	wrlong	r6, r4
	wrlong	r7, sp
	jmp	#.L2

EDIT: Of course, if you also specify -Wall you will get warnings that num1, num2 and num3 are uninitialized and result1, result2 and result3 are unused.

SRLM · 2012-03-07 09:43

@Circuitsoft

Even if it reorders operations and breaks up lines of code the compiler still has to make a syntax tree, which has enough information to output useful comments. Some comments are better than none, especially when it changes the logic and order of the operations.

@Dave Hein

If you include the flags "-mspin" and "-S" it should make the complete assembly code (ie, code with the support assembly in there instead of just the business logic directly from the code).

Dave Hein · 2012-03-07 10:40

SRLM wrote: »

If you include the flags "-mspin" and "-S" it should make the complete assembly code

Thanks for the tip. I wasn't aware of the -mspin flag.

EDIT: Oh, I see you mentioned it in your previous post on the March 4, and you also posted the output from PropGCC. Sorry I missed that. I usually skim through any post that is more a dozen lines or so. I should have read it in more detail.

Compiling HLL to PASM possible?

Comments