Compiling HLL to PASM possible?
I would like to write code in a high level language (C/C++, BASIC, etc.) and have it compiled down to assembly. This would then be loaded into a cog and run as native assembly code.
The question: I would like to know which compiler(s) can compile directly to Propeller assembly.
I understand the limitations inherent in doing this (small code size, less efficient than pure assembly, etc.) I'm not looking for any interpreter type setup or LMM, or advanced language features such as objects, functions, etc. I want to be able to compile down so that I can write my code (math routines to be performed in a fast loop) in the high level language and not have to worry too much about the assembly code. I'd probably still do some tweaks and modifications, but it would be nice to start with the HLL.
The question: I would like to know which compiler(s) can compile directly to Propeller assembly.
I understand the limitations inherent in doing this (small code size, less efficient than pure assembly, etc.) I'm not looking for any interpreter type setup or LMM, or advanced language features such as objects, functions, etc. I want to be able to compile down so that I can write my code (math routines to be performed in a fast loop) in the high level language and not have to worry too much about the assembly code. I'd probably still do some tweaks and modifications, but it would be nice to start with the HLL.

Comments
If you look in the propgcc demos you will find a FullDuplexSerial driver written in C that runs in COG and works at 115200 baud.
In some cases you don't even have to do anything special to get code to run in cog. In the propgcc demos there is a Fast Fourier Transform the core loop of which gets loaded in to COG to run at full PASM speed. Have a look for "fcahe" when you find the propgcc threads.
This is actually quite amazing. I had always argued that compiling C/C++ to COG was going to be useless as the code would be to big and there is no stack or indexed addressing etc to help the C compiler. All in all more work that it's worth. But the guys did it anyway and it works very well.
[size=+2]PropBasic[/size]
I used PropBasic version 00.01.14 (2011-07-26) to test with. The source code that I used is a simple program to multiply some numbers together, add them, and divide with them. This goes well with the math intensive but no I/O application that I need. It's probably not a good benchmark to use if you're going to be doing complicated serial communication or anything with delays, I/O, etc.
A side note about PropBasic: the syntax is a bit quirky. It requires that your code have only one operator/statement per line. So "num = a+b+c" is out. It's odd, but easy enough to work with.
DEVICE P8X32A, XTAL1, PLL16X FREQ 80_000_000 num1 VAR LONG num2 VAR LONG num3 VAR LONG result0 VAR LONG PROGRAM Start Start: DO num3 = num3 * num3 num2 = num2 * num2 num1 = num1 * num1 result0 = num1 + num2 result0 = result0 + num3 result0 = result0 / num3 LOOP ENDI used the following command to test with:
There doesn't seem to be any command line options to use. Anyway, that generated the following Spin file:
'{$BST PATH {REMOVED FROM POSTING}} '' *** COMPILED WITH PropBasic VERSION 00.01.14 July 26, 2011 *** '' This program tests the compiler for PropBasic. '' Is result a command??? CON 'DEVICE P8X32A, XTAL1, PLL16X _ClkMode = XTAL1 + PLL16X _XInFreq = 5000000 'FREQ 80_000_000 ' num1 VAR LONG 'num1 VAR LONG ' num2 VAR LONG 'num2 VAR LONG ' num3 VAR LONG 'num3 VAR LONG ' result0 VAR LONG 'result0 VAR LONG PUB __Program 'PROGRAM Start CogInit(0, @__Init, @__DATASTART) DAT org 0 __Init __RAM mov dira,__InitDirA mov outa,__InitOutA jmp #Start Start 'Start: __DO_1 ' DO mov __temp1,num3 ' num3 = num3 * num3 mov __temp2,num3 abs __temp1,__temp1 WC muxc __temp3,#1 abs __temp2,__temp2 WC, WZ IF_C xor __temp3,#1 mov __temp4,#0 mov __temp5,#32 shr __temp1,#1 WC __L0001 IF_C add __temp4,__temp2 WC rcr __temp4,#1 WC rcr __temp1,#1 WC djnz __temp5,#__L0001 test __temp3,#1 WZ IF_NZ neg __temp4,__temp4 IF_NZ neg __temp1,__temp1 WZ IF_NZ sub __temp4,#1 mov num3,__temp1 mov __temp1,num2 ' num2 = num2 * num2 mov __temp2,num2 abs __temp1,__temp1 WC muxc __temp3,#1 abs __temp2,__temp2 WC, WZ IF_C xor __temp3,#1 mov __temp4,#0 mov __temp5,#32 shr __temp1,#1 WC __L0002 IF_C add __temp4,__temp2 WC rcr __temp4,#1 WC rcr __temp1,#1 WC djnz __temp5,#__L0002 test __temp3,#1 WZ IF_NZ neg __temp4,__temp4 IF_NZ neg __temp1,__temp1 WZ IF_NZ sub __temp4,#1 mov num2,__temp1 mov __temp1,num1 ' num1 = num1 * num1 mov __temp2,num1 abs __temp1,__temp1 WC muxc __temp3,#1 abs __temp2,__temp2 WC, WZ IF_C xor __temp3,#1 mov __temp4,#0 mov __temp5,#32 shr __temp1,#1 WC __L0003 IF_C add __temp4,__temp2 WC rcr __temp4,#1 WC rcr __temp1,#1 WC djnz __temp5,#__L0003 test __temp3,#1 WZ IF_NZ neg __temp4,__temp4 IF_NZ neg __temp1,__temp1 WZ IF_NZ sub __temp4,#1 mov num1,__temp1 mov result0,num1 ' result0 = num1 + num2 adds result0,num2 ' result0 = result0 + num3 adds result0,num3 mov __temp1,result0 ' result0 = result0 / num3 mov __temp2,num3 mov __temp4,#0 abs __temp1,__temp1 WC muxc __temp5,#1 abs __temp2,__temp2 WC, WZ IF_Z mov __temp1,#0 IF_Z jmp #__L0004 IF_C xor __temp5,#1 mov __temp3,#0 min __temp2,#1 __L0005 add __temp3,#1 shl __temp2,#1 WC IF_NC jmp #__L0005 rcr __temp2,#1 __L0006 cmpsub __temp1,__temp2 WC rcl __temp4,#1 shr __temp2,#1 djnz __temp3,#__L0006 test __temp5,#1 WZ IF_NZ neg __temp4,__temp4 IF_NZ neg __temp1,__temp1 __L0004 mov result0,__temp4 jmp #__DO_1 ' LOOP __LOOP_1 mov __temp1,#0 'END waitpne __temp1,__temp1 '********************************************************************** __InitDirA LONG %00000000_00000000_00000000_00000000 __InitOutA LONG %00000000_00000000_00000000_00000000 _FREQ LONG 80000000 __remainder __temp1 RES 1 __temp2 RES 1 __temp3 RES 1 __temp4 RES 1 __temp5 RES 1 __param1 RES 1 __param2 RES 1 __param3 RES 1 __param4 RES 1 __paramcnt RES 1 num1 RES 1 num2 RES 1 num3 RES 1 result0 RES 1 FIT 492 CON LSBFIRST = 0 MSBFIRST = 1 MSBPRE = 0 LSBPRE = 1 MSBPOST = 2 LSBPOST = 3 DAT __DATASTARTI think the compiler did a good job of being faithful to the original code, but I noticed some things:
1. Every source code line is in the .spin file as a comment, which is very helpful.
2. The multiplication and division is done inline, so each additional multiplication consumes 18 longs. It does share temporary variables however.
3. All variables are stored in cog RAM, and user defined variables use the user defined name.
4. The compiler added the remnants of some serial communication code: three longs at "__RAM" and a constants block.
5. The code is nicely formatted straight from the compiler (although it uses spaces instead of tabs).
[size=+2]Propeller GCC[/size]
I used the most recent (and only) version posted in the GCC downloads page (v0_2_3 from 2012-02-08). The source program I used was the same as from the PropBasic, except modified a bit for C.
#if defined(__propeller__) #include <propeller.h> #define int32_t int #define int16_t short int #else #endif int main() { for(;;){ volatile int num1, num2, num3, result0; num3 = num3 * num3; num2 = num2 * num2; num1 = num1 * num1; result0 = num1 + num2; result0 = result0 + num3; result0 = result0 / num3; } }I based it off the fft_bench.c demo, which is why it has the various preprocessor statements at the begining. Note the use of the keyword "volatile" for the int declaration: wihtout it the compiler simply optimized away everything into a simple jump loop.
Anyway, I used the following command to generate the code:
The options do the following:
-0s: optimize code for minimum size
-S: output source code as a file
-mcog: use the cog memory model (put everything in a single cog)
-mspin: generate the resulting spin file
There is also the -mfcache option, but in this case it did not generate code any differently.
And, when run it generated the following spin code:
'' spin code automatically generated by gcc CON _clkmode = xtal1+pll16x _clkfreq = 80_000_000 __clkfreq = 0 '' pointer to clock frequency '' adjust STACKSIZE to how much stack your program needs STACKSIZE = 256 VAR long cog '' cog that was started up by start method long stack[STACKSIZE] '' add parameters here long param '' add any appropriate methods below PUB start stop cog := cognew(@entry, @param) + 1 PUB stop if cog cogstop(cog~ - 1) DAT org entry r0 mov sp,PAR r1 mov r0,sp r2 jmp #_main r3 long 0 r4 long 0 r5 long 0 r6 long 0 r7 long 0 r8 long 0 r9 long 0 r10 long 0 r11 long 0 r12 long 0 r13 long 0 r14 long 0 lr long 0 sp long 0 '.text long 'global variable _main _main sub sp, #16 L_L2 mov r5, #4 add r5, sp mov r6, #8 add r6, sp mov r7, #12 add r7, sp rdlong r0, r5 rdlong r1, r5 call #__MULSI wrlong r0, r5 rdlong r0, r6 rdlong r1, r6 call #__MULSI wrlong r0, r6 mov r6, #8 add r6, sp rdlong r0, r7 rdlong r1, r7 call #__MULSI wrlong r0, r7 mov r7, #12 add r7, sp rdlong r7, r7 rdlong r6, r6 add r7, r6 wrlong r7, sp rdlong r7, sp rdlong r6, r5 add r7, r6 wrlong r7, sp rdlong r0, sp rdlong r1, r5 call #__DIVSI wrlong r0, sp jmp #L_L2 __MASK_0000FFFF long $0000FFFF __TMP0 long 0 __MULSI mov __TMP0,r0 min __TMP0,r1 max r1,r0 mov r0,#0 __MULSI_loop shr r1,#1 wz,wc IF_C add r0,__TMP0 add __TMP0,__TMP0 IF_NZ jmp #__MULSI_loop __MULSI_ret ret __MASK_00FF00FF long $00FF00FF __MASK_0F0F0F0F long $0F0F0F0F __MASK_33333333 long $33333333 __MASK_55555555 long $55555555 __CLZSI rev r0,#0 __CTZSI neg __TMP0,r0 and __TMP0,r0 wz mov r0,#0 IF_Z mov r0,#1 test __TMP0, __MASK_0000FFFF wz IF_Z add r0,#16 test __TMP0, __MASK_00FF00FF wz IF_Z add r0,#8 test __TMP0, __MASK_0F0F0F0F wz IF_Z add r0,#4 test __TMP0, __MASK_33333333 wz IF_Z add r0,#2 test __TMP0, __MASK_55555555 wz IF_Z add r0,#1 __CLZSI_ret ret __DIVR long 0 __DIVCNT long 0 __UDIVSI mov __DIVR,r0 call #__CLZSI neg __DIVCNT,r0 mov r0,r1 call #__CLZSI add __DIVCNT,r0 mov r0,#0 cmps __DIVCNT,#0 wz,wc IF_C jmp #__UDIVSI_done shl r1,__DIVCNT add __DIVCNT,#1 __UDIVSI_loop cmpsub __DIVR,r1 wz,wc addx r0,r0 shr r1,#1 djnz __DIVCNT,#__UDIVSI_loop __UDIVSI_done mov r1,__DIVR __UDIVSI_ret ret __DIVSGN long 0 __DIVSI mov __DIVSGN,r0 xor __DIVSGN,r1 abs r0,r0 wc muxc __DIVSGN,#1 wc abs r1,r1 call #__UDIVSI cmps __DIVSGN,#0 wz,wc IF_B neg r0,r0 test __DIVSGN,#1 wz IF_NZ neg r1,r1 __DIVSI_ret retSome things that I have noticed about this code:
1. The output lacks suitable comments, and the resultant code is rather difficult to understand. It doesn't use original variable names.
2. It creates a multiplication subroutine. This is slightly less efficient in execution time than putting it inline, but it is vastly more efficient on space.
3. The code stores variables in the hub, not the cog as expected.
4. The -Os option appears to be needed: with no optimization the output code is 192 lines. Interstingly, -O2 gives the same output as -0s.
5. The multiply loop ("__MULSI") is very compact (9 longs). It looks like it is O(1). It is also only 4 lines, so at most it will take 32*4 cycles to complete. I'm not sure how it works yet though (especially with a sign).
6. The divide routine is a bit more expensive: 51 longs. To support it though, the loop ("__UDIVSI") is as efficient as the multiply loop.
7. GCC isn't very efficient in memory management from the default: it creates a 256 long hub stack and a 16 long cog stack frame. This could probably be cleaned up manually.
8. It's missing a "FIT" statement at the end.
9. The generated code isn't very well formatted.
Next, I tried a slightly modified source:
#if defined(__propeller__) #include <propeller.h> #define int32_t int #define int16_t short int #else #endif int main() { for(;;){ int num1, num2, num3; volatile int result0; num3 = num3 * num3; num2 = num2 * num2; num1 = num1 * num1; result0 = num1 + num2; result0 = result0 + num3; result0 = result0 / num3; } }Note here that the only variable marked volatile is result0. I compiled with
And got the following output:
'' spin code automatically generated by gcc CON _clkmode = xtal1+pll16x _clkfreq = 80_000_000 __clkfreq = 0 '' pointer to clock frequency '' adjust STACKSIZE to how much stack your program needs STACKSIZE = 256 VAR long cog '' cog that was started up by start method long stack[STACKSIZE] '' add parameters here long param '' add any appropriate methods below PUB start stop cog := cognew(@entry, @param) + 1 PUB stop if cog cogstop(cog~ - 1) DAT org entry r0 mov sp,PAR r1 mov r0,sp r2 jmp #_main r3 long 0 r4 long 0 r5 long 0 r6 long 0 r7 long 0 r8 long 0 r9 long 0 r10 long 0 r11 long 0 r12 long 0 r13 long 0 r14 long 0 lr long 0 sp long 0 '.text long 'global variable _main _main sub sp, #4 L_L2 mov r1, r7 mov r0, r7 call #__MULSI mov r7, r0 mov r1, r4 mov r0, r4 call #__MULSI mov r1, r5 mov r4, r0 mov r0, r5 call #__MULSI mov r6, r0 add r6, r4 mov r5, r0 mov r1, r7 wrlong r6, sp rdlong r6, sp add r6, r7 wrlong r6, sp rdlong r0, sp call #__DIVSI wrlong r0, sp jmp #L_L2 __MASK_0000FFFF long $0000FFFF __TMP0 long 0 __MULSI mov __TMP0,r0 min __TMP0,r1 max r1,r0 mov r0,#0 __MULSI_loop shr r1,#1 wz,wc IF_C add r0,__TMP0 add __TMP0,__TMP0 IF_NZ jmp #__MULSI_loop __MULSI_ret ret __MASK_00FF00FF long $00FF00FF __MASK_0F0F0F0F long $0F0F0F0F __MASK_33333333 long $33333333 __MASK_55555555 long $55555555 __CLZSI rev r0,#0 __CTZSI neg __TMP0,r0 and __TMP0,r0 wz mov r0,#0 IF_Z mov r0,#1 test __TMP0, __MASK_0000FFFF wz IF_Z add r0,#16 test __TMP0, __MASK_00FF00FF wz IF_Z add r0,#8 test __TMP0, __MASK_0F0F0F0F wz IF_Z add r0,#4 test __TMP0, __MASK_33333333 wz IF_Z add r0,#2 test __TMP0, __MASK_55555555 wz IF_Z add r0,#1 __CLZSI_ret ret __DIVR long 0 __DIVCNT long 0 __UDIVSI mov __DIVR,r0 call #__CLZSI neg __DIVCNT,r0 mov r0,r1 call #__CLZSI add __DIVCNT,r0 mov r0,#0 cmps __DIVCNT,#0 wz,wc IF_C jmp #__UDIVSI_done shl r1,__DIVCNT add __DIVCNT,#1 __UDIVSI_loop cmpsub __DIVR,r1 wz,wc addx r0,r0 shr r1,#1 djnz __DIVCNT,#__UDIVSI_loop __UDIVSI_done mov r1,__DIVR __UDIVSI_ret ret __DIVSGN long 0 __DIVSI mov __DIVSGN,r0 xor __DIVSGN,r1 abs r0,r0 wc muxc __DIVSGN,#1 wc abs r1,r1 call #__UDIVSI cmps __DIVSGN,#0 wz,wc IF_B neg r0,r0 test __DIVSGN,#1 wz IF_NZ neg r1,r1 __DIVSI_ret retThis is much better: the output no longer has a bunch of RDLONG and WRLONGs, and is hench much more efficient. Previously, the main loop was 34 lines (many of which are hub access), and now it is 22 lines. The other comments still apply though. Also as before, -mfcache did not change the output code.
[size=+2]Conclusion[/size]
I think I will look into Propeller GCC more. It seems to do a good job for compiling down to efficient Propeller assembly, and it isn't too hard to read the output. I hope that it will be improved over time as well. The PropBasic compiler has a more understandable output, but the inefficient use of cog RAM and the lack of updates (no changes in 8 months) has me worried. Propeller GCC seems to fit my requirements.
#include <propeller.h> #include <stdint.h> void main() { for(;;) { int num1, num2, num3; volatile int result1, result2, result3; result1 = (num1*num1 + num2*num2 + num3*num3)/num1; result2 = (num1*num1 + num2*num2 + num3*num3)/num2; result3 = (num1*num1 + num2*num2 + num3*num3)/num3; } }As written, it's quite inefficient, but I'd bet that GCC will make a temporary variable for num1*num1+num2*num2+num3*num3.
Even if it reorders operations and breaks up lines of code the compiler still has to make a syntax tree, which has enough information to output useful comments. Some comments are better than none, especially when it changes the logic and order of the operations.
@Dave Hein
If you include the flags "-mspin" and "-S" it should make the complete assembly code (ie, code with the support assembly in there instead of just the business logic directly from the code).
EDIT: Oh, I see you mentioned it in your previous post on the March 4, and you also posted the output from PropGCC. Sorry I missed that. I usually skim through any post that is more a dozen lines or so. I should have read it in more detail.