Shop OBEX P1 Docs P2 Docs Learn Events
flexspin compiler for P2: Assembly, Spin, BASIC, and C in one compiler — Parallax Forums

flexspin compiler for P2: Assembly, Spin, BASIC, and C in one compiler

ersmithersmith Posts: 6,051
edited 2020-10-24 23:13 in Propeller 2
Update October 2020: "fastspin" has been renamed to "flexspin" to better fit in to the naming scheme of the FlexProp GUI for Propeller development.

Please go to the spin2cpp release directory for the latest version of flexspin. There is also a simple GUI for Spin and BASIC development on P2 at FlexProp.

Flexspin is a compiler for Spin (and much of Spin2), BASIC, and C which can produce binaries for P1 and P2. FlexProp is a very simple IDE which uses flexspin.

I won't try to keep this post up to date. The links above always have the most recent versions of flexspin and FlexProp. The compiler has beta C language support now. The Spin support is pretty mature, and BASIC is in good shape. Lots of test programs run on the P2 Eval board.

Updated May 18, 2018: Some more bug fixes both to multiple assignment and to PNut compatibility. Now able to build ROM_137PBJ.spin2 correctly (or at least, the same as PNut :)).

Updated May 17, 2018: Fixed a couple of problems that were impacting PNut compatibility, and improved the syntax for multiple assignment (so the parentheses are no longer required in most cases).

Updated May 16, 2018: Updated with a new Beta of fastspin that has support for multiple assignment and functions that return multiple values. I'd also like to point to a simple GUI for fastspin + P2: https://github.com/totalspectrum/spin2gui/releases. It allows you to edit a .spin or .spin2 file, compile it, and download to the P2 hardware (thanks to Dave Hein for his loadp2 program). It's configurable and could (at least in theory) also be used for P1 development, but I haven't tested that since there are already so many P1 GUIs available :). AFAIK the only competitor for spin2gui on P2 is PNut, and at the moment PNut only supports assembly code.

Updated May 10, 2018: My, how time flies... Anyway, now that hardware seems imminent I've gone back and updated fastspin for the v32b FPGA images. fastspin -2 is able to compile all of the samples from Dave Hein's p2asm, including the boot ROM, so the instruction set coverage is pretty good now (but obviously let me know if anything is missing!). The Spin compiler itself is less tested in P2 mode, but I've used it for some emulators and demos on the FPGA. As always, usage is command line only and is just:
fastspin -2 myfile.spin2
This will produce myfile.pasm2 (which is the converted PASM2) and myfile.binary. You can load myfile.binary with PNut, I think, but I use Dave Hein's excellent loadp2 program, included with his p2gcc package.

Updated March 7, 2017: Updated the compiler for the v16a instruction set. The instruction set coding is not tested very thoroughly, but simple programs do compile and run.

Updated May 18: Fixed several bugs reported in the forums. The new binary is attached as fastspin_beta4.zip. Note that fastspin.exe can produce code for either P1 or P2; to get P2 you need to specify the -2 flag.

Updated May 9: Fixed binary output of jumps in COG mode, and updated the push/pop code in the compiler to use postincrement/predecrement mode. The updated binary is attached. Source code, as always, is at github.com/totalspectrum/spin2cpp.

Updated May 7: Fixed a bug in the >< operator (REV works differently on P2) so that fft_bench works now.

Updated May 6: I've attached the beta version of the compiler. The DAT section parser works better (produces the same result as PNut for the inputs I've tested) but is still incomplete; for example, it doesn't understand the fancy ptra++ syntax for read and write. I've fixed a few bugs in the PASM output
too, and added a fastspin specific readme. Usage is pretty simple:
fastspin -2 fibo.spin
produces fibo.p2asm, which can then be loaded by PNut.

*** Original Message ***
Here's an alpha version of the fastspin compiler for P2. fastspin compiles Spin code to PASM; it otherwise acts very much like openspin, but has a few extensions (such as inline assembly between asm...endasm). This version of fastspin has a -2 option to produce P2 code (the output file will be named with a .p2asm extension).

This is labeled an "alpha" version because assembly code inside DAT sections is not always compiled correctly -- the P2 instruction parser is incomplete and buggy. Inline assembly does (mostly) work, because it's passed through to PNut.

Having said all the caveats, fastspin is able to compile simple programs (like the fibonacci demo), and it may be useful for putting together quick demos and tests of the hardware. I compiled and ran fibo like so:
fastspin -2 fibo.spin
PNut_v7 fibo.p2asm
In PNut I selected compile and run, then opened a terminal window to see the output.

Please let me know of any issues you find. I'm still working on the P2 support for DAT sections and hope to have that functional soon.

Eric
«134567121

Comments

  • Heater.Heater. Posts: 21,230
    Is that in a repo somewhere for us Linux users to build?
  • Heater. wrote: »
    Is that in a repo somewhere for us Linux users to build?

    It's in the spin2cpp repo (https://github.com/totalspectrum/spin2cpp). fastspin is just a different front end to spin2cpp, and it's checked into that repo too.
  • Heater.Heater. Posts: 21,230
    Cool, I'll be on it ASAP.
  • Very cool! I think Dave Hein has a program that can translate C to Spin so maybe we don't need a C compiler for P2 after all!
  • rjo__rjo__ Posts: 2,114
    Works here. Very timely!

  • @Eric, I'll have to try fastspin when I get a chance. I haven't done anything on the P2 for a while, and fastspin looks very interesting.

    @David, You didn't add a smiley to your post, but I'm sure you were just kidding. cspin could be used to convert C to Spin, and then compile it with fastspin. However, that would be very limited. We really do need GCC for the P2.
  • David Betz wrote: »
    Very cool! I think Dave Hein has a program that can translate C to Spin so maybe we don't need a C compiler for P2 after all!

    :'( but C++
  • Eric: What sorts of optimization does fastspin do?

  • David Betz wrote: »
    Eric: What sorts of optimization does fastspin do?
    Just pretty basic peephole optimization, constant propagation, and inlining of small functions. On small sequential functions it can produce very good code (even better than GCC, since it's tuned for the Propeller architecture), but it doesn't have any of GCC's sophisticated loop optimizations or common sub-expression elimination, so in practice it usually won't keep up with GCC.

    It doesn't do register assignment at all, it just allocates all local variables in unique COG locations. So large programs can run out of space in COG memory. That's something that can be fixed eventually.
  • ersmith wrote: »
    David Betz wrote: »
    Eric: What sorts of optimization does fastspin do?
    Just pretty basic peephole optimization, constant propagation, and inlining of small functions. On small sequential functions it can produce very good code (even better than GCC, since it's tuned for the Propeller architecture), but it doesn't have any of GCC's sophisticated loop optimizations or common sub-expression elimination, so in practice it usually won't keep up with GCC.

    It doesn't do register assignment at all, it just allocates all local variables in unique COG locations. So large programs can run out of space in COG memory. That's something that can be fixed eventually.
    Register allocation is what always seems to stop me from creating register-based virtual machines. I should really spend some time to learn how to do it! :-)

  • jmgjmg Posts: 15,172
    edited 2016-05-05 22:08
    ersmith wrote: »
    Just pretty basic peephole optimization, constant propagation, and inlining of small functions. On small sequential functions it can produce very good code (even better than GCC, since it's tuned for the Propeller architecture), but it doesn't have any of GCC's sophisticated loop optimizations or common sub-expression elimination, so in practice it usually won't keep up with GCC.

    Sometimes simpler and more predictable is easier to manage!

    How would your optimize manage this DivMod requirement ?
            u32c = seed-256;
    	u32x = (u32a * u32b) / u32c;
    	u32y = (u32a * u32b) % u32c;
            return( u32x +  u32y );   // use both results to avoid discard.
    

    Google suggests GCC will use one mul/div, where native div opcode gives both remainder and quotient.

    Prop docs say
    32 x 32 unsigned multiply with 64-bit product
    64 / 32 unsigned divide with 32-bit quotient and 32-bit remainder


    so that's looking like the same opcodes, compilers should be able to optimise the above well ?

    I've just done some tests using GCC on intel D2000, with rather erratic results. Close, but no cigar.
    (Looks more like GCC issue than D2000 ?)

    Tests on GCC give these possible outcomes, (release build) which moves around with editing other code (?!) :
    (debug build gives expected inefficient two copies, but it is quite stable)

    Function use seems more stable, but has bonus push/pop & bonus xor ?
    No idea why clearing the upper-32b before imul is used, but that XOR seems 'mobile' in GCC, sometimes it pops up after imul, which is not where I would want it.
      x86 : 64b div 32b  =>  edx is Remainder, eax is Quotient
     sometime this (in line in main() )
      cb:	8d 8b 00 ff ff ff    	lea    -0x100(%ebx),%ecx   -256
      d1:	89 d8                	mov    %ebx,%eax
      d3:	0f af c1             	imul   %ecx,%eax          
      d6:	31 d2                	xor    %edx,%edx          
      d8:	f7 f1                  	div    %ecx
      da:	01 d0                	add    %edx,%eax
      dc:	e8 fc ff ff ff       	call   Chkmaths
    
    some edits later, and in-line gives
      d7:	8d 8b ff fe ff ff    	lea    -0x101(%ebx),%ecx
      dd:	89 d8                	mov    %ebx,%eax
    [b]  df:	31 d2                	xor    %edx,%edx  [/b]
      e1:	0f af c1             	imul   %ecx,%eax
      e4:	f7 f1          	        div    %ecx
      e6:	01 d0                	add    %edx,%eax
      e8:	e8 fc ff ff ff       	call   Chkmaths
    
    compare exact SAME construct, as function call  
       0:	8d 88 00 ff ff ff    	lea    -0x100(%eax),%ecx   -256
       6:	31 d2                	xor    %edx,%edx   
       8:	0f af c0             	imul   %eax,%eax 
       b:	55                   	push   %ebp               
       c:	f7 f1                	div    %ecx          
       e:	89 e5                	mov    %esp,%ebp          
      10:	01 d0                	add    %edx,%eax 
      12:	5d                   	pop    %ebp               
      13:	c3                   	ret
    
    to me, the ideal code is a stable 4 lines of assembler:
       0:	8d 88 00 ff ff ff    	lea    -0x100(%eax),%ecx   -256
       8:	0f af c0             	imul   %eax,%eax 
       c:	f7 f1                	div    %ecx          
      10:	01 d0                	add    %edx,%eax 
    
    
    I'm also unclear if this is mixing signed/unsigned opcodes, or the mnemonics are just lazy..
  • ersmithersmith Posts: 6,051
    edited 2016-05-06 01:32
    fastspin doesn't have common subexpression elimination yet, which is pretty much required to recognize div/mod combinations.

    GCC has it in theory, but in practice I've had trouble getting PropGCC to combine a div and mod, even though we've told it that the appropriate functions produce both results. I think PropGCC4 sometimes gets it right, but PropGCC6 has had issues. There is a standard C function (ldiv) that can do both div and mod, though, which helps.
  • jmgjmg Posts: 15,172
    ersmith wrote: »
    fastspin doesn't have common subexpression elimination yet, which is pretty much required to recognize div/mod combinations.

    Would be nice to have stable support of this type of use, as it is something of a HLL blindspot.
    ersmith wrote: »
    GCC has it in theory, but in practice I've had trouble getting PropGCC to combine a div and mod, even though we've told it that the appropriate functions produce both results. I think PropGCC4 sometimes gets it right, but PropGCC6 has had issues.

    Not sure which GCC intel uses, but it seems to get the big steps right, then drop the ball in the details... much like you say..
    ersmith wrote: »
    There is a standard C function (ldiv) that can do both div and mod, though, which helps.
    I needed to have 32*32 -> 64b then 64b/32b -> Div.Mod

    P2 should be able to do this in the same number of ASM lines too, I think ?
    I can't find numbers yet on those intel opcode speeds, but they are not likely to be fast at 32MHz sysclks.
    P2 with Cordic delays, may be similar ? / faster ?

  • jmg wrote: »
    I needed to have 32*32 -> 64b then 64b/32b -> Div.Mod
    Spin doesn't have any way to express 64 bit values, nor does it have unsigned operations, so I don't think you could write this in Spin. You could use inline assembly in fastspin though, something like:
    PUB foo(seed, u32a, u32b) | u32c, u32x, u32y, low, high
        u32c := seed - 256
        '' want:u32x := u32a * u32b / u32c
        ''      u32y := u32a * u32b // u32c
        asm
          qmul u32a, u32b
          getqx low
          getqy high
          setq  high
          qdiv  low, u32c
          getqx u32x
          getqy u32y
        endasm
        return u32x + u32y
    
    which produces
    _foo
            sub     arg1, #256
            qmul    arg2, arg3
            getqx   _foo_low
            getqy   _foo_high
            setq    _foo_high
            qdiv    _foo_low, arg1
            getqx   result1
            getqy   _foo_u32y
            add     result1, _foo_u32y
    _foo_ret
            reta
    
  • Spin uses indentation to determine when blocks end. There is no endif - you just unindent. So why do you have an endasm? Why don't you just indent your inline assembly under its asm block?
  • That said, I'll take it as it is right now. Didn't know simple inline was a part of this. Thanks Ersmith!

  • ersmithersmith Posts: 6,051
    edited 2016-05-06 15:49
    Spin uses indentation to determine when blocks end. There is no endif - you just unindent. So why do you have an endasm? Why don't you just indent your inline assembly under its asm block?

    So that labels in the inline assembly could start at the left column. PASM code doesn't use indentation the same way Spin does, so I thought it was more prudent to explicitly mark the end of the assembly (it makes parsing easier too). There probably would be some way to stick with just indentation (maybe require labels to start with ":" ) but I'm lazy :).

    Eric
  • ersmith wrote: »
    Spin uses indentation to determine when blocks end. There is no endif - you just unindent. So why do you have an endasm? Why don't you just indent your inline assembly under its asm block?

    So that labels in the inline assembly could start at the left column. PASM code doesn't use indentation the same way Spin does, so I thought it was more prudent to explicitly mark the end of the assembly (it makes parsing easier too). There probably would be some way to stick with just indentation (maybe require labels to start with ":" ) but I'm lazy :).

    Eric
    Given all of the work you've done on spin2cpp, fastspin, and lots of other things, I think it would be hard to support your claim of being "lazy"! :-)

  • Agreed!
  • jmgjmg Posts: 15,172
    ersmith wrote: »
    So that labels in the inline assembly could start at the left column. PASM code doesn't use indentation the same way Spin does, so I thought it was more prudent to explicitly mark the end of the assembly (it makes parsing easier too). There probably would be some way to stick with just indentation (maybe require labels to start with ":" ) but I'm lazy :).
    Makes sense to me - because ASM does not follow Spin rules, you cannot use Spin rules to exit any ASM area.

    Nice ASM example. How many cycles is that in P2 ?

  • jmg wrote: »
    Nice ASM example. How many cycles is that in P2 ?
    I'm not sure. Does it depend on how many other COGs are using the CORDIC unit? With just one COG running and with some totally bogus inputs the whole subroutine call and return takes 201 cycles (this is with hubexec).

  • I've updated the first post with a new compiler. Conversion of DAT sections is still incomplete, but at least if output is produced it should be correct.
  • jmgjmg Posts: 15,172
    ersmith wrote: »
    Does it depend on how many other COGs are using the CORDIC unit?
    I think the CORDIC interleaves other COGs invisibly, but does take a little time.
    ersmith wrote: »
    With just one COG running and with some totally bogus inputs the whole subroutine call and return takes 201 cycles (this is with hubexec).
    As a rough check, I can get something like 6*32+4*2 = 200
    ie guess 6 cordic lines need 32c each and 4 non cordic lines are 2c, harder to get an odd number tho ?



  • jmg wrote: »
    ersmith wrote: »
    With just one COG running and with some totally bogus inputs the whole subroutine call and return takes 201 cycles (this is with hubexec).
    As a rough check, I can get something like 6*32+4*2 = 200
    ie guess 6 cordic lines need 32c each and 4 non cordic lines are 2c, harder to get an odd number tho ?
    Yep, that is funny, but it's what I got when I ran it on my DE2-115. Maybe the read from CNT sees an odd cycle because it's picking up the value partway through an instruction? Or there's some kind of hub interaction.

  • jmgjmg Posts: 15,172
    ersmith wrote: »
    Yep, that is funny, but it's what I got when I ran it on my DE2-115. Maybe the read from CNT sees an odd cycle because it's picking up the value partway through an instruction? Or there's some kind of hub interaction.
    Sounds worth investigating.
    Can you post the code in another thread, with an empty call as a comparison.
    Chip, or someone else may know the answer?
    If the reads are identical, it should not matter what phase effects are there, provided they are stable.

  • ersmith wrote: »
    Does it depend on how many other COGs are using the CORDIC unit?

    The CORDIC takes just as long, no matter how many cogs are using it. Another cog gets a turn each clock, and if a cog doesn't use its turn, it gets wasted and no other cog can use it.
  • jmgjmg Posts: 15,172
    ersmith wrote: »
    Spin doesn't have any way to express 64 bit values, nor does it have unsigned operations, so I don't think you could write this in Spin.
    Wow, really ?
    Given P2 hardware says this
    32 x 32 unsigned multiply
    64 / 32 unsigned divide

    to me, that makes supporting unsigned operations quite important on pretty much any P2 language ?
    Sounds like Spin needs an extension for P2 ?

  • kwinnkwinn Posts: 8,697
    jmg wrote: »
    ersmith wrote: »
    Spin doesn't have any way to express 64 bit values, nor does it have unsigned operations, so I don't think you could write this in Spin.
    Wow, really ?
    Given P2 hardware says this
    32 x 32 unsigned multiply
    64 / 32 unsigned divide

    to me, that makes supporting unsigned operations quite important on pretty much any P2 language ?
    Sounds like Spin needs an extension for P2 ?

    I would think extended Spin for the P2 is a given once the dust settles.
  • jmg wrote: »
    ersmith wrote: »
    Yep, that is funny, but it's what I got when I ran it on my DE2-115. Maybe the read from CNT sees an odd cycle because it's picking up the value partway through an instruction? Or there's some kind of hub interaction.
    Sounds worth investigating.
    Can you post the code in another thread, with an empty call as a comparison.
    Chip, or someone else may know the answer?
    If the reads are identical, it should not matter what phase effects are there, provided they are stable.

    Frankly I'm not too worried about it, but here's the code if you'd like to take a look.

  • jmgjmg Posts: 15,172
    edited 2016-05-07 21:02
    Maybe odd is not as odd as I thought. I assumed everything was paced at Opcode rates, but the P2 can wait to SysCLK granularity, and that seems to be exactly what happens here.

    I find this (bold added) - seems 39c is a normal delay thru CORDIC

    "When a cog issues a CORDIC instruction, it must wait for its hub slot, which is 0..15 clocks away, in order to hand off the command to the CORDIC solver. Thirty-nine clocks later, results will be available via the GETQX and GETQY instructions, which will wait for the results, in case they haven’t arrived yet.

    Because each cog’s hub slot comes around every 16 clocks and the pipeline is 38 clocks long, it is possible to overlap CORDIC commands, where three commands are initially given to the CORDIC solver, and then results are picked up and another command is given, indefinitely, until, at the end, three result are picked up"
Sign In or Register to comment.