fastspin compiler for P2

ersmithersmith Posts: 2,082
Updated March 7, 2017: Updated the compiler for the v16a instruction set. The instruction set coding is not tested very thoroughly, but simple programs do compile and run.

Updated May 18: Fixed several bugs reported in the forums. The new binary is attached as fastspin_beta4.zip. Note that fastspin.exe can produce code for either P1 or P2; to get P2 you need to specify the -2 flag.

Updated May 9: Fixed binary output of jumps in COG mode, and updated the push/pop code in the compiler to use postincrement/predecrement mode. The updated binary is attached. Source code, as always, is at github.com/totalspectrum/spin2cpp.

Updated May 7: Fixed a bug in the >< operator (REV works differently on P2) so that fft_bench works now.

Updated May 6: I've attached the beta version of the compiler. The DAT section parser works better (produces the same result as PNut for the inputs I've tested) but is still incomplete; for example, it doesn't understand the fancy ptra++ syntax for read and write. I've fixed a few bugs in the PASM output
too, and added a fastspin specific readme. Usage is pretty simple:
fastspin -2 fibo.spin
produces fibo.p2asm, which can then be loaded by PNut.

*** Original Message ***
Here's an alpha version of the fastspin compiler for P2. fastspin compiles Spin code to PASM; it otherwise acts very much like openspin, but has a few extensions (such as inline assembly between asm...endasm). This version of fastspin has a -2 option to produce P2 code (the output file will be named with a .p2asm extension).

This is labeled an "alpha" version because assembly code inside DAT sections is not always compiled correctly -- the P2 instruction parser is incomplete and buggy. Inline assembly does (mostly) work, because it's passed through to PNut.

Having said all the caveats, fastspin is able to compile simple programs (like the fibonacci demo), and it may be useful for putting together quick demos and tests of the hardware. I compiled and ran fibo like so:
fastspin -2 fibo.spin
PNut_v7 fibo.p2asm
In PNut I selected compile and run, then opened a terminal window to see the output.

Please let me know of any issues you find. I'm still working on the P2 support for DAT sections and hope to have that functional soon.

Eric
«1

Comments

  • 52 Comments sorted by Date Added Votes
  • Is that in a repo somewhere for us Linux users to build?
  • Heater. wrote: »
    Is that in a repo somewhere for us Linux users to build?

    It's in the spin2cpp repo (https://github.com/totalspectrum/spin2cpp). fastspin is just a different front end to spin2cpp, and it's checked into that repo too.
  • Cool, I'll be on it ASAP.
  • Very cool! I think Dave Hein has a program that can translate C to Spin so maybe we don't need a C compiler for P2 after all!
  • Works here. Very timely!

  • @Eric, I'll have to try fastspin when I get a chance. I haven't done anything on the P2 for a while, and fastspin looks very interesting.

    @David, You didn't add a smiley to your post, but I'm sure you were just kidding. cspin could be used to convert C to Spin, and then compile it with fastspin. However, that would be very limited. We really do need GCC for the P2.
  • David Betz wrote: »
    Very cool! I think Dave Hein has a program that can translate C to Spin so maybe we don't need a C compiler for P2 after all!

    :'( but C++
    David
    PropWare: C++ HAL (Hardware Abstraction Layer) for PropGCC; Robust build system using CMake; Integrated Simple Library, libpropeller, and libPropelleruino (Arduino port); Instructions for Eclipse and JetBrain's CLion; Example projects; Doxygen documentation
  • Eric: What sorts of optimization does fastspin do?

  • David Betz wrote: »
    Eric: What sorts of optimization does fastspin do?
    Just pretty basic peephole optimization, constant propagation, and inlining of small functions. On small sequential functions it can produce very good code (even better than GCC, since it's tuned for the Propeller architecture), but it doesn't have any of GCC's sophisticated loop optimizations or common sub-expression elimination, so in practice it usually won't keep up with GCC.

    It doesn't do register assignment at all, it just allocates all local variables in unique COG locations. So large programs can run out of space in COG memory. That's something that can be fixed eventually.
  • ersmith wrote: »
    David Betz wrote: »
    Eric: What sorts of optimization does fastspin do?
    Just pretty basic peephole optimization, constant propagation, and inlining of small functions. On small sequential functions it can produce very good code (even better than GCC, since it's tuned for the Propeller architecture), but it doesn't have any of GCC's sophisticated loop optimizations or common sub-expression elimination, so in practice it usually won't keep up with GCC.

    It doesn't do register assignment at all, it just allocates all local variables in unique COG locations. So large programs can run out of space in COG memory. That's something that can be fixed eventually.
    Register allocation is what always seems to stop me from creating register-based virtual machines. I should really spend some time to learn how to do it! :-)

  • jmgjmg Posts: 10,455
    edited May 2016 Vote Up0Vote Down
    ersmith wrote: »
    Just pretty basic peephole optimization, constant propagation, and inlining of small functions. On small sequential functions it can produce very good code (even better than GCC, since it's tuned for the Propeller architecture), but it doesn't have any of GCC's sophisticated loop optimizations or common sub-expression elimination, so in practice it usually won't keep up with GCC.

    Sometimes simpler and more predictable is easier to manage!

    How would your optimize manage this DivMod requirement ?
            u32c = seed-256;
    	u32x = (u32a * u32b) / u32c;
    	u32y = (u32a * u32b) % u32c;
            return( u32x +  u32y );   // use both results to avoid discard.
    

    Google suggests GCC will use one mul/div, where native div opcode gives both remainder and quotient.

    Prop docs say
    32 x 32 unsigned multiply with 64-bit product
    64 / 32 unsigned divide with 32-bit quotient and 32-bit remainder


    so that's looking like the same opcodes, compilers should be able to optimise the above well ?

    I've just done some tests using GCC on intel D2000, with rather erratic results. Close, but no cigar.
    (Looks more like GCC issue than D2000 ?)

    Tests on GCC give these possible outcomes, (release build) which moves around with editing other code (?!) :
    (debug build gives expected inefficient two copies, but it is quite stable)

    Function use seems more stable, but has bonus push/pop & bonus xor ?
    No idea why clearing the upper-32b before imul is used, but that XOR seems 'mobile' in GCC, sometimes it pops up after imul, which is not where I would want it.
      x86 : 64b div 32b  =>  edx is Remainder, eax is Quotient
     sometime this (in line in main() )
      cb:	8d 8b 00 ff ff ff    	lea    -0x100(%ebx),%ecx   -256
      d1:	89 d8                	mov    %ebx,%eax
      d3:	0f af c1             	imul   %ecx,%eax          
      d6:	31 d2                	xor    %edx,%edx          
      d8:	f7 f1                  	div    %ecx
      da:	01 d0                	add    %edx,%eax
      dc:	e8 fc ff ff ff       	call   Chkmaths
    
    some edits later, and in-line gives
      d7:	8d 8b ff fe ff ff    	lea    -0x101(%ebx),%ecx
      dd:	89 d8                	mov    %ebx,%eax
    [b]  df:	31 d2                	xor    %edx,%edx  [/b]
      e1:	0f af c1             	imul   %ecx,%eax
      e4:	f7 f1          	        div    %ecx
      e6:	01 d0                	add    %edx,%eax
      e8:	e8 fc ff ff ff       	call   Chkmaths
    
    compare exact SAME construct, as function call  
       0:	8d 88 00 ff ff ff    	lea    -0x100(%eax),%ecx   -256
       6:	31 d2                	xor    %edx,%edx   
       8:	0f af c0             	imul   %eax,%eax 
       b:	55                   	push   %ebp               
       c:	f7 f1                	div    %ecx          
       e:	89 e5                	mov    %esp,%ebp          
      10:	01 d0                	add    %edx,%eax 
      12:	5d                   	pop    %ebp               
      13:	c3                   	ret
    
    to me, the ideal code is a stable 4 lines of assembler:
       0:	8d 88 00 ff ff ff    	lea    -0x100(%eax),%ecx   -256
       8:	0f af c0             	imul   %eax,%eax 
       c:	f7 f1                	div    %ecx          
      10:	01 d0                	add    %edx,%eax 
    
    
    I'm also unclear if this is mixing signed/unsigned opcodes, or the mnemonics are just lazy..
  • ersmithersmith Posts: 2,082
    edited May 2016 Vote Up0Vote Down
    fastspin doesn't have common subexpression elimination yet, which is pretty much required to recognize div/mod combinations.

    GCC has it in theory, but in practice I've had trouble getting PropGCC to combine a div and mod, even though we've told it that the appropriate functions produce both results. I think PropGCC4 sometimes gets it right, but PropGCC6 has had issues. There is a standard C function (ldiv) that can do both div and mod, though, which helps.
  • jmgjmg Posts: 10,455
    ersmith wrote: »
    fastspin doesn't have common subexpression elimination yet, which is pretty much required to recognize div/mod combinations.

    Would be nice to have stable support of this type of use, as it is something of a HLL blindspot.
    ersmith wrote: »
    GCC has it in theory, but in practice I've had trouble getting PropGCC to combine a div and mod, even though we've told it that the appropriate functions produce both results. I think PropGCC4 sometimes gets it right, but PropGCC6 has had issues.

    Not sure which GCC intel uses, but it seems to get the big steps right, then drop the ball in the details... much like you say..
    ersmith wrote: »
    There is a standard C function (ldiv) that can do both div and mod, though, which helps.
    I needed to have 32*32 -> 64b then 64b/32b -> Div.Mod

    P2 should be able to do this in the same number of ASM lines too, I think ?
    I can't find numbers yet on those intel opcode speeds, but they are not likely to be fast at 32MHz sysclks.
    P2 with Cordic delays, may be similar ? / faster ?

  • jmg wrote: »
    I needed to have 32*32 -> 64b then 64b/32b -> Div.Mod
    Spin doesn't have any way to express 64 bit values, nor does it have unsigned operations, so I don't think you could write this in Spin. You could use inline assembly in fastspin though, something like:
    PUB foo(seed, u32a, u32b) | u32c, u32x, u32y, low, high
        u32c := seed - 256
        '' want:u32x := u32a * u32b / u32c
        ''      u32y := u32a * u32b // u32c
        asm
          qmul u32a, u32b
          getqx low
          getqy high
          setq  high
          qdiv  low, u32c
          getqx u32x
          getqy u32y
        endasm
        return u32x + u32y
    
    which produces
    _foo
            sub     arg1, #256
            qmul    arg2, arg3
            getqx   _foo_low
            getqy   _foo_high
            setq    _foo_high
            qdiv    _foo_low, arg1
            getqx   result1
            getqy   _foo_u32y
            add     result1, _foo_u32y
    _foo_ret
            reta
    
  • Spin uses indentation to determine when blocks end. There is no endif - you just unindent. So why do you have an endasm? Why don't you just indent your inline assembly under its asm block?
  • That said, I'll take it as it is right now. Didn't know simple inline was a part of this. Thanks Ersmith!

    Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
    Parallax colors simplified: http://forums.parallax.com/showthread.php?123709-Commented-Graphics_Demo.spin<br>
  • ersmithersmith Posts: 2,082
    edited May 2016 Vote Up0Vote Down
    Spin uses indentation to determine when blocks end. There is no endif - you just unindent. So why do you have an endasm? Why don't you just indent your inline assembly under its asm block?

    So that labels in the inline assembly could start at the left column. PASM code doesn't use indentation the same way Spin does, so I thought it was more prudent to explicitly mark the end of the assembly (it makes parsing easier too). There probably would be some way to stick with just indentation (maybe require labels to start with ":" ) but I'm lazy :).

    Eric
  • ersmith wrote: »
    Spin uses indentation to determine when blocks end. There is no endif - you just unindent. So why do you have an endasm? Why don't you just indent your inline assembly under its asm block?

    So that labels in the inline assembly could start at the left column. PASM code doesn't use indentation the same way Spin does, so I thought it was more prudent to explicitly mark the end of the assembly (it makes parsing easier too). There probably would be some way to stick with just indentation (maybe require labels to start with ":" ) but I'm lazy :).

    Eric
    Given all of the work you've done on spin2cpp, fastspin, and lots of other things, I think it would be hard to support your claim of being "lazy"! :-)

  • Agreed!
    Do not taunt Happy Fun Ball! @opengeekorg ---> Be Excellent To One Another SKYPE = acuity_doug
    Parallax colors simplified: http://forums.parallax.com/showthread.php?123709-Commented-Graphics_Demo.spin<br>
  • jmgjmg Posts: 10,455
    ersmith wrote: »
    So that labels in the inline assembly could start at the left column. PASM code doesn't use indentation the same way Spin does, so I thought it was more prudent to explicitly mark the end of the assembly (it makes parsing easier too). There probably would be some way to stick with just indentation (maybe require labels to start with ":" ) but I'm lazy :).
    Makes sense to me - because ASM does not follow Spin rules, you cannot use Spin rules to exit any ASM area.

    Nice ASM example. How many cycles is that in P2 ?

  • jmg wrote: »
    Nice ASM example. How many cycles is that in P2 ?
    I'm not sure. Does it depend on how many other COGs are using the CORDIC unit? With just one COG running and with some totally bogus inputs the whole subroutine call and return takes 201 cycles (this is with hubexec).

  • I've updated the first post with a new compiler. Conversion of DAT sections is still incomplete, but at least if output is produced it should be correct.
  • jmgjmg Posts: 10,455
    ersmith wrote: »
    Does it depend on how many other COGs are using the CORDIC unit?
    I think the CORDIC interleaves other COGs invisibly, but does take a little time.
    ersmith wrote: »
    With just one COG running and with some totally bogus inputs the whole subroutine call and return takes 201 cycles (this is with hubexec).
    As a rough check, I can get something like 6*32+4*2 = 200
    ie guess 6 cordic lines need 32c each and 4 non cordic lines are 2c, harder to get an odd number tho ?



  • jmg wrote: »
    ersmith wrote: »
    With just one COG running and with some totally bogus inputs the whole subroutine call and return takes 201 cycles (this is with hubexec).
    As a rough check, I can get something like 6*32+4*2 = 200
    ie guess 6 cordic lines need 32c each and 4 non cordic lines are 2c, harder to get an odd number tho ?
    Yep, that is funny, but it's what I got when I ran it on my DE2-115. Maybe the read from CNT sees an odd cycle because it's picking up the value partway through an instruction? Or there's some kind of hub interaction.

  • jmgjmg Posts: 10,455
    ersmith wrote: »
    Yep, that is funny, but it's what I got when I ran it on my DE2-115. Maybe the read from CNT sees an odd cycle because it's picking up the value partway through an instruction? Or there's some kind of hub interaction.
    Sounds worth investigating.
    Can you post the code in another thread, with an empty call as a comparison.
    Chip, or someone else may know the answer?
    If the reads are identical, it should not matter what phase effects are there, provided they are stable.

  • ersmith wrote: »
    Does it depend on how many other COGs are using the CORDIC unit?

    The CORDIC takes just as long, no matter how many cogs are using it. Another cog gets a turn each clock, and if a cog doesn't use its turn, it gets wasted and no other cog can use it.
  • jmgjmg Posts: 10,455
    ersmith wrote: »
    Spin doesn't have any way to express 64 bit values, nor does it have unsigned operations, so I don't think you could write this in Spin.
    Wow, really ?
    Given P2 hardware says this
    32 x 32 unsigned multiply
    64 / 32 unsigned divide

    to me, that makes supporting unsigned operations quite important on pretty much any P2 language ?
    Sounds like Spin needs an extension for P2 ?

  • jmg wrote: »
    ersmith wrote: »
    Spin doesn't have any way to express 64 bit values, nor does it have unsigned operations, so I don't think you could write this in Spin.
    Wow, really ?
    Given P2 hardware says this
    32 x 32 unsigned multiply
    64 / 32 unsigned divide

    to me, that makes supporting unsigned operations quite important on pretty much any P2 language ?
    Sounds like Spin needs an extension for P2 ?

    I would think extended Spin for the P2 is a given once the dust settles.
    In science there is no authority. There is only experiment.
    Life is unpredictable. Eat dessert first.
  • jmg wrote: »
    ersmith wrote: »
    Yep, that is funny, but it's what I got when I ran it on my DE2-115. Maybe the read from CNT sees an odd cycle because it's picking up the value partway through an instruction? Or there's some kind of hub interaction.
    Sounds worth investigating.
    Can you post the code in another thread, with an empty call as a comparison.
    Chip, or someone else may know the answer?
    If the reads are identical, it should not matter what phase effects are there, provided they are stable.

    Frankly I'm not too worried about it, but here's the code if you'd like to take a look.

  • jmgjmg Posts: 10,455
    edited May 2016 Vote Up0Vote Down
    Maybe odd is not as odd as I thought. I assumed everything was paced at Opcode rates, but the P2 can wait to SysCLK granularity, and that seems to be exactly what happens here.

    I find this (bold added) - seems 39c is a normal delay thru CORDIC

    "When a cog issues a CORDIC instruction, it must wait for its hub slot, which is 0..15 clocks away, in order to hand off the command to the CORDIC solver. Thirty-nine clocks later, results will be available via the GETQX and GETQY instructions, which will wait for the results, in case they haven’t arrived yet.

    Because each cog’s hub slot comes around every 16 clocks and the pipeline is 38 clocks long, it is possible to overlap CORDIC commands, where three commands are initially given to the CORDIC solver, and then results are picked up and another command is given, indefinitely, until, at the end, three result are picked up"
Sign In or Register to comment.