flexspin compiler for P2: Assembly, Spin, BASIC, and C in one compiler

ersmith · 2016-05-05 13:21

Update October 2020: "fastspin" has been renamed to "flexspin" to better fit in to the naming scheme of the FlexProp GUI for Propeller development.

Please go to the spin2cpp release directory for the latest version of flexspin. There is also a simple GUI for Spin and BASIC development on P2 at FlexProp.

Flexspin is a compiler for Spin (and much of Spin2), BASIC, and C which can produce binaries for P1 and P2. FlexProp is a very simple IDE which uses flexspin.

I won't try to keep this post up to date. The links above always have the most recent versions of flexspin and FlexProp. The compiler has beta C language support now. The Spin support is pretty mature, and BASIC is in good shape. Lots of test programs run on the P2 Eval board.

Updated May 18, 2018: Some more bug fixes both to multiple assignment and to PNut compatibility. Now able to build ROM_137PBJ.spin2 correctly (or at least, the same as PNut

).

Updated May 17, 2018: Fixed a couple of problems that were impacting PNut compatibility, and improved the syntax for multiple assignment (so the parentheses are no longer required in most cases).

Updated May 16, 2018: Updated with a new Beta of fastspin that has support for multiple assignment and functions that return multiple values. I'd also like to point to a simple GUI for fastspin + P2: https://github.com/totalspectrum/spin2gui/releases. It allows you to edit a .spin or .spin2 file, compile it, and download to the P2 hardware (thanks to Dave Hein for his loadp2 program). It's configurable and could (at least in theory) also be used for P1 development, but I haven't tested that since there are already so many P1 GUIs available

. AFAIK the only competitor for spin2gui on P2 is PNut, and at the moment PNut only supports assembly code.

Updated May 10, 2018: My, how time flies... Anyway, now that hardware seems imminent I've gone back and updated fastspin for the v32b FPGA images. fastspin -2 is able to compile all of the samples from Dave Hein's p2asm, including the boot ROM, so the instruction set coverage is pretty good now (but obviously let me know if anything is missing!). The Spin compiler itself is less tested in P2 mode, but I've used it for some emulators and demos on the FPGA. As always, usage is command line only and is just:

fastspin -2 myfile.spin2

This will produce myfile.pasm2 (which is the converted PASM2) and myfile.binary. You can load myfile.binary with PNut, I think, but I use Dave Hein's excellent loadp2 program, included with his p2gcc package.

Updated March 7, 2017: Updated the compiler for the v16a instruction set. The instruction set coding is not tested very thoroughly, but simple programs do compile and run.

Updated May 18: Fixed several bugs reported in the forums. The new binary is attached as fastspin_beta4.zip. Note that fastspin.exe can produce code for either P1 or P2; to get P2 you need to specify the -2 flag.

Updated May 9: Fixed binary output of jumps in COG mode, and updated the push/pop code in the compiler to use postincrement/predecrement mode. The updated binary is attached. Source code, as always, is at github.com/totalspectrum/spin2cpp.

Updated May 7: Fixed a bug in the >< operator (REV works differently on P2) so that fft_bench works now.

Updated May 6: I've attached the beta version of the compiler. The DAT section parser works better (produces the same result as PNut for the inputs I've tested) but is still incomplete; for example, it doesn't understand the fancy ptra++ syntax for read and write. I've fixed a few bugs in the PASM output
too, and added a fastspin specific readme. Usage is pretty simple:

fastspin -2 fibo.spin

produces fibo.p2asm, which can then be loaded by PNut.

*** Original Message ***
Here's an alpha version of the fastspin compiler for P2. fastspin compiles Spin code to PASM; it otherwise acts very much like openspin, but has a few extensions (such as inline assembly between asm...endasm). This version of fastspin has a -2 option to produce P2 code (the output file will be named with a .p2asm extension).

This is labeled an "alpha" version because assembly code inside DAT sections is not always compiled correctly -- the P2 instruction parser is incomplete and buggy. Inline assembly does (mostly) work, because it's passed through to PNut.

Having said all the caveats, fastspin is able to compile simple programs (like the fibonacci demo), and it may be useful for putting together quick demos and tests of the hardware. I compiled and ran fibo like so:

fastspin -2 fibo.spin
PNut_v7 fibo.p2asm

In PNut I selected compile and run, then opened a terminal window to see the output.

Please let me know of any issues you find. I'm still working on the P2 support for DAT sections and hope to have that functional soon.

Eric

Heater. · 2016-05-05 14:00

Is that in a repo somewhere for us Linux users to build?

ersmith · 2016-05-05 15:17

Heater. wrote: »

Is that in a repo somewhere for us Linux users to build?

It's in the spin2cpp repo (https://github.com/totalspectrum/spin2cpp). fastspin is just a different front end to spin2cpp, and it's checked into that repo too.

Heater. · 2016-05-05 15:26

Cool, I'll be on it ASAP.

David Betz · 2016-05-05 15:49

Very cool! I think Dave Hein has a program that can translate C to Spin so maybe we don't need a C compiler for P2 after all!

rjo__ · 2016-05-05 15:56

Works here. Very timely!

Dave Hein · 2016-05-05 16:01

@Eric, I'll have to try fastspin when I get a chance. I haven't done anything on the P2 for a while, and fastspin looks very interesting.

@David, You didn't add a smiley to your post, but I'm sure you were just kidding. cspin could be used to convert C to Spin, and then compile it with fastspin. However, that would be very limited. We really do need GCC for the P2.

DavidZemon · 2016-05-05 16:02

David Betz wrote: »

Very cool! I think Dave Hein has a program that can translate C to Spin so maybe we don't need a C compiler for P2 after all!

but C++

David Betz · 2016-05-05 16:12

Eric: What sorts of optimization does fastspin do?

ersmith · 2016-05-05 17:17

David Betz wrote: »

Eric: What sorts of optimization does fastspin do?

Just pretty basic peephole optimization, constant propagation, and inlining of small functions. On small sequential functions it can produce very good code (even better than GCC, since it's tuned for the Propeller architecture), but it doesn't have any of GCC's sophisticated loop optimizations or common sub-expression elimination, so in practice it usually won't keep up with GCC.

It doesn't do register assignment at all, it just allocates all local variables in unique COG locations. So large programs can run out of space in COG memory. That's something that can be fixed eventually.

David Betz · 2016-05-05 18:08

ersmith wrote: »

David Betz wrote: »

Eric: What sorts of optimization does fastspin do?

Just pretty basic peephole optimization, constant propagation, and inlining of small functions. On small sequential functions it can produce very good code (even better than GCC, since it's tuned for the Propeller architecture), but it doesn't have any of GCC's sophisticated loop optimizations or common sub-expression elimination, so in practice it usually won't keep up with GCC.

It doesn't do register assignment at all, it just allocates all local variables in unique COG locations. So large programs can run out of space in COG memory. That's something that can be fixed eventually.

Register allocation is what always seems to stop me from creating register-based virtual machines. I should really spend some time to learn how to do it! :-)

jmg · 2016-05-05 21:59

ersmith wrote: »

Just pretty basic peephole optimization, constant propagation, and inlining of small functions. On small sequential functions it can produce very good code (even better than GCC, since it's tuned for the Propeller architecture), but it doesn't have any of GCC's sophisticated loop optimizations or common sub-expression elimination, so in practice it usually won't keep up with GCC.

Sometimes simpler and more predictable is easier to manage!

How would your optimize manage this DivMod requirement ?

        u32c = seed-256;
	u32x = (u32a * u32b) / u32c;
	u32y = (u32a * u32b) % u32c;
        return( u32x +  u32y );   // use both results to avoid discard.

Google suggests GCC will use one mul/div, where native div opcode gives both remainder and quotient.

Prop docs say
32 x 32 unsigned multiply with 64-bit product
64 / 32 unsigned divide with 32-bit quotient and 32-bit remainder

so that's looking like the same opcodes, compilers should be able to optimise the above well ?

I've just done some tests using GCC on intel D2000, with rather erratic results. Close, but no cigar.
(Looks more like GCC issue than D2000 ?)

Tests on GCC give these possible outcomes, (release build) which moves around with editing other code (?!) :
(debug build gives expected inefficient two copies, but it is quite stable)

Function use seems more stable, but has bonus push/pop & bonus xor ?
No idea why clearing the upper-32b before imul is used, but that XOR seems 'mobile' in GCC, sometimes it pops up after imul, which is not where I would want it.

  x86 : 64b div 32b  =>  edx is Remainder, eax is Quotient
 sometime this (in line in main() )
  cb:	8d 8b 00 ff ff ff    	lea    -0x100(%ebx),%ecx   -256
  d1:	89 d8                	mov    %ebx,%eax
  d3:	0f af c1             	imul   %ecx,%eax          
  d6:	31 d2                	xor    %edx,%edx          
  d8:	f7 f1                  	div    %ecx
  da:	01 d0                	add    %edx,%eax
  dc:	e8 fc ff ff ff       	call   Chkmaths

some edits later, and in-line gives
  d7:	8d 8b ff fe ff ff    	lea    -0x101(%ebx),%ecx
  dd:	89 d8                	mov    %ebx,%eax
[b]  df:	31 d2                	xor    %edx,%edx  [/b]
  e1:	0f af c1             	imul   %ecx,%eax
  e4:	f7 f1          	        div    %ecx
  e6:	01 d0                	add    %edx,%eax
  e8:	e8 fc ff ff ff       	call   Chkmaths

compare exact SAME construct, as function call  
   0:	8d 88 00 ff ff ff    	lea    -0x100(%eax),%ecx   -256
   6:	31 d2                	xor    %edx,%edx   
   8:	0f af c0             	imul   %eax,%eax 
   b:	55                   	push   %ebp               
   c:	f7 f1                	div    %ecx          
   e:	89 e5                	mov    %esp,%ebp          
  10:	01 d0                	add    %edx,%eax 
  12:	5d                   	pop    %ebp               
  13:	c3                   	ret

to me, the ideal code is a stable 4 lines of assembler:
   0:	8d 88 00 ff ff ff    	lea    -0x100(%eax),%ecx   -256
   8:	0f af c0             	imul   %eax,%eax 
   c:	f7 f1                	div    %ecx          
  10:	01 d0                	add    %edx,%eax

I'm also unclear if this is mixing signed/unsigned opcodes, or the mnemonics are just lazy..

ersmith · 2016-05-06 01:28

fastspin doesn't have common subexpression elimination yet, which is pretty much required to recognize div/mod combinations.

GCC has it in theory, but in practice I've had trouble getting PropGCC to combine a div and mod, even though we've told it that the appropriate functions produce both results. I think PropGCC4 sometimes gets it right, but PropGCC6 has had issues. There is a standard C function (ldiv) that can do both div and mod, though, which helps.

jmg · 2016-05-06 02:08

ersmith wrote: »

fastspin doesn't have common subexpression elimination yet, which is pretty much required to recognize div/mod combinations.

Would be nice to have stable support of this type of use, as it is something of a HLL blindspot.

ersmith wrote: »

GCC has it in theory, but in practice I've had trouble getting PropGCC to combine a div and mod, even though we've told it that the appropriate functions produce both results. I think PropGCC4 sometimes gets it right, but PropGCC6 has had issues.

Not sure which GCC intel uses, but it seems to get the big steps right, then drop the ball in the details... much like you say..

ersmith wrote: »

There is a standard C function (ldiv) that can do both div and mod, though, which helps.

I needed to have 32*32 -> 64b then 64b/32b -> Div.Mod

P2 should be able to do this in the same number of ASM lines too, I think ?
I can't find numbers yet on those intel opcode speeds, but they are not likely to be fast at 32MHz sysclks.
P2 with Cordic delays, may be similar ? / faster ?

ersmith · 2016-05-06 11:55

jmg wrote: »

I needed to have 32*32 -> 64b then 64b/32b -> Div.Mod

Spin doesn't have any way to express 64 bit values, nor does it have unsigned operations, so I don't think you could write this in Spin. You could use inline assembly in fastspin though, something like:

PUB foo(seed, u32a, u32b) | u32c, u32x, u32y, low, high
    u32c := seed - 256
    '' want:u32x := u32a * u32b / u32c
    ''      u32y := u32a * u32b // u32c
    asm
      qmul u32a, u32b
      getqx low
      getqy high
      setq  high
      qdiv  low, u32c
      getqx u32x
      getqy u32y
    endasm
    return u32x + u32y

which produces

_foo
        sub     arg1, #256
        qmul    arg2, arg3
        getqx   _foo_low
        getqy   _foo_high
        setq    _foo_high
        qdiv    _foo_low, arg1
        getqx   result1
        getqy   _foo_u32y
        add     result1, _foo_u32y
_foo_ret
        reta

Electrodude · 2016-05-06 14:05

Spin uses indentation to determine when blocks end. There is no endif - you just unindent. So why do you have an endasm? Why don't you just indent your inline assembly under its asm block?

potatohead · 2016-05-06 15:06

That said, I'll take it as it is right now. Didn't know simple inline was a part of this. Thanks Ersmith!

ersmith · 2016-05-06 15:49

Electrodude wrote: »

Spin uses indentation to determine when blocks end. There is no endif - you just unindent. So why do you have an endasm? Why don't you just indent your inline assembly under its asm block?

So that labels in the inline assembly could start at the left column. PASM code doesn't use indentation the same way Spin does, so I thought it was more prudent to explicitly mark the end of the assembly (it makes parsing easier too). There probably would be some way to stick with just indentation (maybe require labels to start with ":" ) but I'm lazy

.

Eric

David Betz · 2016-05-06 16:34

ersmith wrote: »

Electrodude wrote: »

Spin uses indentation to determine when blocks end. There is no endif - you just unindent. So why do you have an endasm? Why don't you just indent your inline assembly under its asm block?

So that labels in the inline assembly could start at the left column. PASM code doesn't use indentation the same way Spin does, so I thought it was more prudent to explicitly mark the end of the assembly (it makes parsing easier too). There probably would be some way to stick with just indentation (maybe require labels to start with ":" ) but I'm lazy .

Eric

Given all of the work you've done on spin2cpp, fastspin, and lots of other things, I think it would be hard to support your claim of being "lazy"! :-)

potatohead · 2016-05-06 18:08

Agreed!

jmg · 2016-05-06 20:28

ersmith wrote: »

So that labels in the inline assembly could start at the left column. PASM code doesn't use indentation the same way Spin does, so I thought it was more prudent to explicitly mark the end of the assembly (it makes parsing easier too). There probably would be some way to stick with just indentation (maybe require labels to start with ":" ) but I'm lazy .

Makes sense to me - because ASM does not follow Spin rules, you cannot use Spin rules to exit any ASM area.

Nice ASM example. How many cycles is that in P2 ?

ersmith · 2016-05-06 20:51

jmg wrote: »

Nice ASM example. How many cycles is that in P2 ?

I'm not sure. Does it depend on how many other COGs are using the CORDIC unit? With just one COG running and with some totally bogus inputs the whole subroutine call and return takes 201 cycles (this is with hubexec).

ersmith · 2016-05-06 20:55

I've updated the first post with a new compiler. Conversion of DAT sections is still incomplete, but at least if output is produced it should be correct.

jmg · 2016-05-06 21:42

ersmith wrote: »

Does it depend on how many other COGs are using the CORDIC unit?

I think the CORDIC interleaves other COGs invisibly, but does take a little time.

ersmith wrote: »

With just one COG running and with some totally bogus inputs the whole subroutine call and return takes 201 cycles (this is with hubexec).

As a rough check, I can get something like 6*32+4*2 = 200
ie guess 6 cordic lines need 32c each and 4 non cordic lines are 2c, harder to get an odd number tho ?

ersmith · 2016-05-06 22:01

jmg wrote: »

ersmith wrote: »

With just one COG running and with some totally bogus inputs the whole subroutine call and return takes 201 cycles (this is with hubexec).

As a rough check, I can get something like 6*32+4*2 = 200
ie guess 6 cordic lines need 32c each and 4 non cordic lines are 2c, harder to get an odd number tho ?

Yep, that is funny, but it's what I got when I ran it on my DE2-115. Maybe the read from CNT sees an odd cycle because it's picking up the value partway through an instruction? Or there's some kind of hub interaction.

jmg · 2016-05-06 22:40

ersmith wrote: »

Yep, that is funny, but it's what I got when I ran it on my DE2-115. Maybe the read from CNT sees an odd cycle because it's picking up the value partway through an instruction? Or there's some kind of hub interaction.

Sounds worth investigating.
Can you post the code in another thread, with an empty call as a comparison.
Chip, or someone else may know the answer?
If the reads are identical, it should not matter what phase effects are there, provided they are stable.

Electrodude · 2016-05-07 03:15

ersmith wrote: »

Does it depend on how many other COGs are using the CORDIC unit?

The CORDIC takes just as long, no matter how many cogs are using it. Another cog gets a turn each clock, and if a cog doesn't use its turn, it gets wasted and no other cog can use it.

jmg · 2016-05-07 03:24

ersmith wrote: »

Spin doesn't have any way to express 64 bit values, nor does it have unsigned operations, so I don't think you could write this in Spin.

Wow, really ?
Given P2 hardware says this
32 x 32 unsigned multiply
64 / 32 unsigned divide

to me, that makes supporting unsigned operations quite important on pretty much any P2 language ?
Sounds like Spin needs an extension for P2 ?

kwinn · 2016-05-07 12:10

jmg wrote: »

ersmith wrote: »

Spin doesn't have any way to express 64 bit values, nor does it have unsigned operations, so I don't think you could write this in Spin.

Wow, really ?
Given P2 hardware says this
32 x 32 unsigned multiply
64 / 32 unsigned divide

to me, that makes supporting unsigned operations quite important on pretty much any P2 language ?
Sounds like Spin needs an extension for P2 ?

I would think extended Spin for the P2 is a given once the dust settles.

ersmith · 2016-05-07 12:27

jmg wrote: »

ersmith wrote: »

Yep, that is funny, but it's what I got when I ran it on my DE2-115. Maybe the read from CNT sees an odd cycle because it's picking up the value partway through an instruction? Or there's some kind of hub interaction.

Sounds worth investigating.
Can you post the code in another thread, with an empty call as a comparison.
Chip, or someone else may know the answer?
If the reads are identical, it should not matter what phase effects are there, provided they are stable.

Frankly I'm not too worried about it, but here's the code if you'd like to take a look.

jmg · 2016-05-07 20:11

Maybe odd is not as odd as I thought. I assumed everything was paced at Opcode rates, but the P2 can wait to SysCLK granularity, and that seems to be exactly what happens here.

I find this (bold added) - seems 39c is a normal delay thru CORDIC

"When a cog issues a CORDIC instruction, it must wait for its hub slot, which is 0..15 clocks away, in order to hand off the command to the CORDIC solver. Thirty-nine clocks later, results will be available via the GETQX and GETQY instructions, which will wait for the results, in case they haven’t arrived yet.

Because each cog’s hub slot comes around every 16 clocks and the pipeline is 38 clocks long, it is possible to overlap CORDIC commands, where three commands are initially given to the CORDIC solver, and then results are picked up and another command is given, indefinitely, until, at the end, three result are picked up"

flexspin compiler for P2: Assembly, Spin, BASIC, and C in one compiler

Comments