Spin compiler

dfletch · 2008-02-27 09:15

New version of the grammar. Now we're cooking with gas! Lots of changes:

Operator precedence (Thank you Phil!). Bison gives us a neat way to do it using the operators themselves (rather than making IMO rather convoluted rules). Take a look at the new first section. Slick, right

Constants have their own complete set of separate expression productions now.

If / if-else, case, return / abort, quit (Thank you Hippy! And shame on me! What was I smokin?!)

Cognew, string (they need special parser constructs).

Hippy: Re: in-built functions ( lookup, lookdown, strsize, strcomp etc ) .... Can't these be done in a function table? Must they really all appear in the parser? Seems to me, at the parser level, a method call is a method call..

unaries as statements ( var++ , --word[noparse][[/noparse]expr] ) This is a tough one, Hippy. If I just allow the expression as a statement *and* use it as a normal expression, that's a reduce/reduce conflict as far as I understand it. Open to ideas here!

Cheers,

--fletch

UPDATE: Oh, wait, I think I've got a wire crossed on the unary statements. What I did was make something called ExpressionStatement. Any expression can be a statement. Maybe there is no issue here.

UPDATE2: And then I do a test and that behavior is wrong

Obviously only those few expressions specified need to be statement. I'll separate out expressions that can be statements but I still think I'm seeing the reduce/reduce prob.

UPDATE3: Duh, got it. The newline at the end of the statement is enough to avoid conflict. There isn't newlines in expressions, right?

Must sleep. Start Bison work tomorrow.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

Post Edited (dfletch) : 2/27/2008 10:23:47 AM GMT

hippy · 2008-02-27 12:40

Constant-folding is an awkward one in Spin and any language used for bit-twiddling. Consider "var := ( var << 8 ) >> 8"; it's easy to consider that as 'do nothing' whereas its intent is to clear the 8 msb's. It may not be sensible that way, but it works, is valid, and some programmer will use it. We could have a command line switch to turn folding on or off but we'd never know if that had undermined the functionality of others' sub-object code.

Some constant folding and related optimisations can still be done, "var := (1*0) | (2*1)". Such statements aren't that unlikely where constants are currently used to 'conditionally compile' using existing means. Where there's no potential side-effect optimisation is okay.

One advantage of textual bytecode in intermediate files throughout ( with named labels for jumps and so on ) is that it's only the final assembler/linker which has to worry about code size changing. It may mean rebuilding expression evaluation stacks but I don't see that as too much of a problem. I'd prefer extra effort than complicating elsewhere.

A good optimiser between compiler and assembler would mean the compiler didn't have to worry about optimisations at all which simplifies things there, especially with optimising debugging-only if-then-else's and the like. I'd limit the compiler to simple and obvious dead code removal ( nothing after a JMP or RET until the next labelled bytecode instruction ) and leave the rest to an optimiser. At most a peep-hole optimiser to turn NOT/JPF into JPT and remove redundant '*1", ">>0", "|0" etc.

Phil Pilgrim (PhiPi) · 2008-02-27 17:13

Hippy,

I wouldn't consider var := ( var << 8 ) >> 8 a candidate for constant folding — at least as I understand the term — since there are no intermediate operations between two constants. While parsing and generating code, it should be easy enough to make the distinction: you'll have a stack full of operands (paralleling the expression stack in the interpreter) and an operation to emit. Whether or not you emit it will depend on whether the top one (for unary operations) or two operands on the stack are constants or not (i.e. have a constant flag set). If the operands are constants, just replace them with the result of the computation and flag that as a constant; if at least one is not, emit the code, and flag the presumptive result (now the top item on the stack, representing TOS in the interpreter) as a non-constant. This will continue until the expression has been completely parsed and the compiler stack is empty.

Beyond this kind of simple "optimizing", you're probably right about shunting responsibility for optimization down the line. As long as the intermediate code uses symbolic labels (presumptive of a separate linker even further downstream), why not? A separate linker also makes possible the incorporation of relocatable object modules, which could be handy if the compiler turns out to be slow.

-Phil

dfletch · 2008-02-27 17:32

The way I've pictured constant reduction is this: Expressions are just nodes in a tree. Nonterminal branches are followed, applying these rules as it goes. If both children of a binary expression (or the 1 child of a unary expression) are constant and terminal, then the expression is constant and can be reduced. After reduction, the parent expression is evaluated in the same way until root is reached.

Cheers,

--fletch

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

rokicki · 2008-02-27 17:52

Is this grammar ready for bug reports? Specifically syntax that is handled differently by this and by the Prop compiler?
I have a set of test cases I'd like to throw at it, but I don't want to do it if you don't feel it is ready.

dfletch · 2008-02-27 18:01

rokicki: Well it is only 2 days old and no actual software has been written yet [noparse];)[/noparse]

I'm converting this to Bison tonight. I'll write a *very* quick lexer that doesn't quite handle all the variations in literals. After that, I'm sure we'll have a feedback cycle or two where Bison barfs on my grammar. I'll roll those changes back into the grammar doc. This first rev won't even build a set of objects, it will probably just dump the tree it has parsed. The dump could be used to compare actual output to expected output for test cases. So I think sometime within the next couple of days, or this weekend would be a good time to get me tests. May as well zip them up now though and upload, so they're here when I'm ready.

Thanks!

Cheers,

--fletch

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

hippy · 2008-02-27 18:21

@ Phil, dfletch : You're both right; constant folding is to replace <const><op><const> with <newconst>. If or what <var><op><const> optimises to is a different matter. I'll put it down to a neuron misfiring

rokicki · 2008-02-27 18:54

Right, I think an excellent first goal is to get the lexer and parser *exactly* right, even before we worry about tree definition or construction. I think just this
will be entertaining, interesting, and challenging enough to keep this thread active for some time.

Getting whitespace handling (between tokens and at the front of lines) exactly right might be---well, challenging.

-tom

rokicki · 2008-02-27 19:55

Just so you know, here is one tricky thing to get right:

con
   decor=3
   x=$decor decor

There are more.

Phil Pilgrim (PhiPi) · 2008-02-27 20:06

Indentation should be easy to deal with. The lexer will need an indentation stack, but that's a minor addition. Thinking in terms of braces and semicolons as lexical markers, here's the procedure:

1. Add a "{" to the token stream after any REPEAT, IF <condition>, etc., and push the number of leading spaces in the current line, plus one, onto the indentation stack.

2. For any line whose indentation is less than the TOS, pop the stack and append a "}" to the token stream, repeating until the current indentation is greater than or equal to the TOS.

3. Insert a ";" into the stream to demarcate the end of each line that doesn't end in "{".

Once this is done, you'll have a line- and indentation-independent token stream ready to parse.

-Phil

Phil Pilgrim (PhiPi) · 2008-02-27 20:13

Tom,

I assume x=$decor decor would raise an error as soon as it hit the "o", since the lexer is expecting either another hex digit or a non-word character. (I hope you weren't suggesting that it be interpreted as x=$dec OR decor.

)

-Phil

rokicki · 2008-02-27 20:18

I hope you guys are planning to support DAT sections by this parser too.

Also, be careful what you treat as "lines", as in the following:

var
   long x
pub go
   if 1==1
{
comment             }x++
     x := {
}1+33
   else
      x--

Finally, we'll have to come up with some story on tabs that is consistent and meaningful.

dfletch · 2008-02-27 20:18

Phil: Brilliant! So in the parser, it it looks more like a traditional language with block start/stop tokens. I wonder if I can avoid sending whitespace to the parser at all using this idea and clever use of lexer states.

Cheers,

--fletch

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

rokicki · 2008-02-27 20:19

It *is* interpreted as x=$dec OR decor by the current compiler.

Since the current compiler is our standard, *our* compiler *must* do the same thing.

You might argue this is a bug; I think it's just "the way it works" since there's no formal specification of Spin that I am aware of.

We need to accept *precisely* what the current compiler does. Any discrepancies will be support nightmares.

dfletch · 2008-02-27 20:20

rokicki: Yep there's DAT support in the grammar. It was so easy to add compared to the rest I decided just to throw it in there

Cheers,

--fletch

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

dfletch · 2008-02-27 20:30

Well Flex matches the longest contiguous string. So with regexes:

\$[noparse][[/noparse]0-9|a-f]+
"OR"
[noparse][[/noparse]a-z|_]+ (for cname)

I believe this works out:

$decor decor

$dec <- matches a literal int
OR <- matches literal "OR"
<- whitespace
decor <- cname

I don't see this as a big deal, but I may be redfaced when I actually plug it in

Cheers,

--fletch

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

jazzed · 2008-02-27 20:31

Seems like producing a C compiler would be about as easy to do as what's being attempted.
Anyone have a pointer to a doc that describes this project? Maybe a wiki page ?
Fletch since you are a regular in this thread, perhaps you can add a pointer to your .sig ?

Best of luck

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
jazzed·... about·living in·http://en.wikipedia.org/wiki/Silicon_Valley

Traffic is slow at times, but Parallax orders·always get here fast 8)

dfletch · 2008-02-27 20:35

Jazzed, this thread is currently the only doc and project home

I'm more than willing to discuss things in real time on IRC if you want to drop by (see my sig). But there is no official project home yet.

Cheers,

--fletch

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

Phil Pilgrim (PhiPi) · 2008-02-27 20:41

rockiki said...
It *is* interpreted as x=$dec OR decor by the current compiler.

Ugh!, or as W. C. Fields would have said, "How unfortunate!"

In my mind, that's a serious bug in the current compiler. Are we to replicate its warts, then? What happens when Parallax fixes them? I'd favor a more proactive approach to stuff like this to stay ahead of the error-correction curve, otherwise we'll constantly be playing catch-up.

When the new compiler is fussier than the current one and flags more errors, I can see no harm in divergent behavior, since the programmer can simply correct the error to produce a version compatible with both. It's only when neither compiler, or just the current one, flags an error that trouble looms for differing behaviors.

-Phil

rokicki · 2008-02-27 20:42

No, I think producing a C compiler (ANSI C, or a reasonable subset thereof) will be at least 10x harder
than doing the Spin compiler.

Here's some more details on the lexer, with respect to brace-style comments:

' brace-style comments
pub go | x
   { comment }
   x++
   {{ comment }}
   x++
   { comment { } }
   x++
   { comment {{ } } }
   x++
   {{ comment { { {{ }}
   x++
   { comment { }}
   x++
   {{ comment "}}
   ' { comment
   x++

As you can see, comments starting with { terminate with the }, but *counting*
internal braces. Thus "{...}" is a comment as is "{...{}...}", but
"{...{}" is not a complete comment (missing the end brace).

Conversely, comments starting with {{ terminate at the very next }},
*ignoring* any internal braces or doubled braces. Thus "{{} }}" is a
comment, as is "{{ {{ }}" and "{{ } } } }}", but "{{ } }" is *not*
(no doubled brace at the end.)

On matching the longest: that works well and is what the Spin compiler
does, as far as I can tell. (That's how my Spin lexer in Java works.)

rokicki · 2008-02-27 20:57

On expression vs statement: this is almost certainly a parse-time thing; the statement

-x

has very different semantics than the expression

... (-x)

Further,

-x-1

is a valid expression but not a valid statement.

rokicki · 2008-02-27 21:02

> In my mind, that's a serious bug in the current compiler.

Well, it's a matter of opinion. It's certainly not a big deal. Nowhere does the Spin language reference manual say
"tokens are separated by whitespace" or any such.

> When the new compiler is fussier than the current one and flags more errors, I can see no harm in divergent behavior, since the programmer can simply correct the error to

I disagree strongly here. We want programmers to be able to develop using Parallax's IDE and feel confident their code can be used
with this command line compiler, and vice versa. I claim making our compiler work like Parallax's is not that difficult, gives us a great
way to test/validate it, and solves problems for the end user; any divergence from the Parallax compiler will be a source of complaints
and bug reports (and potentially bugs not found until someone *deploys* things in the field!). If we diverge, I can't be sure my Spin
code works in *both* until I've tested it in both. If we do not diverge, if we always generate *identical* binaries for *identical* input,
I know my code will work in both places.

Take home point: I do not want to *split* the propeller community; I want work-alike tools.

rokicki · 2008-02-27 21:12

Here's another one you're not going to like, Phil:

pub go | x
   x + = x
   x * {
           } = x

Assignment operators can have the "operator" and the = sign separated by arbitrary whitespace, comments, etc.

dfletch · 2008-02-27 21:25

Some of these aren't such a big problem I think rokicki. If we use the neat block delimiters that Phil suggested earlier, whitespace and comments can be completely absorbed by the lexer, and complex parsing rules for handling them are not necessary.

Cheers,

--fletch

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

hippy · 2008-02-27 21:43

@ Phil : "Are we to replicate its warts" - I think we should aim to, and then add command line switches or whatnot to flag up those things which could be considered dubious. Whether the default is to accept various Spin constructs of the 'better choices' by default is going to be a matter of debate but we can have that later.

One thing we will need is a suite of test programs. Adding a 'fail' statement will be useful during testing; that will suppress any error and generate an error if one didn't occur on that line. Very useful for regression testing things that should fail as well as those which should pass.

@ rokicki : I hadn't spotted that one. Obviously it parses two tokens <op><=> rather than <op=>, which explains := separated generating an "operator expected" or something like that.

For the embedded { } including over new line, I would treat that at a higher level in the GetToken() code of the parser, thus it only ever returns with tokens not white space ( newline is returned ). Not having whitespace tokens passed down can really simplify a parser.

@ Phil : I like your idea of inserting <blockstart> and <blockend> in the token stream to deal with indentation.

@ dfletch : Another one to watch for is "--var++". No PropTool to hand but I believe that only one postfix or prefix operator can be used. Those also include ~, ~~ and ?

jazzed · 2008-02-27 21:46

rokicki said...

Take home point: I do not want to *split* the propeller community; I want work-alike tools.

There should be nothing wrong with producing "warnings" that may clue a developer
into suspect issues (except for the feature cost). Had i not read this thread, I would
have never dreamed of the abberation that spawned this piece of discussion.

I agree with the assertion that a work-alike is a good first goal; if you can do that
then you have basis for enhancements that someone can then use or not.
Let the market speak thereafter.

Defining anything by committee is hard. What do you call a horse designed by committee?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
jazzed·... about·living in·http://en.wikipedia.org/wiki/Silicon_Valley

Traffic is slow at times, but Parallax orders·always get here fast 8)

Phil Pilgrim (PhiPi) · 2008-02-27 21:49

rockiki said...
Assignment operators can have the "operator" and the = sign separated by arbitrary whitespace, comments, etc.

The first anomaly was annoying enough. This one makes me ill. (I could live with +{...}=, but not + {...} = or + =.) Cripes! Can't we be better than that?

Also, you didn't address my point about playing catch-up to Parallax's own bug eradication efforts, for which your examples are sure candidates. IM<HO, if the junk allowed through the Parallax compiler adversely affects a program's readability, then divergence to meet a higher standard would seem the lesser of two evils.

Nonetheless, Tom, I have to give you an A+ for rooting out these absurd pathologies. They have to be dealt with one way or another! (And it's okay if we disagree on how. I guess one could always write a "lint" preprocessor...)

-Phil

Update: Jazzed may be onto something with the warning idea. I like it! Also Perl has something called use strict, which is a pragma that turns on stricter requirements. This could also work for Spin.

Post Edited (Phil Pilgrim (PhiPi)) : 2/27/2008 9:56:57 PM GMT

CardboardGuru · 2008-02-27 22:09

Phil Pilgrim (PhiPi) said...
In my mind, that's a serious bug in the current compiler. Are we to replicate its warts, then? What happens when Parallax fixes them? I'd favor a more proactive approach to stuff like this to stay ahead of the error-correction curve, otherwise we'll constantly be playing catch-up.

That reminds me of another objective of doing a new compiler.· To do warnings.· The ideal behaviour here (x=$decor·decor) would be to compile and produce exactly the same code as the Parallax Spin compiler does for this code, but also to output a warning suggesting putting some whitespace in there·between "dec" and "or".

The major use cases for warnings though would be for things like:

CMP without WZ or WC
JMP·without #
Self modifying code where the modification and the instruction are too close to each other
Using the <= and >= assignment operators by accident ( e.g. if a<=b rather than if a=<b )

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Help to build the Propeller wiki - propeller.wikispaces.com
Play Defender - Propeller version of the classic game
Prop Room Robotics - my web store for Roomba spare parts in the UK

Post Edited (CardboardGuru) : 2/27/2008 10:23:47 PM GMT

rokicki · 2008-02-27 22:10

Yeah, this sounds like a good middle ground (warnings for some of the oddities).

On the moving target: I don't think the current lexical "oddities" are really bugs, but more
simply artifacts of the way the current parser works. I suspect Parallax would be loathe to
"fix" them because doing so could break working code, and really, there's no point to that.

Even in C it is possible (even without the preprocessor) to write tremendously obfuscated
code by abusing syntactic features (consider trigraphs for instance). Normal people don't
write code this way, but sometimes it does show up; no particular reason to break it, as
long as the existing rules are not themselves dangerous or harmful.

My point on the multiline {}, though, was that you have to be careful *how* you absorb them;
your indentation is the character position of the start of the statement, not the number of
spaces or whatnot preceding the start of the statement. For instance (replacing spaces by
dots for readability):

......{comment
}x++

That statement starts at indentation position *2*. Similarly,

{comment...
.more.comments}x++

That statement starts at indentation position *15*. So you can't just *remove* the comments;
you have to remember the character position of the start of the statement *with* the comments
in place. Not that big of a deal, really.

rokicki · 2008-02-27 22:12

What about tabs in the input source? A starting point might be, what does the Spin compiler do
when *loading* a spin file containing tabs (as opposed to someone hitting a tab key on the
keyboard).

Spin compiler

Comments