Spin compiler

CardboardGuru · 2008-02-27 22:18

dfletch said...
Some of these aren't such a big problem I think rokicki. If we use the neat block delimiters that Phil suggested earlier, whitespace and comments can be completely absorbed by the lexer, and complex parsing rules for handling them are not necessary.

When doing this though, don't foget to consider that the·parser needs to produce error and warning messages that give line and column numbers identical to those in the original source file.· Somehow you need to preserve the information of where the tokens are before you dispose of the comments.·

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Help to build the Propeller wiki - propeller.wikispaces.com
Play Defender - Propeller version of the classic game
Prop Room Robotics - my web store for Roomba spare parts in the UK

dfletch · 2008-02-27 22:29

CardboardGuru: Flex to the rescue! I believe it can pass the line/char numbers of each token to bison without trouble.

I was more worried about losing the comments entirely - the Prop Tool does neat stuff with documentation. It would be cool if this parser could do some of that. Oh well, one thing at a time, eh

Cheers,

--fletch

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

Phil Pilgrim (PhiPi) · 2008-02-27 22:30

Tabs in the input source are certainly a problem, especially if they're mixed interchangably with leading spaces. Without knowing what the original tab setting were, there's simply no way to determine the correct indentation. The Propeller IDE, upon encountering a tab in a Load file, replaces it with eight spaces.

-Phil

CardboardGuru · 2008-02-27 22:33

rokicki said...
What about tabs in the input source? A starting point might be, what does the Spin compiler do
when *loading* a spin file containing tabs (as opposed to someone hitting a tab key on the
keyboard).

A quick try suggests that the IDE replaces each tab character with 8 spaces when loading a spin file.· So the new compiler could just treat tab characters as being the same as 8 spaces, and also give a warning.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Help to build the Propeller wiki - propeller.wikispaces.com
Play Defender - Propeller version of the classic game
Prop Room Robotics - my web store for Roomba spare parts in the UK

Phil Pilgrim (PhiPi) · 2008-02-28 03:31

I almost hate to suggest this, but since braced comments are character-oriented and indentation is line-oriented, might the simplest lexer be two-pass? Pass 1 would deal with the comments and pass 2 with everything else. In pass 1, the source would be treated as one long string; in pass 2, as an array of lines.

-Phil

dfletch · 2008-02-28 05:02

Phil: Flex supports lexing states. So once the start-comment brace is detected, we can switch into "comment mode", and treat newlines differently than we do normally. When it hits the closing brace, it returns to the default state. So I think this can be done in a single pass.

Cheers,

--fletch

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

dfletch · 2008-02-28 06:50

First minor snag: cannot specify "+" and "-" twice in the list of precedence rules (pos/neg, add/sub have different levels). So for these I think I need to try to encode one of these pairs into the productions, and remove from the precedence rules.

I hope it doesn't add too many new productions!

UPDATE: Ugh, and this makes the expressions really ugly. So I end up with these rather arbitrary "high level precedence" and "low level precedence" sets, arranged around the precedence level of add/sub. I think maybe I'll revert back to Phil's original suggestion and describe the precedence entirely in the productions. I hate things that slow me down, but I hate ugly hacks worse

UPDATE2: Oh wow, Bison just keeps surprising me

I can invent a token (e.g. UMINUS UPLUS) put that in the list of precedence rules, and use %prec UMINUS in the neg expression and %prec UPLUS in the pos expression. Bison figures out the details.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

Post Edited (dfletch) : 2/28/2008 7:56:31 AM GMT

dfletch · 2008-02-28 10:03

Next issue revolves around the fact that to the lexer and parser, constant and variable references look very similar in many circumstances. There's something like 500 conflicts left in the grammar and I believe a good chunk of them are related to this.

Stay tuned tomorrow night for a better update.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

Phil Pilgrim (PhiPi) · 2008-02-28 18:14

Rather than making a distinction between constants and variables in the lexer, the distinction should be between symbols and literals. Literals can be resolved in the lexer. Symbols need to wait until the first pass through the parser has assigned them a type and inserted them into the symbol table. Because symbol declarations can occur after they are used (e.g. in a later-occurring DAT section), it will not be possible to compile a Spin program in one pass — at least not without an inordinate amount of post-compilation fixup.

Nobody said this would be easy!

-Phil

dfletch · 2008-02-28 18:35

Phil, yep exactly. I think my mistake here was thinking I'd be able to have a separate set of constant expression productions to help identify constant expressions. After yesterday's discussion of constant reduction (or folding, whatever it's called [noparse];)[/noparse] this seems unnecessary, and I'm now thinking that it needs only 1 set of productions that can be used for both constant and runtime expressions. Deciding if a particular expression is constant is best left to another pass, after parse.

Cheers,

--fletch

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

dfletch · 2008-02-29 09:53

Look ma, software! Well not really, doesn't parse anything yet. Does build though

Currently bison and flex are pretty essential if you want to check things out. Requires make/automake/autoconf/libtool/g++. Builds fine in Cygwin for me with all those tools installed.

There's still a couple of conflicts in the grammar. The first 950 were totally easy and then the last 20 really really hard [noparse];)[/noparse] The first is something to do with the unary expressions that can be a prefix or a suffix. The other issue is in the assembly part, a rule called DataData - I'm not sure what's going on there. Any help with these conflicts would be awesome. The main source files are src/spinparse.ypp and src/spinlex.lpp.

The original spin.lalr now lives in the root of this package, and will be maintained there.

Tomorrow I'll continue work on the lexer and I think it will parse it's first bit of Spin!

G'night.

Cheers,

(P.S. Sorry 'bout the .txt extension! Attachment manager is not tar.gz friendly!)

--fletch

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

stevenmess2004 · 2008-02-29 10:13

Pardon my ignorance

If I understand this just sets up a system that goes through the code and gives us the different operations/variables/etc? I take it that this stage spits out what everything is (opcode,variable,constant) and then we write a compiler that accepts this input?

deSilva · 2008-02-29 10:43

There are many books to read about it... The very first book I read about Compiler Construction 25 years ago (Aho/Ullman) is still not bad for simplistic languages as SPIN ...
This here is free: www.oberon.ethz.ch/WirthPubl/CBEAll.pdf

stevenmess2004 · 2008-02-29 10:57

Thanks, I'll have a read.

hippy · 2008-02-29 11:59

That's a very good link from deSilva.

Compilers are either multi-part with distinctive processing stages or single part where earlier processing ripples down to later processing. There's not a lot of difference in end result; earlier parts analyse whatever they analyse and then pass on information about what they have determined and what it means to later parts.

Lexical analysis is determining what the input stream is and doesn't really care about what the language used is - "I have 10 cats" results in "<name> <name> <number> <name>" with each <name> or <number> having information saying what the name or number was in some common format. For numbers, 10, $A, %1010 and any other forms it accepts these are all converted to a single and same associated value at this stage. Later parts don't care how the number was written only what the number is.

Syntax analysis is determining that what comes out of lexical analysis is what would be expected. In Spin an assignment can be "<name> := <expression>" where <expression> can be of the form, <expression><operator><expression> or <name> or <number>.

Thus "10 := cat" fails because <name> is expected first not <number>, "cat := cat 10 cat" fails because 10 isn't an <operator> it's a <number>, but "cat := cat + 10" does pass the test.

Semantic checking also plays its part here. While <name> is <name> is <name> as defined by whatever a name would be in a particular language ( starts with a letter and followed by any number of letters or digits ), as a <name> on the left of an assignment it has to be the name of a variable, and must be the name of a writeable variable, not the name of a constant, not the name of a read-only variable. What <name> is and means will depend on how the name was defined elsewhere, so the compiler keeps a 'symbol table' of what has been defined and how. When <name> appears, it is looked up in the symbol table and a check is made that it's the right type of name to be used there.

A single-pass compiler requires that all names are defined before they are used. A multi-pass compiler will read the source once to find all the names used and what they mean, and further passes can then process the source knowing what all names will be, even if they are defined later in the source. Spin is multi-pass, C and Pascal are single pass. Single pass is usually quicker but does mean everything has to be defined before being used, so "PUB main" finishes up at the end of the program. C has the ability to provide prototypes, so things not defined can be used in advance; a prototype indicates, "I will be defining this later, and this is what it will look like".

When a statement has passed syntax analysis and the semantics are deemed correct, the compiler will have enough information to determine that "cat := cat + 10" means "take the value of variable 'cat', take the value '10', add the two together, put the result in variable 'cat' and can generate the appropriate code for that - "PUSH cat, PUSH #10, ADD, POP cat" or something similar. This is code generation.

A further step is optimisation; "cat := cat + 10 + 20" is, without optimisation, "PUSH cat, PUSH #10, ADD, PUSH #20, ADD, POP cat", but the compiler can optimise that to ""PUSH cat, PUSH #30, ADD, POP cat", the equivalent of "cat := cat + 30". This optimisation may sometimes be done earlier as part of the code generation itself.

The PUSH, ADD, POP instructions are how the code would be for an 'abstract machine', the often imaginary but ideal processor for which it would be easy to generate code for. These instructions will then be converted to whatever instructions the target processor has which can implement what is required. For Spin there's a close relationship between the native bytecode instruction set and these 'abstract machine' instructions ( thus generating compact code ), for PASM there is much less of a correlation and a number of PASM instructions would be needed to implement each abstract instruction ( larger code ).

That's my version of "Compilers 101"

Post Edited (hippy) : 2/29/2008 12:05:02 PM GMT

stevenmess2004 · 2008-02-29 12:08

Thanks. I get the idea. So once the lexar (automatically converted to c code by bison) is done we just need to write the code generation bit. So, that means that we are nearl half way there (not quite

)

Bob Lawrence (VE1RLL) · 2008-02-29 12:51

A few links for those of us that are learning new terms.

Lexical analysis From Wikipedia, the free encyclopedia

en.wikipedia.org/wiki/Lexical_analysis

Bison - The YACC-compatible Parser Generator

www.cs.princeton.edu/~appel/modern/c/software/bison/bison_toc.html

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Aka: CosmicBob

hippy · 2008-02-29 13:59

@ stevenmess2004 : Almost. A pure lexar will tokenise the code, so to take that earlier example of "fred = $decor decor" it will will turn that into something like -

NAME "fred"
EQUALS
NUMBER "$DEC"
OPERATOR "OR"
NAME "decor"

The syntax checking will check the sequence "NAME EQUALS NAME OPERATOR NAME" is valid, which in a CON block it would be. The semantic checking would be to check that "fred" were not already defined as something else and that "decor" was defined and defined as another constant which can be used to create a value which "fred" will represent. "Fred" and its associated value would then be added to the symbol table.

Phil's complaint over accepting "$decor decor" is that the lexar here takes $decor as two tokens a hex number ($DEC) and an operator ("OR") even though they blur together. The argument is that as "OR" cannot be part of a hexadecimal number it's better to flag that as an error ( require whitespace between the number and operator ) because the compiler may otherwise be misinterpreting what the programmer actually meant but inadvertently got wrong.

Bison, YACC and similar deliver lexical analysis ( tokenising ) and syntax checking ( verifying token sequences are valid ) and may also allow inclusion of semantic checking. Others may require the semantic checking to be written separately or added to what they produce.

In general though, yes, it's simply a case of adding code generation once lexical analysis and syntax checking is there. Semantic checking ensures that what's written as code makes sense and helps generate the correct code. It is often possible to get a compiler working as a prototype without semantic checking, so one can say half the work would have been done.

Phil Pilgrim (PhiPi) · 2008-02-29 16:44

Hippy,

Your example, cat := cat + 10 + 20 is an interesting case for constant folding. It would fail to optimize under the simple scheme I posited earlier, since normally there would be no operations between two constants: the cat + 10 produces a variable value before the + 20 even comes along. To optimze this requires extra rules in the parser that recognize mathematical identities (e.g. that addition is associative). But such rules also have to be cognizant of the target architecture, in case overflow issues render the identities invalid in some cases.

-Phil

hippy · 2008-02-29 18:33

@ Phil : This is probably how I came up with noting the flaw of optimising away '( cat << 8 ) >> 8".

This may more likely fall under peep-hole optimisation than constant folding of the syntax tree. You're right about the extra rules, and they may have to get quite complex and be very specific. One of the reasons it can be so hard to debug optimisation, and the main reason I'd say leave them out of a compiler as far as possible and put them in a separate pass, or certainly isolate them in a single module which can be easily bypassed.

dfletch · 2008-02-29 19:00

stevenmess2004: Well hehehehe "halfway". Remember, in any software project, the first 95% of the time is spent writing code. Then the second 95% of the time is spent debugging [noparse];)[/noparse] I'm guessing we're more like 1/8 to 1/4 the way there.

Phil, great point about that expression. With left association and the specified precedence, we get:

cat := cat + 10 + 20

cat := ( ( cat + 10 ) + 20 )

None of those expressions looks constant to the parser. I'm going to need serious help with some of these advanced optimizations, I hope you guys are feeling up to the challenge!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

Phil Pilgrim (PhiPi) · 2008-02-29 19:20

dfletch said...
Remember, in any software project, the first 95% of the time is spent writing code. Then the second 95% of the time is spent debugging.

I'm a huge believer in debugging as you go. Write a small chunk, test and debug it, write the next chunk, etc. There's a non-linear (and I swear it's exponential) relationship between the size of any new chunk of code and the time it takes to debug it, viz:

····t_debug = k·eⁿ > k·eⁿ¹ + k·eⁿ², where the ns are the new code chunk sizes, and n = n1 + n2.

Also, Hippy's admonishment is a sound one and bears repeating: keep the fancy optimization out of the parser.

-Phil

dfletch · 2008-02-29 20:04

Phil: Yep I agree to keep optimizations out of the parser proper. What I was suggesting (badly) is that "filters" that do these sort of optimizations could be bundled with this library... I think an actual compiler should be a separate project, but many users of this library may want various stages of optimization. In the other extreme, someone writing an IDE or a plug-in for one may want the complete unoptimized parse tree! Here's what I'm thinking in C++ code:

SpinParser p;
ConstantFoldingFilter f;

p.parse("somefile.spin");
f.apply(p);

This filter is completely optional, but can be used by projects who want to have optimized output.

Re: unit tests - yes I agree 100%! This will be one of my first goals after the basic parser/lexer is built.

BTW: Yay! It builds in Linux too

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

dfletch · 2008-02-29 23:04

Oh yes, forgot to mention since everyone is keen on learning how lexers and parsers work (yay! I love projects that encourage learning!): my favorite reference on this subject is the Oreilly book: Lex & Yacc.

It covers most subjects that have come up in this thread. Has a neat example SQL parser in it too. Index very well organized. (No, I don't work for O'Reily, just happy [noparse];)[/noparse]

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

dfletch · 2008-03-02 18:21

WOO! It parses some spin!

After build, do: `cat simpletest.spin | src/SpinParserTest`

If it parses without errors, not much should happen. If there's a syntax error or an error in the parser making it *think* there's a syntax error, it will complain.

To see more of what it's doing, do this before ./configure:

export CXXFLAGS="-DDEBUG_LEXER"

The parser still doesn't actually do anything, except verify that the tokens coming from the lexer are correct. Next steps are:

1) Make Flex supply line numbers.

2) Create a set of objects that are created during parsing (the output tree).

3) Unit tests.

Also, I'm wondering: anyone have some nice project space where I could continue working on this under revision control and maybe have a little web space for releases and snapshots? That would be killer! Please private message me or come to IRC to discuss.

Cheers,

--fletch

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

rokicki · 2008-03-02 18:26

Project space: I've already offered the sourceforge spinc project for this. If you want to use this
just let me know your sourceforge username and I'll add you as a developer.

dfletch · 2008-03-02 18:40

rokicki: SourceForge has this strange effect for me of destroying any will to actually work on the project. I'm not quite sure what's wrong with me

The other issue is that this is *not* a spin compiler. So the name is wrong for it. It only supplies the parser feature. Someone else should start the compiler project and link against this lib (or I will in a few weeks when this is stable - assuming nobody else does).

Cheers,

--fletch

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Join us in the Un-Official Propeller IRC channel: irc.freenode.net #propeller
Newbies, oldies, programmers, math professors, and everyone in-between welcome!
Propeller IRC howto

OwenS · 2008-03-02 18:42

Google Code seems like a light weight system, with SVN hosting, although I'll admit I've never used it

rokicki · 2008-03-03 02:25

Hmm, that's odd, I've been quite happy with sourceforge. Indeed, I like their ranking of projects and the fact
that it seems to be the current sector leader. And I think the intent is that it *does* eventually become a
full compiler, so I see no reason not to put it on sourceforge under spinc. But it is your parser, so whatever
you decide is fine.

rokicki · 2008-03-03 03:02

BTW, with a parser, we are a large part of the way towards using Eclipse as our IDE. Just thought I'd
mention that. And with Eclipse, there are a *ton* of really cool features we may be able to support.
Indeed, with Eclipse and the interpreter source code released, we may even be able to support
single-stepping and debugging on-chip.

Spin compiler

Comments