Making your own programming language, I have not the slightest clue so i have only ??
whiteoxe
Posts: 794
Ive always been a bit clueless as to what compilers do and what the diff between compiled and interpreted languages, because dont they all end up being compiled ?. I have googled a few times ,things like "whats a compiler" or 'what is an interpreted language' , never really helped my understanding , maybe i should be more imaginative when typing a google question.
If you were capable of writing a compiler as some members of this forum are able to do, can you make your own language ? Lets say you wanted to write your own simple basic language. In fact so simple it only had two or four keywords you just made up out of your imagination. Lets pick GOTOIT, PRINTIT, IF,THENGOAHEAD.
What steps are needed. Do you write in assembly and then assign that a bit of that code to GOTO in some way? and do you find the assembly that is used for that chip ?
Then you compile to an .exe using a compiler you have written or a compiler that is open source ?
Feel free to ignore this if its too complex to answer easily. or i might not be making any sense at all
thx,
whiteoxe.
If you were capable of writing a compiler as some members of this forum are able to do, can you make your own language ? Lets say you wanted to write your own simple basic language. In fact so simple it only had two or four keywords you just made up out of your imagination. Lets pick GOTOIT, PRINTIT, IF,THENGOAHEAD.
What steps are needed. Do you write in assembly and then assign that a bit of that code to GOTO in some way? and do you find the assembly that is used for that chip ?
Then you compile to an .exe using a compiler you have written or a compiler that is open source ?
Feel free to ignore this if its too complex to answer easily. or i might not be making any sense at all
thx,
whiteoxe.
Comments
You can read it here : http://compilers.iecc.com/crenshaw/
If you follow along with his code examples you will end up writing a compiler of your own. I even manage to create a compiler that generated PASM for the Propeller from a simple Pascal like language. Jack does not give you a finished language design that you are going to compile, not a finished compiler. He encourages you to create your own language syntax.
Jack exactly starts off as you describe, with a very simple language. Starting with the problem of how do you parse an expression with only variables and the four arithmetic operators. Then he move on to describe assignments, sequential execution, "if", "then", loops, function calls etc etc etc All in very simple to follow steps.
P.S. Jack's examples are in Pascal. I wrote C versions as I read his articles. My finished compiler generated PASM rather than go directly to binary machine code. The PASM was assembled by a Propeller assembler from Cliff Biffle.
As you know computers can only run machine code. But it's possible for a human to hand assemble a program using an opcode table. Back in the 80's I did that for a microprocessor class and given enough time and effort it's possible to build programs on the scale of 2-4 KB. Using machine code it is possible to write an assembler and monitor which allows the programmer to enter code using assembly mnemonics, and let the assembler calculate branch offsets. Things start to pick up speed at this point because the assembler can be used to write much more sophisticated development tools. The programmer would write device drivers to access mass storage, and then write a loader to load and execute binary images. At this point the programmer can use the assembler to write a compiler and save the image to mass storage. From that point forward the loader loads the compiler, which compiles source into object, and then links it into a binary image. The loader can then load the image and execute it.
Now the above is hard work and most of the time programmers use cross development tools on an existing environment to create tools for a new environment to skip the bootstrapping step. But somewhere in the creation of that tool chain bootstrapping was used. Now I didn't describe how to actually build a compiler, that because learning how to do that usually requires one of the classes I mentioned above. It's certainly possible to teach yourself how to do this, but it's more a topic for a book than a forum discussion.
Interpreters are usually written using a compiler. They don't translate a program into machine code like a compiler. Instead they read the source code and call functions within the interpreter to perform the required action. Again how to write one of these is a topic for a book.
Of course, and there are even tools to make this a little easier, as it is a common requirement.
eg
http://en.wikipedia.org/wiki/Yacc
That is why Jack Crenshaw's "Lets' Build a Compiler" is so fantastic. It's written in a clear and simple way. It does not use secret compiler writers terminology. The code examples given are dead easy to follow. You only need rudimentary skills in programming at all to be able to understand it.
If the only thing you knew about compiler creation was Jack's articles you would not image compiler writing was such an advanced topic. A compiler is just another program you can write. Right? It need not even be big, a few hundreds of lines of C will do it. It need not take teams of people man years to create.
Of course that simplicity means any compiler you build from that base knowledge is not going to be generating very optimal code. But it's a huge start in understanding. The techniques Jack shows there can be put to good use in many other programming situations other than in an actual compiler.
Yes there are tools, parser generators like yacc and many others. I would not suggest their use for anyone wanting a basic idea of how lexing, parsing and compiling works. Those tools hide exactly the very thing you want to learn about.
Yacc and friends are not very easy to use unless you are steeped in language construction law already. By contrast writing your own top-down recursive parser and code generator in the style described by Jack Crenshaw is very easy to understand, teaches you a lot, need not be very big and long winded for a simple language and is very satisfying.
http://dinosaur.compilertools.net/
However a couple of times I have heard guys describing languages they have designed an the compilers they built for them where they said the specifically did not use such tools because writing the parse rules can be very hard and/or the tool is not flexible enough to do what you want in your language syntax.
Seems to me it goes like this:
1) A language is designed and a compiler written for it.
2) Other languages are designed and compilers written for them.
3) It is noticed that the same work is being done over and over again, lexing and parsing, that is common to all these languages and compilers.
4) That lexing and parsing work is abstracted out into tools that do just that,
5) Future languages are designed and built with those tools.
That seems to have been a general outline of language technology progression since Fortran.
Of course at step 5) people have lost the skill to do 1) they use a tool rather than write their own code. It's rather like using C in instead of assembler.
http://www.inf.ethz.ch/personal/wirth/CompilerConstruction/index.html
Includes source for the compiler.
If you want to really go whole hog he's done a book called "Project Oberon: The Design of an Operating System, a Compiler, and a Computer "
In it he develops a simple 32bit RISC processor using verilog on a Spartan FPGA board, writes the OS and compiler in Oberon. The nice thing about this, is that he takes the reader through what it takes to bring up a system from scratch. It's quite readable and so is the Oberon code for the compiler and Graphical OS. All done in under 200 pages.
http://projectoberon.com/
Has sources for everything there.
You of course always have the option of programming directly in Assembly Language.
Maybe a little misinterpretation there ... many of us have had the same impression. Spin is not interpreted line by line. Spin is compiled to byte-code which is loaded into the propeller and "interpreted" by a virtual machine in a COG core. The main reason Spin is slower than C is because it is a stack machine where all code and data is accessed from stack which lives in HUB RAM. C variants use registers (in addition to stack) which live in COG RAM and provide more efficiency for local variable access and pointers.
I guess it was good it could run on different operating systems, but so can C and the plus plus !
The language I found hard, and I only did 3 languages, was cobol. had to do lots more work to code line after line things that VB did with one click to add in a component like a calendar, so what was I adding I am not sure if it was a library or a dll or what it was. Vb actually sounds like it runs in its own environment or machine like java. that's not something I easily understand when VB is going to run on just windows machines, maybe its because the windows OS changes so that's why it goes to that CLR thing ?
Anyway thx again. its very interesting.
Quite true. To expand a bit: The Propeller tool, and others, are actually compilers. Their output is a binary executable file. However unlike binaries that conatin instructions for some actual processor hardware they contain "byte codes" that are binary instructions for some processor that does not exist in hardware. A "Virtual Machine". That virtual machine is the byte code interpreter that runs in a COG which in turn executes your programs byte codes. This is only partially the reason Spin is so slow.
An actual hardware stack based machine may well always be slower than the more common processor architectures with lots of registers that can be used as fast temporary store to optimize away many memory accesses.
However the main slow down with Spin is not it's stack based nature but the fact that the interpreter has to execute a hundred or so PASM instructions for every bytecode it interprets. One could imaging that the Propeller COGs execute those byte code directly in hardware, instead of the current instruction set. Then Spin would be a hundred or so times faster.
Perhaps still slower than a register based machine though, as you say.
Mike, jazzed corrected the misconception of what an interpreter is, it could and used to mean reading text line by line slowly as in the original Basic then performing the action, great for simple programs and Wumpus games but interpreter these days and in the context of the Prop normally means interpret intermediate compiled code. The reason that Spin produces "interpreted bytecode" is simply because it can only execute at most 496 instructions in each cog. But the hub which can be read and written similar to a peripheral can be used to read in "bytecodes" which represent an instruction to the Spin Virtual Machine that runs as machine code in the 496 longs of the cog. For instance, my Tachyon Forth runs "bytecode" from 32K hub memory although differently from Spin and achieves a speed of around 1/10 of machine code, although some of the operations are more complex etc.
Java was hyped a lot back in the day for being able to deliver "applications" of the web. Remember Java "applets" in your browser. That never took off. It was awful Smile.
However Java has been immensely popular and is used everywhere on the servers. Running WEB sites and banking and God knows what else.
Java has been one of the most used languages for many, many years. Check here for example: http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html for an interesting look at which languages are "hot".
Java became the new COBOL. It's awful Smile and no one wants to touch it.
Personally I think Java has no reason to exist. It does not offer any interesting features over other languages. If Spin were a truly interpreted language reading your program source line by line and doing what it says it would be massively slower than it is now. Also notice that the compiled byte code binary files are much smaller than your original source code. You would not get as much code into the small space of the Propeller. Also the byte code interpreter is very small, it fits in a COG, a real interpreter would be much bigger and not fit, even in HUB! A Propeller cog executes it's 32 bit native instructions from inside it COG. It can only run programs at full speed that fit in the COG. That's only 496 instructions! Writing a compiler that generates code for that small space is not generally useful.
Those byte codes can be read from HUB by the interpreter and so you can have much bigger programs sitting in the relatively huge HUB RAM space.
Those byte codes are much smaller than 32 bit native Propeller instructions so you can fit a lot more functionality into your programs.
It was part of a programming class, and much of the work followed the lessons in class, so that made it easier.
We had a well-defined syntax to work from, and a very simple theoretical CPU/machine target to compile for, too.
Floating point didn't happen, no includes, macros or conditional compiles.
The project was supposed to be a group effort, and to take the entire semester.
(I was one of the morons doing it alone. And tossed in some basic code optimisation, too. But then, I didn't have a social life, then either... )
That language was stack-based. Doing the same on a CPU/MCU without a stack... would be painful...
(Meaning; I'm very, very impressed by anyone making a stack-based compiler for the Propeller )
Want to know what the final assignment of that class was?
To write an interpreter for the code and prove that it could run the output from your compiler correctly.
Some of the better compilers had not only code optimisation, but did boundary checking and trapped 'divide by zero' and other errors, too.
(Both as compile-time checks and as code inserted into the program)
It is a nice exercise, though, to try creating a compiler for a new MCU/CPU. Even if you don't finish it, you will learn a lot about the HW.
If the plan is to cross-compile to a MCU with limited resources, then yes, an assembler is a very good tool.
(The compiler should then output a text file with original source included as comments, all in a format the assembler can parse, so that the code can be hand optimised)
Building 'from the ground up' on the target platform itself, then a proper assembler is a real necessity.
But for any reasonably large system, with at least a program monitor and some way of transferring a program to it in binary form, the need for an assembler drops off rapidly.
You can call Java awful Smile, but I call it feeding my family.
Besides web application, it's also used by Google as the suggested user land language for Android programming. Google's clean room re-engineering stillresulted in a lawsuit from Oracle. Microsoft's C# is basically a lawsuit safe Java variant and is used to build GUI's on Windows.
Java's success was because it was at the right place at the right time. After JIT-ing was added to the language its performance was high enough to do something useful. The portability of the JVM ensured that purchasers could get vendors to compete with each other, in a way that previous technologies didn't. If IBM charged too much for WebSphere, call Oracle and ask them for a WebSphere price, or go open source with Tomcat. That was huge because prior vendor lock-in was a huge headache.
@Gadgetman, Could you elaborate on that a bit because I can't make any sense out of it?
Presumably if you are transferring binaries to your target those binaries have been created some how. Either you have a compiler that directly outputs binary executables or it generates assembler source which is then assembled to the binary executable by an assembler.
So, whether there is an intermediate assembler stage or not is nothing to do with the target system.
I have yet to see a system where the compiler on the host generates assembler source to be assembled on the target.
Really, I'm not following you and may have misunderstood what you wrote.
@Martin_H,
WARNING: Language war imminent.
I have no doubt that Java has proved very useful and productive and feeds the families of many.
What I claim is that it is a pretty poor language. As it does not introduce any new interesting features to the language landscape one has to wonder why it was devised. Indeed. It was all an accident.
Correct me if I am wrong but Java was first developed as a language system to use in embedded systems. Believe it or not. It was destined to be used in set top boxes and TVs and such like consumer gizmos.
That failed miserably of course because embedded systems, certainly at the time, did not have the resources to run it.
Then the WEB happened. And specifically the Netscape browser. They wanted a language to deliver some application style logic to web pages. The dreaded applets.
That failed miserably of course because, well it was awful Smile, terrible slow, no integration with the browser, required a Java run time to be installed, etc, etc. The failure of Java in the browser was one of the biggest software engineering flops in history.
They did not give up.
Some how the new generation of server software engineers latched onto it. And here we are today.
Java, like COBOL, will be around for a long time to come. Like COBOL providing a living for a generation of programmers into their retirements.
I also studied pfth in comparison to the Zen Forth model.
http://www.ultratechnology.com/efzen.htm
http://www.offete.com/eforth1.html
There are two versions of the Zen Forth document one is long and tries to talk a lot about Zen. The other is shorter and really just gets on with explaining what the Zen Forth model is... a very minimal amount of Assembler and the majority of Forth written in Forth.
I suppose you could get involved with an interpreted Basic or something else, but my own preference is the minimal approach that eForth and Zen Forth try to achieve. For a beginner, why spend any more time and effort coding in Assembler than required?
Probably because it's easier to get anything done in PASM than it is in Forth. Forth is the only so called high level language that is harder to work with than assembler and about as portable. Best of both worlds:)
When the .Net platform came out I moved from VB6 to VB.Net, C# just seemed to difficult at the time.
Fast forward about 7 or 8 years, I had to take a couple Java classes when considering grad school. After those classes C# was no longer intimidating and I moved from VB.Net to C# and never looked back.
The point of that is that taking those Java classes was almost like taking C# classes, they are that similar.
As far as books for Compilers and Interpreters, I recently got a copy of "Writing Compilers & Interpreters" from Ronald Mak, so far it seems like a pretty good resource.
C.W.
C# is of course basically Java from MicroSoft. They pretty much had to do that after Sun sued them over their use of Java.
Oracle continues this tradition my suing Google over their use of Java.
No wonder nobody wants to touch that time bomb.
Thanks for the book recommendation. Sounds like he takes the same approach as Crenshaw, hand made top-down recursive decent parser, no lex/yacc nonsense. I must check it out.
* Java handles the conversion from network byte order to host byte order, and vice versa. That eliminates a whole class of bugs you encounter with C++
* Java enforces a reasonably strong abstraction of the underlying hardware with a class library to do most of what you want. This means that a 100% pure Java program can port easily between AIX, Solaris, and Linux. This was probably a tactical error on Sun's part and contributed to applications being ported from Solaris to Linux on x64.
* Java is compiled to byte codes, then machine code during execution. It's not as fast as C code, but much faster than something like Ruby or Python. Although the JIT-ing technique was later applied to other languages (e.g. JavaScript, Jython, Groovy) which has reduced this advantage.
* It had a multi-generational garbage collector which was pretty speedy in a multi-threaded server environments. This could deferred full GC's and improve performance over other languages.
Ironically Google's version of Java for Android may yet allow Java to succeed in set top boxes. Unless Oracle/Sun's litigious nature convinces Google to drop in in favor of JavaScript on Android.
Android Java programming is pretty ugly to me and I know Java. Qt5 Android is much easier, but the user interfaces are less developed. I've been looking at AndroidScript which is basically Javascript with Android libraries. I noticed there is a multi-platform Javascript alternative called PhoneGap ... I suspect it's closed source (or something else abusive) because it's the product of a for profit US corporation with a history of exorbitant pricing (I could change my mind on that after further research ... no time to look today).
Assembly language (http://en.wikipedia.org/wiki/Assembly_language) translates mnemonics and labels into the machine code (ones and zeros) of a particular cpu.
A compiler (http://en.wikipedia.org/wiki/Compiler) usually translates a high level language into the machine code of a particular cpu.
Interpreters (http://en.wikipedia.org/wiki/Interpreter_%28computing%29) scan and execute the program directly, translating each statement into a sequence of one or more subroutine calls that are already compiled into machine code.
Bytecode interpreters (http://en.wikipedia.org/wiki/Interpreter_%28computing%29#Bytecode_interpreters) are part compiler and part interpreter. Generally the high level language is compiled into bytecodes that are then executed by subroutines in the interpreter.
(A text mode output should not be the only output choice)
What I mean is that if you have a cross-compiler and a way of transferring the compiled code to a system, you usually don't need an assembler. Unless you intend to hand-optimise the code after compilation.
Sure, an assembler is an essential tool if you're bootstrapping a system, but frankly, I hope no one really has to do that these days.
(Starting a completely blank system, and entering the first hand-coded monitor using switches, then slowly adding to it. There's a reason why the first 'additional' code used to be a serial transfer program or something to read off a storage device. Tried the serial route on an 8086 trainer with 7segment LEDs, a HEX keypad and a very unforgiving monitor. Programming assignment... sucked bigtime... )
Of course, to be able to design a decent compiler for any given platform, you essentially need the same knowledge needed to write good assembly code for it.
(Good understanding of the MCU/CPU command set, intimate knowledge of the surrounding HW, any OS APIs that exists. )
I'm just happy that the compiler-writing assignment I had never required us to handle a 'PRINT' command...
(This is actually a bit harder than it sounds. Ideally, the OS/Whatever ROM exists in the system has some sort of API for this, or you'll need to add a whole lot of code, memory handling and variables)
Edit: Fix quote
Compilers tend to have lots of stages. lexing, parsing, compiling, optimizing, code generation, assembling, whatever. Between all these stages are various different representations of the program. Including assembly language.
Of course you almost never know anything about this as it all pipelined through in a way that you would never know.
GCC and LLVM/CLang certainly have an assembler stage. My compiler did:)