Not sure exactly which thread you're referring to (there are so many!) but I don't think I would have simply said that size never matters (I'm sure you'll correct me if I'm wrong there ) - in some cases, of course size matters on the Propeller since (without XMM) we only have 32k to play around in.
I think I do recall saying that size doesn't matter for some Prop users - i.e. those who only to use the Prop for small applications. But in other cases, size does matter.
What I said most recently was that the commonly expressed belief that LMM C would be too large to be useful because it was of the order of four times the size of bytecoded C (an approximation I also used to quote) - was incorrect.
In practice, even without optimization, programs coded in LMM languages should only be about 2 times the size of equivalent programs coded in languages that use SPIN byte codes.
With good optimization this can be even less - some of the simplest optimizations are simply impossible to achieve when using SPIN byte codes. Actually, after seeing some of the byte code that the SPIN compiler actually generates, I may revise my estimate of the overhead of LMM programs down even further!
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
RossH said...
In practice, even without optimization, programs coded in LMM languages should only be about 2 times the size of equivalent programs coded in languages that use SPIN byte codes.
I'd be delighted to be proven wrong but I'm not convinced of that as Spin bytecode uses some clever tricks; numbers are 1 to 5 bytes, variable and stack access and function and method calls 2 or 3 bytes and if LMM doesn't embed any operands in the opcode all become 8 bytes, otherwise 4 or 8 bytes.
It does depend on the particular code and sequence required to achieve something but LMM is stuck with its 4 byte granularity where Spin can go to one byte. This is why I suggested three levels of C; 8-bit bytecode, 16-bit thumb-style and 32-bit LMM. If Prop had thumb-style expansion hardware or purposed opcodes that would be superb for high speed LMM-style execution with reasonable code footprint but it doesn't.
32-bit LMM is however only a real issue for Hub ram. With suitable XMM memory, optimisation is IMO more about improving speed through code change than reducing footprint. If programs are 8 times larger I don't think it's unreasonable nor really a problem once it's accepted that XMM has to be used for all but trivial C programs.
That C most times likely requires XMM and uses I/O resources is a more fundamental problem and may make alternatives look more appealing but there's not a lot of choice - the Prop simply isn't designed for small C code with fast execution. LMM is essential for fast execution, that's a fact of life, everything then follows from that.
Some things are really efficient in Spin.· The expression x++ only requires two bytes and it doesn't use the stack.· The variable x has to be within the first 8 long locations in the VAR section or in the list of local variables.· On the other hand, the expression x := x + 1 could take 8 bytes if the relative address of x is 128 or greater.· And it pushes and pops from the stack three times.
It's true that LMM is stuck with 4-byte granularity, whereas SPIN byte codes have 1-byte granularity, but when you look at actual example you realize this doesn't mean the size ratio of LMM:SPIN will be anything like 4:1. Even with a non-optimizing compiler the ratio could be much closer to 2:1. I may have my byte count wrong in some of the following cases, but you'll get the general idea ...
First consider loading and saving constant operands. When using immediate mode, LMM can load a byte (actually 9 bits) using 4 bytes. To do the same in SPIN takes 2 bytes (3 bytes if the operand requires 9 bits, but that's quite rare so we'll ignore it). So the ratio of LMM:SPIN is 2:1.
Loading 2-byte operands (such as hub addresses on the Prop I) takes 8 bytes in in LMM, whereas SPIN take 3 bytes. Ratio of LMM:SPIN is 8:3.
Loading 4-byte operands (e.g. integers and floats) takes 8 bytes with LMM, whereas SPIN takes 5 bytes. Ratio of LMM:SPIN is 8:5.
So all cases are fairly close to 2:1 - SPIN wins out on some, LMM wins out on some. Figuring out the exact ratio depends on the distribution of 1,2 and 4 byte loads, but overall I'd say the ratio of LMM:SPIN to load constant operands is probably going to be fairly close to 2:1.
Next consider opcodes that actually do something - like ADD. The SPIN opcodes generally work on the stack (i.e. the instruction itself doesn't have any operands) and therefore only takes one byte, whereas an LMM ADD always takes 4 bytes. But in SPIN the operands had to be loaded on the stack beforehand, whereas in LMM the ADD instruction can include one of the operands - so you have to compare the whole sequence of load-and-add in both cases.
To load and add two 1-byte constant operands takes 8 bytes in LMM, whereas in SPIN it takes 5 bytes. Ratio LMM:SPIN is 8:5.
To load and add two 2-byte operands takes 20 bytes in LMM, whereas in SPIN it takes 7 bytes. Ratio LMM:SPIN is 20:7.
To load and add two 4-byte operands takes 20 bytes in LMM, whereas in SPIN it takes 11 bytes. Ratio LMM:SPIN is 20:11.
Again, all around 2:1. Sometimes better, sometimes worse - but overall the ratio will hover around 2:1. And again, to get a more accurate estimate we'd need to know the distributions of the various operations and the various operand sizes.
So far we have only loaded and operated on operands. We haven't actually saved the results anywhere yet. In SPIN, the results are calculated on the stack, and to save them anywhere useful will generally require another opcode - i.e. at least 1 more byte, but more likely 3 if we want to save to an arbitrary hub RAM location. But in LMM we can actually have nearly the whole cog to use as storage - so in 75% (or more) of cases we have already finished (i.e. the LMM ADD instruction has left the result in a cog location that represents the register variable we wanted to save it in anyway). Even if the variable is not in a register, there is actually quite a good chance that it's address is (more about this below), so in most of the remaining cases it is only another 4 bytes to save the result, with only a few cases requiring a full 12 bytes required to save to an arbitrary hub location.
Taking this into account, the average ratio of LMM:SPIN to do a load-add-save sequence could in fact end up less than 2:1.
Also, now we start to get to the point where LMM can do things that SPIN cannot - like re-use common subexpressions, or keep in temporary registers any values that the user (or the compiler) determine will be needed again later. The savings here can be quite dramatic - most code uses the same variable multiple times times in a code block, and there are also often common expressions and other code fragments that can be evaluated once and then re-used many times - not just intermediate arithmetic results, but also the addresses of variables. This is easy to do in LMM, but very hard in SPIN because SPIN simply doesn't have any local cog space to keep such things hanging around - instead, it has to recalulate them each time it needs them. Maybe a SPIN-based compiler could optimize things to use the first 8 long locations in the VAR section as "temporaries" - but LMM programs don't have just 8 such locations - they can have as many of them as they want (for example, Catalina has 24 general purpose registers, and 3 or 4 special purpose ones that are used for this purpose. It could have more, but there is a law of diminishing returns here that means after a certain point the cog space is more valuable as something else).
Of course, the real situation is (as always) more complex. As you point out, SPIN has some clever tricks and some one-byte opcodes that don't require any operands (or as Dave points out, operate on some special locations very efficiently) so again we'd have to look at actual code to see the distribution of such instructions - but overall, I maintain that the ratio of LMM:PASM will remain around 2:1 - it could even be less than 2:1 in some cases.
Note that I am not being specific about Catalina here - the same applies to any LMM program, whether hand crafted or compiled. While compiled LMM will generally be less efficient, that's a compiler issue, and not an LMM issue.
We already have a couple of 4-byte granularity C compilers, and I believe Parallax is working on a 1-byte granularity C compiler, so we might soon be able to actually compare them.
However, I can't see a real case for also developing a 2-byte granularity C compiler (which you call "thumb-style", and which has also been called a "Compact Code Model" or "CMM" in another thread) - I don't think it will end up with a code size dramatically less than the 4-byte (LMM) compiler. My gut feeling is that you may get somewhere between a 10 - 25% code space improvement over LMM. But given that the Prop II could well be out before such a compiler, it just doesn't seem worth the effort.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
You say Parallax is working on a byte granularity C compiler? I'm assuming this uses some sort of virtual machine (the SPIN VM maybe)? Or are they adding byte-length instructions to P2? I guess Parallax has the advantage that they can change the Propeller instruction set to fit their compiler at least for new versions of the chip.
Are we wanting to optimise for fast execution of C programs, or are we optimizing for minimal code size?
When it comes to a speed race I believe LMM will always win, compiler generated or done by hand. (except for some special cases where overlays would help)
In all other cases one should use Zog and be happy [noparse]:)[/noparse]
That is unless you want to target a C compiler at some other byte/word code for fun. Who would be sad enough to do that?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I think it would be useful to develop a set of benchmarks that would allow us to compare compilers.· There are some applications that require speed, and other application require code compaction.· A "Hello World" benchmark would be good.· It would show how much space is required for the minimum size program.
It would be good to have implementations that used the standard library, and other implementations that did not use a library and are self-contained.· A self-contain Spin "Hello World" is about 150 bytes.· Of course a Spin-only serial driver is limited to 19,200 baud.
David: Yes, but a "Hello world" benchmark is totally useless to someone trying to use the Prop for a quad copter or bot balancing on two wheels.
This has been a problem with bench marks since the beginning of the idea. For example Dhrystone was a famous bench mark for many years, but it is heavily focused on string operations. Well, if you are making a PID loop who cares about string operations?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
heater, I think most people realize that benchmarks are flawed.· However, they are helpful in understanding the performance of various compilers and techniques.· I don't know what a PID loop is, but it might be useful to have a PID loop benchmark for someone interested in doing that.
Are we wanting to optimise for fast execution of C programs, or are we optimizing for minimal code size?
We want it all of course [noparse]:)[/noparse] No doubt that LMM C has value with a bigger memory footprint.
I just want the real size difference between Spin and LMM C on one or more programs so the issue can be at least partially put to rest.
HelloWorld is fine for measuring printing all else being equal (same FullDuplexSerial COG).
There are other cases where different code is created and a good metric is harder to establish.
See http://forums.parallax.com/showthread.php?p=898937 for Parallax's original proposal for a 1-byte granularity C/SPIN hybrid language - I believe this (or something similar) is being developed. Not sure on its current progress.
@heater,
Ideally of course we want both performance and small code size. But as many people point out, the architecture of the Prop means we generally can't have both. However, we're not talking about speed at all in this thread. I don't think anyone questions that LMM is many times faster byte code. The point I was making that seems to have stirred the hornets was that an LMM implementation of an algorithm should be able to achieve code sizes that are only around 2 times larger than a byte code implementation (at least when using the SPIN byte codes - not sure about Zog).
@jazzed,
We could have closure on this argument if you guys would just stop arguing
Seriously, though - I agree we should benchmark - but we would have to benchmark hand-crafted LMM against hand-crafted byte codes. Otherwise, all you are comparing is the compiler technologies, which are so different that the results would have very little to do with LMM vs byte codes. Also, there is a problem is finding a suitable benchmark - a "hello, world" benchmark tells you very little other than how much startup code is required and how big the C stdio library routines are (and there is no byte-code equivalent to that anyway). I have tried various standard C benchmarks (whetstone, dhrystone etc) - but I haven't not yet found one that anything other than Catalina can compile and run. However, I have not tried Zog for this - maybe that's the way to go.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
I'm sorry but if your Hello World program binary is too big for simple Propeller hardware and the Spin program is not, LMM C loses. Simple; no compiler arguments necessary [noparse]:)[/noparse]
I'm not suggesting that such a simple program makes a 32KB + binary, but it doesn't take much to push applications into the oops it's just too big category .... there are many solutions for the just too big problem, but they would not include the simple solutions using this wonderful little chip.
I would agree with you if that were the case. Fortunately, it is not - the smaller of the Catalina "Hello, world" demo programs takes about 450 bytes of code space. This does not include loading the drivers (e.g. a display driver), because that would presumably be equivalent no matter what implementation you use. However, it does includes stuff that you also shouldn't really include if you are just trying to do a direct LMM to byte code comparison.
For example, it includes setup code to process command line arguments (something that none of the other compilers you want to compare it against can even do!). It also sets up the infrastructure designed for general purpose invocation of various plugins.
Neither of these are really necessary in such a trivial program, but Catalina includes them because in trivial programs the additional code size overhead simply doesn't matter, but in most non-trivial programs they are required.
Just looking at the "main" function itself might be a better comparison. In Catalina for "Hello, World" it is 11 longs (44 bytes). It could be less if I wanted it to be. For example, I pass an extra parameter to the string output function to indicate which cursor you want to use - but again, using multiple cursors is not something the other compilers probably even support. But when you're dealing with such low numbers, every single long makes a substantial difference.
The point is that if I wanted to make Catalina perform well in such a trivial benchmark, I could do so. Looking at the code, I could easily reduce the "main" function to 5 LMM instructions - say 20 bytes.
Is 20 bytes more than twice as long as the equvalent byte-coded program? Quite likely. But is this a sensible comparison to use as a benchmark? Of course not!
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
Ross, I am familiar with the thread on PMC.· I posted several times to that thread encouraging Parallax to make PMC ANSI compliant.· The last post in that thread summarizes Parallax's decision on the subject.· Basically, they decided to work on improvements to the Prop Tool instead.· PMC is the last item on their list.
They haven't released a new version of the Prop Tool yet, so I don't think they've done much work on PMC.
I didn't quite understand your comment about the DhryStone benchmark.· Are you saying that you could not get it to compile under Catalina?· I looked at the Wikipedia entry on DryStones, and it seems like the main concern is that compiler vendors added special modes that favor the DhryStone test suite.· The other concern is that it emphasis string manipulation a bit.· With that said, it still seems like the DryStone benchmark would provide useful information.
Since that thread there have been a few other indications that Parallax is proceeding (or at least intending to proceed) with something along the lines of PMC - but I'm not sure of their progress or schedule.
AS to Whetstone and Dhrystone - just the opposite. I could only get the them to compile and run under Catalina. You could see if you could get Dhrystone and Whetstone to compile under your C to SPIN translator - but modifying the code in any way to do so is a big no-no when benchmarking. That's probably why compiler writers "tune" their compilers instead.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
I would agree with you if that were the case. Fortunately, it is not - the smaller of the Catalina "Hello, world" demo programs takes about 450 bytes of code space.
So out of the box without lots of excuses it's 450 bytes -vs- 150 bytes?
It's probably not fair, but that sounds like more than 2x bigger to me [noparse]:)[/noparse]
I'll do a fair comparison later.
Cheers,
--Steve
Edit: I see Dave has started a new thread. Good thing.
Ross, thanks for the explanation.· I need to get back to CSPIN development so that it is more complete.· I use it for my own stuff, but I know what works and what doesn't.· Maybe a good milestone would be to get it to a point where it compiles the Dhrystone benchmark without any modifications.
Have you posted the results you got with Catalina on the Whetstone and Drystone benchmarks?· It would be interesting to see how your various optimization modes affect the results.
We can revisit this when there is another compiler - any other compiler, using any technology - that can do anything even remotely like what Catalina can do.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
We can revisit this when there is another compiler - any other compiler, using any technology - that can do anything even remotely like what Catalina can do.
Ross.
Yes, it is unlikely that there will ever be a completely valid comparison Spin byte-code to LMM C. Meanwhile, I invite everyone competent with Spin/PASM and C to use Catalina or ICC for that matter and find out their own truth. A casual observer will never know.
Some interesting results are beginning to emerge over in the Compiler Benchmarks thread.
The Dhrystone 1.1 benchmark is currently the best general purpose benchmark program we have for C on the Propeller, since this program was specifically designed to contain a 'representative' distribution of assignments, control statements and procedure/function calls. It should therefore should also make a good 'representative' test case for the new Catalina Optimizer.
At the highest optimization level (-O3), the optimizer reduces the code size of Dhrystone by 13%, and also increases the Dhrystone performance by 13%.
The Catalina Optimizer is in final phase testing now, and will be available shortly.
The first official release of the Catalina Code Optimizer is now available. Here is a quick summary:
Seamless integration with the Catalina C compiler (version 2.6 or later)
Compatible with Code::Blocks
Compatible with the BlackBox and BlackCat debuggers.
Three optimization levels (single pass, double pass, or double pass with automatic function inlining)
Linux and Windows binaries provided
Optimizes C user programs as well as C library code, and also hand-coded LMM PASM functions
Works with all Catalina memory models (LMM, EMM, XMM)
Typically reduces code size by 10 to 15% (more in some cases!)
Typically improves code performance by 5% to 15% (more in some cases!)
Note that the Catalina Code Optimizer is NOT included with Catalina. The cost is $US25.00 (or equivalent), which includes one year's worth of free updates.
Been too busy for a few weeks to post here, but I noticed today that while I have been away, Catalina has gone from around 100 downloads to nearly 500 downloads (350+ Windows, 70+ Linux) :freaked:
So I thought I'd better post something to let you all know that I am still around and still supporting it. In fact, I have recently found and fixed a few bugs in the Catalina code generator, as well as a few bugs within LCC itself. I will release a new version sometime soon.
But don't let that stop you downloading the current version of Catalina - the fixes will be incorporated in a patch release that can be applied over the top of any existing Catalina 2.6 installation.
For those of you who have purchased the Catalina Optimizer, I will email you the patches shortly. Other Catalina users should keep an eye on the Catalina SourceForge site where I will eventually post the patches.
Catalina Release 2.7 is available. This is a priority patch release that contains bug fixes only. It is currently only available to purchasers of the Catalina Optimizer. The contents of this patch released will eventually be incorporated in a future "full" release of Catalina for other users - check the Catalina sourceforge site for further details.
The main additions/changes in this release were as follows:
Fixed some bugs in the LCC Catalina code generator.
An update to the 'propeller_icc.h' include file to fix an error in the msleep macro, and also to make it unnecessary to edit the file to switch between the Catalina compiler and the ICC compiler (this is now detected automatically).
Fixed a bug with the PropTerminal HMI failing to detect the PC mouse - this driver should always assume the mouse is present.
Fixed a bug with the LCC preprocessor (cpp) that could cause it to run out of memory and crash.
Fixed a problem with the LCC -I command line option (affected Linux only).
Just a small question regarding Catalina on the hydra;
I tried your Compiler with quite good success in coop with Code::Blocks and have to say great work.. works prefectly
But i did not find a way in using BMPs for grafic-interfaces maybe its a lack of skill but i thought asking wouldn't be a problem.
So is there a way of printing bmps on the tv/vga with catalina, and if yes how?
Glad you like the compiler. Catalina provides no graphics libraries, but it is certainly be possible to build a graphics library that loads as a plugin (i.e runs in one or more cogs) to work in conjunction with a bitmapped driver and provide graphics primitives (analogous to the Parallax "graphics" driver).
This would essentially mimic what the current "HMI" plugin does for text functions (i.e. provide a set of higher level and device-independent text functions not supported by the underlying video drivers). I've done something like this with my high-resolution "virtual" graphics driver (described in this thread: http://forums.parallax.com/showthread.php?t=105213), but it's not ready for release. I'll see if I can clean it up - but I already have quite a backlog of Catalina work, so I can't promise it anytime soon.
I'm not sure what you mean by "printing" - if you needed to get a copy of a bitmapped screen you would have to interrogate the Hub RAM screen buffer (which is tile based) and then convert it to to a format that you could save and print elsewhere. Quite feasible, but again Catalina has no built-in support for this. Also, note that you will probably have to do this "a line at a time" since the Prop doesn't have enough RAM to store a full screen as a bitmap - even a simple monochrome low resolution screen (e.g. 640x480x1) can take more than the entire Hub RAM (32kb).
I have a small question about Catalina + Gamepad. Is it supported?
If yes, how do i use it ?
I tried to it in the manual but didn't saw anything specific.
Do you mean a Hydra gamepad driver for Catalina? The answer is that there isn't one - but the NES gamepad is a very simple device, so writing a Catalina plugin should be fairly straightforward. But in fact the gamepad is so simple that using a whole cog just for such a driver might be a bit of a waste - you may be better off simply doing it in C - see John Abshier's "NES controller object" in the OBEX for an example of such a driver written in SPIN. It should be fairly easy to translate this to C.
Adding more device drivers to Catalina is next on my list of things to do once I complete the current batch of outstanding work (discussed here: http://forums.parallax.com/showthread.php?t=125734). I've finished the new smaller file system library, the new multi-cog/multi-thread library, and the new 2-phase loader. Now I'm working on the new "tiny" library. After this will come some additional device drivers.
However, a gamepad driver should not take long, so I'll see what I can do in the next week or so.
Wow thanks for your help Ross;
But first i will give John Abshier's "NES controller object" a try and look what i can do myself (at least i want to learn something to ) but if i can't get it running i will ask for your help again.
Comments
Not sure exactly which thread you're referring to (there are so many!) but I don't think I would have simply said that size never matters (I'm sure you'll correct me if I'm wrong there
I think I do recall saying that size doesn't matter for some Prop users - i.e. those who only to use the Prop for small applications. But in other cases, size does matter.
What I said most recently was that the commonly expressed belief that LMM C would be too large to be useful because it was of the order of four times the size of bytecoded C (an approximation I also used to quote) - was incorrect.
In practice, even without optimization, programs coded in LMM languages should only be about 2 times the size of equivalent programs coded in languages that use SPIN byte codes.
With good optimization this can be even less - some of the simplest optimizations are simply impossible to achieve when using SPIN byte codes. Actually, after seeing some of the byte code that the SPIN compiler actually generates, I may revise my estimate of the overhead of LMM programs down even further!
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
It does depend on the particular code and sequence required to achieve something but LMM is stuck with its 4 byte granularity where Spin can go to one byte. This is why I suggested three levels of C; 8-bit bytecode, 16-bit thumb-style and 32-bit LMM. If Prop had thumb-style expansion hardware or purposed opcodes that would be superb for high speed LMM-style execution with reasonable code footprint but it doesn't.
32-bit LMM is however only a real issue for Hub ram. With suitable XMM memory, optimisation is IMO more about improving speed through code change than reducing footprint. If programs are 8 times larger I don't think it's unreasonable nor really a problem once it's accepted that XMM has to be used for all but trivial C programs.
That C most times likely requires XMM and uses I/O resources is a more fundamental problem and may make alternatives look more appealing but there's not a lot of choice - the Prop simply isn't designed for small C code with fast execution. LMM is essential for fast execution, that's a fact of life, everything then follows from that.
Dave
It's true that LMM is stuck with 4-byte granularity, whereas SPIN byte codes have 1-byte granularity, but when you look at actual example you realize this doesn't mean the size ratio of LMM:SPIN will be anything like 4:1. Even with a non-optimizing compiler the ratio could be much closer to 2:1. I may have my byte count wrong in some of the following cases, but you'll get the general idea ...
First consider loading and saving constant operands. When using immediate mode, LMM can load a byte (actually 9 bits) using 4 bytes. To do the same in SPIN takes 2 bytes (3 bytes if the operand requires 9 bits, but that's quite rare so we'll ignore it). So the ratio of LMM:SPIN is 2:1.
Loading 2-byte operands (such as hub addresses on the Prop I) takes 8 bytes in in LMM, whereas SPIN take 3 bytes. Ratio of LMM:SPIN is 8:3.
Loading 4-byte operands (e.g. integers and floats) takes 8 bytes with LMM, whereas SPIN takes 5 bytes. Ratio of LMM:SPIN is 8:5.
So all cases are fairly close to 2:1 - SPIN wins out on some, LMM wins out on some. Figuring out the exact ratio depends on the distribution of 1,2 and 4 byte loads, but overall I'd say the ratio of LMM:SPIN to load constant operands is probably going to be fairly close to 2:1.
Next consider opcodes that actually do something - like ADD. The SPIN opcodes generally work on the stack (i.e. the instruction itself doesn't have any operands) and therefore only takes one byte, whereas an LMM ADD always takes 4 bytes. But in SPIN the operands had to be loaded on the stack beforehand, whereas in LMM the ADD instruction can include one of the operands - so you have to compare the whole sequence of load-and-add in both cases.
To load and add two 1-byte constant operands takes 8 bytes in LMM, whereas in SPIN it takes 5 bytes. Ratio LMM:SPIN is 8:5.
To load and add two 2-byte operands takes 20 bytes in LMM, whereas in SPIN it takes 7 bytes. Ratio LMM:SPIN is 20:7.
To load and add two 4-byte operands takes 20 bytes in LMM, whereas in SPIN it takes 11 bytes. Ratio LMM:SPIN is 20:11.
Again, all around 2:1. Sometimes better, sometimes worse - but overall the ratio will hover around 2:1. And again, to get a more accurate estimate we'd need to know the distributions of the various operations and the various operand sizes.
So far we have only loaded and operated on operands. We haven't actually saved the results anywhere yet. In SPIN, the results are calculated on the stack, and to save them anywhere useful will generally require another opcode - i.e. at least 1 more byte, but more likely 3 if we want to save to an arbitrary hub RAM location. But in LMM we can actually have nearly the whole cog to use as storage - so in 75% (or more) of cases we have already finished (i.e. the LMM ADD instruction has left the result in a cog location that represents the register variable we wanted to save it in anyway). Even if the variable is not in a register, there is actually quite a good chance that it's address is (more about this below), so in most of the remaining cases it is only another 4 bytes to save the result, with only a few cases requiring a full 12 bytes required to save to an arbitrary hub location.
Taking this into account, the average ratio of LMM:SPIN to do a load-add-save sequence could in fact end up less than 2:1.
Also, now we start to get to the point where LMM can do things that SPIN cannot - like re-use common subexpressions, or keep in temporary registers any values that the user (or the compiler) determine will be needed again later. The savings here can be quite dramatic - most code uses the same variable multiple times times in a code block, and there are also often common expressions and other code fragments that can be evaluated once and then re-used many times - not just intermediate arithmetic results, but also the addresses of variables. This is easy to do in LMM, but very hard in SPIN because SPIN simply doesn't have any local cog space to keep such things hanging around - instead, it has to recalulate them each time it needs them. Maybe a SPIN-based compiler could optimize things to use the first 8 long locations in the VAR section as "temporaries" - but LMM programs don't have just 8 such locations - they can have as many of them as they want (for example, Catalina has 24 general purpose registers, and 3 or 4 special purpose ones that are used for this purpose. It could have more, but there is a law of diminishing returns here that means after a certain point the cog space is more valuable as something else).
Of course, the real situation is (as always) more complex. As you point out, SPIN has some clever tricks and some one-byte opcodes that don't require any operands (or as Dave points out, operate on some special locations very efficiently) so again we'd have to look at actual code to see the distribution of such instructions - but overall, I maintain that the ratio of LMM:PASM will remain around 2:1 - it could even be less than 2:1 in some cases.
Note that I am not being specific about Catalina here - the same applies to any LMM program, whether hand crafted or compiled. While compiled LMM will generally be less efficient, that's a compiler issue, and not an LMM issue.
We already have a couple of 4-byte granularity C compilers, and I believe Parallax is working on a 1-byte granularity C compiler, so we might soon be able to actually compare them.
However, I can't see a real case for also developing a 2-byte granularity C compiler (which you call "thumb-style", and which has also been called a "Compact Code Model" or "CMM" in another thread) - I don't think it will end up with a code size dramatically less than the 4-byte (LMM) compiler. My gut feeling is that you may get somewhere between a 10 - 25% code space improvement over LMM. But given that the Prop II could well be out before such a compiler, it just doesn't seem worth the effort.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
We have a few LMM C compilers.
We have a C to Spin translator that Dave built.
The idea is to take some reasonable C source files, compile them with LMM C and C to Spin and check the output.
Is this a reasonable test? If so, what would be good demonstration source?
We should keep it small so that the limited ICC version can also be tested.
Hello World may be a good starting point.
Is there something more significant that Spin natively supports?
I would like to see closure on the LMM C -vs- Spin program size argument. Any takers?
@David Betz:
As far as we know, Parallax will end up using something like the C to Spin translator.
There was a thread on the question.
Cheers,
--Steve
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM
Are we wanting to optimise for fast execution of C programs, or are we optimizing for minimal code size?
When it comes to a speed race I believe LMM will always win, compiler generated or done by hand. (except for some special cases where overlays would help)
In all other cases one should use Zog and be happy [noparse]:)[/noparse]
That is unless you want to target a C compiler at some other byte/word code for fun. Who would be sad enough to do that?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
It would be good to have implementations that used the standard library, and other implementations that did not use a library and are self-contained.· A self-contain Spin "Hello World" is about 150 bytes.· Of course a Spin-only serial driver is limited to 19,200 baud.
Dave
This has been a problem with bench marks since the beginning of the idea. For example Dhrystone was a famous bench mark for many years, but it is heavily focused on string operations. Well, if you are making a PID loop who cares about string operations?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Dave
I just want the real size difference between Spin and LMM C on one or more programs so the issue can be at least partially put to rest.
HelloWorld is fine for measuring printing all else being equal (same FullDuplexSerial COG).
There are other cases where different code is created and a good metric is harder to establish.
Cheers,
--Steve
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM
See http://forums.parallax.com/showthread.php?p=898937 for Parallax's original proposal for a 1-byte granularity C/SPIN hybrid language - I believe this (or something similar) is being developed. Not sure on its current progress.
@heater,
Ideally of course we want both performance and small code size. But as many people point out, the architecture of the Prop means we generally can't have both. However, we're not talking about speed at all in this thread. I don't think anyone questions that LMM is many times faster byte code. The point I was making that seems to have stirred the hornets was that an LMM implementation of an algorithm should be able to achieve code sizes that are only around 2 times larger than a byte code implementation (at least when using the SPIN byte codes - not sure about Zog).
@jazzed,
We could have closure on this argument if you guys would just stop arguing
Seriously, though - I agree we should benchmark - but we would have to benchmark hand-crafted LMM against hand-crafted byte codes. Otherwise, all you are comparing is the compiler technologies, which are so different that the results would have very little to do with LMM vs byte codes. Also, there is a problem is finding a suitable benchmark - a "hello, world" benchmark tells you very little other than how much startup code is required and how big the C stdio library routines are (and there is no byte-code equivalent to that anyway). I have tried various standard C benchmarks (whetstone, dhrystone etc) - but I haven't not yet found one that anything other than Catalina can compile and run. However, I have not tried Zog for this - maybe that's the way to go.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
I'm sorry but if your Hello World program binary is too big for simple Propeller hardware and the Spin program is not, LMM C loses. Simple; no compiler arguments necessary [noparse]:)[/noparse]
I'm not suggesting that such a simple program makes a 32KB + binary, but it doesn't take much to push applications into the oops it's just too big category .... there are many solutions for the just too big problem, but they would not include the simple solutions using this wonderful little chip.
--Steve
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM
I would agree with you if that were the case. Fortunately, it is not - the smaller of the Catalina "Hello, world" demo programs takes about 450 bytes of code space. This does not include loading the drivers (e.g. a display driver), because that would presumably be equivalent no matter what implementation you use. However, it does includes stuff that you also shouldn't really include if you are just trying to do a direct LMM to byte code comparison.
For example, it includes setup code to process command line arguments (something that none of the other compilers you want to compare it against can even do!). It also sets up the infrastructure designed for general purpose invocation of various plugins.
Neither of these are really necessary in such a trivial program, but Catalina includes them because in trivial programs the additional code size overhead simply doesn't matter, but in most non-trivial programs they are required.
Just looking at the "main" function itself might be a better comparison. In Catalina for "Hello, World" it is 11 longs (44 bytes). It could be less if I wanted it to be. For example, I pass an extra parameter to the string output function to indicate which cursor you want to use - but again, using multiple cursors is not something the other compilers probably even support. But when you're dealing with such low numbers, every single long makes a substantial difference.
The point is that if I wanted to make Catalina perform well in such a trivial benchmark, I could do so. Looking at the code, I could easily reduce the "main" function to 5 LMM instructions - say 20 bytes.
Is 20 bytes more than twice as long as the equvalent byte-coded program? Quite likely. But is this a sensible comparison to use as a benchmark? Of course not!
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
They haven't released a new version of the Prop Tool yet, so I don't think they've done much work on PMC.
I didn't quite understand your comment about the DhryStone benchmark.· Are you saying that you could not get it to compile under Catalina?· I looked at the Wikipedia entry on DryStones, and it seems like the main concern is that compiler vendors added special modes that favor the DhryStone test suite.· The other concern is that it emphasis string manipulation a bit.· With that said, it still seems like the DryStone benchmark would provide useful information.
Dave
Since that thread there have been a few other indications that Parallax is proceeding (or at least intending to proceed) with something along the lines of PMC - but I'm not sure of their progress or schedule.
AS to Whetstone and Dhrystone - just the opposite. I could only get the them to compile and run under Catalina. You could see if you could get Dhrystone and Whetstone to compile under your C to SPIN translator - but modifying the code in any way to do so is a big no-no when benchmarking. That's probably why compiler writers "tune" their compilers instead.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
It's probably not fair, but that sounds like more than 2x bigger to me [noparse]:)[/noparse]
I'll do a fair comparison later.
Cheers,
--Steve
Edit: I see Dave has started a new thread. Good thing.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM
Post Edited (jazzed) : 7/25/2010 1:42:09 AM GMT
Have you posted the results you got with Catalina on the Whetstone and Drystone benchmarks?· It would be interesting to see how your various optimization modes affect the results.
Dave
You are again comparing apples to ipods
We can revisit this when there is another compiler - any other compiler, using any technology - that can do anything even remotely like what Catalina can do.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
Cheers,
--Steve
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Pages: Propeller JVM
Some interesting results are beginning to emerge over in the Compiler Benchmarks thread.
The Dhrystone 1.1 benchmark is currently the best general purpose benchmark program we have for C on the Propeller, since this program was specifically designed to contain a 'representative' distribution of assignments, control statements and procedure/function calls. It should therefore should also make a good 'representative' test case for the new Catalina Optimizer.
At the highest optimization level (-O3), the optimizer reduces the code size of Dhrystone by 13%, and also increases the Dhrystone performance by 13%.
The Catalina Optimizer is in final phase testing now, and will be available shortly.
Ross.
The first official release of the Catalina Code Optimizer is now available. Here is a quick summary:
Note that the Catalina Code Optimizer is NOT included with Catalina. The cost is $US25.00 (or equivalent), which includes one year's worth of free updates.
Please PM or email me for further details.
Ross.
Been too busy for a few weeks to post here, but I noticed today that while I have been away, Catalina has gone from around 100 downloads to nearly 500 downloads (350+ Windows, 70+ Linux) :freaked:
So I thought I'd better post something to let you all know that I am still around and still supporting it. In fact, I have recently found and fixed a few bugs in the Catalina code generator, as well as a few bugs within LCC itself. I will release a new version sometime soon.
But don't let that stop you downloading the current version of Catalina - the fixes will be incorporated in a patch release that can be applied over the top of any existing Catalina 2.6 installation.
For those of you who have purchased the Catalina Optimizer, I will email you the patches shortly. Other Catalina users should keep an eye on the Catalina SourceForge site where I will eventually post the patches.
Ross.
The main additions/changes in this release were as follows:
I tried your Compiler with quite good success in coop with Code::Blocks and have to say great work.. works prefectly
But i did not find a way in using BMPs for grafic-interfaces maybe its a lack of skill but i thought asking wouldn't be a problem.
So is there a way of printing bmps on the tv/vga with catalina, and if yes how?
Thank you for your help,
wuut
Glad you like the compiler. Catalina provides no graphics libraries, but it is certainly be possible to build a graphics library that loads as a plugin (i.e runs in one or more cogs) to work in conjunction with a bitmapped driver and provide graphics primitives (analogous to the Parallax "graphics" driver).
This would essentially mimic what the current "HMI" plugin does for text functions (i.e. provide a set of higher level and device-independent text functions not supported by the underlying video drivers). I've done something like this with my high-resolution "virtual" graphics driver (described in this thread: http://forums.parallax.com/showthread.php?t=105213), but it's not ready for release. I'll see if I can clean it up - but I already have quite a backlog of Catalina work, so I can't promise it anytime soon.
I'm not sure what you mean by "printing" - if you needed to get a copy of a bitmapped screen you would have to interrogate the Hub RAM screen buffer (which is tile based) and then convert it to to a format that you could save and print elsewhere. Quite feasible, but again Catalina has no built-in support for this. Also, note that you will probably have to do this "a line at a time" since the Prop doesn't have enough RAM to store a full screen as a bitmap - even a simple monochrome low resolution screen (e.g. 640x480x1) can take more than the entire Hub RAM (32kb).
Ross.
I have a small question about Catalina + Gamepad. Is it supported?
If yes, how do i use it
I tried to it in the manual but didn't saw anything specific.
Also the forum search didn't help.
Hope to hear something soon
wuut
Do you mean a Hydra gamepad driver for Catalina? The answer is that there isn't one - but the NES gamepad is a very simple device, so writing a Catalina plugin should be fairly straightforward. But in fact the gamepad is so simple that using a whole cog just for such a driver might be a bit of a waste - you may be better off simply doing it in C - see John Abshier's "NES controller object" in the OBEX for an example of such a driver written in SPIN. It should be fairly easy to translate this to C.
Adding more device drivers to Catalina is next on my list of things to do once I complete the current batch of outstanding work (discussed here: http://forums.parallax.com/showthread.php?t=125734). I've finished the new smaller file system library, the new multi-cog/multi-thread library, and the new 2-phase loader. Now I'm working on the new "tiny" library. After this will come some additional device drivers.
However, a gamepad driver should not take long, so I'll see what I can do in the next week or so.
Ross.
But first i will give John Abshier's "NES controller object" a try and look what i can do myself (at least i want to learn something to
greetings wuut
ps: god i love this community
Ross.
P.S. Since you can no longer edit the title of an existing thread, from now on I'm going to create a new thread for each Catalina release.