The Secret Sauce: how to make more compact C porgrams...
ImageCraft
Posts: 348
One criticism of Propeller C using LMM is that it's less compact than the interpreted Spin code. Not that anyone has done any benchmark but I digress However, there are several things with our compilers that would balance things a bit:
- simple optimizations: Our compiler does algebraic, constant folding etc., so "i = 10 + 5 * 2;" would not generate any code, neither does "if (0) ...."
- common subexpression, the base compiler does "...p ... p.." and other common occurrences of the same expression within a straight line code block once, without recalculating.
- register allocation. local variables are kept in registers as much as possible, so a "i = b + c;" are usually just 2 instructions: "mov Ri,Rb; add Ri,Rc" I have not studied the spin stack machine, but this would be "push b;push c;add; pop i" in a traditional stack machine, and depending on how the variables are encoded, there may not be too much code saving.
To top it off, we have two additional technologies that we can roll out in the higher end versions:
- MIO global optimizer. This analyzes the whole functions and do things like constant propagation, common subexpression etc.
- Code Compressor. This is a post-link optimizations where we look for duplicate code sequence in the output code and turn them into subroutines. The program runs a little slower but on some programs, we get as much as 35% code saving.
While the last two major optimizations are not in the compiler yet, the base architectures have been in used since 2002 for the code compressor and 2006 for the MIO so they are well proven. We can add them to the Propeller C if there are demand.
So hope this balances things out a little.
// richard
- simple optimizations: Our compiler does algebraic, constant folding etc., so "i = 10 + 5 * 2;" would not generate any code, neither does "if (0) ...."
- common subexpression, the base compiler does "...p ... p.." and other common occurrences of the same expression within a straight line code block once, without recalculating.
- register allocation. local variables are kept in registers as much as possible, so a "i = b + c;" are usually just 2 instructions: "mov Ri,Rb; add Ri,Rc" I have not studied the spin stack machine, but this would be "push b;push c;add; pop i" in a traditional stack machine, and depending on how the variables are encoded, there may not be too much code saving.
To top it off, we have two additional technologies that we can roll out in the higher end versions:
- MIO global optimizer. This analyzes the whole functions and do things like constant propagation, common subexpression etc.
- Code Compressor. This is a post-link optimizations where we look for duplicate code sequence in the output code and turn them into subroutines. The program runs a little slower but on some programs, we get as much as 35% code saving.
While the last two major optimizations are not in the compiler yet, the base architectures have been in used since 2002 for the code compressor and 2006 for the MIO so they are well proven. We can add them to the Propeller C if there are demand.
So hope this balances things out a little.
// richard
Comments
It would be nice to see some kind of size/speed benchmark for a realisticly large program.
(I may do this myself for this Chess program I've started working on).
I tried a simple test case ...
ICC reports 2176 bytes, but lets not be unfair about the LMM interpreter overhead and take away 2KB; 152 bytes. In comparison Spin managed it all in 36 bytes (4:1).
We're still comparing apples and oranges, so how about adding an extra "i := NothingMuch" to each and just looking at the cost of that; 4 bytes added to Spin, 22 bytes added to ICC (5.5:1).
Let's try a simple factorial ...
124 bytes for ICC, 60 for Spin (2:1).
At 2:1 it's looking good to me and this is a very limited and unfair analysis. On top of that there is the speed advantage which offsets larger code size, there are more optimisations which can be applied, and code size becomes less important with the Prop Mk II.
Of course it would be possible to add optimisation and dead code removal to Spin, and there's already constant folding if the programmer remembers to use the constant() function !
One thing which I have just noticed now though is that ICC may not necessarily be reporting the correct code size used; going to "i = Factorial(Factorial(10));" and there was apparently no increase in size.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
fsrw.c). If someone wants to experiment with it, and they have ImageCraft, I'd be delighted to send it out.
That would be a nice comparison of size.
Of course, since I don't have ImageCraft C (yet), I can't test it and see how well it works.
If ImageCraft wanted, they could use this code as an example (with appropriate modifications and perhaps
testing).
I will admit I "optimized" the C both to minimize the amount of Spin bytecodes generated and to also
generate the Spin I wanted it to generate, so much of it does not look like idiomatic C. (For instance,
*all* vars are declared at the very top of every function.) But it runs fine on Linux.
(Indeed, I used the C version along with a Perl script to generate random function calls, and the linux
fat driver on a loopback file system, to verify that the program was correct; I would do an operation
through fsrw.c, then verify the data was what it should be using the linux loopback mount, and so on
and so forth.)
Not sure if I've misunderstood, but you can download the ICC compiler now, and you get unrestricted free use of it for 45 days then it stays free but generated code size limited.
www.imagecraft.com/devtools_Propeller.html
Seems that .spin does get some advantage in size from built-ins in the rom code though.
C also gets some advantages from the LMM kernel code, but not much other than primitives.
@rokicki ... PM me and I'll try it out.
If you don't want to post code here, I'll respect that.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
are much higher priority than the Prop. The Prop is funtime for me, and that's all gone away for the
moment.
And if I release the fsrw.c as an ImageCraft library, I want to do it right, and I just don't have the time
to do it right.
-tom
This has fsrw.c and the ctospin.pl program. You'll need Perl. Usage is
perl ctospin.pl <fsrw.c >fsrw.spin
You should get the identical fsrw.spin I distribute.
http://tomas.rokicki.com/fat16work/fsrw-in-c.zip
Oops, I just realized. It's in C, but it uses C++ header naming
conventions, so g++ will compile it but gcc won't. Just make the
obvious changes to the header. (I generally use g++ as my
"C" compiler which is stupid, I know.) The only changes though
are just to change the three header files from <cXXXX> to
<XXXX.h>.
To compile with g++ unmodified:
g++ -O5 -o fsrw fsrw.c
It has a bunch of stuff off the end that let me script it from Perl;
none of that is important for ImageCraft and you can strip it.
It's all enclosed in /* IGNORE */.
And please, no comments on the *horrible* ctospin.pl file.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
is always my username at gmail.com.
Also included are fsrw_speed_tv20_sd12.binary and fsrw_speed_vga16_sd12.binary from spin.
The default hardware configurations for these are indicated by <dev><number> in the filenames.
The FSRW.binary is functionally equivalent to fsrw_speed_vga16_sd12.binary.
Some items of interest:
The resulting C binary is about 3.2 times bigger than the equivalent Spin binary.
Using printf instead of vgaText_str... Spin equivalents costs 2Kb.
The vgaText library has some features that bloat the result as does other code in libcprop.a.
The C stack allocation is also non-trivial.
FSWR.binary 18196
fsrw_speed_vga16_sd12.binary 5764
Performance increase can be seen in write/read test, but improvement is obscured some
by the required SD SPI access time.
C version write/read ticks:···· 1_699_756_304· / 404_779_408
Spin version write/read ticks: 1_748_974_672· / 429_584_926
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
1. If you strip it down, eliminate the printf stuff, and just write a minimal program to open a file,
read it, and close it, how much does that come to in size?
2. Making it use sdspiqasm assembly for the block driver would make the speed much faster
(and probably reduce the size somewhat). How hard is it to cognew from ImageCraft C?
3. The fsrw speed test is not really "fair" because it freads/fwrites large blocks (to avoid the
spin overhead of byte-by-byte); the real win of this code is that suddenly pgetc() and
things like that will be tons faster.
4. I'll take a look at the source and see how difficult it would be to reintegrate your changes
so that we can have one source base (in simple-C) that works both with ImageCraft
and also generates the fsrw spin stuff, so people can go either way. And then we can
push forward on adding multiple file support, directory support, and so on.
5. We should look carefully at the output and see where much of the size is coming from,
and see if some small recoding might recover a lot of that.
Congratulations; that happened *very* quickly! Very impressive.
Thanks Jazzed!
Tomas, if we get a clean version running that meets your approval, and with your approval, I will add it to the compiler and make it as part of the standard checkbox libraries.
Thanks!
re: Rayman
Looks like unfortunately 2X size increase may be optimistic :-( OTOH, the normal speed increase should be 5x or so, and as I said, a lot of our customers are using 32K micros for production code, tons of units out in the field using that class of micros, so I think if you think in that term, 30K is not so bad!
// the preceding brought to you by a serial commaist, who forgets to end a sentence from time to time, again.
The optimal LMM instruction is a move immediate into a register or register to register, one long, for Spin that's most likely four bytes so 1:1 would be theoretically possible. When it comes to branching though ICC loses out with needing two longs while Spin can get away with two or three bytes, 4:1. That's where my gut feeling for between 3:1 and 2:1 comes from.
It is possible to improve code density quite considerably for the calls into the C-LMM kernel but that comes at the cost of extra execution time overhead. Jumps into the kernel which take two longs, eight bytes, can be compacted to one long and one word, six bytes, a 25% saving. Near 50% savings can be had in the way registers are handled on stack calls. Constant loading takes 12 bytes and that can be trimmed to 6 or 8 bytes, 50% to 33% saving. All of those would push things closer to the 2:1 ratio, perhaps better.
The code generated is fine for fastest execution speed, but not optimal for maximum code density. Because such optimisations impact execution speed I'd expect the solution being choice of compilation to be for speed or for density, and perhaps code region selectable but that would complicate and enlarge the LMM kernel.
There is scope for near doubling the speed of execution using Phil Pilgrim's Reverse-LMM mechanism. That could be the optimal high-speed, high density solution, although speed freaks will always prefer fastest execution at the cost of code density.
Added : Then there's Thumb-style, 16-bit PASM instructions. That carries a lot more loss of execution speed overhead but could near double code density.
Post Edited (hippy) : 6/19/2008 8:23:58 PM GMT
Unfortunately, I didn't stick very close to a "sharing" model, but there should be very little work to make it like that.
The posted .zip has two files: fsrw.c and fsrw_printf.c. The latter will be much closer to your original; the former has printfs replaced by vgaText_str, etc... function calls or commented printf. A variant of sdspiqasm.spin code is already being used natively as started by the built-in cognew_native function.
@Richard,
I think vga_text.c should be broken into a few pieces. 1) very basic start/stop and rudimentary printc/putchar. 2) a file with spin compatability functions, and 3) conveniece accessors like setxy, etc... Also propmacro.s should be separated pretty much by a function per file. Doing this would allow static lib feature compile/link to include minimally required objects.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
In your opinions, lets say I wan to implement a "optimize for code size" mode, would it be better to optimize the LMM kernel and the associated codegen (not sure about reversing the LMM code...) or just aim for an alternate solution with a thumb style LMM?
The thing is at the next update, I am planning to move some support functions into another COG (CLIB Cog). So the LMM kernel code is relatively empty, and I can see fitting in Thumb kernel calls in the same Cog. It may just fit...
re: Jazzed
Thanks...
Going to Thumb-style LMM, gets that down to 50 bytes, a saving of 60%.
It should be possible to insert the extra step of LMM optimisation in the compilation sequence then link with a suitable LMM kernel. I've only looked at the optimisation feasibility not the impact it would have on the kernel itself.
Added ( posts crossed in the ether ) : On updating ICC I'd go for a three-way choice; as is, a compression of the current codegen and kernel update and the Thumb-style version which would be a much bigger effort.
I suppose it really depends on what the C target market is; fastest C code or C instead of Spin with comparable speed and best code density. I can see the benefits of all the alternatives, leave the developer to pick and choose what they want. Simple compression and slight loss of speed will suit people who do run up against memory limits, after that it's Thumb-style or move to using the Prop Mk II.
Post Edited (hippy) : 6/20/2008 12:56:07 AM GMT
I guess setting expectations are the important thing right now. 3X Spin may sound bad, but it's 5X faster (someone needs to do some good benchmarking there tooo... ) and how many people are really pushing the Prop I PROGRAM (vs. data) memory?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
In terms of "market" I think I phrased that badly. I meant more in terms of what will developers choosing C be looking for. Will people be coming to C simply because they don't want to use Spin and would be happy with a C which delivers comparable Spin speed performance or looking to C as the intermediate between Spin and PASM to gain a speed benefit and because C's an easy coding option ?
I guess that will only be determined in the fullness of time.
End of the day analysis - if ICC draws in a satisfied user base then everyone's happy, if it doesn't, ICC will have to accept that or change and enhance in the light of how things go.
I know some of my comments may sound a little negative but they aren't meant that way. I, and I believe others, are more trying to get to grips with where C fits and how it compares and how to even quantify a comparison, trying to judge how suitable C is for the various types of people who may be looking at using it.
So firstly, I hope we all see that having a C toolset is a good thing for the Propeller. IMHO, there is just no way for Propeller to gain traction in the commercial embedded space without a C toolset.
1. I can get by with Spin
2. I can't fathom PASM!
3. I know any potential employer would want me to program in C
4. For my hobby, I might want extra speed
5. Though I've not hit any memory limits using Spin, I would worry that using C might be more of a memory worry
6. I very-much like the sound of giving the speed vs. code-size choice to the developer via optimisation switches
Hope that helps.
{ot}ICC7Prop @ <$100 := soon?{/ot}
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Cheers,
Simon
www.norfolkhelicopterclub.co.uk
You'll always have as many take-offs as landings, the trick is to be sure you can take-off again ;-)
BTW: I type as I'm thinking, so please don't take any offense at my writing style
The decision is made. We will need to make some very minor changes, e.g. pad the output with a string as a reminder that it's for non-commercial use only, change the webpage and viola. Probably next Monday.
... but shh... don't tell anyone yet. You know, we want to keep this as a sekrit.
Here is a performance comparison from a Beta version of ICC.
The comparison is not complete, but it delivers some general ideas.
SPIN·PASM···· ICC[noparse][[/noparse]1]
·
25us·200ns····1.04us·toggle (bit toggle for measure) while loop cycle[noparse][[/noparse]2] time.
40us·420ns····2.1us··toggle while loop set by if-else loop with ndx & 1.
50us·520ns[noparse][[/noparse]3]·2.9us··toggle while set by if-else loop with ndx & 1·+ increment.
40us·420ns····3.8us··Function call bit toggle while loop cycle time.
26s··800ms····14s····Fibo29
[noparse][[/noparse]1] ImageCraft C default LMM unrolled by 4 with 80MHz operation.
[noparse][[/noparse]2] A cycle is a transition from state 1 to state 2 back to state 1.
··· ·Some duty cycles are asymetrical.
[noparse][[/noparse]3] Asymetrical comparison if_z, if_nz, then increment.
Some performance enhancement ideas:
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Cheers,
Simon
www.norfolkhelicopterclub.co.uk
You'll always have as many take-offs as landings, the trick is to be sure you can take-off again ;-)
BTW: I type as I'm thinking, so please don't take any offense at my writing style