Sorry, Dave - I was specifically talking about CSPIN with the current pnut Spin interpreter. With the pnut interpreter, 100% ANSI compliance is theoretically possible (it's theoretically possible with Turing machine!) but completely impractical. If you used another interpreter or "tweak" the existing one, then it would become practical, and I also agree you should be able to do better (performance-wise) than Spin - getting speeds around twice as fast as Spin should be achievable. However, this will be difficult to achieve this at Spin code sizes. In fact, I'd go so far as to say "impossible" - so I now look forward to being proven wrong
.
I don't understand why you think running C with the current Spin interpreter is impractical. A good C compiler would produce code that would be as efficient as that produced by the Spin compiler. PBASE and VBASE could be used as index registers, and the concept of dynamically relocatable objects wouldn't be needed for C. The method table could be scrapped and functions would be called by just manipulating PCURR. Spin natively support signed int, unsigned short and unsigned char. Signed short and signed char can be supported with the sign extension opcodes, though it would be less efficient than unsigned short and char. unsigned int would require extra operations, but I think those could be easily handled.
A good C compiler would produce code that would be as efficient as that produced by the Spin compiler.
Well, I'm happy to be proven wrong, but I don't agree with some of your premises. Spin bytecodes are not general purpose - they were designed by a very clever man with a very specific goal in mind - to be very efficient at encoding the Spin language. Not the C language.
It is also not really true to say that the Spin compiler produces efficient code. The Spin compiler does almost no optimization of Spin programs at all. The only complexity of Spin is in the rich set of expressions and operators (which oddly enough are very C-like). But it does not do even very simple expression optimization - you even have to tell Spin when an expression evaluates to a constant if you want it to evaluate it at compile-time rather than run-time!
Instead, Spin relies heavily on the programmer writing efficient code. In a way, the Spin language limitsyou to writing programs that can be efficiently represented in terms of Spin byte codes - i.e. the programmer does most of the hard work, not the compiler.
Writing a C compiler that would do the same job effectively would be ... well ... not impossible - but certainly impractical.
Compression of LMM code is an interesting idea, and as jazzed mentioned it's one that the GCC team has discussed as well. It's definitely a worthwhile goal, and I'm glad you're looking into it Ross. I'm curious to know how you approached it. I can think of four feasible ways, but perhaps there are others:
(1) Compress blocks of code using an algorithm such as zip, and decompress them at run time. This would function a lot like XMM mode -- there would be a cache of decompressed blocks that the code would actually run out of. I think this method has the potential to have the best compression, but managing inter-block branches would be a bit of a pain (there'd have to be a table mapping uncompressed addresses to the beginning of compressed blocks).
(2) Use a compressed encoding for all instructions (like jazzed suggested). Offers decent compression, but reduces the available instruction space and may make some operations awkward.
(3) Have a mode bit and toggle between "normal PASM" and "compressed PASM" instructions. There could either be an explicit mode toggle instruction (which in normal PASM could be just be setting a bit in a register) or jump instructions could toggle mode based on something like the bottom bit of the address. Has the advantage that the user can explicitly choose the performance required for functions (compressing less frequently called functions).
(4) Define the instruction set such that the full PASM case is a subset of compressed PASM, e.g. if the high bit of the 16 bit compressed instruction is set then fetch then next 16 bits and turn it into a full PASM instruction, or conversely if certain bits of a full PASM instruction are clear then interpret it as 2 compressed instructions. This way compression granularity can be at a sub function level.
It sounds like you're chosen some variant of method (4), is that correct?
Compression of LMM code is an interesting idea, and as jazzed mentioned it's one that the GCC team has discussed as well. It's definitely a worthwhile goal, and I'm glad you're looking into it Ross. I'm curious to know how you approached it. I can think of four feasible ways, but perhaps there are others:
(1) Compress blocks of code using an algorithm such as zip, and decompress them at run time. This would function a lot like XMM mode -- there would be a cache of decompressed blocks that the code would actually run out of. I think this method has the potential to have the best compression, but managing inter-block branches would be a bit of a pain (there'd have to be a table mapping uncompressed addresses to the beginning of compressed blocks).
(2) Use a compressed encoding for all instructions (like jazzed suggested). Offers decent compression, but reduces the available instruction space and may make some operations awkward.
(3) Have a mode bit and toggle between "normal PASM" and "compressed PASM" instructions. There could either be an explicit mode toggle instruction (which in normal PASM could be just be setting a bit in a register) or jump instructions could toggle mode based on something like the bottom bit of the address. Has the advantage that the user can explicitly choose the performance required for functions (compressing less frequently called functions).
(4) Define the instruction set such that the full PASM case is a subset of compressed PASM, e.g. if the high bit of the 16 bit compressed instruction is set then fetch then next 16 bits and turn it into a full PASM instruction, or conversely if certain bits of a full PASM instruction are clear then interpret it as 2 compressed instructions. This way compression granularity can be at a sub function level.
It sounds like you're chosen some variant of method (4), is that correct?
Eric
Hi Eric,
I considered (1), but decided against it. It would be ok if you had cached XMM RAM (i.e. you could unpack as you load into the cache) but if you have XMM then code size is not really a problem anyway, and you're better of with "traditional" LMM. I can't see this solution being useful on a "bare bones" Propeller.
I use some techniques similar to (2), but Jazzed's solution is not quite workable "as is", and to make it workable means you would most likely miss the 3 hub cycle "sweet spot" he is trying to achieve. I think 4 hub cycles may be possible, and that may be fast enough to be useful - but I also have doubts about whether this technique by itself can reduce code sizes enough to make it worthwhile. But I could be wrong on this, so I hope someone picks this one up and runs with it - it's a very clean and simple solution.
Your (3) and (4) are both getting closer to the mark - I do switch in and out of LMM PASM - that's why I call it a "hybrid" kernel. I am also incorporating some FCACHE techniques, so I can also switch in and out of COG PASM. This is still in the experimental stage
The fact is that none of these is the whole answer individually - it's all a matter of getting the right balance. I don't think I have it quite yet, but I'm getting a bit closer.
I have a fairly good range of actual results for Spin vs C using CMM and LMM now, so I thought I'd post them:
FIBO
Size is the size of the fibo function in bytes.
Speed is number of executions of fibo(26) in 10 seconds.
[B]SIZE SPEED NOTES[/B]
SPIN 26 1.00
CMM 46 1.23 CMM is 1.2x speed, 1.8x size of Spin
LMM 100 1.96 LMM is 2.0x speed, 3.8x size of Spin
DHRYSTONE
Size is the total size of the ProcX and FuncX functions in bytes.
Speed is number of executions of dhrystone in 10 seconds.
[B]SIZE SPEED NOTES[/B]
SPIN 522 4740
CMM 972 6290 CMM is 1.3x speed, 1.9x size of SPIN
LMM 1816 18770 LMM is 4.0x speed, 3.5x size of Spin
GRAPHICS_DEMO
Size is the total size of the ALL non-PASM graphics code in bytes.
Speed is complete screen redraws in 10 seconds. Note that the times are for single buffering, since the program won't won't fit when using double buffering and LMM.
[B]SIZE SPEED NOTES[/B]
SPIN 2500 4.17
CMM 4704 4.55 CMM is 1.1x speed, 1.9x size of Spin
LMM 9168 4.55 LMM is 1.1x speed, 3.7x size of Spin
XXTEA
Size is the size of the btea function in bytes.
Speed is the number of executions of btea in 10 seconds.
[B]SIZE SPEED NOTES[/B]
SPIN 326 770
CSPIN 324 812 CSPIN is 1.1x speed, 1.0x size of Spin
ZOG 369 750 ZOG is 1.0x speed, 1.1x size of Spin
CMM 534 2083 CMM is 2.7x speed, 1.6x size of Spin
LMM 1092 7143 LMM is 9.3x speed, 3.3x size of Spin
TEST_STDIO_FS
Size is the size of all application and file system code in bytes.
Speed is number of lines that can be written to a file in 10 seconds. Note that to get the file write times I had to remove the file reading code, since otherwise the program won't fit when using LMM.
[B]SIZE SPEED NOTES[/B]
SPIN n/a n/a No equivalent Spin program.
CMM 21k 1990
LMM 38k 2520 LMM is 1.3x speed, 1.8x size of CMM
FILETEST
Size is the size of all application and file system code in bytes.
Speed is not meaningful for this program, since it is interactive.
[B]SIZE SPEED NOTES[/B]
SPIN 12k n/a estimate (based on info provided by Dave Hein)
CMM 21k n/a CMM is 1.8x size of Spin
LMM 37k n/a LMM is 3.1x size of Spin
Conclusions
The code sizes are very consistent - CMM code is 1.6 to 1.9 times the size of Spin - but often half the size of the equivalent LMM program. While the speed results vary more widely, it is clear that although CMM is much slower than LMM, it is still substantially faster than Spin.
CMM seems to me to offer a reasonable compromise between code size and speed.
The most interesting case also happens to be the one where there are statistics for the most implementations - i.e. XXTEA. In this case, we have results not only for Spin, but also for two "byte coded" implementations of C - i.e. Zog and CSPIN. Not surprisingly, these are all quite similar in terms of size and speed - but CMM really stands out here, offering 2.7 times the speed at a cost of only 1.6 times the size.
The other interesting results are for the GRAPHICS_DEMO and TEST_STDIO_FS programs - in both cases CMM allows programs to easily run on a "bare-bones" propeller that simply won't fit when using LMM. And in both these cases the CMM version executes at very nearly the same speed as the LMM version anyway. Why would you use LMM in such cases?
Thanks for posting your size and speed comparisons. One thing I notice is that it seems that ZOG actually does generate more compact code than LMM. That had been debated but it seems the code density *is* better at least for that one benchmark program.
Your CMM model sounds interesting. Can you post a description of your compressed instruction set?
Thanks for posting your size and speed comparisons. One thing I notice is that it seems that ZOG actually does generate more compact code than LMM. That had been debated but it seems the code density *is* better at least for that one benchmark program.
Your CMM model sounds interesting. Can you post a description of your compressed instruction set?
Thanks,
David
Yes, I found the Zog stats I couldn't find earlier. My recollection was wrong - it seems to have been the performance of Zog and not the code size that were a bit disappointing (note that things may have improved since - I'm not sure what Zog is up to these days). Initial hopes were that it might be substantially faster than Spin (perhaps even able to challenge LMM) but in fact it was slightly slower. For LMM of course, it is exactly the opposite - the performance is good, but the code size is ruinous.
I plan to post the whole CMM code generator and kernel as part of Catalina 3.7 soon.
I'm really mostly interested in the instruction set you've designed. Do you have a description of that?
Hi David,
Not a final one - I keep making changes. But here is a bird's eye view of the current state of play:
There are 32 16-bit instructions that map directly to Prop instructions.
There are 32 16-bit instructions that map to kernel "primitives" (stack operations, floating point operations etc).
There are 64 32-bit instructions - some map directly to Prop instructions, and some map to kernel "primitives".
All addressing is 24 bit.
The kernel includes the normal LMM loop and also FCACHE capabilities, and also the ability to execute single arbitrary PASM instructions. I'm still working on adding the capability to use these things into the code generator.
I have a fairly good range of actual results for Spin vs C using CMM and LMM now, so I thought I'd post them:
These results look very encouraging! It'll be great to expand the range of options available for programmers to choose from.
Just for reference I thought I'd include a few PropGCC beta results as well, although I haven't had a chance to compile all the benchmarks. That's unfortunate, because I suspect that the compute intensive benchmarks below skew the results in GCC's favor -- on graphics and file I/O benchmarks the advantages of FCACHE will be much lower and so speed wise at least GCC will be more comparable to the Catalina LMM. I must confess that I'm surprised by how much faster GCC is than Spin -- I was expecting more like a 5x difference.
I've used the -Os option to propgcc to optimize for size, and used LMM mode for all tests.
FIBO
Size is the size of the fibo function in bytes.
Speed is number of executions of fibo(26) in 10 seconds.
[B]SIZE SPEED NOTES[/B]
SPIN 26 1.00
CMM 46 1.23 CMM is 1.2x speed, 1.8x size of Spin
LMM 100 1.96 LMM is 2.0x speed, 3.8x size of Spin
GCC 92 10.78 GCC is 10.7x speed, 3.5x size of Spin
DHRYSTONE
Size is the total size of the ProcX and FuncX functions in bytes.
Speed is number of executions of dhrystone in 10 seconds.
[B]SIZE SPEED NOTES[/B]
SPIN 522 4740
CMM 972 6290 CMM is 1.3x speed, 1.9x size of SPIN
LMM 1816 18770 LMM is 4.0x speed, 3.5x size of Spin
GCC 1612 69230 GCC is 14.6x speed, 3.1x size of Spin
XXTEA
Size is the size of the btea function in bytes.
Speed is the number of executions of btea in 10 seconds.
[B]SIZE SPEED NOTES[/B]
SPIN 326 770
CSPIN 324 812 CSPIN is 1.1x speed, 1.0x size of Spin
ZOG 369 750 ZOG is 1.0x speed, 1.1x size of Spin
CMM 534 2083 CMM is 2.7x speed, 1.6x size of Spin
LMM 1092 7143 LMM is 9.3x speed, 3.3x size of Spin
GCC 700 38226 GCC is 53.0x speed, 2.1x size of Spin
I also compiled xxtea for some other architectures (using gcc) just to see how other architectures code sizes compare:
x86: 459
arm: 520
arm-thumb: 424
Spin and Zog are still champions on the xxtea size, and CMM is very competitive with those other architectures!
The PropGCC results also show that FCACHE makes an enormous difference on calculation intensive programs (like these 3). I expect that the combination of FCACHE and CMM will produce a very interesting size/speed performance point.
I'm really mostly interested in the instruction set you've designed. Do you have a description of that?
My thoughts on a compressed instruction set were to try to make something that was a strict subset of PASM, with an escape to the full set. Something I've been toying with is a 12 bit compressed instruction form, with 4 bits each for instruction, src, and dest. That would work well for GCC (which has 16 registers), and would also allow us to embed the short instructions in the regular PASM bitstream with the condition flags all set to 0 -- in other words, the short forms of instructions would just be NO-OPs in the normal Prop1 instruction set, and the long forms would be the same as they are now. The short forms would thus occupy 16 bits each (but would have to come in pairs, see below) and the bits would look like:
iidd ddss ss00 00ii iidd ddss ss00 00ii
so that if 32 bit aligned they would look like IF_NEVER PASM instructions.
The advantage is that the CMM interpreter would then be able to run regular LMM code (albeit a bit slower), and changes to the tools to support compressed instructions could be made gradually.
The disadvantages are:
(1) obviously, only 16 instructions -- although one would be an escape to another set of 16
(2) due to the little-endian nature of the Prop1 and the placement of the condition bits we would have to read 32 bits at a time before being able to decode the 2 16 bit short forms; if we wanted to get fancy, though, this would actually allow us to use 14 bits per instruction rather than 12 bits, for a much larger range of compressed instructions
In practice, though, the compiler generates a huge number of mov, movi, add, addi, cmp, branch, and call instructions, so compressing those and the 6 variants of hub read/write would be pretty big wins. Pretty much any other instructions are gravy
The little endian thing is a bit of a pain, since it does defeat some optimizations (when a single short instruction is followed by a long instruction we would have to encode both as long, since instruction decode only works on 32 bit boundaries). Another option would be to give up strict binary compatibility and rotate the words by 16 (making them mixed endian); then the interpreter could correctly detect short vs. long PASM at odd addresses, at the cost of having to rotate the long PASM (including any long PASM in FCACHE blocks).
Anyway, those are my thoughts, but so far they're all theoretical. It sounds like Ross has actually implemented and tested his scheme, which makes it much more valuable!
Just for reference I thought I'd include a few PropGCC beta results as well, although I haven't had a chance to compile all the benchmarks. That's unfortunate, because I suspect that the compute intensive benchmarks below skew the results in GCC's favor -- on graphics and file I/O benchmarks the advantages of FCACHE will be much lower and so speed wise at least GCC will be more comparable to the Catalina LMM. I must confess that I'm surprised by how much faster GCC is than Spin -- I was expecting more like a 5x difference.
...
The PropGCC results also show that FCACHE makes an enormous difference on calculation intensive programs (like these 3). I expect that the combination of FCACHE and CMM will produce a very interesting size/speed performance point.
Eric
Hi Eric,
Yes, the ability to FCACHE whole chunks of these small benchmark programs gives GCC really impressive results ... on the benchmark programs.
But GCC code sizes are still LMM code sizes - i.e. around twice as big as CMM code sizes, and often approaching four times as large as Spin. And at least on the Prop 1, code size is a critical issue. Not much point having great performance results if you can't take advantage of them because your program won't fit in Hub RAM.
But don't despair - GCC is welcome to adopt Catalina's CMM kernel once I finalize it .
In practice, though, the compiler generates a huge number of mov, movi, add, addi, cmp, branch, and call instructions, so compressing those and the 6 variants of hub read/write would be pretty big wins. Pretty much any other instructions are gravy
The little endian thing is a bit of a pain, since it does defeat some optimizations (when a single short instruction is followed by a long instruction we would have to encode both as long, since instruction decode only works on 32 bit boundaries). Another option would be to give up strict binary compatibility and rotate the words by 16 (making them mixed endian); then the interpreter could correctly detect short vs. long PASM at odd addresses, at the cost of having to rotate the long PASM (including any long PASM in FCACHE blocks).
Anyway, those are my thoughts, but so far they're all theoretical. It sounds like Ross has actually implemented and tested his scheme, which makes it much more valuable!
Eric
Hi Eric,
Yes, you're pretty much spot on here. I use a couple of tricks you maybe haven't thought of yet, but then you've probably got a few up your sleeve that I haven't thought of.
Ross.
EDIT: Snipped the post I quoted - I don't actually use the technique of Eric's in the part I snipped out, and I didn't want to mislead anyone. It is the rest of the post that is spot on!
I've found the first example where the CMM version of a C program is larger than the LMM version!
Here is the C program:
#include <propeller.h>
#define BIT 15 // bit number to toggle - this is for the C3!
#define BITMASK (1<<BIT)
void main(void) {
unsigned count = CNT+CLKFREQ;
DIRA |= BITMASK;
OUTA &= BITMASK;
while (1) {
OUTA ^= BITMASK;
WAITCNT(count, CLKFREQ);
}
}
LMM program size = 80 bytes
CMM program size = 84 bytes!
Back to the drawing board!
Ross.
P.S. yes, yes, I know - the Spin version of this program is only 40 bytes! And no doubt the GCC version is about the same. Don't remind me!
LMM program size = 80 bytes
CMM program size = 84 bytes!
Ok a few tweaks, and this program is now 72 bytes when using CMM. Not a huge saving, maybe - but at least the CMM version is back to being smaller than the LMM version!
Just made a tweak to the CMM code generator that reduces code sizes by over 10% and increases performance by similar amounts in many cases. For example, the updated results for XXTEA are shown below.
XXTEA
Size is the size of the btea function in bytes.
Speed is the number of executions of btea in 10 seconds.
[B]SIZE SPEED NOTES[/B]
SPIN 326 770
CSPIN 324 812 CSPIN is 1.1x speed, 1.0x size of Spin
ZOG 369 750 ZOG is 1.0x speed, 1.1x size of Spin
CMM 480 2325 CMM is 3.0x speed, 1.5x size of Spin
LMM 1092 7143 LMM is 9.3x speed, 3.3x size of Spin
In this case, C compiled for CMM now beats Spin speeds by a factor of 300%, for a code size increase of only 50%. And it is much smaller than LMM.
I also compiled xxtea for some other architectures (using gcc) just to see how other architectures code sizes compare:
x86: 459
arm: 520
arm-thumb: 424
I think Catalina might beat those code sizes by the time I'm done. For instance, the code size above for XXTEA does not take into account any savings to be made by the Catalina Optimizer (which I haven't updated to know about CMM yet).
More updated results with the new code generator ... concentrating now on programs that simply can't be run with LMM (note I seem to have slipped a decimal place in my previous numbers for GRAPHICS DEMO!):
GRAPHICS_DEMO
Size is the total size of the ALL non-PASM graphics code in bytes.
Speed is complete screen redraws in 10 seconds. Note that the times are for single buffering, since the program won't won't fit when using double buffering and LMM.
[B]SIZE SPEED NOTES[/B]
SPIN 2500 41.7
CMM 4536 45.5 CMM is 1.1x speed, 1.8x size of Spin
LMM 9168 45.5 LMM is 1.1x speed, 3.7x size of Spin
TEST_STDIO_FS
Size is the size of all application and file system code in bytes.
Speed is number of lines that can be written to a file in 10 seconds. Note that to get the file write times I had to remove the file reading code, since otherwise the program won't fit when using LMM.
[B]SIZE SPEED NOTES[/B]
SPIN n/a n/a No equivalent Spin program.
CMM 19k 2500
LMM 38k 2520 LMM is 1.0x speed, 2.0x size of CMM
In both these cases, CMM is faster, than Spin, as fast as LMM, and half the size of LMM.
You could expect the same to be true of any LMM-based compiler for these complex examples - i.e. .CMM programs are half the size, but run at the same speed.
Just a quick update - no more significant size or speed improvements, but the Blackbox source level C debugger now works with CMM programs. The next step is to get the Optimizer working, since that should help a bit with both size and performance - and then I'll be ready for a first release.
Catalina 3.7 is nearly ready for posting - I just have to package it up.
Both the BlackBox source level C debugger and the Optimizer are now fully functional in CMM mode. The Optimizer reduces CMM code sizes and also increases performance slightly, so I thought I'd post the latest Spin/CMM/LMM comparison figures:
FIBO
Size is the size of the fibo function in bytes.
Speed is number of executions of fibo(26) in 10 seconds.
[B]SIZE SPEED NOTES[/B]
SPIN 26 1.00
CMM 46 1.33 CMM is 1.3x speed, 1.8x size of Spin
LMM 100 1.96 LMM is 2.0x speed, 3.8x size of Spin
DHRYSTONE
Size is the total size of the ProcX and FuncX functions in bytes.
Speed is number of executions of dhrystone in 10 seconds.
[B]SIZE SPEED NOTES[/B]
SPIN 522 4740
CMM 918 6820 CMM is 1.4x speed, 1.8x size of SPIN
LMM 1816 18770 LMM is 4.0x speed, 3.5x size of Spin
GRAPHICS_DEMO
Size is the total size of the ALL non-PASM graphics code in bytes.
Speed is complete demo cycles (10 screen redraws each cycle) in 10 seconds.
[B]SIZE SPEED NOTES[/B]
SPIN 2500 4.17
CMM 4376 4.55 CMM is 1.1x speed, 1.8x size of Spin
LMM 9168 4.55 LMM is 1.1x speed, 3.7x size of Spin
XXTEA
Size is the size of the btea function in bytes.
Speed is the number of executions of btea in 10 seconds.
[B]SIZE SPEED NOTES[/B]
SPIN 326 770
CSPIN 324 812 CSPIN is 1.1x speed, 1.0x size of Spin
ZOG 369 750 ZOG is 1.0x speed, 1.1x size of Spin
CMM 470 2083 CMM is 2.7x speed, 1.4x size of Spin
LMM 1092 7143 LMM is 9.3x speed, 3.3x size of Spin
TEST_STDIO_FS
Size is the size of all application and file system code in bytes.
Speed is number of lines that can be written to a file in 10 seconds. Note that to get the file write times I had to remove the file reading code, since otherwise the program won't fit when using LMM.
[B]SIZE SPEED NOTES[/B]
SPIN n/a n/a No equivalent Spin program.
CMM 18k 2500
LMM 38k 2520 LMM is 1.0x speed, 2.1x size of CMM
FILETEST
Size is the size of all application and file system code in bytes.
Speed is not meaningful for this program, since it is interactive.
[B]SIZE SPEED NOTES[/B]
SPIN 12k n/a estimate (based on info provided by Dave Hein)
CMM 19k n/a CMM is 1.6x size of Spin
LMM 37k n/a LMM is 3.1x size of Spin
Conclusions
While CMM is much slower than LMM, it is substantially faster than Spin - 33% faster even on the very simplest benchmark program (FIBO), and 270% faster on more complex programs (XXTEA).
But the real reason for using CMM is that it allows useful C programs to be written on a bare Propeller - achieving code sizes that are simply unachievable with an LMM C compiler. With the Optimizer, CMM code sizes are only 1.4 to 1.8 times the size of the equivalent Spin code, and generally less than half the size of the equivalent LMM code.
I will be including the CMM version of the Optimizer free with Catalina 3.7 - I can't really justify charging for it since the scope for code optimization is far less for CMM than for LMM.
Catalina 3.7 is nearly ready for posting - I just have to package it up.
Both the BlackBox source level C debugger and the Optimizer are now fully functional in CMM mode. The Optimizer reduces CMM code sizes and also increases performance slightly, so I thought I'd post the latest Spin/CMM/LMM comparison figures:
FIBO
Size is the size of the fibo function in bytes.
Speed is number of executions of fibo(26) in 10 seconds.
[B]SIZE SPEED NOTES[/B]
SPIN 26 1.00
CMM 46 1.33 CMM is 1.3x speed, 1.8x size of Spin
LMM 100 1.96 LMM is 2.0x speed, 3.8x size of Spin
DHRYSTONE
Size is the total size of the ProcX and FuncX functions in bytes.
Speed is number of executions of dhrystone in 10 seconds.
[B]SIZE SPEED NOTES[/B]
SPIN 522 4740
CMM 918 6820 CMM is 1.4x speed, 1.8x size of SPIN
LMM 1816 18770 LMM is 4.0x speed, 3.5x size of Spin
GRAPHICS_DEMO
Size is the total size of the ALL non-PASM graphics code in bytes.
Speed is complete demo cycles (10 screen redraws each cycle) in 10 seconds.
[B]SIZE SPEED NOTES[/B]
SPIN 2500 4.17
CMM 4484 4.55 CMM is 1.1x speed, 1.8x size of Spin
LMM 9168 4.55 LMM is 1.1x speed, 3.7x size of Spin
XXTEA
Size is the size of the btea function in bytes.
Speed is the number of executions of btea in 10 seconds.
[B]SIZE SPEED NOTES[/B]
SPIN 326 770
CSPIN 324 812 CSPIN is 1.1x speed, 1.0x size of Spin
ZOG 369 750 ZOG is 1.0x speed, 1.1x size of Spin
CMM 470 2083 CMM is 2.7x speed, 1.4x size of Spin
LMM 1092 7143 LMM is 9.3x speed, 3.3x size of Spin
TEST_STDIO_FS
Size is the size of all application and file system code in bytes.
Speed is number of lines that can be written to a file in 10 seconds. Note that to get the file write times I had to remove the file reading code, since otherwise the program won't fit when using LMM.
[B]SIZE SPEED NOTES[/B]
SPIN n/a n/a No equivalent Spin program.
CMM 16k 1990
LMM 38k 2520 LMM is 1.3x speed, 2.3x size of CMM
FILETEST
Size is the size of all application and file system code in bytes.
Speed is not meaningful for this program, since it is interactive.
[B]SIZE SPEED NOTES[/B]
SPIN 12k n/a estimate (based on info provided by Dave Hein)
CMM 19k n/a CMM is 1.6x size of Spin
LMM 37k n/a LMM is 3.1x size of Spin
Conclusions
While CMM is much slower than LMM, it is substantially faster than Spin - 33% faster even on the very simplest benchmark program (FIBO), and 270% faster on more complex programs (XXTEA).
But the real reason for using CMM is that it allows useful C programs to be written on a bare Propeller - achieving code sizes that are simply unachievable with an LMM C compiler. With the Optimizer, CMM code sizes are only 1.4 to 1.8 times the size of the equivalent Spin code, and generally less than half the size of the equivalent LMM code.
I will be including the CMM version of the Optimizer free with Catalina 3.7 - I can't really justify charging for it since the scope for code optimization is far less for CMM than for LMM.
Comments
You see the family resemblence, big, slow and prefers to sleep undisturbed for a long time.
Well, I'm happy to be proven wrong, but I don't agree with some of your premises. Spin bytecodes are not general purpose - they were designed by a very clever man with a very specific goal in mind - to be very efficient at encoding the Spin language. Not the C language.
It is also not really true to say that the Spin compiler produces efficient code. The Spin compiler does almost no optimization of Spin programs at all. The only complexity of Spin is in the rich set of expressions and operators (which oddly enough are very C-like). But it does not do even very simple expression optimization - you even have to tell Spin when an expression evaluates to a constant if you want it to evaluate it at compile-time rather than run-time!
Instead, Spin relies heavily on the programmer writing efficient code. In a way, the Spin language limitsyou to writing programs that can be efficiently represented in terms of Spin byte codes - i.e. the programmer does most of the hard work, not the compiler.
Writing a C compiler that would do the same job effectively would be ... well ... not impossible - but certainly impractical.
Ross.
(1) Compress blocks of code using an algorithm such as zip, and decompress them at run time. This would function a lot like XMM mode -- there would be a cache of decompressed blocks that the code would actually run out of. I think this method has the potential to have the best compression, but managing inter-block branches would be a bit of a pain (there'd have to be a table mapping uncompressed addresses to the beginning of compressed blocks).
(2) Use a compressed encoding for all instructions (like jazzed suggested). Offers decent compression, but reduces the available instruction space and may make some operations awkward.
(3) Have a mode bit and toggle between "normal PASM" and "compressed PASM" instructions. There could either be an explicit mode toggle instruction (which in normal PASM could be just be setting a bit in a register) or jump instructions could toggle mode based on something like the bottom bit of the address. Has the advantage that the user can explicitly choose the performance required for functions (compressing less frequently called functions).
(4) Define the instruction set such that the full PASM case is a subset of compressed PASM, e.g. if the high bit of the 16 bit compressed instruction is set then fetch then next 16 bits and turn it into a full PASM instruction, or conversely if certain bits of a full PASM instruction are clear then interpret it as 2 compressed instructions. This way compression granularity can be at a sub function level.
It sounds like you're chosen some variant of method (4), is that correct?
Eric
Hi Eric,
I considered (1), but decided against it. It would be ok if you had cached XMM RAM (i.e. you could unpack as you load into the cache) but if you have XMM then code size is not really a problem anyway, and you're better of with "traditional" LMM. I can't see this solution being useful on a "bare bones" Propeller.
I use some techniques similar to (2), but Jazzed's solution is not quite workable "as is", and to make it workable means you would most likely miss the 3 hub cycle "sweet spot" he is trying to achieve. I think 4 hub cycles may be possible, and that may be fast enough to be useful - but I also have doubts about whether this technique by itself can reduce code sizes enough to make it worthwhile. But I could be wrong on this, so I hope someone picks this one up and runs with it - it's a very clean and simple solution.
Your (3) and (4) are both getting closer to the mark - I do switch in and out of LMM PASM - that's why I call it a "hybrid" kernel. I am also incorporating some FCACHE techniques, so I can also switch in and out of COG PASM. This is still in the experimental stage
The fact is that none of these is the whole answer individually - it's all a matter of getting the right balance. I don't think I have it quite yet, but I'm getting a bit closer.
Ross.
FIBO
Speed is number of executions of fibo(26) in 10 seconds.
Speed is number of executions of dhrystone in 10 seconds.
Speed is complete screen redraws in 10 seconds. Note that the times are for single buffering, since the program won't won't fit when using double buffering and LMM.
XXTEA
Speed is the number of executions of btea in 10 seconds.
Speed is number of lines that can be written to a file in 10 seconds. Note that to get the file write times I had to remove the file reading code, since otherwise the program won't fit when using LMM.
Speed is not meaningful for this program, since it is interactive.
Conclusions
The code sizes are very consistent - CMM code is 1.6 to 1.9 times the size of Spin - but often half the size of the equivalent LMM program. While the speed results vary more widely, it is clear that although CMM is much slower than LMM, it is still substantially faster than Spin.
CMM seems to me to offer a reasonable compromise between code size and speed.
The most interesting case also happens to be the one where there are statistics for the most implementations - i.e. XXTEA. In this case, we have results not only for Spin, but also for two "byte coded" implementations of C - i.e. Zog and CSPIN. Not surprisingly, these are all quite similar in terms of size and speed - but CMM really stands out here, offering 2.7 times the speed at a cost of only 1.6 times the size.
The other interesting results are for the GRAPHICS_DEMO and TEST_STDIO_FS programs - in both cases CMM allows programs to easily run on a "bare-bones" propeller that simply won't fit when using LMM. And in both these cases the CMM version executes at very nearly the same speed as the LMM version anyway. Why would you use LMM in such cases?
Ross.
Thanks for posting your size and speed comparisons. One thing I notice is that it seems that ZOG actually does generate more compact code than LMM. That had been debated but it seems the code density *is* better at least for that one benchmark program.
Your CMM model sounds interesting. Can you post a description of your compressed instruction set?
Thanks,
David
Yes, I found the Zog stats I couldn't find earlier. My recollection was wrong - it seems to have been the performance of Zog and not the code size that were a bit disappointing (note that things may have improved since - I'm not sure what Zog is up to these days). Initial hopes were that it might be substantially faster than Spin (perhaps even able to challenge LMM) but in fact it was slightly slower. For LMM of course, it is exactly the opposite - the performance is good, but the code size is ruinous.
I plan to post the whole CMM code generator and kernel as part of Catalina 3.7 soon.
Ross.
I'm really mostly interested in the instruction set you've designed. Do you have a description of that?
Hi David,
Not a final one - I keep making changes. But here is a bird's eye view of the current state of play:
- There are 32 16-bit instructions that map directly to Prop instructions.
- There are 32 16-bit instructions that map to kernel "primitives" (stack operations, floating point operations etc).
- There are 64 32-bit instructions - some map directly to Prop instructions, and some map to kernel "primitives".
- All addressing is 24 bit.
- The kernel includes the normal LMM loop and also FCACHE capabilities, and also the ability to execute single arbitrary PASM instructions. I'm still working on adding the capability to use these things into the code generator.
Ross.Just for reference I thought I'd include a few PropGCC beta results as well, although I haven't had a chance to compile all the benchmarks. That's unfortunate, because I suspect that the compute intensive benchmarks below skew the results in GCC's favor -- on graphics and file I/O benchmarks the advantages of FCACHE will be much lower and so speed wise at least GCC will be more comparable to the Catalina LMM. I must confess that I'm surprised by how much faster GCC is than Spin -- I was expecting more like a 5x difference.
I've used the -Os option to propgcc to optimize for size, and used LMM mode for all tests.
FIBO
Speed is number of executions of fibo(26) in 10 seconds.
Speed is number of executions of dhrystone in 10 seconds.
XXTEA
Speed is the number of executions of btea in 10 seconds.
I also compiled xxtea for some other architectures (using gcc) just to see how other architectures code sizes compare:
Spin and Zog are still champions on the xxtea size, and CMM is very competitive with those other architectures!
The PropGCC results also show that FCACHE makes an enormous difference on calculation intensive programs (like these 3). I expect that the combination of FCACHE and CMM will produce a very interesting size/speed performance point.
Eric
My thoughts on a compressed instruction set were to try to make something that was a strict subset of PASM, with an escape to the full set. Something I've been toying with is a 12 bit compressed instruction form, with 4 bits each for instruction, src, and dest. That would work well for GCC (which has 16 registers), and would also allow us to embed the short instructions in the regular PASM bitstream with the condition flags all set to 0 -- in other words, the short forms of instructions would just be NO-OPs in the normal Prop1 instruction set, and the long forms would be the same as they are now. The short forms would thus occupy 16 bits each (but would have to come in pairs, see below) and the bits would look like: so that if 32 bit aligned they would look like IF_NEVER PASM instructions.
The advantage is that the CMM interpreter would then be able to run regular LMM code (albeit a bit slower), and changes to the tools to support compressed instructions could be made gradually.
The disadvantages are:
(1) obviously, only 16 instructions -- although one would be an escape to another set of 16
(2) due to the little-endian nature of the Prop1 and the placement of the condition bits we would have to read 32 bits at a time before being able to decode the 2 16 bit short forms; if we wanted to get fancy, though, this would actually allow us to use 14 bits per instruction rather than 12 bits, for a much larger range of compressed instructions
In practice, though, the compiler generates a huge number of mov, movi, add, addi, cmp, branch, and call instructions, so compressing those and the 6 variants of hub read/write would be pretty big wins. Pretty much any other instructions are gravy
The little endian thing is a bit of a pain, since it does defeat some optimizations (when a single short instruction is followed by a long instruction we would have to encode both as long, since instruction decode only works on 32 bit boundaries). Another option would be to give up strict binary compatibility and rotate the words by 16 (making them mixed endian); then the interpreter could correctly detect short vs. long PASM at odd addresses, at the cost of having to rotate the long PASM (including any long PASM in FCACHE blocks).
Anyway, those are my thoughts, but so far they're all theoretical. It sounds like Ross has actually implemented and tested his scheme, which makes it much more valuable!
Eric
Hi Eric,
Yes, the ability to FCACHE whole chunks of these small benchmark programs gives GCC really impressive results ... on the benchmark programs.
But GCC code sizes are still LMM code sizes - i.e. around twice as big as CMM code sizes, and often approaching four times as large as Spin. And at least on the Prop 1, code size is a critical issue. Not much point having great performance results if you can't take advantage of them because your program won't fit in Hub RAM.
But don't despair - GCC is welcome to adopt Catalina's CMM kernel once I finalize it .
Ross.
Hi Eric,
Yes, you're pretty much spot on here. I use a couple of tricks you maybe haven't thought of yet, but then you've probably got a few up your sleeve that I haven't thought of.
Ross.
EDIT: Snipped the post I quoted - I don't actually use the technique of Eric's in the part I snipped out, and I didn't want to mislead anyone. It is the rest of the post that is spot on!
LOL. That's up to Eric.
Eric's solution is an incremental change for GCC where another is probably not.
I've found the first example where the CMM version of a C program is larger than the LMM version!
Here is the C program:
LMM program size = 80 bytes
CMM program size = 84 bytes!
Back to the drawing board!
Ross.
P.S. yes, yes, I know - the Spin version of this program is only 40 bytes! And no doubt the GCC version is about the same. Don't remind me!
Ok a few tweaks, and this program is now 72 bytes when using CMM. Not a huge saving, maybe - but at least the CMM version is back to being smaller than the LMM version!
Ross.
Just made a tweak to the CMM code generator that reduces code sizes by over 10% and increases performance by similar amounts in many cases. For example, the updated results for XXTEA are shown below.
XXTEA
Speed is the number of executions of btea in 10 seconds.
In this case, C compiled for CMM now beats Spin speeds by a factor of 300%, for a code size increase of only 50%. And it is much smaller than LMM.
I think Catalina might beat those code sizes by the time I'm done. For instance, the code size above for XXTEA does not take into account any savings to be made by the Catalina Optimizer (which I haven't updated to know about CMM yet).
Ross.
GRAPHICS_DEMO
Speed is complete screen redraws in 10 seconds. Note that the times are for single buffering, since the program won't won't fit when using double buffering and LMM.
TEST_STDIO_FS
Speed is number of lines that can be written to a file in 10 seconds. Note that to get the file write times I had to remove the file reading code, since otherwise the program won't fit when using LMM.
In both these cases, CMM is faster, than Spin, as fast as LMM, and half the size of LMM.
You could expect the same to be true of any LMM-based compiler for these complex examples - i.e. .CMM programs are half the size, but run at the same speed.
Ross.
Just a quick update - no more significant size or speed improvements, but the Blackbox source level C debugger now works with CMM programs. The next step is to get the Optimizer working, since that should help a bit with both size and performance - and then I'll be ready for a first release.
Ross.
Catalina 3.7 is nearly ready for posting - I just have to package it up.
Both the BlackBox source level C debugger and the Optimizer are now fully functional in CMM mode. The Optimizer reduces CMM code sizes and also increases performance slightly, so I thought I'd post the latest Spin/CMM/LMM comparison figures:
FIBO
Speed is number of executions of fibo(26) in 10 seconds.
Speed is number of executions of dhrystone in 10 seconds.
Speed is complete demo cycles (10 screen redraws each cycle) in 10 seconds.
XXTEA
Speed is the number of executions of btea in 10 seconds.
Speed is number of lines that can be written to a file in 10 seconds. Note that to get the file write times I had to remove the file reading code, since otherwise the program won't fit when using LMM.
Speed is not meaningful for this program, since it is interactive.
Conclusions
While CMM is much slower than LMM, it is substantially faster than Spin - 33% faster even on the very simplest benchmark program (FIBO), and 270% faster on more complex programs (XXTEA).
But the real reason for using CMM is that it allows useful C programs to be written on a bare Propeller - achieving code sizes that are simply unachievable with an LMM C compiler. With the Optimizer, CMM code sizes are only 1.4 to 1.8 times the size of the equivalent Spin code, and generally less than half the size of the equivalent LMM code.
I will be including the CMM version of the Optimizer free with Catalina 3.7 - I can't really justify charging for it since the scope for code optimization is far less for CMM than for LMM.
Ross.
Nice progress.