Shop OBEX P1 Docs P2 Docs Learn Events
C/Spin benchmarks? - Page 4 — Parallax Forums

C/Spin benchmarks?

124»

Comments

  • Heater.Heater. Posts: 21,230
    edited 2012-07-17 20:23
    Ah, you found Zog's little brother:)

    You see the family resemblence, big, slow and prefers to sleep undisturbed for a long time.
  • Dave HeinDave Hein Posts: 6,347
    edited 2012-07-17 20:28
    RossH wrote: »
    Sorry, Dave - I was specifically talking about CSPIN with the current pnut Spin interpreter. With the pnut interpreter, 100% ANSI compliance is theoretically possible (it's theoretically possible with Turing machine!) but completely impractical. If you used another interpreter or "tweak" the existing one, then it would become practical, and I also agree you should be able to do better (performance-wise) than Spin - getting speeds around twice as fast as Spin should be achievable. However, this will be difficult to achieve this at Spin code sizes. In fact, I'd go so far as to say "impossible" - so I now look forward to being proven wrong :smile:
    .
    I don't understand why you think running C with the current Spin interpreter is impractical. A good C compiler would produce code that would be as efficient as that produced by the Spin compiler. PBASE and VBASE could be used as index registers, and the concept of dynamically relocatable objects wouldn't be needed for C. The method table could be scrapped and functions would be called by just manipulating PCURR. Spin natively support signed int, unsigned short and unsigned char. Signed short and signed char can be supported with the sign extension opcodes, though it would be less efficient than unsigned short and char. unsigned int would require extra operations, but I think those could be easily handled.
  • RossHRossH Posts: 5,511
    edited 2012-07-17 22:01
    Dave Hein wrote: »
    A good C compiler would produce code that would be as efficient as that produced by the Spin compiler.

    Well, I'm happy to be proven wrong, but I don't agree with some of your premises. Spin bytecodes are not general purpose - they were designed by a very clever man with a very specific goal in mind - to be very efficient at encoding the Spin language. Not the C language.

    It is also not really true to say that the Spin compiler produces efficient code. The Spin compiler does almost no optimization of Spin programs at all. The only complexity of Spin is in the rich set of expressions and operators (which oddly enough are very C-like). But it does not do even very simple expression optimization - you even have to tell Spin when an expression evaluates to a constant if you want it to evaluate it at compile-time rather than run-time!

    Instead, Spin relies heavily on the programmer writing efficient code. In a way, the Spin language limitsyou to writing programs that can be efficiently represented in terms of Spin byte codes - i.e. the programmer does most of the hard work, not the compiler.

    Writing a C compiler that would do the same job effectively would be ... well ... not impossible - but certainly impractical.

    Ross.
  • ersmithersmith Posts: 6,096
    edited 2012-07-18 05:31
    Compression of LMM code is an interesting idea, and as jazzed mentioned it's one that the GCC team has discussed as well. It's definitely a worthwhile goal, and I'm glad you're looking into it Ross. I'm curious to know how you approached it. I can think of four feasible ways, but perhaps there are others:

    (1) Compress blocks of code using an algorithm such as zip, and decompress them at run time. This would function a lot like XMM mode -- there would be a cache of decompressed blocks that the code would actually run out of. I think this method has the potential to have the best compression, but managing inter-block branches would be a bit of a pain (there'd have to be a table mapping uncompressed addresses to the beginning of compressed blocks).

    (2) Use a compressed encoding for all instructions (like jazzed suggested). Offers decent compression, but reduces the available instruction space and may make some operations awkward.

    (3) Have a mode bit and toggle between "normal PASM" and "compressed PASM" instructions. There could either be an explicit mode toggle instruction (which in normal PASM could be just be setting a bit in a register) or jump instructions could toggle mode based on something like the bottom bit of the address. Has the advantage that the user can explicitly choose the performance required for functions (compressing less frequently called functions).

    (4) Define the instruction set such that the full PASM case is a subset of compressed PASM, e.g. if the high bit of the 16 bit compressed instruction is set then fetch then next 16 bits and turn it into a full PASM instruction, or conversely if certain bits of a full PASM instruction are clear then interpret it as 2 compressed instructions. This way compression granularity can be at a sub function level.

    It sounds like you're chosen some variant of method (4), is that correct?

    Eric
  • RossHRossH Posts: 5,511
    edited 2012-07-18 17:22
    ersmith wrote: »
    Compression of LMM code is an interesting idea, and as jazzed mentioned it's one that the GCC team has discussed as well. It's definitely a worthwhile goal, and I'm glad you're looking into it Ross. I'm curious to know how you approached it. I can think of four feasible ways, but perhaps there are others:

    (1) Compress blocks of code using an algorithm such as zip, and decompress them at run time. This would function a lot like XMM mode -- there would be a cache of decompressed blocks that the code would actually run out of. I think this method has the potential to have the best compression, but managing inter-block branches would be a bit of a pain (there'd have to be a table mapping uncompressed addresses to the beginning of compressed blocks).

    (2) Use a compressed encoding for all instructions (like jazzed suggested). Offers decent compression, but reduces the available instruction space and may make some operations awkward.

    (3) Have a mode bit and toggle between "normal PASM" and "compressed PASM" instructions. There could either be an explicit mode toggle instruction (which in normal PASM could be just be setting a bit in a register) or jump instructions could toggle mode based on something like the bottom bit of the address. Has the advantage that the user can explicitly choose the performance required for functions (compressing less frequently called functions).

    (4) Define the instruction set such that the full PASM case is a subset of compressed PASM, e.g. if the high bit of the 16 bit compressed instruction is set then fetch then next 16 bits and turn it into a full PASM instruction, or conversely if certain bits of a full PASM instruction are clear then interpret it as 2 compressed instructions. This way compression granularity can be at a sub function level.

    It sounds like you're chosen some variant of method (4), is that correct?

    Eric

    Hi Eric,

    I considered (1), but decided against it. It would be ok if you had cached XMM RAM (i.e. you could unpack as you load into the cache) but if you have XMM then code size is not really a problem anyway, and you're better of with "traditional" LMM. I can't see this solution being useful on a "bare bones" Propeller.

    I use some techniques similar to (2), but Jazzed's solution is not quite workable "as is", and to make it workable means you would most likely miss the 3 hub cycle "sweet spot" he is trying to achieve. I think 4 hub cycles may be possible, and that may be fast enough to be useful - but I also have doubts about whether this technique by itself can reduce code sizes enough to make it worthwhile. But I could be wrong on this, so I hope someone picks this one up and runs with it - it's a very clean and simple solution.

    Your (3) and (4) are both getting closer to the mark - I do switch in and out of LMM PASM - that's why I call it a "hybrid" kernel. I am also incorporating some FCACHE techniques, so I can also switch in and out of COG PASM. This is still in the experimental stage

    The fact is that none of these is the whole answer individually - it's all a matter of getting the right balance. I don't think I have it quite yet, but I'm getting a bit closer.

    Ross.
  • RossHRossH Posts: 5,511
    edited 2012-07-22 01:09
    I have a fairly good range of actual results for Spin vs C using CMM and LMM now, so I thought I'd post them:

    FIBO
    Size is the size of the fibo function in bytes.
    Speed is number of executions of fibo(26) in 10 seconds.
            [B]SIZE          SPEED       NOTES[/B]
    
    SPIN      26        1.00
    CMM       46        1.23      CMM is 1.2x speed, 1.8x size of Spin
    LMM      100        1.96      LMM is 2.0x speed, 3.8x size of Spin
    
    DHRYSTONE
    Size is the total size of the ProcX and FuncX functions in bytes.
    Speed is number of executions of dhrystone in 10 seconds.
            [B]SIZE          SPEED       NOTES[/B]
    
    SPIN     522        4740
    CMM      972        6290      CMM is 1.3x speed, 1.9x size of SPIN
    LMM     1816       18770      LMM is 4.0x speed, 3.5x size of Spin
    
    GRAPHICS_DEMO
    Size is the total size of the ALL non-PASM graphics code in bytes.
    Speed is complete screen redraws in 10 seconds. Note that the times are for single buffering, since the program won't won't fit when using double buffering and LMM.
             [B]SIZE          SPEED       NOTES[/B]
    
    SPIN     2500        4.17
    CMM      4704        4.55      CMM is 1.1x speed, 1.9x size of Spin
    LMM      9168        4.55      LMM is 1.1x speed, 3.7x size of Spin
    

    XXTEA
    Size is the size of the btea function in bytes.
    Speed is the number of executions of btea in 10 seconds.
            [B]SIZE          SPEED       NOTES[/B]
    
    SPIN     326        770       
    CSPIN    324        812       CSPIN is 1.1x speed, 1.0x size of Spin 
    ZOG      369        750       ZOG   is 1.0x speed, 1.1x size of Spin
    CMM      534       2083       CMM   is 2.7x speed, 1.6x size of Spin
    LMM     1092       7143       LMM   is 9.3x speed, 3.3x size of Spin
    
    TEST_STDIO_FS
    Size is the size of all application and file system code in bytes.
    Speed is number of lines that can be written to a file in 10 seconds. Note that to get the file write times I had to remove the file reading code, since otherwise the program won't fit when using LMM.
             [B]SIZE          SPEED       NOTES[/B]
    
    SPIN     n/a        n/a        No equivalent Spin program. 
    CMM      21k        1990      
    LMM      38k        2520       LMM is 1.3x speed, 1.8x size of CMM
    
    FILETEST
    Size is the size of all application and file system code in bytes.
    Speed is not meaningful for this program, since it is interactive.
             [B]SIZE          SPEED       NOTES[/B]
    
    SPIN      12k        n/a       estimate (based on info provided by Dave Hein)
    CMM       21k        n/a       CMM is 1.8x size of Spin
    LMM       37k        n/a       LMM is 3.1x size of Spin
    

    Conclusions

    The code sizes are very consistent - CMM code is 1.6 to 1.9 times the size of Spin - but often half the size of the equivalent LMM program. While the speed results vary more widely, it is clear that although CMM is much slower than LMM, it is still substantially faster than Spin.

    CMM seems to me to offer a reasonable compromise between code size and speed.

    The most interesting case also happens to be the one where there are statistics for the most implementations - i.e. XXTEA. In this case, we have results not only for Spin, but also for two "byte coded" implementations of C - i.e. Zog and CSPIN. Not surprisingly, these are all quite similar in terms of size and speed - but CMM really stands out here, offering 2.7 times the speed at a cost of only 1.6 times the size.

    The other interesting results are for the GRAPHICS_DEMO and TEST_STDIO_FS programs - in both cases CMM allows programs to easily run on a "bare-bones" propeller that simply won't fit when using LMM. And in both these cases the CMM version executes at very nearly the same speed as the LMM version anyway. Why would you use LMM in such cases?

    Ross.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-07-22 07:57
    Hi Ross,

    Thanks for posting your size and speed comparisons. One thing I notice is that it seems that ZOG actually does generate more compact code than LMM. That had been debated but it seems the code density *is* better at least for that one benchmark program.

    Your CMM model sounds interesting. Can you post a description of your compressed instruction set?

    Thanks,
    David
  • RossHRossH Posts: 5,511
    edited 2012-07-22 15:46
    David Betz wrote: »
    Hi Ross,

    Thanks for posting your size and speed comparisons. One thing I notice is that it seems that ZOG actually does generate more compact code than LMM. That had been debated but it seems the code density *is* better at least for that one benchmark program.

    Your CMM model sounds interesting. Can you post a description of your compressed instruction set?

    Thanks,
    David

    Yes, I found the Zog stats I couldn't find earlier. My recollection was wrong - it seems to have been the performance of Zog and not the code size that were a bit disappointing (note that things may have improved since - I'm not sure what Zog is up to these days). Initial hopes were that it might be substantially faster than Spin (perhaps even able to challenge LMM) but in fact it was slightly slower. For LMM of course, it is exactly the opposite - the performance is good, but the code size is ruinous.

    I plan to post the whole CMM code generator and kernel as part of Catalina 3.7 soon.

    Ross.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-07-22 15:48
    RossH wrote: »
    I plan to post the whole CMM code generator and kernel as part of Catalina 3.7 soon.

    Ross.

    I'm really mostly interested in the instruction set you've designed. Do you have a description of that?
  • RossHRossH Posts: 5,511
    edited 2012-07-22 17:24
    David Betz wrote: »
    I'm really mostly interested in the instruction set you've designed. Do you have a description of that?

    Hi David,

    Not a final one - I keep making changes. But here is a bird's eye view of the current state of play:
    • There are 32 16-bit instructions that map directly to Prop instructions.
    • There are 32 16-bit instructions that map to kernel "primitives" (stack operations, floating point operations etc).
    • There are 64 32-bit instructions - some map directly to Prop instructions, and some map to kernel "primitives".
    • All addressing is 24 bit.
    • The kernel includes the normal LMM loop and also FCACHE capabilities, and also the ability to execute single arbitrary PASM instructions. I'm still working on adding the capability to use these things into the code generator.
    Ross.
  • ersmithersmith Posts: 6,096
    edited 2012-07-22 20:09
    RossH wrote: »
    I have a fairly good range of actual results for Spin vs C using CMM and LMM now, so I thought I'd post them:
    These results look very encouraging! It'll be great to expand the range of options available for programmers to choose from.

    Just for reference I thought I'd include a few PropGCC beta results as well, although I haven't had a chance to compile all the benchmarks. That's unfortunate, because I suspect that the compute intensive benchmarks below skew the results in GCC's favor -- on graphics and file I/O benchmarks the advantages of FCACHE will be much lower and so speed wise at least GCC will be more comparable to the Catalina LMM. I must confess that I'm surprised by how much faster GCC is than Spin -- I was expecting more like a 5x difference.

    I've used the -Os option to propgcc to optimize for size, and used LMM mode for all tests.

    FIBO
    Size is the size of the fibo function in bytes.
    Speed is number of executions of fibo(26) in 10 seconds.
            [B]SIZE          SPEED       NOTES[/B]
    
    SPIN      26        1.00
    CMM       46        1.23      CMM is  1.2x speed, 1.8x size of Spin
    LMM      100        1.96      LMM is  2.0x speed, 3.8x size of Spin
    GCC       92       10.78      GCC is 10.7x speed, 3.5x size of Spin
    
    DHRYSTONE
    Size is the total size of the ProcX and FuncX functions in bytes.
    Speed is number of executions of dhrystone in 10 seconds.
            [B]SIZE          SPEED       NOTES[/B]
    
    SPIN     522        4740
    CMM      972        6290      CMM is  1.3x speed, 1.9x size of SPIN
    LMM     1816       18770      LMM is  4.0x speed, 3.5x size of Spin
    GCC     1612       69230      GCC is 14.6x speed, 3.1x size of Spin
    


    XXTEA
    Size is the size of the btea function in bytes.
    Speed is the number of executions of btea in 10 seconds.
            [B]SIZE          SPEED       NOTES[/B]
    
    SPIN     326        770       
    CSPIN    324        812       CSPIN is  1.1x speed, 1.0x size of Spin 
    ZOG      369        750       ZOG   is  1.0x speed, 1.1x size of Spin
    CMM      534       2083       CMM   is  2.7x speed, 1.6x size of Spin
    LMM     1092       7143       LMM   is  9.3x speed, 3.3x size of Spin
    GCC      700      38226       GCC   is 53.0x speed, 2.1x size of Spin
    

    I also compiled xxtea for some other architectures (using gcc) just to see how other architectures code sizes compare:
    x86:       459
    arm:       520
    arm-thumb: 424
    

    Spin and Zog are still champions on the xxtea size, and CMM is very competitive with those other architectures!

    The PropGCC results also show that FCACHE makes an enormous difference on calculation intensive programs (like these 3). I expect that the combination of FCACHE and CMM will produce a very interesting size/speed performance point.

    Eric
  • ersmithersmith Posts: 6,096
    edited 2012-07-22 20:35
    David Betz wrote: »
    I'm really mostly interested in the instruction set you've designed. Do you have a description of that?

    My thoughts on a compressed instruction set were to try to make something that was a strict subset of PASM, with an escape to the full set. Something I've been toying with is a 12 bit compressed instruction form, with 4 bits each for instruction, src, and dest. That would work well for GCC (which has 16 registers), and would also allow us to embed the short instructions in the regular PASM bitstream with the condition flags all set to 0 -- in other words, the short forms of instructions would just be NO-OPs in the normal Prop1 instruction set, and the long forms would be the same as they are now. The short forms would thus occupy 16 bits each (but would have to come in pairs, see below) and the bits would look like:
        iidd ddss ss00 00ii   iidd ddss ss00 00ii
    
    so that if 32 bit aligned they would look like IF_NEVER PASM instructions.

    The advantage is that the CMM interpreter would then be able to run regular LMM code (albeit a bit slower), and changes to the tools to support compressed instructions could be made gradually.

    The disadvantages are:

    (1) obviously, only 16 instructions -- although one would be an escape to another set of 16
    (2) due to the little-endian nature of the Prop1 and the placement of the condition bits we would have to read 32 bits at a time before being able to decode the 2 16 bit short forms; if we wanted to get fancy, though, this would actually allow us to use 14 bits per instruction rather than 12 bits, for a much larger range of compressed instructions

    In practice, though, the compiler generates a huge number of mov, movi, add, addi, cmp, branch, and call instructions, so compressing those and the 6 variants of hub read/write would be pretty big wins. Pretty much any other instructions are gravy

    The little endian thing is a bit of a pain, since it does defeat some optimizations (when a single short instruction is followed by a long instruction we would have to encode both as long, since instruction decode only works on 32 bit boundaries). Another option would be to give up strict binary compatibility and rotate the words by 16 (making them mixed endian); then the interpreter could correctly detect short vs. long PASM at odd addresses, at the cost of having to rotate the long PASM (including any long PASM in FCACHE blocks).

    Anyway, those are my thoughts, but so far they're all theoretical. It sounds like Ross has actually implemented and tested his scheme, which makes it much more valuable!

    Eric
  • RossHRossH Posts: 5,511
    edited 2012-07-22 23:38
    ersmith wrote: »
    Just for reference I thought I'd include a few PropGCC beta results as well, although I haven't had a chance to compile all the benchmarks. That's unfortunate, because I suspect that the compute intensive benchmarks below skew the results in GCC's favor -- on graphics and file I/O benchmarks the advantages of FCACHE will be much lower and so speed wise at least GCC will be more comparable to the Catalina LMM. I must confess that I'm surprised by how much faster GCC is than Spin -- I was expecting more like a 5x difference.
    ...

    The PropGCC results also show that FCACHE makes an enormous difference on calculation intensive programs (like these 3). I expect that the combination of FCACHE and CMM will produce a very interesting size/speed performance point.
    Eric

    Hi Eric,

    Yes, the ability to FCACHE whole chunks of these small benchmark programs gives GCC really impressive results ... on the benchmark programs.

    But GCC code sizes are still LMM code sizes - i.e. around twice as big as CMM code sizes, and often approaching four times as large as Spin. And at least on the Prop 1, code size is a critical issue. Not much point having great performance results if you can't take advantage of them because your program won't fit in Hub RAM.

    But don't despair - GCC is welcome to adopt Catalina's CMM kernel once I finalize it :lol:.

    Ross.
  • RossHRossH Posts: 5,511
    edited 2012-07-22 23:47
    ersmith wrote: »

    <snip>

    In practice, though, the compiler generates a huge number of mov, movi, add, addi, cmp, branch, and call instructions, so compressing those and the 6 variants of hub read/write would be pretty big wins. Pretty much any other instructions are gravy

    The little endian thing is a bit of a pain, since it does defeat some optimizations (when a single short instruction is followed by a long instruction we would have to encode both as long, since instruction decode only works on 32 bit boundaries). Another option would be to give up strict binary compatibility and rotate the words by 16 (making them mixed endian); then the interpreter could correctly detect short vs. long PASM at odd addresses, at the cost of having to rotate the long PASM (including any long PASM in FCACHE blocks).

    Anyway, those are my thoughts, but so far they're all theoretical. It sounds like Ross has actually implemented and tested his scheme, which makes it much more valuable!

    Eric

    Hi Eric,

    Yes, you're pretty much spot on here. I use a couple of tricks you maybe haven't thought of yet, but then you've probably got a few up your sleeve that I haven't thought of.

    Ross.

    EDIT: Snipped the post I quoted - I don't actually use the technique of Eric's in the part I snipped out, and I didn't want to mislead anyone. It is the rest of the post that is spot on!
  • jazzedjazzed Posts: 11,803
    edited 2012-07-23 10:26
    RossH wrote: »
    But don't despair - GCC is welcome to adopt Catalina's CMM kernel once I finalize it :lol:.

    LOL. That's up to Eric.
    Eric's solution is an incremental change for GCC where another is probably not.
  • RossHRossH Posts: 5,511
    edited 2012-07-24 02:19
    Blast! Too clever for myself!

    I've found the first example where the CMM version of a C program is larger than the LMM version!

    Here is the C program:
    #include <propeller.h>
    
    #define BIT 15 // bit number to toggle - this is for the C3!
    
    #define BITMASK (1<<BIT)
    
    void main(void) {
    
       unsigned count  = CNT+CLKFREQ;
    
       DIRA |= BITMASK;
       OUTA &= BITMASK;
    
       while (1) {
          OUTA ^= BITMASK;
          WAITCNT(count, CLKFREQ);
       }
    }
    

    LMM program size = 80 bytes
    CMM program size = 84 bytes!

    Back to the drawing board!

    Ross.

    P.S. yes, yes, I know - the Spin version of this program is only 40 bytes! And no doubt the GCC version is about the same. Don't remind me!
  • RossHRossH Posts: 5,511
    edited 2012-07-25 04:52
    RossH wrote: »
    LMM program size = 80 bytes
    CMM program size = 84 bytes!

    Ok a few tweaks, and this program is now 72 bytes when using CMM. Not a huge saving, maybe - but at least the CMM version is back to being smaller than the LMM version!

    Ross.
  • RossHRossH Posts: 5,511
    edited 2012-07-26 04:12
    All,

    Just made a tweak to the CMM code generator that reduces code sizes by over 10% and increases performance by similar amounts in many cases. For example, the updated results for XXTEA are shown below.

    XXTEA
    Size is the size of the btea function in bytes.
    Speed is the number of executions of btea in 10 seconds.
            [B]SIZE          SPEED       NOTES[/B]
    
    SPIN     326        770       
    CSPIN    324        812       CSPIN is  1.1x speed, 1.0x size of Spin 
    ZOG      369        750       ZOG   is  1.0x speed, 1.1x size of Spin
    CMM      480       2325       CMM   is  3.0x speed, 1.5x size of Spin
    LMM     1092       7143       LMM   is  9.3x speed, 3.3x size of Spin
    

    In this case, C compiled for CMM now beats Spin speeds by a factor of 300%, for a code size increase of only 50%. And it is much smaller than LMM.

    ersmith wrote: »

    I also compiled xxtea for some other architectures (using gcc) just to see how other architectures code sizes compare:
    x86:       459
    arm:       520
    arm-thumb: 424
    

    I think Catalina might beat those code sizes by the time I'm done. For instance, the code size above for XXTEA does not take into account any savings to be made by the Catalina Optimizer (which I haven't updated to know about CMM yet).

    Ross.
  • RossHRossH Posts: 5,511
    edited 2012-07-27 02:27
    More updated results with the new code generator ... concentrating now on programs that simply can't be run with LMM (note I seem to have slipped a decimal place in my previous numbers for GRAPHICS DEMO!):

    GRAPHICS_DEMO
    Size is the total size of the ALL non-PASM graphics code in bytes.
    Speed is complete screen redraws in 10 seconds. Note that the times are for single buffering, since the program won't won't fit when using double buffering and LMM.
             [B]SIZE          SPEED       NOTES[/B]
    
    SPIN     2500        41.7
    CMM      4536        45.5      CMM is 1.1x speed, 1.8x size of Spin
    LMM      9168        45.5      LMM is 1.1x speed, 3.7x size of Spin
    

    TEST_STDIO_FS
    Size is the size of all application and file system code in bytes.
    Speed is number of lines that can be written to a file in 10 seconds. Note that to get the file write times I had to remove the file reading code, since otherwise the program won't fit when using LMM.
             [B]SIZE          SPEED       NOTES[/B]
    
    SPIN     n/a        n/a        No equivalent Spin program. 
    CMM      19k        2500      
    LMM      38k        2520       LMM is 1.0x speed, 2.0x size of CMM
    

    In both these cases, CMM is faster, than Spin, as fast as LMM, and half the size of LMM.

    You could expect the same to be true of any LMM-based compiler for these complex examples - i.e. .CMM programs are half the size, but run at the same speed.

    Ross.
  • RossHRossH Posts: 5,511
    edited 2012-08-02 05:08
    All,

    Just a quick update - no more significant size or speed improvements, but the Blackbox source level C debugger now works with CMM programs. The next step is to get the Optimizer working, since that should help a bit with both size and performance - and then I'll be ready for a first release.

    Ross.
  • RossHRossH Posts: 5,511
    edited 2012-08-13 05:27
    All,

    Catalina 3.7 is nearly ready for posting - I just have to package it up.

    Both the BlackBox source level C debugger and the Optimizer are now fully functional in CMM mode. The Optimizer reduces CMM code sizes and also increases performance slightly, so I thought I'd post the latest Spin/CMM/LMM comparison figures:

    FIBO
    Size is the size of the fibo function in bytes.
    Speed is number of executions of fibo(26) in 10 seconds.
            [B]SIZE          SPEED       NOTES[/B]
    
    SPIN      26        1.00
    CMM       46        1.33      CMM is 1.3x speed, 1.8x size of Spin
    LMM      100        1.96      LMM is 2.0x speed, 3.8x size of Spin
    
    DHRYSTONE
    Size is the total size of the ProcX and FuncX functions in bytes.
    Speed is number of executions of dhrystone in 10 seconds.
            [B]SIZE          SPEED       NOTES[/B]
    
    SPIN     522        4740
    CMM      918        6820      CMM is 1.4x speed, 1.8x size of SPIN
    LMM     1816       18770      LMM is 4.0x speed, 3.5x size of Spin
    
    GRAPHICS_DEMO
    Size is the total size of the ALL non-PASM graphics code in bytes.
    Speed is complete demo cycles (10 screen redraws each cycle) in 10 seconds.
             [B]SIZE          SPEED       NOTES[/B]
    
    SPIN     2500        4.17
    CMM      4376        4.55      CMM is 1.1x speed, 1.8x size of Spin
    LMM      9168        4.55      LMM is 1.1x speed, 3.7x size of Spin
    

    XXTEA
    Size is the size of the btea function in bytes.
    Speed is the number of executions of btea in 10 seconds.
            [B]SIZE          SPEED       NOTES[/B]
    
    SPIN     326        770       
    CSPIN    324        812       CSPIN is 1.1x speed, 1.0x size of Spin 
    ZOG      369        750       ZOG   is 1.0x speed, 1.1x size of Spin
    CMM      470       2083       CMM   is 2.7x speed, 1.4x size of Spin
    LMM     1092       7143       LMM   is 9.3x speed, 3.3x size of Spin
    
    TEST_STDIO_FS
    Size is the size of all application and file system code in bytes.
    Speed is number of lines that can be written to a file in 10 seconds. Note that to get the file write times I had to remove the file reading code, since otherwise the program won't fit when using LMM.
             [B]SIZE          SPEED       NOTES[/B]
    
    SPIN     n/a        n/a        No equivalent Spin program. 
    CMM      18k        2500      
    LMM      38k        2520       LMM is 1.0x speed, 2.1x size of CMM
    
    FILETEST
    Size is the size of all application and file system code in bytes.
    Speed is not meaningful for this program, since it is interactive.
             [B]SIZE          SPEED       NOTES[/B]
    
    SPIN      12k        n/a       estimate (based on info provided by Dave Hein)
    CMM       19k        n/a       CMM is 1.6x size of Spin
    LMM       37k        n/a       LMM is 3.1x size of Spin
    

    Conclusions

    While CMM is much slower than LMM, it is substantially faster than Spin - 33% faster even on the very simplest benchmark program (FIBO), and 270% faster on more complex programs (XXTEA).

    But the real reason for using CMM is that it allows useful C programs to be written on a bare Propeller - achieving code sizes that are simply unachievable with an LMM C compiler. With the Optimizer, CMM code sizes are only 1.4 to 1.8 times the size of the equivalent Spin code, and generally less than half the size of the equivalent LMM code.

    I will be including the CMM version of the Optimizer free with Catalina 3.7 - I can't really justify charging for it since the scope for code optimization is far less for CMM than for LMM.

    Ross.
  • SapiehaSapieha Posts: 2,964
    edited 2012-08-13 05:36
    Hi Ross.

    Nice progress.
    RossH wrote: »
    All,

    Catalina 3.7 is nearly ready for posting - I just have to package it up.

    Both the BlackBox source level C debugger and the Optimizer are now fully functional in CMM mode. The Optimizer reduces CMM code sizes and also increases performance slightly, so I thought I'd post the latest Spin/CMM/LMM comparison figures:

    FIBO
    Size is the size of the fibo function in bytes.
    Speed is number of executions of fibo(26) in 10 seconds.
            [B]SIZE          SPEED       NOTES[/B]
    
    SPIN      26        1.00
    CMM       46        1.33      CMM is 1.3x speed, 1.8x size of Spin
    LMM      100        1.96      LMM is 2.0x speed, 3.8x size of Spin
    
    DHRYSTONE
    Size is the total size of the ProcX and FuncX functions in bytes.
    Speed is number of executions of dhrystone in 10 seconds.
            [B]SIZE          SPEED       NOTES[/B]
    
    SPIN     522        4740
    CMM      918        6820      CMM is 1.4x speed, 1.8x size of SPIN
    LMM     1816       18770      LMM is 4.0x speed, 3.5x size of Spin
    
    GRAPHICS_DEMO
    Size is the total size of the ALL non-PASM graphics code in bytes.
    Speed is complete demo cycles (10 screen redraws each cycle) in 10 seconds.
             [B]SIZE          SPEED       NOTES[/B]
    
    SPIN     2500        4.17
    CMM      4484        4.55      CMM is 1.1x speed, 1.8x size of Spin
    LMM      9168        4.55      LMM is 1.1x speed, 3.7x size of Spin
    

    XXTEA
    Size is the size of the btea function in bytes.
    Speed is the number of executions of btea in 10 seconds.
            [B]SIZE          SPEED       NOTES[/B]
    
    SPIN     326        770       
    CSPIN    324        812       CSPIN is 1.1x speed, 1.0x size of Spin 
    ZOG      369        750       ZOG   is 1.0x speed, 1.1x size of Spin
    CMM      470       2083       CMM   is 2.7x speed, 1.4x size of Spin
    LMM     1092       7143       LMM   is 9.3x speed, 3.3x size of Spin
    
    TEST_STDIO_FS
    Size is the size of all application and file system code in bytes.
    Speed is number of lines that can be written to a file in 10 seconds. Note that to get the file write times I had to remove the file reading code, since otherwise the program won't fit when using LMM.
             [B]SIZE          SPEED       NOTES[/B]
    
    SPIN     n/a        n/a        No equivalent Spin program. 
    CMM      16k        1990      
    LMM      38k        2520       LMM is 1.3x speed, 2.3x size of CMM
    
    FILETEST
    Size is the size of all application and file system code in bytes.
    Speed is not meaningful for this program, since it is interactive.
             [B]SIZE          SPEED       NOTES[/B]
    
    SPIN      12k        n/a       estimate (based on info provided by Dave Hein)
    CMM       19k        n/a       CMM is 1.6x size of Spin
    LMM       37k        n/a       LMM is 3.1x size of Spin
    

    Conclusions

    While CMM is much slower than LMM, it is substantially faster than Spin - 33% faster even on the very simplest benchmark program (FIBO), and 270% faster on more complex programs (XXTEA).

    But the real reason for using CMM is that it allows useful C programs to be written on a bare Propeller - achieving code sizes that are simply unachievable with an LMM C compiler. With the Optimizer, CMM code sizes are only 1.4 to 1.8 times the size of the equivalent Spin code, and generally less than half the size of the equivalent LMM code.

    I will be including the CMM version of the Optimizer free with Catalina 3.7 - I can't really justify charging for it since the scope for code optimization is far less for CMM than for LMM.

    Ross.
Sign In or Register to comment.