Shop OBEX P1 Docs P2 Docs Learn Events
Some multi-language benchmarks — Parallax Forums

Some multi-language benchmarks

ersmithersmith Posts: 5,900
edited 2015-04-15 08:38 in Propeller 1
It's been a while since the last benchmark thread, and a lot of things have happened since then. So I thought it might be interesting to test the performance of various compilers and interpreters that are in active development.

Of course there's always a debate about what makes a good benchmark, and benchmarks can be very misleading. But they can give a very rough picture of the performance of different solutions, for situations where performance matters. In other situations, of course, other things like ease of use, interactivity, documentation, and availability of help may matter more. So caveat emptor.

I'll follow up this post with a few basic benchmarks: toggling a pin, the fibonacci benchmark, and some kind of compute intensive benchmark like an FFT. Heater's fft_bench would be ideal, but I don't think there's a Forth version of it and my Forth skills are not up to porting it, so perhaps some other compute intensive task might be used instead -- any suggestions?. So far I've tested Spin, C (Catalina and PropGCC), Basic (PropBasic), Forth (PropForth and Tachyon Forth) and PASM. If your favorite language/compiler is missing, please feel free to try it out -- these benchmarks are pretty straightforward.
«13

Comments

  • ersmithersmith Posts: 5,900
    edited 2016-10-02 13:47
    Here's timing for a simple benchmark that toggles an output pin 1000 times. This is a very simple task, of course, but it is indicative of a given language's looping and bit manipulation speed. The time taken to do it varies over 2 orders of magnitude.

    The theoretical limit of this is about 4000 cycles (assuming 1 instruction per toggle). In PASM one could get close to this by unrolling the loop. I've chosen in the implementations below to just adopt the most natural looping construct for the languages in question, so for the PASM I've counted the time for a basic djnz loop with one toggle/per loop, plus the timing and subroutine call overhead.

    I'm not very experienced with FORTH, and was unable to find an easy way to toggle a pin other than explicitly doing pinhi/pinlo. This means the loop only runs 500 times for the FORTH implementations, giving them somewhat of an advantage over the other ones (which run the loop 1000 times). So the FORTH times are skewed slightly high; on the other hand my relative inexperience with FORTH probably means that I'm missing some obvious way to improve the loop.

    Compilers tested:
    PropGCC CMM preview
    Catalina 3.8
    PropBasic 1.14 (appears to give the same results to PropBasic 1.27)
    Tachyon Forth as of 9/30/2012
    PropForth 5.0
    fastspin 3.2.0

    Here are the results in number of cycles to toggle a pin 1000 times (source code is attached to this post):
    PASM:                   8020 cycles
    PropBasic (COG mode):   8024
    GCC -Os -mcog:          8044
    fastspin:               8192
    GCC -Os -mlmm:         16416
    PropBasic (LMM):       48176
    Catalina LMM:         176912
    Tachyon Forth:        440912
    GCC -Os -mcmm:        448832
    Catalina CMM -O3:     561360
    Spin:                 966664
    PropForth:           2600704 cycles
    
  • ersmithersmith Posts: 5,900
    edited 2016-10-02 13:52
    Here's another benchmark: the good old Fibonacci numbers. This is a simple test of a recursive function, and basically measures function call overhead.

    This test doesn't have as many compilers as the toggle test. PropBasic doesn't really support recursive functions (technically it does in LMM mode, but without local variables recursive functions are not really practical). PropForth accepted a recursive definition, but did not produce correct results. That may be the fault of my own inexperience with Forth rather than anything else. Tachyon Forth did allow recursion, but after fibo(10) the timing results became very suspicious, and after fibo(11) the answers came out wrong, presumably because of stack overflow. To add a bit of margin of error I've printed the results for fibo(8) only.

    I've also included the size of the fibonacci function itself. Only the fibo() function is measured, not the runtime library or timing code. I've also added tests of Catalina's SMALL memory model and GCC's -mxmm memory model, both of which use external RAM and were tested on a C3 board.

    Absolute size and speed for fibo program; time is in cycles to compute fibo(8), size is in bytes (for both, smaller is better):
                        Time     Size
    PASM                 3940     80
    GCC -Os -mcog        5444    100
    GCC -Os -mlmm       10992     84
    fastspin            32128    104
    GCC -Os -mxmm       57552     84
    GCC -Os -mcmm       58800     26
    Catalina LMM        65392    100
    Catalina CMM -O3   102960     46
    Tachyon Forth      106784     18
    Spin               137360     25
    Catalina SMALL     299632    100
    

    Relative Speed and Size for fibo(8) compared to SPIN
    (higher speed is better, lower size is better)
                    Speed    Size
    TACHYON FORTH    1.29    0.72
    Spin             1.00    1.00
    GCC -mcmm        2.34    1.04
    Catalina CMM     1.33    1.84
    PASM            34.86    3.20
    GCC -mlmm       12.50    3.36
    GCC -mcog       25.23    4.00
    GCC -mxmm        2.39    4.00
    Catalina LMM     2.10    4.00
    Catalina SMALL   0.46    4.00
    fastspin         4.28    4.16
    

    Code is attached.
  • ersmithersmith Posts: 5,900
    edited 2016-10-02 13:55
    Here are some results for Heater's FFT_BENCH benchmark. I don't have
    results for PropBasic, Tachyon Forth, or PropForth (yet... hopefully
    some interested readers will take up the challenge of porting the
    benchmark!).

    This time I have posted total size in bytes of the executable,
    including run time libraries, although I have told the C compilers to
    use the smallest libraries they have for I/O (-ltiny). This makes for
    an interesting comparison of HUB memory used.

    Notes:

    (1) The PASM code size includes the size of the Spin wrapper and test
    harness.

    (2) The GCC results are reported with -Os (optimize for size) and -O2
    (optimize for speed). Strangely, in this particular benchmark the -Os
    code ends up bigger, probably due to -O2 being able to inline and then
    eliminate some functions that are only called once.

    (3) The Catalina LMM results are compiled with -ltiny -lci; Catalina CMM
    adds -C COMPACT -O3.

    (4) Size for Catalina is calculated as code size + cnst size + init
    size, as that more accurately reflects HUB memory available. Size for
    the others is the total size of the .binary file downloaded, and so
    for GCC it includes the run-time kernel, which is 1940 bytes in LMM
    mode and 1892 bytes in CMM mode.

    (5) The JDForth result is as quoted in
    http://forums.parallax.com/showthread.php?129972-fft_bench-An-MCU-benchmark-using-a-simple-FFT-algorithm-in-Spin-C-and-.../page6. The
    size was not specified there, and I don't have JDForth to reproduce it.

    Results sorted by speed:
                 Time (ms)    Total Size (bytes)
    PASM             25          4480
    GCC -O2          47          7288
    GCC CMM -O2      96          5768
    GCC -Os         148          7292
    fastspin        171          5136
    Catalina LMM    348          7488
    GCC CMM -Os     537          5460
    Catalina CMM    763          4536
    JDForth        1200          5428
    Spin           1465          3244
    

    Results sorted by size:
                 Time (ms)    Total Size (bytes)
    Spin           1465          3244
    PASM             25          4480
    Catalina CMM    763          4536
    fastspin        171          5136
    JDForth        1200          5428
    GCC CMM -Os     537          5460
    GCC CMM -O2      96          5768
    GCC -O2          47          7288
    GCC -Os         148          7292
    Catalina LMM    348          7488
    
  • Heater.Heater. Posts: 21,230
    edited 2012-10-01 14:30
    Oh good. I can restate my challenge: "Anyone out there up to porting fft_bench to Forth". The Spin version is suppossed to be the "mother" of all other implementations. With that we can fill in the benchmark results.

    There is no way I am going to live long enough to understand Forth well enough to do this task.

    I was not really thinking of benchmarking when I made that request I was just curious to see how on earth it might look.
  • BeanBean Posts: 8,129
    edited 2012-10-01 14:45
    PropBasic would be quite a bit faster if you use

    thepin = ~thepin

    instead of

    TOGGLE thepin

    because TOGGLE sets the pin direction to output each time.

    I'm like to see the GCC -Os -mcog option gets so close to the PASM code. Is there any way to post the PASM code generated ?

    Bean
  • jmgjmg Posts: 15,140
    edited 2012-10-01 15:23
    Bean wrote: »
    I'm like to see the GCC -Os -mcog option gets so close to the PASM code. Is there any way to post the PASM code generated ?

    Yes, those are impressive.

    I presume the generated PASM can be manually tweaked, and included for a even-smaller result, for those cases where the effort is warranted.

    eg PInToggle looks so close in size, speed is likely to be the same, but the next example has a 1.3816x speed, and 1.25x size delta. Pretty good, but a possible manual-optimize candidate.


    General comment:
    On benchmarks I think the size should include all resource used, otherwise novices might think Spin/Forth are tiny.
    (on a Prop 2, spin has a different loaded cost, and opens compacted/expanded Spin options )
  • HumanoidoHumanoido Posts: 5,770
    edited 2012-10-01 15:28
    From an academic perspective, it depends on the method used to toggle a pin. For example, SPIN can be at the top of the list for the fastest pin toggle with a "nano time (1-100-ns) pulsar program" built up from a Cog's counter register that can toggle at every clock giving a 12.6-ns wide pulse. One thousand cycles comes out to 12.6-microseconds. It’s interesting to note with a 6.25 crystal the base is a mere 10-us. As each counter module also has its own phase-locked loop (PLL) which can be used to synthesize frequencies up to 128 MHz, perhaps a small program could take this concept even farther.
  • RossHRossH Posts: 5,336
    edited 2012-10-01 16:56
    Hi Eric,

    You mention Catalina "5.8" instead of "3.8". Finally, for consistency it would be good to add code size to these results as well.

    EDIT: That's strange - code just appeared. Thanks.
    Ross.
  • ersmithersmith Posts: 5,900
    edited 2012-10-01 16:59
    Bean wrote: »
    PropBasic would be quite a bit faster if you use

    thepin = ~thepin

    instead of

    TOGGLE thepin
    Thanks. I've updated the post with the new version. With that change PropBasic in COG mode moves up just behind GCC in COG mode.
    I'm like to see the GCC -Os -mcog option gets so close to the PASM code. Is there any way to post the PASM code generated ?

    Sure. The -S option to propeller-elf-gcc causes it to emit assembly language (in GAS syntax instead of PASM, but the two are very close). Here is the code for the toggle function:
            .text
            .balign 4
            .global _toggle
    _toggle
            mov     r7, r0
            cmps    r0, #0 wz,wc
            add     r7, #1
            mov     r5, .LC0
            IF_B  mov       r7,#1
            '' loop_start register r7 level #1
            jmp     #.L2
    .L3
            xor OUTA,r5
    .L2
            djnz    r7,#.L3
            jmp     lr
            .balign 4
    .LC0
            long    65536
    
    
    There's a bit of fussing around at the start to make sure the parameter is positive and hence suitable for djnz, which accounts for the slightly longer time for GCC versus PASM.

    In LMM mode GCC does not generate djnz, so its loop is a bit bigger (using a cmp and branch). The loop does fit into the FCACHE area in the COG and so executes without LMM overhead, which explains why GCC's LMM time is close to PropBasic's COG time.
  • ersmithersmith Posts: 5,900
    edited 2012-10-01 17:02
    RossH wrote: »
    You mention Catalina "5.8" instead of "3.8". Finally, for consistency it would be good to add code size to these results as well.

    Oops, sorry -- thanks for catching that typo! I've fixed it in the original post.

    I guess I can update the toggle code to include size, although frankly for such a trivial function none of the languages should emit very much code.

    Eric
  • RossHRossH Posts: 5,336
    edited 2012-10-01 17:05
    ersmith wrote: »
    Oops, sorry -- thanks for catching that typo! I've fixed it in the original post.

    I guess I can update the toggle code to include size, although frankly for such a trivial function none of the languages should emit very much code.

    Eric

    Yes, fair point.
  • ersmithersmith Posts: 5,900
    edited 2012-10-01 17:15
    jmg wrote: »
    On benchmarks I think the size should include all resource used, otherwise novices might think Spin/Forth are tiny.
    (on a Prop 2, spin has a different loaded cost, and opens compacted/expanded Spin options )

    That's a tough one. You're right that in the real world the resource required for the kernel matter. On the other hand, for tiny benchmarks like these the size of the benchmark code is dwarfed by the size of the kernels. In practice once you've decided to use a given language/compiler the kernel and run time library overhead are basically fixed, and the cost of the program itself is incremental to that. Also, of course, the kernel may be stored in ROM (for Spin) or EEPROM (for some of the other languages) and so not really affect the HUB ram availability.

    For now I've chosen just to focus on the "incremental" code size of the program or function being benchmarked, but of course comparing total program sizes for various tasks would also be an interesting benchmark, and it would certainly be interesting to compare those -- feel free to post your results!
  • RossHRossH Posts: 5,336
    edited 2012-10-01 19:23
    ersmith wrote: »
    That's a tough one. You're right that in the real world the resource required for the kernel matter. On the other hand, for tiny benchmarks like these the size of the benchmark code is dwarfed by the size of the kernels. In practice once you've decided to use a given language/compiler the kernel and run time library overhead are basically fixed, and the cost of the program itself is incremental to that. Also, of course, the kernel may be stored in ROM (for Spin) or EEPROM (for some of the other languages) and so not really affect the HUB ram availability.

    For now I've chosen just to focus on the "incremental" code size of the program or function being benchmarked, but of course comparing total program sizes for various tasks would also be an interesting benchmark, and it would certainly be interesting to compare those -- feel free to post your results!

    I agree this is tough to handle properly, but I also agree that Eric has taken the best approach. I have also had the view for a while that for high-level languages what we should be counting is just the high-level language code size - or, to be more specific, the amount of Hub RAM that has to be dedicated to holding the application program that is to be executed. This is where languages differ considerably, and this is the statistic most useful to potential users of those languages when comparing one against another (although they may not at first realize it, especially if they look simply at the binary file size).

    I could be convinced to also include const data size, since this often represents code trade-offs (a good example is jump tables often used to implement switch statements in C) - but you would have to separate this stuff out from real const data (e.g. constant strings) that are generally language independent - and this is often quite difficult. So including const data understates the difference between languages, and excluding it overstates the difference - but (on balance) I think excluding it is going to be more accurate.

    I also think the kernel size should be excluded (with a clarification that should be noted below) as should anything else that gets loaded into Cog RAM and thereafter does not consume Hub RAM - i.e. plugins or drivers. On the Propeller (especially the Prop v1) the resource that everyone runs out of first is Hub RAM. Most programs do not use all the available Cog RAM, or all the available cogs. This is especially true now because most C code can be loaded using various multi-phase load techniques that load up all the Cog RAM before loading the Hub RAM with the application code to be executed (with Catalina this is possible even on a bare-bones Prop v1, provided you have a 64kb EEPROM - not sure about GCC, but I think it may be the same). Again, including this stuff understates the difference between languages, and excluding it may overstate the difference, but again on balance I think excluding it is going to give you a more accurate indication of the relative sizes of using different languages.

    The one case where kernel (or driver, or plugin) size can make a difference is when a copy of the cog code has to be kept at run time to be dynamically loaded into cogs - here, Spin has a clear advantage when it comes to the kernel because its kernel is built into the ROM, whereas other languages typically have to keep a copy in Hub RAM. However, this type of dynamic loading is probably rare enough that it can simply be footnoted - including it for all cases would be quite misleading to most people (who will never write programs that use this capability).

    Ross.
  • jmgjmg Posts: 15,140
    edited 2012-10-01 19:53
    ersmith wrote: »
    That's a tough one. You're right that in the real world the resource required for the kernel matter. On the other hand, for tiny benchmarks like these the size of the benchmark code is dwarfed by the size of the kernels.

    Oops, I did not write that very clearly, I was not expecting ONE number, but rather more columns added, that show the base-resource-load. - and a % full of each resource would be useful too.

    That way, the code size itself remains very clear, which is the main objective, but the added columns also make it clear what else is going on in the background, to actually run the reported test.

    I'm partly thinking of someone new to all this, reading a benchmark report, and trying to get their head around the details.
  • jmgjmg Posts: 15,140
    edited 2012-10-01 19:58
    RossH wrote: »
    So including const data understates the difference between languages, and excluding it overstates the difference - but (on balance) I think excluding it is going to be more accurate.

    The risk with excluding stuff, is one 'mental scaling' exercise readers will do, is 'How many copies of this, can this chip run" ?
    If resource that was used, is excluded, that estimate will be badly skewed.

    That's why I prefer a multi-column report, as you cannot hope to squash all this into a single number.
  • AribaAriba Posts: 2,682
    edited 2012-10-01 20:03
    Toggle and FIBO are easy to implement but have not much practical impact.
    On the Propeller we use a lot of bit banging drivers so my proposal for a practical benchmark is:

    Output an array of 512 bytes with a bit-banging SPI-Out routine as fast as possible.

    Many application needs that, for example SPI Displays, ADCs, Serial RAM and so on. And the code is a good mix of hub variable access, subroutine call and bit handling.

    In Spin a possible code looks like that:
    CON
     DO  = 0     'pins
     CLK = 1
     CS  = 2
    
    VAR
     byte array[512]
     long i, time
    
    pub Main
     outa[CS]  := 1
     dira[CS]  := 1
     dira[DO]  := 1
     dira[CLK] := 1
    
     repeat i from 0 to 511
       array[i] := i & 255
    
     time := cnt
    
     outa[CS] := 0
     repeat i from 0 to 511
       spiout(array[i])
     outa[CS] := 1
    
     time := cnt-time
    ' print time
    
    pub spiout(value)
     value ><= 8            'MSB first
     repeat 8               '8 bits
       outa[DO] := value
       outa[CLK] := 1
       value >>= 1
       outa[CLK] := 0
    
    The results for this Spin code are:
    11_732_816 clocks
    code size: 96 bytes

    Andy
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-10-01 20:56
    Ariba wrote: »
    Toggle and FIBO are easy to implement but have not much practical impact.
    On the Propeller we use a lot of bit banging drivers so my proposal for a practical benchmark is:

    Output an array of 512 bytes with a bit-banging SPI-Out routine as fast as possible.

    Many application needs that, for example SPI Displays, ADCs, Serial RAM and so on. And the code is a good mix of hub variable access, subroutine call and bit handling.

    Andy
    I agree absolutely as when I looked at the benchmarks they didn't really prove anything in the real world where the bottleneck is always the I/O. There's a real need to be able to attain the necessary I/O speeds for things to update smoothly and some things only work if you can run at that speed too. Here is my initial version Ariba's Spin code just as I would do it.
    0    |< CONSTANT #DO
    1    |< CONSTANT #CLK
    2    |< CONSTANT #CS
    
    TABLE array #512 ALLOT
    
    pub Main
        [SPIO]            \ Load SPIO module
        #CLK 0 COGREG!        \ Setup I/O masks to be bitbashed by SPI module
        #DO 1 COGREG! 
        8  3 COGREG!        \ set number of bits to use for SPIO
        #CS    OUTSET
        #DO    OUTCLR
        #CLK    OUTCLR
        array #512 
          ADO
          I C@ $FF AND I C!
          LOOP
        CNT@
        #CS OUTCLR
        array #512
          ADO
          I C@ RUNMOD DROP   \ run and discard result
          LOOP
        #CS OUTSET    
        CNT@ SWAP - $400A .NUM
        ;
    

    When I run it:
    Main 387,088 ok
    

    The code size here includes the array and constants so that means besides the 512 bytes that there are another 104 bytes used in code etc. The size of Main is actually 71 bytes long including the call to the formatted print word .NUM.
    HERE ' Main - .DEC 0071 ok

    CODE @$5581 - bytes added = 0616 and 10879 bytes free
    NAMES @$226C - bytes added = 0039 and 1564 bytes free
  • AribaAriba Posts: 2,682
    edited 2012-10-01 23:01
    So Taychyon is 30 times faster then pure Spin here!
    Is this SPIO module done in Forth or is it a PASM routine inside the Forth-Kernel ?
    For sure if Spin uses a PASM SPI routine, then it is also much faster, but then it is more a PASM benchmark result.

    Andy
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-10-01 23:19
    Ariba wrote: »
    So Taychyon is 30 times faster then pure Spin here!
    Is this SPIO module done in Forth or is it a PASM routine inside the Forth-Kernel ?
    For sure if Spin uses a PASM SPI routine, then it is also much faster, but then it is more a PASM benchmark result.

    Andy

    Well then you must discount all references to PASM in any other benchmark! The SPI module is just 10 PASM instructions that help support fast I/O but since space is limited in the kernel (it doesn't use a separate PASM cog) there are 16 longs reserved for special PASM helper modules for repetitive functions. The whole point of Tachyon vs traditional Forths is to take into account the architecture of the Propeller and how it is typically used. There is also the SHROUT instruction which minimizes stack pushing and popping and so without the perfectly valid SPI module even this is what we get:
    TACHYON
    
    0    |< CONSTANT #DO
    1    |< CONSTANT #CLK
    2    |< CONSTANT #CS
    
    
    TABLE array #512 ALLOT
    
    pub spiout ( value --  )
        #24 REV
        #DO SWAP
        #CLK 0 COGREG!
        8  
          FOR 
          SHROUT        \ Shift right the next lsb through the output ( mask data -- mask data/2 )
          CLOCK CLOCK
           NEXT
        2DROP
        ;
            
    pub Main
        #CLK    OUTCLR
        #DO    OUTCLR
        #CS    OUTSET
        array #512 
          ADO
          I C@ $FF AND I C!
          LOOP
        CNT@
        #CS OUTCLR
        array #512
          ADO
          I C@ spiout
          LOOP
        #CS OUTSET    
        CNT@ SWAP - $400A .NUM
        ;
    
    END
    
    
    CODE   @$4E04  - bytes added = 0619 and 12796 bytes free
    NAMES  @$2673  - bytes added = 0049 and 2595 bytes free
    
     Main 1,672,880 ok
    
    
  • AribaAriba Posts: 2,682
    edited 2012-10-02 00:10
    Well then you must discount all references to PASM in any other benchmark! The SPI module is just 10 PASM instructions that help support fast I/O but since space is limited in the kernel (it doesn't use a separate PASM cog) there are 16 longs reserved for special PASM helper modules for repetitive functions. The whole point of Tachyon vs traditional Forths is to take into account the architecture of the Propeller and how it is typically used. There is also the SHROUT instruction which minimizes stack pushing and popping and so without the perfectly valid SPI module even this is what we get....
    Yes, I see what you mean.
    If this SPI module runs in the same cog as the kernel and can be replaced at runtime with other short PASM routines, then I also see this as a valid Tachyon benchmark result (but not a valid Forth result) - it's just a special feature of Tachyon, like the cache routines in GCC. I wish Spin could do this also (OK, something like that is doable with Spin Interpreter patching at runtime, but with a lot of overhead).

    Andy
  • CarlJacobsCarlJacobs Posts: 8
    edited 2012-10-02 01:23
    A forth version of FFT was published in http://forums.parallax.com/showthread.php?129972 (on the last page) in July last year.
  • BeanBean Posts: 8,129
    edited 2012-10-02 03:59
    ersmith wrote: »
    There's a bit of fussing around at the start to make sure the parameter is positive and hence suitable for djnz, which accounts for the slightly longer time for GCC versus PASM.

    Wow, that is really cool that the compiler is "smart" enough to rearrange things to use DJNZ. I'm impressed.

    Bean
  • Heater.Heater. Posts: 21,230
    edited 2012-10-02 04:06
    CarlJacobs,
    A forth version of FFT was published

    Good grief so it was. And I even replied to you. It had totally slipped my mind.
  • Heater.Heater. Posts: 21,230
    edited 2012-10-02 04:29
    Above Peter said:
    Well then you must discount all references to PASM in any other benchmark!

    I totally agree, in a language benchmark guts of the benchmark should not be relying on parts written a language that is not the subject of the benchmark.

    I think we are in the right room for an argument here:)

    Looking at the fft_bench in Forth we see that a great speed up can be achieved by coding the work of the inner most loop in PASM instructions. The same can be seen in the SPI code in this thread.

    I now understand why guys have been touting the speed of Forth for decades, they have been cheating:) This is akin to using inline assembler in C.

    "Ah", you might say, "In that case GCC is not allowed to use FCACHE in LMM code to secretly stash instructions into the kernel of the LMM interpretter where they run much faster".

    I would disagree and say that FCACHE is perfectly acceptable because:
    1) ALL the source code of the benchmark is still written in C.
    2) The FCACHEed parts are still compiled to instructions by the compiler, not written by the author.
    3) It's automatic, the compiler has decided what and how to FCACHE.
    4) In Java and such they have "Just In Time" compilation which compiles Java byte codes to native machine instructions on the fly at run time. I don't think benchmarkers ever rejected that idea.

    Anyway we can now add fft_bench to the results on this thread.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-10-02 05:31
    Heater. wrote: »
    I now understand why guys have been touting the speed of Forth for decades, they have been cheating:) This is akin to using inline assembler in C.
    If it's possible to cheat to get ahead then do it I say! :) Of course we are talking about processors and not ethics. The Prop has an instruction that flips bits around for you that would take a small subroutine on other processors. Is that cheating? You betcha! and it's good design. Same with Tachyon if I may say so myself in that the virtual machine instructions are designed in such a way that might not be common but are really useful.

    As for PASM being disqualified of course you couldn't really disqualify it if it is part of the instruction set or library even without having to delve into PASM inline otherwise Spin and every other language would be disqualified, they are all machine code at the core.

    I purposely avoid having an assembler in Forth as that defeats the purpose of programming in Forth in the first place but well chosen instructions that avoid inline PASM are pursued and implemented as need dictates. Unfortunately 496 longs do limit what you can implement as pure VM instructions so that is why the bare-bones SPI instruction is paged into the cog as needed.
  • Martin_HMartin_H Posts: 4,051
    edited 2012-10-02 06:21
    Rather than cycles elapsed time might be an interesting metric. That way you could run a few benchmarks on the Arduino and see how fast the Propeller is. The Arduino's digitalWrite function is supposed to be pretty slow and it is used all the time, so Prop GCC might beat it even on a single thread. On a multi-thread benchmark it should do even better.

    Besides speed there's code compactness and it looks like the byte code languages (Spin and Tachyon Forth) win here. I suppose this should obvious from the buzz I've been hearing about C on the Propeller, but you're effectively deducing hub ram by a third by using C in LMM. For the cog model it's pretty close to PASM.
  • Heater.Heater. Posts: 21,230
    edited 2012-10-02 06:34
    Peter Jakacki,
    If it's possible to cheat to get ahead then do it I say!

    I don't disagree with that. At least when building your applications if you have to cheat to get the job done then you have to. If you find your self cheating too much though it might be a sign that you have the wrong approach, wrong language, wrong hardware architecture, wrong processor or something. Cheating has a price in making things over complex, in destroying portability, maintainability etc.

    However when it comes to benchmarks it's just not cricket:)
    The Prop has an instruction that flips bits around for you that would take a small subroutine on other processors. Is that cheating? You betcha! and it's good design.
    So it does and I use that instruction in the PASM version of fft_bench. I do not use it in the C version of fft_bench simply because there is no C language equivalent operator or standard function/macro. I could use inline assembler to make use of that instruction but that is no longer C, it's not portable, it's not in the spirit of the benchmark. Can't remember what happened in the Spin version now, does Spin have a bit reverse operator?
    As for PASM being disqualified of course you couldn't really disqualify it if it is part of the instruction set or library even without having to delve into PASM inline otherwise Spin and every other language would be disqualified, they are all machine code at the core.

    Here I disagree. Yes it's all machine code at the core but if you are comparing languages you should only be using the language under test to write the benchmark source code in. The whole point of these higer level languages is to be portable between architectures, the benchmarks should be portable to. Forth code should be runnable on my PC as well as on the Prop, for example, as C is.

    Now that is a bit of a grey area, if Forth or C had a bit reverse operation defined as a macro or word in a standard library I would say it's OK to use it, even if that results in a single instruction on some machines or a long subroutine on others. Standards are there to aid portability of source so I accept that this can happen.

    So, if for example you have a special Forth word for bit reversal in your Forth engine that is OK with be as long as you have a similar Forth engine that runs the same benchmark code on my PC or elsewhere. Of course that Forth may not be "standard" Forth anymore.

    Did I say we were in the right room for an argument?:)
  • Heater.Heater. Posts: 21,230
    edited 2012-10-02 06:39
    Martin_H,
    ..you're effectively reducing hub ram by a third by using C in LMM.

    That is true, and it the price you pay for speed. However both Catalina and propgcc support a compressed memory model (CMM) where the C is compiled to a more compact bytecode like binary rather than raw 32 bit instructions. The resulting code is somewhat faster than Spin and only a bit bigger, as far as I can tell.
  • ersmithersmith Posts: 5,900
    edited 2012-10-02 07:08
    Heater. wrote: »
    Looking at the fft_bench in Forth we see that a great speed up can be achieved by coding the work of the inner most loop in PASM instructions. The same can be seen in the SPI code in this thread.

    I now understand why guys have been touting the speed of Forth for decades, they have been cheating:) This is akin to using inline assembler in C.
    I think there's a big difference between the Tachyon SPI example and the JDForth version of FFT. In the SPI case it's using a builtin function of the language -- kind of like calling memcpy() in C, which may be implemented in assembly but is a standard C library function, so it's legit. Of course using the inline PASM in the FFT demo is a different kettle of fish, and I think should be disqualified.

    Is anyone up to porting the JDForth version of FFT to Tachyon and/or PropForth? I started to take a stab at it, but my Forth-foo is weak and it's likely that even if I succeeded it would not be a very good port.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2012-10-02 07:13
    @Heater says:
    I don't disagree with that. At least when building your applications if you have to cheat to get the job done then you have to. If you find your self cheating too much though it might be a sign that you have the wrong approach, wrong language, wrong hardware architecture, wrong processor or something. Cheating has a price in making things over complex, in destroying portability, maintainability etc.

    I remember a really old movie about Christopher Columbus and how he was sitting at a dinner table asking the person next to him if they could make an egg stand up straight without any supports or anything. Well they couldn't but Chris showed em how by smacking the egg onto the table so that it "sat flat" by itself. Now that is cheating but having an optimized instruction set is sensible and despite the reference to cheating is anything but.

    However when it comes to benchmarks it's just not cricket:)

    So it does and I use that instruction in the PASM version of fft_bench. I do not use it in the C version of fft_bench simply because there is no C language equivalent operator or standard function/macro. I could use inline assembler to make use of that instruction but that is no longer C, it's not portable, it's not in the spirit of the benchmark. Can't remember what happened in the Spin version now, does Spin have a bit reverse operator?

    So the problem is the benchmark and expecting everything to fit to it. Would it be a good idea for Chip to leave a lot of these "non-standard" instructions out for portability or standards sake?????

    Here I disagree. Yes it's all machine code at the core but if you are comparing languages you should only be using the language under test to write the benchmark source code in. The whole point of these higer level languages is to be portable between architectures, the benchmarks should be portable to. Forth code should be runnable on my PC as well as on the Prop, for example, as C is.

    I don't ever expect to port many applications I write for the Prop as the Prop is rather unique. Come on, you really expect Prop OBEX software to run on a PC!?

    Now that is a bit of a grey area, if Forth or C had a bit reverse operation defined as a macro or word in a standard library I would say it's OK to use it, even if that results in a single instruction on some machines or a long subroutine on others. Standards are there to aid portability of source so I accept that this can happen.

    Oh, what a tiny cage some people make for themselves :)

    So, if for example you have a special Forth word for bit reversal in your Forth engine that is OK with be as long as you have a similar Forth engine that runs the same benchmark code on my PC or elsewhere. Of course that Forth may not be "standard" Forth anymore.

    May Tachyon never be standard Forth and given the nature of the Prop I very much doubt it could ever be anyway.

    Did I say we were in the right room for an argument?:)

    Absolutely, but this room is a little dry for a good argument, it needs something to wet the whistle!
Sign In or Register to comment.