Some multi-language benchmarks

ersmith · 2012-10-01 13:43

It's been a while since the last benchmark thread, and a lot of things have happened since then. So I thought it might be interesting to test the performance of various compilers and interpreters that are in active development.

Of course there's always a debate about what makes a good benchmark, and benchmarks can be very misleading. But they can give a very rough picture of the performance of different solutions, for situations where performance matters. In other situations, of course, other things like ease of use, interactivity, documentation, and availability of help may matter more. So caveat emptor.

I'll follow up this post with a few basic benchmarks: toggling a pin, the fibonacci benchmark, and some kind of compute intensive benchmark like an FFT. Heater's fft_bench would be ideal, but I don't think there's a Forth version of it and my Forth skills are not up to porting it, so perhaps some other compute intensive task might be used instead -- any suggestions?. So far I've tested Spin, C (Catalina and PropGCC), Basic (PropBasic), Forth (PropForth and Tachyon Forth) and PASM. If your favorite language/compiler is missing, please feel free to try it out -- these benchmarks are pretty straightforward.

ersmith · 2012-10-01 13:47

Here's timing for a simple benchmark that toggles an output pin 1000 times. This is a very simple task, of course, but it is indicative of a given language's looping and bit manipulation speed. The time taken to do it varies over 2 orders of magnitude.

The theoretical limit of this is about 4000 cycles (assuming 1 instruction per toggle). In PASM one could get close to this by unrolling the loop. I've chosen in the implementations below to just adopt the most natural looping construct for the languages in question, so for the PASM I've counted the time for a basic djnz loop with one toggle/per loop, plus the timing and subroutine call overhead.

I'm not very experienced with FORTH, and was unable to find an easy way to toggle a pin other than explicitly doing pinhi/pinlo. This means the loop only runs 500 times for the FORTH implementations, giving them somewhat of an advantage over the other ones (which run the loop 1000 times). So the FORTH times are skewed slightly high; on the other hand my relative inexperience with FORTH probably means that I'm missing some obvious way to improve the loop.

Compilers tested:
PropGCC CMM preview
Catalina 3.8
PropBasic 1.14 (appears to give the same results to PropBasic 1.27)
Tachyon Forth as of 9/30/2012
PropForth 5.0
fastspin 3.2.0

Here are the results in number of cycles to toggle a pin 1000 times (source code is attached to this post):

PASM:                   8020 cycles
PropBasic (COG mode):   8024
GCC -Os -mcog:          8044
fastspin:               8192
GCC -Os -mlmm:         16416
PropBasic (LMM):       48176
Catalina LMM:         176912
Tachyon Forth:        440912
GCC -Os -mcmm:        448832
Catalina CMM -O3:     561360
Spin:                 966664
PropForth:           2600704 cycles

ersmith · 2012-10-01 13:52

Here's another benchmark: the good old Fibonacci numbers. This is a simple test of a recursive function, and basically measures function call overhead.

This test doesn't have as many compilers as the toggle test. PropBasic doesn't really support recursive functions (technically it does in LMM mode, but without local variables recursive functions are not really practical). PropForth accepted a recursive definition, but did not produce correct results. That may be the fault of my own inexperience with Forth rather than anything else. Tachyon Forth did allow recursion, but after fibo(10) the timing results became very suspicious, and after fibo(11) the answers came out wrong, presumably because of stack overflow. To add a bit of margin of error I've printed the results for fibo(8) only.

I've also included the size of the fibonacci function itself. Only the fibo() function is measured, not the runtime library or timing code. I've also added tests of Catalina's SMALL memory model and GCC's -mxmm memory model, both of which use external RAM and were tested on a C3 board.

Absolute size and speed for fibo program; time is in cycles to compute fibo(8), size is in bytes (for both, smaller is better):

                    Time     Size
PASM                 3940     80
GCC -Os -mcog        5444    100
GCC -Os -mlmm       10992     84
fastspin            32128    104
GCC -Os -mxmm       57552     84
GCC -Os -mcmm       58800     26
Catalina LMM        65392    100
Catalina CMM -O3   102960     46
Tachyon Forth      106784     18
Spin               137360     25
Catalina SMALL     299632    100

Relative Speed and Size for fibo(8) compared to SPIN
(higher speed is better, lower size is better)

                Speed    Size
TACHYON FORTH    1.29    0.72
Spin             1.00    1.00
GCC -mcmm        2.34    1.04
Catalina CMM     1.33    1.84
PASM            34.86    3.20
GCC -mlmm       12.50    3.36
GCC -mcog       25.23    4.00
GCC -mxmm        2.39    4.00
Catalina LMM     2.10    4.00
Catalina SMALL   0.46    4.00
fastspin         4.28    4.16

Code is attached.

ersmith · 2012-10-01 13:52

Here are some results for Heater's FFT_BENCH benchmark. I don't have
results for PropBasic, Tachyon Forth, or PropForth (yet... hopefully
some interested readers will take up the challenge of porting the
benchmark!).

This time I have posted total size in bytes of the executable,
including run time libraries, although I have told the C compilers to
use the smallest libraries they have for I/O (-ltiny). This makes for
an interesting comparison of HUB memory used.

Notes:

(1) The PASM code size includes the size of the Spin wrapper and test
harness.

(2) The GCC results are reported with -Os (optimize for size) and -O2
(optimize for speed). Strangely, in this particular benchmark the -Os
code ends up bigger, probably due to -O2 being able to inline and then
eliminate some functions that are only called once.

(3) The Catalina LMM results are compiled with -ltiny -lci; Catalina CMM
adds -C COMPACT -O3.

(4) Size for Catalina is calculated as code size + cnst size + init
size, as that more accurately reflects HUB memory available. Size for
the others is the total size of the .binary file downloaded, and so
for GCC it includes the run-time kernel, which is 1940 bytes in LMM
mode and 1892 bytes in CMM mode.

(5) The JDForth result is as quoted in
http://forums.parallax.com/showthread.php?129972-fft_bench-An-MCU-benchmark-using-a-simple-FFT-algorithm-in-Spin-C-and-.../page6. The
size was not specified there, and I don't have JDForth to reproduce it.

Results sorted by speed:

             Time (ms)    Total Size (bytes)
PASM             25          4480
GCC -O2          47          7288
GCC CMM -O2      96          5768
GCC -Os         148          7292
fastspin        171          5136
Catalina LMM    348          7488
GCC CMM -Os     537          5460
Catalina CMM    763          4536
JDForth        1200          5428
Spin           1465          3244

Results sorted by size:

             Time (ms)    Total Size (bytes)
Spin           1465          3244
PASM             25          4480
Catalina CMM    763          4536
fastspin        171          5136
JDForth        1200          5428
GCC CMM -Os     537          5460
GCC CMM -O2      96          5768
GCC -O2          47          7288
GCC -Os         148          7292
Catalina LMM    348          7488

Heater. · 2012-10-01 14:30

Oh good. I can restate my challenge: "Anyone out there up to porting fft_bench to Forth". The Spin version is suppossed to be the "mother" of all other implementations. With that we can fill in the benchmark results.

There is no way I am going to live long enough to understand Forth well enough to do this task.

I was not really thinking of benchmarking when I made that request I was just curious to see how on earth it might look.

Bean · 2012-10-01 14:45

PropBasic would be quite a bit faster if you use

thepin = ~thepin

instead of

TOGGLE thepin

because TOGGLE sets the pin direction to output each time.

I'm like to see the GCC -Os -mcog option gets so close to the PASM code. Is there any way to post the PASM code generated ?

Bean

jmg · 2012-10-01 15:23

Bean wrote: »

I'm like to see the GCC -Os -mcog option gets so close to the PASM code. Is there any way to post the PASM code generated ?

Yes, those are impressive.

I presume the generated PASM can be manually tweaked, and included for a even-smaller result, for those cases where the effort is warranted.

eg PInToggle looks so close in size, speed is likely to be the same, but the next example has a 1.3816x speed, and 1.25x size delta. Pretty good, but a possible manual-optimize candidate.

General comment:
On benchmarks I think the size should include all resource used, otherwise novices might think Spin/Forth are tiny.
(on a Prop 2, spin has a different loaded cost, and opens compacted/expanded Spin options )

Humanoido · 2012-10-01 15:28

From an academic perspective, it depends on the method used to toggle a pin. For example, SPIN can be at the top of the list for the fastest pin toggle with a "nano time (1-100-ns) pulsar program" built up from a Cog's counter register that can toggle at every clock giving a 12.6-ns wide pulse. One thousand cycles comes out to 12.6-microseconds. Its interesting to note with a 6.25 crystal the base is a mere 10-us. As each counter module also has its own phase-locked loop (PLL) which can be used to synthesize frequencies up to 128 MHz, perhaps a small program could take this concept even farther.

RossH · 2012-10-01 16:56

Hi Eric,

You mention Catalina "5.8" instead of "3.8". Finally, for consistency it would be good to add code size to these results as well.

EDIT: That's strange - code just appeared. Thanks.
Ross.

ersmith · 2012-10-01 16:59

Bean wrote: »

PropBasic would be quite a bit faster if you use

thepin = ~thepin

instead of

TOGGLE thepin

Thanks. I've updated the post with the new version. With that change PropBasic in COG mode moves up just behind GCC in COG mode.

I'm like to see the GCC -Os -mcog option gets so close to the PASM code. Is there any way to post the PASM code generated ?

Sure. The -S option to propeller-elf-gcc causes it to emit assembly language (in GAS syntax instead of PASM, but the two are very close). Here is the code for the toggle function:

        .text
        .balign 4
        .global _toggle
_toggle
        mov     r7, r0
        cmps    r0, #0 wz,wc
        add     r7, #1
        mov     r5, .LC0
        IF_B  mov       r7,#1
        '' loop_start register r7 level #1
        jmp     #.L2
.L3
        xor OUTA,r5
.L2
        djnz    r7,#.L3
        jmp     lr
        .balign 4
.LC0
        long    65536

There's a bit of fussing around at the start to make sure the parameter is positive and hence suitable for djnz, which accounts for the slightly longer time for GCC versus PASM.

In LMM mode GCC does not generate djnz, so its loop is a bit bigger (using a cmp and branch). The loop does fit into the FCACHE area in the COG and so executes without LMM overhead, which explains why GCC's LMM time is close to PropBasic's COG time.

ersmith · 2012-10-01 17:02

RossH wrote: »

You mention Catalina "5.8" instead of "3.8". Finally, for consistency it would be good to add code size to these results as well.

Oops, sorry -- thanks for catching that typo! I've fixed it in the original post.

I guess I can update the toggle code to include size, although frankly for such a trivial function none of the languages should emit very much code.

Eric

RossH · 2012-10-01 17:05

ersmith wrote: »

Oops, sorry -- thanks for catching that typo! I've fixed it in the original post.

I guess I can update the toggle code to include size, although frankly for such a trivial function none of the languages should emit very much code.

Eric

Yes, fair point.

ersmith · 2012-10-01 17:15

jmg wrote: »

On benchmarks I think the size should include all resource used, otherwise novices might think Spin/Forth are tiny.
(on a Prop 2, spin has a different loaded cost, and opens compacted/expanded Spin options )

That's a tough one. You're right that in the real world the resource required for the kernel matter. On the other hand, for tiny benchmarks like these the size of the benchmark code is dwarfed by the size of the kernels. In practice once you've decided to use a given language/compiler the kernel and run time library overhead are basically fixed, and the cost of the program itself is incremental to that. Also, of course, the kernel may be stored in ROM (for Spin) or EEPROM (for some of the other languages) and so not really affect the HUB ram availability.

For now I've chosen just to focus on the "incremental" code size of the program or function being benchmarked, but of course comparing total program sizes for various tasks would also be an interesting benchmark, and it would certainly be interesting to compare those -- feel free to post your results!

RossH · 2012-10-01 19:23

ersmith wrote: »

That's a tough one. You're right that in the real world the resource required for the kernel matter. On the other hand, for tiny benchmarks like these the size of the benchmark code is dwarfed by the size of the kernels. In practice once you've decided to use a given language/compiler the kernel and run time library overhead are basically fixed, and the cost of the program itself is incremental to that. Also, of course, the kernel may be stored in ROM (for Spin) or EEPROM (for some of the other languages) and so not really affect the HUB ram availability.

For now I've chosen just to focus on the "incremental" code size of the program or function being benchmarked, but of course comparing total program sizes for various tasks would also be an interesting benchmark, and it would certainly be interesting to compare those -- feel free to post your results!

I agree this is tough to handle properly, but I also agree that Eric has taken the best approach. I have also had the view for a while that for high-level languages what we should be counting is just the high-level language code size - or, to be more specific, the amount of Hub RAM that has to be dedicated to holding the application program that is to be executed. This is where languages differ considerably, and this is the statistic most useful to potential users of those languages when comparing one against another (although they may not at first realize it, especially if they look simply at the binary file size).

I could be convinced to also include const data size, since this often represents code trade-offs (a good example is jump tables often used to implement switch statements in C) - but you would have to separate this stuff out from real const data (e.g. constant strings) that are generally language independent - and this is often quite difficult. So including const data understates the difference between languages, and excluding it overstates the difference - but (on balance) I think excluding it is going to be more accurate.

I also think the kernel size should be excluded (with a clarification that should be noted below) as should anything else that gets loaded into Cog RAM and thereafter does not consume Hub RAM - i.e. plugins or drivers. On the Propeller (especially the Prop v1) the resource that everyone runs out of first is Hub RAM. Most programs do not use all the available Cog RAM, or all the available cogs. This is especially true now because most C code can be loaded using various multi-phase load techniques that load up all the Cog RAM before loading the Hub RAM with the application code to be executed (with Catalina this is possible even on a bare-bones Prop v1, provided you have a 64kb EEPROM - not sure about GCC, but I think it may be the same). Again, including this stuff understates the difference between languages, and excluding it may overstate the difference, but again on balance I think excluding it is going to give you a more accurate indication of the relative sizes of using different languages.

The one case where kernel (or driver, or plugin) size can make a difference is when a copy of the cog code has to be kept at run time to be dynamically loaded into cogs - here, Spin has a clear advantage when it comes to the kernel because its kernel is built into the ROM, whereas other languages typically have to keep a copy in Hub RAM. However, this type of dynamic loading is probably rare enough that it can simply be footnoted - including it for all cases would be quite misleading to most people (who will never write programs that use this capability).

Ross.

jmg · 2012-10-01 19:53

ersmith wrote: »

That's a tough one. You're right that in the real world the resource required for the kernel matter. On the other hand, for tiny benchmarks like these the size of the benchmark code is dwarfed by the size of the kernels.

Oops, I did not write that very clearly, I was not expecting ONE number, but rather more columns added, that show the base-resource-load. - and a % full of each resource would be useful too.

That way, the code size itself remains very clear, which is the main objective, but the added columns also make it clear what else is going on in the background, to actually run the reported test.

I'm partly thinking of someone new to all this, reading a benchmark report, and trying to get their head around the details.

jmg · 2012-10-01 19:58

RossH wrote: »

So including const data understates the difference between languages, and excluding it overstates the difference - but (on balance) I think excluding it is going to be more accurate.

The risk with excluding stuff, is one 'mental scaling' exercise readers will do, is 'How many copies of this, can this chip run" ?
If resource that was used, is excluded, that estimate will be badly skewed.

That's why I prefer a multi-column report, as you cannot hope to squash all this into a single number.

Ariba · 2012-10-01 20:03

Toggle and FIBO are easy to implement but have not much practical impact.
On the Propeller we use a lot of bit banging drivers so my proposal for a practical benchmark is:

Output an array of 512 bytes with a bit-banging SPI-Out routine as fast as possible.

Many application needs that, for example SPI Displays, ADCs, Serial RAM and so on. And the code is a good mix of hub variable access, subroutine call and bit handling.

In Spin a possible code looks like that:

CON
 DO  = 0     'pins
 CLK = 1
 CS  = 2

VAR
 byte array[512]
 long i, time

pub Main
 outa[CS]  := 1
 dira[CS]  := 1
 dira[DO]  := 1
 dira[CLK] := 1

 repeat i from 0 to 511
   array[i] := i & 255

 time := cnt

 outa[CS] := 0
 repeat i from 0 to 511
   spiout(array[i])
 outa[CS] := 1

 time := cnt-time
' print time

pub spiout(value)
 value ><= 8            'MSB first
 repeat 8               '8 bits
   outa[DO] := value
   outa[CLK] := 1
   value >>= 1
   outa[CLK] := 0

The results for this Spin code are:
11_732_816 clocks
code size: 96 bytes

Andy

Peter Jakacki · 2012-10-01 20:56

Ariba wrote: »

Toggle and FIBO are easy to implement but have not much practical impact.
On the Propeller we use a lot of bit banging drivers so my proposal for a practical benchmark is:

Output an array of 512 bytes with a bit-banging SPI-Out routine as fast as possible.

Many application needs that, for example SPI Displays, ADCs, Serial RAM and so on. And the code is a good mix of hub variable access, subroutine call and bit handling.

Andy

I agree absolutely as when I looked at the benchmarks they didn't really prove anything in the real world where the bottleneck is always the I/O. There's a real need to be able to attain the necessary I/O speeds for things to update smoothly and some things only work if you can run at that speed too. Here is my initial version Ariba's Spin code just as I would do it.

0    |< CONSTANT #DO
1    |< CONSTANT #CLK
2    |< CONSTANT #CS

TABLE array #512 ALLOT

pub Main
    [SPIO]            \ Load SPIO module
    #CLK 0 COGREG!        \ Setup I/O masks to be bitbashed by SPI module
    #DO 1 COGREG! 
    8  3 COGREG!        \ set number of bits to use for SPIO
    #CS    OUTSET
    #DO    OUTCLR
    #CLK    OUTCLR
    array #512 
      ADO
      I C@ $FF AND I C!
      LOOP
    CNT@
    #CS OUTCLR
    array #512
      ADO
      I C@ RUNMOD DROP   \ run and discard result
      LOOP
    #CS OUTSET    
    CNT@ SWAP - $400A .NUM
    ;

When I run it:

Main 387,088 ok

The code size here includes the array and constants so that means besides the 512 bytes that there are another 104 bytes used in code etc. The size of Main is actually 71 bytes long including the call to the formatted print word .NUM.
HERE ' Main - .DEC 0071 ok

CODE @$5581 - bytes added = 0616 and 10879 bytes free
NAMES @$226C - bytes added = 0039 and 1564 bytes free

Ariba · 2012-10-01 23:01

So Taychyon is 30 times faster then pure Spin here!
Is this SPIO module done in Forth or is it a PASM routine inside the Forth-Kernel ?
For sure if Spin uses a PASM SPI routine, then it is also much faster, but then it is more a PASM benchmark result.

Andy

Peter Jakacki · 2012-10-01 23:19

Ariba wrote: »

So Taychyon is 30 times faster then pure Spin here!
Is this SPIO module done in Forth or is it a PASM routine inside the Forth-Kernel ?
For sure if Spin uses a PASM SPI routine, then it is also much faster, but then it is more a PASM benchmark result.

Andy

Well then you must discount all references to PASM in any other benchmark! The SPI module is just 10 PASM instructions that help support fast I/O but since space is limited in the kernel (it doesn't use a separate PASM cog) there are 16 longs reserved for special PASM helper modules for repetitive functions. The whole point of Tachyon vs traditional Forths is to take into account the architecture of the Propeller and how it is typically used. There is also the SHROUT instruction which minimizes stack pushing and popping and so without the perfectly valid SPI module even this is what we get:

TACHYON

0    |< CONSTANT #DO
1    |< CONSTANT #CLK
2    |< CONSTANT #CS


TABLE array #512 ALLOT

pub spiout ( value --  )
    #24 REV
    #DO SWAP
    #CLK 0 COGREG!
    8  
      FOR 
      SHROUT        \ Shift right the next lsb through the output ( mask data -- mask data/2 )
      CLOCK CLOCK
       NEXT
    2DROP
    ;
        
pub Main
    #CLK    OUTCLR
    #DO    OUTCLR
    #CS    OUTSET
    array #512 
      ADO
      I C@ $FF AND I C!
      LOOP
    CNT@
    #CS OUTCLR
    array #512
      ADO
      I C@ spiout
      LOOP
    #CS OUTSET    
    CNT@ SWAP - $400A .NUM
    ;

END

CODE   @$4E04  - bytes added = 0619 and 12796 bytes free
NAMES  @$2673  - bytes added = 0049 and 2595 bytes free

 Main 1,672,880 ok

Ariba · 2012-10-02 00:10

Peter Jakacki wrote: »

Well then you must discount all references to PASM in any other benchmark! The SPI module is just 10 PASM instructions that help support fast I/O but since space is limited in the kernel (it doesn't use a separate PASM cog) there are 16 longs reserved for special PASM helper modules for repetitive functions. The whole point of Tachyon vs traditional Forths is to take into account the architecture of the Propeller and how it is typically used. There is also the SHROUT instruction which minimizes stack pushing and popping and so without the perfectly valid SPI module even this is what we get....

Yes, I see what you mean.
If this SPI module runs in the same cog as the kernel and can be replaced at runtime with other short PASM routines, then I also see this as a valid Tachyon benchmark result (but not a valid Forth result) - it's just a special feature of Tachyon, like the cache routines in GCC. I wish Spin could do this also (OK, something like that is doable with Spin Interpreter patching at runtime, but with a lot of overhead).

Andy

CarlJacobs · 2012-10-02 01:23

A forth version of FFT was published in http://forums.parallax.com/showthread.php?129972 (on the last page) in July last year.

Bean · 2012-10-02 03:59

ersmith wrote: »

There's a bit of fussing around at the start to make sure the parameter is positive and hence suitable for djnz, which accounts for the slightly longer time for GCC versus PASM.

Wow, that is really cool that the compiler is "smart" enough to rearrange things to use DJNZ. I'm impressed.

Bean

Heater. · 2012-10-02 04:06

CarlJacobs,

A forth version of FFT was published

Good grief so it was. And I even replied to you. It had totally slipped my mind.

Heater. · 2012-10-02 04:29

Above Peter said:

Well then you must discount all references to PASM in any other benchmark!

I totally agree, in a language benchmark guts of the benchmark should not be relying on parts written a language that is not the subject of the benchmark.

I think we are in the right room for an argument here:)

Looking at the fft_bench in Forth we see that a great speed up can be achieved by coding the work of the inner most loop in PASM instructions. The same can be seen in the SPI code in this thread.

I now understand why guys have been touting the speed of Forth for decades, they have been cheating:) This is akin to using inline assembler in C.

"Ah", you might say, "In that case GCC is not allowed to use FCACHE in LMM code to secretly stash instructions into the kernel of the LMM interpretter where they run much faster".

I would disagree and say that FCACHE is perfectly acceptable because:
1) ALL the source code of the benchmark is still written in C.
2) The FCACHEed parts are still compiled to instructions by the compiler, not written by the author.
3) It's automatic, the compiler has decided what and how to FCACHE.
4) In Java and such they have "Just In Time" compilation which compiles Java byte codes to native machine instructions on the fly at run time. I don't think benchmarkers ever rejected that idea.

Anyway we can now add fft_bench to the results on this thread.

Peter Jakacki · 2012-10-02 05:31

Heater. wrote: »

I now understand why guys have been touting the speed of Forth for decades, they have been cheating:) This is akin to using inline assembler in C.

If it's possible to cheat to get ahead then do it I say!

Of course we are talking about processors and not ethics. The Prop has an instruction that flips bits around for you that would take a small subroutine on other processors. Is that cheating? You betcha! and it's good design. Same with Tachyon if I may say so myself in that the virtual machine instructions are designed in such a way that might not be common but are really useful.

As for PASM being disqualified of course you couldn't really disqualify it if it is part of the instruction set or library even without having to delve into PASM inline otherwise Spin and every other language would be disqualified, they are all machine code at the core.

I purposely avoid having an assembler in Forth as that defeats the purpose of programming in Forth in the first place but well chosen instructions that avoid inline PASM are pursued and implemented as need dictates. Unfortunately 496 longs do limit what you can implement as pure VM instructions so that is why the bare-bones SPI instruction is paged into the cog as needed.

Martin_H · 2012-10-02 06:21

Rather than cycles elapsed time might be an interesting metric. That way you could run a few benchmarks on the Arduino and see how fast the Propeller is. The Arduino's digitalWrite function is supposed to be pretty slow and it is used all the time, so Prop GCC might beat it even on a single thread. On a multi-thread benchmark it should do even better.

Besides speed there's code compactness and it looks like the byte code languages (Spin and Tachyon Forth) win here. I suppose this should obvious from the buzz I've been hearing about C on the Propeller, but you're effectively deducing hub ram by a third by using C in LMM. For the cog model it's pretty close to PASM.

Heater. · 2012-10-02 06:34

Peter Jakacki,

If it's possible to cheat to get ahead then do it I say!

I don't disagree with that. At least when building your applications if you have to cheat to get the job done then you have to. If you find your self cheating too much though it might be a sign that you have the wrong approach, wrong language, wrong hardware architecture, wrong processor or something. Cheating has a price in making things over complex, in destroying portability, maintainability etc.

However when it comes to benchmarks it's just not cricket:)

The Prop has an instruction that flips bits around for you that would take a small subroutine on other processors. Is that cheating? You betcha! and it's good design.

So it does and I use that instruction in the PASM version of fft_bench. I do not use it in the C version of fft_bench simply because there is no C language equivalent operator or standard function/macro. I could use inline assembler to make use of that instruction but that is no longer C, it's not portable, it's not in the spirit of the benchmark. Can't remember what happened in the Spin version now, does Spin have a bit reverse operator?

As for PASM being disqualified of course you couldn't really disqualify it if it is part of the instruction set or library even without having to delve into PASM inline otherwise Spin and every other language would be disqualified, they are all machine code at the core.

Here I disagree. Yes it's all machine code at the core but if you are comparing languages you should only be using the language under test to write the benchmark source code in. The whole point of these higer level languages is to be portable between architectures, the benchmarks should be portable to. Forth code should be runnable on my PC as well as on the Prop, for example, as C is.

Now that is a bit of a grey area, if Forth or C had a bit reverse operation defined as a macro or word in a standard library I would say it's OK to use it, even if that results in a single instruction on some machines or a long subroutine on others. Standards are there to aid portability of source so I accept that this can happen.

So, if for example you have a special Forth word for bit reversal in your Forth engine that is OK with be as long as you have a similar Forth engine that runs the same benchmark code on my PC or elsewhere. Of course that Forth may not be "standard" Forth anymore.

Did I say we were in the right room for an argument?:)

Heater. · 2012-10-02 06:39

Martin_H,

..you're effectively reducing hub ram by a third by using C in LMM.

That is true, and it the price you pay for speed. However both Catalina and propgcc support a compressed memory model (CMM) where the C is compiled to a more compact bytecode like binary rather than raw 32 bit instructions. The resulting code is somewhat faster than Spin and only a bit bigger, as far as I can tell.

ersmith · 2012-10-02 07:08

Heater. wrote: »

Looking at the fft_bench in Forth we see that a great speed up can be achieved by coding the work of the inner most loop in PASM instructions. The same can be seen in the SPI code in this thread.

I now understand why guys have been touting the speed of Forth for decades, they have been cheating:) This is akin to using inline assembler in C.

I think there's a big difference between the Tachyon SPI example and the JDForth version of FFT. In the SPI case it's using a builtin function of the language -- kind of like calling memcpy() in C, which may be implemented in assembly but is a standard C library function, so it's legit. Of course using the inline PASM in the FFT demo is a different kettle of fish, and I think should be disqualified.

Is anyone up to porting the JDForth version of FFT to Tachyon and/or PropForth? I started to take a stab at it, but my Forth-foo is weak and it's likely that even if I succeeded it would not be a very good port.

Peter Jakacki · 2012-10-02 07:13

@Heater says:
I don't disagree with that. At least when building your applications if you have to cheat to get the job done then you have to. If you find your self cheating too much though it might be a sign that you have the wrong approach, wrong language, wrong hardware architecture, wrong processor or something. Cheating has a price in making things over complex, in destroying portability, maintainability etc.

I remember a really old movie about Christopher Columbus and how he was sitting at a dinner table asking the person next to him if they could make an egg stand up straight without any supports or anything. Well they couldn't but Chris showed em how by smacking the egg onto the table so that it "sat flat" by itself. Now that is cheating but having an optimized instruction set is sensible and despite the reference to cheating is anything but.

However when it comes to benchmarks it's just not cricket:)

So it does and I use that instruction in the PASM version of fft_bench. I do not use it in the C version of fft_bench simply because there is no C language equivalent operator or standard function/macro. I could use inline assembler to make use of that instruction but that is no longer C, it's not portable, it's not in the spirit of the benchmark. Can't remember what happened in the Spin version now, does Spin have a bit reverse operator?

So the problem is the benchmark and expecting everything to fit to it. Would it be a good idea for Chip to leave a lot of these "non-standard" instructions out for portability or standards sake?????

Here I disagree. Yes it's all machine code at the core but if you are comparing languages you should only be using the language under test to write the benchmark source code in. The whole point of these higer level languages is to be portable between architectures, the benchmarks should be portable to. Forth code should be runnable on my PC as well as on the Prop, for example, as C is.

I don't ever expect to port many applications I write for the Prop as the Prop is rather unique. Come on, you really expect Prop OBEX software to run on a PC!?

Now that is a bit of a grey area, if Forth or C had a bit reverse operation defined as a macro or word in a standard library I would say it's OK to use it, even if that results in a single instruction on some machines or a long subroutine on others. Standards are there to aid portability of source so I accept that this can happen.

Oh, what a tiny cage some people make for themselves

So, if for example you have a special Forth word for bit reversal in your Forth engine that is OK with be as long as you have a similar Forth engine that runs the same benchmark code on my PC or elsewhere. Of course that Forth may not be "standard" Forth anymore.

May Tachyon never be standard Forth and given the nature of the Prop I very much doubt it could ever be anyway.

Did I say we were in the right room for an argument?:)

Absolutely, but this room is a little dry for a good argument, it needs something to wet the whistle!

Some multi-language benchmarks

Comments