Is anyone up to porting the JDForth version of FFT to Tachyon and/or PropForth?
Now you have me puzzled. Why does it need "porting"?
I know almost nothing about Forth and I only looked at the FFT briefly before I started feeling dizzy but It has a Forth only (no PASM) option, in fact that is default in the source given.
So if it needs porting ether the source or target Forth source is not actually Forth.
Would the real Forth please stand up?:)
P.S.
OK just had another look. It has a "butterfly" word defined in PASM and I guess the PASM syntax may be different. But it has a "buttefly-slow" in Forth that can be used instead and the PASM version deleted.
Worse it has a "butterflyinit" word that is defined in PASM that has no Forth alternative.
So the problem is the benchmark and expecting everything to fit to it
Yes. And that's the nature of benchmarks. The arguments over this are probably even older than me:) In general the argument is that benchmarks do not reflect real world situations, they do not exploint whatever special features some architecture has etc etc etc.
All true, But if there are no rules to the game the comparisons become impossible to draw any meaning from anyway.
Come on, you really expect Prop OBEX software to run on a PC!?
Certainly, why not? I belive a lot of it does under spinsim. One would just have to kit out your PC with some I/O and one would have a nice fast Propeller. And big:)
Absolutely, but this room is a little dry for a good argument, it needs something to wet the whistle!
Excuse me for a minute while I check what's in the fridge:)
I've updated the third post with some results for FFT_BENCH. It's quite interesting, as once again there is a huge range in speed (PASM is almost 59x faster than Spin) and a somewhat smaller range in sizes. I've posted the total size of the benchmark this time, including run time libraries and kernel, so it gives an indication of how much room a complete project takes in each language. Spin is still the king here, but Catalina's CMM is nipping at its heels and the PropGCC CMM results are not too bad either. I suspect the various Forth interpreters may be able to do even better, but I don't have results for them either (only the speed result for JDForth).
Toggle and FIBO are easy to implement but have not much practical impact.
On the Propeller we use a lot of bit banging drivers so my proposal for a practical benchmark is:
Output an array of 512 bytes with a bit-banging SPI-Out routine as fast as possible.
That sounds like an excellent benchmark as well. Thanks for the suggestion!
Here's a C version:
// simple bit banging benchmark; send 512 bytes out an SPI bus
// as fast as possible
#include <stdio.h>
#include <stdint.h>
#include <propeller.h>
#define D0PIN 0
#define CLKPIN 1
#define CSPIN 2
#define D0MASK (1<<D0PIN)
#define CLKMASK (1<<CLKPIN)
#define CSMASK (1<<CSPIN)
void
spiout(uint8_t data)
{
unsigned int datamask = 0x80; /* send MSB first */
int i;
while (datamask > 0) {
if ((data & datamask)) {
OUTA |= D0MASK;
} else {
OUTA &= ~D0MASK;
}
OUTA |= CLKMASK;
datamask = datamask >> 1;
OUTA &= ~CLKMASK;
}
}
uint8_t array[512];
void
main()
{
int i;
unsigned int time;
// initialize data array
for (i = 0; i < 512; i++) {
array[i] = (i & 0xff);
}
// set up pins
OUTA |= CSMASK;
DIRA |= CSMASK;
DIRA |= D0MASK;
DIRA |= CLKMASK;
// now send data
time = CNT;
for (i = 0; i < 512; i++) {
spiout(array[i]);
}
time = CNT - time;
printf("elapsed cycles = %u\n", time);
for(;;) ;
}
Results for PropGCC (compiled with -Os except where noted) are:
There is probably a smaller range in sizes because fft_bench uses two tables of 16 bit cos and sine that take up one and half killo bytes.
Actually the later heater_fft in PASM uses the ROM tables to save that space at the cost of some speed.
This file should be renamed fft_bench.spin I think.
The one case where kernel (or driver, or plugin) size can make a difference is when a copy of the cog code has to be kept at run time to be dynamically loaded into cogs - here, Spin has a clear advantage when it comes to the kernel because its kernel is built into the ROM, whereas other languages typically have to keep a copy in Hub RAM.
...
However, this type of dynamic loading is probably rare enough that it can simply be footnoted - including it for all cases would be quite misleading to most people (who will never write programs that use this capability).
I guess that depends on the choices people want to make.
As one of our resident C haters once noted, making propeller into a single core device is not very desirable. My take is, there are many solutions available a in single chip having a single core and peripherals for half the price, at least twice the LMM performance, more code space, and little programming grief.
Propeller multi-core ability is a strength that should be easily accessible at any time in the program life cycle, not a footnote.
I guess that depends on the choices people want to make.
As one of our resident C haters once noted, making propeller into a single core device is not very desirable. My take is, there are many solutions available a in single chip having a single core and peripherals for half the price, at least twice the LMM performance, more code space, and little programming grief.
Propeller multi-core ability is a strength that should be easily accessible at any time in the program life cycle, not a footnote.
Are you saying that there is no value in a scheme that allows a single COG to run the main program and work to be distributed over other COGs using either LMM, CMM, or even COG C? There is no value in XMM solutions that don't allow XMM to run on every COG at the same time?
Are you saying that there is no value in a scheme that allows a single COG to run the main program and work to be distributed over other COGs using either LMM, CMM, or even COG C? There is no value in XMM solutions that don't allow XMM to run on every COG at the same time?
Propeller has multiple cores. In Spin we can make any core run PASM or start/stop another Spin function at anytime. Only having half of that ability in C takes away propeller advantage, and makes it less useful relative to other solutions.
XMMC, etc... are useful because it allows propeller to grow beyond it's natural code limits (anyone who writes propeller programs runs into that problem). In GCC XMMC, etc... are more useful because we can tell functions to live in HUB memory and run at normal speed. Multi-core XMM would be horribly slow and thus less valuable.
Propeller has multiple cores. In Spin we can make any core run PASM or start/stop another Spin function at anytime. Only having half of that ability in C takes away propeller advantage, and makes it less useful relative to other solutions.
XMMC, etc... are useful because it allows propeller to grow beyond it's natural code limits (anyone who writes propeller programs runs into that problem). In GCC XMMC, etc... are more useful because we can tell functions to live in HUB memory and run at normal speed. Multi-core XMM would be horribly slow and thus less valuable.
It seems to me that almost every Spin program I've looked at has one main program that launches COGs to run drivers or maybe small Spin programs. I have seen very few Spin programs that make use of fully concurrent execution. It seems to me that XMM just makes it possible for that main program to be bigger and more complex without interfering with the ability to launch smaller pieces of code to run concurrently in other COGs. I guess the various Forth systems allow more uniform concurrent operation in some cases across multiple chips. The OpenMP support Eric recently added further improves the ability to do fine-grained parallelism. In other words, I don't think Spin has any ability that isn't also shared by other languages on the Propeller and it is missing some that others have.
I totally agree, in a language benchmark guts of the benchmark should not be relying on parts written a language that is not the subject of the benchmark.
This needs a little care, else you reject all library calls.
I'd agree that what should be excluded is manually-coded-in-line, but even that should be allowed if clearly stated.
ie I would certainly be interested to know where the middle ground of some in-line ASM can place you.
Many users would want to use C, and only go to PASM as really needed, and there, modifying generated PASM is likely to be the preferred optimize pathway.
Any decent language shows you the ASM generated, and some careful re-phrase can often improve that, to the point the jump to manual ASM is not needed.
If the "written a language that is not the subject of the benchmark" is applied pedantically, then code generated automatically by that language is allowed to be included : The user did not write it in another language.
In the forth example, it certainly is a feature that fast SPI is available, but another number that should be included, is how much - ie the size of the 'native code area' - it is so small, it is pretty much a single-shot use.
Again, we come back to needing multiple columns, in the reports, that do show all the resource used.
This needs a little care, else you reject all library calls.
Accoding to many, if the library functions are in the standards documents of the language then they are part of the language and can be used. So for example in C using strcmp() to compare strings in a benchmark would be acceptable even if it's implementation is in assembler on the platform under test. You are not writing anything in the benchmark code that is outside the standards language definition.
...generated automatically...
Lot's of possibilities there, if you can get the code generation to be quick enough to make any savings.
For example, potentially the Forth run time could compile the Forth source of the benchmark into machine code and then execute and time that. Good trick if you can pull it off:)
In the forth example, it certainly is a feature that fast SPI is available
I don't believe it is. I assume there are such things as standards for the Forth language, well for example ANSI X3.215-1994 http://www.forth.org/svfig/Win32Forth/DPANS94.txt No mention of fast SPI in there. For a general language benchmark usefull accross platforms it should not be included.
But as we are more interested in what works to do practical things on the Prop we let that slide. Which does not mean it's OK to do it in inline PASM. If it's a normal part of the supplied Forth run time I'm happy.
For example: That FFT in Forth has a good junk of functionality implemented in PASM. That's just not right.
Anyway, looking at the whole thing briefly I might think it world be easier to understand if it were all done in PASM:)
Well that's the thing. Why are we doing this benchmarking? What actually do we want to test and compare? Speed, code size, ease of use, portability, applicability to different problem domains all come into the picture. And I guess we all have different priorities when assessing the results and will therefore dispute this and that outcome ad nauseum.
Still I think, if teacher sets a problem to be solved in language A but you hand in a solution where the main guts of the problem is solved in language B that totally misses the point of the exercise and your homework will be marked down even if the solution works:)
Well that's the thing. Why are we doing this benchmarking? What actually do we want to test and compare? Speed, code size, ease of use, portability, applicability to different problem domains all come into the picture.
The best things to include are the numbers, (the rest is intangible), and that is why multiple numbers are needed.
Still I think, if teacher sets a problem to be solved in language A but you hand in a solution where the main guts of the problem is solved in language B that totally misses the point of the exercise and your homework will be marked down even if the solution works:)
Not by any teacher I know. Especially if the student explained why B was used instead of A.
Reminds me of a case of a Maths Major, taking a computer science paper.
Given a problem the Prof has always solved iteratively, she knew there was actually a related maths proof, and coded the solution using that.
She smashed all previous speed records.
The point of benchmarks is to compare and evaluate many different approaches to a problem.
In a Prop, that range of choice is wider than in most other chips.
What? So an essay written in French would have been OK for a German language assignement?
When studying matrices in maths I would not expect an assignement solved using other techniques to get a good grade as you are suppossed to be demonstrating your mastery of matrices.
Anyway, we have to agree to disagree on a few points and just get back to the benchmarking.
I have just spent the whole evening and early hours of the morning trying to get my head around getting GCC to automatically parallelize the fft_bench code using it's new found OpenMP support. If I ever get it working that would mean:
1) The resulting code could split it's work load over 1, 2, 4 or 8 cogs.
2) The expected (hoped for) performance gain is maybe upto 6 times max taking into account the overheads of scheduling the cogs and the fact that it is not possible to parallelize the entire process.
3) Exactly the same code should run on 1 or many cogs with no change.
Anyway this is giving me a severe headache, turns out arranging to dynamically split up that FFT work is not so simple and needs a little rearranging of the way things are done. Just now I have made the first baby step of splitting the work into two and still get correct results in the output.
If (big if) I succeed I will expect a parallelized Forth version to appear, even with in line ASM
Since you use the optimizer for GCC, you should also use it for Catalina. When you add -O3 the code size for Catalina goes down to 4376.
But I'm afraid you are still not comparing "apples with apples" in all cases. For instance, the library functions included by the C versions are much more sophisticated than the ones used by the Spin version (and probably the Forth version as well - not sure about that one).
Getting a true cross-language comparison is difficult. For example:
If you also add -ltiny the code size goes down to 2612.
If you also add -Dprintf=t_printf the code size goes down to 2184.
If you also add -Dprintf=trivial_printf the code size goes down to 1952.
... etc ...
This is why I think that when this type of benchmarking, you should only count the benchmark application code size itself, not the kernel size, the library function size etc - these do not compare well across languages.
Since you use the optimizer for GCC, you should also use it for Catalina. When you add -O3 the code size for Catalina goes down to 4376.
I did use -O3 for Catalina CMM. Perhaps you missed the note, but the size reported for Catalina included code, cnst, and init. So for Catalina CMM (with -O3) the size is 2612 (code) + 104 (cnst) + 1820 (init) = 4536 bytes. I realize that there's some dispute over whether the constant data should be included or not, but all the other languages have it too, so it is (somewhat) apples to apples. Perhaps I've misunderstood what the "init" section is, but I assumed it was the initial data for the arrays (which is fairly substantial in size, and all of the languages have it).
Perhaps a code size (only) comparison would be useful as well. I invite contributions from readers :-).
And yes, the C runtime library is much more sophisticated than for example what the FullDuplexSerial function provides, which is why I compiled all the C examples with -ltiny. To be fair, there are some features of all the language runtimes (e.g. the receive functions in FullDuplexSerial) that aren't used in the fft_bench demo.
I did use -O3 for Catalina CMM. Perhaps you missed the note, but the size reported for Catalina included code, cnst, and init. So for Catalina CMM (with -O3) the size is 2612 (code) + 104 (cnst) + 1820 (init) = 4536 bytes. I realize that there's some dispute over whether the constant data should be included or not, but all the other languages have it too, so it is (somewhat) apples to apples. Perhaps I've misunderstood what the "init" section is, but I assumed it was the initial data for the arrays (which is fairly substantial in size, and all of the languages have it).
Perhaps a code size (only) comparison would be useful as well. I invite contributions from readers :-).
And yes, the C runtime library is much more sophisticated than for example what the FullDuplexSerial function provides, which is why I compiled all the C examples with -ltiny. To be fair, there are some features of all the language runtimes (e.g. the receive functions in FullDuplexSerial) that aren't used in the fft_bench demo.
Eric
Ok - yes, I missed the note. Thanks.
Ross.
P.S. Actually, thinking about it more - if the point is to benchmark across languages, then including stuff that is common to all languages is just diluting the differences anyway - so in this case (as I have argued before) I think const data sizes should be excluded. But much of this problem will vanish when we start looking at larger programs anyway. These small programs are not really a good basis for real-world comparison, precisely because of such problems.
I recompiled the JDForth version of FFT and got the following program size: 5428 bytes.
My reasoning for including PASM is consistent with my use of JDForth. I developed JDForth because I was writing an application that left the Propeller struggling to cope. I was writing more and more PASM and decided there must be a better way. JDForth was the result with the majority of the application being written in Forth, with snippets of PASM to boost the performance of critical sections of the code. Even after this change I still had 4 cogs with dedicated high speed PASM drivers.
In reality I don't think the given FFT algorithm is very forth friendly, but I have neither the time nor requirement to understand and write the algorithm in a forth friendly manner. Surely, if a piece of code is going to be used as a reasonably platform independent (ie free of assembly) benchmark, then it should be written in the style of the language as opposed to the current forth-interpretation-of-the-C as it stands.
In the case of FFT example the run time dropped from 1200ms to 140ms by inclusion of the PASM, but to me that points in the direction of weakness in the underlying implementation of the algorithm. (Obviously not for C, but definitely so for forth).
In the end, my original application was simply a poor fit for the Propeller. I gained the required speed but ran out of RAM. Not in terms of program space, but I was writing to SD Card memory and ran out RAM for storing FIFO buffers. I have moved the application to another processor (with it all being written in C) and use about 128k bytes of the RAM dedicated to the FIFO buffers to manage SD card latency. The amount of buffering was required to keep up with the incoming data stream while allowing for the inevitable delays that occur when deleting blocks on the SD card.
To me, JDForth on the Propeller was great as a compromise between code size vs application speed vs ability to add snippets of assembly language. I enjoyed working with the Propeller and this forum is one that I still frequently visit, but the Propeller does not suit my applications. For the curious, I now use the STM32F4xx family of embedded processors, BUT they are in a different league of complexity and not a processor for the beginner, The one strength that the Propeller undeniably has is deterministic operation, and for this reason alone I would consider it's use in preference to a CPLD if at all possible. I would also use the Propeller as a user interface coprocessor (to add keyboard + video + sound).
In reality I don't think the given FFT algorithm is very forth friendly, but I have neither the time nor requirement to understand and write the algorithm in a forth friendly manner.
Redesigning the algorithm in a language friendly manner is not in the spirit of the fft benchmark.
I am sure there are better ways to write the FFT algorithm in different languages. Heck, there are probably better ways to write it in C. Certainly there are faster implimentations out there. That particular code sequence is what I came up with after finally getting to understand the FFT, written ffrom scratch from my understanding of the maths, perhaps a naive implementation.
However, none of that matters, forget that it is even an FFT. It was suggested as a benchmark a while back as it contains a good selection of loops and arithmetic/logical operations, array lookup, etc, representitive of what people do with MCU's. There is no floating point for example. As such, any implementation of that benchmark should follow the spirit of the original and perform the same operations in the same sequence. The aim of fft_bench is not to produce the correct FFT result but a test of how a language/processor can do that sequence of operations. After all you could just skip the calculations and print the expected output:)
This is why I think that when this type of benchmarking, you should only count the benchmark application code size itself, not the kernel size, the library function size etc - these do not compare well across languages.
Yes, i'd agree, and my preference is to list multiple numbers : Code Itself , Kernal Size, Library Funcitons included, as then it is clear how each portion contributes, and what the total resource needed is, as well.
The classic MAP file from a linker, has this sort of info.
I'm squarely with Heater, the use of inline assembly in a language speed benchmark invalidates the result, and is no longer a benchmark of that specific language. It may be valid during usage of that language, but doesn't help that improve language implementation.
For example, prior to the addition of just in time compilation Java was unbelievably slow. But Java has a native code calling facility which could be used to improve specific hot spots in a program by calling routines written in C. Sun to their credit wasn't satisfied with this technique and added JITing to improve the core language and retain code portability.
I'm squarely with Heater, the use of inline assembly in a language speed benchmark invalidates the result, and is no longer a benchmark of that specific language. It may be valid during usage of that language, but doesn't help that improve language implementation.
I would agree user-coded inline assembly needs a special mention, and belongs in a 'mixed language' benchmark.
Where it gets murkier is if the language compiler generates the in-line asm, under a user directive, or via optimise passes.
Where it gets murkier is if the language compiler generates the in-line asm, under a user directive, or via optimise passes.
I don't understand.
1) That is what traditional compilers like C do. Convert source text into machine instructions. No Problem there.
Or
2) Do you mean something like a language that normally converts source to byte codes, e.g. Spin, but can be told to insert raw machine instructions as well when needed.
Again no problem with that, as long as the source text is actually the language under test, Spin in this case, who cares how the compiler or run time gets the job done.
In fact propgcc is a good example of this. I would say that propgcc in "normal" LMM mode is not producing native Propeller instructions but rather something that needs a run time kernel to execute them. Hence being 3 or 4 times slower than compiling to COG code. However propgcc has the FCACHE mechanism which will detect small self contained loops and functions and decide to compile them into real native code that is loaded into COG and run at full speed. Is that cheating? Not in my mind, the source text is still regular C.
This, FCACHE, by the way, is how propgcc manages to run the FFT_bench in 47ms verses Catalina's 348ms
2) Do you mean something like a language that normally converts source to byte codes, e.g. Spin, but can be told to insert raw machine instructions as well when needed.
Yes, that is a good example.
In the Prop, this is an option most others lack, but it also has caveats...
Again no problem with that, as long as the source text is actually the language under test, Spin in this case, who cares how the compiler or run time gets the job done.
In fact propgcc is a good example of this. I would say that propgcc in "normal" LMM mode is not producing native Propeller instructions but rather something that needs a run time kernel to execute them. Hence being 3 or 4 times slower than compiling to COG code. However propgcc has the FCACHE mechanism which will detect small self contained loops and functions and decide to compile them into real native code that is loaded into COG and run at full speed. Is that cheating? Not in my mind, the source text is still regular C.
This, FCACHE, by the way, is how propgcc manages to run the FFT_bench in 47ms verses Catalina's 348ms
What you need to do when this occurs, is state what went into the cache or COG.
The Prop has a relatively small PASM workspace, so this will not scale how a novice might think.
Remember, Benchmarks can be read by someone who has never seen a prop before.
Now you have me puzzled. Why does it need "porting"?
....
Would the real Forth please stand up?:)
....
Worse it has a "butterflyinit" word that is defined in PASM that has no Forth alternative.
Not to claim to speak for forth in general of anything....
PORTING means doing the stuff that takes advantage of the hardware architecture. This is what the kernels does, it the hardware abstraction layer. but most micro controllers have limited hardware.....
REAL forth in this case would be the kernel, which may or may not have hardware which optimizes the function you want to use. In the case of the prop, the kernel can do what ever the prop can do.
"butterflyinit" - Somebody smart (er than me, maybe mindrobots) would have to write it (in high level forth). And it would be slightly slower than what your C compiler generated. Then somebody else smart (not me) would optimize it in assembler, which would be about as fast or faster than the compiler output, depending on the skills of the compiler writer and the forth optimizer.
If nlordi worked on it, you might have some forth to compare to another skilled implementation; he actually knows something about FFT.
Sal doesn't do bench marks, except [current version] to [last version]. Other wise its not apples to apples.
If we ask Rick and Peter to take a stab at it...Maybe we could see propforth to Tachyon. But its all about how much fun it is.
If we ask Rick and Peter to take a stab at it...Maybe we could see propforth to Tachyon. But its all about how much fun it is.
Unless I had a need for it then it would be no fun at all. Benchmarks can be useful but it's hard to compare apples with pineapples even though they are both "apples". There are also interpretations being imposed on what is and what is not allowed which is what marketing companies do when they want their product to come out on top. I do agree though that inline assembler is not the pure source language but it's okay if the function is built-in such as in libraries etc as long as the programmer doesn't need to delve into assembler. However in real-word applications you do what you need to do.
Anyway, these isolated benchmarks don't do anything to get me excited but if the benchmark was a real and complete application then it would present a complete picture of how good the implementation is. Such an application would not only crunch data but juggle access to real world devices etc. However it turns out that many of my Forth applications have just as much diagnostics built-in as there is actual application but even then it is very compact. One of my very early Forth applications was for POS terminals and despite the hardware limitations of the time the units could not be matched by anything else in terms of speed and resource economy even with more powerful hardware and more memory (I know).
Indeed "porting" is all about moving a program from one environment to another and fixing up it's I/O to make use of the new "ports" in the new environment. I take that to mean real hardware ports or I/O provided by an OS. Of course porting can also involve other changes.
We have used the term "porting" here to refere to re-implimenting a benchmark algorithm in a different language which I'm sure was not the original intent of the term.
Now there is the thing. fft_bench does not use any special hardware except what it needs to print to the screen and time itself. Everything else should be just regular Forth code as I might see on a PC version or whatever. That printing and timing is what may need "porting" from one Forth kernel to another but I would assume that is trivial.
Turns out the "porting" problem with the Forth FFT is that bits of it are not written in Forth but assembler.
Comments
Now you have me puzzled. Why does it need "porting"?
I know almost nothing about Forth and I only looked at the FFT briefly before I started feeling dizzy but It has a Forth only (no PASM) option, in fact that is default in the source given.
So if it needs porting ether the source or target Forth source is not actually Forth.
Would the real Forth please stand up?:)
P.S.
OK just had another look. It has a "butterfly" word defined in PASM and I guess the PASM syntax may be different. But it has a "buttefly-slow" in Forth that can be used instead and the PASM version deleted.
Worse it has a "butterflyinit" word that is defined in PASM that has no Forth alternative.
Yes. And that's the nature of benchmarks. The arguments over this are probably even older than me:) In general the argument is that benchmarks do not reflect real world situations, they do not exploint whatever special features some architecture has etc etc etc.
All true, But if there are no rules to the game the comparisons become impossible to draw any meaning from anyway.
Certainly, why not? I belive a lot of it does under spinsim. One would just have to kit out your PC with some I/O and one would have a nice fast Propeller. And big:)
Excuse me for a minute while I check what's in the fridge:)
Here's a C version:
Results for PropGCC (compiled with -Os except where noted) are: The sizes here are size only of this code (no library or kernel) and with the printing and final for loop commented out.
There is probably a smaller range in sizes because fft_bench uses two tables of 16 bit cos and sine that take up one and half killo bytes.
Actually the later heater_fft in PASM uses the ROM tables to save that space at the cost of some speed.
This file should be renamed fft_bench.spin I think.
I guess that depends on the choices people want to make.
As one of our resident C haters once noted, making propeller into a single core device is not very desirable. My take is, there are many solutions available a in single chip having a single core and peripherals for half the price, at least twice the LMM performance, more code space, and little programming grief.
Propeller multi-core ability is a strength that should be easily accessible at any time in the program life cycle, not a footnote.
Propeller has multiple cores. In Spin we can make any core run PASM or start/stop another Spin function at anytime. Only having half of that ability in C takes away propeller advantage, and makes it less useful relative to other solutions.
XMMC, etc... are useful because it allows propeller to grow beyond it's natural code limits (anyone who writes propeller programs runs into that problem). In GCC XMMC, etc... are more useful because we can tell functions to live in HUB memory and run at normal speed. Multi-core XMM would be horribly slow and thus less valuable.
This needs a little care, else you reject all library calls.
I'd agree that what should be excluded is manually-coded-in-line, but even that should be allowed if clearly stated.
ie I would certainly be interested to know where the middle ground of some in-line ASM can place you.
Many users would want to use C, and only go to PASM as really needed, and there, modifying generated PASM is likely to be the preferred optimize pathway.
Any decent language shows you the ASM generated, and some careful re-phrase can often improve that, to the point the jump to manual ASM is not needed.
If the "written a language that is not the subject of the benchmark" is applied pedantically, then code generated automatically by that language is allowed to be included : The user did not write it in another language.
In the forth example, it certainly is a feature that fast SPI is available, but another number that should be included, is how much - ie the size of the 'native code area' - it is so small, it is pretty much a single-shot use.
Again, we come back to needing multiple columns, in the reports, that do show all the resource used.
Accoding to many, if the library functions are in the standards documents of the language then they are part of the language and can be used. So for example in C using strcmp() to compare strings in a benchmark would be acceptable even if it's implementation is in assembler on the platform under test. You are not writing anything in the benchmark code that is outside the standards language definition.
Lot's of possibilities there, if you can get the code generation to be quick enough to make any savings.
For example, potentially the Forth run time could compile the Forth source of the benchmark into machine code and then execute and time that. Good trick if you can pull it off:)
I don't believe it is. I assume there are such things as standards for the Forth language, well for example ANSI X3.215-1994 http://www.forth.org/svfig/Win32Forth/DPANS94.txt No mention of fast SPI in there. For a general language benchmark usefull accross platforms it should not be included.
But as we are more interested in what works to do practical things on the Prop we let that slide. Which does not mean it's OK to do it in inline PASM. If it's a normal part of the supplied Forth run time I'm happy.
For example: That FFT in Forth has a good junk of functionality implemented in PASM. That's just not right.
Anyway, looking at the whole thing briefly I might think it world be easier to understand if it were all done in PASM:)
So it should just go into the benchmarks with a comment and two entries, one in the Forth column, and one in the PASM column ?
On a Prop one is really interested in both 'drop in code', and a more Prop-tuned solution.
Still I think, if teacher sets a problem to be solved in language A but you hand in a solution where the main guts of the problem is solved in language B that totally misses the point of the exercise and your homework will be marked down even if the solution works:)
The best things to include are the numbers, (the rest is intangible), and that is why multiple numbers are needed.
Not by any teacher I know. Especially if the student explained why B was used instead of A.
Reminds me of a case of a Maths Major, taking a computer science paper.
Given a problem the Prof has always solved iteratively, she knew there was actually a related maths proof, and coded the solution using that.
She smashed all previous speed records.
The point of benchmarks is to compare and evaluate many different approaches to a problem.
In a Prop, that range of choice is wider than in most other chips.
What? So an essay written in French would have been OK for a German language assignement?
When studying matrices in maths I would not expect an assignement solved using other techniques to get a good grade as you are suppossed to be demonstrating your mastery of matrices.
Anyway, we have to agree to disagree on a few points and just get back to the benchmarking.
I have just spent the whole evening and early hours of the morning trying to get my head around getting GCC to automatically parallelize the fft_bench code using it's new found OpenMP support. If I ever get it working that would mean:
1) The resulting code could split it's work load over 1, 2, 4 or 8 cogs.
2) The expected (hoped for) performance gain is maybe upto 6 times max taking into account the overheads of scheduling the cogs and the fact that it is not possible to parallelize the entire process.
3) Exactly the same code should run on 1 or many cogs with no change.
Anyway this is giving me a severe headache, turns out arranging to dynamically split up that FFT work is not so simple and needs a little rearranging of the way things are done. Just now I have made the first baby step of splitting the work into two and still get correct results in the output.
If (big if) I succeed I will expect a parallelized Forth version to appear, even with in line ASM
Since you use the optimizer for GCC, you should also use it for Catalina. When you add -O3 the code size for Catalina goes down to 4376.
But I'm afraid you are still not comparing "apples with apples" in all cases. For instance, the library functions included by the C versions are much more sophisticated than the ones used by the Spin version (and probably the Forth version as well - not sure about that one).
Getting a true cross-language comparison is difficult. For example:
If you also add -ltiny the code size goes down to 2612.
If you also add -Dprintf=t_printf the code size goes down to 2184.
If you also add -Dprintf=trivial_printf the code size goes down to 1952.
... etc ...
This is why I think that when this type of benchmarking, you should only count the benchmark application code size itself, not the kernel size, the library function size etc - these do not compare well across languages.
Ross.
Perhaps a code size (only) comparison would be useful as well. I invite contributions from readers :-).
And yes, the C runtime library is much more sophisticated than for example what the FullDuplexSerial function provides, which is why I compiled all the C examples with -ltiny. To be fair, there are some features of all the language runtimes (e.g. the receive functions in FullDuplexSerial) that aren't used in the fft_bench demo.
Eric
Ok - yes, I missed the note. Thanks.
Ross.
P.S. Actually, thinking about it more - if the point is to benchmark across languages, then including stuff that is common to all languages is just diluting the differences anyway - so in this case (as I have argued before) I think const data sizes should be excluded. But much of this problem will vanish when we start looking at larger programs anyway. These small programs are not really a good basis for real-world comparison, precisely because of such problems.
My reasoning for including PASM is consistent with my use of JDForth. I developed JDForth because I was writing an application that left the Propeller struggling to cope. I was writing more and more PASM and decided there must be a better way. JDForth was the result with the majority of the application being written in Forth, with snippets of PASM to boost the performance of critical sections of the code. Even after this change I still had 4 cogs with dedicated high speed PASM drivers.
In reality I don't think the given FFT algorithm is very forth friendly, but I have neither the time nor requirement to understand and write the algorithm in a forth friendly manner. Surely, if a piece of code is going to be used as a reasonably platform independent (ie free of assembly) benchmark, then it should be written in the style of the language as opposed to the current forth-interpretation-of-the-C as it stands.
In the case of FFT example the run time dropped from 1200ms to 140ms by inclusion of the PASM, but to me that points in the direction of weakness in the underlying implementation of the algorithm. (Obviously not for C, but definitely so for forth).
In the end, my original application was simply a poor fit for the Propeller. I gained the required speed but ran out of RAM. Not in terms of program space, but I was writing to SD Card memory and ran out RAM for storing FIFO buffers. I have moved the application to another processor (with it all being written in C) and use about 128k bytes of the RAM dedicated to the FIFO buffers to manage SD card latency. The amount of buffering was required to keep up with the incoming data stream while allowing for the inevitable delays that occur when deleting blocks on the SD card.
To me, JDForth on the Propeller was great as a compromise between code size vs application speed vs ability to add snippets of assembly language. I enjoyed working with the Propeller and this forum is one that I still frequently visit, but the Propeller does not suit my applications. For the curious, I now use the STM32F4xx family of embedded processors, BUT they are in a different league of complexity and not a processor for the beginner, The one strength that the Propeller undeniably has is deterministic operation, and for this reason alone I would consider it's use in preference to a CPLD if at all possible. I would also use the Propeller as a user interface coprocessor (to add keyboard + video + sound).
Redesigning the algorithm in a language friendly manner is not in the spirit of the fft benchmark.
I am sure there are better ways to write the FFT algorithm in different languages. Heck, there are probably better ways to write it in C. Certainly there are faster implimentations out there. That particular code sequence is what I came up with after finally getting to understand the FFT, written ffrom scratch from my understanding of the maths, perhaps a naive implementation.
However, none of that matters, forget that it is even an FFT. It was suggested as a benchmark a while back as it contains a good selection of loops and arithmetic/logical operations, array lookup, etc, representitive of what people do with MCU's. There is no floating point for example. As such, any implementation of that benchmark should follow the spirit of the original and perform the same operations in the same sequence. The aim of fft_bench is not to produce the correct FFT result but a test of how a language/processor can do that sequence of operations. After all you could just skip the calculations and print the expected output:)
Wow, now that is quite a leap you have made there, worthy of a prize. French ? German ?
Yes, i'd agree, and my preference is to list multiple numbers : Code Itself , Kernal Size, Library Funcitons included, as then it is clear how each portion contributes, and what the total resource needed is, as well.
The classic MAP file from a linker, has this sort of info.
For example, prior to the addition of just in time compilation Java was unbelievably slow. But Java has a native code calling facility which could be used to improve specific hot spots in a program by calling routines written in C. Sun to their credit wasn't satisfied with this technique and added JITing to improve the core language and retain code portability.
I would agree user-coded inline assembly needs a special mention, and belongs in a 'mixed language' benchmark.
Where it gets murkier is if the language compiler generates the in-line asm, under a user directive, or via optimise passes.
I don't understand.
1) That is what traditional compilers like C do. Convert source text into machine instructions. No Problem there.
Or
2) Do you mean something like a language that normally converts source to byte codes, e.g. Spin, but can be told to insert raw machine instructions as well when needed.
Again no problem with that, as long as the source text is actually the language under test, Spin in this case, who cares how the compiler or run time gets the job done.
In fact propgcc is a good example of this. I would say that propgcc in "normal" LMM mode is not producing native Propeller instructions but rather something that needs a run time kernel to execute them. Hence being 3 or 4 times slower than compiling to COG code. However propgcc has the FCACHE mechanism which will detect small self contained loops and functions and decide to compile them into real native code that is loaded into COG and run at full speed. Is that cheating? Not in my mind, the source text is still regular C.
This, FCACHE, by the way, is how propgcc manages to run the FFT_bench in 47ms verses Catalina's 348ms
Yes, that is a good example.
In the Prop, this is an option most others lack, but it also has caveats...
What you need to do when this occurs, is state what went into the cache or COG.
The Prop has a relatively small PASM workspace, so this will not scale how a novice might think.
Remember, Benchmarks can be read by someone who has never seen a prop before.
Not to claim to speak for forth in general of anything....
PORTING means doing the stuff that takes advantage of the hardware architecture. This is what the kernels does, it the hardware abstraction layer. but most micro controllers have limited hardware.....
REAL forth in this case would be the kernel, which may or may not have hardware which optimizes the function you want to use. In the case of the prop, the kernel can do what ever the prop can do.
"butterflyinit" - Somebody smart (er than me, maybe mindrobots) would have to write it (in high level forth). And it would be slightly slower than what your C compiler generated. Then somebody else smart (not me) would optimize it in assembler, which would be about as fast or faster than the compiler output, depending on the skills of the compiler writer and the forth optimizer.
If nlordi worked on it, you might have some forth to compare to another skilled implementation; he actually knows something about FFT.
Sal doesn't do bench marks, except [current version] to [last version]. Other wise its not apples to apples.
If we ask Rick and Peter to take a stab at it...Maybe we could see propforth to Tachyon. But its all about how much fun it is.
Unless I had a need for it then it would be no fun at all. Benchmarks can be useful but it's hard to compare apples with pineapples even though they are both "apples". There are also interpretations being imposed on what is and what is not allowed which is what marketing companies do when they want their product to come out on top. I do agree though that inline assembler is not the pure source language but it's okay if the function is built-in such as in libraries etc as long as the programmer doesn't need to delve into assembler. However in real-word applications you do what you need to do.
Anyway, these isolated benchmarks don't do anything to get me excited but if the benchmark was a real and complete application then it would present a complete picture of how good the implementation is. Such an application would not only crunch data but juggle access to real world devices etc. However it turns out that many of my Forth applications have just as much diagnostics built-in as there is actual application but even then it is very compact. One of my very early Forth applications was for POS terminals and despite the hardware limitations of the time the units could not be matched by anything else in terms of speed and resource economy even with more powerful hardware and more memory (I know).
Yes, FCACHE is great for achieving good results on small benchmark programs
Ross.
Indeed "porting" is all about moving a program from one environment to another and fixing up it's I/O to make use of the new "ports" in the new environment. I take that to mean real hardware ports or I/O provided by an OS. Of course porting can also involve other changes.
We have used the term "porting" here to refere to re-implimenting a benchmark algorithm in a different language which I'm sure was not the original intent of the term.
Now there is the thing. fft_bench does not use any special hardware except what it needs to print to the screen and time itself. Everything else should be just regular Forth code as I might see on a PC version or whatever. That printing and timing is what may need "porting" from one Forth kernel to another but I would assume that is trivial.
Turns out the "porting" problem with the Forth FFT is that bits of it are not written in Forth but assembler.
Sal is very wise.