Yes, FCACHE is great for achieving good results on small benchmark programs
Oh yeah:)
Now that I have let that little "secret" out of the bag I'm surprised no one has been demanding that the GCC results be posted for the case when FCACHE is turned off and only LMM code is used.
For all the Forth guys that don't see the fun in "porting" FFT bench from one Forth version to another I do sympathize. It's a shame though because as far as I can see everything is already done in Forth, except for some small initialization sequence. I have no idea why that is done that way as it hardly impacts performance as far as I can tell.
Now that I have let that little "secret" out of the bag I'm surprised no one has been demanding that the GCC results be posted for the case when FCACHE is turned off and only LMM code is used.
So how about it Eric?
Hi Heater,
I don't think we need to do that - FCACHE is a perectly legitimate technique. I use it in the CMM version of Catalina, and I plan to add it to the LMM version as well in future. It's just that the benefits to be gained from FCACHE are massively over-stated by the type of trivial benchmark programs we are currently using.
By the way - much the same problem occurred when fast caches (sometimes hundreds of kilobytes or even megabytes in size) were added to microprocessors - benchmark programs immediately had to be made much larger and more complex to avoid the whole program simply being "cached" and giving unrealistically fast results. That's what led to the death of many of the popular small benchmark programs like whetstone, dhrystone etc - they are simply too small to give meaningful results on modern micros.
It just needs to be noted that many "normal" user programs will not get anybenefit at all from using a compiler that can use FCACHE. And of course, there will be some that will get significant benefits.
Anyway, I know it gives the GCC team a boost to be able to come first on something, so I'm ok with leaving it in
I don't think we need to do that - FCACHE is a perectly legitimate technique. I use it in the CMM version of Catalina, and I plan to add it to the LMM version as well in future.
Good stuff.
It's just that the benefits to be gained from FCACHE are massively over-stated by the type of trivial benchmark programs we are currently using.
Ahhggg!!! ...I'm gutted. FFT "trivial" he says. I sweated blood over that thing for ages, not to mention the decades it took me to get to understand how the algorithm works in the first place. "Trivial" .. mutter.. mutter..sob.
...much the same problem occurred when fast caches (sometimes hundreds of kilobytes or even megabytes in size) were added to microprocessors
Even small caches. I remeber being very surprised years ago when I discovered that the traditional method of un-rolling loops to gain speed could actually have no benifit or slow you down on some early Pentium or AMD I was using. Then it hit me that these new fangled chips were cached and unrolling a loop could spill out into slow memory.
Then the world of optimization gets a whole new twist. Arranging your code and data so as to be working in cache with out "thrashing". I.e traversing 2D arrays in the right column/row order.
It just needs to be noted that many "normal" user programs will not get any benefit at all from using a compiler that can use FCACHE. And of course, there will be some that will get significant benefits.
And that's why we need the non-FACHE benchmark results here.
It's just that the benefits to be gained from FCACHE are massively over-stated by the type of trivialelegantly written and superbly efficient, but very small benchmark programs we are currently using.
I just wanted to comment that the PropBasic version would be much faster as:
i = 1000
DO
thepin = ~thepin
LOOP i
It should be close to the PASM speed.
That is the problem with benchmarks, there are so many different ways to do a seemingly simple task that it is hard to judge a language from ANY benchmark program.
Now that I have let that little "secret" out of the bag I'm surprised no one has been demanding that the GCC results be posted for the case when FCACHE is turned off and only LMM code is used.
So how about it Eric?
You just have to add -mno-fcache to the command line to turn it off. With -O2 and -mno-fcache the fft_bench runs in 120 ms instead of 47 ms. Quite a bit slower, but still almost 3x faster than any of the other non-PASM solutions.
Now there is the thing. fft_bench does not use any special hardware except what it needs to print to the screen and time itself. Everything else should be just regular Forth code as I might see on a PC version or whatever. That printing and timing is what may need "porting" from one Forth kernel to another but I would assume that is trivial.
The other little "gotcha" is that Forth isn't really standardized, or at least none of the Propeller implementations (that I'm aware of) adhere to standards. They all use the same syntax, but the words implemented vary, as do details like whether the interpreter is case sensitive or not. Which is not to knock Forth -- there are other "language families" (like LISP) that generally share this problem, and people can still get significant work done in the Forth of their choice. The interactivity of PropForth and Tachyon Forth is a very cool feature.
The other little "gotcha" is that Forth isn't really standardized, or at least none of the Propeller implementations (that I'm aware of) adhere to standards. They all use the same syntax, but the words implemented vary, as do details like whether the interpreter is case sensitive or not.
...
The interactivity of PropForth and Tachyon Forth is a very cool feature.
I agree this is a pain point to adopting the Forth language. One appeal of a HLL is code and skills reuse. With C I can often find an algorithm already written (e.g. inverse kinematics), or write it on a PC with a set of unit tests to ensure it is correct. Porting it elsewhere is usually just making sure that I didn't code any integer size dependencies. Forth has the possibility of achieving this, which standardization would help.
However, I can see Forth's appeal. It does allow skills reuse, and the interactivity makes it much easier to build embedded programs with built in command parsers. When coupled with a wireless link you get remote commands plus integrated debugging which is a nice feature.
I just wanted to comment that the PropBasic version would be much faster as:
i = 1000
DO
thepin = ~thepin
LOOP i
It should be close to the PASM speed.
Thanks, Bean. It's been a long time since I programmed in Basic, I don't think we had DO/LOOP back in my day :-). I've updated the source code and timings in the second post. PropBasic is now basically identical to the PASM speed -- very impressive!
I agree with your comment about it being hard to judge a language from any benchmark program. For one thing, there are very important factors like ease of use and ease of learning that are hard to quantify. For another, even a small variation in source code can produce quite a big difference in performance for a specific language (as we've seen), so no one benchmark can possibly quantify the speed differences between languages. That's why we need more benchmarks :-).
Turns out the "porting" problem with the Forth FFT is that bits of it are not written in Forth but assembler.
That's the whole idea; its not the "problem" but rather "the process". High level forth lets one test and set up all the functions, quickly. Then its easy to identify bottle necks, via interactive testing. The we can optimize just the bottle necks in assembler. So forth is all about getting to the assembler parts, and ensuring the only the tiniest bit is in assembler.
Of course, one has to do high level optimization via code design before low level optimization in assembler before it makes any sense, and often folks skip the design optimization and jump right to coding, or worse jump to low level optimization before the application is even ironed out. And then wonder why the code is a mess and doesn't work. But I see this just as frequently on professional C projects, perhaps more frequently. This is why I advise against travel by air, or getting sick in general.
All true, and all doable in C and many other languages as well.
Presumably this "process" did not occur when the Forth FFT was made as there is no Forth equivalent of that PASM initialization sequence like there is for the main guts of the algorithm.
Presumably this "process" did not occur when the Forth FFT was made as there is no Forth equivalent of that PASM initialization sequence like there is for the main guts of the algorithm.
You'd have to talk to Nick about that. As I understand it, it works as-is. Whether it needs to be optimized for speed or efficiency is usually left to the person that has the specific need.
Everything else should be just regular Forth code as I might see on a PC version or whatever.
This is the disconnect. "Standard" forth means "tailored to whatever way I want to implement forth on this special piece of hardware.". This is because micro controllers are not and should not try to be generic Windows PC's. If you want a the PC version, get the PC version. If you want the prop, maybe consider getting the prop version. This is the trade off for such a tiny efficient system, the USER has to be aware of the differences in the hardware abstraction layer, rather than relying on the compiler and unknown writers of library files to have "gotten it right".
The USER also has to understand the problem that's being addressed, we must read the data sheet for the part we want to use. WE must understand the algorithm we are trying to write. We DON'T want to have the compiler second guess what we did, and try to fix our mistakes for us.
[For the benefit of spectators] These are the design decisions for forth, at least for propforth. This does not mean to imply that one way is better or more useful than another, just that the approach and goals are different, and therefore the benchmarks etc don't always allow the direct comparison that is implied by a single number result. Whether you look from a PC/OS/C perspective or a micro-controler/sub-OS/forth perspective, according to your needs, makes a difference on what the same data tells you.
"Standard" forth means "tailored to whatever way I want to implement forth on this special piece of hardware."
I read that as "standard Forth is whatever I want to do" ego there is no standard.
Now I do agree that all processors, MCU's etc are different and input output and other hardware and will have to be tailored to suit the particular instance. "Ported" as we say.
BUT.
Since the beginning of time, well 1954 when FORTRAN was born, all high level languages have supported a few basic things: Manipulation of data with add, subtract, multiply, divide. Program structuring with sequence, selection and iteration. Functions or Procedures. Array access.
I belive there is a core of Forth that supports all of those things and is "standardized". As it happens the FFT benchmak only needs that core to operate.
The only parts that need "tailoring" or "porting" as we normally say are the means of printing the results and the means of timing the execution.
Is it really so hard to move this thing from Forth to Forth? If so Forth is basically useless as a language, i.e a means of communication of ideas.
From what I can tell there is a standard Forth that was developed by ANSI in 1994. ANSI Forth provides for portability from one machine to another. However, it appears that there are certain features of ANSI Forth that result in the code being big and slow. Because of this there are many incompatible implementations of Forth that generate smaller and faster code, but programs cannot be ported from one to another. I would expect that each machine would have it's own dialect of Forth, but it appears the Prop has two or three incompatible dialects.
You mention "P1V" at the top. Does that mean you've modified the interpreter in ROM and can therefore only use it on an FPGA? Or can anyone use it in place of the normal interpreter (understanding, of course, that some hub space is wasted)
I've updated the first few posts with results from fastspin (my Spin to PASM/LMM compiler). I've also run fastspin on Andy's SPI test, with the following results:
I've updated the first few posts with results from fastspin (my Spin to PASM/LMM compiler). I've also run fastspin on Andy's SPI test, with the following results:
I've updated the first few posts with results from fastspin (my Spin to PASM/LMM compiler). I've also run fastspin on Andy's SPI test, with the following results:
Good progress.
If this can do PASM or LMM, shouldn't this have two speed lines, one for each ?
I'm sorry, I wasn't very clear -- I meant that fastspin produces LMM PASM code. That's because the fastspin front end only does LMM. Pure PASM (COG) mode is available through the spin2cpp front end. There isn't much difference in speed. fastspin (which is the same as spin2cpp --code=hub --asm --binary) gives the 107,136 cycles quoted above, whereas spin2cpp --code=cog --asm --binary gives 106,532 cycles. They're so close because the compiler uses FCACHE in LMM mode to load small loops (like the SPI transmit loop) into COG memory.
Here's the main loop (load and transmit) from the SPI demo, as output by spin2cpp --code=cog (the fastspin output will be similar, but obscured a bit by the FCACHE loading code). Note that the spiout() function has been inlined, because it's only called from one place in the demo. The code comments have also lost sync (to a degree) with the assembly code because of optimization, although I think it's not too bad; the only one that really jumps out at me is that the "outa[D0] := value" statement should go before the muxc.
.... They're so close because the compiler uses FCACHE in LMM mode to load small loops (like the SPI transmit loop) into COG memory.
Ah, right, very clever.
That means a corner case is more when something does not fit into FCACHE, or when some Cache swapping has to occur ?
I don't know if I'd call it a corner case... the default is that everything runs out of hub (with the usual LMM rdlong/execute interpreter). When the compiler sees a loop that will fit into the FCACHE area (<64 longs, with no HUB subroutine calls) then it modifies the code to insert a call to load the loop into FCACHE and then jump to it. Those kinds of loops run at full COG speed, about 4x faster than LMM.
Comments
Oh yeah:)
Now that I have let that little "secret" out of the bag I'm surprised no one has been demanding that the GCC results be posted for the case when FCACHE is turned off and only LMM code is used.
So how about it Eric?
Hi Heater,
I don't think we need to do that - FCACHE is a perectly legitimate technique. I use it in the CMM version of Catalina, and I plan to add it to the LMM version as well in future. It's just that the benefits to be gained from FCACHE are massively over-stated by the type of trivial benchmark programs we are currently using.
By the way - much the same problem occurred when fast caches (sometimes hundreds of kilobytes or even megabytes in size) were added to microprocessors - benchmark programs immediately had to be made much larger and more complex to avoid the whole program simply being "cached" and giving unrealistically fast results. That's what led to the death of many of the popular small benchmark programs like whetstone, dhrystone etc - they are simply too small to give meaningful results on modern micros.
It just needs to be noted that many "normal" user programs will not get any benefit at all from using a compiler that can use FCACHE. And of course, there will be some that will get significant benefits.
Anyway, I know it gives the GCC team a boost to be able to come first on something, so I'm ok with leaving it in
Ross.
Good stuff.
Ahhggg!!! ...I'm gutted. FFT "trivial" he says. I sweated blood over that thing for ages, not to mention the decades it took me to get to understand how the algorithm works in the first place. "Trivial" .. mutter.. mutter..sob.
Even small caches. I remeber being very surprised years ago when I discovered that the traditional method of un-rolling loops to gain speed could actually have no benifit or slow you down on some early Pentium or AMD I was using. Then it hit me that these new fangled chips were cached and unrolling a loop could spill out into slow memory.
Then the world of optimization gets a whole new twist. Arranging your code and data so as to be working in cache with out "thrashing". I.e traversing 2D arrays in the right column/row order.
And that's why we need the non-FACHE benchmark results here.
It should be close to the PASM speed.
That is the problem with benchmarks, there are so many different ways to do a seemingly simple task that it is hard to judge a language from ANY benchmark program.
Bean
Or on any program that uses loops. You know, like most of them :-).
You just have to add -mno-fcache to the command line to turn it off. With -O2 and -mno-fcache the fft_bench runs in 120 ms instead of 47 ms. Quite a bit slower, but still almost 3x faster than any of the other non-PASM solutions.
Eric
Eric
I agree this is a pain point to adopting the Forth language. One appeal of a HLL is code and skills reuse. With C I can often find an algorithm already written (e.g. inverse kinematics), or write it on a PC with a set of unit tests to ensure it is correct. Porting it elsewhere is usually just making sure that I didn't code any integer size dependencies. Forth has the possibility of achieving this, which standardization would help.
However, I can see Forth's appeal. It does allow skills reuse, and the interactivity makes it much easier to build embedded programs with built in command parsers. When coupled with a wireless link you get remote commands plus integrated debugging which is a nice feature.
Thanks, Bean. It's been a long time since I programmed in Basic, I don't think we had DO/LOOP back in my day :-). I've updated the source code and timings in the second post. PropBasic is now basically identical to the PASM speed -- very impressive!
I agree with your comment about it being hard to judge a language from any benchmark program. For one thing, there are very important factors like ease of use and ease of learning that are hard to quantify. For another, even a small variation in source code can produce quite a big difference in performance for a specific language (as we've seen), so no one benchmark can possibly quantify the speed differences between languages. That's why we need more benchmarks :-).
Eric
That's the whole idea; its not the "problem" but rather "the process". High level forth lets one test and set up all the functions, quickly. Then its easy to identify bottle necks, via interactive testing. The we can optimize just the bottle necks in assembler. So forth is all about getting to the assembler parts, and ensuring the only the tiniest bit is in assembler.
Of course, one has to do high level optimization via code design before low level optimization in assembler before it makes any sense, and often folks skip the design optimization and jump right to coding, or worse jump to low level optimization before the application is even ironed out. And then wonder why the code is a mess and doesn't work. But I see this just as frequently on professional C projects, perhaps more frequently. This is why I advise against travel by air, or getting sick in general.
All true, and all doable in C and many other languages as well.
Presumably this "process" did not occur when the Forth FFT was made as there is no Forth equivalent of that PASM initialization sequence like there is for the main guts of the algorithm.
You'd have to talk to Nick about that. As I understand it, it works as-is. Whether it needs to be optimized for speed or efficiency is usually left to the person that has the specific need.
This is the disconnect. "Standard" forth means "tailored to whatever way I want to implement forth on this special piece of hardware.". This is because micro controllers are not and should not try to be generic Windows PC's. If you want a the PC version, get the PC version. If you want the prop, maybe consider getting the prop version. This is the trade off for such a tiny efficient system, the USER has to be aware of the differences in the hardware abstraction layer, rather than relying on the compiler and unknown writers of library files to have "gotten it right".
The USER also has to understand the problem that's being addressed, we must read the data sheet for the part we want to use. WE must understand the algorithm we are trying to write. We DON'T want to have the compiler second guess what we did, and try to fix our mistakes for us.
[For the benefit of spectators] These are the design decisions for forth, at least for propforth. This does not mean to imply that one way is better or more useful than another, just that the approach and goals are different, and therefore the benchmarks etc don't always allow the direct comparison that is implied by a single number result. Whether you look from a PC/OS/C perspective or a micro-controler/sub-OS/forth perspective, according to your needs, makes a difference on what the same data tells you.
Are you talking in riddles? I read that as "standard Forth is whatever I want to do" ego there is no standard.
Now I do agree that all processors, MCU's etc are different and input output and other hardware and will have to be tailored to suit the particular instance. "Ported" as we say.
BUT.
Since the beginning of time, well 1954 when FORTRAN was born, all high level languages have supported a few basic things: Manipulation of data with add, subtract, multiply, divide. Program structuring with sequence, selection and iteration. Functions or Procedures. Array access.
I belive there is a core of Forth that supports all of those things and is "standardized". As it happens the FFT benchmak only needs that core to operate.
The only parts that need "tailoring" or "porting" as we normally say are the means of printing the results and the means of timing the execution.
Is it really so hard to move this thing from Forth to Forth? If so Forth is basically useless as a language, i.e a means of communication of ideas.
http://forums.parallax.com/showthread.php/160754-Latest-P1V-git-files-Unscrambled-ROMs-Faster-Spin-Interpreter?p=1326276&viewfull=1#post1326276
If this can do PASM or LMM, shouldn't this have two speed lines, one for each ?
I'm sorry, I wasn't very clear -- I meant that fastspin produces LMM PASM code. That's because the fastspin front end only does LMM. Pure PASM (COG) mode is available through the spin2cpp front end. There isn't much difference in speed. fastspin (which is the same as spin2cpp --code=hub --asm --binary) gives the 107,136 cycles quoted above, whereas spin2cpp --code=cog --asm --binary gives 106,532 cycles. They're so close because the compiler uses FCACHE in LMM mode to load small loops (like the SPI transmit loop) into COG memory.
Here's the main loop (load and transmit) from the SPI demo, as output by spin2cpp --code=cog (the fastspin output will be similar, but obscured a bit by the FCACHE loading code). Note that the spiout() function has been inlined, because it's only called from one place in the demo. The code comments have also lost sync (to a degree) with the assembly code because of optimization, although I think it's not too bad; the only one that really jumps out at me is that the "outa[D0] := value" statement should go before the muxc.
Ah, right, very clever.
That means a corner case is more when something does not fit into FCACHE, or when some Cache swapping has to occur ?
I don't know if I'd call it a corner case... the default is that everything runs out of hub (with the usual LMM rdlong/execute interpreter). When the compiler sees a loop that will fit into the FCACHE area (<64 longs, with no HUB subroutine calls) then it modifies the code to insert a call to load the loop into FCACHE and then jump to it. Those kinds of loops run at full COG speed, about 4x faster than LMM.