I'll be particularly interested in how your implementation of Bill Henning's FCACHE idea works out. As you may know, Catalina never used FCACHE because (apart from trivial benchmarks) I could never see how it would assist in real world program - but that may just be me. However, I always though it was an interesting idea, and I'll be keen to see how it works out.
The "team" is doing all the nice work. I'm mostly just testing and reporting it. Heater's fft_bench has benefited from FastCache dramatically because internal loops can be rolled in. The dhrystone benchmark benefits less. Optimzations are ongoing. I've promised not to publish real numbers yet.
I'll be particularly interested in how your implementation of Bill Henning's FCACHE idea works out. As you may know, Catalina never used FCACHE because (apart from trivial benchmarks) I could never see how it would assist in real world program - but that may just be me. However, I always though it was an interesting idea, and I'll be keen to see how it works out.
...
One issue is how difficult will it be to get GCC to identify nested loops and leaf functions, and turn them into FCACHE'd native code?
The other is the painful tradeoff between kernel helper primitives and keeping the FCACHE area as large as possible.
...
I struggled with exactly the issues you raise, and never came to a satisfactory conclusion - now I will have the luxury of letting someone else do the hard work and just leveraging off the final answer!
"FastCache" or any other name than FCACHE is an improvement. That "F" always
had the wrong connotation for me:)
The fact that FastCache can fetch fairly large junks of code into COG from
HUB (or external memory) and run them at full PASM speed has some interesting
consequences that I have never considered before. See FASTCACHE CONSEQUENCES
below.
RossH,
Catalina never used FCACHE because (apart from trivial benchmarks) I
could never see how it would assist in real world program
I used to have the same opinion. Looked to me as if caching a few instructions
into COG for the benefit a of a few very small tight loops was not really going
to have much overall effect, except for some trivial benchmarks as you say and
simple string copying loops etc.
However, as you see Prop GCC has managed to cache a fairly large non-trivial
chunk of code in the FFT and accelerate it "dramatically". This changes my mind
entirely. In fact I would say FastCache has now become extremely important.
FASTCACHE CONSEQUENCES:
Imagine you want to write a driver or other functionality that:
1) Needs to be fast but not so fast that you want to do it in PASM. Besides you
may want it to be portable.
2) Is small enough to fit into FastCache, say 200 to 300 instructions. So that it could be compiled to native PASM and run at max speed in a COG.
So now you can:
1) In your normal LMM C program start another COG running some function. It
starts as just another normal LMM code with it's own new LMM kernel.
2) That function (or one called from there) is small enough and loopy enough
that it immediately gets FastCached and runs at native speed, forever.
BINGO You have just started a COG running C compiled to native PASM without
having to do any special compilation or build steps for that code. You
have not had to start it up with any different steps in your main
program. It just does it totally transparently.
Looks like what has happened with the FFT is almost there.
There are some downsides to this idea:
1) It is a perhaps wasteful way to get native code running in a COG
as it will be carrying the baggage of a now dead LMM kernel in the COG.
2) Also how would the author know that the intended function is indeed FastCached
as intended at build time? How does he know when his function has become to
big to cache. Does the compiler output anything to indicate that? (Indeed that
feed back might be required if FastCache is expected to be used even normally).
Aside: How cool would it be if the Prop II could execute LMM code without a
software LMM kernel? That is to say it had some instruction(s) that started a
hardware LMM execution mode together with instruction those few LMM pseudo ops
required, (load immediate etc). Then the FastCache could be almost the entire 496
COG locations!
In summary, the fact that FFT can be turbo charged by FastCache shows that it has a great deal of potential even for non-trivial benchmarks and real application code.
In summary, the fact that FFT can be turbo charged by FastCache shows that it has a great deal of potential even for non-trivial benchmarks and real application code.
Heater,
Hmmm - this is a long way from being demonstrated yet. There is a reason why many of the small benchmark programs (like Dhrystone) are no longer used - it is because caching techniques like FCACHE made the relationship between benchmark performance and real-world application performance almost completely meaningless.
It is fairly simple to demonstrate that in quite trivial programs FCACHE does not improve performance at all - in fact, a dumb compiler that insists on FCACHEing any code that just happens to look like the call to a small function, or a tight loop (but which is only executed a few times) could very often result in reduced program performance.
But the reality is actually worse - as Bill alluded to earlier (and as I have also mentioned elsewhere) - in some cases FCACHE can reduce performance even if your compiler is very smart and only FCACHEs code that really will benefit from it. This is because every long you reserve for FCACHE space is a long that is pretty much wasted for programs that don't do much FCACHEing. And in an LMM kernel, longs are so precious for so many other purposes that adopting FCACHE at all (especially a large FCACHE) is always going to result in compromises elsewhere in the LMM kernel.
Conclusion? For some programs (including many of our favorite benchmarks), FCACHE will make a real difference. For instance, if your program is just over 496 longs, using FCACHE might mean it can still run at near cog PASM speed instead of at LMM PASM speed. But in other cases (and especially in larger programs) using FCACHE may make no difference, or may even be detrimental.
I suppose if it was a compile-time option to have an FCACHE at all then that would be a good solution. Even better would be if you could choose at compile time the size of the FCACHE - you could then tune the size of the FCACHE individually for each program.
How about a cache mode for XMM where Hub ram is used as a second level cache?
C.W.
We have that for XMM. The size is selectable as 2K, 4K, or 8K.
The cache line length is selectable as a power of 2 from 4 to 128 bytes.
You can change these and the board type without recompiling if you like.
The driver can be linked into the GCC binary and "over-loaded" by the propeller loader.
All you need is an external memory driver binary and a board description like this.
[ssf80]
clkfreq: 80000000
clkmode: XTAL1+PLL16X
baudrate: 115200
rxpin: 31
txpin: 30
tvpin: 12 # only used if TV_DEBUG is defined
cache-driver: ssf_cache.dat
cache-size: 8K
cache-param1: 0
cache-param2: 0
More details of this and the external memory API will come later.
1) It is a perhaps wasteful way to get native code running in a COG
as it will be carrying the baggage of a now dead LMM kernel in the COG.
The LMM/XMM VM kernel remains in tact. You can separately start/stop a full COG's worth of GCC COG code (or GCC LMM code, or GCC GAS code, or even PASM code) on demand if you like.
Neato! You can enable it to give you good benchmark results, and then disable it for real programs.
Or the real-world program could be tuned to take advantage of the FCACHE by breaking large loops up into multiple small tight loops that fit in the FCACHE. This is done all the time for processors that have hardware caches that are limited in size.
I struggled with exactly the issues you raise, and never came to a satisfactory conclusion - now I will have the luxury of letting someone else do the hard work and just leveraging off the final answer!
Ross.
Well, back in '06 I was already sure that FCACHE and FLIB were necessary for really good performane - which is why they were part of LMM from the beginning.
I know it's a tough sell - because it is definitely not easy to implement it right for compilers; and it is counter-intuitive - after all, at first blush it seems to make more sense to add lots of helper functions.
The reason that does not work too well is basically due to the hub access speed limitation. With a four way unrolled loop, you get one LMM instruction every 20 clock cycles; which is kind of a sweet spot. You really hit diminishing returns after that.
Now if you can load loops - even better nested loops - into the FCACHE, you run at native speed within it, and due to the looping, the overhead of loading the cache is negligable.
The only downside is loops not being able to call library functions.... which is where FLIB comes in.
For loops that need to call library functions such as floating point routines or string functions, a separatly loaded full pasm speed library is the answer
Therefore, the bigger the FCACHE, and FLIB areas, the closer to native performance.
"FastCache" or any other name than FCACHE is an improvement. That "F" always
had the wrong connotation for me:)
LOL!
It was 'F' for consistency; FJMP, FCALL, etc (Far Call, Far Jump, Far Ret) so the 'F' prefix was to distinguish LMM primitives from native instructions.
The fact that FastCache can fetch fairly large junks of code into COG from
HUB (or external memory) and run them at full PASM speed has some interesting
consequences that I have never considered before. See FASTCACHE CONSEQUENCES
below.
I really should have pushed its use more; but I was too busy with Morpheus etc
I used to have the same opinion. Looked to me as if caching a few instructions
into COG for the benefit a of a few very small tight loops was not really going
to have much overall effect, except for some trivial benchmarks as you say and
simple string copying loops etc.
However, as you see Prop GCC has managed to cache a fairly large non-trivial
chunk of code in the FFT and accelerate it "dramatically". This changes my mind
entirely. In fact I would say FastCache has now become extremely important.
Imagine you want to write a driver or other functionality that:
1) Needs to be fast but not so fast that you want to do it in PASM. Besides you
may want it to be portable.
2) Is small enough to fit into FastCache, say 200 to 300 instructions. So that it could be compiled to native PASM and run at max speed in a COG.
So now you can:
1) In your normal LMM C program start another COG running some function. It
starts as just another normal LMM code with it's own new LMM kernel.
2) That function (or one called from there) is small enough and loopy enough
that it immediately gets FastCached and runs at native speed, forever.
BINGO You have just started a COG running C compiled to native PASM without
having to do any special compilation or build steps for that code. You
have not had to start it up with any different steps in your main
program. It just does it totally transparently.
Looks like what has happened with the FFT is almost there.
Exactly. Think SPI driver on demand. I2C driver. No cogs wasted. Also works for in-line assembly.
2) Also how would the author know that the intended function is indeed FastCached
as intended at build time? How does he know when his function has become to
big to cache. Does the compiler output anything to indicate that? (Indeed that
feed back might be required if FastCache is expected to be used even normally).
You will be able to take a look at the assembly source code generated by the compiler; also Eric is working on getting GCC to generate FCACHE whenever it is possible (and appropriate). He has made good progress already!
Basically, any loop that does not call library functions is a candidate for FCACHE; and with FLIB will help with some library code.
anything with printf() will ofcourse never fit in FCACHE - it's impossible! (There: by the magic of the forums now someone will manage to fit printf...)
Aside: How cool would it be if the Prop II could execute LMM code without a
software LMM kernel? That is to say it had some instruction(s) that started a
hardware LMM execution mode together with instruction those few LMM pseudo ops
required, (load immediate etc). Then the FastCache could be almost the entire 496
COG locations!
In summary, the fact that FFT can be turbo charged by FastCache shows that it has a great deal of potential even for non-trivial benchmarks and real application code.
An "LMM" mode would possible for Propeller 3, it would be very cool indeed, but fair warning, Chip is not to be distracted from Prop2 work ... I was almost make to walk the plank for taking up too much of his time at UPEW (mostly kidding).
Hmmm - this is a long way from being demonstrated yet. There is a reason why many of the small benchmark programs (like Dhrystone) are no longer used - it is because caching techniques like FCACHE made the relationship between benchmark performance and real-world application performance almost completely meaningless.
It is fairly simple to demonstrate that in quite trivial programs FCACHE does not improve performance at all - in fact, a dumb compiler that insists on FCACHEing any code that just happens to look like the call to a small function, or a tight loop (but which is only executed a few times) could very often result in reduced program performance.
But the reality is actually worse - as Bill alluded to earlier (and as I have also mentioned elsewhere) - in some cases FCACHE can reduce performance even if your compiler is very smart and only FCACHEs code that really will benefit from it. This is because every long you reserve for FCACHE space is a long that is pretty much wasted for programs that don't do much FCACHEing. And in an LMM kernel, longs are so precious for so many other purposes that adopting FCACHE at all (especially a large FCACHE) is always going to result in compromises elsewhere in the LMM kernel.
Conclusion? For some programs (including many of our favorite benchmarks), FCACHE will make a real difference. For instance, if your program is just over 496 longs, using FCACHE might mean it can still run at near cog PASM speed instead of at LMM PASM speed. But in other cases (and especially in larger programs) using FCACHE may make no difference, or may even be detrimental.
I suppose if it was a compile-time option to have an FCACHE at all then that would be a good solution. Even better would be if you could choose at compile time the size of the FCACHE - you could then tune the size of the FCACHE individually for each program.
Ross.
I really like the command line switch to use / not use FCACHE... it will finally let us theory :-)
Properly implemented FCACHE, with FLIB, should not reduce performance, as straight in-line code cannot really benefit from FCACHE.
The real benefit comes when loops are pulled into the cache, or small leaf functions that loop a lot.
My past experiments indicated that best results would be obtained with dedicating at least half the cog memory (I used 256 longs) as an FCACHE, more if possible.
For some loops it is best to also have an "FLIB" of leaf functions that are called in the innermost loop - ie float, string etc leaf functions.
When FCACHE is not used for loops, it can ofcourse be loaded with a large FLIB of kernel helper functions!
Or the real-world program could be tuned to take advantage of the FCACHE by breaking large loops up into multiple small tight loops that fit in the FCACHE. This is done all the time for processors that have hardware caches that are limited in size.
For some loops it is best to also have an "FLIB" of leaf functions that are called in the innermost loop - ie float, string etc leaf functions.
When FCACHE is not used for loops, it can ofcourse be loaded with a large FLIB of kernel helper functions!
I agree with this. For a while I contemplated the idea of having a dynamically generated FLIB as part of each program - i.e. instead of having FCACHE and a fixed FLIB, the compiler would pull out candidate functions (or parts of functions) that would benefit from being run in a dedicated cog, and put those in one (or more) FLIB cogs - then your main program calls the FLIB cogs at appropriate points in the normal program execution.
This would give you all the benefits of both FCACHE and FLIB - but without the FCACHE loading overhead (which no-one ever seems to take into account), and also without being stuck with a fixed set of FLIB functions that includes things that most programs don't need.
However, this kind of thing is well beyond the capability of current compilers ... I think.
Has anyone counted the cycles and done the math on this "loading of code overhead" vs "how many times around the loop the cached code go does"?
Obviously one execution of the cached code is a looser.
How about twice around the loop? My gut tells me that may already be a winner. Is there something I'm missing?
In the case of FFT the load time must be 10s of microseconds and the execution time, running into over 100ms, is halved or quartered. No contest FastCache wins.
Has anyone counted the cycles and done the math on this "loading of code overhead" vs "how many times around the loop the cached code go does"?
Obviously one execution of the cached code is a looser.
How about twice around the loop? My gut tells me that may already be a winner. Is there something I'm missing?
In the case of FFT the load time must be 10s of microseconds and the execution time, running into over 100ms, is halved or quartered. No contest FastCache wins.
It is probably just easier to write the code and see.
Ah, the scientific method. Wait, is that theory first then experiment or experiment first?
Can't do either here at the moment.
It's been a loooong time since I did this exercise, but from memory I think it takes more than twice as long just to load an FCACHEd chunk of code as it does to completely execute the same code using LMM. So FCACHEing loops is only beneficial if you can guaranteee the loop will be executed three times or more. In cases where there are multiple exit conditions from the loop it is generally impossible for the code generator to tell in advance whether or not it will be beneficial to FCACHE the loop or not - and if it just blindly FCACHEs every candidate loop in the program then it is quite possible that some parts of an FCACHEd program may in fact run many times slower than a non-FCACHEd program.
Also keep in mind that the bigger the FCACHE, the bigger the potential speedup if the compiler gets it right - but also the bigger the potential slowdown if the compiler gets it wrong. However, there is one class of programs that does benefit unambiguously from FCACHE - i.e. those programs which exhibit very predictable loop behaviour - such as most simple benchmarks .
However, I could easily be misremembering some of the details here, or my original analysis may have been faulty - which is why I suggested simply writing the code to check. After all, the whole mechanism is only a few instructions long.
I am sure you are right. Three times around the code sounds like a safe bet for performance gains with FastCache.
In general the compiler must be unable to know if a loop will go around three times or not. That could depend on some input value that the compiler cannot know the run time value of. Or it could depend on some complicated previous calculation we cannot expect a compiler to fully analyze.
Seems we need some pragmas such that the programmer can suggest that caching a loop or function would be a good idea.
TI uses "#pragma MUST_INTERATE(min, max, mult)" to tell the compiler the minimum and maximum times a loop may iterate, and the multiple of the loop count. MUST_ITERATE(10,,2) says that the loop will iterate at least 10 times, and will always iterate a multiple of 2 times. Maybe we could use a pragma similar to that.
I agree with this. For a while I contemplated the idea of having a dynamically generated FLIB as part of each program - i.e. instead of having FCACHE and a fixed FLIB, the compiler would pull out candidate functions (or parts of functions) that would benefit from being run in a dedicated cog, and put those in one (or more) FLIB cogs - then your main program calls the FLIB cogs at appropriate points in the normal program execution.[/quit]
That is a fascinating idea; I always though of FLIB as being statically generated... I will think about it.
This would give you all the benefits of both FCACHE and FLIB - but without the FCACHE loading overhead (which no-one ever seems to take into account), and also without being stuck with a fixed set of FLIB functions that includes things that most programs don't need.
However, this kind of thing is well beyond the capability of current compilers ... I think.
Ross.
"well beyond the capability of current compilers ... I think"
Come on, you should have said it is impossible! Then someone would have pulled it off via forum magic...
All kidding aside, it would be possible, but not easy - ie a lot of work.
Has anyone counted the cycles and done the math on this "loading of code overhead" vs "how many times around the loop the cached code go does"?
Obviously one execution of the cached code is a looser.
How about twice around the loop? My gut tells me that may already be a winner. Is there something I'm missing?
In the case of FFT the load time must be 10s of microseconds and the execution time, running into over 100ms, is halved or quartered. No contest FastCache wins.
I actually did that when writing a couple magazine articles in the past... which were accepted, but I did not get part 3 done in time. I'll put them up on my site.
Eric also verified the cycles before he added FCACHE support.
Basically, even without Phil's reverse-loader, twice through the loop ties, anything more is a win.
As almost all loops execute more than twice, it is difficult for FCACHE not to win...
So FCACHEing loops is only beneficial if you can guaranteee the loop will be executed three times or more.
No, actually what you need is that the average case be at least 3 trips (well, maybe 2 point something), not a guaranteed 3 trips. If you do it once, and then, on a separate load, go around 100 times, you've still gained.
It's been a loooong time since I did this exercise, but from memory I think it takes more than twice as long just to load an FCACHEd chunk of code as it does to completely execute the same code using LMM. So FCACHEing loops is only beneficial if you can guaranteee the loop will be executed three times or more. In cases where there are multiple exit conditions from the loop it is generally impossible for the code generator to tell in advance whether or not it will be beneficial to FCACHE the loop or not - and if it just blindly FCACHEs every candidate loop in the program then it is quite possible that some parts of an FCACHEd program may in fact run many times slower than a non-FCACHEd program.
Also keep in mind that the bigger the FCACHE, the bigger the potential speedup if the compiler gets it right - but also the bigger the potential slowdown if the compiler gets it wrong. However, there is one class of programs that does benefit unambiguously from FCACHE - i.e. those programs which exhibit very predictable loop behaviour - such as most simple benchmarks .
However, I could easily be misremembering some of the details here, or my original analysis may have been faulty - which is why I suggested simply writing the code to check. After all, the whole mechanism is only a few instructions long.
Ross.
Pretty good memory Ross!
Worst case FCACHE loader is 32 cycles per long.
Decent case FACHE loader (2x unrolled) is 24 cycles per long, but will load an extra long approx. 50% of the time.
with 4 way unroll, non-hub LMM instructions execute in 20 cycles.
We can ignore hub instructions in this simplified analysis as they would also be correspondly slower in pure pasm
To load an FCACHE segment N instructions long then takes approximately
24*n cycles
Let's say 90% of the instructions in it take 4 cycles, and the other 10% take 20 cycles (hub instructions, allowing for latency to next hub window)
Let's consider this a single loop, executed X times. The "win" will be much bigger for nested loops.
Time to load: approx. 24*N
Time to execute once in cache: 0.9*N*4 + 0.1*N*20
Tc = 24N + X*(0.9N*4 + 0.1N*20) ' total execution time of FCACHED loop
Ti = X*(0.9N*20+0.1N*32) ' total execution time if run as straight line LMM
Let's simplify...
Tc = 24N +X(3.6N + 2N) = 24N + X(5.6N)
Tc = 24N+5.5N*X
For LMM
Ti = X*(0.9N*20+0.1N*32) ' total execution time if run as straight line LMM
Ti = X(18N+3.2N)
Ti = X(21.2N)
Ti = 21.2NX
For PASM
Ta = X*(0.9*N*4+0.1*N*20) ' picking average 20 cycles for hub op, same as above
Ta = X*(3.6N + 2N)
Ta = X*5.6N
FCACHED execution speed approximation for a single loop N instructions long executed X times:
Tc = 24N + 5.6*X*N
LMM execution speed approximation for a single loop N instructions long executed X times:
Ti = 21.2NX
So let's do some analysys
X=2 ' worst possible case
N=10 ' small loop, minimizes benefits
21200 cycles for pure LMM
8000 cycles for LMM+FCACHE
5600 cycles for pure PASM
X = 10
N = 200
Ti = 21.2*200*10 = 42400
Tc = 24*200+5.6*200*10 = 4800+11200 = 16000
Ta = 10*5.6*200 = 11200
42400 cycles for pure LMM
16000 cycles for LMM+FCACHE
11200 cycles for pure PASM
X = 50
N = 200
Ti = 21.2*200*50 = 212000
Tc = 24*200 +5.6*200*50 = 4800+56000 = 60800
Ta = 50*5.6*200 = 56000
212000 for pure LMM
60800 for LMM+FCACHE <--- only takes 8.6% longer than pure pasm! >90% of PASM performance
56000 for pure PASM
Numbers would be much better for nested loops of course
Ok, now that I have shown some concrete numbers, back to the math...
Ti = 21.2NX
Tc = 24N+5.6XN
Ta = 5.6XN
Ti will linearly increase with both X and N
Tc shows the higher the number of loops, the smaller the impact of loading the cache
Tc/Ta computes the overhead of LMM+FCACHE over pure PASM, and clearly the overhead of 24N becomes smaller as both X and N increase
The analysis can be expanded to cover nested loops fairly easily
let o be the outer loop
let i be the inner loop
N=o+i
let j be the number of times the inner loop is executed
Ti = 21.2NX = 21.2(o+ij)X
Ti = 21.2X(o+ij)
Tc = 24(o+i) + 5.6X(o+ij)
Ta = 5.6X(o+ij)
I will only run one analysis of this case :-)
X = 20
o = 200
i = 56
j = 20
Ti = 21.2(o+ij)X
Ti = 21.2(200+56*20)*20 = 21.2(200+1120)*20 = 559680
TI uses "#pragma MUST_INTERATE(min, max, mult)" to tell the compiler the minimum and maximum times a loop may iterate, and the multiple of the loop count. MUST_ITERATE(10,,2) says that the loop will iterate at least 10 times, and will always iterate a multiple of 2 times. Maybe we could use a pragma similar to that.
Dave, your comment combined with C.W.'s and heaters suggests something like:
#FastCache
nested loops
#CacheEnd
If the cache blook is too big to fit, the compiler can emit a warning!
This should make it easier for Eric to FCACHE nested loops, as he would have hints of the boundaries.
Comments
Jim
I'll be particularly interested in how your implementation of Bill Henning's FCACHE idea works out. As you may know, Catalina never used FCACHE because (apart from trivial benchmarks) I could never see how it would assist in real world program - but that may just be me. However, I always though it was an interesting idea, and I'll be keen to see how it works out.
Ross.
Thanks Ross.
The "team" is doing all the nice work. I'm mostly just testing and reporting it. Heater's fft_bench has benefited from FastCache dramatically because internal loops can be rolled in. The dhrystone benchmark benefits less. Optimzations are ongoing. I've promised not to publish real numbers yet.
(as of last Thursday I am officially on the team)
One issue is how difficult will it be to get GCC to identify nested loops and leaf functions, and turn them into FCACHE'd native code?
The other is the painful tradeoff between kernel helper primitives and keeping the FCACHE area as large as possible.
Next will be trying various incarnations of FLIB
(The "FastCache" name is for marketing reasons; it will sound better than "FCACHE" to purchasing agents reading press releases.)
That's great news for the Parallax GCC effort, Bill!
I struggled with exactly the issues you raise, and never came to a satisfactory conclusion - now I will have the luxury of letting someone else do the hard work and just leveraging off the final answer!
Ross.
"FastCache" or any other name than FCACHE is an improvement. That "F" always
had the wrong connotation for me:)
The fact that FastCache can fetch fairly large junks of code into COG from
HUB (or external memory) and run them at full PASM speed has some interesting
consequences that I have never considered before. See FASTCACHE CONSEQUENCES
below.
RossH,
I used to have the same opinion. Looked to me as if caching a few instructions
into COG for the benefit a of a few very small tight loops was not really going
to have much overall effect, except for some trivial benchmarks as you say and
simple string copying loops etc.
However, as you see Prop GCC has managed to cache a fairly large non-trivial
chunk of code in the FFT and accelerate it "dramatically". This changes my mind
entirely. In fact I would say FastCache has now become extremely important.
FASTCACHE CONSEQUENCES:
Imagine you want to write a driver or other functionality that:
1) Needs to be fast but not so fast that you want to do it in PASM. Besides you
may want it to be portable.
2) Is small enough to fit into FastCache, say 200 to 300 instructions. So that it could be compiled to native PASM and run at max speed in a COG.
So now you can:
1) In your normal LMM C program start another COG running some function. It
starts as just another normal LMM code with it's own new LMM kernel.
2) That function (or one called from there) is small enough and loopy enough
that it immediately gets FastCached and runs at native speed, forever.
BINGO You have just started a COG running C compiled to native PASM without
having to do any special compilation or build steps for that code. You
have not had to start it up with any different steps in your main
program. It just does it totally transparently.
Looks like what has happened with the FFT is almost there.
There are some downsides to this idea:
1) It is a perhaps wasteful way to get native code running in a COG
as it will be carrying the baggage of a now dead LMM kernel in the COG.
2) Also how would the author know that the intended function is indeed FastCached
as intended at build time? How does he know when his function has become to
big to cache. Does the compiler output anything to indicate that? (Indeed that
feed back might be required if FastCache is expected to be used even normally).
Aside: How cool would it be if the Prop II could execute LMM code without a
software LMM kernel? That is to say it had some instruction(s) that started a
hardware LMM execution mode together with instruction those few LMM pseudo ops
required, (load immediate etc). Then the FastCache could be almost the entire 496
COG locations!
In summary, the fact that FFT can be turbo charged by FastCache shows that it has a great deal of potential even for non-trivial benchmarks and real application code.
Heater,
Hmmm - this is a long way from being demonstrated yet. There is a reason why many of the small benchmark programs (like Dhrystone) are no longer used - it is because caching techniques like FCACHE made the relationship between benchmark performance and real-world application performance almost completely meaningless.
It is fairly simple to demonstrate that in quite trivial programs FCACHE does not improve performance at all - in fact, a dumb compiler that insists on FCACHEing any code that just happens to look like the call to a small function, or a tight loop (but which is only executed a few times) could very often result in reduced program performance.
But the reality is actually worse - as Bill alluded to earlier (and as I have also mentioned elsewhere) - in some cases FCACHE can reduce performance even if your compiler is very smart and only FCACHEs code that really will benefit from it. This is because every long you reserve for FCACHE space is a long that is pretty much wasted for programs that don't do much FCACHEing. And in an LMM kernel, longs are so precious for so many other purposes that adopting FCACHE at all (especially a large FCACHE) is always going to result in compromises elsewhere in the LMM kernel.
Conclusion? For some programs (including many of our favorite benchmarks), FCACHE will make a real difference. For instance, if your program is just over 496 longs, using FCACHE might mean it can still run at near cog PASM speed instead of at LMM PASM speed. But in other cases (and especially in larger programs) using FCACHE may make no difference, or may even be detrimental.
I suppose if it was a compile-time option to have an FCACHE at all then that would be a good solution. Even better would be if you could choose at compile time the size of the FCACHE - you could then tune the size of the FCACHE individually for each program.
Ross.
C.W.
The cache line length is selectable as a power of 2 from 4 to 128 bytes.
You can change these and the board type without recompiling if you like.
The driver can be linked into the GCC binary and "over-loaded" by the propeller loader.
All you need is an external memory driver binary and a board description like this. More details of this and the external memory API will come later.
It is a compile time option.
The LMM/XMM VM kernel remains in tact. You can separately start/stop a full COG's worth of GCC COG code (or GCC LMM code, or GCC GAS code, or even PASM code) on demand if you like.
Neato! You can enable it to give you good benchmark results, and then disable it for real programs.
Ross.
Thanks Ross!
Well, back in '06 I was already sure that FCACHE and FLIB were necessary for really good performane - which is why they were part of LMM from the beginning.
I know it's a tough sell - because it is definitely not easy to implement it right for compilers; and it is counter-intuitive - after all, at first blush it seems to make more sense to add lots of helper functions.
The reason that does not work too well is basically due to the hub access speed limitation. With a four way unrolled loop, you get one LMM instruction every 20 clock cycles; which is kind of a sweet spot. You really hit diminishing returns after that.
Now if you can load loops - even better nested loops - into the FCACHE, you run at native speed within it, and due to the looping, the overhead of loading the cache is negligable.
The only downside is loops not being able to call library functions.... which is where FLIB comes in.
For loops that need to call library functions such as floating point routines or string functions, a separatly loaded full pasm speed library is the answer
Therefore, the bigger the FCACHE, and FLIB areas, the closer to native performance.
LOL!
It was 'F' for consistency; FJMP, FCALL, etc (Far Call, Far Jump, Far Ret) so the 'F' prefix was to distinguish LMM primitives from native instructions.
I really should have pushed its use more; but I was too busy with Morpheus etc
I agree
Exactly. Think SPI driver on demand. I2C driver. No cogs wasted. Also works for in-line assembly.
The current cog-only mode helps with that.
If you are very careful writing code for it, it should be possible to get it not to use a hub stack at all - which would be a big win.
You will be able to take a look at the assembly source code generated by the compiler; also Eric is working on getting GCC to generate FCACHE whenever it is possible (and appropriate). He has made good progress already!
Basically, any loop that does not call library functions is a candidate for FCACHE; and with FLIB will help with some library code.
anything with printf() will ofcourse never fit in FCACHE - it's impossible! (There: by the magic of the forums now someone will manage to fit printf...)
An "LMM" mode would possible for Propeller 3, it would be very cool indeed, but fair warning, Chip is not to be distracted from Prop2 work ... I was almost make to walk the plank for taking up too much of his time at UPEW (mostly kidding).
I really like the command line switch to use / not use FCACHE... it will finally let us theory :-)
Properly implemented FCACHE, with FLIB, should not reduce performance, as straight in-line code cannot really benefit from FCACHE.
The real benefit comes when loops are pulled into the cache, or small leaf functions that loop a lot.
My past experiments indicated that best results would be obtained with dedicating at least half the cog memory (I used 256 longs) as an FCACHE, more if possible.
For some loops it is best to also have an "FLIB" of leaf functions that are called in the innermost loop - ie float, string etc leaf functions.
When FCACHE is not used for loops, it can ofcourse be loaded with a large FLIB of kernel helper functions!
I've even considered launching slave cogs to execute part of the loop (range of indexes) in parallel!
(Oops! Wife is giving me grief for me for using after-work "family time" to post. I am in trouble....)
I agree with this. For a while I contemplated the idea of having a dynamically generated FLIB as part of each program - i.e. instead of having FCACHE and a fixed FLIB, the compiler would pull out candidate functions (or parts of functions) that would benefit from being run in a dedicated cog, and put those in one (or more) FLIB cogs - then your main program calls the FLIB cogs at appropriate points in the normal program execution.
This would give you all the benefits of both FCACHE and FLIB - but without the FCACHE loading overhead (which no-one ever seems to take into account), and also without being stuck with a fixed set of FLIB functions that includes things that most programs don't need.
However, this kind of thing is well beyond the capability of current compilers ... I think.
Ross.
Obviously one execution of the cached code is a looser.
How about twice around the loop? My gut tells me that may already be a winner. Is there something I'm missing?
In the case of FFT the load time must be 10s of microseconds and the execution time, running into over 100ms, is halved or quartered. No contest FastCache wins.
It is probably just easier to write the code and see.
Can't do either here at the moment.
It's been a loooong time since I did this exercise, but from memory I think it takes more than twice as long just to load an FCACHEd chunk of code as it does to completely execute the same code using LMM. So FCACHEing loops is only beneficial if you can guaranteee the loop will be executed three times or more. In cases where there are multiple exit conditions from the loop it is generally impossible for the code generator to tell in advance whether or not it will be beneficial to FCACHE the loop or not - and if it just blindly FCACHEs every candidate loop in the program then it is quite possible that some parts of an FCACHEd program may in fact run many times slower than a non-FCACHEd program.
Also keep in mind that the bigger the FCACHE, the bigger the potential speedup if the compiler gets it right - but also the bigger the potential slowdown if the compiler gets it wrong. However, there is one class of programs that does benefit unambiguously from FCACHE - i.e. those programs which exhibit very predictable loop behaviour - such as most simple benchmarks .
However, I could easily be misremembering some of the details here, or my original analysis may have been faulty - which is why I suggested simply writing the code to check. After all, the whole mechanism is only a few instructions long.
Ross.
Something like: __attribute__ ((fastcache))
C.W.
In general the compiler must be unable to know if a loop will go around three times or not. That could depend on some input value that the compiler cannot know the run time value of. Or it could depend on some complicated previous calculation we cannot expect a compiler to fully analyze.
Seems we need some pragmas such that the programmer can suggest that caching a loop or function would be a good idea.
I actually did that when writing a couple magazine articles in the past... which were accepted, but I did not get part 3 done in time. I'll put them up on my site.
Eric also verified the cycles before he added FCACHE support.
Basically, even without Phil's reverse-loader, twice through the loop ties, anything more is a win.
As almost all loops execute more than twice, it is difficult for FCACHE not to win...
That works too!
No, actually what you need is that the average case be at least 3 trips (well, maybe 2 point something), not a guaranteed 3 trips. If you do it once, and then, on a separate load, go around 100 times, you've still gained.
Pretty good memory Ross!
Worst case FCACHE loader is 32 cycles per long.
Decent case FACHE loader (2x unrolled) is 24 cycles per long, but will load an extra long approx. 50% of the time.
with 4 way unroll, non-hub LMM instructions execute in 20 cycles.
We can ignore hub instructions in this simplified analysis as they would also be correspondly slower in pure pasm
To load an FCACHE segment N instructions long then takes approximately
24*n cycles
Let's say 90% of the instructions in it take 4 cycles, and the other 10% take 20 cycles (hub instructions, allowing for latency to next hub window)
Let's consider this a single loop, executed X times. The "win" will be much bigger for nested loops.
Time to load: approx. 24*N
Time to execute once in cache: 0.9*N*4 + 0.1*N*20
Tc = 24N + X*(0.9N*4 + 0.1N*20) ' total execution time of FCACHED loop
Ti = X*(0.9N*20+0.1N*32) ' total execution time if run as straight line LMM
Let's simplify...
Tc = 24N +X(3.6N + 2N) = 24N + X(5.6N)
Tc = 24N+5.5N*X
For LMM
Ti = X*(0.9N*20+0.1N*32) ' total execution time if run as straight line LMM
Ti = X(18N+3.2N)
Ti = X(21.2N)
Ti = 21.2NX
For PASM
Ta = X*(0.9*N*4+0.1*N*20) ' picking average 20 cycles for hub op, same as above
Ta = X*(3.6N + 2N)
Ta = X*5.6N
FCACHED execution speed approximation for a single loop N instructions long executed X times:
Tc = 24N + 5.6*X*N
LMM execution speed approximation for a single loop N instructions long executed X times:
Ti = 21.2NX
So let's do some analysys
X=2 ' worst possible case
N=10 ' small loop, minimizes benefits
Ti = 21.2 * 10 * 2 = 212*2 = 424 cycles
Tc = 24*10 + 5.6*2*10 = 240 + 56*2 = 240 + 112 = 352 cycles
Ta = 2*5.6*10 = 20*5.6 = 112 cycles
424 cycles for pure LMM
352 cycles for LMM + FCACHE
112 cycles for PASM
X=10
N=100
Ti = 21.2*100*10 = 21200 cycles
Tc = 24*100 + 5.6*100*10 = 2400+5600 = 8000 cycles
Ta = 10*5.6*100 = 5600
21200 cycles for pure LMM
8000 cycles for LMM+FCACHE
5600 cycles for pure PASM
X = 10
N = 200
Ti = 21.2*200*10 = 42400
Tc = 24*200+5.6*200*10 = 4800+11200 = 16000
Ta = 10*5.6*200 = 11200
42400 cycles for pure LMM
16000 cycles for LMM+FCACHE
11200 cycles for pure PASM
X = 50
N = 200
Ti = 21.2*200*50 = 212000
Tc = 24*200 +5.6*200*50 = 4800+56000 = 60800
Ta = 50*5.6*200 = 56000
212000 for pure LMM
60800 for LMM+FCACHE <--- only takes 8.6% longer than pure pasm! >90% of PASM performance
56000 for pure PASM
Numbers would be much better for nested loops of course
Ok, now that I have shown some concrete numbers, back to the math...
Ti = 21.2NX
Tc = 24N+5.6XN
Ta = 5.6XN
Ti will linearly increase with both X and N
Tc shows the higher the number of loops, the smaller the impact of loading the cache
Tc/Ta computes the overhead of LMM+FCACHE over pure PASM, and clearly the overhead of 24N becomes smaller as both X and N increase
The analysis can be expanded to cover nested loops fairly easily
let o be the outer loop
let i be the inner loop
N=o+i
let j be the number of times the inner loop is executed
Ti = 21.2NX = 21.2(o+ij)X
Ti = 21.2X(o+ij)
Tc = 24(o+i) + 5.6X(o+ij)
Ta = 5.6X(o+ij)
I will only run one analysis of this case :-)
X = 20
o = 200
i = 56
j = 20
Ti = 21.2(o+ij)X
Ti = 21.2(200+56*20)*20 = 21.2(200+1120)*20 = 559680
Tc = 24(o+i) + 5.6X(o+ij) = 24(200) + 5.6*20*(200+56*20)
Tc = 4800+112*(200+1120) = 228800
Ta = 5.6*20*(200+56*20) = 224000
Tc/Ta = 1.021
LMM+FCACHE in this case achieves 97.9% of pure pasm speed
Dave, your comment combined with C.W.'s and heaters suggests something like:
#FastCache
nested loops
#CacheEnd
If the cache blook is too big to fit, the compiler can emit a warning!
This should make it easier for Eric to FCACHE nested loops, as he would have hints of the boundaries.