Shop OBEX P1 Docs P2 Docs Learn Events
Propeller GCC Status Update — Parallax Forums

Propeller GCC Status Update

jazzedjazzed Posts: 11,803
edited 2011-10-13 08:52 in Propeller 1
A new Propeller GCC Update has been posted.
«1

Comments

  • PublisonPublison Posts: 12,366
    edited 2011-09-13 10:42
    Thanks Steve. You and the team are doing an awesome job!

    Jim
  • RossHRossH Posts: 5,517
    edited 2011-09-13 15:57
    Nice work, Jazzed.

    I'll be particularly interested in how your implementation of Bill Henning's FCACHE idea works out. As you may know, Catalina never used FCACHE because (apart from trivial benchmarks) I could never see how it would assist in real world program - but that may just be me. However, I always though it was an interesting idea, and I'll be keen to see how it works out.

    Ross.
  • jazzedjazzed Posts: 11,803
    edited 2011-09-13 16:15
    Thanks Jim.

    Thanks Ross.

    The "team" is doing all the nice work. I'm mostly just testing and reporting it. Heater's fft_bench has benefited from FastCache dramatically because internal loops can be rolled in. The dhrystone benchmark benefits less. Optimzations are ongoing. I've promised not to publish real numbers yet.
  • Bill HenningBill Henning Posts: 6,445
    edited 2011-09-13 16:54
    The whole team has done good work; and Eric has started implementing FCACHE and early results are very promising!

    (as of last Thursday I am officially on the team)

    One issue is how difficult will it be to get GCC to identify nested loops and leaf functions, and turn them into FCACHE'd native code?

    The other is the painful tradeoff between kernel helper primitives and keeping the FCACHE area as large as possible.

    Next will be trying various incarnations of FLIB :)

    (The "FastCache" name is for marketing reasons; it will sound better than "FCACHE" to purchasing agents reading press releases.)
    RossH wrote: »
    Nice work, Jazzed.

    I'll be particularly interested in how your implementation of Bill Henning's FCACHE idea works out. As you may know, Catalina never used FCACHE because (apart from trivial benchmarks) I could never see how it would assist in real world program - but that may just be me. However, I always though it was an interesting idea, and I'll be keen to see how it works out.

    Ross.
  • RossHRossH Posts: 5,517
    edited 2011-09-13 20:18
    ...
    (as of last Thursday I am officially on the team)
    ...

    That's great news for the Parallax GCC effort, Bill!
    ...
    One issue is how difficult will it be to get GCC to identify nested loops and leaf functions, and turn them into FCACHE'd native code?

    The other is the painful tradeoff between kernel helper primitives and keeping the FCACHE area as large as possible.
    ...
    I struggled with exactly the issues you raise, and never came to a satisfactory conclusion - now I will have the luxury of letting someone else do the hard work and just leveraging off the final answer! :smile:

    Ross.
  • Heater.Heater. Posts: 21,230
    edited 2011-09-14 03:05
    It's amazing how fast Prop GCC is progressing.

    "FastCache" or any other name than FCACHE is an improvement. That "F" always
    had the wrong connotation for me:)

    The fact that FastCache can fetch fairly large junks of code into COG from
    HUB (or external memory) and run them at full PASM speed has some interesting
    consequences that I have never considered before. See FASTCACHE CONSEQUENCES
    below.


    RossH,
    Catalina never used FCACHE because (apart from trivial benchmarks) I
    could never see how it would assist in real world program

    I used to have the same opinion. Looked to me as if caching a few instructions
    into COG for the benefit a of a few very small tight loops was not really going
    to have much overall effect, except for some trivial benchmarks as you say and
    simple string copying loops etc.

    However, as you see Prop GCC has managed to cache a fairly large non-trivial
    chunk of code in the FFT and accelerate it "dramatically". This changes my mind
    entirely. In fact I would say FastCache has now become extremely important.

    FASTCACHE CONSEQUENCES:

    Imagine you want to write a driver or other functionality that:
    1) Needs to be fast but not so fast that you want to do it in PASM. Besides you
    may want it to be portable.
    2) Is small enough to fit into FastCache, say 200 to 300 instructions. So that it could be compiled to native PASM and run at max speed in a COG.

    So now you can:
    1) In your normal LMM C program start another COG running some function. It
    starts as just another normal LMM code with it's own new LMM kernel.
    2) That function (or one called from there) is small enough and loopy enough
    that it immediately gets FastCached and runs at native speed, forever.

    BINGO You have just started a COG running C compiled to native PASM without
    having to do any special compilation or build steps for that code. You
    have not had to start it up with any different steps in your main
    program. It just does it totally transparently.

    Looks like what has happened with the FFT is almost there.

    There are some downsides to this idea:

    1) It is a perhaps wasteful way to get native code running in a COG
    as it will be carrying the baggage of a now dead LMM kernel in the COG.

    2) Also how would the author know that the intended function is indeed FastCached
    as intended at build time? How does he know when his function has become to
    big to cache. Does the compiler output anything to indicate that? (Indeed that
    feed back might be required if FastCache is expected to be used even normally).

    Aside: How cool would it be if the Prop II could execute LMM code without a
    software LMM kernel? That is to say it had some instruction(s) that started a
    hardware LMM execution mode together with instruction those few LMM pseudo ops
    required, (load immediate etc). Then the FastCache could be almost the entire 496
    COG locations!

    In summary, the fact that FFT can be turbo charged by FastCache shows that it has a great deal of potential even for non-trivial benchmarks and real application code.
  • RossHRossH Posts: 5,517
    edited 2011-09-14 05:46
    Heater. wrote: »
    In summary, the fact that FFT can be turbo charged by FastCache shows that it has a great deal of potential even for non-trivial benchmarks and real application code.

    Heater,

    Hmmm - this is a long way from being demonstrated yet. There is a reason why many of the small benchmark programs (like Dhrystone) are no longer used - it is because caching techniques like FCACHE made the relationship between benchmark performance and real-world application performance almost completely meaningless.

    It is fairly simple to demonstrate that in quite trivial programs FCACHE does not improve performance at all - in fact, a dumb compiler that insists on FCACHEing any code that just happens to look like the call to a small function, or a tight loop (but which is only executed a few times) could very often result in reduced program performance.

    But the reality is actually worse - as Bill alluded to earlier (and as I have also mentioned elsewhere) - in some cases FCACHE can reduce performance even if your compiler is very smart and only FCACHEs code that really will benefit from it. This is because every long you reserve for FCACHE space is a long that is pretty much wasted for programs that don't do much FCACHEing. And in an LMM kernel, longs are so precious for so many other purposes that adopting FCACHE at all (especially a large FCACHE) is always going to result in compromises elsewhere in the LMM kernel.

    Conclusion? For some programs (including many of our favorite benchmarks), FCACHE will make a real difference. For instance, if your program is just over 496 longs, using FCACHE might mean it can still run at near cog PASM speed instead of at LMM PASM speed. But in other cases (and especially in larger programs) using FCACHE may make no difference, or may even be detrimental.

    I suppose if it was a compile-time option to have an FCACHE at all then that would be a good solution. Even better would be if you could choose at compile time the size of the FCACHE - you could then tune the size of the FCACHE individually for each program.

    Ross.
  • ctwardellctwardell Posts: 1,716
    edited 2011-09-14 06:43
    How about a cache mode for XMM where Hub ram is used as a second level cache?

    C.W.
  • jazzedjazzed Posts: 11,803
    edited 2011-09-14 08:07
    ctwardell wrote: »
    How about a cache mode for XMM where Hub ram is used as a second level cache?

    C.W.
    We have that for XMM. The size is selectable as 2K, 4K, or 8K.
    The cache line length is selectable as a power of 2 from 4 to 128 bytes.

    You can change these and the board type without recompiling if you like.
    The driver can be linked into the GCC binary and "over-loaded" by the propeller loader.

    All you need is an external memory driver binary and a board description like this.
    [ssf80]
        clkfreq: 80000000
        clkmode: XTAL1+PLL16X
        baudrate: 115200
        rxpin: 31
        txpin: 30
        tvpin: 12   # only used if TV_DEBUG is defined
        cache-driver: ssf_cache.dat
        cache-size: 8K
        cache-param1: 0
        cache-param2: 0
    
    More details of this and the external memory API will come later.

    RossH wrote: »
    I suppose if it was a compile-time option to have an FCACHE at all then that would be a good solution.
    It is a compile time option.

    Heater. wrote: »
    1) It is a perhaps wasteful way to get native code running in a COG
    as it will be carrying the baggage of a now dead LMM kernel in the COG.
    The LMM/XMM VM kernel remains in tact. You can separately start/stop a full COG's worth of GCC COG code (or GCC LMM code, or GCC GAS code, or even PASM code) on demand if you like.
  • RossHRossH Posts: 5,517
    edited 2011-09-14 15:37
    jazzed wrote: »
    RossH wrote: »
    I suppose if it was a compile-time option to have an FCACHE at all then that would be a good solution.

    It is a compile time option.

    Neato! You can enable it to give you good benchmark results, and then disable it for real programs. :smile:

    Ross.
  • Dave HeinDave Hein Posts: 6,347
    edited 2011-09-14 15:54
    RossH wrote: »
    Neato! You can enable it to give you good benchmark results, and then disable it for real programs. :smile:
    Or the real-world program could be tuned to take advantage of the FCACHE by breaking large loops up into multiple small tight loops that fit in the FCACHE. This is done all the time for processors that have hardware caches that are limited in size.
  • jazzedjazzed Posts: 11,803
    edited 2011-09-14 16:00
    RossH wrote: »
    Neato! You can enable it to give you good benchmark results, and then disable it for real programs. :smile:

    Ross.
    Just responding to your suggestion ;-) All previous performance results did not include FastCache.
  • Bill HenningBill Henning Posts: 6,445
    edited 2011-09-14 20:22
    RossH wrote: »
    That's great news for the Parallax GCC effort, Bill!

    Thanks Ross!
    RossH wrote: »
    I struggled with exactly the issues you raise, and never came to a satisfactory conclusion - now I will have the luxury of letting someone else do the hard work and just leveraging off the final answer! :smile:

    Ross.

    Well, back in '06 I was already sure that FCACHE and FLIB were necessary for really good performane - which is why they were part of LMM from the beginning.

    I know it's a tough sell - because it is definitely not easy to implement it right for compilers; and it is counter-intuitive - after all, at first blush it seems to make more sense to add lots of helper functions.

    The reason that does not work too well is basically due to the hub access speed limitation. With a four way unrolled loop, you get one LMM instruction every 20 clock cycles; which is kind of a sweet spot. You really hit diminishing returns after that.

    Now if you can load loops - even better nested loops - into the FCACHE, you run at native speed within it, and due to the looping, the overhead of loading the cache is negligable.

    The only downside is loops not being able to call library functions.... which is where FLIB comes in.

    For loops that need to call library functions such as floating point routines or string functions, a separatly loaded full pasm speed library is the answer :)

    Therefore, the bigger the FCACHE, and FLIB areas, the closer to native performance.
  • Bill HenningBill Henning Posts: 6,445
    edited 2011-09-14 20:40
    Heater. wrote: »
    It's amazing how fast Prop GCC is progressing.

    "FastCache" or any other name than FCACHE is an improvement. That "F" always
    had the wrong connotation for me:)

    LOL!

    It was 'F' for consistency; FJMP, FCALL, etc (Far Call, Far Jump, Far Ret) so the 'F' prefix was to distinguish LMM primitives from native instructions.
    Heater. wrote: »
    The fact that FastCache can fetch fairly large junks of code into COG from
    HUB (or external memory) and run them at full PASM speed has some interesting
    consequences that I have never considered before. See FASTCACHE CONSEQUENCES
    below.

    I really should have pushed its use more; but I was too busy with Morpheus etc :)
    Heater. wrote: »
    RossH,

    I used to have the same opinion. Looked to me as if caching a few instructions
    into COG for the benefit a of a few very small tight loops was not really going
    to have much overall effect, except for some trivial benchmarks as you say and
    simple string copying loops etc.

    However, as you see Prop GCC has managed to cache a fairly large non-trivial
    chunk of code in the FFT and accelerate it "dramatically". This changes my mind
    entirely. In fact I would say FastCache has now become extremely important.

    I agree :)
    Heater. wrote: »
    FASTCACHE CONSEQUENCES:

    Imagine you want to write a driver or other functionality that:
    1) Needs to be fast but not so fast that you want to do it in PASM. Besides you
    may want it to be portable.
    2) Is small enough to fit into FastCache, say 200 to 300 instructions. So that it could be compiled to native PASM and run at max speed in a COG.

    So now you can:
    1) In your normal LMM C program start another COG running some function. It
    starts as just another normal LMM code with it's own new LMM kernel.
    2) That function (or one called from there) is small enough and loopy enough
    that it immediately gets FastCached and runs at native speed, forever.

    BINGO You have just started a COG running C compiled to native PASM without
    having to do any special compilation or build steps for that code. You
    have not had to start it up with any different steps in your main
    program. It just does it totally transparently.

    Looks like what has happened with the FFT is almost there.

    Exactly. Think SPI driver on demand. I2C driver. No cogs wasted. Also works for in-line assembly.
    Heater. wrote: »
    There are some downsides to this idea:

    1) It is a perhaps wasteful way to get native code running in a COG
    as it will be carrying the baggage of a now dead LMM kernel in the COG.

    The current cog-only mode helps with that.

    If you are very careful writing code for it, it should be possible to get it not to use a hub stack at all - which would be a big win.
    Heater. wrote: »
    2) Also how would the author know that the intended function is indeed FastCached
    as intended at build time? How does he know when his function has become to
    big to cache. Does the compiler output anything to indicate that? (Indeed that
    feed back might be required if FastCache is expected to be used even normally).

    You will be able to take a look at the assembly source code generated by the compiler; also Eric is working on getting GCC to generate FCACHE whenever it is possible (and appropriate). He has made good progress already!

    Basically, any loop that does not call library functions is a candidate for FCACHE; and with FLIB will help with some library code.

    anything with printf() will ofcourse never fit in FCACHE - it's impossible! (There: by the magic of the forums now someone will manage to fit printf...)
    Heater. wrote: »
    Aside: How cool would it be if the Prop II could execute LMM code without a
    software LMM kernel? That is to say it had some instruction(s) that started a
    hardware LMM execution mode together with instruction those few LMM pseudo ops
    required, (load immediate etc). Then the FastCache could be almost the entire 496
    COG locations!

    In summary, the fact that FFT can be turbo charged by FastCache shows that it has a great deal of potential even for non-trivial benchmarks and real application code.

    An "LMM" mode would possible for Propeller 3, it would be very cool indeed, but fair warning, Chip is not to be distracted from Prop2 work ... I was almost make to walk the plank for taking up too much of his time at UPEW (mostly kidding).
  • Bill HenningBill Henning Posts: 6,445
    edited 2011-09-14 20:48
    RossH wrote: »
    Heater,

    Hmmm - this is a long way from being demonstrated yet. There is a reason why many of the small benchmark programs (like Dhrystone) are no longer used - it is because caching techniques like FCACHE made the relationship between benchmark performance and real-world application performance almost completely meaningless.

    It is fairly simple to demonstrate that in quite trivial programs FCACHE does not improve performance at all - in fact, a dumb compiler that insists on FCACHEing any code that just happens to look like the call to a small function, or a tight loop (but which is only executed a few times) could very often result in reduced program performance.

    But the reality is actually worse - as Bill alluded to earlier (and as I have also mentioned elsewhere) - in some cases FCACHE can reduce performance even if your compiler is very smart and only FCACHEs code that really will benefit from it. This is because every long you reserve for FCACHE space is a long that is pretty much wasted for programs that don't do much FCACHEing. And in an LMM kernel, longs are so precious for so many other purposes that adopting FCACHE at all (especially a large FCACHE) is always going to result in compromises elsewhere in the LMM kernel.

    Conclusion? For some programs (including many of our favorite benchmarks), FCACHE will make a real difference. For instance, if your program is just over 496 longs, using FCACHE might mean it can still run at near cog PASM speed instead of at LMM PASM speed. But in other cases (and especially in larger programs) using FCACHE may make no difference, or may even be detrimental.

    I suppose if it was a compile-time option to have an FCACHE at all then that would be a good solution. Even better would be if you could choose at compile time the size of the FCACHE - you could then tune the size of the FCACHE individually for each program.

    Ross.

    I really like the command line switch to use / not use FCACHE... it will finally let us theory :-)

    Properly implemented FCACHE, with FLIB, should not reduce performance, as straight in-line code cannot really benefit from FCACHE.

    The real benefit comes when loops are pulled into the cache, or small leaf functions that loop a lot.

    My past experiments indicated that best results would be obtained with dedicating at least half the cog memory (I used 256 longs) as an FCACHE, more if possible.

    For some loops it is best to also have an "FLIB" of leaf functions that are called in the innermost loop - ie float, string etc leaf functions.

    When FCACHE is not used for loops, it can ofcourse be loaded with a large FLIB of kernel helper functions!
  • Bill HenningBill Henning Posts: 6,445
    edited 2011-09-14 20:52
    Exactly!

    I've even considered launching slave cogs to execute part of the loop (range of indexes) in parallel!

    (Oops! Wife is giving me grief for me for using after-work "family time" to post. I am in trouble....)
    Dave Hein wrote: »
    Or the real-world program could be tuned to take advantage of the FCACHE by breaking large loops up into multiple small tight loops that fit in the FCACHE. This is done all the time for processors that have hardware caches that are limited in size.
  • RossHRossH Posts: 5,517
    edited 2011-09-14 22:16
    For some loops it is best to also have an "FLIB" of leaf functions that are called in the innermost loop - ie float, string etc leaf functions.

    When FCACHE is not used for loops, it can ofcourse be loaded with a large FLIB of kernel helper functions!

    I agree with this. For a while I contemplated the idea of having a dynamically generated FLIB as part of each program - i.e. instead of having FCACHE and a fixed FLIB, the compiler would pull out candidate functions (or parts of functions) that would benefit from being run in a dedicated cog, and put those in one (or more) FLIB cogs - then your main program calls the FLIB cogs at appropriate points in the normal program execution.

    This would give you all the benefits of both FCACHE and FLIB - but without the FCACHE loading overhead (which no-one ever seems to take into account), and also without being stuck with a fixed set of FLIB functions that includes things that most programs don't need.

    However, this kind of thing is well beyond the capability of current compilers ... I think.

    Ross.
  • Heater.Heater. Posts: 21,230
    edited 2011-09-15 03:58
    Has anyone counted the cycles and done the math on this "loading of code overhead" vs "how many times around the loop the cached code go does"?
    Obviously one execution of the cached code is a looser.
    How about twice around the loop? My gut tells me that may already be a winner. Is there something I'm missing?
    In the case of FFT the load time must be 10s of microseconds and the execution time, running into over 100ms, is halved or quartered. No contest FastCache wins.
  • RossHRossH Posts: 5,517
    edited 2011-09-15 04:01
    Heater. wrote: »
    Has anyone counted the cycles and done the math on this "loading of code overhead" vs "how many times around the loop the cached code go does"?
    Obviously one execution of the cached code is a looser.
    How about twice around the loop? My gut tells me that may already be a winner. Is there something I'm missing?
    In the case of FFT the load time must be 10s of microseconds and the execution time, running into over 100ms, is halved or quartered. No contest FastCache wins.

    It is probably just easier to write the code and see.
  • Heater.Heater. Posts: 21,230
    edited 2011-09-15 04:11
    Ah, the scientific method. Wait, is that theory first then experiment or experiment first?
    Can't do either here at the moment.
  • RossHRossH Posts: 5,517
    edited 2011-09-15 05:51
    Heater. wrote: »
    Ah, the scientific method. Wait, is that theory first then experiment or experiment first?
    Can't do either here at the moment.

    It's been a loooong time since I did this exercise, but from memory I think it takes more than twice as long just to load an FCACHEd chunk of code as it does to completely execute the same code using LMM. So FCACHEing loops is only beneficial if you can guaranteee the loop will be executed three times or more. In cases where there are multiple exit conditions from the loop it is generally impossible for the code generator to tell in advance whether or not it will be beneficial to FCACHE the loop or not - and if it just blindly FCACHEs every candidate loop in the program then it is quite possible that some parts of an FCACHEd program may in fact run many times slower than a non-FCACHEd program.

    Also keep in mind that the bigger the FCACHE, the bigger the potential speedup if the compiler gets it right - but also the bigger the potential slowdown if the compiler gets it wrong. However, there is one class of programs that does benefit unambiguously from FCACHE - i.e. those programs which exhibit very predictable loop behaviour - such as most simple benchmarks :smile:.

    However, I could easily be misremembering some of the details here, or my original analysis may have been faulty - which is why I suggested simply writing the code to check. After all, the whole mechanism is only a few instructions long.

    Ross.
  • ctwardellctwardell Posts: 1,716
    edited 2011-09-15 06:03
    Would it be useful to use attributes to contol fast cache use?

    Something like: __attribute__ ((fastcache))

    C.W.
  • Heater.Heater. Posts: 21,230
    edited 2011-09-15 06:06
    I am sure you are right. Three times around the code sounds like a safe bet for performance gains with FastCache.
    In general the compiler must be unable to know if a loop will go around three times or not. That could depend on some input value that the compiler cannot know the run time value of. Or it could depend on some complicated previous calculation we cannot expect a compiler to fully analyze.
    Seems we need some pragmas such that the programmer can suggest that caching a loop or function would be a good idea.
  • Dave HeinDave Hein Posts: 6,347
    edited 2011-09-15 07:02
    TI uses "#pragma MUST_INTERATE(min, max, mult)" to tell the compiler the minimum and maximum times a loop may iterate, and the multiple of the loop count. MUST_ITERATE(10,,2) says that the loop will iterate at least 10 times, and will always iterate a multiple of 2 times. Maybe we could use a pragma similar to that.
  • Bill HenningBill Henning Posts: 6,445
    edited 2011-09-15 09:08
    RossH wrote: »
    I agree with this. For a while I contemplated the idea of having a dynamically generated FLIB as part of each program - i.e. instead of having FCACHE and a fixed FLIB, the compiler would pull out candidate functions (or parts of functions) that would benefit from being run in a dedicated cog, and put those in one (or more) FLIB cogs - then your main program calls the FLIB cogs at appropriate points in the normal program execution.[/quit]

    That is a fascinating idea; I always though of FLIB as being statically generated... I will think about it.
    RossH wrote: »
    This would give you all the benefits of both FCACHE and FLIB - but without the FCACHE loading overhead (which no-one ever seems to take into account), and also without being stuck with a fixed set of FLIB functions that includes things that most programs don't need.

    However, this kind of thing is well beyond the capability of current compilers ... I think.

    Ross.

    "well beyond the capability of current compilers ... I think"

    Come on, you should have said it is impossible! Then someone would have pulled it off via forum magic...

    All kidding aside, it would be possible, but not easy - ie a lot of work.
  • Bill HenningBill Henning Posts: 6,445
    edited 2011-09-15 09:10
    Heater. wrote: »
    Has anyone counted the cycles and done the math on this "loading of code overhead" vs "how many times around the loop the cached code go does"?
    Obviously one execution of the cached code is a looser.
    How about twice around the loop? My gut tells me that may already be a winner. Is there something I'm missing?
    In the case of FFT the load time must be 10s of microseconds and the execution time, running into over 100ms, is halved or quartered. No contest FastCache wins.

    I actually did that when writing a couple magazine articles in the past... which were accepted, but I did not get part 3 done in time. I'll put them up on my site.

    Eric also verified the cycles before he added FCACHE support.

    Basically, even without Phil's reverse-loader, twice through the loop ties, anything more is a win.

    As almost all loops execute more than twice, it is difficult for FCACHE not to win...
  • Bill HenningBill Henning Posts: 6,445
    edited 2011-09-15 09:14
    RossH wrote: »
    It is probably just easier to write the code and see.

    That works too!
  • kenr707kenr707 Posts: 151
    edited 2011-09-15 09:47
    RossH wrote: »
    So FCACHEing loops is only beneficial if you can guaranteee the loop will be executed three times or more.

    No, actually what you need is that the average case be at least 3 trips (well, maybe 2 point something), not a guaranteed 3 trips. If you do it once, and then, on a separate load, go around 100 times, you've still gained.
  • Bill HenningBill Henning Posts: 6,445
    edited 2011-09-15 10:09
    RossH wrote: »
    It's been a loooong time since I did this exercise, but from memory I think it takes more than twice as long just to load an FCACHEd chunk of code as it does to completely execute the same code using LMM. So FCACHEing loops is only beneficial if you can guaranteee the loop will be executed three times or more. In cases where there are multiple exit conditions from the loop it is generally impossible for the code generator to tell in advance whether or not it will be beneficial to FCACHE the loop or not - and if it just blindly FCACHEs every candidate loop in the program then it is quite possible that some parts of an FCACHEd program may in fact run many times slower than a non-FCACHEd program.

    Also keep in mind that the bigger the FCACHE, the bigger the potential speedup if the compiler gets it right - but also the bigger the potential slowdown if the compiler gets it wrong. However, there is one class of programs that does benefit unambiguously from FCACHE - i.e. those programs which exhibit very predictable loop behaviour - such as most simple benchmarks :smile:.

    However, I could easily be misremembering some of the details here, or my original analysis may have been faulty - which is why I suggested simply writing the code to check. After all, the whole mechanism is only a few instructions long.

    Ross.

    Pretty good memory Ross!

    Worst case FCACHE loader is 32 cycles per long.

    Decent case FACHE loader (2x unrolled) is 24 cycles per long, but will load an extra long approx. 50% of the time.

    with 4 way unroll, non-hub LMM instructions execute in 20 cycles.

    We can ignore hub instructions in this simplified analysis as they would also be correspondly slower in pure pasm

    To load an FCACHE segment N instructions long then takes approximately

    24*n cycles

    Let's say 90% of the instructions in it take 4 cycles, and the other 10% take 20 cycles (hub instructions, allowing for latency to next hub window)

    Let's consider this a single loop, executed X times. The "win" will be much bigger for nested loops.

    Time to load: approx. 24*N

    Time to execute once in cache: 0.9*N*4 + 0.1*N*20

    Tc = 24N + X*(0.9N*4 + 0.1N*20) ' total execution time of FCACHED loop

    Ti = X*(0.9N*20+0.1N*32) ' total execution time if run as straight line LMM

    Let's simplify...

    Tc = 24N +X(3.6N + 2N) = 24N + X(5.6N)

    Tc = 24N+5.5N*X

    For LMM

    Ti = X*(0.9N*20+0.1N*32) ' total execution time if run as straight line LMM
    Ti = X(18N+3.2N)
    Ti = X(21.2N)
    Ti = 21.2NX

    For PASM

    Ta = X*(0.9*N*4+0.1*N*20) ' picking average 20 cycles for hub op, same as above

    Ta = X*(3.6N + 2N)

    Ta = X*5.6N

    FCACHED execution speed approximation for a single loop N instructions long executed X times:

    Tc = 24N + 5.6*X*N

    LMM execution speed approximation for a single loop N instructions long executed X times:

    Ti = 21.2NX

    So let's do some analysys

    X=2 ' worst possible case
    N=10 ' small loop, minimizes benefits

    Ti = 21.2 * 10 * 2 = 212*2 = 424 cycles

    Tc = 24*10 + 5.6*2*10 = 240 + 56*2 = 240 + 112 = 352 cycles

    Ta = 2*5.6*10 = 20*5.6 = 112 cycles

    424 cycles for pure LMM
    352 cycles for LMM + FCACHE
    112 cycles for PASM


    X=10
    N=100

    Ti = 21.2*100*10 = 21200 cycles
    Tc = 24*100 + 5.6*100*10 = 2400+5600 = 8000 cycles
    Ta = 10*5.6*100 = 5600

    21200 cycles for pure LMM
    8000 cycles for LMM+FCACHE
    5600 cycles for pure PASM


    X = 10
    N = 200

    Ti = 21.2*200*10 = 42400
    Tc = 24*200+5.6*200*10 = 4800+11200 = 16000
    Ta = 10*5.6*200 = 11200

    42400 cycles for pure LMM
    16000 cycles for LMM+FCACHE
    11200 cycles for pure PASM


    X = 50
    N = 200

    Ti = 21.2*200*50 = 212000
    Tc = 24*200 +5.6*200*50 = 4800+56000 = 60800
    Ta = 50*5.6*200 = 56000

    212000 for pure LMM
    60800 for LMM+FCACHE <--- only takes 8.6% longer than pure pasm! >90% of PASM performance
    56000 for pure PASM


    Numbers would be much better for nested loops of course

    Ok, now that I have shown some concrete numbers, back to the math...

    Ti = 21.2NX
    Tc = 24N+5.6XN
    Ta = 5.6XN

    Ti will linearly increase with both X and N
    Tc shows the higher the number of loops, the smaller the impact of loading the cache
    Tc/Ta computes the overhead of LMM+FCACHE over pure PASM, and clearly the overhead of 24N becomes smaller as both X and N increase

    The analysis can be expanded to cover nested loops fairly easily

    let o be the outer loop
    let i be the inner loop
    N=o+i
    let j be the number of times the inner loop is executed

    Ti = 21.2NX = 21.2(o+ij)X
    Ti = 21.2X(o+ij)

    Tc = 24(o+i) + 5.6X(o+ij)

    Ta = 5.6X(o+ij)

    I will only run one analysis of this case :-)

    X = 20
    o = 200
    i = 56
    j = 20

    Ti = 21.2(o+ij)X
    Ti = 21.2(200+56*20)*20 = 21.2(200+1120)*20 = 559680

    Tc = 24(o+i) + 5.6X(o+ij) = 24(200) + 5.6*20*(200+56*20)
    Tc = 4800+112*(200+1120) = 228800

    Ta = 5.6*20*(200+56*20) = 224000

    Tc/Ta = 1.021

    LMM+FCACHE in this case achieves 97.9% of pure pasm speed
  • Bill HenningBill Henning Posts: 6,445
    edited 2011-09-15 10:16
    Dave Hein wrote: »
    TI uses "#pragma MUST_INTERATE(min, max, mult)" to tell the compiler the minimum and maximum times a loop may iterate, and the multiple of the loop count. MUST_ITERATE(10,,2) says that the loop will iterate at least 10 times, and will always iterate a multiple of 2 times. Maybe we could use a pragma similar to that.

    Dave, your comment combined with C.W.'s and heaters suggests something like:

    #FastCache

    nested loops

    #CacheEnd


    If the cache blook is too big to fit, the compiler can emit a warning!

    This should make it easier for Eric to FCACHE nested loops, as he would have hints of the boundaries.
Sign In or Register to comment.