LMM execution speeds w/ XMM
escher
Posts: 138
in Propeller 1
I noticed on this propellergcc documentation page it's stated that:
What actually differentiates "select code"? I am extremely interested in both the large memory model speed as well as the eXternal memory model size, and would like to take advantage of both. My use-case is arcade game code that could be both a) complex requiring fast execution, and b) extensive requiring external storage.
Select code can run as LMM even in XMM-type memory modes.
What actually differentiates "select code"? I am extremely interested in both the large memory model speed as well as the eXternal memory model size, and would like to take advantage of both. My use-case is arcade game code that could be both a) complex requiring fast execution, and b) extensive requiring external storage.
Comments
Because a single loop fetcher misses the sweet spot in the hub instruction fetcher, some use an unrolled loop ie it fetches and executes then fetches and executes etc before jumping back to the top of the loop. This saves a hub loop (16 clocks) for every extra unrolled fetch after the first. But remember, every unrolled fetch still takes IIRC one 16 clock hub loop so your program is still 4x slower. And IIRC jumps and calls take 2 fetch/execute cycles so thats 8x slower. The normal single loop takes 32 clocks so that is 8x slower.
Now, into the mix there is FCACHE that loads blocks of code (into hub if I have this correct - I’ve never used it).
I wrote a fast overlay loader that loads blocks of code into COG where it executes. The loader hits the sweet spot for the total load. Great if you want to organise your code, especially if you have lots of little subroutines that execute lots of loops in your code as the routine executes at full cog speed but you have the loading overhead.
Hope this helps.
Actually FCACHE loads the code into COG memory, so it's similar to your overlay loader but done automatically by the compiler for loops. So for example code like: gets turned by the compiler into something like:
(I'm showing fastspin generated code here rather than PropGCC but the principle is the same). The LMM_FCACHE_LOAD function copies some number of instructions from HUB into COG memory, starting at LMM_FCACHE_START. If the correct instructions are already in COG then it returns immediately. Once in COG memory the loop runs at full speed, avoiding the 4x or 8x LMM penalty.
FCACHE works in both LMM and XMM modes, but in XMM mode there's a lot less space free in COG memory so only small loops will fit into the FCACHE.
You can put an __attribute__((section(".hubtext"))) on a function to declare that it should go in HUB memory instead of external memory; similarly there's __attribute__((section(".hubdata"))) for data. There are HUBTEXT and HUBDATA defines in propeller.h to save you a bit of typing. Obviously if you overuse these directives you will run out of hub space .