LMM execution speeds w/ XMM
in Propeller 1
I noticed on this propellergcc documentation page it's stated that:
What actually differentiates "select code"? I am extremely interested in both the large memory model speed as well as the eXternal memory model size, and would like to take advantage of both. My use-case is arcade game code that could be both a) complex requiring fast execution, and b) extensive requiring external storage.
Select code can run as LMM even in XMM-type memory modes.
What actually differentiates "select code"? I am extremely interested in both the large memory model speed as well as the eXternal memory model size, and would like to take advantage of both. My use-case is arcade game code that could be both a) complex requiring fast execution, and b) extensive requiring external storage.
Comments
Because a single loop fetcher misses the sweet spot in the hub instruction fetcher, some use an unrolled loop ie it fetches and executes then fetches and executes etc before jumping back to the top of the loop. This saves a hub loop (16 clocks) for every extra unrolled fetch after the first. But remember, every unrolled fetch still takes IIRC one 16 clock hub loop so your program is still 4x slower. And IIRC jumps and calls take 2 fetch/execute cycles so thats 8x slower. The normal single loop takes 32 clocks so that is 8x slower.
Now, into the mix there is FCACHE that loads blocks of code (into hub if I have this correct - I’ve never used it).
I wrote a fast overlay loader that loads blocks of code into COG where it executes. The loader hits the sweet spot for the total load. Great if you want to organise your code, especially if you have lots of little subroutines that execute lots of loops in your code as the routine executes at full cog speed but you have the loading overhead.
Hope this helps.
Actually FCACHE loads the code into COG memory, so it's similar to your overlay loader but done automatically by the compiler for loops. So for example code like:
#include <propeller.h> #define pin 1 #define delay 1000 void shift_out(unsigned val) { unsigned tim, i; tim = _CNT + delay; for (i = 0; i < 32; i++) { if (val & 1) { OUTA |= (1<<pin); } else { OUTA &= ~(1<<pin); } val >>= 1; waitcnt(tim); tim += delay; } }
gets turned by the compiler into something like:_shift_out mov COUNT_, #3 call #pushregs_ mov local01, arg01 mov local02, cnt add local02, imm_1000_ mov local03, #32 call #LMM_FCACHE_LOAD long (@@@LR__0002-@@@LR__0001) LR__0001 shr local01, #1 wc if_c or outa, #2 if_nc andn outa, #2 mov arg01, local02 waitcnt arg01, #0 add local02, imm_1000_ djnz local03, #LMM_FCACHE_START + (LR__0001 - LR__0001) LR__0002 mov sp, fp call #popregs_ _shift_out_ret call #LMM_RET
(I'm showing fastspin generated code here rather than PropGCC but the principle is the same). The LMM_FCACHE_LOAD function copies some number of instructions from HUB into COG memory, starting at LMM_FCACHE_START. If the correct instructions are already in COG then it returns immediately. Once in COG memory the loop runs at full speed, avoiding the 4x or 8x LMM penalty.
FCACHE works in both LMM and XMM modes, but in XMM mode there's a lot less space free in COG memory so only small loops will fit into the FCACHE.
You can put an __attribute__((section(".hubtext"))) on a function to declare that it should go in HUB memory instead of external memory; similarly there's __attribute__((section(".hubdata"))) for data. There are HUBTEXT and HUBDATA defines in propeller.h to save you a bit of typing. Obviously if you overuse these directives you will run out of hub space