LMM execution speeds w/ XMM

escher · 2019-11-01 14:16

I noticed on this propellergcc documentation page it's stated that:

Select code can run as LMM even in XMM-type memory modes.

What actually differentiates "select code"? I am extremely interested in both the large memory model speed as well as the eXternal memory model size, and would like to take advantage of both. My use-case is arcade game code that could be both a) complex requiring fast execution, and b) extensive requiring external storage.

Cluso99 · 2019-11-05 20:11

LMM and XMM models use a small loop that fetches each instruction into COG where it executes “that” instruction within the loop. Special instructions are used to handle jumps and calls.
Because a single loop fetcher misses the sweet spot in the hub instruction fetcher, some use an unrolled loop ie it fetches and executes then fetches and executes etc before jumping back to the top of the loop. This saves a hub loop (16 clocks) for every extra unrolled fetch after the first. But remember, every unrolled fetch still takes IIRC one 16 clock hub loop so your program is still 4x slower. And IIRC jumps and calls take 2 fetch/execute cycles so thats 8x slower. The normal single loop takes 32 clocks so that is 8x slower.
Now, into the mix there is FCACHE that loads blocks of code (into hub if I have this correct - I’ve never used it).
I wrote a fast overlay loader that loads blocks of code into COG where it executes. The loader hits the sweet spot for the total load. Great if you want to organise your code, especially if you have lots of little subroutines that execute lots of loops in your code as the routine executes at full cog speed but you have the loading overhead.
Hope this helps.

ersmith · 2019-11-05 20:41

Cluso99 wrote: »

Now, into the mix there is FCACHE that loads blocks of code (into hub if I have this correct - I’ve never used it).
I wrote a fast overlay loader that loads blocks of code into COG where it executes. The loader hits the sweet spot for the total load. Great if you want to organise your code, especially if you have lots of little subroutines that execute lots of loops in your code as the routine executes at full cog speed but you have the loading overhead.
Hope this helps.

Actually FCACHE loads the code into COG memory, so it's similar to your overlay loader but done automatically by the compiler for loops. So for example code like:

#include <propeller.h>

#define pin 1
#define delay 1000

void shift_out(unsigned val)
{
    unsigned tim, i;
    tim = _CNT + delay;
    for (i = 0; i < 32; i++) {
        if (val & 1) {
            OUTA |= (1<<pin);
        } else {
            OUTA &= ~(1<<pin);
        }
        val >>= 1;
        waitcnt(tim);
        tim += delay;
    }
}

gets turned by the compiler into something like:

_shift_out
        mov     COUNT_, #3
        call    #pushregs_
        mov     local01, arg01
        mov     local02, cnt
        add     local02, imm_1000_
        mov     local03, #32
        call    #LMM_FCACHE_LOAD
        long    (@@@LR__0002-@@@LR__0001)
LR__0001
        shr     local01, #1 wc
 if_c   or      outa, #2
 if_nc  andn    outa, #2
        mov     arg01, local02
        waitcnt arg01, #0
        add     local02, imm_1000_
        djnz    local03, #LMM_FCACHE_START + (LR__0001 - LR__0001)
LR__0002
        mov     sp, fp
        call    #popregs_
_shift_out_ret
        call    #LMM_RET

(I'm showing fastspin generated code here rather than PropGCC but the principle is the same). The LMM_FCACHE_LOAD function copies some number of instructions from HUB into COG memory, starting at LMM_FCACHE_START. If the correct instructions are already in COG then it returns immediately. Once in COG memory the loop runs at full speed, avoiding the 4x or 8x LMM penalty.

FCACHE works in both LMM and XMM modes, but in XMM mode there's a lot less space free in COG memory so only small loops will fit into the FCACHE.

ersmith · 2019-11-05 20:46

escher wrote: »

I noticed on this propellergcc documentation page it's stated that:

Select code can run as LMM even in XMM-type memory modes.

What actually differentiates "select code"? I am extremely interested in both the large memory model speed as well as the eXternal memory model size, and would like to take advantage of both. My use-case is arcade game code that could be both a) complex requiring fast execution, and b) extensive requiring external storage.

You can put an __attribute__((section(".hubtext"))) on a function to declare that it should go in HUB memory instead of external memory; similarly there's __attribute__((section(".hubdata"))) for data. There are HUBTEXT and HUBDATA defines in propeller.h to save you a bit of typing. Obviously if you overuse these directives you will run out of hub space

.

LMM execution speeds w/ XMM

Comments