FlexSpin: is mixed HUB/LUT exec possible? (again)

ManAtWork · 2024-01-14 18:37

The subject is almost identical to this thread but the application and requirements are different, this time, so I have started a new thread.

I'm about to implement the EtherCat master which will be strongly connected to the Ethernet PHY driver. The driver already used two cogs so to avoid wasting too many cogs I plan to use the sender cog of the Ethernet driver to also do most of the protocol handling for EtherCAT. The receiver cog needs to be ready for an incoming packet at any time. Sending and receiving of packets happens overlapping so we need full duplex and at least two cogs.

Bit-banging the RMII protocol requires maximum performance and needs to be coded in assembler. Parsing the command queue and building the Ethernet frames has to deal with a lot of data structures, containers and buffers. So coding this in C would be most convenient.

So my plan is to place the sender part of the Ethernet driver permanently in LUT RAM and call it from the C code. I also need some cog RAm/registers for scratch space and at least one parameter to pass to the assembler code. This time I don't need interrupts and the assembler code doesn't use the CORDIC unit.

Is this possible? I think I have to disable the FCACHE feature or at least limit it to the LUT space I do not use.

evanh · 2024-01-14 19:40

There is a nice attribute on function definitions for fixed assignment of a function to either cogRAM __attribute__((cog)) or lutRAM __attribute__((lut)). I assume it takes away from the available Fcache space but with sparing use it should be a solution. Funnily, it would be the perfect way to effect an ISR in C too.

eg:

uint32_t  pinwait( const unsigned pingroup, const uint32_t mode, int settling ) __attribute__((cog))
{
    const unsigned  basepin = pins & 0x3f, pins = pingroup >> 6;
    uint32_t  diff1 = 0, diff2 = 0, sample;

    _rdpin( pingroup );
    _wrpin( pingroup, mode );

    do {
        while( !_pinr( basepin ) );    // wait for next decimation

        for( pin = basepin; pin < pins; pin++ )
        {
            sample = _rdpin( pin ) << 5;    // 27-bit to 32-bit translation for clean rotary summing
            sample -= diff1;
            diff1 += sample;
            sample -= diff2;
            diff2 += sample;
        }
    } while( --settling );

    return sample >> 10; // sinc3 x 7 (128 period) + 5 (translation) - 10 = 16-bit samples
}

ManAtWork · 2024-01-15 09:15

Ah, that's very interesting. But it doesn't help in my case. I need the C code to stay in hub RAM and the assembler code in LUT RAM, as it normally does. I'm just looking for a way to protect my assembler code from being overwritten by the FCACHE feature. And I'm looking for the documentation which hopefully grants some free cog registers which are not used by compiled code.

Anyway... I think the best method is to start coding and don't try to optimize too much at the beginning. I'll simply use 3 cogs. That makes everything a lot easier because I don't need to modify the ASM part each time I change the data structures. which would be error prone. It also doesn't restrict the use of DEBUG or printf() statements which is handy during development.

When I'm done I'll know exactly how large the C and assembler code is and how many registers I need. Then I can decide which option I choose:
1. translate the C code manually to assembler so that everything the sender PHY and the protocol part runs homogenously together in one cog without any tricks. This is a lot more work but would give best performance. If there is not enough LUT/COG RAM I could also place the less time critical parts in hub RAM.
2. Try to use mixed hub and LUT execution. Advantage: more readable and maintainable code. Disadvantage: not prtable to other compilers, restricted use of DEBUG/printf()

evanh · 2024-01-15 10:06

Well, a C wrapped inline pasm will work that way. Ie:

int  pinwait( int parameter ) __attribute__((cog))
{
    __asm {
        ... insert pasm code here ...
    }
    return result;
}

There might be a second way to do the above attributing to pure pasm too. There is supposed to be a way to compile a pure pasm section that was originally for the purpose of loading into its own cog. But it could possibly be repurposed for this too.

@ManAtWork said:
Anyway... I think the best method is to start coding and don't try to optimize too much at the beginning. I'll simply use 3 cogs.

... When I'm done I'll know exactly how large the C and assembler code is and how many registers I need. Then I can decide which option I choose:

Yep, that sounds wise.

ManAtWork · 2024-01-15 10:29

Yes I know, inline pasm is loaded to LUT RAM. This is perfectly OK for short code snippets. But I just want to avoid that it's re-loaded each time I call the driver because the code is relatively long and I call it quite often. Instead, it should stay there.

Wuerfel_21 · 2024-01-15 10:33

Inline ASM is loaded into low cog ram, LUT is left alone.

evanh · 2024-01-15 11:58

No, when it's attributed like I've done, the pasm is permanently loaded in cogRAM or lutRAM as specified. It is not Fcached at all then.

You can even make it an __asm const {} and it still gets placed in the specified RAM. It's just left unoptimised then.

ersmith · 2024-01-15 13:57

Functions marked __attribute__((cog)) or __attribute__((lut)) are permanently kept in the respective local memories.

ManAtWork · 2024-01-15 20:37

Ah, thanks Eric. Now that you've mentioned it I found it right away in the General doc. Chapter "Functions in COG or LUT memory":

Normally functions are placed in HUB memory, because there is a lot more of that. However, it is possible to force some functions to be placed in the chip’s internal memory, where they will execute much more quickly.
...
To put a method into COG memory, place a special comment {++cog} after the PUB or PRI declaration of the method. (...) Similarly, to put the method into LUT memory (on the P2 only, obviously) then use the comment {++lut}.

and Chapter "Register usage - P2":

Most of COG RAM is used by the compiler, except that $1e0-$1ef are left free for application use. COG RAM from $00 to $ff is used for FCACHE, and so when you are sure no FCACHE is in use you may use this for scratch. The first 16 registers of LUT memory ($200 to $20f) are always left free. The remainder of the first half of LUT ($210 to $2ff) are used for functions that the user explicitly asks to put in LUT. If no such functions exist they are free for user use.
The second half of LUT memory (from $300 to $3ff) may be used by compiler internal functions, and so should not be used by assembly code.
ptra is used for the stack pointer. Applications should avoid using it. pa is used internally for fcache loading. Applications may use it as a temporary variable, but be aware that any code execution which may trigger an fcache load (e.g. any loop or subroutine call) may trash its value.

So I think that's all I need. So if I put my assembler code inside a __asm { } block and that in turn inside a functioned marked as __attribute__((lut)) it should be placed permanently in LUT RAM. I think I could even use function arguments and local variables to pass parameters and return values, right?

evanh · 2024-01-15 21:35

@ManAtWork said:
... I could even use function arguments and local variables to pass parameters and return values, right?

Yes.

FlexSpin: is mixed HUB/LUT exec possible? (again)

Comments