Shop OBEX P1 Docs P2 Docs Learn Events
LLVM Backend for Propeller 2 - Page 12 — Parallax Forums

LLVM Backend for Propeller 2

17891012

Comments

  • RaymanRayman Posts: 14,553

    I was assuming this P2 Clang can also compile C files, but don't actually know that for sure...
    the compile line looks like this:
    clang -Os -ffunction-sections -fdata-sections -fno-jump-tables --target=p2 -Dprintf=__simple_printf -DP2_TARGET_MHZ=160 -o ack_core_lifecycle.o -c ./src/ack_core_lifecycle.c

    Is that OK?

  • That error is a result of some LLVM intrinsic operation not being implemented. There are WAY too many for me to be able to test all and many are not even needed for P2, so I haven't hit this one before. Can you post that source file and compile command, so I can add in the required feature and test?

    Yes, clang can compile C or C++. It's fairly parallel to gcc in terms of functionality and can often be used as a transparent replacement. The above command is perfectly valid.

  • RaymanRayman Posts: 14,553

    I think Amazon would come after me if I posted their code for ACK...
    Guess that's a problem...

    Funny, when I google searched for "ack_core_lifecycle.c", all I get is my post in this forum.

  • Ah... it looks like it's trying to load a 1 bit value from a global address and then zero extend it, which I don't think I have support for. I will try to fix it open-loop and let you test it. Hopefully it won't take too many iterations...

  • RaymanRayman Posts: 14,553

    Thanks, if you're willing to fix, I'm willing to test!
    There are 61 more C files in this source though...

  • I was able to recreate the issue with the following LLVM IR snippet, so now I should be able to fix and test. Should have something shortly

    @val = external global i1
    
    define i32 @main() {
        %1 = load i1, i1* @val
        %2 = zext i1 %1 to i32
        ret i32 %2
    }
    
  • @Rayman please pull and rebuild p2llvm, should be fixed.

  • RaymanRayman Posts: 14,553

    Taking forever to compile... I should probably set this up on a faster computer... Wonder if more cores would help Visual Studio compile this faster...

  • Not sure about visual studio--but I would also look into making the build incremental. That's what I do and it takes only 10-15s on my mac mini to compile in this change. Not sure how that needs to get set up in Visual Studio on windows though

  • RaymanRayman Posts: 14,553
    edited 2022-03-05 23:52

    Do I just need to copy over the new "bin" folder?

    Did that, but still get an error:

    1>EXEC : fatal error : error in backend: Cannot select: 0x1a2459a28e8: i32,ch = load<(dereferenceable load (s8) from @sg_ackModuleInitiatedFactoryReset), zext from i1> 0x1a2458f60c8, 0x1a24599c7b8, undef:i32
    1>  0x1a24599c7b8: i32 = P2GAWRAPPER TargetGlobalAddress:i32<i1* @sg_ackModuleInitiatedFactoryReset> 0
    1>    0x1a245992098: i32 = TargetGlobalAddress<i1* @sg_ackModuleInitiatedFactoryReset> 0
    1>  0x1a24599ab10: i32 = undef
    

    Looks similar, but not exactly the same...

  • RaymanRayman Posts: 14,553
    edited 2022-03-06 00:03

    Looks like this is the line giving it trouble:
    static bool sg_ackModuleInitiatedFactoryReset = false;

    If I change it to "uint32_t", without the "static" the problem shifts to a different variable, so I guess that fixes it.
    Is it trying to compact the bools into a single bit to save memory?
    That's what it sounds like...
    Seems like "static" has something to do with it too...

    Actually, just removing the "static", leaving the "bool", makes the error move on to something else...

  • Looks like it's the same error.

    Yeah the frontend is trying to make bools 1 bit values in LLVM, which are not handled perfectly by the backend since P2 doesn't have any 1 bit data types, so I need to wrap things with sign or zero extensions. Static is likely messing with it because it's turning it into a global variable where it might otherwise be optimizing something out.

    I'll keep playing with it to see if I can recreate it again.

  • Was trying to get this code to run but it crashes the P2.

    void SendSD(int drive, const BYTE* buff, unsigned int bc)
    {
        int i = 0;
        const BYTE *b;
        int x = 0;
        int mosi = (Pins[drive] >> 16) & 0xff;
        int clk = (Pins[drive] >> 8) & 0xff;
        int j = 0;
        int t = bc;
    
        b = buff;
        asm volatile ("rdfast #0, %[b]\n"
                      "mov %[i], %[t]\n"
                 ".Li: mov %[j], #8\n"
                      "rfbyte %[x]\n"
                      "shl %[x], #24\n"
                 ".Lj: shl %[x], #1 wc\n"
                 "if_c drvh %[mosi]\n"
                "if_nc drvl %[mosi]\n"
                      "drvh %[clk]\n"
                      "nop\n"
                      "drvl %[clk]\n"
                      "djnz %[j], #.Lj\n"
                      "djnz %[i], #.Li\n"
                      :[i]"+r"(i), [j]"+r"(j), [x]"+r"(x), [b]"+r"(b)
                      :[mosi]"r"(mosi), [clk]"r"(clk), [t]"r"(t));
    
    }
    

    If I remove the RDFAST and RFBYTE and just use RDBYTE %[x], %[b] it work fine.

    Also could use "outc" instruction.

    Mike

  • evanhevanh Posts: 15,848
    edited 2022-04-01 18:08

    RDFAST requires cogexec. It'll crash any hubexec. My inline code relies on Fcaching putting it in cogRAM or lutRAM. ie: Disabling the optimiser with -O0 will break my code.

    PS: And it's not just the use of the FIFO that I need cogexec for either. The SPI clock lead timing heavily relies on the precise instruction relationship within the inner loop. Further reading - https://forums.parallax.com/discussion/comment/1537246/#Comment_1537246

  • Yeah that’s correct. You’ll need to run that function in cog mode to use the streamer. I don’t have good examples ready of that but I’ve done it several times and it works well. The best way would be to have a function caching mechanism to move that function to the cog (all that space is unused except for GP registers), but I haven’t yet figured out how to make that work yet.

    I’m going to be out of the country for the next week but can come up with an example for you. Basically you just need to start the function in a new cog using coginit with the appropriate mode and provide it a stack.

  • @evanh ,

    The instruction page says: Begin new fast hub read via FIFO. D[31] = no wait, D[13:0] = block size in 64-byte units (0 = max), S[19:0] = block start address.

    No where does is say the program is in cog memory or that cog memory is used.

    Mike

  • evanhevanh Posts: 15,848
    edited 2022-04-01 18:19

    Hehe, yes, that could be clearer. FIFO is one piece of hardware that can be purposed for three different uses. Either it feeds the Streamer, or it feeds hubexec, or it feeds RFxxxx/WFxxxx.

    So, what happens is a RDFAST will redirect the FIFO while hubexec is fetching instructions from the FIFO. Hubexec just executes whatever data the redirection grabs. To the FIFO, hubexec is a series of hidden RFLONG. On that note, the Streamer is also a series of hidden RFxxxx/WFxxxx.

  • @evanh ,

    Ok, I read the secondary documentation and it basically says the FIFO is being used to cache instructions and can't be used for other things.

    So flexspin is seeing the RDFAST and stuffing the code in cog memory?

    Mike

  • evanhevanh Posts: 15,848

    @iseries said:
    So flexspin is seeing the RDFAST and stuffing the code in cog memory?

    No, the optimiser is seeing __asm {} and knows to do that. Which is why disabling the optimiser will kill my code.

  • @evanh ,

    So if I have 4k of assembly code it won't work either?

    Mike

  • evanhevanh Posts: 15,848

    Lol, get your own cog!

  • evanhevanh Posts: 15,848

    Oh, there is also the const qualifier for __asm that will force the pasm to stay in hubRAM, truly inline.

  • I think the optimizer has a problem.

    This code does not work. It confuses r0 with r2 or something.

    #include <propeller.h>
    #include <stdio.h>
    #include <string.h>
    
    int getSec(void);
    
    int i,j;
    int b;
    unsigned int *d;
    
    int main(int argc, char** argv)
    {
        unsigned char *b1, *b2;
    
        i = getSec();
        waitms(5000);
        j = getSec();
        b = j - i;
        printf("Start: %d, End: %d, Time: %ds\n", i, j, b);
    
        printf("Done\n");
    
        while (1)
        {
            waitms(500);
        }
    }
    
    int getSec()
    {
        unsigned int upper, lower, results;
        unsigned int freq;
    
        freq = _clkfreq;
    
        asm volatile ("getct %[u] wc\n"
             "getct %[l]\n"
             "setq %[u]\n"
             "qdiv %[l], %[f]\n"
             "getqx %[r]\n"
             :[u]"+r"(upper), [l]"+r"(lower), [r]"+r"(results)
             :[f]"r"(freq));
    
        return results;
    }
    

    Mike

  • Here is another example of assembly code gone wrong:

    #include <propeller.h>
    #include <stdio.h>
    #include <string.h>
    
    int getSec(void);
    void SendSDX(int, const BYTE* , unsigned int);
    
    unsigned int Pinx[2];
    unsigned char buff[100];
    int i,j;
    int b;
    unsigned int *d;
    
    int main(int argc, char** argv)
    {
        unsigned char *b1, *b2;
    
        i = getSec();
        wait(5000);
        j = getSec();
        b = j - i;
        printf("Start: %d, End: %d, Time: %ds\n", i, j, b);
    
        SendSDX(0, buff, 1);
    
        printf("Done\n");
    
        while (1)
        {
            wait(500);
        }
    }
    
    int getSec()
    {
        unsigned int upper, lower, results;
        unsigned int freq;
    
        freq = _clkfreq;
    
        asm volatile ("getct %[u] wc\n"
             "getct %[l]\n"
             "setq %[u]\n"
             "qdiv %[l], %[f]\n"
             "getqx %[r]\n"
             :[u]"+r"(upper), [l]"+r"(lower), [r]"+r"(results)
             :[f]"r"(freq));
    
        return results;
    }
    
    __attribute__ ((section ("lut"), cogtext, no_builtin("memset")))
    void SendSDX(int drive, const BYTE* buff, unsigned int bc)
    {
        int i;
        unsigned int x;
        int mosi = (Pinx[drive] >> 16) & 0xff;
        int clk = (Pinx[drive] >> 8) & 0xff;
        int j;
    
        asm volatile (
                "rdfast #0, %[b]\n"
                "mov   %[i], %[t]\n"
        ".Lxi:   mov   %[j], #8\n"
                "rfbyte %[x]\n"
                "shl %[x], #24\n"
        ".Lxj:   shl %[x], #1  wc\n"
        "if_c    drvh %[mosi]\n"
        "if_nc   drvl %[mosi]\n"
                "drvh %[clk]\n"
                "nop\n"
                "drvl %[clk]\n"
                "nop\n"
                "djnz %[j], #.Lxj\n"
                "djnz %[i], #.Lxi\n"
                :[i]"+r"(i), [j]"+r"(j), [x]"+r"(x), [b]"+r"(buff)
                :[mosi]"r"(mosi), [clk]"r"(clk), [t]"r"(bc));
    }
    

    Disassembled the above function:

    00000200 <SendSDX>:
         200: 28 06 64 fd            setq #3
         204: 61 a1 67 fc            wrlong r0, ptra++
         208: 02 a0 67 f0            shl r0, #2 <drive int index>
         20c: 1f 00 00 ff            augs #31
         210: 90 a7 07 f6            mov r3, #400   <base Pinx>
         214: d0 a7 03 f1            add r3, r0 
         218: d3 a1 03 fb            rdlong r0, r3  <drive wiped out for Pins>
         21c: d0 a7 03 f6            mov r3, r0 <mosi>
         220: 10 a6 47 f0            shr r3, #16    <mosi>
         224: ff a6 07 f5            and r3, #255   <mosi>
         228: 08 a0 47 f0            shr r0, #8 <clk>
         22c: ff a0 07 f5            and r0, #255   <clk>
         230: d1 01 78 fc            rdfast #0, r1  <buff>
         234: d2 a1 03 f6            mov r0, r2 <Pins wiped out for i>
         238: 08 a4 07 f6            mov r2, #8 <j>
         23c: 10 a6 63 fd            rfbyte r3  <mosi wiped out for x>
         240: 18 a6 67 f0            shl r3, #24    <x>
         244: 01 a6 77 f0            shl r3, #1 wc <x>
         248: 59 a6 63 cd       if_c     drvh r3    <x should be mosi>
         24c: 58 a6 63 3d       if_nc    drvl r3    <x should be mosi>
         250: 59 a0 63 fd            drvh r0    <i should be clk>
         254: 00 00 00 00   nop
         258: 58 a0 63 fd            drvl r0    <i should be clk>
         25c: 00 00 00 00   nop
         260: f8 a5 6f fb            djnz r2, #-8   <j>
         264: f4 a1 6f fb            djnz r0, #-12  <i>
         268: 28 06 64 fd            setq #3
         26c: 5f a1 07 fb            rdlong r0, --ptra  
         270: 2e 00 64 fd            reta   
    

    Instead of allocating new registers for i, j, and x it is wiping out existing ones.

    Mike

  • @iseries I got back from vacation yesterday so I'll start wrapping my head around this and see what the issues are. First and foremost though, there's currently no support for caching arbitrary functions into the LUT or COG RAM space (I'm working through a few different ideas for how to implement it but it will take me some time), so __attribute__ ((section ("lut"), cogtext, no_builtin("memset"))) is only useful for functions the compiler expects to be in the LUT RAM (a specific list of runtime library functions it might generate calls for, like 64 bit or signed division functions, for example). For an arbitrary function, it will load that function into the LUT on cog boot, but it won't actually call it, it will just call the hub address where the function was originally stored.

    My immediate guess for why new registers aren't being allocated is that it doesn't see i/j/x be used anywhere else, so it assumed they are not needed and just re-uses registers. if you need registers in inline assembly, I would actually use r0, r1, etc, and add them to the clobber list in the asm statement, and then the compiler will correctly handling allocating registers outside the statement as well. Maybe marking the variables i/j/x as volatile would work too, but I'm not sure, I'd need to try it.

    For caching functions, I'm thinking to have the compiler generate the call to an intermediate function (that should be inlined), place the address of the real function to call into PB, then have that function actually load the cached function from PB (if not already loaded) into COG ram, and then actually call it. This will allow for a relatively dynamic use of COG ram across different cogs. I'll probably limit the cache to just 1 function that takes up the full cog ram for simplicity, otherwise I'd need to track where functions are located, how often they've been used, etc, and build a full caching mechanism that will probably remove any performance gain of a cached function.

  • The first example you gave appears to be solved by marking the operands you read and write as early-clobber operands: [u]"+&r"(upper) instead of [u]"+r"(upper). See https://stackoverflow.com/questions/15819794/when-to-use-earlyclobber-constraint-in-extended-gcc-inline-assembly.

    I haven't tested it because I don't have a P2 board nearby, but the disassembly looks correct. I'm guessing that will also solve the issues with the second one, but I'll need to actually try it.

  • iseriesiseries Posts: 1,490
    edited 2022-04-12 12:36

    This early clobber stuff is giving me a headache.

    Using a register and then adding them to the list leads to hard to read code, might as well use a variable.

    If I use the early clobber syntax the compiler generates bad code.

    void SendSD(int drive, const BYTE* buff, unsigned int bc)
    {
        int mosi = (Pins[drive] >> 16) & 0xff;
        int clk = (Pins[drive] >> 8) & 0xff;
        int i,j,x;
    
        asm volatile ("mov %[i], %[bc]\n"
                 ".Li: mov %[j], #8\n"
                      "rdbyte %[x], %[b]\n"
                      "shl %[x], #24\n"
                 ".Lj: shl %[x], #1 wc\n"
                 "if_c drvh %[mosi]\n"
                "if_nc drvl %[mosi]\n"
                      "drvh %[clk]\n"
                      "waitx #2\n"
                      "drvl %[clk]\n"
                      "waitx #2\n"
                      "djnz %[j], #.Lj\n"
                      "add %[b], #1\n"
                      "djnz %[i], #.Li\n"
                      :[b]"+r"(buff), [i]"+&r"(i), [j]"+&r"(j), [x]"+&r"(x)
                      :[mosi]"r"(mosi), [clk]"r"(clk), [bc]"r"(bc));
    
    }
    

    This generated code fails to save the register and thus containments the registers.

    000074c4 <SendSD>:
        74c4: 28 02 64 fd            setq #1
        74c8: 61 a1 67 fc            wrlong r0, ptra++
        74cc: 28 06 64 fd            setq #3
        74d0: 61 a7 67 fc            wrlong r3, ptra++
        74d4: 02 a0 67 f0            shl r0, #2 
        74d8: 50 02 00 ff            augs #592
        74dc: 6c a7 07 f6            mov r3, #364   
        74e0: d0 a7 03 f1            add r3, r0 
        74e4: d3 a1 03 fb            rdlong r0, r3  
        74e8: d0 a7 03 f6            mov r3, r0 
        74ec: 10 a6 47 f0            shr r3, #16    
        74f0: ff a6 07 f5            and r3, #255   
        74f4: 08 a0 47 f0            shr r0, #8 
        74f8: ff a0 07 f5            and r0, #255   
        74fc: d2 a9 03 f6            mov r4, r2 
        7500: 08 aa 07 f6            mov r5, #8 
        7504: d1 ad c3 fa            rdbyte r6, r1  
        7508: 18 ac 67 f0            shl r6, #24    
        750c: 01 ac 77 f0            shl r6, #1 wc
        7510: 59 a6 63 cd       if_c   drvh r3  
        7514: 58 a6 63 3d       if_nc drvl r3   
        7518: 59 a0 63 fd            drvh r0    
        751c: 1f 04 64 fd            waitx #2   
        7520: 58 a0 63 fd            drvl r0    
        7524: 1f 04 64 fd            waitx #2   
        7528: f8 ab 6f fb            djnz r5, #-8
        752c: 01 a2 07 f1            add r1, #1 
        7530: f3 a9 6f fb            djnz r4, #-13
        7534: 28 06 64 fd            setq #3
        7538: 5f a7 07 fb            rdlong r3, --ptra  
        753c: 28 02 64 fd            setq #1
        7540: 5f a1 07 fb            rdlong r0, --ptra  
        7544: 2e 00 64 fd            reta   
    

    I was trying to use the same assemble that I used in FlexC but that's not working. The way that FlexC passes parameters is different than clang.

    Mike

  • With early clobber, basically just need to follow the rule that if an input operand gets written to before ALL input operands have been read for the last time (meaning, the only remaining instructions will only perform writes), then the operand needs to be marked as early-clobber. That's how I understand it, anyway, I might be wrong with some detail.

    Which registers aren't getting saved/getting overwritten? I can't actually test this example since I don't have the supporting code, but from the disassembly it looks okay.

  • The assembly is only saving 4 registers when there are 7 registers being used. So R4, R5, R6 are not saved if you look at the assembly code above.

    Any way I went back to marking all of them as clobber so I could get correct assembly code. Need to fix getsec() as this function does not work currently.

    Mike

  • The following snippet is saving things correctly:

        74c4: 28 02 64 fd            setq #1
        74c8: 61 a1 67 fc            wrlong r0, ptra++
        74cc: 28 06 64 fd            setq #3
        74d0: 61 a7 67 fc            wrlong r3, ptra++
    

    saves r0, r1 with one fast block move, then saves r3, r4, r5, r6 with another block move. r2 is never actually written to so it skips saving it.

Sign In or Register to comment.