I was assuming this P2 Clang can also compile C files, but don't actually know that for sure...
the compile line looks like this:
clang -Os -ffunction-sections -fdata-sections -fno-jump-tables --target=p2 -Dprintf=__simple_printf -DP2_TARGET_MHZ=160 -o ack_core_lifecycle.o -c ./src/ack_core_lifecycle.c
That error is a result of some LLVM intrinsic operation not being implemented. There are WAY too many for me to be able to test all and many are not even needed for P2, so I haven't hit this one before. Can you post that source file and compile command, so I can add in the required feature and test?
Yes, clang can compile C or C++. It's fairly parallel to gcc in terms of functionality and can often be used as a transparent replacement. The above command is perfectly valid.
Ah... it looks like it's trying to load a 1 bit value from a global address and then zero extend it, which I don't think I have support for. I will try to fix it open-loop and let you test it. Hopefully it won't take too many iterations...
Taking forever to compile... I should probably set this up on a faster computer... Wonder if more cores would help Visual Studio compile this faster...
Not sure about visual studio--but I would also look into making the build incremental. That's what I do and it takes only 10-15s on my mac mini to compile in this change. Not sure how that needs to get set up in Visual Studio on windows though
Looks like this is the line giving it trouble: static bool sg_ackModuleInitiatedFactoryReset = false;
If I change it to "uint32_t", without the "static" the problem shifts to a different variable, so I guess that fixes it.
Is it trying to compact the bools into a single bit to save memory?
That's what it sounds like...
Seems like "static" has something to do with it too...
Actually, just removing the "static", leaving the "bool", makes the error move on to something else...
Yeah the frontend is trying to make bools 1 bit values in LLVM, which are not handled perfectly by the backend since P2 doesn't have any 1 bit data types, so I need to wrap things with sign or zero extensions. Static is likely messing with it because it's turning it into a global variable where it might otherwise be optimizing something out.
I'll keep playing with it to see if I can recreate it again.
RDFAST requires cogexec. It'll crash any hubexec. My inline code relies on Fcaching putting it in cogRAM or lutRAM. ie: Disabling the optimiser with -O0 will break my code.
Yeah that’s correct. You’ll need to run that function in cog mode to use the streamer. I don’t have good examples ready of that but I’ve done it several times and it works well. The best way would be to have a function caching mechanism to move that function to the cog (all that space is unused except for GP registers), but I haven’t yet figured out how to make that work yet.
I’m going to be out of the country for the next week but can come up with an example for you. Basically you just need to start the function in a new cog using coginit with the appropriate mode and provide it a stack.
The instruction page says: Begin new fast hub read via FIFO. D[31] = no wait, D[13:0] = block size in 64-byte units (0 = max), S[19:0] = block start address.
No where does is say the program is in cog memory or that cog memory is used.
Hehe, yes, that could be clearer. FIFO is one piece of hardware that can be purposed for three different uses. Either it feeds the Streamer, or it feeds hubexec, or it feeds RFxxxx/WFxxxx.
So, what happens is a RDFAST will redirect the FIFO while hubexec is fetching instructions from the FIFO. Hubexec just executes whatever data the redirection grabs. To the FIFO, hubexec is a series of hidden RFLONG. On that note, the Streamer is also a series of hidden RFxxxx/WFxxxx.
@iseries I got back from vacation yesterday so I'll start wrapping my head around this and see what the issues are. First and foremost though, there's currently no support for caching arbitrary functions into the LUT or COG RAM space (I'm working through a few different ideas for how to implement it but it will take me some time), so __attribute__ ((section ("lut"), cogtext, no_builtin("memset"))) is only useful for functions the compiler expects to be in the LUT RAM (a specific list of runtime library functions it might generate calls for, like 64 bit or signed division functions, for example). For an arbitrary function, it will load that function into the LUT on cog boot, but it won't actually call it, it will just call the hub address where the function was originally stored.
My immediate guess for why new registers aren't being allocated is that it doesn't see i/j/x be used anywhere else, so it assumed they are not needed and just re-uses registers. if you need registers in inline assembly, I would actually use r0, r1, etc, and add them to the clobber list in the asm statement, and then the compiler will correctly handling allocating registers outside the statement as well. Maybe marking the variables i/j/x as volatile would work too, but I'm not sure, I'd need to try it.
For caching functions, I'm thinking to have the compiler generate the call to an intermediate function (that should be inlined), place the address of the real function to call into PB, then have that function actually load the cached function from PB (if not already loaded) into COG ram, and then actually call it. This will allow for a relatively dynamic use of COG ram across different cogs. I'll probably limit the cache to just 1 function that takes up the full cog ram for simplicity, otherwise I'd need to track where functions are located, how often they've been used, etc, and build a full caching mechanism that will probably remove any performance gain of a cached function.
I haven't tested it because I don't have a P2 board nearby, but the disassembly looks correct. I'm guessing that will also solve the issues with the second one, but I'll need to actually try it.
With early clobber, basically just need to follow the rule that if an input operand gets written to before ALL input operands have been read for the last time (meaning, the only remaining instructions will only perform writes), then the operand needs to be marked as early-clobber. That's how I understand it, anyway, I might be wrong with some detail.
Which registers aren't getting saved/getting overwritten? I can't actually test this example since I don't have the supporting code, but from the disassembly it looks okay.
The assembly is only saving 4 registers when there are 7 registers being used. So R4, R5, R6 are not saved if you look at the assembly code above.
Any way I went back to marking all of them as clobber so I could get correct assembly code. Need to fix getsec() as this function does not work currently.
Comments
I was assuming this P2 Clang can also compile C files, but don't actually know that for sure...
the compile line looks like this:
clang -Os -ffunction-sections -fdata-sections -fno-jump-tables --target=p2 -Dprintf=__simple_printf -DP2_TARGET_MHZ=160 -o ack_core_lifecycle.o -c ./src/ack_core_lifecycle.c
Is that OK?
That error is a result of some LLVM intrinsic operation not being implemented. There are WAY too many for me to be able to test all and many are not even needed for P2, so I haven't hit this one before. Can you post that source file and compile command, so I can add in the required feature and test?
Yes, clang can compile C or C++. It's fairly parallel to gcc in terms of functionality and can often be used as a transparent replacement. The above command is perfectly valid.
I think Amazon would come after me if I posted their code for ACK...
Guess that's a problem...
Funny, when I google searched for "ack_core_lifecycle.c", all I get is my post in this forum.
Ah... it looks like it's trying to load a 1 bit value from a global address and then zero extend it, which I don't think I have support for. I will try to fix it open-loop and let you test it. Hopefully it won't take too many iterations...
Thanks, if you're willing to fix, I'm willing to test!
There are 61 more C files in this source though...
I was able to recreate the issue with the following LLVM IR snippet, so now I should be able to fix and test. Should have something shortly
@Rayman please pull and rebuild p2llvm, should be fixed.
Taking forever to compile... I should probably set this up on a faster computer... Wonder if more cores would help Visual Studio compile this faster...
Not sure about visual studio--but I would also look into making the build incremental. That's what I do and it takes only 10-15s on my mac mini to compile in this change. Not sure how that needs to get set up in Visual Studio on windows though
Do I just need to copy over the new "bin" folder?
Did that, but still get an error:
Looks similar, but not exactly the same...
Looks like this is the line giving it trouble:
static bool sg_ackModuleInitiatedFactoryReset = false;
If I change it to "uint32_t", without the "static" the problem shifts to a different variable, so I guess that fixes it.
Is it trying to compact the bools into a single bit to save memory?
That's what it sounds like...
Seems like "static" has something to do with it too...
Actually, just removing the "static", leaving the "bool", makes the error move on to something else...
Looks like it's the same error.
Yeah the frontend is trying to make bools 1 bit values in LLVM, which are not handled perfectly by the backend since P2 doesn't have any 1 bit data types, so I need to wrap things with sign or zero extensions. Static is likely messing with it because it's turning it into a global variable where it might otherwise be optimizing something out.
I'll keep playing with it to see if I can recreate it again.
Was trying to get this code to run but it crashes the P2.
If I remove the RDFAST and RFBYTE and just use RDBYTE %[x], %[b] it work fine.
Also could use "outc" instruction.
Mike
RDFAST requires cogexec. It'll crash any hubexec. My inline code relies on Fcaching putting it in cogRAM or lutRAM. ie: Disabling the optimiser with -O0 will break my code.
PS: And it's not just the use of the FIFO that I need cogexec for either. The SPI clock lead timing heavily relies on the precise instruction relationship within the inner loop. Further reading - https://forums.parallax.com/discussion/comment/1537246/#Comment_1537246
Yeah that’s correct. You’ll need to run that function in cog mode to use the streamer. I don’t have good examples ready of that but I’ve done it several times and it works well. The best way would be to have a function caching mechanism to move that function to the cog (all that space is unused except for GP registers), but I haven’t yet figured out how to make that work yet.
I’m going to be out of the country for the next week but can come up with an example for you. Basically you just need to start the function in a new cog using coginit with the appropriate mode and provide it a stack.
@evanh ,
The instruction page says: Begin new fast hub read via FIFO. D[31] = no wait, D[13:0] = block size in 64-byte units (0 = max), S[19:0] = block start address.
No where does is say the program is in cog memory or that cog memory is used.
Mike
Hehe, yes, that could be clearer. FIFO is one piece of hardware that can be purposed for three different uses. Either it feeds the Streamer, or it feeds hubexec, or it feeds RFxxxx/WFxxxx.
So, what happens is a RDFAST will redirect the FIFO while hubexec is fetching instructions from the FIFO. Hubexec just executes whatever data the redirection grabs. To the FIFO, hubexec is a series of hidden RFLONG. On that note, the Streamer is also a series of hidden RFxxxx/WFxxxx.
@evanh ,
Ok, I read the secondary documentation and it basically says the FIFO is being used to cache instructions and can't be used for other things.
So flexspin is seeing the RDFAST and stuffing the code in cog memory?
Mike
No, the optimiser is seeing __asm {} and knows to do that. Which is why disabling the optimiser will kill my code.
@evanh ,
So if I have 4k of assembly code it won't work either?
Mike
Lol, get your own cog!
Oh, there is also the
const
qualifier for__asm
that will force the pasm to stay in hubRAM, truly inline.I think the optimizer has a problem.
This code does not work. It confuses r0 with r2 or something.
Mike
Here is another example of assembly code gone wrong:
Disassembled the above function:
Instead of allocating new registers for i, j, and x it is wiping out existing ones.
Mike
@iseries I got back from vacation yesterday so I'll start wrapping my head around this and see what the issues are. First and foremost though, there's currently no support for caching arbitrary functions into the LUT or COG RAM space (I'm working through a few different ideas for how to implement it but it will take me some time), so
__attribute__ ((section ("lut"), cogtext, no_builtin("memset")))
is only useful for functions the compiler expects to be in the LUT RAM (a specific list of runtime library functions it might generate calls for, like 64 bit or signed division functions, for example). For an arbitrary function, it will load that function into the LUT on cog boot, but it won't actually call it, it will just call the hub address where the function was originally stored.My immediate guess for why new registers aren't being allocated is that it doesn't see i/j/x be used anywhere else, so it assumed they are not needed and just re-uses registers. if you need registers in inline assembly, I would actually use r0, r1, etc, and add them to the clobber list in the asm statement, and then the compiler will correctly handling allocating registers outside the statement as well. Maybe marking the variables i/j/x as volatile would work too, but I'm not sure, I'd need to try it.
For caching functions, I'm thinking to have the compiler generate the call to an intermediate function (that should be inlined), place the address of the real function to call into
PB
, then have that function actually load the cached function fromPB
(if not already loaded) into COG ram, and then actually call it. This will allow for a relatively dynamic use of COG ram across different cogs. I'll probably limit the cache to just 1 function that takes up the full cog ram for simplicity, otherwise I'd need to track where functions are located, how often they've been used, etc, and build a full caching mechanism that will probably remove any performance gain of a cached function.The first example you gave appears to be solved by marking the operands you read and write as early-clobber operands:
[u]"+&r"(upper)
instead of[u]"+r"(upper)
. See https://stackoverflow.com/questions/15819794/when-to-use-earlyclobber-constraint-in-extended-gcc-inline-assembly.I haven't tested it because I don't have a P2 board nearby, but the disassembly looks correct. I'm guessing that will also solve the issues with the second one, but I'll need to actually try it.
This early clobber stuff is giving me a headache.
Using a register and then adding them to the list leads to hard to read code, might as well use a variable.
If I use the early clobber syntax the compiler generates bad code.
This generated code fails to save the register and thus containments the registers.
I was trying to use the same assemble that I used in FlexC but that's not working. The way that FlexC passes parameters is different than clang.
Mike
With early clobber, basically just need to follow the rule that if an input operand gets written to before ALL input operands have been read for the last time (meaning, the only remaining instructions will only perform writes), then the operand needs to be marked as early-clobber. That's how I understand it, anyway, I might be wrong with some detail.
Which registers aren't getting saved/getting overwritten? I can't actually test this example since I don't have the supporting code, but from the disassembly it looks okay.
The assembly is only saving 4 registers when there are 7 registers being used. So R4, R5, R6 are not saved if you look at the assembly code above.
Any way I went back to marking all of them as clobber so I could get correct assembly code. Need to fix getsec() as this function does not work currently.
Mike
The following snippet is saving things correctly:
saves r0, r1 with one fast block move, then saves r3, r4, r5, r6 with another block move. r2 is never actually written to so it skips saving it.