It ideally should be that, and it could save 3 instructions and 6 clocks vs what was actually compiled in the code above. Can add up when there are function calls with lots of arguments being setup, including immediates. It is done this way I think so the stack is adjusted atomically in one go when ready to be called in case prior arguments and locals that are referenced off the SP are needed while building the new arguments. But it's very inefficient in code size IMO and this is where having a BP could come in handy.
I sort of wonder if the called code's prologue and epilogue code could be best done in COGRAM/LUTRAM so that the impact of the extra code size overhead is reduced. We could easily pass a register count to save in PA with something like this:
called function prologue: callpa #5, enter_handler
called function epilogue: callpa #5, leave_handler
Then in COG/LUT the enter_handler could save and block write the saved registers. The problem would be when there are gaps in the registers to be saved/restored. But it could simply be brute forced to just save them anyway as it's only a single clock per extra register saved rather than dealing with masks and looking for ranges to save etc.
The leave_handler could potentially also pop the return address from the stack and return to the caller directly bypassing the need for a RETA at the end of the function.
Just added a bunch of the previously unused P2 instructions to the P2LLVM code in P2InstrInfo.td. A few commented instructions at the bottom are still TBD as they are a little more involved. I've also added a couple of extra instruction formats for P2InstrFormats.td like P2InstNOARGS for the few instructions which take no arguments, as the code doesn't seem to have that yet. If anyone needs these for their own use, they are copied here for now, but beware they are not all tested so if there are bit errors below it could assemble to bad code.
Here are new classes for P2InstrFormats.td (one for no arguments, one for special MODCZ arguments)
class P2InstNOARGS<bits<28> op, dag outs, dag inputs, string asmstr> : P2Inst<21, outs, !con(inputs, (ins P2Cond:$cc)), !strconcat("$cc\t", asmstr)> {
bits<4> cc;
let Inst{31-28} = cc;
let Inst{27-0} = op;
let TSFlags{5} = 0;
let TSFlags{6} = 0;
let TSFlags{7} = 0;
let TSFlags{10-8} = 0; // s is always operand 0
let TSFlags{13-11} = 0; // d is always operand 0
let TSFlags{16-14} = 0; // n is always operand 0
}
class P2InstCZ4C4Z<bits<7> op, bits<9> s, dag outs, dag inputs, string asmstr> :
P2Inst<20, outs, !con(inputs, (ins P2Cond:$cc, P2Effect:$cz)), !strconcat("$cc\t", asmstr, " $cz")> {
bits<20> n20;
bits<4> cc;
bits<2> cz;
bits<4> cccc;
bits<4> zzzz;
let Inst{31-28} = cc;
let Inst{27-21} = op;
let Inst{20-19} = cz;
let Inst{18-17} = 0b10;
let Inst{16-13} = cccc;
let Inst{12-9} = zzzz;
let Inst{8-0} = s;
let TSFlags{5} = 0;
let TSFlags{6} = 0;
let TSFlags{7} = 0;
let TSFlags{10-8} = 0; // s is always operand 0
let TSFlags{13-11} = 0; // d is always operand 0
let TSFlags{16-14} = 0; // n is always operand 0
}
@Rayman said:
@rogloh Nice! It is just inline assembly where this could go wrong, right? Think so.
Yes. C doesn't use these as there is no pattern defined that use these instructions. The disassembler and assembler should be able to take these and work with these formats. I have changed the tab spacing a little which works with the rest of my code changes, but may upset the output format slightly for these in isolation in the objdump until the remainder of the code changes I am working on get applied.. but that's cosmetic only. Also code has been fixed above due to some typo errors. It's now building ok on my Mac.
Update: I just tried a few of these new instructions for sanity (ignore the modcz stuff which I'm still messing with):
Kind of expected the without case to crash and burn, but seemed to work just like the one with the fixes...
Maybe it takes a few of those operations to break it, or maybe this was just too simple an example, with nothing really happening after the modulus operation besides serial output?
BTW: I still can't figure out how to create libp2.a and the other .a files... Do these compile together with llvm? Or, are they a separate make command?
Kind of expected the without case to crash and burn, but seemed to work just like the one with the fixes...
Maybe it takes a few of those operations to break it, or maybe this was just too simple an example, with nothing really happening after the modulus operation besides serial output?
Strange. Are you sure you did the testing correctly and have the C modulus operation working without the extra #if 0 change in P2ExpandPseudos.cpp as mentioned in https://forums.parallax.com/discussion/comment/1572365/#Comment_1572365 and https://forums.parallax.com/discussion/comment/1572366/#Comment_1572366 ? Possibly if the code has optimized the modulus code out it would succeed but I found without those changes, compilation of the modulus operation in C would assert LLVM and this also included generating the library code as well which is where I first found it.
BTW: I still can't figure out how to create libp2.a and the other .a files... Do these compile together with llvm? Or, are they a separate make command?
If you use the build.py script in the top level folder of the p2llvm repository code then they can be built after the P2 variant of LLVM is built. If you have built your P2 variant of LLVM independently without using build.py, like you did in other ways for Windows specifically, then they won't be produced. On Linux you should really be just using the python build script to do everything for you. I know I did have some problems out of the box with where the propeller*.h include files were stored/available and I needed to move some around so it would let the library build complete without problems this might have been the very first time and after that it was okay, can't recall specifics. It's possible I just may have not setup the include path in the installer correctly if that was also required to be done, not sure.
Update: For sanity I just recompiled without the #if 0 in the QUREM code in P2ExpandPseudos.cpp and tried to build your hello2.c above. It certainly still caused a crash in the compiler. However I just found that with optimizations disabled (using the -O0 setting), it didn't assert. So that's likely what you have different in your setup. Try having the -Os or -O1 optimization level setting enabled and see if it crashes.
@rogloh ok, yeah didn’t set any optimization that can recall so probably was default…
That brings up what O should be using…. Probably O1 at least right? Unless for troubleshooting?
So .a is a seperate compile step . That is useful info. Can try that now.
Can’t really go back to without your fixes on that one machine. But have a couple more machines want to build on and can test optimization effect there…
Update: just thought of a use for MOVBYTS instruction (already coded but not linked to any pattern). LLVM has an endian swap intrinsic which may be able to make use of it.
2. Clang C/C++ Builtins
If you are writing C or C++ code and want to ensure Clang emits the most efficient machine-level instruction (like BSWAP on x86 or REV on ARM), use the following Clang built-ins:
__builtin_bswap16(uint16_t x)
__builtin_bswap32(uint32_t x)
__builtin_bswap64(uint64_t x)
I just figured out how to change the compiler pattern selection for the 32 bit byte swap operation to make use of the P2 MOVBYTS capability.
Here's the code being compiled as a C function using the intrinsic without the change:
uint32_t __attribute__((noinline)) test(uint32_t x, uint32_t y)
{
uint32_t z = x % y;
return __builtin_bswap32(z);
}
The builtin_bswap16 also uses it and will simply shift up the 32 register first by 16 bits then do the 32 bit MOVBYTS to effectively zero extend the result in the register, nothing else is needed. Right now the 64 bit byte swap form is not implemented and will still generate a lot of code but I expect it could instead be done with 2 32 bit swaps as well as an intermediate register for a register exchange, so probably 5 P2 instructions. Here's the 64 bit byte swap code, and it looks really ugly. It seems it's using 32 bit masks and shifting to do things byte by byte, very inefficient.
Or, something that needs to be added at some point?
It does support it as I understand, or at least there are some driver files and FATFS stuff present in the C library. Take a look in the source folder area (p2llvm/libc/drivers/*) and you'll see SDDriver.c sdmmc.c and sdmmc.h and a diskio.c file. I haven't tried it out though.
@rogloh said:
It does support it as I understand, or at least there are some driver files and FATFS stuff present in the C library. Take a look in the source folder area (p2llvm/libc/drivers/*) and you'll see SDDriver.c sdmmc.c and sdmmc.h and a diskio.c file. I haven't tried it out though.
Which is where I, temporarily at the time, hacked in the smartpin clock input select code when the distance is greater than three pins. It is necessarily convoluted. It mostly just takes up space.
@Rayman said:
@rogloh Any idea if compiler flags would be the same for C++ as they are for C.
Would really love to program in C++ if not bloated too much…
I suspect there's a bunch of additional C++ flags available using LLVM. Whether you require them or not for your project is something you'll need to figure out. Try without first and see if it breaks, then debug. Usual stuff.
@Rayman said:
@evanh do you maybe then have the incantation needed to use it?
My custom library changes may have commented them out as the FatFS stuff was colliding with MicroPython's. You'll need to be able to build your own libc.a and libp2.a libraries (e.g. with Linux), and make the stock one. I believe that the normal driver build for libc.a had all the following files activated in p2llvm/libc/drivers/CMakeLists.txt before I shrunk it:
As far as using it goes I think it's almost automatically included via the InitIO chain if setup correctly. You might need to call mount() first. See the sdcard.c file in libc/stdio/ source area and also look for uses of the _InitIO symbol in the tree which is called at startup. Hopefully you can figure out how it works. Look in driver.h (two files) as well for more insights.
Also there is are a C++ examples (non-SD) in the examples folder and the corresponding Makefile may help determine any extra or default C++ flags.
The problem with C++ will be there is no STL library currently built for this LLVM port, so while you can use C++ syntax you can't use the C++ libraries unless you find/port one yourself.
I've almost got all the P2 instructions coded up now and all the aliases are in as well. When finished it should allow P2LLVM to be a full working assembler for the P2, as well as a C compiler supporting full inline assembly. There are a handful of remaining issues left to sort out that I'm still working on:
MODCZ uses a slightly different encoding format, and either references "registers" as operands from 0-$f for each 4 bit argument, or alias names like _set, _clr and _nc_or_z etc. Still adding all that logic and capability.
TESTP[n] and TESTB[n] get encoded with different rules based on their effects flags (WC/WZ) being different to others despite the regular opcode portions colliding with DRV/DIR/FLT/OUT/BIT instructions which are only allowed to use WCZ or no effects. Also TESTP[b]/TESTB[n] accept ANDC/ANDZ/ORC/ORZ/XORC/XORZ unlike the other instructions. This still has to be implemented. Nikita's codebase hadn't finished this part yet so I'm looking into how to extend it to do this. A few other instructions also ideally need to have effects flags enforced where they can/must only use a subset of the flags like GETBRK & GETRND. Update: I found a way and TESTB[n]/TESTP[n] is now coded *
For completeness I should add the two unallocated P2 instructions marked as "empty", so the disassembler won't crash if it ever sees this opcode value anywhere. All P2 instructions would be present then and running llvm-objdump on random data should no longer assert. Update: empty instructions are added.
Further work:
In the disassembled output we could try to identify symbols for relative jumps in addition to the absolute jumps which were already done, and also show the effective address to help identify address targets for DJNZ, TJZ and other relative branches. Some of this capability may already be available generically in llvm-objdump and just need to be enabled somewhere.
I've already tried to cleanup the listing file a bit to output in a column aligned format so a tab character was added to separate the mnemonic from the arguments instead of a space, however I'd also quite like to have a way to get rid of the 0x hex prefixes and move to a "$" prefix like we already use in PASM2 syntax, but there are issues with parsing this if you want to feed the output assembly code back into the compiler (eg. if you compile to assembly with -S and then assemble the output file separately). It may make sense to have a special flag to control whether the assembly listing gets printed out containing a $ or 0x prefix. The problem is the parser also needs to read in the same format, so it would need similar flags for that as well to enable this. I believe there are some attribute capabilities in LLVM that could control this type of thing, so I'm still looking into that part. Eg. passing --disassembler-options in llvm-objdump.
Further patterns to make use of SETNIB/SETBYTE/SETWORD/GETNIB/GETBYTE/GETWORD instructions where possible instead of shifting and masking to extract/set sub fields in registers.
Add patterns to use TESTB/BITH/BITL if single bits are being tested or set/cleared in a register instead of anding/oring with a large sized (augs) mask for example. That is a common operation to do in C and it would make sense to make use of these P2 instructions.
Other P2 specific optimizations identified that can reduce code. I've seen conditional code cases where some register value is simply copied into itself for one condition while another value is moved into it for the opposite condition, and that's something that could be removed in an optimization pass IMO. Like how the relative JMP #0 was removed.
Also I believe that if we could prevent the generation of TJNZ and any other relative branches in hubexec code via a special optional flag it should be possible to make use of my external PSRAM code caching solution that was already working with flexspin if we compile to ASM first then process the assembly code to look for any external memory "far" jumps and recompile with my caching stuff included. We would only need to leave either the PA or PB register available for this because either a CALLPA or CALLPB needs to be used for performing the far jump indirectly. The external memory caching code would need to execute from LUTRAM or COGRAM. I recall previously it was only using about 25% of the LUTRAM for containing my external memory code. Right now in P2LLVM the LUT gets used for some builtin functions but could probably be shared with this external memory code, still leaving most of COG RAM free for any general Fcache use, apart from some state that needs to be maintained for managing external memory (<32 longs IIRC), so hopefully 432 longs or so free for a really large Fcache area. Some of the builtin code that was placed into LUTRAM is of questionable benefit whenever it just jumps out to execute hubexec code anyway. High speed memory copy/fills make sense to run from LUT and probably the shift/divide/multiply stuff for 64 bit integers. However for any straight line code, running from LUT doesn't speed up too much vs hubexec anyway and I don't believe we should put floating point code routines in there at all.
Also, think you have to do the submodule init and submodule update before running build.py.
Also, should probably copy over @rogloh updates to LLVM and P2 target after that and before build as well...
@Rayman said:
@rogloh Is libp2++ the C++ version of the C libp2 library?
If so, does this mean C++ compiler won't work right until similar fixes are made there?
I don't have any changes for libp2++. That seems to just be a header file for some C++ wrapper classes to control Smartpins. Nothing else is in that folder. Maybe Nikita had plans to put more C++ stuff in there over time but it seems it's just for control of Smartpins at the moment. If you write only standalone classes yourself or otherwise use third party code that don't require the C++ library classes and methods I'd expect the Clang/LLVM toolchain should still work with the P2 but it will be rather limited. For full C++ support you need to link in an implementation for a standard library. P2LLVM should presumably already include one according to https://releases.llvm.org/14.0.0/projects/libcxx/docs/ReleaseNotes.html.
Apparently you just need to compile your code with the standard libc++ like this (though I've not tried it):
@Rayman said:
@rogloh do you have a example of how to do inline assembly?
Yeah this example below.
You need to ensure that \n is added at the end of every line being assembled and follow it with three colons before closing the parenthesis for the overall "asm"block.
The colons at the end separate those register parameters that are output to, read from, or are otherwise clobbered. In this case below "cc" = flags and "r" = register, "=r" means writes to register, "+r" means reads/modifies/writes. There is also "i" used for immediate constants. If multiple parameters are read or written etc, they are separated by commas between the colons.
You probably will want to look up the "GCC Extended Asm syntax" or sometimes known as AT&T syntax to follow it fully. The name preceding the "+r" or "r" strings is the name used in the square brackets with the % symbol in the assembly code, while the name following the "r", "+r" that is given in the parenthesis, e.g (timeout) is the C argument name used in the function's locals that will source this register initially. In this case it was kept the same, although I don't expect it needs to be.
As I've learned more about LLVM I've added a few more things to my changes for P2LLVM.
For the relative branches/calls using 20 bit immediates (JMP/CALL), and using 9 bit immediates (TJNZ/TJZ etc), I've enabled printing a target address after applying the relative offset to the next program counter address (the absolute called addresses were already being resolved to symbol names where applicable). This really helps navigate a disassembled listing. They get printed in the comment string that follows the instruction. E.g <331b0> below is the target address for the TJNZ. I'd like to do something similar with AUGD/AUGS but it's trickier given they span different instructions. If those could also resolve to printing some global symbol name in the different segments of the ELF file it would be wonderful.
The disassembled output now prints immediates in hex or in decimal correctly according to the --print-hex-imm option to llvm-objdump, previously it was printing in decimal only. Unfortunately decimal is still the default for LLVM as --no-print-hex-imm gets used until overridden. I think there is a push for later versions of LLVM to flip the default to use hex immediates. However the resolved target addresses are still conveniently printed in hex.
The parsing code also accepts $ prefixes correctly now for hex integers and registers (e.g. you can mov $3,4 or mov r0,$3 or mov r2, #$a1 etc, all are accepted). You can still use the "0x" prefix instead of $. It currently doesn't support % or %% immediates though when parsing numbers. The percentage character prefix is also meant to work for binary constants, but isn't right now for some weird reason I will need to look into, and is probably related to evaluating expressions which may not be very comprehensive at this time. For the %% combo, it's possible that I might be able to first look for two percentage signs and then try to convert the incoming decimal value into quaternary base numbers (base 4, like %%3210->0b11100100) but some of the integer parsing is in the common codebase, not the P2 target specific files, so TBD on that.
I've added the extra ANDC/ANDZ/ORC/ORZ/XORC/XORZ flags to the TESTP/TESTPN/TESTB/TESTBN instructions and code to encode and decode them specially just for these instructions.
Once I'm happy this is a consistently working feature set I'll post the updated files here, or ideally put it on github if I can get my act together.
EDIT: actually that could take a while, so here are the current set of file changes to be copied into P2LLVM source if you want the extra P2 instructions and stuff I've done to date. NOTE: not everything is 100% tested.
I've been able to add some more Patterns to the TableGen stuff for P2LLVM that take advantage of the P2 capabilities.
So far I'm using the following P2 instructions:
FLE, FLES, FGE, FGES - min/max on registers with other registers or immediate values
replaces code like this
if (x>y) x=y;
with
FLE x,y
and code like this
if (x<y) x=y;
with
FGE x,y
Works the same for signed values as well using FLES and FGES instead
BITL, BITH, BITNOT - individual bit set/clear/invert
replaces
reg &= (1<<n); with BITL reg, n
reg |= (1<<n); with BITH reg, n
reg ^= (1<<n); with BITNOT reg, n
In theory a group of consecutive bits (up to 16 total) could be changed in one go using
ranges of bits identifed in the top 4 bits of the 9 bit index, but this is a lot of
extra work figuring it out for infrequent returns, so I'm leaving that out for now.
GETNIB, SETNIB - extracting/setting nibbles in registers
replaces
reg2 = (reg >> 4n) & $f; where n=0-7
with GETNIB reg2, reg, #n
and
reg2 = (reg2 ~(0xf << 4n) | (reg & ($f << 4n))
with GETNIB reg2, reg, #n
GETBYTE, SETBYTE - extracting/setting bytes in registers
(same thing done as for GETNIB/SETNIB but for bytes)
GETWORD, SETWORD - extracting/setting words in registers
(same thing done as for GETNIB/SETNIB but for words)
ROLNIB, ROLBYTE, ROLWORD
replaces
D = (D << 4) | ((S >> 4n) & 15)
or
D = (D << 4) + ((S >> 4n) & 15)
with
ROLNIB D, S, #0 , for n=0, as well as n=1-7 with other right shifts of S
similar for bytes/words
ANDN
replaces
reg = reg & bigval;
with
ANDN reg, #val9bits
if val9bits fully fits in 9 bits and val9bits = ~bigval, saving the AUGS
I did notice in some cases that the optimizer's re-ordering of instructions can change the use of these patterns so maybe some more tweaking is needed to try to force its use more often.
Here's the set of patterns being used.
// these patterns are not yet 100% validated
// --- Signed Maximum ---
def : Pat<(smax P2GPR:$rs1, P2GPR:$rs2), (FGESrr P2GPR:$rs1, P2GPR:$rs2, always, noeff)>;
def : Pat<(smax P2GPR:$rs1, imm:$rs2), (FGESri P2GPR:$rs1, imm:$rs2, always, noeff)>;
// --- Signed Minimum ---
def : Pat<(smin P2GPR:$rs1, P2GPR:$rs2), (FLESrr P2GPR:$rs1, P2GPR:$rs2, always, noeff)>;
def : Pat<(smin P2GPR:$rs1, imm:$rs2), (FLESrr P2GPR:$rs1, imm:$rs2, always, noeff)>;
// --- Unsigned Maximum ---
def : Pat<(umax P2GPR:$rs1, P2GPR:$rs2), (FGErr P2GPR:$rs1, P2GPR:$rs2, always, noeff)>;
def : Pat<(umax P2GPR:$rs1, imm:$rs2), (FGEri P2GPR:$rs1, imm:$rs2, always, noeff)>;
// --- Unsigned Minimum ---
def : Pat<(umin P2GPR:$rs1, P2GPR:$rs2), (FLErr P2GPR:$rs1, P2GPR:$rs2, always, noeff)>;
def : Pat<(umin P2GPR:$rs1, imm:$rs2), (FLErr P2GPR:$rs1, imm:$rs2, always, noeff)>;
// Define a multiclass for Nibble patterns
// n: Index (0-7)
// s: Shift amount (n * 4)
// m: The AND mask for GET (15)
// sm: The inverted OR mask for SET (0xFFFFFFFF ^ (15 << s))
multiclass NibblePats<int n, int s, int m, bits<32> sm> {
// Pattern for GETNIB: (src >> shift) & 15
def : Pat<(and (srl P2GPR:$src, (i32 s)), (i32 m)),
(GETNIBrr P2GPR:$src, (i32 n), always)>;
// Pattern for SETNIB: (D & mask) | ((S & 15) << shift)
// Note: $src is tied to $D via Constraints in the instruction definition
def : Pat<(or (and P2GPR:$D_in, (i32 sm)),
(shl (and P2GPR:$S, (i32 15)), (i32 s))),
(SETNIBrr P2GPR:$D_in, P2GPR:$S, (i32 n), always)>;
}
// Pattern for ROLNIB: (D << 4) | (S & 15)
def : Pat<(or (shl P2GPR:$D_in, (i32 4)),
(and P2GPR:$S, (i32 0xf))),
(ROLNIBrr P2GPR:$D_in, P2GPR:$S, (i32 0), always)>;
// Pattern for ROLNIB: (D << 4) + (S & 15)
def : Pat<(add (shl P2GPR:$D_in, (i32 4)),
(and P2GPR:$S, (i32 0xf))),
(ROLNIBrr P2GPR:$D_in, P2GPR:$S, (i32 0), always)>;
// Usage: defm : NibblePats<Index, Shift, GetMask, SetMask>
defm : NibblePats<0, 0, 15, 0xFFFFFFF0>;
defm : NibblePats<1, 4, 15, 0xFFFFFF0F>;
defm : NibblePats<2, 8, 15, 0xFFFFF0FF>;
defm : NibblePats<3, 12, 15, 0xFFFF0FFF>;
defm : NibblePats<4, 16, 15, 0xFFF0FFFF>;
defm : NibblePats<5, 20, 15, 0xFF0FFFFF>;
defm : NibblePats<6, 24, 15, 0xF0FFFFFF>;
defm : NibblePats<7, 28, 15, 0x0FFFFFFF>;
// Special case for Nibble 0 (no shift)
def : Pat<(and P2GPR:$src, (i32 15)), (GETNIBrr P2GPR:$src, (i32 0), always)>;
multiclass BytePats<int n, int s, int m, bits<32> sm> {
def : Pat<(and (srl P2GPR:$src, (i32 s)), (i32 m)),
(GETBYTErr P2GPR:$src, (i32 n), always)>;
def : Pat<(or (and P2GPR:$D_in, (i32 sm)),
(shl (and P2GPR:$S, (i32 255)), (i32 s))),
(SETBYTErr P2GPR:$D_in, P2GPR:$S, (i32 n), always)>;
}
// Pattern for ROLBYTE: (D << 8) | (S & 0xff)
def : Pat<(or (shl P2GPR:$D_in, (i32 8)),
(and P2GPR:$S, (i32 0xff))),
(ROLBYTErr P2GPR:$D_in, P2GPR:$S, (i32 0), always)>;
// Pattern for ROLBYTE: (D << 8) + (S & 0xff)
def : Pat<(add (shl P2GPR:$D_in, (i32 8)),
(and P2GPR:$S, (i32 0xff))),
(ROLBYTErr P2GPR:$D_in, P2GPR:$S, (i32 0), always)>;
defm : BytePats<0, 0, 255, 0xFFFFFF00>;
defm : BytePats<1, 8, 255, 0xFFFF00FF>;
defm : BytePats<2, 16, 255, 0xFF00FFFF>;
defm : BytePats<3, 24, 255, 0x00FFFFFF>;
// Special case for Byte 0 (no shift)
def : Pat<(and P2GPR:$src, (i32 255)), (GETBYTErr P2GPR:$src, (i32 0), always)>;
multiclass WordPats<int n, int s, int m, bits<32> sm> {
def : Pat<(and (srl P2GPR:$src, (i32 s)), (i32 m)),
(GETWORDrr P2GPR:$src, (i32 n), always)>;
def : Pat<(or (and P2GPR:$D_in, (i32 sm)),
(shl (and P2GPR:$S, (i32 0xffff)), (i32 s))),
(SETWORDrr P2GPR:$D_in, P2GPR:$S, (i32 n), always)>;
}
// Pattern for ROLWORD: (D << 16) | (S & 0xffff)
def : Pat<(or (shl P2GPR:$D_in, (i32 16)),
(and P2GPR:$S, (i32 0xffff))),
(ROLWORDrr P2GPR:$D_in, P2GPR:$S, (i32 0), always)>;
// Pattern for ROLWORD: (D << 16) + (S & 0xffff)
def : Pat<(add (shl P2GPR:$D_in, (i32 16)),
(and P2GPR:$S, (i32 0xffff))),
(ROLWORDrr P2GPR:$D_in, P2GPR:$S, (i32 0), always)>;
defm : WordPats<0, 0, 0xffff, 0xFFFF0000>;
defm : WordPats<1, 16, 0xffff, 0x0000FFFF>;
// Special case for Word 0 (no shift)
//def : Pat<(and P2GPR:$src, (i32 0xffff)), (GETWORDrr P2GPR:$src, (i32 0), always>;
def ToInv9BitImm : SDNodeXForm<imm, [{
uint32_t val = (uint32_t)N->getZExtValue();
uint32_t invVal = ~val & 0x1FF; // Mask to exactly 9 bits
return CurDAG->getTargetConstant(invVal, SDLoc(N), MVT::i32);
}]>;
def immANDN : PatLeaf<(i32 imm), [{
// Check if the bitwise NOT of this 32-bit value fits in 9 bits
uint32_t val = (uint32_t)N->getZExtValue();
return isUInt<9>(~val);
}]>;
// Pattern: (and Reg, Imm) -> (ANDNri Reg, ~Imm) when ~Imm fits in 9 bits
let AddedComplexity = 500 in {
def : Pat<(and i32:$src, (i32 immANDN:$imm)),
(ANDNri i32:$src, (ToInv9BitImm immANDN:$imm), always, noeff)>;
}
def imm_bit_mask : PatLeaf<(imm), [{
uint32_t val = (uint32_t)N->getZExtValue();
// Check if exactly one bit is CLEAR (and all others are SET)
return isPowerOf2_64(~val);
}]>;
def to_bit_idx : SDNodeXForm<imm, [{
uint32_t val = ~((uint32_t)N->getZExtValue());
return CurDAG->getTargetConstant(Log2_32(val), SDLoc(N), MVT::i32);
}]>;
def imm_power_of_2 : PatLeaf<(imm), [{
return isPowerOf2_64((uint32_t)N->getZExtValue());
}]>;
def to_log2 : SDNodeXForm<imm, [{
return CurDAG->getTargetConstant(Log2_32(N->getZExtValue()), SDLoc(N), MVT::i32);
}]>;
let AddedComplexity = 500 in {
// Pattern: Register &= (1<<n)
def : Pat<(and i32:$src, imm_bit_mask:$imm),
(BITLri i32:$src, (to_bit_idx $imm), always, noeff)>;
// Pattern: Register |= (1<<n)
def : Pat<(or i32:$src, imm_power_of_2:$imm),
(BITHri i32:$src, (to_log2 $imm), always, noeff)>;
// Pattern: Register ^= (1<<n)
def : Pat<(xor i32:$src, imm_power_of_2:$imm),
(BITNOTri i32:$src, (to_log2 $imm), always, noeff)>;
}
I'd like to add TESTB as well as the instructions that can set Z on the result being followed by the conditional branch rather than separately testing the result register against 0 as an additional step. We may be able to utilize MUXQ as well for a certain pattern in C code.
Here's the BITH/BITL/BITNOT/ANDN working with this silly C snippet:
Comments
It ideally should be that, and it could save 3 instructions and 6 clocks vs what was actually compiled in the code above. Can add up when there are function calls with lots of arguments being setup, including immediates. It is done this way I think so the stack is adjusted atomically in one go when ready to be called in case prior arguments and locals that are referenced off the SP are needed while building the new arguments. But it's very inefficient in code size IMO and this is where having a BP could come in handy.
I sort of wonder if the called code's prologue and epilogue code could be best done in COGRAM/LUTRAM so that the impact of the extra code size overhead is reduced. We could easily pass a register count to save in PA with something like this:
called function prologue:
callpa #5, enter_handlercalled function epilogue:
callpa #5, leave_handlerThen in COG/LUT the enter_handler could save and block write the saved registers. The problem would be when there are gaps in the registers to be saved/restored. But it could simply be brute forced to just save them anyway as it's only a single clock per extra register saved rather than dealing with masks and looking for ranges to save etc.
The leave_handler could potentially also pop the return address from the stack and return to the caller directly bypassing the need for a RETA at the end of the function.
Just some musings on this...
I've used the forum as a notebook at times. Both for laying out the reasons/steps and for later reference.
Just added a bunch of the previously unused P2 instructions to the P2LLVM code in P2InstrInfo.td. A few commented instructions at the bottom are still TBD as they are a little more involved. I've also added a couple of extra instruction formats for P2InstrFormats.td like P2InstNOARGS for the few instructions which take no arguments, as the code doesn't seem to have that yet. If anyone needs these for their own use, they are copied here for now, but beware they are not all tested so if there are bit errors below it could assemble to bad code.
Here are new classes for P2InstrFormats.td (one for no arguments, one for special MODCZ arguments)
class P2InstNOARGS<bits<28> op, dag outs, dag inputs, string asmstr> : P2Inst<21, outs, !con(inputs, (ins P2Cond:$cc)), !strconcat("$cc\t", asmstr)> { bits<4> cc; let Inst{31-28} = cc; let Inst{27-0} = op; let TSFlags{5} = 0; let TSFlags{6} = 0; let TSFlags{7} = 0; let TSFlags{10-8} = 0; // s is always operand 0 let TSFlags{13-11} = 0; // d is always operand 0 let TSFlags{16-14} = 0; // n is always operand 0 } class P2InstCZ4C4Z<bits<7> op, bits<9> s, dag outs, dag inputs, string asmstr> : P2Inst<20, outs, !con(inputs, (ins P2Cond:$cc, P2Effect:$cz)), !strconcat("$cc\t", asmstr, " $cz")> { bits<20> n20; bits<4> cc; bits<2> cz; bits<4> cccc; bits<4> zzzz; let Inst{31-28} = cc; let Inst{27-21} = op; let Inst{20-19} = cz; let Inst{18-17} = 0b10; let Inst{16-13} = cccc; let Inst{12-9} = zzzz; let Inst{8-0} = s; let TSFlags{5} = 0; let TSFlags{6} = 0; let TSFlags{7} = 0; let TSFlags{10-8} = 0; // s is always operand 0 let TSFlags{13-11} = 0; // d is always operand 0 let TSFlags{16-14} = 0; // n is always operand 0 }@rogloh Nice! It is just inline assembly where this could go wrong, right? Think so.
Yes. C doesn't use these as there is no pattern defined that use these instructions. The disassembler and assembler should be able to take these and work with these formats. I have changed the tab spacing a little which works with the rest of my code changes, but may upset the output format slightly for these in isolation in the objdump until the remainder of the code changes I am working on get applied.. but that's cosmetic only. Also code has been fixed above due to some typo errors. It's now building ok on my Mac.
Update: I just tried a few of these new instructions for sanity (ignore the modcz stuff which I'm still messing with):
asm volatile ( //"modc _nc_and_z wc\n" //"modz _nc_and_nz wz\n" //"modcz $3,$5 wcz\n" "addpix r0, r2\n" "mulpix r0, r2\n" "mulpix r0, r2\n" "allowi\n" "stalli\n" "trgint1\n" "setluts #4\n" "setpat r3,r5\n" "fblock r3,#4\n" "cmpm r3, #4\n" "rczl r3 wc\n" "sca r3,r19\n" "negc r3, #22 wc\n" "jint #3\n" "testn r3,r4\n" ...and I get this dissassembly:
@rogloh Got llvm compiled on Ubuntu, first without and then with the fixes from post#687 from here https://forums.parallax.com/discussion/169862/micropython-for-p2/p23
Tested with a hello2.c file that includes a modulus operation.
Kind of expected the without case to crash and burn, but seemed to work just like the one with the fixes...
Maybe it takes a few of those operations to break it, or maybe this was just too simple an example, with nothing really happening after the modulus operation besides serial output?
BTW: I still can't figure out how to create libp2.a and the other .a files... Do these compile together with llvm? Or, are they a separate make command?
Strange. Are you sure you did the testing correctly and have the C modulus operation working without the extra #if 0 change in P2ExpandPseudos.cpp as mentioned in https://forums.parallax.com/discussion/comment/1572365/#Comment_1572365 and https://forums.parallax.com/discussion/comment/1572366/#Comment_1572366 ? Possibly if the code has optimized the modulus code out it would succeed but I found without those changes, compilation of the modulus operation in C would assert LLVM and this also included generating the library code as well which is where I first found it.
If you use the build.py script in the top level folder of the p2llvm repository code then they can be built after the P2 variant of LLVM is built. If you have built your P2 variant of LLVM independently without using build.py, like you did in other ways for Windows specifically, then they won't be produced. On Linux you should really be just using the python build script to do everything for you. I know I did have some problems out of the box with where the propeller*.h include files were stored/available and I needed to move some around so it would let the library build complete without problems this might have been the very first time and after that it was okay, can't recall specifics. It's possible I just may have not setup the include path in the installer correctly if that was also required to be done, not sure.
Update: For sanity I just recompiled without the #if 0 in the QUREM code in P2ExpandPseudos.cpp and tried to build your hello2.c above. It certainly still caused a crash in the compiler. However I just found that with optimizations disabled (using the -O0 setting), it didn't assert. So that's likely what you have different in your setup. Try having the -Os or -O1 optimization level setting enabled and see if it crashes.
@rogloh ok, yeah didn’t set any optimization that can recall so probably was default…
That brings up what O should be using…. Probably O1 at least right? Unless for troubleshooting?
So .a is a seperate compile step . That is useful info. Can try that now.
Can’t really go back to without your fixes on that one machine. But have a couple more machines want to build on and can test optimization effect there…
Further to this
I just figured out how to change the compiler pattern selection for the 32 bit byte swap operation to make use of the P2 MOVBYTS capability.
Here's the code being compiled as a C function using the intrinsic without the change:
uint32_t __attribute__((noinline)) test(uint32_t x, uint32_t y) { uint32_t z = x % y; return __builtin_bswap32(z); }and the listing
00000a00 <test>: a00: 28 04 64 fd setq #2 a04: 61 a1 67 fc wrlong r0, ptra++ a08: d1 a1 13 fd qdiv r0, r1 a0c: 19 a0 63 fd getqy r0 a10: d0 a3 03 f6 mov r1, r0 a14: 08 a2 47 f0 shr r1, #8 a18: 7f 00 00 ff augs #$fe00 >> 9 a1c: 00 a3 07 f5 and r1, #$100 a20: d0 a5 03 f6 mov r2, r0 a24: 18 a4 47 f0 shr r2, #$18 a28: d2 a3 43 f5 or r1, r2 a2c: d0 a5 03 f6 mov r2, r0 a30: 18 a4 67 f0 shl r2, #$18 a34: 08 a0 67 f0 shl r0, #8 a38: 80 7f 00 ff augs #$ff0000 >> 9 a3c: 00 a0 07 f5 and r0, #0 a40: d0 a5 43 f5 or r2, r0 a44: d1 a5 43 f5 or r2, r1 a48: d2 df 03 f6 mov r31, r2 a4c: 28 04 64 fd setq #2 a50: 5f a1 07 fb rdlong r0, --ptra a54: 2e 00 64 fd retaAnd here's the new much smaller code with MOVBYTS generation enabled now:
00000a00 <test>: a00: 61 a1 67 fc wrlong r0, ptra++ a04: d1 a1 13 fd qdiv r0, r1 a08: 19 a0 63 fd getqy r0 a0c: 1b a0 ff f9 movbyts r0, #$1b a10: d0 df 03 f6 mov r31, r0 a14: 5f a1 07 fb rdlong r0, --ptra a18: 2e 00 64 fd retaThe builtin_bswap16 also uses it and will simply shift up the 32 register first by 16 bits then do the 32 bit MOVBYTS to effectively zero extend the result in the register, nothing else is needed. Right now the 64 bit byte swap form is not implemented and will still generate a lot of code but I expect it could instead be done with 2 32 bit swaps as well as an intermediate register for a register exchange, so probably 5 P2 instructions. Here's the 64 bit byte swap code, and it looks really ugly. It seems it's using 32 bit masks and shifting to do things byte by byte, very inefficient.
00000a00 <test>: a00: 28 16 64 fd setq #$b a04: 61 a1 67 fc wrlong r0, ptra++ a08: d1 a1 13 fd qdiv r0, r1 a0c: 19 a0 63 fd getqy r0 a10: 00 a2 07 f6 mov r1, #0 a14: 38 a6 07 f6 mov r3, #$38 a18: d3 a5 03 f6 mov r2, r3 a1c: a0 02 c0 fd calla #\__lshrdi3 a20: ee ad 03 f6 mov r6, r30 a24: ef af 03 f6 mov r7, r31 a28: 28 a8 07 f6 mov r4, #$28 a2c: d4 a5 03 f6 mov r2, r4 a30: a0 02 c0 fd calla #\__lshrdi3 a34: ee b1 03 f6 mov r8, r30 a38: ef b3 03 f6 mov r9, r31 a3c: 7f 00 00 ff augs #$fe00 >> 9 a40: 00 b1 07 f5 and r8, #$100 a44: 00 b2 07 f5 and r9, #0 a48: d6 b1 43 f5 or r8, r6 a4c: d7 b3 43 f5 or r9, r7 a50: 18 aa 07 f6 mov r5, #$18 a54: d5 a5 03 f6 mov r2, r5 a58: a0 02 c0 fd calla #\__lshrdi3 a5c: ee b5 03 f6 mov r10, r30 a60: ef b7 03 f6 mov r11, r31 a64: 80 7f 00 ff augs #$ff0000 >> 9 a68: 00 b4 07 f5 and r10, #0 a6c: 00 b6 07 f5 and r11, #0 a70: 08 a4 07 f6 mov r2, #8 a74: a0 02 c0 fd calla #\__lshrdi3 a78: ee ad 03 f6 mov r6, r30 a7c: ef af 03 f6 mov r7, r31 a80: 00 80 7f ff augs #$ff000000 >> 9 a84: 00 ac 07 f5 and r6, #0 a88: 00 ae 07 f5 and r7, #0 a8c: da ad 43 f5 or r6, r10 a90: db af 43 f5 or r7, r11 a94: d8 ad 43 f5 or r6, r8 a98: d9 af 43 f5 or r7, r9 a9c: 04 02 c0 fd calla #\__ashldi3 aa0: ee b1 03 f6 mov r8, r30 aa4: ef b3 03 f6 mov r9, r31 aa8: 00 b0 07 f5 and r8, #0 aac: ff b2 07 f5 and r9, #$ff ab0: d5 a5 03 f6 mov r2, r5 ab4: 04 02 c0 fd calla #\__ashldi3 ab8: ee b5 03 f6 mov r10, r30 abc: ef b7 03 f6 mov r11, r31 ac0: 00 b4 07 f5 and r10, #0 ac4: 7f 00 00 ff augs #$fe00 >> 9 ac8: 00 b7 07 f5 and r11, #$100 acc: d8 b5 43 f5 or r10, r8 ad0: d9 b7 43 f5 or r11, r9 ad4: d3 a5 03 f6 mov r2, r3 ad8: 04 02 c0 fd calla #\__ashldi3 adc: ee b1 03 f6 mov r8, r30 ae0: ef b3 03 f6 mov r9, r31 ae4: d4 a5 03 f6 mov r2, r4 ae8: 04 02 c0 fd calla #\__ashldi3 aec: ee a1 03 f6 mov r0, r30 af0: ef a3 03 f6 mov r1, r31 af4: 00 a0 07 f5 and r0, #0 af8: 80 7f 00 ff augs #$ff0000 >> 9 afc: 00 a2 07 f5 and r1, #0 b00: d0 b1 43 f5 or r8, r0 b04: d1 b3 43 f5 or r9, r1 b08: da b1 43 f5 or r8, r10 b0c: db b3 43 f5 or r9, r11 b10: d6 b1 43 f5 or r8, r6 b14: d7 b3 43 f5 or r9, r7 b18: d8 dd 03 f6 mov r30, r8 b1c: d9 df 03 f6 mov r31, r9 b20: 28 16 64 fd setq #$b b24: 5f a1 07 fb rdlong r0, --ptra b28: 2e 00 64 fd retaThe changes needed for the 16 or 32 bit byte swaps in P2InstrInfo.td are :
let Constraints = "$s1 = $d", s_num = 2 in { def MOVBYTSr : P2InstIDS<0b100111111, 0b0, (outs P2GPR:$d), (ins P2GPR:$s1, P2GPR:$s), "movbyts\t$d, $s">; def MOVBYTSi : P2InstIDS<0b100111111, 0b1, (outs P2GPR:$d), (ins P2GPR:$s1, i8imm:$s), "movbyts\t$d, $s">; } // Map the bswap node (i32 type) to the MOVBYTS instruction def : Pat<(bswap i32:$src), (MOVBYTSi P2GPR:$src, 0x1b, always)>;and also a one line edit in P2ISelLowering.cpp to mark "BSWAP" as Legal instead of Expand.
setOperationAction(ISD::BSWAP, MVT::i32, Legal);Now need to figure out what else can be optimized for the P2...
UPDATE: just added these four to P2InstrInfo.td which enable FGE, FLE, FGES, FLES to be used with 32 bit values.
These need to be added to P2IselLowering.cpp :
setOperationAction(ISD::SMIN, MVT::i32, Legal); setOperationAction(ISD::SMAX, MVT::i32, Legal); setOperationAction(ISD::UMIN, MVT::i32, Legal); setOperationAction(ISD::UMAX, MVT::i32, Legal);Trying to remember something…
Does this Clang include uSD support already ?
Or, something that needs to be added at some point?
It does support it as I understand, or at least there are some driver files and FATFS stuff present in the C library. Take a look in the source folder area (p2llvm/libc/drivers/*) and you'll see SDDriver.c sdmmc.c and sdmmc.h and a diskio.c file. I haven't tried it out though.
@rogloh Any idea if compiler flags would be the same for C++ as they are for C.
Would really love to program in C++ if not bloated too much…
/me goes looking ... found this - https://github.com/ne75/p2llvm/blob/master/libc/drivers/sdmmc.c
Which is clearly copied from Flexspin's driver. Whoever ported it has added this comment on line 131:
Which is where I, temporarily at the time, hacked in the smartpin clock input select code when the distance is greater than three pins. It is necessarily convoluted. It mostly just takes up space.
@evanh do you maybe then have the incantation needed to use it?
I have no idea of anything about LLVM. Never touched it.
The only reason I got stuck into Flexspin is because I had Eric holding my hand.
I suspect there's a bunch of additional C++ flags available using LLVM. Whether you require them or not for your project is something you'll need to figure out. Try without first and see if it breaks, then debug. Usual stuff.
My custom library changes may have commented them out as the FatFS stuff was colliding with MicroPython's. You'll need to be able to build your own libc.a and libp2.a libraries (e.g. with Linux), and make the stock one. I believe that the normal driver build for libc.a had all the following files activated in p2llvm/libc/drivers/CMakeLists.txt before I shrunk it:
add_library(drivers OBJECT SimpleSerial.c FdSerial.c terminal.c memory.c SDDriver.c diskio.c ff.c ffunicode.c sdmmc.c )As far as using it goes I think it's almost automatically included via the InitIO chain if setup correctly. You might need to call mount() first. See the sdcard.c file in libc/stdio/ source area and also look for uses of the
_InitIOsymbol in the tree which is called at startup. Hopefully you can figure out how it works. Look in driver.h (two files) as well for more insights.Also there is are a C++ examples (non-SD) in the examples folder and the corresponding Makefile may help determine any extra or default C++ flags.
The problem with C++ will be there is no STL library currently built for this LLVM port, so while you can use C++ syntax you can't use the C++ libraries unless you find/port one yourself.
Maybe STL is unneeded bloat anyway?
I've almost got all the P2 instructions coded up now and all the aliases are in as well. When finished it should allow P2LLVM to be a full working assembler for the P2, as well as a C compiler supporting full inline assembly. There are a handful of remaining issues left to sort out that I'm still working on:
Further work:
In the disassembled output we could try to identify symbols for relative jumps in addition to the absolute jumps which were already done, and also show the effective address to help identify address targets for DJNZ, TJZ and other relative branches. Some of this capability may already be available generically in llvm-objdump and just need to be enabled somewhere.
I've already tried to cleanup the listing file a bit to output in a column aligned format so a tab character was added to separate the mnemonic from the arguments instead of a space, however I'd also quite like to have a way to get rid of the 0x hex prefixes and move to a "$" prefix like we already use in PASM2 syntax, but there are issues with parsing this if you want to feed the output assembly code back into the compiler (eg. if you compile to assembly with -S and then assemble the output file separately). It may make sense to have a special flag to control whether the assembly listing gets printed out containing a $ or 0x prefix. The problem is the parser also needs to read in the same format, so it would need similar flags for that as well to enable this. I believe there are some attribute capabilities in LLVM that could control this type of thing, so I'm still looking into that part. Eg. passing --disassembler-options in llvm-objdump.
Further patterns to make use of SETNIB/SETBYTE/SETWORD/GETNIB/GETBYTE/GETWORD instructions where possible instead of shifting and masking to extract/set sub fields in registers.
Add patterns to use TESTB/BITH/BITL if single bits are being tested or set/cleared in a register instead of anding/oring with a large sized (augs) mask for example. That is a common operation to do in C and it would make sense to make use of these P2 instructions.
Other P2 specific optimizations identified that can reduce code. I've seen conditional code cases where some register value is simply copied into itself for one condition while another value is moved into it for the opposite condition, and that's something that could be removed in an optimization pass IMO. Like how the relative JMP #0 was removed.
Also I believe that if we could prevent the generation of TJNZ and any other relative branches in hubexec code via a special optional flag it should be possible to make use of my external PSRAM code caching solution that was already working with flexspin if we compile to ASM first then process the assembly code to look for any external memory "far" jumps and recompile with my caching stuff included. We would only need to leave either the PA or PB register available for this because either a CALLPA or CALLPB needs to be used for performing the far jump indirectly. The external memory caching code would need to execute from LUTRAM or COGRAM. I recall previously it was only using about 25% of the LUTRAM for containing my external memory code. Right now in P2LLVM the LUT gets used for some builtin functions but could probably be shared with this external memory code, still leaving most of COG RAM free for any general Fcache use, apart from some state that needs to be maintained for managing external memory (<32 longs IIRC), so hopefully 432 longs or so free for a really large Fcache area. Some of the builtin code that was placed into LUTRAM is of questionable benefit whenever it just jumps out to execute hubexec code anyway. High speed memory copy/fills make sense to run from LUT and probably the shift/divide/multiply stuff for 64 bit integers. However for any straight line code, running from LUT doesn't speed up too much vs hubexec anyway and I don't believe we should put floating point code routines in there at all.
Think figured out that the build scripts are meant for a Mac...
The path to Clang is a Mac path. Kind of a pain to fix...
Also, had to install Clang. Guess that comes with Mac.
Can fix here in build.py:
# setup cmake command cmake_cmd = [ 'cmake', '-G', 'Unix Makefiles', '-DCMAKE_OSX_ARCHITECTURES=arm64', '-DLLVM_INSTALL_UTILS=true', '-DCMAKE_CXX_COMPILER=/usr/bin/clang++',Also, think you have to do the submodule init and submodule update before running build.py.
Also, should probably copy over @rogloh updates to LLVM and P2 target after that and before build as well...
@rogloh Is libp2++ the C++ version of the C libp2 library?
If so, does this mean C++ compiler won't work right until similar fixes are made there?
I don't have any changes for libp2++. That seems to just be a header file for some C++ wrapper classes to control Smartpins. Nothing else is in that folder. Maybe Nikita had plans to put more C++ stuff in there over time but it seems it's just for control of Smartpins at the moment. If you write only standalone classes yourself or otherwise use third party code that don't require the C++ library classes and methods I'd expect the Clang/LLVM toolchain should still work with the P2 but it will be rather limited. For full C++ support you need to link in an implementation for a standard library. P2LLVM should presumably already include one according to https://releases.llvm.org/14.0.0/projects/libcxx/docs/ReleaseNotes.html.
Apparently you just need to compile your code with the standard libc++ like this (though I've not tried it):
clang++ -stdlib=libc++ my_file.cpp
Give it a go and see if it works...?
Tried to build on two more Ubuntu VMs. One worked and one didn’t….
. The didn’t one actually crashes during compile.
Kind of strange because thought did them the same way…. Well, try try again
But also think will try on Mac mini to see if get a better experience.
@rogloh do you have a example of how to do inline assembly?
Yeah this example below.
You need to ensure that \n is added at the end of every line being assembled and follow it with three colons before closing the parenthesis for the overall "asm"block.
The colons at the end separate those register parameters that are output to, read from, or are otherwise clobbered. In this case below "cc" = flags and "r" = register, "=r" means writes to register, "+r" means reads/modifies/writes. There is also "i" used for immediate constants. If multiple parameters are read or written etc, they are separated by commas between the colons.
You probably will want to look up the "GCC Extended Asm syntax" or sometimes known as AT&T syntax to follow it fully. The name preceding the "+r" or "r" strings is the name used in the square brackets with the % symbol in the assembly code, while the name following the "r", "+r" that is given in the parenthesis, e.g (timeout) is the C argument name used in the function's locals that will source this register initially. In this case it was kept the same, although I don't expect it needs to be.
static int wait_ready( // 0:Card Busy, 1:Card Ready uint32_t timeout, // 100/250/500 ms interval in sysclock ticks sd_config_t *bus) { unsigned PIN_CLK = bus->clk; uint32_t m_se2; register int ready; drvl_(PIN_CLK); // enable CLK smartpin wypin(PIN_CLK, -1); timeout += _cnt(); m_se2 = 0b110<<6 | bus->d0; // trigger on high level - card ready ready = 0; // Busy asm volatile ( "setse2 #0\n" "setse2 %[m_se2]\n" "setq %[timeout]\n" "waitse2 wc\n" "wrnc %[ready]\n" : [ready] "+r" (ready) : [timeout] "r" (timeout), [m_se2] "r" (m_se2) : "cc" ); return ready; }@rogloh Thanks! Don't think I'd ever have figured that out ...
As I've learned more about LLVM I've added a few more things to my changes for P2LLVM.
For the relative branches/calls using 20 bit immediates (JMP/CALL), and using 9 bit immediates (TJNZ/TJZ etc), I've enabled printing a target address after applying the relative offset to the next program counter address (the absolute called addresses were already being resolved to symbol names where applicable). This really helps navigate a disassembled listing. They get printed in the comment string that follows the instruction. E.g <331b0> below is the target address for the TJNZ. I'd like to do something similar with AUGD/AUGS but it's trickier given they span different instructions. If those could also resolve to printing some global symbol name in the different segments of the ELF file it would be wonderful.
The disassembled output now prints immediates in hex or in decimal correctly according to the
--print-hex-immoption to llvm-objdump, previously it was printing in decimal only. Unfortunately decimal is still the default for LLVM as--no-print-hex-immgets used until overridden. I think there is a push for later versions of LLVM to flip the default to use hex immediates. However the resolved target addresses are still conveniently printed in hex.The parsing code also accepts $ prefixes correctly now for hex integers and registers (e.g. you can
mov $3,4ormov r0,$3ormov r2, #$a1etc, all are accepted). You can still use the "0x" prefix instead of $. It currently doesn't support % or %% immediates though when parsing numbers. The percentage character prefix is also meant to work for binary constants, but isn't right now for some weird reason I will need to look into, and is probably related to evaluating expressions which may not be very comprehensive at this time. For the %% combo, it's possible that I might be able to first look for two percentage signs and then try to convert the incoming decimal value into quaternary base numbers (base 4, like %%3210->0b11100100) but some of the integer parsing is in the common codebase, not the P2 target specific files, so TBD on that.I've added the extra ANDC/ANDZ/ORC/ORZ/XORC/XORZ flags to the TESTP/TESTPN/TESTB/TESTBN instructions and code to encode and decode them specially just for these instructions.
{"", 0x0}, {"wz", 0x1}, {"wc", 0x2}, {"wcz", 0x3}, {"andz", 0x4}, {"andc", 0x5}, {"orz", 0x6}, {"orc", 0x7}, {"xorz", 0x8}, {"xorc", 0x9}Also added all these condition flags aliases:
// aliases {"if_nz_and_nc", 0x1}, {"if_gt", 0x1}, {"if_a", 0x1}, {"if_00", 0x1}, {"if_z_and_nc", 0x2}, {"if_ge", 0x3}, {"if_0x", 0x3}, {"if_nz_and_c", 0x4}, {"if_10", 0x4}, {"if_ne", 0x5}, {"if_x0", 0x5}, {"if_z_ne_c", 0x6}, {"if_diff", 0x6}, {"if_nz_or_nc", 0x7}, {"if_not_11", 0x7}, {"if_z_and_c", 0x8}, {"if_11", 0x8}, {"if_z_eq_c", 0x9}, {"if_same", 0x9}, {"if_e", 0xa}, {"if_x1", 0xa}, {"if_z_or_nc", 0xb}, {"if_not_10", 0xb}, {"if_lt", 0xc}, {"if_b", 0xc}, {"if_1x", 0xc}, {"if_nz_or_c", 0xd}, {"if_not_01", 0xd}, {"if_z_or_c", 0xe}, {"if_le", 0xe}, {"if_be", 0xe}, {"if_not_00", 0xe}I also added all the other alias strings for the MODCZ arguments which are recognized now and also printed out in code like this.
Here they are
{"_clr", 0x0}, {"_nc_and_nz", 0x1}, {"_nz_and_nc", 0x1}, {"_gt", 0x1}, {"_nc_and_z", 0x2}, {"_z_and_nc", 0x2}, {"_nc", 0x3}, {"_ge", 0x3}, {"_c_and_nz", 0x4}, {"_nz_and_c", 0x4}, {"_nz", 0x5}, {"_ne", 0x5}, {"_c_ne_z", 0x6}, {"_z_ne_c", 0x6}, {"_nc_or_nz", 0x7}, {"_nz_or_nc", 0x7}, {"_c_and_z", 0x8}, {"_z_and_c", 0x8}, {"_c_eq_z", 0x9}, {"_z_eq_c", 0x9}, {"_z", 0xa}, {"_e", 0xa}, {"_nc_or_z", 0xb}, {"_z_or_nc", 0xb}, {"_c", 0xc}, {"_lt", 0xc}, {"_c_or_nz", 0xd}, {"_nz_or_c", 0xd}, {"_c_or_z", 0xe}, {"_z_or_c", 0xe}, {"_le", 0xe}, {"_set", 0xf}, }Once I'm happy this is a consistently working feature set I'll post the updated files here, or ideally put it on github if I can get my act together.
EDIT: actually that could take a while, so here are the current set of file changes to be copied into P2LLVM source if you want the extra P2 instructions and stuff I've done to date. NOTE: not everything is 100% tested.
I've been able to add some more Patterns to the TableGen stuff for P2LLVM that take advantage of the P2 capabilities.
So far I'm using the following P2 instructions:
FLE, FLES, FGE, FGES - min/max on registers with other registers or immediate values
replaces code like this
if (x>y) x=y;
with
FLE x,y
and code like this
if (x<y) x=y;
with
FGE x,y
Works the same for signed values as well using FLES and FGES instead
BITL, BITH, BITNOT - individual bit set/clear/invert
replaces
reg &= (1<<n); with BITL reg, n
reg |= (1<<n); with BITH reg, n
reg ^= (1<<n); with BITNOT reg, n
In theory a group of consecutive bits (up to 16 total) could be changed in one go using
ranges of bits identifed in the top 4 bits of the 9 bit index, but this is a lot of
extra work figuring it out for infrequent returns, so I'm leaving that out for now.
GETNIB, SETNIB - extracting/setting nibbles in registers
replaces
reg2 = (reg >> 4n) & $f; where n=0-7
with GETNIB reg2, reg, #n
and
reg2 = (reg2 ~(0xf << 4n) | (reg & ($f << 4n))
with GETNIB reg2, reg, #n
GETBYTE, SETBYTE - extracting/setting bytes in registers
(same thing done as for GETNIB/SETNIB but for bytes)
GETWORD, SETWORD - extracting/setting words in registers
(same thing done as for GETNIB/SETNIB but for words)
ROLNIB, ROLBYTE, ROLWORD
replaces
D = (D << 4) | ((S >> 4n) & 15)
or
D = (D << 4) + ((S >> 4n) & 15)
with
ROLNIB D, S, #0 , for n=0, as well as n=1-7 with other right shifts of S
similar for bytes/words
ANDN
replaces
reg = reg & bigval;
with
ANDN reg, #val9bits
if val9bits fully fits in 9 bits and val9bits = ~bigval, saving the AUGS
I did notice in some cases that the optimizer's re-ordering of instructions can change the use of these patterns so maybe some more tweaking is needed to try to force its use more often.
Here's the set of patterns being used.
// these patterns are not yet 100% validated // --- Signed Maximum --- def : Pat<(smax P2GPR:$rs1, P2GPR:$rs2), (FGESrr P2GPR:$rs1, P2GPR:$rs2, always, noeff)>; def : Pat<(smax P2GPR:$rs1, imm:$rs2), (FGESri P2GPR:$rs1, imm:$rs2, always, noeff)>; // --- Signed Minimum --- def : Pat<(smin P2GPR:$rs1, P2GPR:$rs2), (FLESrr P2GPR:$rs1, P2GPR:$rs2, always, noeff)>; def : Pat<(smin P2GPR:$rs1, imm:$rs2), (FLESrr P2GPR:$rs1, imm:$rs2, always, noeff)>; // --- Unsigned Maximum --- def : Pat<(umax P2GPR:$rs1, P2GPR:$rs2), (FGErr P2GPR:$rs1, P2GPR:$rs2, always, noeff)>; def : Pat<(umax P2GPR:$rs1, imm:$rs2), (FGEri P2GPR:$rs1, imm:$rs2, always, noeff)>; // --- Unsigned Minimum --- def : Pat<(umin P2GPR:$rs1, P2GPR:$rs2), (FLErr P2GPR:$rs1, P2GPR:$rs2, always, noeff)>; def : Pat<(umin P2GPR:$rs1, imm:$rs2), (FLErr P2GPR:$rs1, imm:$rs2, always, noeff)>; // Define a multiclass for Nibble patterns // n: Index (0-7) // s: Shift amount (n * 4) // m: The AND mask for GET (15) // sm: The inverted OR mask for SET (0xFFFFFFFF ^ (15 << s)) multiclass NibblePats<int n, int s, int m, bits<32> sm> { // Pattern for GETNIB: (src >> shift) & 15 def : Pat<(and (srl P2GPR:$src, (i32 s)), (i32 m)), (GETNIBrr P2GPR:$src, (i32 n), always)>; // Pattern for SETNIB: (D & mask) | ((S & 15) << shift) // Note: $src is tied to $D via Constraints in the instruction definition def : Pat<(or (and P2GPR:$D_in, (i32 sm)), (shl (and P2GPR:$S, (i32 15)), (i32 s))), (SETNIBrr P2GPR:$D_in, P2GPR:$S, (i32 n), always)>; } // Pattern for ROLNIB: (D << 4) | (S & 15) def : Pat<(or (shl P2GPR:$D_in, (i32 4)), (and P2GPR:$S, (i32 0xf))), (ROLNIBrr P2GPR:$D_in, P2GPR:$S, (i32 0), always)>; // Pattern for ROLNIB: (D << 4) + (S & 15) def : Pat<(add (shl P2GPR:$D_in, (i32 4)), (and P2GPR:$S, (i32 0xf))), (ROLNIBrr P2GPR:$D_in, P2GPR:$S, (i32 0), always)>; // Usage: defm : NibblePats<Index, Shift, GetMask, SetMask> defm : NibblePats<0, 0, 15, 0xFFFFFFF0>; defm : NibblePats<1, 4, 15, 0xFFFFFF0F>; defm : NibblePats<2, 8, 15, 0xFFFFF0FF>; defm : NibblePats<3, 12, 15, 0xFFFF0FFF>; defm : NibblePats<4, 16, 15, 0xFFF0FFFF>; defm : NibblePats<5, 20, 15, 0xFF0FFFFF>; defm : NibblePats<6, 24, 15, 0xF0FFFFFF>; defm : NibblePats<7, 28, 15, 0x0FFFFFFF>; // Special case for Nibble 0 (no shift) def : Pat<(and P2GPR:$src, (i32 15)), (GETNIBrr P2GPR:$src, (i32 0), always)>; multiclass BytePats<int n, int s, int m, bits<32> sm> { def : Pat<(and (srl P2GPR:$src, (i32 s)), (i32 m)), (GETBYTErr P2GPR:$src, (i32 n), always)>; def : Pat<(or (and P2GPR:$D_in, (i32 sm)), (shl (and P2GPR:$S, (i32 255)), (i32 s))), (SETBYTErr P2GPR:$D_in, P2GPR:$S, (i32 n), always)>; } // Pattern for ROLBYTE: (D << 8) | (S & 0xff) def : Pat<(or (shl P2GPR:$D_in, (i32 8)), (and P2GPR:$S, (i32 0xff))), (ROLBYTErr P2GPR:$D_in, P2GPR:$S, (i32 0), always)>; // Pattern for ROLBYTE: (D << 8) + (S & 0xff) def : Pat<(add (shl P2GPR:$D_in, (i32 8)), (and P2GPR:$S, (i32 0xff))), (ROLBYTErr P2GPR:$D_in, P2GPR:$S, (i32 0), always)>; defm : BytePats<0, 0, 255, 0xFFFFFF00>; defm : BytePats<1, 8, 255, 0xFFFF00FF>; defm : BytePats<2, 16, 255, 0xFF00FFFF>; defm : BytePats<3, 24, 255, 0x00FFFFFF>; // Special case for Byte 0 (no shift) def : Pat<(and P2GPR:$src, (i32 255)), (GETBYTErr P2GPR:$src, (i32 0), always)>; multiclass WordPats<int n, int s, int m, bits<32> sm> { def : Pat<(and (srl P2GPR:$src, (i32 s)), (i32 m)), (GETWORDrr P2GPR:$src, (i32 n), always)>; def : Pat<(or (and P2GPR:$D_in, (i32 sm)), (shl (and P2GPR:$S, (i32 0xffff)), (i32 s))), (SETWORDrr P2GPR:$D_in, P2GPR:$S, (i32 n), always)>; } // Pattern for ROLWORD: (D << 16) | (S & 0xffff) def : Pat<(or (shl P2GPR:$D_in, (i32 16)), (and P2GPR:$S, (i32 0xffff))), (ROLWORDrr P2GPR:$D_in, P2GPR:$S, (i32 0), always)>; // Pattern for ROLWORD: (D << 16) + (S & 0xffff) def : Pat<(add (shl P2GPR:$D_in, (i32 16)), (and P2GPR:$S, (i32 0xffff))), (ROLWORDrr P2GPR:$D_in, P2GPR:$S, (i32 0), always)>; defm : WordPats<0, 0, 0xffff, 0xFFFF0000>; defm : WordPats<1, 16, 0xffff, 0x0000FFFF>; // Special case for Word 0 (no shift) //def : Pat<(and P2GPR:$src, (i32 0xffff)), (GETWORDrr P2GPR:$src, (i32 0), always>; def ToInv9BitImm : SDNodeXForm<imm, [{ uint32_t val = (uint32_t)N->getZExtValue(); uint32_t invVal = ~val & 0x1FF; // Mask to exactly 9 bits return CurDAG->getTargetConstant(invVal, SDLoc(N), MVT::i32); }]>; def immANDN : PatLeaf<(i32 imm), [{ // Check if the bitwise NOT of this 32-bit value fits in 9 bits uint32_t val = (uint32_t)N->getZExtValue(); return isUInt<9>(~val); }]>; // Pattern: (and Reg, Imm) -> (ANDNri Reg, ~Imm) when ~Imm fits in 9 bits let AddedComplexity = 500 in { def : Pat<(and i32:$src, (i32 immANDN:$imm)), (ANDNri i32:$src, (ToInv9BitImm immANDN:$imm), always, noeff)>; } def imm_bit_mask : PatLeaf<(imm), [{ uint32_t val = (uint32_t)N->getZExtValue(); // Check if exactly one bit is CLEAR (and all others are SET) return isPowerOf2_64(~val); }]>; def to_bit_idx : SDNodeXForm<imm, [{ uint32_t val = ~((uint32_t)N->getZExtValue()); return CurDAG->getTargetConstant(Log2_32(val), SDLoc(N), MVT::i32); }]>; def imm_power_of_2 : PatLeaf<(imm), [{ return isPowerOf2_64((uint32_t)N->getZExtValue()); }]>; def to_log2 : SDNodeXForm<imm, [{ return CurDAG->getTargetConstant(Log2_32(N->getZExtValue()), SDLoc(N), MVT::i32); }]>; let AddedComplexity = 500 in { // Pattern: Register &= (1<<n) def : Pat<(and i32:$src, imm_bit_mask:$imm), (BITLri i32:$src, (to_bit_idx $imm), always, noeff)>; // Pattern: Register |= (1<<n) def : Pat<(or i32:$src, imm_power_of_2:$imm), (BITHri i32:$src, (to_log2 $imm), always, noeff)>; // Pattern: Register ^= (1<<n) def : Pat<(xor i32:$src, imm_power_of_2:$imm), (BITNOTri i32:$src, (to_log2 $imm), always, noeff)>; }I'd like to add TESTB as well as the instructions that can set Z on the result being followed by the conditional branch rather than separately testing the result register against 0 as an additional step. We may be able to utilize MUXQ as well for a certain pattern in C code.
Here's the BITH/BITL/BITNOT/ANDN working with this silly C snippet:
{ uint32_t crc = 0; uint32_t test = *(uint32_t *)0x323; crc += test & 0xfffffe31; crc*=3; crc &= ~(1<<23); crc+=test; crc |= (1<<22); crc+=test; crc ^= (1<<12); }here's the listing
And ultimately I'd like to see this:
just become a
rdlong r0, ##$323(2 instructions)This sounds great @rogloh
Putting your version on GitHub sounds like great idea.
Guess that would be a fork.
But @n_ermosh did post a few months ago …. Maybe he would take pull request. If can figure that out…