It ideally should be that, and it could save 3 instructions and 6 clocks vs what was actually compiled in the code above. Can add up when there are function calls with lots of arguments being setup, including immediates. It is done this way I think so the stack is adjusted atomically in one go when ready to be called in case prior arguments and locals that are referenced off the SP are needed while building the new arguments. But it's very inefficient in code size IMO and this is where having a BP could come in handy.
I sort of wonder if the called code's prologue and epilogue code could be best done in COGRAM/LUTRAM so that the impact of the extra code size overhead is reduced. We could easily pass a register count to save in PA with something like this:
called function prologue: callpa #5, enter_handler
called function epilogue: callpa #5, leave_handler
Then in COG/LUT the enter_handler could save and block write the saved registers. The problem would be when there are gaps in the registers to be saved/restored. But it could simply be brute forced to just save them anyway as it's only a single clock per extra register saved rather than dealing with masks and looking for ranges to save etc.
The leave_handler could potentially also pop the return address from the stack and return to the caller directly bypassing the need for a RETA at the end of the function.
Just added a bunch of the previously unused P2 instructions to the P2LLVM code in P2InstrInfo.td. A few commented instructions at the bottom are still TBD as they are a little more involved. I've also added a couple of extra instruction formats for P2InstrFormats.td like P2InstNOARGS for the few instructions which take no arguments, as the code doesn't seem to have that yet. If anyone needs these for their own use, they are copied here for now, but beware they are not all tested so if there are bit errors below it could assemble to bad code.
Here are new classes for P2InstrFormats.td (one for no arguments, one for special MODCZ arguments)
class P2InstNOARGS<bits<28> op, dag outs, dag inputs, string asmstr> : P2Inst<21, outs, !con(inputs, (ins P2Cond:$cc)), !strconcat("$cc\t", asmstr)> {
bits<4> cc;
let Inst{31-28} = cc;
let Inst{27-0} = op;
let TSFlags{5} = 0;
let TSFlags{6} = 0;
let TSFlags{7} = 0;
let TSFlags{10-8} = 0; // s is always operand 0
let TSFlags{13-11} = 0; // d is always operand 0
let TSFlags{16-14} = 0; // n is always operand 0
}
class P2InstCZ4C4Z<bits<7> op, bits<9> s, dag outs, dag inputs, string asmstr> :
P2Inst<20, outs, !con(inputs, (ins P2Cond:$cc, P2Effect:$cz)), !strconcat("$cc\t", asmstr, " $cz")> {
bits<20> n20;
bits<4> cc;
bits<2> cz;
bits<4> cccc;
bits<4> zzzz;
let Inst{31-28} = cc;
let Inst{27-21} = op;
let Inst{20-19} = cz;
let Inst{18-17} = 0b10;
let Inst{16-13} = cccc;
let Inst{12-9} = zzzz;
let Inst{8-0} = s;
let TSFlags{5} = 0;
let TSFlags{6} = 0;
let TSFlags{7} = 0;
let TSFlags{10-8} = 0; // s is always operand 0
let TSFlags{13-11} = 0; // d is always operand 0
let TSFlags{16-14} = 0; // n is always operand 0
}
@Rayman said:
@rogloh Nice! It is just inline assembly where this could go wrong, right? Think so.
Yes. C doesn't use these as there is no pattern defined that use these instructions. The disassembler and assembler should be able to take these and work with these formats. I have changed the tab spacing a little which works with the rest of my code changes, but may upset the output format slightly for these in isolation in the objdump until the remainder of the code changes I am working on get applied.. but that's cosmetic only. Also code has been fixed above due to some typo errors. It's now building ok on my Mac.
Update: I just tried a few of these new instructions for sanity (ignore the modcz stuff which I'm still messing with):
Comments
It ideally should be that, and it could save 3 instructions and 6 clocks vs what was actually compiled in the code above. Can add up when there are function calls with lots of arguments being setup, including immediates. It is done this way I think so the stack is adjusted atomically in one go when ready to be called in case prior arguments and locals that are referenced off the SP are needed while building the new arguments. But it's very inefficient in code size IMO and this is where having a BP could come in handy.
I sort of wonder if the called code's prologue and epilogue code could be best done in COGRAM/LUTRAM so that the impact of the extra code size overhead is reduced. We could easily pass a register count to save in PA with something like this:
called function prologue:
callpa #5, enter_handlercalled function epilogue:
callpa #5, leave_handlerThen in COG/LUT the enter_handler could save and block write the saved registers. The problem would be when there are gaps in the registers to be saved/restored. But it could simply be brute forced to just save them anyway as it's only a single clock per extra register saved rather than dealing with masks and looking for ranges to save etc.
The leave_handler could potentially also pop the return address from the stack and return to the caller directly bypassing the need for a RETA at the end of the function.
Just some musings on this...
I've used the forum as a notebook at times. Both for laying out the reasons/steps and for later reference.
Just added a bunch of the previously unused P2 instructions to the P2LLVM code in P2InstrInfo.td. A few commented instructions at the bottom are still TBD as they are a little more involved. I've also added a couple of extra instruction formats for P2InstrFormats.td like P2InstNOARGS for the few instructions which take no arguments, as the code doesn't seem to have that yet. If anyone needs these for their own use, they are copied here for now, but beware they are not all tested so if there are bit errors below it could assemble to bad code.
Here are new classes for P2InstrFormats.td (one for no arguments, one for special MODCZ arguments)
class P2InstNOARGS<bits<28> op, dag outs, dag inputs, string asmstr> : P2Inst<21, outs, !con(inputs, (ins P2Cond:$cc)), !strconcat("$cc\t", asmstr)> { bits<4> cc; let Inst{31-28} = cc; let Inst{27-0} = op; let TSFlags{5} = 0; let TSFlags{6} = 0; let TSFlags{7} = 0; let TSFlags{10-8} = 0; // s is always operand 0 let TSFlags{13-11} = 0; // d is always operand 0 let TSFlags{16-14} = 0; // n is always operand 0 } class P2InstCZ4C4Z<bits<7> op, bits<9> s, dag outs, dag inputs, string asmstr> : P2Inst<20, outs, !con(inputs, (ins P2Cond:$cc, P2Effect:$cz)), !strconcat("$cc\t", asmstr, " $cz")> { bits<20> n20; bits<4> cc; bits<2> cz; bits<4> cccc; bits<4> zzzz; let Inst{31-28} = cc; let Inst{27-21} = op; let Inst{20-19} = cz; let Inst{18-17} = 0b10; let Inst{16-13} = cccc; let Inst{12-9} = zzzz; let Inst{8-0} = s; let TSFlags{5} = 0; let TSFlags{6} = 0; let TSFlags{7} = 0; let TSFlags{10-8} = 0; // s is always operand 0 let TSFlags{13-11} = 0; // d is always operand 0 let TSFlags{16-14} = 0; // n is always operand 0 }@rogloh Nice! It is just inline assembly where this could go wrong, right? Think so.
Yes. C doesn't use these as there is no pattern defined that use these instructions. The disassembler and assembler should be able to take these and work with these formats. I have changed the tab spacing a little which works with the rest of my code changes, but may upset the output format slightly for these in isolation in the objdump until the remainder of the code changes I am working on get applied.. but that's cosmetic only. Also code has been fixed above due to some typo errors. It's now building ok on my Mac.
Update: I just tried a few of these new instructions for sanity (ignore the modcz stuff which I'm still messing with):
asm volatile ( //"modc _nc_and_z wc\n" //"modz _nc_and_nz wz\n" //"modcz $3,$5 wcz\n" "addpix r0, r2\n" "mulpix r0, r2\n" "mulpix r0, r2\n" "allowi\n" "stalli\n" "trgint1\n" "setluts #4\n" "setpat r3,r5\n" "fblock r3,#4\n" "cmpm r3, #4\n" "rczl r3 wc\n" "sca r3,r19\n" "negc r3, #22 wc\n" "jint #3\n" "testn r3,r4\n" ...and I get this dissassembly: