I think the fundamental issue is that there are very good tools already available for the P2, and people don't realize just how good they are (and what there is out there).
For example, I've heard lots of people saying how cool it will be to eventually have a self-hosted development environment. But @"Peter Jakacki" has had a self-hosted development environment for P2 since literally before there was silicon. I'm constantly amazed at all of the things he does with FORTH on the P2, but people don't seem to be aware of it.
People have been waiting for Chip's interpreter to run Spin code on the P2. But fastspin has been able to do that for a long time now.
People have wanted a PropTool like editor to work with P2 tools; but there are already several solutions that can do that.
There are plenty more examples. I don't want to pick on anyone -- I've also been guilty of overlooking some good existing tools for the P2 too. And sometimes people just want to experiment and write their own tools. There's nothing wrong with that: but if you're looking just for a tool to use for work, there's a good chance that there's already something available for P2 that can do at least part of what you want. If you want to program for the P2, jump right in!
I think there's likely a good bit of performance to be gained from a P2-specific compressed format or just native Hubexec.
Perhaps you missed the benchmark I posted above, but the RISC-V toolchain compiled P2 binary runs the Dhrystone benchmark more than twice as fast as all of the P2 "native" compilers, including p2gcc, Catalina, and fastspin, all of which use hubexec (for that matter riscvp2 uses hubexec too, once the RISC-V opcodes are translated).
That just shows that the GCC optimizer more than makes up for the not-so-large recompiler overhead. Not to bash on fastspin and Catalina, but they just don't have the same level of optimization work put into them. A well-made native LLVM or GCC backend would be fastest, but who's going to invest lots of time to write that? Until Parallax decides to fund development or someone decides to do it because reasons, the RISC-V emulation seems like a good solution.
(I wonder how P2 compares to actual RISC-V hardware...)
But as anyone who has ever used a JIT recompiling emulator for some old machine on their PC can tell, the real issue isn't actual hot-loop performance, but recompilation delay. I don't have a P2, so I can't test that, but try testing the different compilers for latency when some code is executed for the first time. (like, wait until a pin reads high, then call a function that drives another pin high and measure the delay using a digital scope or some PASM code in another cog). That may be an issue for people with a hard realtime requirement, which IIRC is one of the Prop selling points.
In the end, C performance doesn't even matter that much. Beginners and the education crowd don't really care too much and advanced users will want to write critical code in PASM, anyways.
PropWare's example code for mounting a FAT32 partition on an SD card, reading the contents of one file, and writing those contents to another file, would probably be a very good way to compare the performance of some procedural code across different C++ compilers. PropWare's FAT32 library is written entirely in C++, with the only PASM being the low-level SPI bit banging routines.
The test could be done with other FAT32 libraries as well, such as Kye's FAT engine, but PropWare's version uses the least amount of assembly. This isn't necessarily a good thing for end-users, but it is a good thing if you're trying to compare compiler performance.
Of course... running this test with PropWare will require a fair bit of work to get PropWare building under p2gcc and riscv. Maybe in a couple weeks...
@rogloh : why are you re-inventing the wheel here? I mean sure, try it with a different compiler, but why not at least start with my existing micropython port to the P2 (https://github.com/totalspectrum/micropython). Trust me, just because I happened to compile it with a RISC-V compiler doesn't give it cooties .
I solved the serial buffer problem by running serial code in another COG (see BufferSerial.spin2, translated by spin2cpp into BufferSerial.c). There's code for VGA and USB support (also running in separate COGs). There are some P2 support objects (modpyb.c) which shouldn't be too hard to port to p2gcc. In fact I plan to convert them to use the standard propeller2.h header file that was discussed elsewhere in the forum.
Ok. I only just started on the serial issue in particular yesterday and had been given the impression that what had been available for P2 Micropython so far was purely a binary, hence the need for looking into PropGCC based tools in the first place. I'll try to also take a look at what you've done on this too when I can, so thanks for the link to your source code.
If this effort goes further it should be very interesting to see how the performance compares between risc-v translation and native code. From what I've seen so far by disassembling code I get the feeling that the key issue with the native P1 COG model code being translated to the equivalent native P2 code via p2gcc tools but now running in hub mode is just how frequently it does rdlong/wrlong operations throughout its execution for obtaining/storing data which will compete with and stall the hub exec pipeline, and also how it may not fully use all possible register optimisations as much as it could compared to a truly optimized port of GCC to P2 specifically. Perhaps your own risc-v + JIT approach could have gained performance over the native P1 compiled code in that area especially if you are also caching code/data in the LUT too. It still seems counterintuitive that translated code from another instruction set should run faster than native code doing the same work, but I guess it comes down to how limited the native compiler implementation really is vs what you are doing along with the JIT component that might be eliminating some unnecessary code.
So what your saying is that a RISC-V program for the P2 runs faster than a PASM program on the P2. Well I guess the instruction set on the P2 is not very efficient and that only a small subset of the instructions should be used and the rest avoided.
Now I'm really confused.
I just want to compile my existing C programs for the P1 on a P2 and have them run at the processor speed so my timings are correct for that platform.
So what your saying is that a RISC-V program for the P2 runs faster than a PASM program on the P2. Well I guess the instruction set on the P2 is not very efficient and that only a small subset of the instructions should be used and the rest avoided.
Well if you meant me iseries, I'm certainly not saying that. In fact if that is ever true it must be because the compiler implementation of risc-v is far superior to that of the native code. It would pretty much make a mockery of any real P2 C compiler if that were the case ALL the time. For now at least according to some of ersmith's benchmarks there appears there is this window in time where P1 compiler tool implementation may not be able to generate as efficient code that is translated to P2 when compared to risc-v opcode translation plus extras. But I'd also like to see this for myself too with micropython. It still seems strange to me too. Maybe the PropGCC was never very good at outputting register optimised code?
I just want to compile my existing C programs for the P1 on a P2 and have them run at the processor speed so my timings are correct for that platform.
Compare the different options for yourself. If there is some timing critical area then add checks around it to verify the accuracy of each run. It takes work to implement and analyse but there is fun and satisfaction in the results of verification too. And it can be ported back to the Prop1 and also kept as a backup for later checks when there might be other questions.
Eric, I am both fascinated and encouraged by your results. This is absolutely fantastic. I cannot wait to try micropython, and also take a look under the hood at how the RISCV hit is working.
Well done Eric
I was looking at the PASM code generated by the PropGCC tool and see quite a lot of cases where function prologues and epilogues might be candidates for performance improvements to further take advantage of the P2. I'm sort of wondering if it would be possible to post-process the spin2 output that is generated by Dave Hein's sp2pasm, to add some further P2 specific optimisations to help speed things up.
Eg. Consider this simple C function...
void mp_load_method(mp_obj_t base, qstr attr, mp_obj_t *dest) {
DEBUG_OP_printf("load method %p.%s\n", base, qstr_str(attr));
mp_load_method_maybe(base, attr, dest);
if (dest[0] == MP_OBJ_NULL) {
// no attribute/method called attr
if (MICROPY_ERROR_REPORTING == MICROPY_ERROR_REPORTING_TERSE) {
mp_raise_msg(&mp_type_AttributeError, "no such attribute");
} else {
// following CPython, we give a more detailed error message for type objects
if (mp_obj_is_type(base, &mp_type_type)) {
nlr_raise(mp_obj_new_exception_msg_varg(&mp_type_AttributeError,
"type object '%q' has no attribute '%q'",
((mp_obj_type_t*)MP_OBJ_TO_PTR(base))->name, attr));
} else {
nlr_raise(mp_obj_new_exception_msg_varg(&mp_type_AttributeError,
"'%s' object has no attribute '%q'",
mp_obj_get_type_str(base), attr));
}
}
}
}
It has generated the following PASM2 after first assembling it to P1 PASM using propeller-elf-gcc and then its translation to PASM2 with s2pasm:
If you notice all the wrlong and rdlongs at the beginning and end of this assembled C function, you'll see that this will slow down the P2 and run each push/pop in its own hub window cycle (and compete with hub exec too). Perhaps there might be some scope to recognize these fixed blocks of code that are repeated throughout the spin2 output files in many places and compress them down to something that runs faster. See the concept below... I'm thinking this might be simple text file line translation sort of processing, nothing necessarily done at the P1/P2 assembler level because symbols still get used later to compute the addresses so if some code is removed or changed here in these blocks it shouldn't have a negative effect on the final P2 assembly/link step. You would just need a new temporary register r15, contiguous with r14 which gets patched with the LR (register $1f6) before the block transfer happens. That extra register does not exist in the VM at this point.
sub sp, #4 sub sp, #16 ' 16 is computed by number of longs pushed * 4
wrlong r12, sp becomes mov r15, lr ' r15 is a new temporary link register for this transfer
sub sp, #4 --------> setq #3 ' 3 is (4 transfers - 1)
wrlong r13, sp wrlong r12, sp ' r12 is first register mentioned in original code (can vary)
sub sp, #4 sub sp, #12 ' same as last line of normal code
wrlong r14, sp
sub sp, #4
wrlong lr, sp
sub sp, #12
... and ...
add sp, #12 add sp, #12 ' basically reverse the above sequence
rdlong lr, sp becomes setq #3
add sp, #4 --------> rdlong r12, sp
rdlong r14, sp mov lr, r15 ' restore real link register from temp r15
add sp, #4 add sp, #16
rdlong r13, sp
add sp, #4
rdlong r12, sp
add sp, #4
I don't think this ever adds any extra instructions, and even if there is only one register and the link register preserved/restored before another function is called within this function that could clobber them, I think it should take the same number of instructions. Also in many cases there are more than just 3 registers preserved above, gaining further advantage. For example in minimal Micropython code when I grepped for these such patterns I found the following situations:
Registers preserved Cases
6 + LR 0
5 + LR 83
4 + LR 21
3 + LR 47
2 + LR 52
1 + LR 53
There's potentially a lot of cycles (and some code space) that could be saved here if that approach works, especially whenever functions get called in loops, which happens a lot in C. Can such setq burst transfers still work in hub exec mode, or does it conflict with the normal pipeline reading I wonder? I'm hoping it might be doable but don't know yet and I'm not sure of any other side effects that may preclude doing this.
One thing that is different is the register save order gets reversed but I don't think this is an issue unless it is somehow used in a stack frame type situation by nested functions. But in this case it only seems to be saving registers it can re-use, not function parameters, so I think it is okay. Worst case we could just flip the register order in the VM to compensate for this.
If this works it could be some nice low hanging fruit.
Roger.
The LMM mode in s2pasm will convert the LMM PUSHM and POPRET instructions into SETQ/RDLONG/WRLONG instructions. There was a slight improvement in code speed.
Yes Brian, using PTRA for SP would be handy but I know that is a far more extensive change.
Thanks Dave, I haven't been able to get LMM going as yet but I might give it another try. I had issues compiling/linking properly last time I used it IIRC but I might have been doing something incorrectly.
Does anyone know if register re-use/optimization features in GCC depend on the particular CPU port (propeller-elf-gcc) or is it more of a common/generic capability of the compiler itself whenever a general register set is available for it to select from? You'd sort of think it might be possible for such optimisations to be common across "typical" platforms. I'm no compiler expert but one day I might try to take a look at that source code too. I've heard it's a rather complex thing so I've sort of shied away from it.
@"Dave Hein" Your p2gcc tools are actually pretty good at what they do but in finally getting Micropython to compile natively for P2 with some tweaks for scalability and alignment I have found a fatal issue with the linking stage that causes a Micropython crash during execution. This crash is memory layout and link order dependent and therefore somewhat randomized when code is recompiled with different sizes making it much harder to pinpoint and debug. Luckily I encountered a stable enough setup where I could crash it on demand and then debug it from there. It turns out that the crashing function's code in hub memory was actually being clobbered when a particular uninitialized data structure member was written at runtime. This data structure's symbol was identified as a uninitialized data symbol whose address gets incorrectly mapped to the address of the first module that ever references when linked using the p2link utility, and it is not stored at the correct hub address in the module where it is defined as it should be.
Best way to describe it for you is to provide the simplest example that triggers it. To make this happen you just need two source files, one defining uninitialized data and the other one referencing it.
data.c:
int blank[1000]; // uninitialized array
main.c:
#include <stdio.h>
#include <unistd.h>
extern int blank[1000];
void clear()
{
for (int i=0; i<1000; i++)
blank[i]=0;
}
void main(void)
{
sleep(1); // add some delay for terminal to be ready
puts("Clearing");
clear();
puts("Done");
while(1);
}
if you compile and link in this order where the module holding the uninitialized data precedes any module that references it, it works after downloading to the P2:
But if you link these two files in the other order it hangs, as the code being executed in main.c gets overwritten because the symbol for "blank" gets mapped to part of the code space in the main.o module where it is first referenced not to its real allocated data space in the data.o module itself.
Is this bug possibly fixable? If it was I believe Micropython would finally run stably and natively on the P2 as the rest of it I've tested seems to work okay when regular initialized data is accessed and I can already run quite a few Python things in limited cases where these corrupted functions don't get called. It is just uninitialized data that has this issue during linking and this makes everything unreliable whenever referencing uninitialized data blocks in the C code.
I've attached these files in the attached zip so you can easily reproduce and this includes a trivial Makefile that works if you copy it into a test folder under the p2gcc folder and already have the tools in your path (or just modify accordingly). I've also logged the output of the linker and disassembly to ".good" and ".bad" files as well. You can see where this "blank" variable is stored in the linking and it is at the wrong address in the bad case.
When I looked further at your tools, this is the area of the p2link.c code that I think seems to be doing the damage. It just chooses the first module it encounters that is referencing the uninitialized data (OTYPE_UNINIT_DATA) as the actual address of the uninitialized data, which is wrong.
<snipped from p2link.c:MergeGlobalVariables(prev_num)>
...
for (i = prev_num; i < numsym; i++)
{
if (symtype[i] == OTYPE_UNINIT_DATA)
{
j = FindSymbol(symname[i], OTYPE_INIT_DATA);
if (j < 0) j = FindSymbol(symname[i], OTYPE_GLOBAL_FUNC);
if (j < 0) j = FindSymbol(symname[i], OTYPE_UNINIT_DATA);
if (debugflag)
{
printf("Found variable %s at %8.8x with type = %d\n", symname[i], symvalue[i], symtype[j]);
printf("i = %d, j = %d\n", i, j);
}
// Use address from previous object if found
if (j >= 0 && j != i)
{
if (debugflag)
{
printf("Use address from entry %d at address %8.8x\n", j, symvalue[j]);
printf("Changing %8.8x to %8.8x\n", symvalue[i], symvalue[j]);
}
symvalue[i] = symvalue[j];
}
}
I really hope there is a way to resolve this bug so Micropython can work natively on the P2 for comparison with the other methods of getting it to run such as risc-v.
Cheers,
Roger.
Update:
As a temporarily workaround/possible hack I have modified the p2asm.c tool from this:
if (s->scope == SCOPE_GLOBAL_COMM)
{
if (debugflag) printf("GLOBAL COMM %8.8x %s\n", s->value2, s->name);
WriteObjectEntry(OTYPE_UNINIT_DATA, s->objsect, s->value2, s->name);
}
to this
if (s->scope == SCOPE_GLOBAL_COMM)
{
if (debugflag) printf("GLOBAL COMM %8.8x %s\n", s->value2, s->name);
WriteObjectEntry(OTYPE_INIT_DATA, s->objsect, s->value2, s->name);
}
and it seems to help. Only the module defining the uninitialized variable seems to match the scope of GLOBAL_COMM, the other modules only referencing it match GLOBAL_UNDE instead. So this may be a way to differentiate it. Not sure if there are other side-effects in doing this. But my related crash stopped anyway for now.
I looked at it a bit, and the problem is that blank[] is not allocated any space in main.c because of the "extern" keyword that is used when declaring it. When I link uninitialized variables I use the address from the first object file. If main.o is the first object module then I use the dummy address 0x400 for blank. There is code at 0x400, so it gets clobbered when initializing blank. If you remove the "extern" keyword it will work fine.
I believe your work-around forces the address in data.c to be used. However, this will generate a warning if an initialized version of the variable actually exists in another object file. To fix this I need to create another symbol type for symbols that don't have space allocated. These would then need to be linked to either uninitialized or initialized data. It will probably be a while before I get to it, but your work-around should work in the mean time. The other possibility would be to remove the "extern" keyword or add a "-D extern" to the command line when compiling.
Thanks for looking at this issue Dave. Yeah my workaround still helps me for now until a full fix is put in there. Some header files can declare variables with "extern" allowing other modules to access their .c file's variables, as Micropython's mpstate.[ch] files were doing. Ideally there should be no reason that external variables need to be placed into the code that access them when they are only being declared through included header files and defined in another module. So I agree with you that ideally another symbol type (or flag) is required between the assembler and linker stages (propagated via the .o file) to resolve this issue.
By the way I'd like to say your P2GCC tools are so very useful and thanks for coming up with them.
I did look more at the prolog/epilog stuff and over in the Micropython thread I mentioned I found that in COG mode I could speed up some benchmarking code by 10% or so and also reduce the size by 10% when I swapped out the prologs/epilogs with P2 native burst transfers instead of saving/restoring registers one by one which is far less efficient.
I'd used some PERL/SED scripts that identified these frequently used blocks of code and replaced them with the faster versions, but it might be something that s2pasm could do at some point to optimise it's output for the P2. I know the LMM mode was meant to be able to do this too, but i found s2pasm and p2asm don't work right in LMM mode. They complain about missing FCACHE stuff (probably just needs some real VM support for that in a different prefix.spin2 file) and the assembler doesn't like the format that propeller-elf-gcc outputs in it's lmm output mode with all those ".compress default" lines.
I'm thinking there may be scope to improve performance even further by identifying more native P2 optimizations and incorporating them in s2pasm. Getting rid of all these extra constant pointer loads (when -mcog is used) and converting into simple AUGS + RDLONG instructions would be great too to save further hub cycles.
By the way the major data alignment bug fix (which I resolved in s2pasm) was to put in this replacement function so the data bytes correspond to where the C code thinks they are:
void ProcessZero(void)
{
int len;
sscanf(&buffer1[6], "%d", &len);
if (len & 1)
fprintf(outfile, "\tbyte\t0%s", NEW_LINE);
if (len & 2)
fprintf(outfile, "\tword\t0%s", NEW_LINE);
len &= ~3;
if (len)
fprintf(outfile, "\tlong\t0[%d]%s", len/4, NEW_LINE);
}
I suspect the same thing should probably be done to ProcessComm() as well, though for that .comm directive GCC has an extra alignment argument, which could/(should?) be used too. But I didn't encounter a problem with that one (yet) unlike the big one I first hit with the .zero directive that crashed code regularly.
There shouldn't be any FCACHE references in LMM mode since p2gcc specifies -mno-fcache when it uses -mlmm. It appears that you are using your own script files, so maybe you need to add -mno-fcache to it. However, if the compiler is generating .compress directives that would cause a problem.
I wasn't aware of the data alignment problem. I suppose there could be an issue if the exact size is required. I was assuming I could get by with using longs for bytes and words even though it does waste some memory.
Aha, that was it Dave. I didn't have the -mno-cache option enabled in my Makefile. I've added it and defined __TMP0 long in prefix.spin2 (for some reason I also needed that register name put in there), and it now fully compiles and links in LMM mode, even though s2pasm is rather "chatty". Build crashes my Micropython though for some reason. I'll have to debug it further if I get some time. [UPDATE]: I know why it is crashing, because I've reversed the register order and I need to undo that again to operate with how they are meant to be pushed/popped in LMM mode.
The data alignment was a problem for me basically because sizeof(structure) in C didn't know about these extra padding longs you were adding to its elements and it started reading from incorrect offsets in the data structure. It might have been with an array of structs, can't recall precisely now, but this change fixed it.
Just got MP to work in LMM mode but I'm sort of surprised. If I compile and run it with LMM mode I get worse performance for Micropython compared to my (optimised) COG mode and the image size is about the same. Seems the COG mode is still the way to go, at least for now.
LMM mode COG mode
Testing 1 additions per loop over 10s Testing 1 additions per loop over 10s
Count: 302567 Count: 393664
Count: 302547 Count: 393670
Count: 302547 Count: 393669
Testing 2 additions per loop over 10s Testing 2 additions per loop over 10s
Count: 223201 Count: 299834
Count: 223196 Count: 299827
Count: 223195 Count: 299826
Testing 3 additions per loop over 10s Testing 3 additions per loop over 10s
Count: 176357 Count: 239221
Count: 176352 Count: 239216
Count: 176352 Count: 239215
Testing 10! calculations per loop over 10s Testing 10! calculations per loop over 10s
Count: 13014 Count: 15461
Count: 13014 Count: 15461
Count: 13014 Count: 15461
I found an optimization I could apply to p2gcc compilation sequence that first identifies this frequent behavior in cog mode:
rdlong r1, ##.LC8
...
.LC8
long _mp_type_NameError
and uses this method instead:
mov r1, ##_mp_type_NameError
I applied it to the Micropython assembly files and found it improved benchmarks slightly compared to the above. Was hoping for a higher gain but it was at least a gain.
Testing 1 additions per loop over 10s
Count: 409829
Count: 409804
Count: 409803
Testing 2 additions per loop over 10s
Count: 310543
Count: 310535
Count: 310534
Testing 3 additions per loop over 10s
Count: 246596
Count: 246590
Count: 246590
Testing 10! calculations per loop over 10s
Count: 15729
Count: 15730
Count: 15729
I've made up a bash script file for this optimization. It's probably not a particularly robust one and I suspect you could break it easily if you tried hard enough, for now this was more proof of concept for testing this optimization out. I run it after the s2pasm step on the .spin2 files but ideally this step could be done as part of s2pasm where it might be more aware of comment blocks etc. For now I still keep the original labels defined in the code such as .LC8 above, however in theory you could remove those lines too and gain some additional code space. An exercise for the reader...
#!/bin/bash
# A simple shell script to replace constant address lookups in .spin2 files with loading
# the symbol value directly into registers, to optimize translated P1 code in COG mode to
# make use of more P2 capabilities.
# July 21 2019 rogloh
#cleanup any remaining temporary/intermediate files kept around for debug, before beginning
rm -f tempsedfile
rm -f $1.labels
rm -f $1.lines
rm -f $1.sedfile1
rm -f $1.sedfile2
#extract set of constant labels used with rdlong to a file
grep ", ##\.LC[0-9]*" $1 | grep rdlong | grep -o "\.LC[0-9]*" | sort | uniq > $1.labels
#locate where each label is defined and save the symbol from the next line as a sed replacement string
for label in `cat $1.labels` ; do x=`grep -A 1 -w "^\$label" $1 | grep -v "^\." | grep -w long | cut -f2- -d"g" | grep -v "," ` && x="s/##$label[^0-9]/{__SYMBOL__PATCH__} ##${x#"${x%%[![:space:]]*}"}"/g && echo "$x" >> tempsedfile ; done
#only continue if any replacements were found
if [ -f tempsedfile ]; then
#get rid of excess carriage returns
sed -e "s:\r/:/:" tempsedfile > $1.sedfile1
#run sed to replace label with symbol in source file
sed -f $1.sedfile1 $1 > $1.out1
#find all lines of the file where all these special symbol patches were applied
grep -n "__SYMBOL__PATCH__" $1.out1 | cut -f1 -d":" > $1.lines
#continue if any changes were made
if [ -f $1.lines ]; then
#create another sed file with final rdlong substitutions on each line
for line in `cat $1.lines` ; do echo "$line"s:rdlong:mov: >> $1.sedfile2; done
#preserve original file (for debug)
cp $1 $1.orig
#for each line of the file substitute back a mov instruction instead of the rdlong operation
sed -f $1.sedfile2 $1.out1 > $1
fi
fi
#clean up all intermediate files, can comment out lines below to debug
#rm -f $1.lines
#rm -f $1.labels
#rm -f $1.orig
#rm -f $1.sedfile1
#rm -f $1.sedfile2
I think the further optimizations start to get a lot harder to apply automatically like I did above and ideally a true native P2 port of GCC could really help further here.
For example if either PTRA or PTRB could be used as the stack pointer, this would be useful as it would allow indexed addressing directly.
When you look at the disassembly listings there is plenty of this sort of code being generated by propeller-elf-gcc:
mov r2, #4
mov r3, #60
add r2, sp
add r3, sp
rdlong r7, r2 ; read local variable into r7 at long[sp+4]
rdlong r3, r3 ; read local variable into r3 at long[sp+60]
On a P2 those 6 instructions could be simplified down to just these two, saving two instructions per local accessed, and that type of code is very frequently used for reading and writing locals so it saves a lot of instruction memory too:
rdlong r7, ptra[1]
rdlong r3, ptra[15]
On the P2 you could use one pointer register for the Stack and have the other one free as a temporary register for general pointer+offset memory dereferencing purposes to access C structure members etc. That could work nicely. I suspect this change probably buys you the most bang per buck compared to the P1 code emitted.
Also now that we have SETQ burst transfers, I think it would be reasonable to increase the number of registers from 15 to 31, and this may enable more register optimisations in the compiler with less chance of running out of registers, and it wouldn't add that much more overhead if you ever have to save/restore from 8-15 registers instead of from just 0-7. It only ever adds one extra hub cycle in that case. Having so many registers could have been prohibitively expensive on the P1 with its sequential hub transfers and only ever transferring one long per hub cycle, so I can see why only 15 registers were used there, but for P2 I expect it could make more sense.
@"Dave Hein" I've sent you a pull request on GitHub to add ELF file loading to loadp2. I've only tested it on files produced by the RISC-V linker, but it only uses generic ELF structures (indeed it's copied from @"David Betz" 's PropLoader) so it should work for any future P2 ELF format as well.
Eric Smith (aka ersmith) has taken over maintenance of loadp2. The new repository for loadp2 is at https://github.com/totalspectrum/loadp2 . All versions from 0.014 and on will be in the new repository. Thanks Eric for taking this over.
Eric Smith (aka ersmith) has taken over maintenance of loadp2. The new repository for loadp2 is at https://github.com/totalspectrum/loadp2 . All versions from 0.014 and on will be in the new repository. Thanks Eric for taking this over.
Thanks for writing loadp2, Dave! It's an incredibly useful tool.
Comments
For example, I've heard lots of people saying how cool it will be to eventually have a self-hosted development environment. But @"Peter Jakacki" has had a self-hosted development environment for P2 since literally before there was silicon. I'm constantly amazed at all of the things he does with FORTH on the P2, but people don't seem to be aware of it.
People have been waiting for Chip's interpreter to run Spin code on the P2. But fastspin has been able to do that for a long time now.
People have wanted a PropTool like editor to work with P2 tools; but there are already several solutions that can do that.
There are plenty more examples. I don't want to pick on anyone -- I've also been guilty of overlooking some good existing tools for the P2 too. And sometimes people just want to experiment and write their own tools. There's nothing wrong with that: but if you're looking just for a tool to use for work, there's a good chance that there's already something available for P2 that can do at least part of what you want. If you want to program for the P2, jump right in!
(I wonder how P2 compares to actual RISC-V hardware...)
But as anyone who has ever used a JIT recompiling emulator for some old machine on their PC can tell, the real issue isn't actual hot-loop performance, but recompilation delay. I don't have a P2, so I can't test that, but try testing the different compilers for latency when some code is executed for the first time. (like, wait until a pin reads high, then call a function that drives another pin high and measure the delay using a digital scope or some PASM code in another cog). That may be an issue for people with a hard realtime requirement, which IIRC is one of the Prop selling points.
In the end, C performance doesn't even matter that much. Beginners and the education crowd don't really care too much and advanced users will want to write critical code in PASM, anyways.
The test could be done with other FAT32 libraries as well, such as Kye's FAT engine, but PropWare's version uses the least amount of assembly. This isn't necessarily a good thing for end-users, but it is a good thing if you're trying to compare compiler performance.
Of course... running this test with PropWare will require a fair bit of work to get PropWare building under p2gcc and riscv. Maybe in a couple weeks...
Ok. I only just started on the serial issue in particular yesterday and had been given the impression that what had been available for P2 Micropython so far was purely a binary, hence the need for looking into PropGCC based tools in the first place. I'll try to also take a look at what you've done on this too when I can, so thanks for the link to your source code.
If this effort goes further it should be very interesting to see how the performance compares between risc-v translation and native code. From what I've seen so far by disassembling code I get the feeling that the key issue with the native P1 COG model code being translated to the equivalent native P2 code via p2gcc tools but now running in hub mode is just how frequently it does rdlong/wrlong operations throughout its execution for obtaining/storing data which will compete with and stall the hub exec pipeline, and also how it may not fully use all possible register optimisations as much as it could compared to a truly optimized port of GCC to P2 specifically. Perhaps your own risc-v + JIT approach could have gained performance over the native P1 compiled code in that area especially if you are also caching code/data in the LUT too. It still seems counterintuitive that translated code from another instruction set should run faster than native code doing the same work, but I guess it comes down to how limited the native compiler implementation really is vs what you are doing along with the JIT component that might be eliminating some unnecessary code.
Now I'm really confused.
I just want to compile my existing C programs for the P1 on a P2 and have them run at the processor speed so my timings are correct for that platform.
Mike
Well if you meant me iseries, I'm certainly not saying that. In fact if that is ever true it must be because the compiler implementation of risc-v is far superior to that of the native code. It would pretty much make a mockery of any real P2 C compiler if that were the case ALL the time. For now at least according to some of ersmith's benchmarks there appears there is this window in time where P1 compiler tool implementation may not be able to generate as efficient code that is translated to P2 when compared to risc-v opcode translation plus extras. But I'd also like to see this for myself too with micropython. It still seems strange to me too. Maybe the PropGCC was never very good at outputting register optimised code?
So do I.
Well done Eric
curious
Mike
Eg. Consider this simple C function...
It has generated the following PASM2 after first assembling it to P1 PASM using propeller-elf-gcc and then its translation to PASM2 with s2pasm:
If you notice all the wrlong and rdlongs at the beginning and end of this assembled C function, you'll see that this will slow down the P2 and run each push/pop in its own hub window cycle (and compete with hub exec too). Perhaps there might be some scope to recognize these fixed blocks of code that are repeated throughout the spin2 output files in many places and compress them down to something that runs faster. See the concept below... I'm thinking this might be simple text file line translation sort of processing, nothing necessarily done at the P1/P2 assembler level because symbols still get used later to compute the addresses so if some code is removed or changed here in these blocks it shouldn't have a negative effect on the final P2 assembly/link step. You would just need a new temporary register r15, contiguous with r14 which gets patched with the LR (register $1f6) before the block transfer happens. That extra register does not exist in the VM at this point.
I don't think this ever adds any extra instructions, and even if there is only one register and the link register preserved/restored before another function is called within this function that could clobber them, I think it should take the same number of instructions. Also in many cases there are more than just 3 registers preserved above, gaining further advantage. For example in minimal Micropython code when I grepped for these such patterns I found the following situations:
There's potentially a lot of cycles (and some code space) that could be saved here if that approach works, especially whenever functions get called in loops, which happens a lot in C. Can such setq burst transfers still work in hub exec mode, or does it conflict with the normal pipeline reading I wonder? I'm hoping it might be doable but don't know yet and I'm not sure of any other side effects that may preclude doing this.
One thing that is different is the register save order gets reversed but I don't think this is an issue unless it is somehow used in a stack frame type situation by nested functions. But in this case it only seems to be saving registers it can re-use, not function parameters, so I think it is okay. Worst case we could just flip the register order in the VM to compensate for this.
If this works it could be some nice low hanging fruit.
Roger.
Thanks Dave, I haven't been able to get LMM going as yet but I might give it another try. I had issues compiling/linking properly last time I used it IIRC but I might have been doing something incorrectly.
Does anyone know if register re-use/optimization features in GCC depend on the particular CPU port (propeller-elf-gcc) or is it more of a common/generic capability of the compiler itself whenever a general register set is available for it to select from? You'd sort of think it might be possible for such optimisations to be common across "typical" platforms. I'm no compiler expert but one day I might try to take a look at that source code too. I've heard it's a rather complex thing so I've sort of shied away from it.
Best way to describe it for you is to provide the simplest example that triggers it. To make this happen you just need two source files, one defining uninitialized data and the other one referencing it.
data.c:
main.c:
if you compile and link in this order where the module holding the uninitialized data precedes any module that references it, it works after downloading to the P2:
But if you link these two files in the other order it hangs, as the code being executed in main.c gets overwritten because the symbol for "blank" gets mapped to part of the code space in the main.o module where it is first referenced not to its real allocated data space in the data.o module itself.
Is this bug possibly fixable? If it was I believe Micropython would finally run stably and natively on the P2 as the rest of it I've tested seems to work okay when regular initialized data is accessed and I can already run quite a few Python things in limited cases where these corrupted functions don't get called. It is just uninitialized data that has this issue during linking and this makes everything unreliable whenever referencing uninitialized data blocks in the C code.
I've attached these files in the attached zip so you can easily reproduce and this includes a trivial Makefile that works if you copy it into a test folder under the p2gcc folder and already have the tools in your path (or just modify accordingly). I've also logged the output of the linker and disassembly to ".good" and ".bad" files as well. You can see where this "blank" variable is stored in the linking and it is at the wrong address in the bad case.
When I looked further at your tools, this is the area of the p2link.c code that I think seems to be doing the damage. It just chooses the first module it encounters that is referencing the uninitialized data (OTYPE_UNINIT_DATA) as the actual address of the uninitialized data, which is wrong.
I really hope there is a way to resolve this bug so Micropython can work natively on the P2 for comparison with the other methods of getting it to run such as risc-v.
Cheers,
Roger.
Update:
As a temporarily workaround/possible hack I have modified the p2asm.c tool from this:
to this
and it seems to help. Only the module defining the uninitialized variable seems to match the scope of GLOBAL_COMM, the other modules only referencing it match GLOBAL_UNDE instead. So this may be a way to differentiate it. Not sure if there are other side-effects in doing this. But my related crash stopped anyway for now.
I believe your work-around forces the address in data.c to be used. However, this will generate a warning if an initialized version of the variable actually exists in another object file. To fix this I need to create another symbol type for symbols that don't have space allocated. These would then need to be linked to either uninitialized or initialized data. It will probably be a while before I get to it, but your work-around should work in the mean time. The other possibility would be to remove the "extern" keyword or add a "-D extern" to the command line when compiling.
By the way I'd like to say your P2GCC tools are so very useful and thanks for coming up with them.
I did look more at the prolog/epilog stuff and over in the Micropython thread I mentioned I found that in COG mode I could speed up some benchmarking code by 10% or so and also reduce the size by 10% when I swapped out the prologs/epilogs with P2 native burst transfers instead of saving/restoring registers one by one which is far less efficient.
I'd used some PERL/SED scripts that identified these frequently used blocks of code and replaced them with the faster versions, but it might be something that s2pasm could do at some point to optimise it's output for the P2. I know the LMM mode was meant to be able to do this too, but i found s2pasm and p2asm don't work right in LMM mode. They complain about missing FCACHE stuff (probably just needs some real VM support for that in a different prefix.spin2 file) and the assembler doesn't like the format that propeller-elf-gcc outputs in it's lmm output mode with all those ".compress default" lines.
I'm thinking there may be scope to improve performance even further by identifying more native P2 optimizations and incorporating them in s2pasm. Getting rid of all these extra constant pointer loads (when -mcog is used) and converting into simple AUGS + RDLONG instructions would be great too to save further hub cycles.
By the way the major data alignment bug fix (which I resolved in s2pasm) was to put in this replacement function so the data bytes correspond to where the C code thinks they are:
I suspect the same thing should probably be done to ProcessComm() as well, though for that .comm directive GCC has an extra alignment argument, which could/(should?) be used too. But I didn't encounter a problem with that one (yet) unlike the big one I first hit with the .zero directive that crashed code regularly.
Cheers,
Roger.
I wasn't aware of the data alignment problem. I suppose there could be an issue if the exact size is required. I was assuming I could get by with using longs for bytes and words even though it does waste some memory.
The data alignment was a problem for me basically because sizeof(structure) in C didn't know about these extra padding longs you were adding to its elements and it started reading from incorrect offsets in the data structure. It might have been with an array of structs, can't recall precisely now, but this change fixed it.
and uses this method instead:
I applied it to the Micropython assembly files and found it improved benchmarks slightly compared to the above. Was hoping for a higher gain but it was at least a gain.
I've made up a bash script file for this optimization. It's probably not a particularly robust one and I suspect you could break it easily if you tried hard enough, for now this was more proof of concept for testing this optimization out. I run it after the s2pasm step on the .spin2 files but ideally this step could be done as part of s2pasm where it might be more aware of comment blocks etc. For now I still keep the original labels defined in the code such as .LC8 above, however in theory you could remove those lines too and gain some additional code space. An exercise for the reader...
Update: fixed bug in script
For example if either PTRA or PTRB could be used as the stack pointer, this would be useful as it would allow indexed addressing directly.
When you look at the disassembly listings there is plenty of this sort of code being generated by propeller-elf-gcc:
On a P2 those 6 instructions could be simplified down to just these two, saving two instructions per local accessed, and that type of code is very frequently used for reading and writing locals so it saves a lot of instruction memory too:
On the P2 you could use one pointer register for the Stack and have the other one free as a temporary register for general pointer+offset memory dereferencing purposes to access C structure members etc. That could work nicely. I suspect this change probably buys you the most bang per buck compared to the P1 code emitted.
Also now that we have SETQ burst transfers, I think it would be reasonable to increase the number of registers from 15 to 31, and this may enable more register optimisations in the compiler with less chance of running out of registers, and it wouldn't add that much more overhead if you ever have to save/restore from 8-15 registers instead of from just 0-7. It only ever adds one extra hub cycle in that case. Having so many registers could have been prohibitively expensive on the P1 with its sequential hub transfers and only ever transferring one long per hub cycle, so I can see why only 15 registers were used there, but for P2 I expect it could make more sense.
Thanks for writing loadp2, Dave! It's an incredibly useful tool.