Can't Wait for PropGCC on the P2?

1202122232426»

Comments

  • I think the fundamental issue is that there are very good tools already available for the P2, and people don't realize just how good they are (and what there is out there).

    For example, I've heard lots of people saying how cool it will be to eventually have a self-hosted development environment. But @Peter Jakacki has had a self-hosted development environment for P2 since literally before there was silicon. I'm constantly amazed at all of the things he does with FORTH on the P2, but people don't seem to be aware of it.

    People have been waiting for Chip's interpreter to run Spin code on the P2. But fastspin has been able to do that for a long time now.

    People have wanted a PropTool like editor to work with P2 tools; but there are already several solutions that can do that.

    There are plenty more examples. I don't want to pick on anyone -- I've also been guilty of overlooking some good existing tools for the P2 too. And sometimes people just want to experiment and write their own tools. There's nothing wrong with that: but if you're looking just for a tool to use for work, there's a good chance that there's already something available for P2 that can do at least part of what you want. If you want to program for the P2, jump right in!
  • ersmith wrote: »
    Wuerfel_21 wrote: »
    I think there's likely a good bit of performance to be gained from a P2-specific compressed format or just native Hubexec.
    Perhaps you missed the benchmark I posted above, but the RISC-V toolchain compiled P2 binary runs the Dhrystone benchmark more than twice as fast as all of the P2 "native" compilers, including p2gcc, Catalina, and fastspin, all of which use hubexec (for that matter riscvp2 uses hubexec too, once the RISC-V opcodes are translated).
    That just shows that the GCC optimizer more than makes up for the not-so-large recompiler overhead. Not to bash on fastspin and Catalina, but they just don't have the same level of optimization work put into them. A well-made native LLVM or GCC backend would be fastest, but who's going to invest lots of time to write that? Until Parallax decides to fund development or someone decides to do it because reasons, the RISC-V emulation seems like a good solution.
    (I wonder how P2 compares to actual RISC-V hardware...)
    But as anyone who has ever used a JIT recompiling emulator for some old machine on their PC can tell, the real issue isn't actual hot-loop performance, but recompilation delay. I don't have a P2, so I can't test that, but try testing the different compilers for latency when some code is executed for the first time. (like, wait until a pin reads high, then call a function that drives another pin high and measure the delay using a digital scope or some PASM code in another cog). That may be an issue for people with a hard realtime requirement, which IIRC is one of the Prop selling points.

    In the end, C performance doesn't even matter that much. Beginners and the education crowd don't really care too much and advanced users will want to write critical code in PASM, anyways.
  • PropWare's example code for mounting a FAT32 partition on an SD card, reading the contents of one file, and writing those contents to another file, would probably be a very good way to compare the performance of some procedural code across different C++ compilers. PropWare's FAT32 library is written entirely in C++, with the only PASM being the low-level SPI bit banging routines.

    The test could be done with other FAT32 libraries as well, such as Kye's FAT engine, but PropWare's version uses the least amount of assembly. This isn't necessarily a good thing for end-users, but it is a good thing if you're trying to compare compiler performance.

    Of course... running this test with PropWare will require a fair bit of work to get PropWare building under p2gcc and riscv. Maybe in a couple weeks...
    David
    PropWare: C++ HAL (Hardware Abstraction Layer) for PropGCC; Robust build system using CMake; Integrated Simple Library, libpropeller, and libPropelleruino (Arduino port); Instructions for Eclipse and JetBrain's CLion; Example projects; Doxygen documentation
    CI Server: https://ci.zemon.name?guest=1
  • roglohrogloh Posts: 1,145
    edited 2019-07-11 - 00:41:29
    ersmith wrote: »
    @rogloh : why are you re-inventing the wheel here? I mean sure, try it with a different compiler, but why not at least start with my existing micropython port to the P2 (https://github.com/totalspectrum/micropython). Trust me, just because I happened to compile it with a RISC-V compiler doesn't give it cooties :).

    I solved the serial buffer problem by running serial code in another COG (see BufferSerial.spin2, translated by spin2cpp into BufferSerial.c). There's code for VGA and USB support (also running in separate COGs). There are some P2 support objects (modpyb.c) which shouldn't be too hard to port to p2gcc. In fact I plan to convert them to use the standard propeller2.h header file that was discussed elsewhere in the forum.

    Ok. I only just started on the serial issue in particular yesterday and had been given the impression that what had been available for P2 Micropython so far was purely a binary, hence the need for looking into PropGCC based tools in the first place. I'll try to also take a look at what you've done on this too when I can, so thanks for the link to your source code.

    If this effort goes further it should be very interesting to see how the performance compares between risc-v translation and native code. From what I've seen so far by disassembling code I get the feeling that the key issue with the native P1 COG model code being translated to the equivalent native P2 code via p2gcc tools but now running in hub mode is just how frequently it does rdlong/wrlong operations throughout its execution for obtaining/storing data which will compete with and stall the hub exec pipeline, and also how it may not fully use all possible register optimisations as much as it could compared to a truly optimized port of GCC to P2 specifically. Perhaps your own risc-v + JIT approach could have gained performance over the native P1 compiled code in that area especially if you are also caching code/data in the LUT too. It still seems counterintuitive that translated code from another instruction set should run faster than native code doing the same work, but I guess it comes down to how limited the native compiler implementation really is vs what you are doing along with the JIT component that might be eliminating some unnecessary code.
  • So what your saying is that a RISC-V program for the P2 runs faster than a PASM program on the P2. Well I guess the instruction set on the P2 is not very efficient and that only a small subset of the instructions should be used and the rest avoided.

    Now I'm really confused.

    I just want to compile my existing C programs for the P1 on a P2 and have them run at the processor speed so my timings are correct for that platform.

    Mike
  • iseries wrote: »
    So what your saying is that a RISC-V program for the P2 runs faster than a PASM program on the P2. Well I guess the instruction set on the P2 is not very efficient and that only a small subset of the instructions should be used and the rest avoided.

    Well if you meant me iseries, I'm certainly not saying that. In fact if that is ever true it must be because the compiler implementation of risc-v is far superior to that of the native code. It would pretty much make a mockery of any real P2 C compiler if that were the case ALL the time. For now at least according to some of ersmith's benchmarks there appears there is this window in time where P1 compiler tool implementation may not be able to generate as efficient code that is translated to P2 when compared to risc-v opcode translation plus extras. But I'd also like to see this for myself too with micropython. It still seems strange to me too. Maybe the PropGCC was never very good at outputting register optimised code?
    I just want to compile my existing C programs for the P1 on a P2 and have them run at the processor speed so my timings are correct for that platform.
    So do I.
  • Compare the different options for yourself. If there is some timing critical area then add checks around it to verify the accuracy of each run. It takes work to implement and analyse but there is fun and satisfaction in the results of verification too. And it can be ported back to the Prop1 and also kept as a backup for later checks when there might be other questions.
    "... peers into the actual workings of a quantum jump for the first time. The results
    reveal a surprising finding that contradicts Danish physicist Niels Bohr's established view
    —the jumps are neither abrupt nor as random as previously thought."
  • Eric, I am both fascinated and encouraged by your results. This is absolutely fantastic. I cannot wait to try micropython, and also take a look under the hood at how the RISCV hit is working.
    Well done Eric :smiley: :smiley: :)
    My Prop boards: P8XBlade2, RamBlade, CpuBlade, TriBlade
    Prop OS (also see Sphinx, PropDos, PropCmd, Spinix)
    Website: www.clusos.com
    Prop Tools (Index) , Emulators (Index) , ZiCog (Z80)
  • If it is faster and smaller, maybe RiscV would be the to go to byte-code interpreter for fastspin?

    curious

    Mike
    I am just another Code Monkey.
    A determined coder can write COBOL programs in any language. -- Author unknown.
    Press any key to continue, any other key to quit

    The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this post are to be interpreted as described in RFC 2119.
  • roglohrogloh Posts: 1,145
    edited 2019-07-12 - 01:55:34
    I was looking at the PASM code generated by the PropGCC tool and see quite a lot of cases where function prologues and epilogues might be candidates for performance improvements to further take advantage of the P2. I'm sort of wondering if it would be possible to post-process the spin2 output that is generated by Dave Hein's sp2pasm, to add some further P2 specific optimisations to help speed things up.

    Eg. Consider this simple C function...
    void mp_load_method(mp_obj_t base, qstr attr, mp_obj_t *dest) {
        DEBUG_OP_printf("load method %p.%s\n", base, qstr_str(attr));
    
        mp_load_method_maybe(base, attr, dest);
    
        if (dest[0] == MP_OBJ_NULL) {
            // no attribute/method called attr
            if (MICROPY_ERROR_REPORTING == MICROPY_ERROR_REPORTING_TERSE) {
                mp_raise_msg(&mp_type_AttributeError, "no such attribute");
            } else {
                // following CPython, we give a more detailed error message for type objects
                if (mp_obj_is_type(base, &mp_type_type)) {
                    nlr_raise(mp_obj_new_exception_msg_varg(&mp_type_AttributeError,
                        "type object '%q' has no attribute '%q'",
                        ((mp_obj_type_t*)MP_OBJ_TO_PTR(base))->name, attr));
                } else {
                    nlr_raise(mp_obj_new_exception_msg_varg(&mp_type_AttributeError,
                        "'%s' object has no attribute '%q'",
                        mp_obj_get_type_str(base), attr));
                }
            }
        }
    }
    

    It has generated the following PASM2 after first assembling it to P1 PASM using propeller-elf-gcc and then its translation to PASM2 with s2pasm:
        .global _mp_load_method
    _mp_load_method
        sub sp, #4
        wrlong  r12, sp
        sub sp, #4
        wrlong  r13, sp
        sub sp, #4
        wrlong  r14, sp
        sub sp, #4
        wrlong  lr, sp
        sub sp, #12
        mov r12, r0
        mov r0, r1
        mov r14, r2
        mov r13, r1
        calld   lr,#_qstr_str
        mov r7, sp
        add r7, #4
        rdlong  temp, ##.LC55
        wrlong  temp, sp
        wrlong  r12, r7
        add r7, #4
        wrlong  r0, r7
        calld   lr,#_DEBUG_printf
        mov r0, r12
        mov r1, r13
        mov r2, r14
        calld   lr,#_mp_load_method_maybe
        rdlong  r7, r14
        cmps    r7, #0 wcz
        IF_E  rdlong    r0, ##.LC56
        IF_E  rdlong    r1, ##.LC57
        IF_E  calld lr,#_mp_raise_msg
        add sp, #12
        rdlong  lr, sp
        add sp, #4
        rdlong  r14, sp
        add sp, #4
        rdlong  r13, sp
        add sp, #4
        rdlong  r12, sp
        add sp, #4
        jmp lr
    

    If you notice all the wrlong and rdlongs at the beginning and end of this assembled C function, you'll see that this will slow down the P2 and run each push/pop in its own hub window cycle (and compete with hub exec too). Perhaps there might be some scope to recognize these fixed blocks of code that are repeated throughout the spin2 output files in many places and compress them down to something that runs faster. See the concept below... I'm thinking this might be simple text file line translation sort of processing, nothing necessarily done at the P1/P2 assembler level because symbols still get used later to compute the addresses so if some code is removed or changed here in these blocks it shouldn't have a negative effect on the final P2 assembly/link step. You would just need a new temporary register r15, contiguous with r14 which gets patched with the LR (register $1f6) before the block transfer happens. That extra register does not exist in the VM at this point.
        sub sp, #4                        sub sp, #16    ' 16 is computed by number of longs pushed * 4
        wrlong  r12, sp    becomes        mov r15, lr    ' r15 is a new temporary link register for this transfer
        sub sp, #4        -------->       setq #3        ' 3 is (4 transfers - 1)
        wrlong  r13, sp                   wrlong r12, sp ' r12 is first register mentioned in original code (can vary)
        sub sp, #4                        sub sp, #12    ' same as last line of normal code
        wrlong  r14, sp
        sub sp, #4
        wrlong  lr, sp
        sub sp, #12
    
        ... and ...
    
        add sp, #12                      add sp, #12     ' basically reverse the above sequence
        rdlong lr, sp      becomes       setq #3
        add sp, #4        -------->      rdlong r12, sp
        rdlong  r14, sp                  mov lr, r15     ' restore real link register from temp r15
        add sp, #4                       add sp, #16
        rdlong  r13, sp                  
        add sp, #4
        rdlong  r12, sp
        add sp, #4
    
    

    I don't think this ever adds any extra instructions, and even if there is only one register and the link register preserved/restored before another function is called within this function that could clobber them, I think it should take the same number of instructions. Also in many cases there are more than just 3 registers preserved above, gaining further advantage. For example in minimal Micropython code when I grepped for these such patterns I found the following situations:
    Registers preserved                  Cases
    6 + LR                                 0      
    5 + LR                                83
    4 + LR                                21
    3 + LR                                47
    2 + LR                                52
    1 + LR                                53
    

    There's potentially a lot of cycles (and some code space) that could be saved here if that approach works, especially whenever functions get called in loops, which happens a lot in C. Can such setq burst transfers still work in hub exec mode, or does it conflict with the normal pipeline reading I wonder? I'm hoping it might be doable but don't know yet and I'm not sure of any other side effects that may preclude doing this.

    One thing that is different is the register save order gets reversed but I don't think this is an issue unless it is somehow used in a stack frame type situation by nested functions. But in this case it only seems to be saving registers it can re-use, not function parameters, so I think it is okay. Worst case we could just flip the register order in the VM to compensate for this.

    If this works it could be some nice low hanging fruit.
    Roger.
  • Maybe the 'sp' register could be substituted with 'ptra' and the order adjusted to suit the PTRx ++/-- and index options.
    Melbourne, Australia
  • The LMM mode in s2pasm will convert the LMM PUSHM and POPRET instructions into SETQ/RDLONG/WRLONG instructions. There was a slight improvement in code speed.
  • roglohrogloh Posts: 1,145
    edited 2019-07-14 - 11:57:04
    Yes Brian, using PTRA for SP would be handy but I know that is a far more extensive change.

    Thanks Dave, I haven't been able to get LMM going as yet but I might give it another try. I had issues compiling/linking properly last time I used it IIRC but I might have been doing something incorrectly.

    Does anyone know if register re-use/optimization features in GCC depend on the particular CPU port (propeller-elf-gcc) or is it more of a common/generic capability of the compiler itself whenever a general register set is available for it to select from? You'd sort of think it might be possible for such optimisations to be common across "typical" platforms. I'm no compiler expert but one day I might try to take a look at that source code too. I've heard it's a rather complex thing so I've sort of shied away from it.
  • roglohrogloh Posts: 1,145
    edited 2019-07-14 - 10:26:16
    @Dave Hein Your p2gcc tools are actually pretty good at what they do but in finally getting Micropython to compile natively for P2 with some tweaks for scalability and alignment I have found a fatal issue with the linking stage that causes a Micropython crash during execution. This crash is memory layout and link order dependent and therefore somewhat randomized when code is recompiled with different sizes making it much harder to pinpoint and debug. Luckily I encountered a stable enough setup where I could crash it on demand and then debug it from there. It turns out that the crashing function's code in hub memory was actually being clobbered when a particular uninitialized data structure member was written at runtime. This data structure's symbol was identified as a uninitialized data symbol whose address gets incorrectly mapped to the address of the first module that ever references when linked using the p2link utility, and it is not stored at the correct hub address in the module where it is defined as it should be.

    Best way to describe it for you is to provide the simplest example that triggers it. To make this happen you just need two source files, one defining uninitialized data and the other one referencing it.

    data.c:
    int blank[1000]; // uninitialized array
    

    main.c:
    #include <stdio.h>
    #include <unistd.h>
    
    extern int blank[1000];
    
    void clear()
    {
        for (int i=0; i<1000; i++)
            blank[i]=0;
    }
    void main(void)
    {
        sleep(1); // add some delay for terminal to be ready
        puts("Clearing");
        clear();
        puts("Done");
        while(1);
    }
    

    if you compile and link in this order where the module holding the uninitialized data precedes any module that references it, it works after downloading to the P2:
    propeller-elf-gcc -S -mcog -std=c99 main.c
    propeller-elf-gcc -S -mcog -std=c99 data.c
    s2pasm -p ../lib/prefix.spin2 main.s
    s2pasm -p ../lib/prefix.spin2 data.s
    p2asm -o -c -hub main.spin2 
    p2asm -o -c -hub data.spin2 
    p2link -d -L ../lib prefix.o p2start.o data.o main.o libc.a > link.good
    p2dump -dis -data a.out > code.good
    loadp2 -PATCH -t a.out
    ( Entering terminal mode at 115200 bps.  Press Ctrl-] to exit. )
    Clearing
    Done
    

    But if you link these two files in the other order it hangs, as the code being executed in main.c gets overwritten because the symbol for "blank" gets mapped to part of the code space in the main.o module where it is first referenced not to its real allocated data space in the data.o module itself.
    propeller-elf-gcc -S -mcog -std=c99 main.c
    propeller-elf-gcc -S -mcog -std=c99 data.c
    s2pasm -p ../lib/prefix.spin2 main.s
    s2pasm -p ../lib/prefix.spin2 data.s
    p2asm -o -c -hub main.spin2 
    p2asm -o -c -hub data.spin2 
    p2link -d -L ../lib prefix.o p2start.o main.o data.o libc.a > link.bad
    p2dump -dis -data a.out > code.bad
    loadp2 -PATCH -t a.out
    ( Entering terminal mode at 115200 bps.  Press Ctrl-] to exit. )
    Clearing
    

    Is this bug possibly fixable? If it was I believe Micropython would finally run stably and natively on the P2 as the rest of it I've tested seems to work okay when regular initialized data is accessed and I can already run quite a few Python things in limited cases where these corrupted functions don't get called. It is just uninitialized data that has this issue during linking and this makes everything unreliable whenever referencing uninitialized data blocks in the C code.

    I've attached these files in the attached zip so you can easily reproduce and this includes a trivial Makefile that works if you copy it into a test folder under the p2gcc folder and already have the tools in your path (or just modify accordingly). I've also logged the output of the linker and disassembly to ".good" and ".bad" files as well. You can see where this "blank" variable is stored in the linking and it is at the wrong address in the bad case.

    When I looked further at your tools, this is the area of the p2link.c code that I think seems to be doing the damage. It just chooses the first module it encounters that is referencing the uninitialized data (OTYPE_UNINIT_DATA) as the actual address of the uninitialized data, which is wrong.
    <snipped from p2link.c:MergeGlobalVariables(prev_num)>
    ...
        for (i = prev_num; i < numsym; i++)
        {
            if (symtype[i] == OTYPE_UNINIT_DATA)
            {
                j = FindSymbol(symname[i], OTYPE_INIT_DATA);
                if (j < 0) j = FindSymbol(symname[i], OTYPE_GLOBAL_FUNC);
                if (j < 0) j = FindSymbol(symname[i], OTYPE_UNINIT_DATA);
                if (debugflag)
                {
                    printf("Found variable %s at %8.8x with type = %d\n", symname[i], symvalue[i], symtype[j]);
                    printf("i = %d, j = %d\n", i, j);
                }
                // Use address from previous object if found
                if (j >= 0 && j != i)
                {
                    if (debugflag)
                    {
                        printf("Use address from entry %d at address %8.8x\n", j, symvalue[j]);
                        printf("Changing %8.8x to %8.8x\n", symvalue[i], symvalue[j]);
                    }
                    symvalue[i] = symvalue[j];
                }
            }
    

    I really hope there is a way to resolve this bug so Micropython can work natively on the P2 for comparison with the other methods of getting it to run such as risc-v.

    Cheers,
    Roger.

    Update:
    As a temporarily workaround/possible hack I have modified the p2asm.c tool from this:
                          if (s->scope == SCOPE_GLOBAL_COMM)
                         {
                             if (debugflag) printf("GLOBAL COMM %8.8x %s\n", s->value2, s->name);
                            WriteObjectEntry(OTYPE_UNINIT_DATA, s->objsect, s->value2, s->name);
                         }
    

    to this
                          if (s->scope == SCOPE_GLOBAL_COMM)
                         {
                             if (debugflag) printf("GLOBAL COMM %8.8x %s\n", s->value2, s->name);
                            WriteObjectEntry(OTYPE_INIT_DATA, s->objsect, s->value2, s->name);
                         }
    
    

    and it seems to help. Only the module defining the uninitialized variable seems to match the scope of GLOBAL_COMM, the other modules only referencing it match GLOBAL_UNDE instead. So this may be a way to differentiate it. Not sure if there are other side-effects in doing this. But my related crash stopped anyway for now.
  • Thanks for posting the code. I'll look into it.
  • I looked at it a bit, and the problem is that blank[] is not allocated any space in main.c because of the "extern" keyword that is used when declaring it. When I link uninitialized variables I use the address from the first object file. If main.o is the first object module then I use the dummy address 0x400 for blank. There is code at 0x400, so it gets clobbered when initializing blank. If you remove the "extern" keyword it will work fine.

    I believe your work-around forces the address in data.c to be used. However, this will generate a warning if an initialized version of the variable actually exists in another object file. To fix this I need to create another symbol type for symbols that don't have space allocated. These would then need to be linked to either uninitialized or initialized data. It will probably be a while before I get to it, but your work-around should work in the mean time. The other possibility would be to remove the "extern" keyword or add a "-D extern" to the command line when compiling.
  • roglohrogloh Posts: 1,145
    edited 2019-07-19 - 02:22:04
    Thanks for looking at this issue Dave. Yeah my workaround still helps me for now until a full fix is put in there. Some header files can declare variables with "extern" allowing other modules to access their .c file's variables, as Micropython's mpstate.[ch] files were doing. Ideally there should be no reason that external variables need to be placed into the code that access them when they are only being declared through included header files and defined in another module. So I agree with you that ideally another symbol type (or flag) is required between the assembler and linker stages (propagated via the .o file) to resolve this issue.

    By the way I'd like to say your P2GCC tools are so very useful and thanks for coming up with them. :smile:

    I did look more at the prolog/epilog stuff and over in the Micropython thread I mentioned I found that in COG mode I could speed up some benchmarking code by 10% or so and also reduce the size by 10% when I swapped out the prologs/epilogs with P2 native burst transfers instead of saving/restoring registers one by one which is far less efficient.

    I'd used some PERL/SED scripts that identified these frequently used blocks of code and replaced them with the faster versions, but it might be something that s2pasm could do at some point to optimise it's output for the P2. I know the LMM mode was meant to be able to do this too, but i found s2pasm and p2asm don't work right in LMM mode. They complain about missing FCACHE stuff (probably just needs some real VM support for that in a different prefix.spin2 file) and the assembler doesn't like the format that propeller-elf-gcc outputs in it's lmm output mode with all those ".compress default" lines.

    I'm thinking there may be scope to improve performance even further by identifying more native P2 optimizations and incorporating them in s2pasm. Getting rid of all these extra constant pointer loads (when -mcog is used) and converting into simple AUGS + RDLONG instructions would be great too to save further hub cycles.

    By the way the major data alignment bug fix (which I resolved in s2pasm) was to put in this replacement function so the data bytes correspond to where the C code thinks they are:
    void ProcessZero(void)
    {
        int len;
        sscanf(&buffer1[6], "%d", &len);
        if (len & 1)
            fprintf(outfile, "\tbyte\t0%s", NEW_LINE);
        if (len & 2)
            fprintf(outfile, "\tword\t0%s", NEW_LINE);
        len &= ~3;
        if (len)
            fprintf(outfile, "\tlong\t0[%d]%s", len/4, NEW_LINE);
    }
    

    I suspect the same thing should probably be done to ProcessComm() as well, though for that .comm directive GCC has an extra alignment argument, which could/(should?) be used too. But I didn't encounter a problem with that one (yet) unlike the big one I first hit with the .zero directive that crashed code regularly.

    Cheers,
    Roger.
  • There shouldn't be any FCACHE references in LMM mode since p2gcc specifies -mno-fcache when it uses -mlmm. It appears that you are using your own script files, so maybe you need to add -mno-fcache to it. However, if the compiler is generating .compress directives that would cause a problem.

    I wasn't aware of the data alignment problem. I suppose there could be an issue if the exact size is required. I was assuming I could get by with using longs for bytes and words even though it does waste some memory.
  • roglohrogloh Posts: 1,145
    edited 2019-07-19 - 03:40:31
    Aha, that was it Dave. I didn't have the -mno-cache option enabled in my Makefile. I've added it and defined __TMP0 long in prefix.spin2 (for some reason I also needed that register name put in there), and it now fully compiles and links in LMM mode, even though s2pasm is rather "chatty". Build crashes my Micropython though for some reason. I'll have to debug it further if I get some time. [UPDATE]: I know why it is crashing, because I've reversed the register order and I need to undo that again to operate with how they are meant to be pushed/popped in LMM mode.

    The data alignment was a problem for me basically because sizeof(structure) in C didn't know about these extra padding longs you were adding to its elements and it started reading from incorrect offsets in the data structure. It might have been with an array of structs, can't recall precisely now, but this change fixed it.
  • roglohrogloh Posts: 1,145
    edited 2019-07-19 - 03:59:08
    Just got MP to work in LMM mode but I'm sort of surprised. If I compile and run it with LMM mode I get worse performance for Micropython compared to my (optimised) COG mode and the image size is about the same. Seems the COG mode is still the way to go, at least for now.
    LMM mode                                    COG mode                                            
    Testing 1 additions per loop over 10s       Testing 1 additions per loop over 10s
    Count:  302567                              Count:  393664
    Count:  302547                              Count:  393670
    Count:  302547                              Count:  393669
    Testing 2 additions per loop over 10s       Testing 2 additions per loop over 10s
    Count:  223201                              Count:  299834
    Count:  223196                              Count:  299827
    Count:  223195                              Count:  299826
    Testing 3 additions per loop over 10s       Testing 3 additions per loop over 10s
    Count:  176357                              Count:  239221
    Count:  176352                              Count:  239216
    Count:  176352                              Count:  239215
    Testing 10! calculations per loop over 10s  Testing 10! calculations per loop over 10s
    Count:  13014                               Count:  15461
    Count:  13014                               Count:  15461
    Count:  13014                               Count:  15461
    
  • roglohrogloh Posts: 1,145
    edited 2019-07-24 - 03:54:47
    I found an optimization I could apply to p2gcc compilation sequence that first identifies this frequent behavior in cog mode:
        rdlong  r1, ##.LC8
    
       ...
    
    .LC8
        long    _mp_type_NameError
    

    and uses this method instead:
        mov  r1, ##_mp_type_NameError
    

    I applied it to the Micropython assembly files and found it improved benchmarks slightly compared to the above. Was hoping for a higher gain but it was at least a gain.
    Testing 1 additions per loop over 10s
    Count:  409829
    Count:  409804
    Count:  409803
    Testing 2 additions per loop over 10s
    Count:  310543
    Count:  310535
    Count:  310534
    Testing 3 additions per loop over 10s
    Count:  246596
    Count:  246590
    Count:  246590
    Testing 10! calculations per loop over 10s
    Count:  15729
    Count:  15730
    Count:  15729
    

    I've made up a bash script file for this optimization. It's probably not a particularly robust one and I suspect you could break it easily if you tried hard enough, for now this was more proof of concept for testing this optimization out. I run it after the s2pasm step on the .spin2 files but ideally this step could be done as part of s2pasm where it might be more aware of comment blocks etc. For now I still keep the original labels defined in the code such as .LC8 above, however in theory you could remove those lines too and gain some additional code space. An exercise for the reader... :smile:
    #!/bin/bash
    
    # A simple shell script to replace constant address lookups in .spin2 files with loading
    # the symbol value directly into registers, to optimize translated P1 code in COG mode to
    # make use of more P2 capabilities.
    # July 21 2019  rogloh
    
    #cleanup any remaining temporary/intermediate files kept around for debug, before beginning
    rm -f tempsedfile
    rm -f $1.labels
    rm -f $1.lines
    rm -f $1.sedfile1
    rm -f $1.sedfile2
    
    #extract set of constant labels used with rdlong to a file
    grep ", ##\.LC[0-9]*" $1 | grep rdlong | grep -o "\.LC[0-9]*" | sort | uniq > $1.labels
    
    #locate where each label is defined and save the symbol from the next line as a sed replacement string
    
    for label in `cat $1.labels` ; do  x=`grep -A 1 -w "^\$label" $1 | grep -v "^\." | grep -w long | cut -f2- -d"g" | grep -v "," ` && x="s/##$label[^0-9]/{__SYMBOL__PATCH__} ##${x#"${x%%[![:space:]]*}"}"/g && echo "$x" >> tempsedfile ; done
    
    #only continue if any replacements were found
    if [ -f tempsedfile ]; then
    
    #get rid of excess carriage returns
        sed -e "s:\r/:/:" tempsedfile > $1.sedfile1
    
    #run sed to replace label with symbol in source file
        sed -f $1.sedfile1 $1 > $1.out1
    
    #find all lines of the file where all these special symbol patches were applied
        grep -n "__SYMBOL__PATCH__"  $1.out1 | cut -f1 -d":" > $1.lines
    
    #continue if any changes were made
        if [ -f $1.lines ]; then
    
    #create another sed file with final rdlong substitutions on each line
            for line in `cat $1.lines` ; do echo "$line"s:rdlong:mov: >> $1.sedfile2; done
    
    #preserve original file (for debug)
            cp $1 $1.orig
    
    #for each line of the file substitute back a mov instruction instead of the rdlong operation 
            sed -f $1.sedfile2 $1.out1 > $1
        fi
    fi
    
    #clean up all intermediate files, can comment out lines below to debug
    #rm -f $1.lines
    #rm -f $1.labels
    #rm -f $1.orig
    #rm -f $1.sedfile1
    #rm -f $1.sedfile2
    

    Update: fixed bug in script
  • roglohrogloh Posts: 1,145
    edited 2019-07-21 - 13:57:51
    I think the further optimizations start to get a lot harder to apply automatically like I did above and ideally a true native P2 port of GCC could really help further here.

    For example if either PTRA or PTRB could be used as the stack pointer, this would be useful as it would allow indexed addressing directly.

    When you look at the disassembly listings there is plenty of this sort of code being generated by propeller-elf-gcc:
        mov r2, #4
        mov r3, #60
        add r2, sp
        add r3, sp
        rdlong  r7, r2  ; read local variable into r7 at long[sp+4]
        rdlong  r3, r3  ; read local variable into r3 at long[sp+60]
    
    On a P2 those 6 instructions could be simplified down to just these two, saving two instructions per local accessed, and that type of code is very frequently used for reading and writing locals so it saves a lot of instruction memory too:
        rdlong r7, ptra[1]
        rdlong r3, ptra[15]
    
    On the P2 you could use one pointer register for the Stack and have the other one free as a temporary register for general pointer+offset memory dereferencing purposes to access C structure members etc. That could work nicely. I suspect this change probably buys you the most bang per buck compared to the P1 code emitted.

    Also now that we have SETQ burst transfers, I think it would be reasonable to increase the number of registers from 15 to 31, and this may enable more register optimisations in the compiler with less chance of running out of registers, and it wouldn't add that much more overhead if you ever have to save/restore from 8-15 registers instead of from just 0-7. It only ever adds one extra hub cycle in that case. Having so many registers could have been prohibitively expensive on the P1 with its sequential hub transfers and only ever transferring one long per hub cycle, so I can see why only 15 registers were used there, but for P2 I expect it could make more sense.

  • Roger, you have made some interesting suggestions. I think these would be good to consider if Parallax ever pursues developing GCC for the P2.
  • @Dave Hein I've sent you a pull request on GitHub to add ELF file loading to loadp2. I've only tested it on files produced by the RISC-V linker, but it only uses generic ELF structures (indeed it's copied from @David Betz 's PropLoader) so it should work for any future P2 ELF format as well.
  • Dave HeinDave Hein Posts: 5,932
    edited 2019-07-27 - 17:56:09
    Eric Smith (aka ersmith) has taken over maintenance of loadp2. The new repository for loadp2 is at https://github.com/totalspectrum/loadp2 . All versions from 0.014 and on will be in the new repository. Thanks Eric for taking this over.
  • Dave Hein wrote: »
    Eric Smith (aka ersmith) has taken over maintenance of loadp2. The new repository for loadp2 is at https://github.com/totalspectrum/loadp2 . All versions from 0.014 and on will be in the new repository. Thanks Eric for taking this over.

    Thanks for writing loadp2, Dave! It's an incredibly useful tool.
  • Yes, thank you very much indeed. I've even been using the -t option by itself for a stand-alone terminal.
    "... peers into the actual workings of a quantum jump for the first time. The results
    reveal a surprising finding that contradicts Danish physicist Niels Bohr's established view
    —the jumps are neither abrupt nor as random as previously thought."
  • Yes, thank you very much indeed. I've even been using the -t option by itself for a stand-alone terminal.
    What a simple yet elegant "other use" of loadp2! Thank you.
    The difference between theory and practice is that, in theory, there is no difference between theory and practice, but in practice, there is.
Sign In or Register to comment.