LLVM Backend for Propeller 2

24

Comments

  • Speaking of which, are you sure you only want to allocate 16 GPRs? With how big the COG/LUT RAMs are, there's surely space for more (and from experience, 16 can get really cramped in some situations)
  • Yes Wuerfel_21, I would expect if the setq burst transfer method can be used to save/restore registers then increasing the register space should improve performance at only a minor expense of one clock per extra register needing to be saved or restored. The 16 regs with p2gcc is certainly far too cramped when some get used for parameter passing and local use etc, so something like at least 32 registers like the AVR uses could quite easily perform better. The P1 implementation where p2gcc came from originally had a big penalty to save/restore these registers, but the P2 should not have this with setq and the register count could increase.
  • rogloh wrote: »
    Yes Wuerfel_21, I would expect if the setq burst transfer method can be used to save/restore registers then increasing the register space should improve performance at only a minor expense of one clock per extra register needing to be saved or restored. The 16 regs with p2gcc is certainly far too cramped when some get used for parameter passing and local use etc, so something like at least 32 registers like the AVR uses could quite easily perform better. The P1 implementation where p2gcc came from originally had a big penalty to save/restore these registers, but the P2 should not have this with setq and the register count could increase.

    More registers is not always better. I experimented quite a lot with the number of registers for Catalina on the P1, expecting this would be true - but while 16 was too few, it turned out that 32 was too many - the overhead in both space and time of saving and loading them was actually detrimental to performance. I found 24 was the best compromise. Of course, this will vary from compiler to compiler.
  • roglohrogloh Posts: 2,678
    edited 2020-08-05 - 13:41:11
    RossH I would think that would be true if you always have to save/restore a fixed set of registers. If you only need to save the number of registers you actually have to clobber inside the function maybe this larger potential register count increase would become less detrimental, and code space is certainly not an issue with the setq transfer approach, as the instruction count doesn't vary with the number of regs transferred on the P2. It would do if you need to save/load registers sequentially however like the P1 did. It's an interesting problem to try to optimize.
  • @n_ermosh, I haven't tested it yet... but I was able to complete a build :)
    I've installed it locally and assembled an RPM which is now uploaded to https://david.zemon.name/downloads/LLVM-12.0.0git-Linux.rpm
    I tried building a .deb, but there is a CMake/CPack variable missing (CPACK_PACKAGE_CONTACT or CPACK_DEBIAN_PACKAGE_MAINTAINER), which stops the packaging process.
  • Regarding GP register count, changing it is extremely simple, maybe 5 lines of code in the compiler. This is what I love about LLVM and propeller together. LLVM is designed to be flexible to match the architecture, and our architecture is flexible as well, meaning with little effort we can do a lot. I picked 16 registers arbitrarily. Going to 32 only reduces the code space you have for cog-based code/programs, but doesn’t present any real complexity. Could even do 64 if desired. The prologue/epilogue inserter will only push/pop registers that get clobbered inside the function, as you said @rogloh.

    @DavidZemon glad to hear it! Let me know if you are able to get anything building and loading
  • I'll switch this over to Docker once you have that working, but in the meantime, we have .rpm and .tar.gz artifacts for Linux published on TeamCity now: https://ci.zemon.name/project/P2llvm?guest=1
  • Took at look at your builds and it looks like it's missing clang and lld--did you enabled those projects when configuring the build? I made some changes to add P2 as a target and more changes will come, so we can't use the out of the box clang (yet).

    I'll update once the docker is working--the intention is for it to just be a build system for linux systems and not a run environment. But it looks like you were able to get the project building on linux so that's good.
  • n_ermosh wrote: »
    Took at look at your builds and it looks like it's missing clang and lld--did you enabled those projects when configuring the build? I made some changes to add P2 as a target and more changes will come, so we can't use the out of the box clang (yet).

    I'll update once the docker is working--the intention is for it to just be a build system for linux systems and not a run environment. But it looks like you were able to get the project building on linux so that's good.

    Interesting. I'll have to play around with the CMake options some more. I think the quotes are getting parsed a bit weird... I don't have the problem locally.
  • Well, it took A LOT of different tries to get it working... but it finally is :lol:. Turns out, no escaping is needed at all in the CMake options text box on TeamCity... the whole text box must be passed in as one big string into something, because not even the semicolon needed escaping/quoting. Anyway, artifacts are live now :)
  • rogloh wrote: »
    RossH I would think that would be true if you always have to save/restore a fixed set of registers. If you only need to save the number of registers you actually have to clobber inside the function maybe this larger potential register count increase would become less detrimental, and code space is certainly not an issue with the setq transfer approach, as the instruction count doesn't vary with the number of regs transferred on the P2. It would do if you need to save/load registers sequentially however like the P1 did. It's an interesting problem to try to optimize.

    Yes, I did those experiments on the P1. I should redo them on the P2 - the answer might be different. But then I would have to re-write the code generator :(

    However, you still have to save/restore all registers on an interrupt, so the argument still holds. More registers is not necessarily always better.

    Of course, not all compilers allow an arbitrary C function to be used as an interrupt handler as Catalina does. Swings and roundabouts!
  • On P2 if you use setq / rdlong and setq / wrlong to save and restore registers, the extra cost is literally just 1 cycle per register with no additional code overhead (the setq says how many to save). So yeah, more registers is almost certainly a win on P2. It's probably one of the reasons riscvp2 performs so well despite the JIT compilation overhead: with 32 (well, really only 31) registers to play with memory traffic gets reduced, and memory load/stores are pretty expensive.
  • However, you still have to save/restore all registers on an interrupt, so the argument still holds.

    True, you would need to save all the registers you actually use in the interrupt routine yes. For C based ISRs this would be the full set, for PASM based ISRs hopefully somewhat less, or maybe none if you can choose a different register range to use for that code. It's a trade off between general performance and C code interrupt latency. On a P2 the interrupt latency might not always be the first priority, of course in some cases it could be. Not all C code written by people will use interrupt routines or have them coded in C, but those applications that do may desire less latency and people may be happy to run the general code a little less efficiently perhaps. Hopefully if we have an option to write ISRs in PASM and fully control which registers need to be saved that could at least provide a path to reduce the latency further.

    It's a classic trade off to choose the optimum solution if there even is one. :smile:

  • It can be built with gcc. Not recommended as clang compiles faster. My build dir is 50GB. :astonished: And binaries are absurdly large.
    file ../build/bin/clang-12 
    ../build/bin/clang-12: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=x, with debug_info, not stripped
    [x@localhost p2_dev_tests]$ ls -sh  ../build/bin/clang-12 
    1.9G ../build/bin/clang-12
    

    I have test12.c running. That's basically a hello world. Some printfs go missing or are clobbered.

    p2.ld is not in the new repository. There was a bunch of good stuff in llvm-propeller2/p2_dev_tests/ . That repository is gone.
  • @SaucySoliton, you should get smaller binaries if you add -D CMAKE_BUILD_TYPE=Release to the CMake step.
  • n_ermoshn_ermosh Posts: 107
    edited 2020-08-06 - 02:09:42
    It can be built with gcc. Not recommended as clang compiles faster. My build dir is 50GB. :astonished: And binaries are absurdly large.

    Yes, when building with all the debug stuff the build is massive. my installed binaries totaled 11GB... you can build the release version by adding the flag "-DCMAKE_BUILD_TYPE=Release" to the cmake to make configuration step to make the built <1G.

    p2.ld is not in the new repository. There was a bunch of good stuff in llvm-propeller2/p2_dev_tests/ . That repository is gone.

    Put back the dev testsinto the new repo, and added p2.ld into libp2--my bad.
  • @DavidZemon getting docker going has been a bit of a challenge--It ends up using a ton of memory and crashing. I'm trying it again with the release version of the built. Although if you got it building on linux, then not nearly as much need for a docker image. I'll still put one up once I'm done, but it will only build release versions for linux.
  • n_ermosh wrote: »
    @DavidZemon getting docker going has been a bit of a challenge--It ends up using a ton of memory and crashing. I'm trying it again with the release version of the built. Although if you got it building on linux, then not nearly as much need for a docker image. I'll still put one up once I'm done, but it will only build release versions for linux.

    I ran into the same problem when I forgot to set the build type to release... I have 32 GB of RAM, and even when set to -j1 (or no -j flag) it STILL ate all of my RAM. Soon I switched back to "Release", it built fine.
  • also @SaucySoliton regarding "Some printfs go missing or are clobbered.", can you upload the resulting binary, and also the output you were getting that was clobbered?
  • Something in recent LLVM commits changes how rodata section is relocated, which is leading to all sorts of issues when dealing with strings. I'm still tracking down the exact bug but will hopefully fix it soon.
  • n_ermoshn_ermosh Posts: 107
    edited 2020-08-07 - 17:44:08
    I've fixed that issue so now the basic examples in p2_dev_tests should all run fine. Everything is currently in the dev branches, so use those. I also added @ersmith's spi test to there, I think it's a good reference for performance. I've some ideas for improving performance that I'll begin to implement. I also expanded to 32 registers instead of 16. I'll also keep adding more instructions and get the rest of the special registers encoded. More to come...
  • Very interesting...
    Can this do inline PASM2 assembly?
    Or, can it launch a binary assembly blob into a new cog?
  • Rayman wrote: »
    Very interesting...
    Can this do inline PASM2 assembly?
    Or, can it launch a binary assembly blob into a new cog?

    Yes you can do inline assembly. For any unimplemented instructions (or if you want condition modifiers or effect flags, which the parser doesn't support yet), you can always do something like
    #define storecnt()      __asm__ volatile (".long 0xfd63ec1a ;  .long 0xfc67ec10" : : : )
    
    but I'm working on getting everything added in, eventually.

    I don't have a library function written to launch a binary blob, but you can just use the coginit instruction directly and pass the d/s fields for the mode and the location of the blob and copy it in. What will take a bit more time is starting a compiled function in cogexec mode rather than hubexec mode, but I have ideas for how to do that.
  • Great. Sound like you are on it...
  • Also turns out that reading/writing named registers is super easy. latest code (dev branch) now lets you do statements like
    DIRA |= 1 << pin
    
    for compatibility with older propgcc code.

    Somewhat unrelated, but does p2 have a p1-style waitcnt() function, waiting until the clock value reaches a specific value. I know there's waitx, but that's not going to be as helpful for tightly timed loops (things like the below snippet I've used hundreds of times)
    int t = CNT;
    while(1) {
        //do stuff
        waitcnt(t += loop_period);
    }
    
  • Okay, what did I do wrong? I was trying to build p2llvm on my Mac mini and got the following error:
    dbetz@Davids-Mac-mini-2 build % cmake -G "Unix Makefiles" -DLLVM_ENABLE_PROJECTS="lld;clang" -DCMAKE_INSTALL_PREFIX=/opt/p2llvm
    CMake Warning:
      No source or binary directory provided.  Both will be assumed to be the
      same as the current working directory, but note that this warning will
      become a fatal error in future CMake releases.
    
    
    CMake Error: The source directory "/Users/dbetz/Dropbox/p2/p2llvm/llvm-project/build" does not appear to contain CMakeLists.txt.
    Specify --help for usage, or press the help button on the CMake GUI.
    
    I got this command line from the README.md file.
  • add the "../llvm" on the end of the cmake line (no quotes needed)
  • n_ermosh wrote: »
    Somewhat unrelated, but does p2 have a p1-style waitcnt() function, waiting until the clock value reaches a specific value. I know there's waitx, but that's not going to be as helpful for tightly timed loops (things like the below snippet I've used hundreds of times)
    int t = CNT;
    while(1) {
        //do stuff
        waitcnt(t += loop_period);
    }
    

    Yes, ADDCT1 D,#0 + WAITCT1 (or any of the other CT events). Not very nice for WAITCNT implementation though, since this blocks other usages of CT1

    Also unrelated, but have you thought about what you're going to do with PTRA,PA,PB? These are neat because you can LOC an address into them (a relative one at that, too) and of course PTRB can do the nice addressing modes.
  • Maybe PTRA could be a stack pointer, PTRB could be a general pointer to access args/locals off the stack frame or LUT RAM given it has offset capabilities? PA/PB could be temporary regs you can also load into PTRB as needed if they also offer LOC functions for computing addresses of relocatable/relative constant data in executable areas? Maybe PA or PB can be used when calls are made using CALLD to act as a link register in leaf functions instead of CALLA? With auto increment/decrement capabilities PTRB could also be useful as a character pointer in some functions when parsing strings etc.

    There are probably several good uses of those registers to consider to help maximize performance.
  • @David Betz what @rogloh said, you need to point cmake to where the llvm source is, which is ../llvm. I’ll fix the instructions.

    Currently, I use ptra as the stack pointer, pa, as a scratch register whenever I need to offset ptra without changing it, for things like
    mov pa, ptra
    sub pa, #4
    wrlong r0, pa
    

    Now I know that I can just wrlong r0 directly to ptra with an offset by using the special immediate, but there are a few places where that’s not super straightforward to implement. The solution is to add a optimization pass before machine code generation to replace that structure above with the direct write to ptra+offset, but one thing at a time...

    I don’t currently have a use for ptrb or pb. My thought was to use ptrb as a lut ram pointer and push/pop callee saved registers to ptrb into lut ram and auto increment/decrement.

    I haven’t looked too much into the LOC instruction or what it does—sounds like it just loads a 20 bit address into ptra/b/pa/pb without an intermediate augs that you would need for a normal move? The relative addressing seems useful, though the only PC relative stuff I do is jumps right now, which can already take a full 20 bit address.

    There’s a lot of cool stuff that can be done. If you guys and gals see ways things can be optimized and sped up (either in the initial generation of code or in a pass towards the end of compilation) I’m all ears.
Sign In or Register to comment.