Speaking of which, are you sure you only want to allocate 16 GPRs? With how big the COG/LUT RAMs are, there's surely space for more (and from experience, 16 can get really cramped in some situations)
Yes Wuerfel_21, I would expect if the setq burst transfer method can be used to save/restore registers then increasing the register space should improve performance at only a minor expense of one clock per extra register needing to be saved or restored. The 16 regs with p2gcc is certainly far too cramped when some get used for parameter passing and local use etc, so something like at least 32 registers like the AVR uses could quite easily perform better. The P1 implementation where p2gcc came from originally had a big penalty to save/restore these registers, but the P2 should not have this with setq and the register count could increase.
Yes Wuerfel_21, I would expect if the setq burst transfer method can be used to save/restore registers then increasing the register space should improve performance at only a minor expense of one clock per extra register needing to be saved or restored. The 16 regs with p2gcc is certainly far too cramped when some get used for parameter passing and local use etc, so something like at least 32 registers like the AVR uses could quite easily perform better. The P1 implementation where p2gcc came from originally had a big penalty to save/restore these registers, but the P2 should not have this with setq and the register count could increase.
More registers is not always better. I experimented quite a lot with the number of registers for Catalina on the P1, expecting this would be true - but while 16 was too few, it turned out that 32 was too many - the overhead in both space and time of saving and loading them was actually detrimental to performance. I found 24 was the best compromise. Of course, this will vary from compiler to compiler.
RossH I would think that would be true if you always have to save/restore a fixed set of registers. If you only need to save the number of registers you actually have to clobber inside the function maybe this larger potential register count increase would become less detrimental, and code space is certainly not an issue with the setq transfer approach, as the instruction count doesn't vary with the number of regs transferred on the P2. It would do if you need to save/load registers sequentially however like the P1 did. It's an interesting problem to try to optimize.
@n_ermosh, I haven't tested it yet... but I was able to complete a build
I've installed it locally and assembled an RPM which is now uploaded to https://david.zemon.name/downloads/LLVM-12.0.0git-Linux.rpm
I tried building a .deb, but there is a CMake/CPack variable missing (CPACK_PACKAGE_CONTACT or CPACK_DEBIAN_PACKAGE_MAINTAINER), which stops the packaging process.
Regarding GP register count, changing it is extremely simple, maybe 5 lines of code in the compiler. This is what I love about LLVM and propeller together. LLVM is designed to be flexible to match the architecture, and our architecture is flexible as well, meaning with little effort we can do a lot. I picked 16 registers arbitrarily. Going to 32 only reduces the code space you have for cog-based code/programs, but doesn’t present any real complexity. Could even do 64 if desired. The prologue/epilogue inserter will only push/pop registers that get clobbered inside the function, as you said @rogloh.
@DavidZemon glad to hear it! Let me know if you are able to get anything building and loading
I'll switch this over to Docker once you have that working, but in the meantime, we have .rpm and .tar.gz artifacts for Linux published on TeamCity now: https://ci.zemon.name/project/P2llvm?guest=1
Took at look at your builds and it looks like it's missing clang and lld--did you enabled those projects when configuring the build? I made some changes to add P2 as a target and more changes will come, so we can't use the out of the box clang (yet).
I'll update once the docker is working--the intention is for it to just be a build system for linux systems and not a run environment. But it looks like you were able to get the project building on linux so that's good.
Took at look at your builds and it looks like it's missing clang and lld--did you enabled those projects when configuring the build? I made some changes to add P2 as a target and more changes will come, so we can't use the out of the box clang (yet).
I'll update once the docker is working--the intention is for it to just be a build system for linux systems and not a run environment. But it looks like you were able to get the project building on linux so that's good.
Interesting. I'll have to play around with the CMake options some more. I think the quotes are getting parsed a bit weird... I don't have the problem locally.
Well, it took A LOT of different tries to get it working... but it finally is . Turns out, no escaping is needed at all in the CMake options text box on TeamCity... the whole text box must be passed in as one big string into something, because not even the semicolon needed escaping/quoting. Anyway, artifacts are live now
RossH I would think that would be true if you always have to save/restore a fixed set of registers. If you only need to save the number of registers you actually have to clobber inside the function maybe this larger potential register count increase would become less detrimental, and code space is certainly not an issue with the setq transfer approach, as the instruction count doesn't vary with the number of regs transferred on the P2. It would do if you need to save/load registers sequentially however like the P1 did. It's an interesting problem to try to optimize.
Yes, I did those experiments on the P1. I should redo them on the P2 - the answer might be different. But then I would have to re-write the code generator
However, you still have to save/restore all registers on an interrupt, so the argument still holds. More registers is not necessarily always better.
Of course, not all compilers allow an arbitrary C function to be used as an interrupt handler as Catalina does. Swings and roundabouts!
On P2 if you use setq / rdlong and setq / wrlong to save and restore registers, the extra cost is literally just 1 cycle per register with no additional code overhead (the setq says how many to save). So yeah, more registers is almost certainly a win on P2. It's probably one of the reasons riscvp2 performs so well despite the JIT compilation overhead: with 32 (well, really only 31) registers to play with memory traffic gets reduced, and memory load/stores are pretty expensive.
However, you still have to save/restore all registers on an interrupt, so the argument still holds.
True, you would need to save all the registers you actually use in the interrupt routine yes. For C based ISRs this would be the full set, for PASM based ISRs hopefully somewhat less, or maybe none if you can choose a different register range to use for that code. It's a trade off between general performance and C code interrupt latency. On a P2 the interrupt latency might not always be the first priority, of course in some cases it could be. Not all C code written by people will use interrupt routines or have them coded in C, but those applications that do may desire less latency and people may be happy to run the general code a little less efficiently perhaps. Hopefully if we have an option to write ISRs in PASM and fully control which registers need to be saved that could at least provide a path to reduce the latency further.
It's a classic trade off to choose the optimum solution if there even is one.
It can be built with gcc. Not recommended as clang compiles faster. My build dir is 50GB. And binaries are absurdly large.
Yes, when building with all the debug stuff the build is massive. my installed binaries totaled 11GB... you can build the release version by adding the flag "-DCMAKE_BUILD_TYPE=Release" to the cmake to make configuration step to make the built <1G.
@DavidZemon getting docker going has been a bit of a challenge--It ends up using a ton of memory and crashing. I'm trying it again with the release version of the built. Although if you got it building on linux, then not nearly as much need for a docker image. I'll still put one up once I'm done, but it will only build release versions for linux.
@DavidZemon getting docker going has been a bit of a challenge--It ends up using a ton of memory and crashing. I'm trying it again with the release version of the built. Although if you got it building on linux, then not nearly as much need for a docker image. I'll still put one up once I'm done, but it will only build release versions for linux.
I ran into the same problem when I forgot to set the build type to release... I have 32 GB of RAM, and even when set to -j1 (or no -j flag) it STILL ate all of my RAM. Soon I switched back to "Release", it built fine.
also @SaucySoliton regarding "Some printfs go missing or are clobbered.", can you upload the resulting binary, and also the output you were getting that was clobbered?
Something in recent LLVM commits changes how rodata section is relocated, which is leading to all sorts of issues when dealing with strings. I'm still tracking down the exact bug but will hopefully fix it soon.
I've fixed that issue so now the basic examples in p2_dev_tests should all run fine. Everything is currently in the dev branches, so use those. I also added @ersmith's spi test to there, I think it's a good reference for performance. I've some ideas for improving performance that I'll begin to implement. I also expanded to 32 registers instead of 16. I'll also keep adding more instructions and get the rest of the special registers encoded. More to come...
Very interesting...
Can this do inline PASM2 assembly?
Or, can it launch a binary assembly blob into a new cog?
Yes you can do inline assembly. For any unimplemented instructions (or if you want condition modifiers or effect flags, which the parser doesn't support yet), you can always do something like
but I'm working on getting everything added in, eventually.
I don't have a library function written to launch a binary blob, but you can just use the coginit instruction directly and pass the d/s fields for the mode and the location of the blob and copy it in. What will take a bit more time is starting a compiled function in cogexec mode rather than hubexec mode, but I have ideas for how to do that.
Also turns out that reading/writing named registers is super easy. latest code (dev branch) now lets you do statements like
DIRA |= 1 << pin
for compatibility with older propgcc code.
Somewhat unrelated, but does p2 have a p1-style waitcnt() function, waiting until the clock value reaches a specific value. I know there's waitx, but that's not going to be as helpful for tightly timed loops (things like the below snippet I've used hundreds of times)
int t = CNT;
while(1) {
//do stuff
waitcnt(t += loop_period);
}
Okay, what did I do wrong? I was trying to build p2llvm on my Mac mini and got the following error:
dbetz@Davids-Mac-mini-2 build % cmake -G "Unix Makefiles" -DLLVM_ENABLE_PROJECTS="lld;clang" -DCMAKE_INSTALL_PREFIX=/opt/p2llvm
CMake Warning:
No source or binary directory provided. Both will be assumed to be the
same as the current working directory, but note that this warning will
become a fatal error in future CMake releases.
CMake Error: The source directory "/Users/dbetz/Dropbox/p2/p2llvm/llvm-project/build" does not appear to contain CMakeLists.txt.
Specify --help for usage, or press the help button on the CMake GUI.
Somewhat unrelated, but does p2 have a p1-style waitcnt() function, waiting until the clock value reaches a specific value. I know there's waitx, but that's not going to be as helpful for tightly timed loops (things like the below snippet I've used hundreds of times)
int t = CNT;
while(1) {
//do stuff
waitcnt(t += loop_period);
}
Yes, ADDCT1 D,#0 + WAITCT1 (or any of the other CT events). Not very nice for WAITCNT implementation though, since this blocks other usages of CT1
Also unrelated, but have you thought about what you're going to do with PTRA,PA,PB? These are neat because you can LOC an address into them (a relative one at that, too) and of course PTRB can do the nice addressing modes.
Maybe PTRA could be a stack pointer, PTRB could be a general pointer to access args/locals off the stack frame or LUT RAM given it has offset capabilities? PA/PB could be temporary regs you can also load into PTRB as needed if they also offer LOC functions for computing addresses of relocatable/relative constant data in executable areas? Maybe PA or PB can be used when calls are made using CALLD to act as a link register in leaf functions instead of CALLA? With auto increment/decrement capabilities PTRB could also be useful as a character pointer in some functions when parsing strings etc.
There are probably several good uses of those registers to consider to help maximize performance.
@"David Betz" what @rogloh said, you need to point cmake to where the llvm source is, which is ../llvm. I’ll fix the instructions.
Currently, I use ptra as the stack pointer, pa, as a scratch register whenever I need to offset ptra without changing it, for things like
mov pa, ptra
sub pa, #4
wrlong r0, pa
Now I know that I can just wrlong r0 directly to ptra with an offset by using the special immediate, but there are a few places where that’s not super straightforward to implement. The solution is to add a optimization pass before machine code generation to replace that structure above with the direct write to ptra+offset, but one thing at a time...
I don’t currently have a use for ptrb or pb. My thought was to use ptrb as a lut ram pointer and push/pop callee saved registers to ptrb into lut ram and auto increment/decrement.
I haven’t looked too much into the LOC instruction or what it does—sounds like it just loads a 20 bit address into ptra/b/pa/pb without an intermediate augs that you would need for a normal move? The relative addressing seems useful, though the only PC relative stuff I do is jumps right now, which can already take a full 20 bit address.
There’s a lot of cool stuff that can be done. If you guys and gals see ways things can be optimized and sped up (either in the initial generation of code or in a pass towards the end of compilation) I’m all ears.
Comments
More registers is not always better. I experimented quite a lot with the number of registers for Catalina on the P1, expecting this would be true - but while 16 was too few, it turned out that 32 was too many - the overhead in both space and time of saving and loading them was actually detrimental to performance. I found 24 was the best compromise. Of course, this will vary from compiler to compiler.
I've installed it locally and assembled an RPM which is now uploaded to https://david.zemon.name/downloads/LLVM-12.0.0git-Linux.rpm
I tried building a .deb, but there is a CMake/CPack variable missing (CPACK_PACKAGE_CONTACT or CPACK_DEBIAN_PACKAGE_MAINTAINER), which stops the packaging process.
@DavidZemon glad to hear it! Let me know if you are able to get anything building and loading
I'll update once the docker is working--the intention is for it to just be a build system for linux systems and not a run environment. But it looks like you were able to get the project building on linux so that's good.
Interesting. I'll have to play around with the CMake options some more. I think the quotes are getting parsed a bit weird... I don't have the problem locally.
Yes, I did those experiments on the P1. I should redo them on the P2 - the answer might be different. But then I would have to re-write the code generator
However, you still have to save/restore all registers on an interrupt, so the argument still holds. More registers is not necessarily always better.
Of course, not all compilers allow an arbitrary C function to be used as an interrupt handler as Catalina does. Swings and roundabouts!
True, you would need to save all the registers you actually use in the interrupt routine yes. For C based ISRs this would be the full set, for PASM based ISRs hopefully somewhat less, or maybe none if you can choose a different register range to use for that code. It's a trade off between general performance and C code interrupt latency. On a P2 the interrupt latency might not always be the first priority, of course in some cases it could be. Not all C code written by people will use interrupt routines or have them coded in C, but those applications that do may desire less latency and people may be happy to run the general code a little less efficiently perhaps. Hopefully if we have an option to write ISRs in PASM and fully control which registers need to be saved that could at least provide a path to reduce the latency further.
It's a classic trade off to choose the optimum solution if there even is one.
I have test12.c running. That's basically a hello world. Some printfs go missing or are clobbered.
p2.ld is not in the new repository. There was a bunch of good stuff in llvm-propeller2/p2_dev_tests/ . That repository is gone.
Yes, when building with all the debug stuff the build is massive. my installed binaries totaled 11GB... you can build the release version by adding the flag "-DCMAKE_BUILD_TYPE=Release" to the cmake to make configuration step to make the built <1G.
Put back the dev testsinto the new repo, and added p2.ld into libp2--my bad.
I ran into the same problem when I forgot to set the build type to release... I have 32 GB of RAM, and even when set to -j1 (or no -j flag) it STILL ate all of my RAM. Soon I switched back to "Release", it built fine.
Can this do inline PASM2 assembly?
Or, can it launch a binary assembly blob into a new cog?
Yes you can do inline assembly. For any unimplemented instructions (or if you want condition modifiers or effect flags, which the parser doesn't support yet), you can always do something like but I'm working on getting everything added in, eventually.
I don't have a library function written to launch a binary blob, but you can just use the coginit instruction directly and pass the d/s fields for the mode and the location of the blob and copy it in. What will take a bit more time is starting a compiled function in cogexec mode rather than hubexec mode, but I have ideas for how to do that.
Somewhat unrelated, but does p2 have a p1-style waitcnt() function, waiting until the clock value reaches a specific value. I know there's waitx, but that's not going to be as helpful for tightly timed loops (things like the below snippet I've used hundreds of times)
Yes, ADDCT1 D,#0 + WAITCT1 (or any of the other CT events). Not very nice for WAITCNT implementation though, since this blocks other usages of CT1
Also unrelated, but have you thought about what you're going to do with PTRA,PA,PB? These are neat because you can LOC an address into them (a relative one at that, too) and of course PTRB can do the nice addressing modes.
There are probably several good uses of those registers to consider to help maximize performance.
Currently, I use ptra as the stack pointer, pa, as a scratch register whenever I need to offset ptra without changing it, for things like
Now I know that I can just wrlong r0 directly to ptra with an offset by using the special immediate, but there are a few places where that’s not super straightforward to implement. The solution is to add a optimization pass before machine code generation to replace that structure above with the direct write to ptra+offset, but one thing at a time...
I don’t currently have a use for ptrb or pb. My thought was to use ptrb as a lut ram pointer and push/pop callee saved registers to ptrb into lut ram and auto increment/decrement.
I haven’t looked too much into the LOC instruction or what it does—sounds like it just loads a 20 bit address into ptra/b/pa/pb without an intermediate augs that you would need for a normal move? The relative addressing seems useful, though the only PC relative stuff I do is jumps right now, which can already take a full 20 bit address.
There’s a lot of cool stuff that can be done. If you guys and gals see ways things can be optimized and sped up (either in the initial generation of code or in a pass towards the end of compilation) I’m all ears.