I don’t currently have a use for ptrb or pb. My thought was to use ptrb as a lut ram pointer and push/pop callee saved registers to ptrb into lut ram and auto increment/decrement.
I've been thinking that it might be neat to use it to pass the C++ this pointer (or just the first pointer argument, in general) around, so functions operating on structured data could take advantage of the cool addressing modes. Not sure how big the benefit there is though.
I haven’t looked too much into the LOC instruction or what it does—sounds like it just loads a 20 bit address into ptra/b/pa/pb without an intermediate augs that you would need for a normal move? The relative addressing seems useful, though the only PC relative stuff I do is jumps right now, which can already take a full 20 bit address.
Yes, that's what it does. It can also do absolute adresses, so you could even use it to load a 20 bit immediate in some cases.
There’s a lot of cool stuff that can be done. If you guys and gals see ways things can be optimized and sped up (either in the initial generation of code or in a pass towards the end of compilation) I’m all ears.
Here's some, @n_ermosh :
If you know a C function doesn't call any other functions, it is a leaf function. This means it can be called differently by those who call it. Instead of pushing the return address on the stack you could have the return address automatically loaded into say PB, using "CALLD PB, #address" instead of "CALLA #address". Then when this leaf function returns you can use JMP PB (4 clocks) instead of RETA (11-18 clocks) to return. A lot faster. Of course this means you need to know if a function calls another function to be able to identify it as a leaf. And function pointers that call leaf functions may need different handling too. They might need to call one instruction earlier which just POPs the stack into PB before the leaf function really begins. That's probably one way to handle that issue. Accessing local/parameter data needs to accommodate this slight stack difference too in leaf functions. I don't know if/how easy this is to customise LLVM for things like that. Another option might be for the called function (not the caller) to save the link register on the stack only if needed, then you can always use CALLD form. It's worth looking into good ways to speed up the prolog/epilogs and keep them small at the same time.
LOC will probably only be useful for holding constant (readonly) data buried inside executable code that needs to support relocation. If your read only data is always being stored at other fixed addresses it may not be as useful.
PTRB could be useful for a "register" variable in a function if used as a pointer that can automatically increment/decrement. Unfortunately there is only one of these resources though because PTRA is the SP (which makes sense). Both it and PTRA can also be used to read indexed values like parameters or locals off the hub stack relative to some base (if it fits the +/- 31 long range). Like this:
rdlong r0,ptra[-3] ' get long at [SP-12]
rdlong r1,ptra[-2] ' get long at [SP-8]
add r0, r1 ' etc
This is probably a good use I can think of that gives best bang per buck with the P2 hardware capabilities. If you don't use autoinc/autodec or offsets, any other register can be used to access hub/LUT and PTRA/PTRB are not needed. If you want to use it to access LUT for your saved registers that would be faster than individual hub read/writes (3 clocks to read/2 to write in LUT), but maybe not so if you instead can use setq bursts to have registers saved in HUB depending on the number saved. Writing to LUTRAM will be limited to 512 locations before other handling will needed. It is a lot but you would still need to handle the overflow case for recursion which may make things complicated and it may make debugging a little trickier to follow if the saved registers are not on the hub stack but inside the LUT RAM. Weigh that one up too.
add the "../llvm" on the end of the cmake line (no quotes needed)
Okay, I see. When I view the README.md file from GitHub the end of the command line is cut off. It would be nice if they would give some indication that that was happening. Anyway, when I used the full command line the build started. We'll see if it finishes...
add the "../llvm" on the end of the cmake line (no quotes needed)
Okay, I see. When I view the README.md file from GitHub the end of the command line is cut off. It would be nice if they would give some indication that that was happening. Anyway, when I used the full command line the build started. We'll see if it finishes...
Thanks for your help!
The build completed successfully. Can someone point me to where in the source tree the P2-specific code is located?
The build completed successfully. Can someone point me to where in the source tree the P2-specific code is located?
A few places (which I'm working on documenting...).
llvm-project/llvm/lib/Target/P2 contains all the backend code for propeller. There's quite a bit going on here, I'd take a look at docs/Notes and Thoughts.md which has links to a few good references.
clang/lib/Driver/ToolChains/ contains setup for the full toolchain. doesn't do much yet other than create a p2 target for clang.
clang/lib/Basic/Targets/ contains the custom info for P2 specific code (like named registers, etc).
lld/ELF/Arch contains the code that performs relocation for the linker.
There were a few other folders that were edited to register P2 as a target and provide some global info for various parts of llvm, but most of the work is done by everything in the lib/Target/P2 folder.
The build completed successfully. Can someone point me to where in the source tree the P2-specific code is located?
A few places (which I'm working on documenting...).
llvm-project/llvm/lib/Target/P2 contains all the backend code for propeller. There's quite a bit going on here, I'd take a look at docs/Notes and Thoughts.md which has links to a few good references.
clang/lib/Driver/ToolChains/ contains setup for the full toolchain. doesn't do much yet other than create a p2 target for clang.
clang/lib/Basic/Targets/ contains the custom info for P2 specific code (like named registers, etc).
lld/ELF/Arch contains the code that performs relocation for the linker.
There were a few other folders that were edited to register P2 as a target and provide some global info for various parts of llvm, but most of the work is done by everything in the lib/Target/P2 folder.
also @SaucySoliton regarding "Some printfs go missing or are clobbered.", can you upload the resulting binary, and also the output you were getting that was clobbered?
Something in recent LLVM commits changes how rodata section is relocated, which is leading to all sorts of issues when dealing with strings. I'm still tracking down the exact bug but will hopefully fix it soon.
Adding an extra zero to the end might be a workaround.
printf("i as an int = %d, \0", i);
printf("i as an uint = %u, \0", i);
printf("i as a hex = %x\n\0", i);
Despite my best efforts to only change one character and keep the same length, the compiler always seems to change the ordering of the strings.
So given that adding \0 seemed to fix it, I was looking for a possible buffer overrun. I eventually got this, which looked like a pretty clean buffer overrun.
printf("i as an int = %d, XXXXXX", i);
printf("i as an uint = %u, YYYYYY", i);
printf("i as a hex = %x\n ZZZZZZ", i);
waitx(_CLOCKFREQ/10);
00000000 69 20 61 73 20 61 6e 20 69 6e 74 20 3d 20 32 31 |i as an int = 21|
00000010 34 37 34 38 33 36 33 37 2c 20 58 58 58 58 58 58 |47483637, XXXXXX|
00000020 38 cb 07 fb 39 c9 07 fb 3a c7 07 fb 3b c5 07 fb |8...9...:...;...|
00000030 3c c3 07 fb 3d c1 07 fb 30 f0 87 f1 2e 69 20 61 |<...=...0....i a|
00000040 73 20 61 6e 20 69 6e 74 20 3d 20 32 31 34 37 34 |s an int = 21474|
00000050 38 33 36 33 38 2c 20 58 58 58 58 58 58 38 cb 07 |83638, XXXXXX8..|
Except that the overrun didn't correspond to anything that came after the strings in question. A quick search finds the source of the data.
So main has the wrong string pointers in it. Also has the wrong registers. The offset is 420 bytes (0xA14) but that is not constant and it's not always the same between both calls. There are no other changes in main. The registers for the printf calls do not change.
Interesting that the trailing 0 was a workaround, but I actually found the root cause and fixed it in the dev branch (not yet merged to master as I'm finishing up another feature right now). I don't understand why the strings are being re-ordered, but at least now the relocation is keeping track of it properly and updating the reference the correct address in rodata. once I finish up the feature i'm currently working on (integrating the c standard lib, propeller standard lib, and linker scripts into clang directly), I'll merge to master and tag it as the next version "release".
It took me a little while to learn how to use branches. The string issue seems to be fixed now.
fft_bench v1.0 for PROPELLER
Freq. Magnitude
00000000 000001fe
000000c0 000001ff
00000140 000001ff
00000200 000001ff
1024 point bit-reversal and butterfly run time = 12742 us
clock frequency = 180000000
fft_bench.elf is 8140 bytes
It's not using the rev instruction. So there would be an easy speed gain from that. I used to get 15mS on this which leads me to think that the change to 32 registers is only in the dev branch right now.
results: time / size of binary loaded to P2
riscvp2_lut: 8119 us 32412 bytes
fastspin: 13130 us 25180 bytes
p2gcc: 22740 us 20584 bytes
catalina: 24666 us 23552 bytes
@SaucySoliton I'm really liking the look of that LLVM generated fft_bench.elf file size in comparison with the binary image sizes of other compilers.
Was this elf file inclusive of all the serial IO stuff needed for the benchmark to run etc or were there other parts required that would have increased the entire binary size further?
Your gain from 15us to 12.7us is a good one with the increased number of registers. As prologs and epilogs get optimized I expect it should help boost it further. Any chance you might zip and post the disassembled listing of this benchmark here? We might find other optimisation candidates looking at it.
Unfortunately I still can't build LLVM, it only reaches up to 95% if I skip the errors and Clang seems to have some incompatibilities with the Apple Xcode header files on my Mac preventing complete compilation. I'll probably need to upgrade to a new OS X and deal with that nightmare to fix it, and I'm not ready for that yet.
The LLVM version isn't built with full C libraries. Just a lightweight printf. The elf file is complete. fft_bench.o is about 5kB. It would be good to find out how much of the size numbers above are standard libraries. And with the size of the P2 memory, is 10kB a big deal?
I'll need to go out of my way to post the listing before Monday.
Thanks Saucy. Well I don't think 10kB is a big deal at all. It was interesting to just compare the size with the other compilers to see how things are improving. I'd certainly expect LLVM version output size should be smaller than what p2gcc does in the end, and if it wasn't then it is certainly not optimal and we have a problem. p2gcc adds extra overhead in some cases from having less registers as well as not making full use of useful P2 instructions where it can to save space.
No problem if that listing is troublesome to post at the moment. Don't worry about it.
@SaucySoliton I merged your PR (and fixed a couple of merge conflicts). I was also able to build your fft benchmark (with a few changes to the defines. I've added __p2llvm__ as a standard define, so for these portable programs, lets start using that moving forward).
Regarding block transfers with setq. My implementation currently generates this:
add ptra, #64 // reserve 16 longs on the stack
setq #9 // setup to write 9+1 longs
mov pa, ptra
sub pa, #64 // get pointer to start of stack block
wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers
And it doesn't work. does setq need to come directly before the corresponding wrlong? With how the stack allocation is working right now, I need to reserve all the space for it first before spilling any registers, so I can't use the auto-increment capabilities of ptra. Does anything about this seem obviously wrong? Obviously auto-incrementing would be better, it would save a lot of extra instructions, but I need to think about how to make that work in the llvm framework.
@SaucySoliton I merged your PR (and fixed a couple of merge conflicts). I was also able to build your fft benchmark (with a few changes to the defines. I've added __p2llvm__ as a standard define, so for these portable programs, lets start using that moving forward).
Regarding block transfers with setq. My implementation currently generates this:
add ptra, #64 // reserve 16 longs on the stack
setq #9 // setup to write 9+1 longs
mov pa, ptra
sub pa, #64 // get pointer to start of stack block
wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers
And it doesn't work. does setq need to come directly before the corresponding wrlong? With how the stack allocation is working right now, I need to reserve all the space for it first before spilling any registers, so I can't use the auto-increment capabilities of ptra. Does anything about this seem obviously wrong? Obviously auto-incrementing would be better, it would save a lot of extra instructions, but I need to think about how to make that work in the llvm framework.
Yes, in most cases SETQ/SETQ2 needs to come directly before the instruction it's affecting
I added the rev instruction and ran the fft_bench test, I get a much larger output now. I haven't looked into how the benchmark works--why would changing the reverse bit method change what's being run?
Yes, in most cases SETQ/SETQ2 needs to come directly before the instruction it's affecting
RIP okay, that's going to take a little bit of effort. I'll table this for now for more important things on the todo list. I have ideas, will take me some time to implement them though.
I added the rev instruction and ran the fft_bench test, I get a much larger output now. I haven't looked into how the benchmark works--why would changing the reverse bit method change what's being run?
It shouldn't. Maybe you fixed other stuff with the compiler at the same time? I found that changing the optimizer level affected how many lines it sends. -O2 and above didn't work at all? I haven't looked into what is happening there yet.
add ptra, #64 // reserve 16 longs on the stack
setq #9 // setup to write 9+1 longs
mov pa, ptra
sub pa, #64 // get pointer to start of stack block
wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers
you could use this:
mov pa, ptra // get pointer to start of stack block
add ptra, #64 // reserve 16 longs on the stack
setq #9 // setup to write 9+1 longs
wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers
Note that pa remains unchanged after the transfer, I guess that is what you want for preserving registers and restoring the stack later perhaps? Like a block pointer. It's worth spending the time getting a good fast sequence here given it will be used when calling many functions.
One other thing @n_ermosh , in this particular example above it looks like you were reserving 16 longs on the stack but only saving 10 registers. So does this mean you are planning on always allocating a fixed space for saving up to maximum number of 16 working registers? If so, this will of course eat up stack space faster.
I found p2gcc seemed to allocate its internal working register use sequentially, which makes saving the number of registers easy and the stack pointer could be adjusted by the smallest amount needed each time. I don't know what LLVM tracks there to help you only save what is needed and consume the least amount of stack space required but I'd expect it is doable somehow.
@rogloh, In that example, the remaining 6 longs are being used for local variables; it was just a snippet I grabbed from one of my examples. I'm only allocating what's needed for every function (as computed by the magical llvm internals). It will take a bit of work to rework how the prologue is inserted to make the above work, but I think it might be possible. Currently, here's the rough order of things happening that generates my code above. llvm is separately calling the functions that perform 1, 2, and 3 in various steps of compilation
1. at function entry, insert prologue:
add ptra, #size
2. spill callee saved registers to the stack:
figure out # of longs to save, which registers to start/end add, and insert those instructions:
setq #n_regs
wrlong r0, first_reg_frame_index
Frame index is currently still an abstract memory reference, and it's exact offset (from the new stack pointer) is determined by llvm and not by my code.
3. Eliminate the frame index by storing ptra to pa, subtracting, and replacing the frame_index with pa
move pa, ptra
sub pa, #frame_index_offset
wrlong r0, pa
The resulting code ends up looking like:
add ptra, #size
setq #n_regs
mov pa, ptra
sub pa, #size
wrlong r0, pa
I have an idea for how to work around this to make it look more like your example, without doing weird band-aid things, but I need to study for my pilot's exam this week, hopefully will be able to get back to this soon and implement some more things from the todo list. I'll keep an eye on this thread if people have ideas though.
edit: it's actually super straightforward to change this to do a block transfer plus auto-increment PTRA. I might be able to get it done this weekend.
turns out it was easy. latest commits in dev branch now use setq/auto-incrementing reads/writes to ptra to push/pop registers in function calls, and manual stack allocation (for local variables, etc) happens after it. Should see some performance improvements now. One more thing crossed off the list!
Do you expect you'll be able to use the indexed register hub access e.g., rdlong r3, ptra[-20] for accessing locals or any arguments allocated on the stack with just a single instruction when they are not otherwise maintained in register variables? There are distance limits with this but maybe the first 32 items below the stack pointer location might be accessible this way for a performance improvement too. The compiler would of course need to know when this faster form of hub memory access is applicable. I imagine trying to support alloca() later might be a problem unless the second ptrb is used as a frame pointer perhaps and ideally pointing between locals and arguments to give maximal access range, but that should not be a priority at this stage.
I do know the AVR chips also use this type of memory access method with GCC code and have similar distance limits too with LDD (load indirect with displacement) instructions. See below for the Y+n form with LDD reads. For AVR "n" is limited to 63 bytes while P2 would have limits of 32 bytes, words or longs below the current stack pointer depending on the read/write size which is still useful.
int main(void) {
6c: cf 93 push r28
6e: df 93 push r29
70: cd b7 in r28, 0x3d ; 61
72: de b7 in r29, 0x3e ; 62
74: 29 97 sbiw r28, 0x09 ; 9
76: 0f b6 in r0, 0x3f ; 63
78: f8 94 cli
7a: de bf out 0x3e, r29 ; 62
7c: 0f be out 0x3f, r0 ; 63
7e: cd bf out 0x3d, r28 ; 61
volatile int foo;
volatile char bar[7];
while (1) {
PORTB = foo;
80: 88 85 ldd r24, Y+8 ; 0x08
82: 99 85 ldd r25, Y+9 ; 0x09
84: 88 bb out 0x18, r24 ; 24
PORTB = bar[3];
86: 8c 81 ldd r24, Y+4 ; 0x04
88: 88 bb out 0x18, r24 ; 24
8a: fa cf rjmp .-12 ; 0x80 <main+0x14>
@rogloh it should be pretty easy to implement as well. Since stack slots are always referred to by an arbitrary frame index (stack pointer +/- an offset), and replaced at the end with actual registers, it should be pretty easy to implement logic that will check the size/alignment of the offset, then choose to either use the indexed ptra mode or just save ptra to a register and subtract (like I currently do), or maybe insert augs to increase the possible offset distance. There's probably a few corner cases where this could break something, but I'll cross that bridge. My next task is to test out dynamic memory allocation and see how that work and what will happen.
One major aspect of LLVM I haven't even touched yet is that you can give it hints to the cost of executing a particular instruction. This is then used internally for scheduling as well as optimization to pick the appropriate instructions to use, without any effort from me other than just defining the costs. This can be used to prefer storing things in registers (which it already does), but also interestingly for non-blocking instructions like qdiv or qmul, it can be told to place the getqx/getqy later if it doesn't need the result immediately so that the CORDIC can run in the background until complete.
Sounds good. If you do use AUGS (you may need to in some cases) it could require several instructions to read offsets from the stack pointer. Eg.
mov pa, ptra ' ptra is SP
sub pa, ##nnnn ' uses AUGS once nnnn exceeds 511
rdlong r3, pa
this creates 4 instructions instead of 1 for each local access which while also somewhat slower can start to quickly bloat the code. Eg. it burns 16 bytes to read something and maybe another 16 bytes elsewhere to write the result back.
I also sort of wonder if any of the ALT instructions can be used somehow with AUGS to save an instruction but am not sure there.
If you can make uses of the hardware multiplier that would be good too. Unfortunately the really fast multiply is limited to 16 bit operands but maybe you can still make good use of it in internal calculations where possible if you know the inputs will fit.
The other P2 things to try to leverage are the bit operations, addsx etc for 64 bit stuff, maybe byte accesses getbyte/setbyte if this is detectable in the code, DJZ loop stuff, any small skip sequences where possible and good use of condition flag execution.
Can't I just use augs to set the index of ptra to be a large offset? Page 61 of the datasheet, unless I'm not understanding how that works? That should just add 1 instruction for when there's a large offset. effectively doing:
augs #nnnnn
rdlong r3, ptra[nnnnn]
condition flag execution is a big one that I'm not yet set up to do. However, I bet there's a post-compilation optimizer pass that can be written to convert small loops or branches into conditional execution and insert skips and such. LLVM has a nice structure that lets you write passes for each function at various stages of compilation and do whatever you want to the function, on top of everything LLVM does. I have one currently called "expand pseudo instructions" which converts pseudo instructions I define (that don't have an equivalent P2 instruction, but I'm telling LLVM one will exist) into a small series of real instructions.
Can't I just use augs to set the index of ptra to be a large offset? Page 61 of the datasheet, unless I'm not understanding how that works? That should just add 1 instruction for when there's a large offset. effectively doing:
augs #nnnnn
rdlong r3, ptra[nnnnn]
If you can actually do that it would be very good. But you may need to test that idea out first, because RDLONG addresses are encoded differently and so it may not work the same as you'd assume. I'm guessing that it won't work but would be very happy to be proven wrong there. EDIT: okay I read page 61 and you might be correct! That's very efficient with only a second instruction needed over the "rdlong r0, ptra[10]" form.
For loops and branches, the conditional execution should be very common with z as well as c flags for typical greater/less than comparisons for example. TJZ and TJNZ should come in very useful too.
I'm currently working out dynamic memory allocation, but next on my list is optimizing loops. Defining a pattern to match comparison to 0 should be pretty easy to save an instruction. Although, I think my overarching philosophy for writing optimizations (for now) is to focus on compiler functionality and completeness rather than on performance, as any really high performance code can always be written in assembly (although I think an end user should never have to touch assembly unless doing something really, really specialized). Lots to do...
Although, I think my overarching philosophy for writing optimizations (for now) is to focus on compiler functionality and completeness rather than on performance
Absolutely. Once it's (fully) functional, folks can start using it to write libraries while you simultaneously enhance performance. The optimization rabbit hole is never ending....
For anyone building anything using cmake: please forget that gnu make exists. It's highly counter-productive. Use ninja everywhere. It's the defacto modern build tool. Once you experience its performance, going back to gnu make will sound and feel ridonkulous. make tries to be a general purpose tool, and its code base and performance bears that fact. ninja tries to do one job only and does it extremely well.`cmake -G Ninja` all the way. Thank me later Yes, you need to install ninja.
Comments
Yes, that's what it does. It can also do absolute adresses, so you could even use it to load a 20 bit immediate in some cases.
If you know a C function doesn't call any other functions, it is a leaf function. This means it can be called differently by those who call it. Instead of pushing the return address on the stack you could have the return address automatically loaded into say PB, using "CALLD PB, #address" instead of "CALLA #address". Then when this leaf function returns you can use JMP PB (4 clocks) instead of RETA (11-18 clocks) to return. A lot faster. Of course this means you need to know if a function calls another function to be able to identify it as a leaf. And function pointers that call leaf functions may need different handling too. They might need to call one instruction earlier which just POPs the stack into PB before the leaf function really begins. That's probably one way to handle that issue. Accessing local/parameter data needs to accommodate this slight stack difference too in leaf functions. I don't know if/how easy this is to customise LLVM for things like that. Another option might be for the called function (not the caller) to save the link register on the stack only if needed, then you can always use CALLD form. It's worth looking into good ways to speed up the prolog/epilogs and keep them small at the same time.
LOC will probably only be useful for holding constant (readonly) data buried inside executable code that needs to support relocation. If your read only data is always being stored at other fixed addresses it may not be as useful.
PTRB could be useful for a "register" variable in a function if used as a pointer that can automatically increment/decrement. Unfortunately there is only one of these resources though because PTRA is the SP (which makes sense). Both it and PTRA can also be used to read indexed values like parameters or locals off the hub stack relative to some base (if it fits the +/- 31 long range). Like this:
rdlong r0,ptra[-3] ' get long at [SP-12]
rdlong r1,ptra[-2] ' get long at [SP-8]
add r0, r1 ' etc
This is probably a good use I can think of that gives best bang per buck with the P2 hardware capabilities. If you don't use autoinc/autodec or offsets, any other register can be used to access hub/LUT and PTRA/PTRB are not needed. If you want to use it to access LUT for your saved registers that would be faster than individual hub read/writes (3 clocks to read/2 to write in LUT), but maybe not so if you instead can use setq bursts to have registers saved in HUB depending on the number saved. Writing to LUTRAM will be limited to 512 locations before other handling will needed. It is a lot but you would still need to handle the overflow case for recursion which may make things complicated and it may make debugging a little trickier to follow if the saved registers are not on the hub stack but inside the LUT RAM. Weigh that one up too.
Thanks for your help!
A few places (which I'm working on documenting...).
llvm-project/llvm/lib/Target/P2 contains all the backend code for propeller. There's quite a bit going on here, I'd take a look at docs/Notes and Thoughts.md which has links to a few good references.
clang/lib/Driver/ToolChains/ contains setup for the full toolchain. doesn't do much yet other than create a p2 target for clang.
clang/lib/Basic/Targets/ contains the custom info for P2 specific code (like named registers, etc).
lld/ELF/Arch contains the code that performs relocation for the linker.
There were a few other folders that were edited to register P2 as a target and provide some global info for various parts of llvm, but most of the work is done by everything in the lib/Target/P2 folder.
Adding an extra zero to the end might be a workaround.
printf("i as an int = %d, \0", i);
printf("i as an uint = %u, \0", i);
printf("i as a hex = %x\n\0", i);
Despite my best efforts to only change one character and keep the same length, the compiler always seems to change the ordering of the strings.
So given that adding \0 seemed to fix it, I was looking for a possible buffer overrun. I eventually got this, which looked like a pretty clean buffer overrun. Except that the overrun didn't correspond to anything that came after the strings in question. A quick search finds the source of the data. So main has the wrong string pointers in it. Also has the wrong registers. The offset is 420 bytes (0xA14) but that is not constant and it's not always the same between both calls. There are no other changes in main. The registers for the printf calls do not change.
fft_bench.elf is 8140 bytes
It's not using the rev instruction. So there would be an easy speed gain from that. I used to get 15mS on this which leads me to think that the change to 32 registers is only in the dev branch right now.
http://forums.parallax.com/discussion/comment/1502265/#Comment_1502265
Was this elf file inclusive of all the serial IO stuff needed for the benchmark to run etc or were there other parts required that would have increased the entire binary size further?
Your gain from 15us to 12.7us is a good one with the increased number of registers. As prologs and epilogs get optimized I expect it should help boost it further. Any chance you might zip and post the disassembled listing of this benchmark here? We might find other optimisation candidates looking at it.
Unfortunately I still can't build LLVM, it only reaches up to 95% if I skip the errors and Clang seems to have some incompatibilities with the Apple Xcode header files on my Mac preventing complete compilation. I'll probably need to upgrade to a new OS X and deal with that nightmare to fix it, and I'm not ready for that yet.
The LLVM version isn't built with full C libraries. Just a lightweight printf. The elf file is complete. fft_bench.o is about 5kB. It would be good to find out how much of the size numbers above are standard libraries. And with the size of the P2 memory, is 10kB a big deal?
I'll need to go out of my way to post the listing before Monday.
No problem if that listing is troublesome to post at the moment. Don't worry about it.
Regarding block transfers with setq. My implementation currently generates this:
And it doesn't work. does setq need to come directly before the corresponding wrlong? With how the stack allocation is working right now, I need to reserve all the space for it first before spilling any registers, so I can't use the auto-increment capabilities of ptra. Does anything about this seem obviously wrong? Obviously auto-incrementing would be better, it would save a lot of extra instructions, but I need to think about how to make that work in the llvm framework.
Yes, in most cases SETQ/SETQ2 needs to come directly before the instruction it's affecting
but it did seem to speed it up a bit.
RIP okay, that's going to take a little bit of effort. I'll table this for now for more important things on the todo list. I have ideas, will take me some time to implement them though.
It shouldn't. Maybe you fixed other stuff with the compiler at the same time? I found that changing the optimizer level affected how many lines it sends. -O2 and above didn't work at all? I haven't looked into what is happening there yet.
you could use this:
Note that pa remains unchanged after the transfer, I guess that is what you want for preserving registers and restoring the stack later perhaps? Like a block pointer. It's worth spending the time getting a good fast sequence here given it will be used when calling many functions.
I found p2gcc seemed to allocate its internal working register use sequentially, which makes saving the number of registers easy and the stack pointer could be adjusted by the smallest amount needed each time. I don't know what LLVM tracks there to help you only save what is needed and consume the least amount of stack space required but I'd expect it is doable somehow.
1. at function entry, insert prologue:
add ptra, #size
2. spill callee saved registers to the stack:
figure out # of longs to save, which registers to start/end add, and insert those instructions:
setq #n_regs
wrlong r0, first_reg_frame_index
Frame index is currently still an abstract memory reference, and it's exact offset (from the new stack pointer) is determined by llvm and not by my code.
3. Eliminate the frame index by storing ptra to pa, subtracting, and replacing the frame_index with pa
move pa, ptra
sub pa, #frame_index_offset
wrlong r0, pa
The resulting code ends up looking like:
add ptra, #size
setq #n_regs
mov pa, ptra
sub pa, #size
wrlong r0, pa
I have an idea for how to work around this to make it look more like your example, without doing weird band-aid things, but I need to study for my pilot's exam this week, hopefully will be able to get back to this soon and implement some more things from the todo list. I'll keep an eye on this thread if people have ideas though.
edit: it's actually super straightforward to change this to do a block transfer plus auto-increment PTRA. I might be able to get it done this weekend.
Do you expect you'll be able to use the indexed register hub access e.g., rdlong r3, ptra[-20] for accessing locals or any arguments allocated on the stack with just a single instruction when they are not otherwise maintained in register variables? There are distance limits with this but maybe the first 32 items below the stack pointer location might be accessible this way for a performance improvement too. The compiler would of course need to know when this faster form of hub memory access is applicable. I imagine trying to support alloca() later might be a problem unless the second ptrb is used as a frame pointer perhaps and ideally pointing between locals and arguments to give maximal access range, but that should not be a priority at this stage.
I do know the AVR chips also use this type of memory access method with GCC code and have similar distance limits too with LDD (load indirect with displacement) instructions. See below for the Y+n form with LDD reads. For AVR "n" is limited to 63 bytes while P2 would have limits of 32 bytes, words or longs below the current stack pointer depending on the read/write size which is still useful.
One major aspect of LLVM I haven't even touched yet is that you can give it hints to the cost of executing a particular instruction. This is then used internally for scheduling as well as optimization to pick the appropriate instructions to use, without any effort from me other than just defining the costs. This can be used to prefer storing things in registers (which it already does), but also interestingly for non-blocking instructions like qdiv or qmul, it can be told to place the getqx/getqy later if it doesn't need the result immediately so that the CORDIC can run in the background until complete.
this creates 4 instructions instead of 1 for each local access which while also somewhat slower can start to quickly bloat the code. Eg. it burns 16 bytes to read something and maybe another 16 bytes elsewhere to write the result back.
I also sort of wonder if any of the ALT instructions can be used somehow with AUGS to save an instruction but am not sure there.
If you can make uses of the hardware multiplier that would be good too. Unfortunately the really fast multiply is limited to 16 bit operands but maybe you can still make good use of it in internal calculations where possible if you know the inputs will fit.
The other P2 things to try to leverage are the bit operations, addsx etc for 64 bit stuff, maybe byte accesses getbyte/setbyte if this is detectable in the code, DJZ loop stuff, any small skip sequences where possible and good use of condition flag execution.
condition flag execution is a big one that I'm not yet set up to do. However, I bet there's a post-compilation optimizer pass that can be written to convert small loops or branches into conditional execution and insert skips and such. LLVM has a nice structure that lets you write passes for each function at various stages of compilation and do whatever you want to the function, on top of everything LLVM does. I have one currently called "expand pseudo instructions" which converts pseudo instructions I define (that don't have an equivalent P2 instruction, but I'm telling LLVM one will exist) into a small series of real instructions.
If you can actually do that it would be very good. But you may need to test that idea out first, because RDLONG addresses are encoded differently and so it may not work the same as you'd assume. I'm guessing that it won't work but would be very happy to be proven wrong there. EDIT: okay I read page 61 and you might be correct! That's very efficient with only a second instruction needed over the "rdlong r0, ptra[10]" form.
For loops and branches, the conditional execution should be very common with z as well as c flags for typical greater/less than comparisons for example. TJZ and TJNZ should come in very useful too.
Absolutely. Once it's (fully) functional, folks can start using it to write libraries while you simultaneously enhance performance. The optimization rabbit hole is never ending....