I don’t currently have a use for ptrb or pb. My thought was to use ptrb as a lut ram pointer and push/pop callee saved registers to ptrb into lut ram and auto increment/decrement.
I've been thinking that it might be neat to use it to pass the C++ this pointer (or just the first pointer argument, in general) around, so functions operating on structured data could take advantage of the cool addressing modes. Not sure how big the benefit there is though.
I haven’t looked too much into the LOC instruction or what it does—sounds like it just loads a 20 bit address into ptra/b/pa/pb without an intermediate augs that you would need for a normal move? The relative addressing seems useful, though the only PC relative stuff I do is jumps right now, which can already take a full 20 bit address.
Yes, that's what it does. It can also do absolute adresses, so you could even use it to load a 20 bit immediate in some cases.
There’s a lot of cool stuff that can be done. If you guys and gals see ways things can be optimized and sped up (either in the initial generation of code or in a pass towards the end of compilation) I’m all ears.
Here's some, @n_ermosh :
If you know a C function doesn't call any other functions, it is a leaf function. This means it can be called differently by those who call it. Instead of pushing the return address on the stack you could have the return address automatically loaded into say PB, using "CALLD PB, #address" instead of "CALLA #address". Then when this leaf function returns you can use JMP PB (4 clocks) instead of RETA (11-18 clocks) to return. A lot faster. Of course this means you need to know if a function calls another function to be able to identify it as a leaf. And function pointers that call leaf functions may need different handling too. They might need to call one instruction earlier which just POPs the stack into PB before the leaf function really begins. That's probably one way to handle that issue. Accessing local/parameter data needs to accommodate this slight stack difference too in leaf functions. I don't know if/how easy this is to customise LLVM for things like that. Another option might be for the called function (not the caller) to save the link register on the stack only if needed, then you can always use CALLD form. It's worth looking into good ways to speed up the prolog/epilogs and keep them small at the same time.
LOC will probably only be useful for holding constant (readonly) data buried inside executable code that needs to support relocation. If your read only data is always being stored at other fixed addresses it may not be as useful.
PTRB could be useful for a "register" variable in a function if used as a pointer that can automatically increment/decrement. Unfortunately there is only one of these resources though because PTRA is the SP (which makes sense). Both it and PTRA can also be used to read indexed values like parameters or locals off the hub stack relative to some base (if it fits the +/- 31 long range). Like this:
rdlong r0,ptra[-3] ' get long at [SP-12]
rdlong r1,ptra[-2] ' get long at [SP-8]
add r0, r1 ' etc
This is probably a good use I can think of that gives best bang per buck with the P2 hardware capabilities. If you don't use autoinc/autodec or offsets, any other register can be used to access hub/LUT and PTRA/PTRB are not needed. If you want to use it to access LUT for your saved registers that would be faster than individual hub read/writes (3 clocks to read/2 to write in LUT), but maybe not so if you instead can use setq bursts to have registers saved in HUB depending on the number saved. Writing to LUTRAM will be limited to 512 locations before other handling will needed. It is a lot but you would still need to handle the overflow case for recursion which may make things complicated and it may make debugging a little trickier to follow if the saved registers are not on the hub stack but inside the LUT RAM. Weigh that one up too.
add the "../llvm" on the end of the cmake line (no quotes needed)
Okay, I see. When I view the README.md file from GitHub the end of the command line is cut off. It would be nice if they would give some indication that that was happening. Anyway, when I used the full command line the build started. We'll see if it finishes...
add the "../llvm" on the end of the cmake line (no quotes needed)
Okay, I see. When I view the README.md file from GitHub the end of the command line is cut off. It would be nice if they would give some indication that that was happening. Anyway, when I used the full command line the build started. We'll see if it finishes...
Thanks for your help!
The build completed successfully. Can someone point me to where in the source tree the P2-specific code is located?
The build completed successfully. Can someone point me to where in the source tree the P2-specific code is located?
A few places (which I'm working on documenting...).
llvm-project/llvm/lib/Target/P2 contains all the backend code for propeller. There's quite a bit going on here, I'd take a look at docs/Notes and Thoughts.md which has links to a few good references.
clang/lib/Driver/ToolChains/ contains setup for the full toolchain. doesn't do much yet other than create a p2 target for clang.
clang/lib/Basic/Targets/ contains the custom info for P2 specific code (like named registers, etc).
lld/ELF/Arch contains the code that performs relocation for the linker.
There were a few other folders that were edited to register P2 as a target and provide some global info for various parts of llvm, but most of the work is done by everything in the lib/Target/P2 folder.
The build completed successfully. Can someone point me to where in the source tree the P2-specific code is located?
A few places (which I'm working on documenting...).
llvm-project/llvm/lib/Target/P2 contains all the backend code for propeller. There's quite a bit going on here, I'd take a look at docs/Notes and Thoughts.md which has links to a few good references.
clang/lib/Driver/ToolChains/ contains setup for the full toolchain. doesn't do much yet other than create a p2 target for clang.
clang/lib/Basic/Targets/ contains the custom info for P2 specific code (like named registers, etc).
lld/ELF/Arch contains the code that performs relocation for the linker.
There were a few other folders that were edited to register P2 as a target and provide some global info for various parts of llvm, but most of the work is done by everything in the lib/Target/P2 folder.
also @SaucySoliton regarding "Some printfs go missing or are clobbered.", can you upload the resulting binary, and also the output you were getting that was clobbered?
Something in recent LLVM commits changes how rodata section is relocated, which is leading to all sorts of issues when dealing with strings. I'm still tracking down the exact bug but will hopefully fix it soon.
Adding an extra zero to the end might be a workaround.
printf("i as an int = %d, \0", i);
printf("i as an uint = %u, \0", i);
printf("i as a hex = %x\n\0", i);
Despite my best efforts to only change one character and keep the same length, the compiler always seems to change the ordering of the strings.
So given that adding \0 seemed to fix it, I was looking for a possible buffer overrun. I eventually got this, which looked like a pretty clean buffer overrun.
printf("i as an int = %d, XXXXXX", i);
printf("i as an uint = %u, YYYYYY", i);
printf("i as a hex = %x\n ZZZZZZ", i);
waitx(_CLOCKFREQ/10);
000000006920617320616e 20696e 74203d 203231 |i as an int = 21|
0000001034373438333633372c 20585858585858 |47483637, XXXXXX|
0000002038 cb 07 fb 39 c907 fb 3a c707 fb 3b c507 fb |8...9...:...;...|
000000303c c307 fb 3d c107 fb 30 f087 f12e 692061 |<...=...0....i a|
000000407320616e 20696e 74203d 203231343734 |s an int = 21474|
0000005038333633382c 2058585858585838 cb 07 |83638, XXXXXX8..|
Except that the overrun didn't correspond to anything that came after the strings in question. A quick search finds the source of the data.
So main has the wrong string pointers in it. Also has the wrong registers. The offset is 420 bytes (0xA14) but that is not constant and it's not always the same between both calls. There are no other changes in main. The registers for the printf calls do not change.
Interesting that the trailing 0 was a workaround, but I actually found the root cause and fixed it in the dev branch (not yet merged to master as I'm finishing up another feature right now). I don't understand why the strings are being re-ordered, but at least now the relocation is keeping track of it properly and updating the reference the correct address in rodata. once I finish up the feature i'm currently working on (integrating the c standard lib, propeller standard lib, and linker scripts into clang directly), I'll merge to master and tag it as the next version "release".
It took me a little while to learn how to use branches. The string issue seems to be fixed now.
fft_bench v1.0 for PROPELLER
Freq. Magnitude
00000000000001fe
000000c0000001ff
00000140000001ff
00000200000001ff
1024 point bit-reversal and butterfly run time = 12742 us
clock frequency = 180000000
fft_bench.elf is 8140 bytes
It's not using the rev instruction. So there would be an easy speed gain from that. I used to get 15mS on this which leads me to think that the change to 32 registers is only in the dev branch right now.
results: time / size of binary loaded to P2riscvp2_lut: 8119 us 32412 bytes
fastspin: 13130 us 25180 bytes
p2gcc: 22740 us 20584 bytes
catalina: 24666 us 23552 bytes
@SaucySoliton I'm really liking the look of that LLVM generated fft_bench.elf file size in comparison with the binary image sizes of other compilers.
Was this elf file inclusive of all the serial IO stuff needed for the benchmark to run etc or were there other parts required that would have increased the entire binary size further?
Your gain from 15us to 12.7us is a good one with the increased number of registers. As prologs and epilogs get optimized I expect it should help boost it further. Any chance you might zip and post the disassembled listing of this benchmark here? We might find other optimisation candidates looking at it.
Unfortunately I still can't build LLVM, it only reaches up to 95% if I skip the errors and Clang seems to have some incompatibilities with the Apple Xcode header files on my Mac preventing complete compilation. I'll probably need to upgrade to a new OS X and deal with that nightmare to fix it, and I'm not ready for that yet.
The LLVM version isn't built with full C libraries. Just a lightweight printf. The elf file is complete. fft_bench.o is about 5kB. It would be good to find out how much of the size numbers above are standard libraries. And with the size of the P2 memory, is 10kB a big deal?
I'll need to go out of my way to post the listing before Monday.
Thanks Saucy. Well I don't think 10kB is a big deal at all. It was interesting to just compare the size with the other compilers to see how things are improving. I'd certainly expect LLVM version output size should be smaller than what p2gcc does in the end, and if it wasn't then it is certainly not optimal and we have a problem. p2gcc adds extra overhead in some cases from having less registers as well as not making full use of useful P2 instructions where it can to save space.
No problem if that listing is troublesome to post at the moment. Don't worry about it.
@SaucySoliton I merged your PR (and fixed a couple of merge conflicts). I was also able to build your fft benchmark (with a few changes to the defines. I've added __p2llvm__ as a standard define, so for these portable programs, lets start using that moving forward).
Regarding block transfers with setq. My implementation currently generates this:
addptra, #64 // reserve 16 longs on the stack
setq #9 // setup to write 9+1 longs
movpa, ptrasubpa, #64 // get pointer to start of stack block
wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers
And it doesn't work. does setq need to come directly before the corresponding wrlong? With how the stack allocation is working right now, I need to reserve all the space for it first before spilling any registers, so I can't use the auto-increment capabilities of ptra. Does anything about this seem obviously wrong? Obviously auto-incrementing would be better, it would save a lot of extra instructions, but I need to think about how to make that work in the llvm framework.
@SaucySoliton I merged your PR (and fixed a couple of merge conflicts). I was also able to build your fft benchmark (with a few changes to the defines. I've added __p2llvm__ as a standard define, so for these portable programs, lets start using that moving forward).
Regarding block transfers with setq. My implementation currently generates this:
addptra, #64 // reserve 16 longs on the stack
setq #9 // setup to write 9+1 longs
movpa, ptrasubpa, #64 // get pointer to start of stack block
wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers
And it doesn't work. does setq need to come directly before the corresponding wrlong? With how the stack allocation is working right now, I need to reserve all the space for it first before spilling any registers, so I can't use the auto-increment capabilities of ptra. Does anything about this seem obviously wrong? Obviously auto-incrementing would be better, it would save a lot of extra instructions, but I need to think about how to make that work in the llvm framework.
Yes, in most cases SETQ/SETQ2 needs to come directly before the instruction it's affecting
I added the rev instruction and ran the fft_bench test, I get a much larger output now. I haven't looked into how the benchmark works--why would changing the reverse bit method change what's being run?
Yes, in most cases SETQ/SETQ2 needs to come directly before the instruction it's affecting
RIP okay, that's going to take a little bit of effort. I'll table this for now for more important things on the todo list. I have ideas, will take me some time to implement them though.
I added the rev instruction and ran the fft_bench test, I get a much larger output now. I haven't looked into how the benchmark works--why would changing the reverse bit method change what's being run?
It shouldn't. Maybe you fixed other stuff with the compiler at the same time? I found that changing the optimizer level affected how many lines it sends. -O2 and above didn't work at all? I haven't looked into what is happening there yet.
addptra, #64 // reserve 16 longs on the stack
setq #9 // setup to write 9+1 longs
movpa, ptrasubpa, #64 // get pointer to start of stack block
wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers
you could use this:
movpa, ptra // get pointer to start of stack block
addptra, #64 // reserve 16 longs on the stack
setq #9 // setup to write 9+1 longs
wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers
Note that pa remains unchanged after the transfer, I guess that is what you want for preserving registers and restoring the stack later perhaps? Like a block pointer. It's worth spending the time getting a good fast sequence here given it will be used when calling many functions.
One other thing @n_ermosh , in this particular example above it looks like you were reserving 16 longs on the stack but only saving 10 registers. So does this mean you are planning on always allocating a fixed space for saving up to maximum number of 16 working registers? If so, this will of course eat up stack space faster.
I found p2gcc seemed to allocate its internal working register use sequentially, which makes saving the number of registers easy and the stack pointer could be adjusted by the smallest amount needed each time. I don't know what LLVM tracks there to help you only save what is needed and consume the least amount of stack space required but I'd expect it is doable somehow.
@rogloh, In that example, the remaining 6 longs are being used for local variables; it was just a snippet I grabbed from one of my examples. I'm only allocating what's needed for every function (as computed by the magical llvm internals). It will take a bit of work to rework how the prologue is inserted to make the above work, but I think it might be possible. Currently, here's the rough order of things happening that generates my code above. llvm is separately calling the functions that perform 1, 2, and 3 in various steps of compilation
1. at function entry, insert prologue:
add ptra, #size
2. spill callee saved registers to the stack:
figure out # of longs to save, which registers to start/end add, and insert those instructions:
setq #n_regs
wrlong r0, first_reg_frame_index
Frame index is currently still an abstract memory reference, and it's exact offset (from the new stack pointer) is determined by llvm and not by my code.
3. Eliminate the frame index by storing ptra to pa, subtracting, and replacing the frame_index with pa
move pa, ptra
sub pa, #frame_index_offset
wrlong r0, pa
The resulting code ends up looking like:
add ptra, #size
setq #n_regs
mov pa, ptra
sub pa, #size
wrlong r0, pa
I have an idea for how to work around this to make it look more like your example, without doing weird band-aid things, but I need to study for my pilot's exam this week, hopefully will be able to get back to this soon and implement some more things from the todo list. I'll keep an eye on this thread if people have ideas though.
edit: it's actually super straightforward to change this to do a block transfer plus auto-increment PTRA. I might be able to get it done this weekend.
turns out it was easy. latest commits in dev branch now use setq/auto-incrementing reads/writes to ptra to push/pop registers in function calls, and manual stack allocation (for local variables, etc) happens after it. Should see some performance improvements now. One more thing crossed off the list!
Do you expect you'll be able to use the indexed register hub access e.g., rdlong r3, ptra[-20] for accessing locals or any arguments allocated on the stack with just a single instruction when they are not otherwise maintained in register variables? There are distance limits with this but maybe the first 32 items below the stack pointer location might be accessible this way for a performance improvement too. The compiler would of course need to know when this faster form of hub memory access is applicable. I imagine trying to support alloca() later might be a problem unless the second ptrb is used as a frame pointer perhaps and ideally pointing between locals and arguments to give maximal access range, but that should not be a priority at this stage.
I do know the AVR chips also use this type of memory access method with GCC code and have similar distance limits too with LDD (load indirect with displacement) instructions. See below for the Y+n form with LDD reads. For AVR "n" is limited to 63 bytes while P2 would have limits of 32 bytes, words or longs below the current stack pointer depending on the read/write size which is still useful.
int main(void) {
6c: cf 93 push r286e: df 93 push r2970: cd b7 in r28, 0x3d ; 6172: de b7 in r29, 0x3e ; 6274: 2997 sbiw r28, 0x09 ; 976: 0f b6 in r0, 0x3f ; 6378: f894 cli
7a: de bf out 0x3e, r29 ; 627c: 0f be out 0x3f, r0 ; 637e: cd bf out 0x3d, r28 ; 61volatile int foo;
volatile char bar[7];
while (1) {
PORTB = foo;
80: 8885 ldd r24, Y+8 ; 0x0882: 9985 ldd r25, Y+9 ; 0x0984: 88 bb out 0x18, r24 ; 24PORTB = bar[3];
86: 8c 81 ldd r24, Y+4 ; 0x0488: 88 bb out 0x18, r24 ; 248a: fa cf rjmp .-12 ; 0x80 <main+0x14>
@rogloh it should be pretty easy to implement as well. Since stack slots are always referred to by an arbitrary frame index (stack pointer +/- an offset), and replaced at the end with actual registers, it should be pretty easy to implement logic that will check the size/alignment of the offset, then choose to either use the indexed ptra mode or just save ptra to a register and subtract (like I currently do), or maybe insert augs to increase the possible offset distance. There's probably a few corner cases where this could break something, but I'll cross that bridge. My next task is to test out dynamic memory allocation and see how that work and what will happen.
One major aspect of LLVM I haven't even touched yet is that you can give it hints to the cost of executing a particular instruction. This is then used internally for scheduling as well as optimization to pick the appropriate instructions to use, without any effort from me other than just defining the costs. This can be used to prefer storing things in registers (which it already does), but also interestingly for non-blocking instructions like qdiv or qmul, it can be told to place the getqx/getqy later if it doesn't need the result immediately so that the CORDIC can run in the background until complete.
Sounds good. If you do use AUGS (you may need to in some cases) it could require several instructions to read offsets from the stack pointer. Eg.
movpa, ptra' ptra is SPsubpa, ##nnnn ' uses AUGS once nnnn exceeds 511rdlong r3, pa
this creates 4 instructions instead of 1 for each local access which while also somewhat slower can start to quickly bloat the code. Eg. it burns 16 bytes to read something and maybe another 16 bytes elsewhere to write the result back.
I also sort of wonder if any of the ALT instructions can be used somehow with AUGS to save an instruction but am not sure there.
If you can make uses of the hardware multiplier that would be good too. Unfortunately the really fast multiply is limited to 16 bit operands but maybe you can still make good use of it in internal calculations where possible if you know the inputs will fit.
The other P2 things to try to leverage are the bit operations, addsx etc for 64 bit stuff, maybe byte accesses getbyte/setbyte if this is detectable in the code, DJZ loop stuff, any small skip sequences where possible and good use of condition flag execution.
Can't I just use augs to set the index of ptra to be a large offset? Page 61 of the datasheet, unless I'm not understanding how that works? That should just add 1 instruction for when there's a large offset. effectively doing:
augs #nnnnn
rdlong r3, ptra[nnnnn]
condition flag execution is a big one that I'm not yet set up to do. However, I bet there's a post-compilation optimizer pass that can be written to convert small loops or branches into conditional execution and insert skips and such. LLVM has a nice structure that lets you write passes for each function at various stages of compilation and do whatever you want to the function, on top of everything LLVM does. I have one currently called "expand pseudo instructions" which converts pseudo instructions I define (that don't have an equivalent P2 instruction, but I'm telling LLVM one will exist) into a small series of real instructions.
Can't I just use augs to set the index of ptra to be a large offset? Page 61 of the datasheet, unless I'm not understanding how that works? That should just add 1 instruction for when there's a large offset. effectively doing:
augs #nnnnn
rdlong r3, ptra[nnnnn]
If you can actually do that it would be very good. But you may need to test that idea out first, because RDLONG addresses are encoded differently and so it may not work the same as you'd assume. I'm guessing that it won't work but would be very happy to be proven wrong there. EDIT: okay I read page 61 and you might be correct! That's very efficient with only a second instruction needed over the "rdlong r0, ptra[10]" form.
For loops and branches, the conditional execution should be very common with z as well as c flags for typical greater/less than comparisons for example. TJZ and TJNZ should come in very useful too.
I'm currently working out dynamic memory allocation, but next on my list is optimizing loops. Defining a pattern to match comparison to 0 should be pretty easy to save an instruction. Although, I think my overarching philosophy for writing optimizations (for now) is to focus on compiler functionality and completeness rather than on performance, as any really high performance code can always be written in assembly (although I think an end user should never have to touch assembly unless doing something really, really specialized). Lots to do...
Although, I think my overarching philosophy for writing optimizations (for now) is to focus on compiler functionality and completeness rather than on performance
Absolutely. Once it's (fully) functional, folks can start using it to write libraries while you simultaneously enhance performance. The optimization rabbit hole is never ending....
For anyone building anything using cmake: please forget that gnu make exists. It's highly counter-productive. Use ninja everywhere. It's the defacto modern build tool. Once you experience its performance, going back to gnu make will sound and feel ridonkulous. make tries to be a general purpose tool, and its code base and performance bears that fact. ninja tries to do one job only and does it extremely well.`cmake -G Ninja` all the way. Thank me later Yes, you need to install ninja.
Comments
Yes, that's what it does. It can also do absolute adresses, so you could even use it to load a 20 bit immediate in some cases.
If you know a C function doesn't call any other functions, it is a leaf function. This means it can be called differently by those who call it. Instead of pushing the return address on the stack you could have the return address automatically loaded into say PB, using "CALLD PB, #address" instead of "CALLA #address". Then when this leaf function returns you can use JMP PB (4 clocks) instead of RETA (11-18 clocks) to return. A lot faster. Of course this means you need to know if a function calls another function to be able to identify it as a leaf. And function pointers that call leaf functions may need different handling too. They might need to call one instruction earlier which just POPs the stack into PB before the leaf function really begins. That's probably one way to handle that issue. Accessing local/parameter data needs to accommodate this slight stack difference too in leaf functions. I don't know if/how easy this is to customise LLVM for things like that. Another option might be for the called function (not the caller) to save the link register on the stack only if needed, then you can always use CALLD form. It's worth looking into good ways to speed up the prolog/epilogs and keep them small at the same time.
LOC will probably only be useful for holding constant (readonly) data buried inside executable code that needs to support relocation. If your read only data is always being stored at other fixed addresses it may not be as useful.
PTRB could be useful for a "register" variable in a function if used as a pointer that can automatically increment/decrement. Unfortunately there is only one of these resources though because PTRA is the SP (which makes sense). Both it and PTRA can also be used to read indexed values like parameters or locals off the hub stack relative to some base (if it fits the +/- 31 long range). Like this:
rdlong r0,ptra[-3] ' get long at [SP-12]
rdlong r1,ptra[-2] ' get long at [SP-8]
add r0, r1 ' etc
This is probably a good use I can think of that gives best bang per buck with the P2 hardware capabilities. If you don't use autoinc/autodec or offsets, any other register can be used to access hub/LUT and PTRA/PTRB are not needed. If you want to use it to access LUT for your saved registers that would be faster than individual hub read/writes (3 clocks to read/2 to write in LUT), but maybe not so if you instead can use setq bursts to have registers saved in HUB depending on the number saved. Writing to LUTRAM will be limited to 512 locations before other handling will needed. It is a lot but you would still need to handle the overflow case for recursion which may make things complicated and it may make debugging a little trickier to follow if the saved registers are not on the hub stack but inside the LUT RAM. Weigh that one up too.
Thanks for your help!
A few places (which I'm working on documenting...).
llvm-project/llvm/lib/Target/P2 contains all the backend code for propeller. There's quite a bit going on here, I'd take a look at docs/Notes and Thoughts.md which has links to a few good references.
clang/lib/Driver/ToolChains/ contains setup for the full toolchain. doesn't do much yet other than create a p2 target for clang.
clang/lib/Basic/Targets/ contains the custom info for P2 specific code (like named registers, etc).
lld/ELF/Arch contains the code that performs relocation for the linker.
There were a few other folders that were edited to register P2 as a target and provide some global info for various parts of llvm, but most of the work is done by everything in the lib/Target/P2 folder.
Adding an extra zero to the end might be a workaround.
printf("i as an int = %d, \0", i);
printf("i as an uint = %u, \0", i);
printf("i as a hex = %x\n\0", i);
Despite my best efforts to only change one character and keep the same length, the compiler always seems to change the ordering of the strings.
So given that adding \0 seemed to fix it, I was looking for a possible buffer overrun. I eventually got this, which looked like a pretty clean buffer overrun.
printf("i as an int = %d, XXXXXX", i); printf("i as an uint = %u, YYYYYY", i); printf("i as a hex = %x\n ZZZZZZ", i); waitx(_CLOCKFREQ/10); 00000000 69 20 61 73 20 61 6e 20 69 6e 74 20 3d 20 32 31 |i as an int = 21| 00000010 34 37 34 38 33 36 33 37 2c 20 58 58 58 58 58 58 |47483637, XXXXXX| 00000020 38 cb 07 fb 39 c9 07 fb 3a c7 07 fb 3b c5 07 fb |8...9...:...;...| 00000030 3c c3 07 fb 3d c1 07 fb 30 f0 87 f1 2e 69 20 61 |<...=...0....i a| 00000040 73 20 61 6e 20 69 6e 74 20 3d 20 32 31 34 37 34 |s an int = 21474| 00000050 38 33 36 33 38 2c 20 58 58 58 58 58 58 38 cb 07 |83638, XXXXXX8..|
Except that the overrun didn't correspond to anything that came after the strings in question. A quick search finds the source of the data.a10: 38 cb 07 fb rdlong r5, #312 a10: 38 cb 07 fb rdlong r5, #312 a14: 39 c9 07 fb rdlong r4, #313 a14: 39 c9 07 fb rdlong r4, #313 a18: 3a c7 07 fb rdlong r3, #314 a18: 3a c7 07 fb rdlong r3, #314 a1c: 3b c5 07 fb rdlong r2, #315 a1c: 3b c5 07 fb rdlong r2, #315 a20: 3c c3 07 fb rdlong r1, #316 a20: 3c c3 07 fb rdlong r1, #316 a24: 3d c1 07 fb rdlong r0, #317 a24: 3d c1 07 fb rdlong r0, #317 a28: 30 f0 87 f1 sub ptra, #48 a28: 30 f0 87 f1 sub ptra, #48 a2c: 2e 00 64 fd reta a2c: 2e 00 64 fd reta
Disassembly of section .text: Disassembly of section .text: 00000400 <main>: 00000400 <main>: 460: f9 c1 63 fc wrlong r0, ptrb 460: f9 c1 63 fc wrlong r0, ptrb 464: 05 00 00 ff augs #5 464: 05 00 00 ff augs #5 468: c5 c1 07 f6 mov r0, #453 $bc5 .str | 468: f7 c1 07 f6 mov r0, #503 $bf7 .str 46c: 05 00 00 ff augs #5 46c: 05 00 00 ff augs #5 470: de c3 07 f6 mov r1, #478 $bde .str.1 | 470: 10 c4 07 f6 mov r2, #16 $a10 expecting 0bb4 <.str.1> 474: 05 00 00 ff augs #5 474: 05 00 00 ff augs #5 478: f8 c5 07 f6 mov r2, #504 $bf8 .str.2 | 478: 2a c6 07 f6 mov r3, #42 $a2a expecting bce <.str.2> 47c: 96 98 00 ff augs #39062 47c: 96 98 00 ff augs #39062 480: 00 c7 07 f6 mov r3, #256 480: 00 c7 07 f6 mov r3, #256 484: f8 f3 03 f6 mov ptrb, ptra 484: f8 f3 03 f6 mov ptrb, ptra
So main has the wrong string pointers in it. Also has the wrong registers. The offset is 420 bytes (0xA14) but that is not constant and it's not always the same between both calls. There are no other changes in main. The registers for the printf calls do not change.fft_bench v1.0 for PROPELLER Freq. Magnitude 00000000 000001fe 000000c0 000001ff 00000140 000001ff 00000200 000001ff 1024 point bit-reversal and butterfly run time = 12742 us clock frequency = 180000000
fft_bench.elf is 8140 bytesIt's not using the rev instruction. So there would be an easy speed gain from that. I used to get 15mS on this which leads me to think that the change to 32 registers is only in the dev branch right now.
results: time / size of binary loaded to P2 riscvp2_lut: 8119 us 32412 bytes fastspin: 13130 us 25180 bytes p2gcc: 22740 us 20584 bytes catalina: 24666 us 23552 bytes
http://forums.parallax.com/discussion/comment/1502265/#Comment_1502265Was this elf file inclusive of all the serial IO stuff needed for the benchmark to run etc or were there other parts required that would have increased the entire binary size further?
Your gain from 15us to 12.7us is a good one with the increased number of registers. As prologs and epilogs get optimized I expect it should help boost it further. Any chance you might zip and post the disassembled listing of this benchmark here? We might find other optimisation candidates looking at it.
Unfortunately I still can't build LLVM, it only reaches up to 95% if I skip the errors and Clang seems to have some incompatibilities with the Apple Xcode header files on my Mac preventing complete compilation. I'll probably need to upgrade to a new OS X and deal with that nightmare to fix it, and I'm not ready for that yet.
The LLVM version isn't built with full C libraries. Just a lightweight printf. The elf file is complete. fft_bench.o is about 5kB. It would be good to find out how much of the size numbers above are standard libraries. And with the size of the P2 memory, is 10kB a big deal?
I'll need to go out of my way to post the listing before Monday.
No problem if that listing is troublesome to post at the moment. Don't worry about it.
Regarding block transfers with setq. My implementation currently generates this:
add ptra, #64 // reserve 16 longs on the stack setq #9 // setup to write 9+1 longs mov pa, ptra sub pa, #64 // get pointer to start of stack block wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers
And it doesn't work. does setq need to come directly before the corresponding wrlong? With how the stack allocation is working right now, I need to reserve all the space for it first before spilling any registers, so I can't use the auto-increment capabilities of ptra. Does anything about this seem obviously wrong? Obviously auto-incrementing would be better, it would save a lot of extra instructions, but I need to think about how to make that work in the llvm framework.
Yes, in most cases SETQ/SETQ2 needs to come directly before the instruction it's affecting
fft_bench v1.0 for PROPELLER Freq. Magnitude 00000000 000001fe 00000001 0000015f 00000002 000000ae 00000003 000000f8 00000004 00000043 00000005 00000171 00000006 00000155 00000007 00000032 00000008 000000a2 00000009 00000027 0000000a 000000cd 0000000b 000000a7 0000000c 00000016 0000000d 00000039 0000000e 00000018 0000000f 00000017 00000011 00000014 00000012 00000012 00000013 00000026 00000014 0000000d 00000015 00000057 00000016 0000005c 00000017 0000000f 00000018 00000036 00000019 0000000e 0000001a 0000004e 0000001b 00000043 0000001c 00000009 0000001d 00000019 0000001e 0000000b 0000001f 0000000b 00000021 0000000a 00000022 00000009 00000023 00000014 00000024 00000007 00000025 00000031 00000026 00000035 00000027 00000009 00000028 00000020 00000029 00000008 0000002a 00000030 0000002b 0000002a 0000002c 00000006 0000002d 00000010 0000002e 00000006 0000002f 00000007 00000031 00000007 00000032 00000006 00000033 0000000e 00000034 00000005 00000035 00000022 00000036 00000025 00000037 00000006 00000038 00000017 00000039 00000006 0000003a 00000022 0000003b 0000001f 0000003c 00000004 0000003d 0000000c 0000003e 00000005 0000003f 00000005 00000041 00000005 00000042 00000005 00000043 0000000a 00000044 00000003 00000045 0000001a 00000046 0000001c 00000047 00000004 00000048 00000011 00000049 00000004 0000004a 0000001b 0000004b 00000018 0000004c 00000003 0000004d 00000009 0000004e 00000003 0000004f 00000004 00000051 00000004 00000052 00000004 00000053 00000008 00000054 00000003 00000055 00000015 00000056 00000017 00000057 00000003 00000058 0000000e 00000059 00000003 0000005a 00000016 0000005b 00000014 0000005c 00000002 0000005d 00000008 0000005e 00000002 0000005f 00000003 00000061 00000003 00000062 00000003 00000063 00000006 00000064 00000002 00000065 00000011 00000066 00000013 00000067 00000003 00000068 0000000c 00000069 00000003 0000006a 00000013 0000006b 00000011 0000006c 00000002 0000006d 00000006 0000006e 00000002 0000006f 00000003 00000071 00000002 00000072 00000003 00000073 00000005 00000074 00000002 00000075 0000000f 00000076 00000010 00000077 00000002 00000078 0000000a 00000079 00000002 0000007a 00000011 0000007b 0000000f 0000007c 00000002 0000007d 00000005 0000007e 00000002 0000007f 00000002 00000081 00000002 00000082 00000002 00000083 00000005 00000084 00000001 00000085 0000000d 00000086 0000000e 00000087 00000002 00000088 00000009 00000089 00000002 0000008a 0000000f 0000008b 0000000d 0000008c 00000001 0000008d 00000005 0000008e 00000001 0000008f 00000002 00000091 00000002 00000092 00000002 00000093 00000004 00000094 00000001 00000095 0000000c 00000096 0000000d 00000097 00000002 00000098 00000008 00000099 00000002 0000009a 0000000d 0000009b 0000000c 0000009c 00000001 0000009d 00000004 0000009e 00000001 0000009f 00000002 000000a1 00000002 000000a2 00000002 000000a3 00000003 000000a4 00000001 000000a5 0000000a 000000a6 0000000c 000000a7 00000001 000000a8 00000007 000000a9 00000001 000000aa 0000000c 000000ab 0000000b 000000ac 00000001 000000ad 00000004 000000ae 00000001 000000af 00000001 000000b1 00000001 000000b2 00000002 000000b3 00000003 000000b4 00000001 000000b5 0000000a 000000b6 0000000a 000000b7 00000001 000000b8 00000007 000000b9 00000001 000000ba 0000000b 000000bb 0000000a 000000bc 00000001 000000bd 00000003 000000be 00000001 000000bf 00000001 000000c1 00000001 000000c2 00000001 000000c3 00000003 000000c4 00000001 000000c5 00000008 000000c6 0000000a 000000c7 00000001 000000c8 00000006 000000c9 00000001 000000ca 0000000a 000000cb 00000009 000000cc 00000001 000000cd 00000003 000000ce 00000001 000000cf 00000001 000000d1 00000001 000000d2 00000001 000000d3 00000002 000000d4 00000001 000000d5 00000008 000000d6 00000009 000000d7 00000001 000000d8 00000006 000000d9 00000001 000000da 00000009 000000db 00000008 000000dc 00000001 000000dd 00000003 000000de 00000001 000000df 00000001 000000e1 00000001 000000e2 00000001 000000e3 00000002 000000e5 00000008 000000e6 00000008 000000e7 00000001 000000e8 00000005 000000e9 00000001 000000ea 00000009 000000eb 00000008 000000ed 00000002 000000ee 00000001 000000ef 00000001 000000f1 00000001 000000f2 00000001 000000f3 00000002 000000f5 00000007 000000f6 00000008 000000f7 00000001 000000f8 00000005 000000f9 00000001 000000fa 00000008 000000fb 00000007 000000fd 00000002 000000fe 00000001 000000ff 00000001 00000101 00000001 00000102 00000001 00000103 00000002 00000105 00000007 00000106 00000008 00000107 00000001 00000108 00000004 00000109 00000001 0000010a 00000007 0000010b 00000007 0000010d 00000002 0000010e 00000001 0000010f 00000001 00000111 00000001 00000112 00000001 00000113 00000002 00000115 00000007 00000116 00000007 00000117 00000001 00000118 00000004 00000119 00000001 0000011a 00000007 0000011b 00000006 0000011d 00000002 0000011e 00000001 0000011f 00000001 00000121 00000001 00000122 00000001 00000123 00000002 00000125 00000007 00000126 00000007 00000127 00000001 00000128 00000004 00000129 00000001 0000012a 00000006 0000012b 00000006 0000012d 00000002 0000012e 00000001 0000012f 00000001 00000131 00000001 00000132 00000001 00000133 00000002 00000135 00000006 00000136 00000007 00000137 00000001 00000138 00000003 00000139 00000001 0000013a 00000007 0000013b 00000005 0000013d 00000002 0000013e 00000001 0000013f 00000001 00000141 00000001 00000142 00000001 00000143 00000002 00000145 00000006 00000146 00000007 00000147 00000001 00000148 00000004 00000149 00000001 0000014a 00000007 0000014b 00000005 0000014d 00000002 0000014e 00000001 0000014f 00000001 00000151 00000001 00000152 00000001 00000153 00000002 00000155 00000006 00000156 00000007 00000157 00000001 00000158 00000003 00000159 00000001 0000015a 00000006 0000015b 00000005 0000015d 00000002 0000015e 00000001 0000015f 00000001 00000161 00000001 00000163 00000002 00000165 00000006 00000166 00000007 00000167 00000001 00000168 00000003 00000169 00000001 0000016a 00000006 0000016b 00000005 0000016d 00000002 0000016e 00000001 0000016f 00000001 00000171 00000001 00000173 00000002 00000175 00000006 00000176 00000006 00000177 00000001 00000178 00000003 00000179 00000001 0000017a 00000005 0000017b 00000005 0000017d 00000002 0000017e 00000001 0000017f 00000001 00000181 00000001 00000183 00000002 00000185 00000006 00000186 00000006 00000187 00000001 00000188 00000003 00000189 00000001 0000018a 00000005 0000018b 00000005 0000018d 00000002 0000018e 00000001 0000018f 00000001 00000191 00000001 00000193 00000002 00000195 00000005 00000196 00000006 00000197 00000001 00000198 00000003 00000199 00000001 0000019a 00000006 0000019b 00000005 0000019d 00000002 0000019e 00000001 0000019f 00000001 000001a1 00000001 000001a3 00000002 000001a5 00000005 000001a6 00000006 000001a7 00000001 000001a8 00000003 000001a9 00000001 000001aa 00000005 000001ab 00000005 000001ad 00000002 000001ae 00000001 000001af 00000001 000001b1 00000001 000001b3 00000002 000001b5 00000005 000001b6 00000006 000001b7 00000001 000001b8 00000003 000001b9 00000001 000001ba 00000005 000001bb 00000005 000001bd 00000002 000001be 00000001 000001bf 00000001 000001c1 00000001 000001c3 00000002 000001c5 00000005 000001c6 00000006 000001c7 00000001 000001c8 00000003 000001c9 00000001 000001ca 00000005 000001cb 00000005 000001cd 00000002 000001ce 00000001 000001cf 00000001 000001d1 00000001 000001d3 00000002 000001d5 00000005 000001d6 00000006 000001d7 00000001 000001d8 00000003 000001d9 00000001 000001da 00000005 000001db 00000005 000001dd 00000002 000001de 00000001 000001df 00000001 000001e1 00000001 000001e3 00000002 000001e5 00000005 000001e6 00000006 000001e7 00000001 000001e8 00000003 000001e9 00000001 000001ea 00000005 000001eb 00000004 000001ed 00000002 000001ee 00000001 000001ef 00000001 000001f1 00000001 000001f3 00000002 000001f5 00000005 000001f6 00000005 000001f7 00000001 000001f8 00000003 000001f9 00000001 000001fa 00000005 000001fb 00000005 000001fd 00000002 000001ff 00000001 1024 point bit-reversal and butterfly run time = 11902 us clock frequency = 180000000
but it did seem to speed it up a bit.
RIP okay, that's going to take a little bit of effort. I'll table this for now for more important things on the todo list. I have ideas, will take me some time to implement them though.
It shouldn't. Maybe you fixed other stuff with the compiler at the same time? I found that changing the optimizer level affected how many lines it sends. -O2 and above didn't work at all? I haven't looked into what is happening there yet.
add ptra, #64 // reserve 16 longs on the stack setq #9 // setup to write 9+1 longs mov pa, ptra sub pa, #64 // get pointer to start of stack block wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers
you could use this:
mov pa, ptra // get pointer to start of stack block add ptra, #64 // reserve 16 longs on the stack setq #9 // setup to write 9+1 longs wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers
Note that pa remains unchanged after the transfer, I guess that is what you want for preserving registers and restoring the stack later perhaps? Like a block pointer. It's worth spending the time getting a good fast sequence here given it will be used when calling many functions.
I found p2gcc seemed to allocate its internal working register use sequentially, which makes saving the number of registers easy and the stack pointer could be adjusted by the smallest amount needed each time. I don't know what LLVM tracks there to help you only save what is needed and consume the least amount of stack space required but I'd expect it is doable somehow.
1. at function entry, insert prologue:
add ptra, #size
2. spill callee saved registers to the stack:
figure out # of longs to save, which registers to start/end add, and insert those instructions:
setq #n_regs
wrlong r0, first_reg_frame_index
Frame index is currently still an abstract memory reference, and it's exact offset (from the new stack pointer) is determined by llvm and not by my code.
3. Eliminate the frame index by storing ptra to pa, subtracting, and replacing the frame_index with pa
move pa, ptra
sub pa, #frame_index_offset
wrlong r0, pa
The resulting code ends up looking like:
add ptra, #size
setq #n_regs
mov pa, ptra
sub pa, #size
wrlong r0, pa
I have an idea for how to work around this to make it look more like your example, without doing weird band-aid things, but I need to study for my pilot's exam this week, hopefully will be able to get back to this soon and implement some more things from the todo list. I'll keep an eye on this thread if people have ideas though.
edit: it's actually super straightforward to change this to do a block transfer plus auto-increment PTRA. I might be able to get it done this weekend.
Do you expect you'll be able to use the indexed register hub access e.g., rdlong r3, ptra[-20] for accessing locals or any arguments allocated on the stack with just a single instruction when they are not otherwise maintained in register variables? There are distance limits with this but maybe the first 32 items below the stack pointer location might be accessible this way for a performance improvement too. The compiler would of course need to know when this faster form of hub memory access is applicable. I imagine trying to support alloca() later might be a problem unless the second ptrb is used as a frame pointer perhaps and ideally pointing between locals and arguments to give maximal access range, but that should not be a priority at this stage.
I do know the AVR chips also use this type of memory access method with GCC code and have similar distance limits too with LDD (load indirect with displacement) instructions. See below for the Y+n form with LDD reads. For AVR "n" is limited to 63 bytes while P2 would have limits of 32 bytes, words or longs below the current stack pointer depending on the read/write size which is still useful.
int main(void) { 6c: cf 93 push r28 6e: df 93 push r29 70: cd b7 in r28, 0x3d ; 61 72: de b7 in r29, 0x3e ; 62 74: 29 97 sbiw r28, 0x09 ; 9 76: 0f b6 in r0, 0x3f ; 63 78: f8 94 cli 7a: de bf out 0x3e, r29 ; 62 7c: 0f be out 0x3f, r0 ; 63 7e: cd bf out 0x3d, r28 ; 61 volatile int foo; volatile char bar[7]; while (1) { PORTB = foo; 80: 88 85 ldd r24, Y+8 ; 0x08 82: 99 85 ldd r25, Y+9 ; 0x09 84: 88 bb out 0x18, r24 ; 24 PORTB = bar[3]; 86: 8c 81 ldd r24, Y+4 ; 0x04 88: 88 bb out 0x18, r24 ; 24 8a: fa cf rjmp .-12 ; 0x80 <main+0x14>
One major aspect of LLVM I haven't even touched yet is that you can give it hints to the cost of executing a particular instruction. This is then used internally for scheduling as well as optimization to pick the appropriate instructions to use, without any effort from me other than just defining the costs. This can be used to prefer storing things in registers (which it already does), but also interestingly for non-blocking instructions like qdiv or qmul, it can be told to place the getqx/getqy later if it doesn't need the result immediately so that the CORDIC can run in the background until complete.
mov pa, ptra ' ptra is SP sub pa, ##nnnn ' uses AUGS once nnnn exceeds 511 rdlong r3, pa
this creates 4 instructions instead of 1 for each local access which while also somewhat slower can start to quickly bloat the code. Eg. it burns 16 bytes to read something and maybe another 16 bytes elsewhere to write the result back.
I also sort of wonder if any of the ALT instructions can be used somehow with AUGS to save an instruction but am not sure there.
If you can make uses of the hardware multiplier that would be good too. Unfortunately the really fast multiply is limited to 16 bit operands but maybe you can still make good use of it in internal calculations where possible if you know the inputs will fit.
The other P2 things to try to leverage are the bit operations, addsx etc for 64 bit stuff, maybe byte accesses getbyte/setbyte if this is detectable in the code, DJZ loop stuff, any small skip sequences where possible and good use of condition flag execution.
augs #nnnnn rdlong r3, ptra[nnnnn]
condition flag execution is a big one that I'm not yet set up to do. However, I bet there's a post-compilation optimizer pass that can be written to convert small loops or branches into conditional execution and insert skips and such. LLVM has a nice structure that lets you write passes for each function at various stages of compilation and do whatever you want to the function, on top of everything LLVM does. I have one currently called "expand pseudo instructions" which converts pseudo instructions I define (that don't have an equivalent P2 instruction, but I'm telling LLVM one will exist) into a small series of real instructions.
If you can actually do that it would be very good. But you may need to test that idea out first, because RDLONG addresses are encoded differently and so it may not work the same as you'd assume. I'm guessing that it won't work but would be very happy to be proven wrong there. EDIT: okay I read page 61 and you might be correct!
For loops and branches, the conditional execution should be very common with z as well as c flags for typical greater/less than comparisons for example. TJZ and TJNZ should come in very useful too.
cmp r1,r2 wz if_nz jmp #loop
Absolutely. Once it's (fully) functional, folks can start using it to write libraries while you simultaneously enhance performance. The optimization rabbit hole is never ending....