LLVM Backend for Propeller 2

Wuerfel_21 · 2020-08-09 05:35

n_ermosh wrote: »

I don’t currently have a use for ptrb or pb. My thought was to use ptrb as a lut ram pointer and push/pop callee saved registers to ptrb into lut ram and auto increment/decrement.

I've been thinking that it might be neat to use it to pass the C++ this pointer (or just the first pointer argument, in general) around, so functions operating on structured data could take advantage of the cool addressing modes. Not sure how big the benefit there is though.

n_ermosh wrote: »

I haven’t looked too much into the LOC instruction or what it does—sounds like it just loads a 20 bit address into ptra/b/pa/pb without an intermediate augs that you would need for a normal move? The relative addressing seems useful, though the only PC relative stuff I do is jumps right now, which can already take a full 20 bit address.

Yes, that's what it does. It can also do absolute adresses, so you could even use it to load a 20 bit immediate in some cases.

rogloh · 2020-08-09 06:41

n_ermosh wrote: »

There’s a lot of cool stuff that can be done. If you guys and gals see ways things can be optimized and sped up (either in the initial generation of code or in a pass towards the end of compilation) I’m all ears.

Here's some, @n_ermosh :

If you know a C function doesn't call any other functions, it is a leaf function. This means it can be called differently by those who call it. Instead of pushing the return address on the stack you could have the return address automatically loaded into say PB, using "CALLD PB, #address" instead of "CALLA #address". Then when this leaf function returns you can use JMP PB (4 clocks) instead of RETA (11-18 clocks) to return. A lot faster. Of course this means you need to know if a function calls another function to be able to identify it as a leaf. And function pointers that call leaf functions may need different handling too. They might need to call one instruction earlier which just POPs the stack into PB before the leaf function really begins. That's probably one way to handle that issue. Accessing local/parameter data needs to accommodate this slight stack difference too in leaf functions. I don't know if/how easy this is to customise LLVM for things like that. Another option might be for the called function (not the caller) to save the link register on the stack only if needed, then you can always use CALLD form. It's worth looking into good ways to speed up the prolog/epilogs and keep them small at the same time.

LOC will probably only be useful for holding constant (readonly) data buried inside executable code that needs to support relocation. If your read only data is always being stored at other fixed addresses it may not be as useful.

PTRB could be useful for a "register" variable in a function if used as a pointer that can automatically increment/decrement. Unfortunately there is only one of these resources though because PTRA is the SP (which makes sense). Both it and PTRA can also be used to read indexed values like parameters or locals off the hub stack relative to some base (if it fits the +/- 31 long range). Like this:

rdlong r0,ptra[-3] ' get long at [SP-12]
rdlong r1,ptra[-2] ' get long at [SP-8]
add r0, r1 ' etc

This is probably a good use I can think of that gives best bang per buck with the P2 hardware capabilities. If you don't use autoinc/autodec or offsets, any other register can be used to access hub/LUT and PTRA/PTRB are not needed. If you want to use it to access LUT for your saved registers that would be faster than individual hub read/writes (3 clocks to read/2 to write in LUT), but maybe not so if you instead can use setq bursts to have registers saved in HUB depending on the number saved. Writing to LUTRAM will be limited to 512 locations before other handling will needed. It is a lot but you would still need to handle the overflow case for recursion which may make things complicated and it may make debugging a little trickier to follow if the saved registers are not on the hub stack but inside the LUT RAM. Weigh that one up too.

David Betz · 2020-08-09 10:55

rogloh wrote: »

add the "../llvm" on the end of the cmake line (no quotes needed)

Okay, I see. When I view the README.md file from GitHub the end of the command line is cut off. It would be nice if they would give some indication that that was happening. Anyway, when I used the full command line the build started. We'll see if it finishes...

Thanks for your help!

David Betz · 2020-08-09 12:34

David Betz wrote: »

rogloh wrote: »

add the "../llvm" on the end of the cmake line (no quotes needed)

Okay, I see. When I view the README.md file from GitHub the end of the command line is cut off. It would be nice if they would give some indication that that was happening. Anyway, when I used the full command line the build started. We'll see if it finishes...

Thanks for your help!

The build completed successfully. Can someone point me to where in the source tree the P2-specific code is located?

n_ermosh · 2020-08-09 16:47

David Betz wrote: »

The build completed successfully. Can someone point me to where in the source tree the P2-specific code is located?

A few places (which I'm working on documenting...).

llvm-project/llvm/lib/Target/P2 contains all the backend code for propeller. There's quite a bit going on here, I'd take a look at docs/Notes and Thoughts.md which has links to a few good references.

clang/lib/Driver/ToolChains/ contains setup for the full toolchain. doesn't do much yet other than create a p2 target for clang.

clang/lib/Basic/Targets/ contains the custom info for P2 specific code (like named registers, etc).

lld/ELF/Arch contains the code that performs relocation for the linker.

There were a few other folders that were edited to register P2 as a target and provide some global info for various parts of llvm, but most of the work is done by everything in the lib/Target/P2 folder.

David Betz · 2020-08-09 16:58

n_ermosh wrote: »

David Betz wrote: »

The build completed successfully. Can someone point me to where in the source tree the P2-specific code is located?

A few places (which I'm working on documenting...).

llvm-project/llvm/lib/Target/P2 contains all the backend code for propeller. There's quite a bit going on here, I'd take a look at docs/Notes and Thoughts.md which has links to a few good references.

clang/lib/Driver/ToolChains/ contains setup for the full toolchain. doesn't do much yet other than create a p2 target for clang.

clang/lib/Basic/Targets/ contains the custom info for P2 specific code (like named registers, etc).

lld/ELF/Arch contains the code that performs relocation for the linker.

There were a few other folders that were edited to register P2 as a target and provide some global info for various parts of llvm, but most of the work is done by everything in the lib/Target/P2 folder.

Thanks for the roadmap. That's very helpful.

SaucySoliton · 2020-08-12 22:12

n_ermosh wrote: »

also @SaucySoliton regarding "Some printfs go missing or are clobbered.", can you upload the resulting binary, and also the output you were getting that was clobbered?

n_ermosh wrote: »

Something in recent LLVM commits changes how rodata section is relocated, which is leading to all sorts of issues when dealing with strings. I'm still tracking down the exact bug but will hopefully fix it soon.

Adding an extra zero to the end might be a workaround.
printf("i as an int = %d, \0", i);
printf("i as an uint = %u, \0", i);
printf("i as a hex = %x\n\0", i);

Despite my best efforts to only change one character and keep the same length, the compiler always seems to change the ordering of the strings.

So given that adding \0 seemed to fix it, I was looking for a possible buffer overrun. I eventually got this, which looked like a pretty clean buffer overrun.

        printf("i as an int = %d, XXXXXX", i);
        printf("i as an uint = %u, YYYYYY", i);
        printf("i as a hex = %x\n ZZZZZZ", i);
        waitx(_CLOCKFREQ/10);

00000000  69 20 61 73 20 61 6e 20  69 6e 74 20 3d 20 32 31  |i as an int = 21|
00000010  34 37 34 38 33 36 33 37  2c 20 58 58 58 58 58 58  |47483637, XXXXXX|
00000020  38 cb 07 fb 39 c9 07 fb  3a c7 07 fb 3b c5 07 fb  |8...9...:...;...|
00000030  3c c3 07 fb 3d c1 07 fb  30 f0 87 f1 2e 69 20 61  |<...=...0....i a|
00000040  73 20 61 6e 20 69 6e 74  20 3d 20 32 31 34 37 34  |s an int = 21474|
00000050  38 33 36 33 38 2c 20 58  58 58 58 58 58 38 cb 07  |83638, XXXXXX8..|

Except that the overrun didn't correspond to anything that came after the strings in question. A quick search finds the source of the data.

     a10: 38 cb 07 fb  		rdlong r5, #312			     a10: 38 cb 07 fb  		rdlong r5, #312
     a14: 39 c9 07 fb  		rdlong r4, #313			     a14: 39 c9 07 fb  		rdlong r4, #313
     a18: 3a c7 07 fb  		rdlong r3, #314			     a18: 3a c7 07 fb  		rdlong r3, #314
     a1c: 3b c5 07 fb  		rdlong r2, #315			     a1c: 3b c5 07 fb  		rdlong r2, #315
     a20: 3c c3 07 fb  		rdlong r1, #316			     a20: 3c c3 07 fb  		rdlong r1, #316
     a24: 3d c1 07 fb  		rdlong r0, #317			     a24: 3d c1 07 fb  		rdlong r0, #317
     a28: 30 f0 87 f1  		sub ptra, #48			     a28: 30 f0 87 f1  		sub ptra, #48
     a2c: 2e 00 64 fd  		reta				     a2c: 2e 00 64 fd  		reta

Disassembly of section .text:					Disassembly of section .text:

00000400 <main>:						00000400 <main>:
     460: f9 c1 63 fc  		wrlong r0, ptrb			     460: f9 c1 63 fc  		wrlong r0, ptrb
     464: 05 00 00 ff  		augs #5				     464: 05 00 00 ff  		augs #5
     468: c5 c1 07 f6  		mov r0, #453	$bc5  .str    |	     468: f7 c1 07 f6  		mov r0, #503     $bf7  .str
     46c: 05 00 00 ff  		augs #5				     46c: 05 00 00 ff  		augs #5
     470: de c3 07 f6  		mov r1, #478	$bde .str.1   |	     470: 10 c4 07 f6  		mov r2, #16      $a10 expecting 0bb4 <.str.1>
     474: 05 00 00 ff  		augs #5				     474: 05 00 00 ff  		augs #5
     478: f8 c5 07 f6  		mov r2, #504	$bf8 .str.2   |	     478: 2a c6 07 f6  		mov r3, #42      $a2a expecting  bce <.str.2>
     47c: 96 98 00 ff  		augs #39062			     47c: 96 98 00 ff  		augs #39062
     480: 00 c7 07 f6  		mov r3, #256			     480: 00 c7 07 f6  		mov r3, #256
     484: f8 f3 03 f6  		mov ptrb, ptra			     484: f8 f3 03 f6  		mov ptrb, ptra

So main has the wrong string pointers in it. Also has the wrong registers. The offset is 420 bytes (0xA14) but that is not constant and it's not always the same between both calls. There are no other changes in main. The registers for the printf calls do not change.

n_ermosh · 2020-08-14 04:40

Interesting that the trailing 0 was a workaround, but I actually found the root cause and fixed it in the dev branch (not yet merged to master as I'm finishing up another feature right now). I don't understand why the strings are being re-ordered, but at least now the relocation is keeping track of it properly and updating the reference the correct address in rodata. once I finish up the feature i'm currently working on (integrating the c standard lib, propeller standard lib, and linker scripts into clang directly), I'll merge to master and tag it as the next version "release".

SaucySoliton · 2020-08-14 22:22

It took me a little while to learn how to use branches. The string issue seems to be fixed now.

fft_bench v1.0 for PROPELLER
Freq.    Magnitude
00000000 000001fe
000000c0 000001ff
00000140 000001ff
00000200 000001ff
1024 point bit-reversal and butterfly run time = 12742 us
clock frequency = 180000000

fft_bench.elf is 8140 bytes

It's not using the rev instruction. So there would be an easy speed gain from that. I used to get 15mS on this which leads me to think that the change to 32 registers is only in the dev branch right now.

results:       time / size of binary loaded to P2

riscvp2_lut:  8119 us  32412 bytes
fastspin:    13130 us  25180 bytes
p2gcc:       22740 us  20584 bytes
catalina:    24666 us  23552 bytes

http://forums.parallax.com/discussion/comment/1502265/#Comment_1502265

rogloh · 2020-08-15 02:42

@SaucySoliton I'm really liking the look of that LLVM generated fft_bench.elf file size in comparison with the binary image sizes of other compilers.

Was this elf file inclusive of all the serial IO stuff needed for the benchmark to run etc or were there other parts required that would have increased the entire binary size further?

Your gain from 15us to 12.7us is a good one with the increased number of registers. As prologs and epilogs get optimized I expect it should help boost it further. Any chance you might zip and post the disassembled listing of this benchmark here? We might find other optimisation candidates looking at it.

Unfortunately I still can't build LLVM, it only reaches up to 95% if I skip the errors and Clang seems to have some incompatibilities with the Apple Xcode header files on my Mac preventing complete compilation. I'll probably need to upgrade to a new OS X and deal with that nightmare to fix it, and I'm not ready for that yet.

SaucySoliton · 2020-08-15 03:51

@rogloh

The LLVM version isn't built with full C libraries. Just a lightweight printf. The elf file is complete. fft_bench.o is about 5kB. It would be good to find out how much of the size numbers above are standard libraries. And with the size of the P2 memory, is 10kB a big deal?

I'll need to go out of my way to post the listing before Monday.

rogloh · 2020-08-15 04:30

Thanks Saucy. Well I don't think 10kB is a big deal at all. It was interesting to just compare the size with the other compilers to see how things are improving. I'd certainly expect LLVM version output size should be smaller than what p2gcc does in the end, and if it wasn't then it is certainly not optimal and we have a problem. p2gcc adds extra overhead in some cases from having less registers as well as not making full use of useful P2 instructions where it can to save space.

No problem if that listing is troublesome to post at the moment. Don't worry about it.

n_ermosh · 2020-08-15 18:48

@SaucySoliton I merged your PR (and fixed a couple of merge conflicts). I was also able to build your fft benchmark (with a few changes to the defines. I've added __p2llvm__ as a standard define, so for these portable programs, lets start using that moving forward).

Regarding block transfers with setq. My implementation currently generates this:

add ptra, #64 // reserve 16 longs on the stack
setq #9 // setup to write 9+1 longs 
mov pa, ptra
sub pa, #64 // get pointer to start of stack block
wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers

And it doesn't work. does setq need to come directly before the corresponding wrlong? With how the stack allocation is working right now, I need to reserve all the space for it first before spilling any registers, so I can't use the auto-increment capabilities of ptra. Does anything about this seem obviously wrong? Obviously auto-incrementing would be better, it would save a lot of extra instructions, but I need to think about how to make that work in the llvm framework.

Wuerfel_21 · 2020-08-15 19:11

n_ermosh wrote: »
@SaucySoliton I merged your PR (and fixed a couple of merge conflicts). I was also able to build your fft benchmark (with a few changes to the defines. I've added __p2llvm__ as a standard define, so for these portable programs, lets start using that moving forward).

Regarding block transfers with setq. My implementation currently generates this:
add ptra, #64 // reserve 16 longs on the stack
setq #9 // setup to write 9+1 longs 
mov pa, ptra
sub pa, #64 // get pointer to start of stack block
wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers
And it doesn't work. does setq need to come directly before the corresponding wrlong? With how the stack allocation is working right now, I need to reserve all the space for it first before spilling any registers, so I can't use the auto-increment capabilities of ptra. Does anything about this seem obviously wrong? Obviously auto-incrementing would be better, it would save a lot of extra instructions, but I need to think about how to make that work in the llvm framework.

Yes, in most cases SETQ/SETQ2 needs to come directly before the instruction it's affecting

n_ermosh · 2020-08-15 19:12

I added the rev instruction and ran the fft_bench test, I get a much larger output now. I haven't looked into how the benchmark works--why would changing the reverse bit method change what's being run?

fft_bench v1.0 for PROPELLER
Freq.    Magnitude
00000000 000001fe
00000001 0000015f
00000002 000000ae
00000003 000000f8
00000004 00000043
00000005 00000171
00000006 00000155
00000007 00000032
00000008 000000a2
00000009 00000027
0000000a 000000cd
0000000b 000000a7
0000000c 00000016
0000000d 00000039
0000000e 00000018
0000000f 00000017
00000011 00000014
00000012 00000012
00000013 00000026
00000014 0000000d
00000015 00000057
00000016 0000005c
00000017 0000000f
00000018 00000036
00000019 0000000e
0000001a 0000004e
0000001b 00000043
0000001c 00000009
0000001d 00000019
0000001e 0000000b
0000001f 0000000b
00000021 0000000a
00000022 00000009
00000023 00000014
00000024 00000007
00000025 00000031
00000026 00000035
00000027 00000009
00000028 00000020
00000029 00000008
0000002a 00000030
0000002b 0000002a
0000002c 00000006
0000002d 00000010
0000002e 00000006
0000002f 00000007
00000031 00000007
00000032 00000006
00000033 0000000e
00000034 00000005
00000035 00000022
00000036 00000025
00000037 00000006
00000038 00000017
00000039 00000006
0000003a 00000022
0000003b 0000001f
0000003c 00000004
0000003d 0000000c
0000003e 00000005
0000003f 00000005

00000041 00000005
00000042 00000005
00000043 0000000a
00000044 00000003
00000045 0000001a
00000046 0000001c
00000047 00000004
00000048 00000011
00000049 00000004
0000004a 0000001b
0000004b 00000018
0000004c 00000003
0000004d 00000009
0000004e 00000003
0000004f 00000004
00000051 00000004
00000052 00000004
00000053 00000008
00000054 00000003
00000055 00000015
00000056 00000017
00000057 00000003
00000058 0000000e
00000059 00000003
0000005a 00000016
0000005b 00000014
0000005c 00000002
0000005d 00000008
0000005e 00000002
0000005f 00000003
00000061 00000003
00000062 00000003
00000063 00000006
00000064 00000002
00000065 00000011
00000066 00000013
00000067 00000003
00000068 0000000c
00000069 00000003
0000006a 00000013
0000006b 00000011
0000006c 00000002
0000006d 00000006
0000006e 00000002
0000006f 00000003
00000071 00000002
00000072 00000003
00000073 00000005
00000074 00000002
00000075 0000000f
00000076 00000010
00000077 00000002
00000078 0000000a
00000079 00000002
0000007a 00000011
0000007b 0000000f
0000007c 00000002
0000007d 00000005
0000007e 00000002
0000007f 00000002
00000081 00000002
00000082 00000002

00000083 00000005
00000084 00000001
00000085 0000000d
00000086 0000000e
00000087 00000002
00000088 00000009
00000089 00000002
0000008a 0000000f
0000008b 0000000d
0000008c 00000001
0000008d 00000005
0000008e 00000001
0000008f 00000002
00000091 00000002
00000092 00000002
00000093 00000004
00000094 00000001
00000095 0000000c
00000096 0000000d
00000097 00000002
00000098 00000008
00000099 00000002
0000009a 0000000d
0000009b 0000000c
0000009c 00000001
0000009d 00000004
0000009e 00000001
0000009f 00000002
000000a1 00000002
000000a2 00000002
000000a3 00000003
000000a4 00000001
000000a5 0000000a
000000a6 0000000c
000000a7 00000001
000000a8 00000007
000000a9 00000001
000000aa 0000000c
000000ab 0000000b
000000ac 00000001
000000ad 00000004
000000ae 00000001
000000af 00000001
000000b1 00000001
000000b2 00000002
000000b3 00000003
000000b4 00000001
000000b5 0000000a
000000b6 0000000a
000000b7 00000001
000000b8 00000007
000000b9 00000001
000000ba 0000000b
000000bb 0000000a
000000bc 00000001
000000bd 00000003
000000be 00000001
000000bf 00000001
000000c1 00000001
000000c2 00000001
000000c3 00000003
000000c4 00000001

000000c5 00000008
000000c6 0000000a
000000c7 00000001
000000c8 00000006
000000c9 00000001
000000ca 0000000a
000000cb 00000009
000000cc 00000001
000000cd 00000003
000000ce 00000001
000000cf 00000001
000000d1 00000001
000000d2 00000001
000000d3 00000002
000000d4 00000001
000000d5 00000008
000000d6 00000009
000000d7 00000001
000000d8 00000006
000000d9 00000001
000000da 00000009
000000db 00000008
000000dc 00000001
000000dd 00000003
000000de 00000001
000000df 00000001
000000e1 00000001
000000e2 00000001
000000e3 00000002
000000e5 00000008
000000e6 00000008
000000e7 00000001
000000e8 00000005
000000e9 00000001
000000ea 00000009
000000eb 00000008
000000ed 00000002
000000ee 00000001
000000ef 00000001
000000f1 00000001
000000f2 00000001
000000f3 00000002
000000f5 00000007
000000f6 00000008
000000f7 00000001
000000f8 00000005
000000f9 00000001
000000fa 00000008
000000fb 00000007
000000fd 00000002
000000fe 00000001
000000ff 00000001
00000101 00000001
00000102 00000001
00000103 00000002
00000105 00000007
00000106 00000008
00000107 00000001
00000108 00000004
00000109 00000001
0000010a 00000007
0000010b 00000007

0000010d 00000002
0000010e 00000001
0000010f 00000001
00000111 00000001
00000112 00000001
00000113 00000002
00000115 00000007
00000116 00000007
00000117 00000001
00000118 00000004
00000119 00000001
0000011a 00000007
0000011b 00000006
0000011d 00000002
0000011e 00000001
0000011f 00000001
00000121 00000001
00000122 00000001
00000123 00000002
00000125 00000007
00000126 00000007
00000127 00000001
00000128 00000004
00000129 00000001
0000012a 00000006
0000012b 00000006
0000012d 00000002
0000012e 00000001
0000012f 00000001
00000131 00000001
00000132 00000001
00000133 00000002
00000135 00000006
00000136 00000007
00000137 00000001
00000138 00000003
00000139 00000001
0000013a 00000007
0000013b 00000005
0000013d 00000002
0000013e 00000001
0000013f 00000001
00000141 00000001
00000142 00000001
00000143 00000002
00000145 00000006
00000146 00000007
00000147 00000001
00000148 00000004
00000149 00000001
0000014a 00000007
0000014b 00000005
0000014d 00000002
0000014e 00000001
0000014f 00000001
00000151 00000001
00000152 00000001
00000153 00000002
00000155 00000006
00000156 00000007
00000157 00000001
00000158 00000003

00000159 00000001
0000015a 00000006
0000015b 00000005
0000015d 00000002
0000015e 00000001
0000015f 00000001
00000161 00000001
00000163 00000002
00000165 00000006
00000166 00000007
00000167 00000001
00000168 00000003
00000169 00000001
0000016a 00000006
0000016b 00000005
0000016d 00000002
0000016e 00000001
0000016f 00000001
00000171 00000001
00000173 00000002
00000175 00000006
00000176 00000006
00000177 00000001
00000178 00000003
00000179 00000001
0000017a 00000005
0000017b 00000005
0000017d 00000002
0000017e 00000001
0000017f 00000001
00000181 00000001
00000183 00000002
00000185 00000006
00000186 00000006
00000187 00000001
00000188 00000003
00000189 00000001
0000018a 00000005
0000018b 00000005
0000018d 00000002
0000018e 00000001
0000018f 00000001
00000191 00000001
00000193 00000002
00000195 00000005
00000196 00000006
00000197 00000001
00000198 00000003
00000199 00000001
0000019a 00000006
0000019b 00000005
0000019d 00000002
0000019e 00000001
0000019f 00000001
000001a1 00000001
000001a3 00000002
000001a5 00000005
000001a6 00000006
000001a7 00000001
000001a8 00000003
000001a9 00000001
000001aa 00000005

000001ab 00000005
000001ad 00000002
000001ae 00000001
000001af 00000001
000001b1 00000001
000001b3 00000002
000001b5 00000005
000001b6 00000006
000001b7 00000001
000001b8 00000003
000001b9 00000001
000001ba 00000005
000001bb 00000005
000001bd 00000002
000001be 00000001
000001bf 00000001
000001c1 00000001
000001c3 00000002
000001c5 00000005
000001c6 00000006
000001c7 00000001
000001c8 00000003
000001c9 00000001
000001ca 00000005
000001cb 00000005
000001cd 00000002
000001ce 00000001
000001cf 00000001
000001d1 00000001
000001d3 00000002
000001d5 00000005
000001d6 00000006
000001d7 00000001
000001d8 00000003
000001d9 00000001
000001da 00000005
000001db 00000005
000001dd 00000002
000001de 00000001
000001df 00000001
000001e1 00000001
000001e3 00000002
000001e5 00000005
000001e6 00000006
000001e7 00000001
000001e8 00000003
000001e9 00000001
000001ea 00000005
000001eb 00000004
000001ed 00000002
000001ee 00000001
000001ef 00000001
000001f1 00000001
000001f3 00000002
000001f5 00000005
000001f6 00000005
000001f7 00000001
000001f8 00000003
000001f9 00000001
000001fa 00000005
000001fb 00000005
000001fd 00000002

000001ff 00000001
1024 point bit-reversal and butterfly run time = 11902 us
clock frequency = 180000000

but it did seem to speed it up a bit.

n_ermosh · 2020-08-15 19:15

Wuerfel_21 wrote: »

Yes, in most cases SETQ/SETQ2 needs to come directly before the instruction it's affecting

RIP okay, that's going to take a little bit of effort. I'll table this for now for more important things on the todo list. I have ideas, will take me some time to implement them though.

SaucySoliton · 2020-08-15 20:24

n_ermosh wrote: »

I added the rev instruction and ran the fft_bench test, I get a much larger output now. I haven't looked into how the benchmark works--why would changing the reverse bit method change what's being run?

It shouldn't. Maybe you fixed other stuff with the compiler at the same time? I found that changing the optimizer level affected how many lines it sends. -O2 and above didn't work at all? I haven't looked into what is happening there yet.

n_ermosh · 2020-08-15 21:53

I guess what I'm most curious about is what is the expected output when it runs correctly?

rogloh · 2020-08-16 00:23

To fix this:

add ptra, #64 // reserve 16 longs on the stack
setq #9 // setup to write 9+1 longs 
mov pa, ptra
sub pa, #64 // get pointer to start of stack block
wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers

you could use this:

mov pa, ptra // get pointer to start of stack block
add ptra, #64 // reserve 16 longs on the stack
setq #9 // setup to write 9+1 longs 
wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers

Note that pa remains unchanged after the transfer, I guess that is what you want for preserving registers and restoring the stack later perhaps? Like a block pointer. It's worth spending the time getting a good fast sequence here given it will be used when calling many functions.

rogloh · 2020-08-16 00:59

One other thing @n_ermosh , in this particular example above it looks like you were reserving 16 longs on the stack but only saving 10 registers. So does this mean you are planning on always allocating a fixed space for saving up to maximum number of 16 working registers? If so, this will of course eat up stack space faster.

I found p2gcc seemed to allocate its internal working register use sequentially, which makes saving the number of registers easy and the stack pointer could be adjusted by the smallest amount needed each time. I don't know what LLVM tracks there to help you only save what is needed and consume the least amount of stack space required but I'd expect it is doable somehow.

n_ermosh · 2020-08-16 01:29

@rogloh, In that example, the remaining 6 longs are being used for local variables; it was just a snippet I grabbed from one of my examples. I'm only allocating what's needed for every function (as computed by the magical llvm internals). It will take a bit of work to rework how the prologue is inserted to make the above work, but I think it might be possible. Currently, here's the rough order of things happening that generates my code above. llvm is separately calling the functions that perform 1, 2, and 3 in various steps of compilation

1. at function entry, insert prologue:
add ptra, #size

2. spill callee saved registers to the stack:
figure out # of longs to save, which registers to start/end add, and insert those instructions:
setq #n_regs
wrlong r0, first_reg_frame_index

Frame index is currently still an abstract memory reference, and it's exact offset (from the new stack pointer) is determined by llvm and not by my code.

3. Eliminate the frame index by storing ptra to pa, subtracting, and replacing the frame_index with pa
move pa, ptra
sub pa, #frame_index_offset
wrlong r0, pa

The resulting code ends up looking like:

add ptra, #size
setq #n_regs
mov pa, ptra
sub pa, #size
wrlong r0, pa

I have an idea for how to work around this to make it look more like your example, without doing weird band-aid things, but I need to study for my pilot's exam this week, hopefully will be able to get back to this soon and implement some more things from the todo list. I'll keep an eye on this thread if people have ideas though.

edit: it's actually super straightforward to change this to do a block transfer plus auto-increment PTRA. I might be able to get it done this weekend.

n_ermosh · 2020-08-16 22:33

turns out it was easy. latest commits in dev branch now use setq/auto-incrementing reads/writes to ptra to push/pop registers in function calls, and manual stack allocation (for local variables, etc) happens after it. Should see some performance improvements now. One more thing crossed off the list!

rogloh · 2020-08-17 01:11

That's great n_ermosh!

Do you expect you'll be able to use the indexed register hub access e.g., rdlong r3, ptra[-20] for accessing locals or any arguments allocated on the stack with just a single instruction when they are not otherwise maintained in register variables? There are distance limits with this but maybe the first 32 items below the stack pointer location might be accessible this way for a performance improvement too. The compiler would of course need to know when this faster form of hub memory access is applicable. I imagine trying to support alloca() later might be a problem unless the second ptrb is used as a frame pointer perhaps and ideally pointing between locals and arguments to give maximal access range, but that should not be a priority at this stage.

I do know the AVR chips also use this type of memory access method with GCC code and have similar distance limits too with LDD (load indirect with displacement) instructions. See below for the Y+n form with LDD reads. For AVR "n" is limited to 63 bytes while P2 would have limits of 32 bytes, words or longs below the current stack pointer depending on the read/write size which is still useful.

int main(void) {
  6c:   cf 93           push    r28
  6e:   df 93           push    r29
  70:   cd b7           in      r28, 0x3d       ; 61
  72:   de b7           in      r29, 0x3e       ; 62
  74:   29 97           sbiw    r28, 0x09       ; 9
  76:   0f b6           in      r0, 0x3f        ; 63
  78:   f8 94           cli
  7a:   de bf           out     0x3e, r29       ; 62
  7c:   0f be           out     0x3f, r0        ; 63
  7e:   cd bf           out     0x3d, r28       ; 61
        volatile int foo;
        volatile char bar[7];

        while (1) {
                        PORTB = foo;
  80:   88 85           ldd     r24, Y+8        ; 0x08
  82:   99 85           ldd     r25, Y+9        ; 0x09
  84:   88 bb           out     0x18, r24       ; 24
                        PORTB = bar[3];
  86:   8c 81           ldd     r24, Y+4        ; 0x04
  88:   88 bb           out     0x18, r24       ; 24
  8a:   fa cf           rjmp    .-12            ; 0x80 <main+0x14>

n_ermosh · 2020-08-17 02:05

@rogloh it should be pretty easy to implement as well. Since stack slots are always referred to by an arbitrary frame index (stack pointer +/- an offset), and replaced at the end with actual registers, it should be pretty easy to implement logic that will check the size/alignment of the offset, then choose to either use the indexed ptra mode or just save ptra to a register and subtract (like I currently do), or maybe insert augs to increase the possible offset distance. There's probably a few corner cases where this could break something, but I'll cross that bridge. My next task is to test out dynamic memory allocation and see how that work and what will happen.

One major aspect of LLVM I haven't even touched yet is that you can give it hints to the cost of executing a particular instruction. This is then used internally for scheduling as well as optimization to pick the appropriate instructions to use, without any effort from me other than just defining the costs. This can be used to prefer storing things in registers (which it already does), but also interestingly for non-blocking instructions like qdiv or qmul, it can be told to place the getqx/getqy later if it doesn't need the result immediately so that the CORDIC can run in the background until complete.

rogloh · 2020-08-17 02:30

Sounds good. If you do use AUGS (you may need to in some cases) it could require several instructions to read offsets from the stack pointer. Eg.

mov pa, ptra ' ptra is SP
sub pa, ##nnnn  ' uses AUGS once nnnn exceeds 511
rdlong r3, pa

this creates 4 instructions instead of 1 for each local access which while also somewhat slower can start to quickly bloat the code. Eg. it burns 16 bytes to read something and maybe another 16 bytes elsewhere to write the result back.

I also sort of wonder if any of the ALT instructions can be used somehow with AUGS to save an instruction but am not sure there.

If you can make uses of the hardware multiplier that would be good too. Unfortunately the really fast multiply is limited to 16 bit operands but maybe you can still make good use of it in internal calculations where possible if you know the inputs will fit.

The other P2 things to try to leverage are the bit operations, addsx etc for 64 bit stuff, maybe byte accesses getbyte/setbyte if this is detectable in the code, DJZ loop stuff, any small skip sequences where possible and good use of condition flag execution.

n_ermosh · 2020-08-17 03:15

Can't I just use augs to set the index of ptra to be a large offset? Page 61 of the datasheet, unless I'm not understanding how that works? That should just add 1 instruction for when there's a large offset. effectively doing:

augs #nnnnn
rdlong r3, ptra[nnnnn]

condition flag execution is a big one that I'm not yet set up to do. However, I bet there's a post-compilation optimizer pass that can be written to convert small loops or branches into conditional execution and insert skips and such. LLVM has a nice structure that lets you write passes for each function at various stages of compilation and do whatever you want to the function, on top of everything LLVM does. I have one currently called "expand pseudo instructions" which converts pseudo instructions I define (that don't have an equivalent P2 instruction, but I'm telling LLVM one will exist) into a small series of real instructions.

rogloh · 2020-08-17 05:20

n_ermosh wrote: »
Can't I just use augs to set the index of ptra to be a large offset? Page 61 of the datasheet, unless I'm not understanding how that works? That should just add 1 instruction for when there's a large offset. effectively doing:
augs #nnnnn
rdlong r3, ptra[nnnnn]

If you can actually do that it would be very good. But you may need to test that idea out first, because RDLONG addresses are encoded differently and so it may not work the same as you'd assume. I'm guessing that it won't work but would be very happy to be proven wrong there. EDIT: okay I read page 61 and you might be correct!

That's very efficient with only a second instruction needed over the "rdlong r0, ptra[10]" form.

For loops and branches, the conditional execution should be very common with z as well as c flags for typical greater/less than comparisons for example. TJZ and TJNZ should come in very useful too.

      cmp r1,r2 wz
if_nz jmp #loop

n_ermosh · 2020-08-20 00:35

I'm currently working out dynamic memory allocation, but next on my list is optimizing loops. Defining a pattern to match comparison to 0 should be pretty easy to save an instruction. Although, I think my overarching philosophy for writing optimizations (for now) is to focus on compiler functionality and completeness rather than on performance, as any really high performance code can always be written in assembly (although I think an end user should never have to touch assembly unless doing something really, really specialized). Lots to do...

DavidZemon · 2020-08-20 03:56

n_ermosh wrote: »

Although, I think my overarching philosophy for writing optimizations (for now) is to focus on compiler functionality and completeness rather than on performance

Absolutely. Once it's (fully) functional, folks can start using it to write libraries while you simultaneously enhance performance. The optimization rabbit hole is never ending....

kuba · 2020-08-21 07:05

For anyone building anything using cmake: please forget that gnu make exists. It's highly counter-productive. Use ninja everywhere. It's the defacto modern build tool. Once you experience its performance, going back to gnu make will sound and feel ridonkulous. make tries to be a general purpose tool, and its code base and performance bears that fact. ninja tries to do one job only and does it extremely well.`cmake -G Ninja` all the way. Thank me later

Yes, you need to install ninja.

LLVM Backend for Propeller 2

Comments