Shop OBEX P1 Docs P2 Docs Learn Events
LLVM Backend for Propeller 2 - Page 3 — Parallax Forums

LLVM Backend for Propeller 2

1356713

Comments

  • n_ermosh wrote: »
    I don’t currently have a use for ptrb or pb. My thought was to use ptrb as a lut ram pointer and push/pop callee saved registers to ptrb into lut ram and auto increment/decrement.
    I've been thinking that it might be neat to use it to pass the C++ this pointer (or just the first pointer argument, in general) around, so functions operating on structured data could take advantage of the cool addressing modes. Not sure how big the benefit there is though.
    n_ermosh wrote: »
    I haven’t looked too much into the LOC instruction or what it does—sounds like it just loads a 20 bit address into ptra/b/pa/pb without an intermediate augs that you would need for a normal move? The relative addressing seems useful, though the only PC relative stuff I do is jumps right now, which can already take a full 20 bit address.
    Yes, that's what it does. It can also do absolute adresses, so you could even use it to load a 20 bit immediate in some cases.
  • roglohrogloh Posts: 5,171
    edited 2020-08-09 06:42
    n_ermosh wrote: »
    There’s a lot of cool stuff that can be done. If you guys and gals see ways things can be optimized and sped up (either in the initial generation of code or in a pass towards the end of compilation) I’m all ears.
    Here's some, @n_ermosh :

    If you know a C function doesn't call any other functions, it is a leaf function. This means it can be called differently by those who call it. Instead of pushing the return address on the stack you could have the return address automatically loaded into say PB, using "CALLD PB, #address" instead of "CALLA #address". Then when this leaf function returns you can use JMP PB (4 clocks) instead of RETA (11-18 clocks) to return. A lot faster. Of course this means you need to know if a function calls another function to be able to identify it as a leaf. And function pointers that call leaf functions may need different handling too. They might need to call one instruction earlier which just POPs the stack into PB before the leaf function really begins. That's probably one way to handle that issue. Accessing local/parameter data needs to accommodate this slight stack difference too in leaf functions. I don't know if/how easy this is to customise LLVM for things like that. Another option might be for the called function (not the caller) to save the link register on the stack only if needed, then you can always use CALLD form. It's worth looking into good ways to speed up the prolog/epilogs and keep them small at the same time.

    LOC will probably only be useful for holding constant (readonly) data buried inside executable code that needs to support relocation. If your read only data is always being stored at other fixed addresses it may not be as useful.

    PTRB could be useful for a "register" variable in a function if used as a pointer that can automatically increment/decrement. Unfortunately there is only one of these resources though because PTRA is the SP (which makes sense). Both it and PTRA can also be used to read indexed values like parameters or locals off the hub stack relative to some base (if it fits the +/- 31 long range). Like this:

    rdlong r0,ptra[-3] ' get long at [SP-12]
    rdlong r1,ptra[-2] ' get long at [SP-8]
    add r0, r1 ' etc

    This is probably a good use I can think of that gives best bang per buck with the P2 hardware capabilities. If you don't use autoinc/autodec or offsets, any other register can be used to access hub/LUT and PTRA/PTRB are not needed. If you want to use it to access LUT for your saved registers that would be faster than individual hub read/writes (3 clocks to read/2 to write in LUT), but maybe not so if you instead can use setq bursts to have registers saved in HUB depending on the number saved. Writing to LUTRAM will be limited to 512 locations before other handling will needed. It is a lot but you would still need to handle the overflow case for recursion which may make things complicated and it may make debugging a little trickier to follow if the saved registers are not on the hub stack but inside the LUT RAM. Weigh that one up too.


  • rogloh wrote: »
    add the "../llvm" on the end of the cmake line (no quotes needed)
    Okay, I see. When I view the README.md file from GitHub the end of the command line is cut off. It would be nice if they would give some indication that that was happening. Anyway, when I used the full command line the build started. We'll see if it finishes...

    Thanks for your help!

  • David Betz wrote: »
    rogloh wrote: »
    add the "../llvm" on the end of the cmake line (no quotes needed)
    Okay, I see. When I view the README.md file from GitHub the end of the command line is cut off. It would be nice if they would give some indication that that was happening. Anyway, when I used the full command line the build started. We'll see if it finishes...

    Thanks for your help!
    The build completed successfully. Can someone point me to where in the source tree the P2-specific code is located?

  • David Betz wrote: »
    The build completed successfully. Can someone point me to where in the source tree the P2-specific code is located?

    A few places (which I'm working on documenting...).

    llvm-project/llvm/lib/Target/P2 contains all the backend code for propeller. There's quite a bit going on here, I'd take a look at docs/Notes and Thoughts.md which has links to a few good references.

    clang/lib/Driver/ToolChains/ contains setup for the full toolchain. doesn't do much yet other than create a p2 target for clang.

    clang/lib/Basic/Targets/ contains the custom info for P2 specific code (like named registers, etc).

    lld/ELF/Arch contains the code that performs relocation for the linker.

    There were a few other folders that were edited to register P2 as a target and provide some global info for various parts of llvm, but most of the work is done by everything in the lib/Target/P2 folder.
  • n_ermosh wrote: »
    David Betz wrote: »
    The build completed successfully. Can someone point me to where in the source tree the P2-specific code is located?

    A few places (which I'm working on documenting...).

    llvm-project/llvm/lib/Target/P2 contains all the backend code for propeller. There's quite a bit going on here, I'd take a look at docs/Notes and Thoughts.md which has links to a few good references.

    clang/lib/Driver/ToolChains/ contains setup for the full toolchain. doesn't do much yet other than create a p2 target for clang.

    clang/lib/Basic/Targets/ contains the custom info for P2 specific code (like named registers, etc).

    lld/ELF/Arch contains the code that performs relocation for the linker.

    There were a few other folders that were edited to register P2 as a target and provide some global info for various parts of llvm, but most of the work is done by everything in the lib/Target/P2 folder.
    Thanks for the roadmap. That's very helpful.

  • n_ermosh wrote: »
    also @SaucySoliton regarding "Some printfs go missing or are clobbered.", can you upload the resulting binary, and also the output you were getting that was clobbered?
    n_ermosh wrote: »
    Something in recent LLVM commits changes how rodata section is relocated, which is leading to all sorts of issues when dealing with strings. I'm still tracking down the exact bug but will hopefully fix it soon.

    Adding an extra zero to the end might be a workaround.
    printf("i as an int = %d, \0", i);
    printf("i as an uint = %u, \0", i);
    printf("i as a hex = %x\n\0", i);

    Despite my best efforts to only change one character and keep the same length, the compiler always seems to change the ordering of the strings.

    So given that adding \0 seemed to fix it, I was looking for a possible buffer overrun. I eventually got this, which looked like a pretty clean buffer overrun.
            printf("i as an int = %d, XXXXXX", i);
            printf("i as an uint = %u, YYYYYY", i);
            printf("i as a hex = %x\n ZZZZZZ", i);
            waitx(_CLOCKFREQ/10);
    
    00000000  69 20 61 73 20 61 6e 20  69 6e 74 20 3d 20 32 31  |i as an int = 21|
    00000010  34 37 34 38 33 36 33 37  2c 20 58 58 58 58 58 58  |47483637, XXXXXX|
    00000020  38 cb 07 fb 39 c9 07 fb  3a c7 07 fb 3b c5 07 fb  |8...9...:...;...|
    00000030  3c c3 07 fb 3d c1 07 fb  30 f0 87 f1 2e 69 20 61  |<...=...0....i a|
    00000040  73 20 61 6e 20 69 6e 74  20 3d 20 32 31 34 37 34  |s an int = 21474|
    00000050  38 33 36 33 38 2c 20 58  58 58 58 58 58 38 cb 07  |83638, XXXXXX8..|
    
    Except that the overrun didn't correspond to anything that came after the strings in question. A quick search finds the source of the data.
         a10: 38 cb 07 fb  		rdlong r5, #312			     a10: 38 cb 07 fb  		rdlong r5, #312
         a14: 39 c9 07 fb  		rdlong r4, #313			     a14: 39 c9 07 fb  		rdlong r4, #313
         a18: 3a c7 07 fb  		rdlong r3, #314			     a18: 3a c7 07 fb  		rdlong r3, #314
         a1c: 3b c5 07 fb  		rdlong r2, #315			     a1c: 3b c5 07 fb  		rdlong r2, #315
         a20: 3c c3 07 fb  		rdlong r1, #316			     a20: 3c c3 07 fb  		rdlong r1, #316
         a24: 3d c1 07 fb  		rdlong r0, #317			     a24: 3d c1 07 fb  		rdlong r0, #317
         a28: 30 f0 87 f1  		sub ptra, #48			     a28: 30 f0 87 f1  		sub ptra, #48
         a2c: 2e 00 64 fd  		reta				     a2c: 2e 00 64 fd  		reta
    
    Disassembly of section .text:					Disassembly of section .text:
    
    00000400 <main>:						00000400 <main>:
         460: f9 c1 63 fc  		wrlong r0, ptrb			     460: f9 c1 63 fc  		wrlong r0, ptrb
         464: 05 00 00 ff  		augs #5				     464: 05 00 00 ff  		augs #5
         468: c5 c1 07 f6  		mov r0, #453	$bc5  .str    |	     468: f7 c1 07 f6  		mov r0, #503     $bf7  .str
         46c: 05 00 00 ff  		augs #5				     46c: 05 00 00 ff  		augs #5
         470: de c3 07 f6  		mov r1, #478	$bde .str.1   |	     470: 10 c4 07 f6  		mov r2, #16      $a10 expecting 0bb4 <.str.1>
         474: 05 00 00 ff  		augs #5				     474: 05 00 00 ff  		augs #5
         478: f8 c5 07 f6  		mov r2, #504	$bf8 .str.2   |	     478: 2a c6 07 f6  		mov r3, #42      $a2a expecting  bce <.str.2>
         47c: 96 98 00 ff  		augs #39062			     47c: 96 98 00 ff  		augs #39062
         480: 00 c7 07 f6  		mov r3, #256			     480: 00 c7 07 f6  		mov r3, #256
         484: f8 f3 03 f6  		mov ptrb, ptra			     484: f8 f3 03 f6  		mov ptrb, ptra
    
    So main has the wrong string pointers in it. Also has the wrong registers. The offset is 420 bytes (0xA14) but that is not constant and it's not always the same between both calls. There are no other changes in main. The registers for the printf calls do not change.
  • Interesting that the trailing 0 was a workaround, but I actually found the root cause and fixed it in the dev branch (not yet merged to master as I'm finishing up another feature right now). I don't understand why the strings are being re-ordered, but at least now the relocation is keeping track of it properly and updating the reference the correct address in rodata. once I finish up the feature i'm currently working on (integrating the c standard lib, propeller standard lib, and linker scripts into clang directly), I'll merge to master and tag it as the next version "release".
  • It took me a little while to learn how to use branches. The string issue seems to be fixed now.
    fft_bench v1.0 for PROPELLER
    Freq.    Magnitude
    00000000 000001fe
    000000c0 000001ff
    00000140 000001ff
    00000200 000001ff
    1024 point bit-reversal and butterfly run time = 12742 us
    clock frequency = 180000000
    
    fft_bench.elf is 8140 bytes

    It's not using the rev instruction. So there would be an easy speed gain from that. I used to get 15mS on this which leads me to think that the change to 32 registers is only in the dev branch right now.
    results:       time / size of binary loaded to P2
    
    riscvp2_lut:  8119 us  32412 bytes
    fastspin:    13130 us  25180 bytes
    p2gcc:       22740 us  20584 bytes
    catalina:    24666 us  23552 bytes
    
    http://forums.parallax.com/discussion/comment/1502265/#Comment_1502265
  • roglohrogloh Posts: 5,171
    edited 2020-08-15 03:30
    @SaucySoliton I'm really liking the look of that LLVM generated fft_bench.elf file size in comparison with the binary image sizes of other compilers.

    Was this elf file inclusive of all the serial IO stuff needed for the benchmark to run etc or were there other parts required that would have increased the entire binary size further?

    Your gain from 15us to 12.7us is a good one with the increased number of registers. As prologs and epilogs get optimized I expect it should help boost it further. Any chance you might zip and post the disassembled listing of this benchmark here? We might find other optimisation candidates looking at it.

    Unfortunately I still can't build LLVM, it only reaches up to 95% if I skip the errors and Clang seems to have some incompatibilities with the Apple Xcode header files on my Mac preventing complete compilation. I'll probably need to upgrade to a new OS X and deal with that nightmare to fix it, and I'm not ready for that yet.
  • @rogloh

    The LLVM version isn't built with full C libraries. Just a lightweight printf. The elf file is complete. fft_bench.o is about 5kB. It would be good to find out how much of the size numbers above are standard libraries. And with the size of the P2 memory, is 10kB a big deal?


    I'll need to go out of my way to post the listing before Monday.
  • Thanks Saucy. Well I don't think 10kB is a big deal at all. It was interesting to just compare the size with the other compilers to see how things are improving. I'd certainly expect LLVM version output size should be smaller than what p2gcc does in the end, and if it wasn't then it is certainly not optimal and we have a problem. p2gcc adds extra overhead in some cases from having less registers as well as not making full use of useful P2 instructions where it can to save space.

    No problem if that listing is troublesome to post at the moment. Don't worry about it.
  • @SaucySoliton I merged your PR (and fixed a couple of merge conflicts). I was also able to build your fft benchmark (with a few changes to the defines. I've added __p2llvm__ as a standard define, so for these portable programs, lets start using that moving forward).

    Regarding block transfers with setq. My implementation currently generates this:
    add ptra, #64 // reserve 16 longs on the stack
    setq #9 // setup to write 9+1 longs 
    mov pa, ptra
    sub pa, #64 // get pointer to start of stack block
    wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers
    

    And it doesn't work. does setq need to come directly before the corresponding wrlong? With how the stack allocation is working right now, I need to reserve all the space for it first before spilling any registers, so I can't use the auto-increment capabilities of ptra. Does anything about this seem obviously wrong? Obviously auto-incrementing would be better, it would save a lot of extra instructions, but I need to think about how to make that work in the llvm framework.
  • Wuerfel_21Wuerfel_21 Posts: 4,511
    edited 2020-08-15 19:11
    n_ermosh wrote: »
    @SaucySoliton I merged your PR (and fixed a couple of merge conflicts). I was also able to build your fft benchmark (with a few changes to the defines. I've added __p2llvm__ as a standard define, so for these portable programs, lets start using that moving forward).

    Regarding block transfers with setq. My implementation currently generates this:
    add ptra, #64 // reserve 16 longs on the stack
    setq #9 // setup to write 9+1 longs 
    mov pa, ptra
    sub pa, #64 // get pointer to start of stack block
    wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers
    

    And it doesn't work. does setq need to come directly before the corresponding wrlong? With how the stack allocation is working right now, I need to reserve all the space for it first before spilling any registers, so I can't use the auto-increment capabilities of ptra. Does anything about this seem obviously wrong? Obviously auto-incrementing would be better, it would save a lot of extra instructions, but I need to think about how to make that work in the llvm framework.

    Yes, in most cases SETQ/SETQ2 needs to come directly before the instruction it's affecting
  • I added the rev instruction and ran the fft_bench test, I get a much larger output now. I haven't looked into how the benchmark works--why would changing the reverse bit method change what's being run?
    fft_bench v1.0 for PROPELLER
    Freq.    Magnitude
    00000000 000001fe
    00000001 0000015f
    00000002 000000ae
    00000003 000000f8
    00000004 00000043
    00000005 00000171
    00000006 00000155
    00000007 00000032
    00000008 000000a2
    00000009 00000027
    0000000a 000000cd
    0000000b 000000a7
    0000000c 00000016
    0000000d 00000039
    0000000e 00000018
    0000000f 00000017
    00000011 00000014
    00000012 00000012
    00000013 00000026
    00000014 0000000d
    00000015 00000057
    00000016 0000005c
    00000017 0000000f
    00000018 00000036
    00000019 0000000e
    0000001a 0000004e
    0000001b 00000043
    0000001c 00000009
    0000001d 00000019
    0000001e 0000000b
    0000001f 0000000b
    00000021 0000000a
    00000022 00000009
    00000023 00000014
    00000024 00000007
    00000025 00000031
    00000026 00000035
    00000027 00000009
    00000028 00000020
    00000029 00000008
    0000002a 00000030
    0000002b 0000002a
    0000002c 00000006
    0000002d 00000010
    0000002e 00000006
    0000002f 00000007
    00000031 00000007
    00000032 00000006
    00000033 0000000e
    00000034 00000005
    00000035 00000022
    00000036 00000025
    00000037 00000006
    00000038 00000017
    00000039 00000006
    0000003a 00000022
    0000003b 0000001f
    0000003c 00000004
    0000003d 0000000c
    0000003e 00000005
    0000003f 00000005
    
    00000041 00000005
    00000042 00000005
    00000043 0000000a
    00000044 00000003
    00000045 0000001a
    00000046 0000001c
    00000047 00000004
    00000048 00000011
    00000049 00000004
    0000004a 0000001b
    0000004b 00000018
    0000004c 00000003
    0000004d 00000009
    0000004e 00000003
    0000004f 00000004
    00000051 00000004
    00000052 00000004
    00000053 00000008
    00000054 00000003
    00000055 00000015
    00000056 00000017
    00000057 00000003
    00000058 0000000e
    00000059 00000003
    0000005a 00000016
    0000005b 00000014
    0000005c 00000002
    0000005d 00000008
    0000005e 00000002
    0000005f 00000003
    00000061 00000003
    00000062 00000003
    00000063 00000006
    00000064 00000002
    00000065 00000011
    00000066 00000013
    00000067 00000003
    00000068 0000000c
    00000069 00000003
    0000006a 00000013
    0000006b 00000011
    0000006c 00000002
    0000006d 00000006
    0000006e 00000002
    0000006f 00000003
    00000071 00000002
    00000072 00000003
    00000073 00000005
    00000074 00000002
    00000075 0000000f
    00000076 00000010
    00000077 00000002
    00000078 0000000a
    00000079 00000002
    0000007a 00000011
    0000007b 0000000f
    0000007c 00000002
    0000007d 00000005
    0000007e 00000002
    0000007f 00000002
    00000081 00000002
    00000082 00000002
    
    00000083 00000005
    00000084 00000001
    00000085 0000000d
    00000086 0000000e
    00000087 00000002
    00000088 00000009
    00000089 00000002
    0000008a 0000000f
    0000008b 0000000d
    0000008c 00000001
    0000008d 00000005
    0000008e 00000001
    0000008f 00000002
    00000091 00000002
    00000092 00000002
    00000093 00000004
    00000094 00000001
    00000095 0000000c
    00000096 0000000d
    00000097 00000002
    00000098 00000008
    00000099 00000002
    0000009a 0000000d
    0000009b 0000000c
    0000009c 00000001
    0000009d 00000004
    0000009e 00000001
    0000009f 00000002
    000000a1 00000002
    000000a2 00000002
    000000a3 00000003
    000000a4 00000001
    000000a5 0000000a
    000000a6 0000000c
    000000a7 00000001
    000000a8 00000007
    000000a9 00000001
    000000aa 0000000c
    000000ab 0000000b
    000000ac 00000001
    000000ad 00000004
    000000ae 00000001
    000000af 00000001
    000000b1 00000001
    000000b2 00000002
    000000b3 00000003
    000000b4 00000001
    000000b5 0000000a
    000000b6 0000000a
    000000b7 00000001
    000000b8 00000007
    000000b9 00000001
    000000ba 0000000b
    000000bb 0000000a
    000000bc 00000001
    000000bd 00000003
    000000be 00000001
    000000bf 00000001
    000000c1 00000001
    000000c2 00000001
    000000c3 00000003
    000000c4 00000001
    
    000000c5 00000008
    000000c6 0000000a
    000000c7 00000001
    000000c8 00000006
    000000c9 00000001
    000000ca 0000000a
    000000cb 00000009
    000000cc 00000001
    000000cd 00000003
    000000ce 00000001
    000000cf 00000001
    000000d1 00000001
    000000d2 00000001
    000000d3 00000002
    000000d4 00000001
    000000d5 00000008
    000000d6 00000009
    000000d7 00000001
    000000d8 00000006
    000000d9 00000001
    000000da 00000009
    000000db 00000008
    000000dc 00000001
    000000dd 00000003
    000000de 00000001
    000000df 00000001
    000000e1 00000001
    000000e2 00000001
    000000e3 00000002
    000000e5 00000008
    000000e6 00000008
    000000e7 00000001
    000000e8 00000005
    000000e9 00000001
    000000ea 00000009
    000000eb 00000008
    000000ed 00000002
    000000ee 00000001
    000000ef 00000001
    000000f1 00000001
    000000f2 00000001
    000000f3 00000002
    000000f5 00000007
    000000f6 00000008
    000000f7 00000001
    000000f8 00000005
    000000f9 00000001
    000000fa 00000008
    000000fb 00000007
    000000fd 00000002
    000000fe 00000001
    000000ff 00000001
    00000101 00000001
    00000102 00000001
    00000103 00000002
    00000105 00000007
    00000106 00000008
    00000107 00000001
    00000108 00000004
    00000109 00000001
    0000010a 00000007
    0000010b 00000007
    
    0000010d 00000002
    0000010e 00000001
    0000010f 00000001
    00000111 00000001
    00000112 00000001
    00000113 00000002
    00000115 00000007
    00000116 00000007
    00000117 00000001
    00000118 00000004
    00000119 00000001
    0000011a 00000007
    0000011b 00000006
    0000011d 00000002
    0000011e 00000001
    0000011f 00000001
    00000121 00000001
    00000122 00000001
    00000123 00000002
    00000125 00000007
    00000126 00000007
    00000127 00000001
    00000128 00000004
    00000129 00000001
    0000012a 00000006
    0000012b 00000006
    0000012d 00000002
    0000012e 00000001
    0000012f 00000001
    00000131 00000001
    00000132 00000001
    00000133 00000002
    00000135 00000006
    00000136 00000007
    00000137 00000001
    00000138 00000003
    00000139 00000001
    0000013a 00000007
    0000013b 00000005
    0000013d 00000002
    0000013e 00000001
    0000013f 00000001
    00000141 00000001
    00000142 00000001
    00000143 00000002
    00000145 00000006
    00000146 00000007
    00000147 00000001
    00000148 00000004
    00000149 00000001
    0000014a 00000007
    0000014b 00000005
    0000014d 00000002
    0000014e 00000001
    0000014f 00000001
    00000151 00000001
    00000152 00000001
    00000153 00000002
    00000155 00000006
    00000156 00000007
    00000157 00000001
    00000158 00000003
    
    00000159 00000001
    0000015a 00000006
    0000015b 00000005
    0000015d 00000002
    0000015e 00000001
    0000015f 00000001
    00000161 00000001
    00000163 00000002
    00000165 00000006
    00000166 00000007
    00000167 00000001
    00000168 00000003
    00000169 00000001
    0000016a 00000006
    0000016b 00000005
    0000016d 00000002
    0000016e 00000001
    0000016f 00000001
    00000171 00000001
    00000173 00000002
    00000175 00000006
    00000176 00000006
    00000177 00000001
    00000178 00000003
    00000179 00000001
    0000017a 00000005
    0000017b 00000005
    0000017d 00000002
    0000017e 00000001
    0000017f 00000001
    00000181 00000001
    00000183 00000002
    00000185 00000006
    00000186 00000006
    00000187 00000001
    00000188 00000003
    00000189 00000001
    0000018a 00000005
    0000018b 00000005
    0000018d 00000002
    0000018e 00000001
    0000018f 00000001
    00000191 00000001
    00000193 00000002
    00000195 00000005
    00000196 00000006
    00000197 00000001
    00000198 00000003
    00000199 00000001
    0000019a 00000006
    0000019b 00000005
    0000019d 00000002
    0000019e 00000001
    0000019f 00000001
    000001a1 00000001
    000001a3 00000002
    000001a5 00000005
    000001a6 00000006
    000001a7 00000001
    000001a8 00000003
    000001a9 00000001
    000001aa 00000005
    
    000001ab 00000005
    000001ad 00000002
    000001ae 00000001
    000001af 00000001
    000001b1 00000001
    000001b3 00000002
    000001b5 00000005
    000001b6 00000006
    000001b7 00000001
    000001b8 00000003
    000001b9 00000001
    000001ba 00000005
    000001bb 00000005
    000001bd 00000002
    000001be 00000001
    000001bf 00000001
    000001c1 00000001
    000001c3 00000002
    000001c5 00000005
    000001c6 00000006
    000001c7 00000001
    000001c8 00000003
    000001c9 00000001
    000001ca 00000005
    000001cb 00000005
    000001cd 00000002
    000001ce 00000001
    000001cf 00000001
    000001d1 00000001
    000001d3 00000002
    000001d5 00000005
    000001d6 00000006
    000001d7 00000001
    000001d8 00000003
    000001d9 00000001
    000001da 00000005
    000001db 00000005
    000001dd 00000002
    000001de 00000001
    000001df 00000001
    000001e1 00000001
    000001e3 00000002
    000001e5 00000005
    000001e6 00000006
    000001e7 00000001
    000001e8 00000003
    000001e9 00000001
    000001ea 00000005
    000001eb 00000004
    000001ed 00000002
    000001ee 00000001
    000001ef 00000001
    000001f1 00000001
    000001f3 00000002
    000001f5 00000005
    000001f6 00000005
    000001f7 00000001
    000001f8 00000003
    000001f9 00000001
    000001fa 00000005
    000001fb 00000005
    000001fd 00000002
    
    000001ff 00000001
    1024 point bit-reversal and butterfly run time = 11902 us
    clock frequency = 180000000
    

    but it did seem to speed it up a bit.
  • Wuerfel_21 wrote: »

    Yes, in most cases SETQ/SETQ2 needs to come directly before the instruction it's affecting

    RIP okay, that's going to take a little bit of effort. I'll table this for now for more important things on the todo list. I have ideas, will take me some time to implement them though.
  • n_ermosh wrote: »
    I added the rev instruction and ran the fft_bench test, I get a much larger output now. I haven't looked into how the benchmark works--why would changing the reverse bit method change what's being run?

    It shouldn't. Maybe you fixed other stuff with the compiler at the same time? I found that changing the optimizer level affected how many lines it sends. -O2 and above didn't work at all? I haven't looked into what is happening there yet.
  • I guess what I'm most curious about is what is the expected output when it runs correctly?
  • To fix this:
    add ptra, #64 // reserve 16 longs on the stack
    setq #9 // setup to write 9+1 longs 
    mov pa, ptra
    sub pa, #64 // get pointer to start of stack block
    wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers
    

    you could use this:
    mov pa, ptra // get pointer to start of stack block
    add ptra, #64 // reserve 16 longs on the stack
    setq #9 // setup to write 9+1 longs 
    wrlong r0, pa // write r0 to that pointer and then the next 9 sequential registers
    

    Note that pa remains unchanged after the transfer, I guess that is what you want for preserving registers and restoring the stack later perhaps? Like a block pointer. It's worth spending the time getting a good fast sequence here given it will be used when calling many functions.
  • One other thing @n_ermosh , in this particular example above it looks like you were reserving 16 longs on the stack but only saving 10 registers. So does this mean you are planning on always allocating a fixed space for saving up to maximum number of 16 working registers? If so, this will of course eat up stack space faster.

    I found p2gcc seemed to allocate its internal working register use sequentially, which makes saving the number of registers easy and the stack pointer could be adjusted by the smallest amount needed each time. I don't know what LLVM tracks there to help you only save what is needed and consume the least amount of stack space required but I'd expect it is doable somehow.
  • n_ermoshn_ermosh Posts: 290
    edited 2020-08-16 01:44
    @rogloh, In that example, the remaining 6 longs are being used for local variables; it was just a snippet I grabbed from one of my examples. I'm only allocating what's needed for every function (as computed by the magical llvm internals). It will take a bit of work to rework how the prologue is inserted to make the above work, but I think it might be possible. Currently, here's the rough order of things happening that generates my code above. llvm is separately calling the functions that perform 1, 2, and 3 in various steps of compilation

    1. at function entry, insert prologue:
    add ptra, #size

    2. spill callee saved registers to the stack:
    figure out # of longs to save, which registers to start/end add, and insert those instructions:
    setq #n_regs
    wrlong r0, first_reg_frame_index

    Frame index is currently still an abstract memory reference, and it's exact offset (from the new stack pointer) is determined by llvm and not by my code.

    3. Eliminate the frame index by storing ptra to pa, subtracting, and replacing the frame_index with pa
    move pa, ptra
    sub pa, #frame_index_offset
    wrlong r0, pa

    The resulting code ends up looking like:

    add ptra, #size
    setq #n_regs
    mov pa, ptra
    sub pa, #size
    wrlong r0, pa

    I have an idea for how to work around this to make it look more like your example, without doing weird band-aid things, but I need to study for my pilot's exam this week, hopefully will be able to get back to this soon and implement some more things from the todo list. I'll keep an eye on this thread if people have ideas though.

    edit: it's actually super straightforward to change this to do a block transfer plus auto-increment PTRA. I might be able to get it done this weekend.
  • turns out it was easy. latest commits in dev branch now use setq/auto-incrementing reads/writes to ptra to push/pop registers in function calls, and manual stack allocation (for local variables, etc) happens after it. Should see some performance improvements now. One more thing crossed off the list!
  • That's great n_ermosh!

    Do you expect you'll be able to use the indexed register hub access e.g., rdlong r3, ptra[-20] for accessing locals or any arguments allocated on the stack with just a single instruction when they are not otherwise maintained in register variables? There are distance limits with this but maybe the first 32 items below the stack pointer location might be accessible this way for a performance improvement too. The compiler would of course need to know when this faster form of hub memory access is applicable. I imagine trying to support alloca() later might be a problem unless the second ptrb is used as a frame pointer perhaps and ideally pointing between locals and arguments to give maximal access range, but that should not be a priority at this stage.

    I do know the AVR chips also use this type of memory access method with GCC code and have similar distance limits too with LDD (load indirect with displacement) instructions. See below for the Y+n form with LDD reads. For AVR "n" is limited to 63 bytes while P2 would have limits of 32 bytes, words or longs below the current stack pointer depending on the read/write size which is still useful.
    int main(void) {
      6c:   cf 93           push    r28
      6e:   df 93           push    r29
      70:   cd b7           in      r28, 0x3d       ; 61
      72:   de b7           in      r29, 0x3e       ; 62
      74:   29 97           sbiw    r28, 0x09       ; 9
      76:   0f b6           in      r0, 0x3f        ; 63
      78:   f8 94           cli
      7a:   de bf           out     0x3e, r29       ; 62
      7c:   0f be           out     0x3f, r0        ; 63
      7e:   cd bf           out     0x3d, r28       ; 61
            volatile int foo;
            volatile char bar[7];
    
            while (1) {
                            PORTB = foo;
      80:   88 85           ldd     r24, Y+8        ; 0x08
      82:   99 85           ldd     r25, Y+9        ; 0x09
      84:   88 bb           out     0x18, r24       ; 24
                            PORTB = bar[3];
      86:   8c 81           ldd     r24, Y+4        ; 0x04
      88:   88 bb           out     0x18, r24       ; 24
      8a:   fa cf           rjmp    .-12            ; 0x80 <main+0x14>
    
  • @rogloh it should be pretty easy to implement as well. Since stack slots are always referred to by an arbitrary frame index (stack pointer +/- an offset), and replaced at the end with actual registers, it should be pretty easy to implement logic that will check the size/alignment of the offset, then choose to either use the indexed ptra mode or just save ptra to a register and subtract (like I currently do), or maybe insert augs to increase the possible offset distance. There's probably a few corner cases where this could break something, but I'll cross that bridge. My next task is to test out dynamic memory allocation and see how that work and what will happen.

    One major aspect of LLVM I haven't even touched yet is that you can give it hints to the cost of executing a particular instruction. This is then used internally for scheduling as well as optimization to pick the appropriate instructions to use, without any effort from me other than just defining the costs. This can be used to prefer storing things in registers (which it already does), but also interestingly for non-blocking instructions like qdiv or qmul, it can be told to place the getqx/getqy later if it doesn't need the result immediately so that the CORDIC can run in the background until complete.
  • roglohrogloh Posts: 5,171
    edited 2020-08-17 02:38
    Sounds good. If you do use AUGS (you may need to in some cases) it could require several instructions to read offsets from the stack pointer. Eg.
    mov pa, ptra ' ptra is SP
    sub pa, ##nnnn  ' uses AUGS once nnnn exceeds 511
    rdlong r3, pa
    

    this creates 4 instructions instead of 1 for each local access which while also somewhat slower can start to quickly bloat the code. Eg. it burns 16 bytes to read something and maybe another 16 bytes elsewhere to write the result back.

    I also sort of wonder if any of the ALT instructions can be used somehow with AUGS to save an instruction but am not sure there.

    If you can make uses of the hardware multiplier that would be good too. Unfortunately the really fast multiply is limited to 16 bit operands but maybe you can still make good use of it in internal calculations where possible if you know the inputs will fit.

    The other P2 things to try to leverage are the bit operations, addsx etc for 64 bit stuff, maybe byte accesses getbyte/setbyte if this is detectable in the code, DJZ loop stuff, any small skip sequences where possible and good use of condition flag execution.
  • Can't I just use augs to set the index of ptra to be a large offset? Page 61 of the datasheet, unless I'm not understanding how that works? That should just add 1 instruction for when there's a large offset. effectively doing:
    augs #nnnnn
    rdlong r3, ptra[nnnnn]
    

    condition flag execution is a big one that I'm not yet set up to do. However, I bet there's a post-compilation optimizer pass that can be written to convert small loops or branches into conditional execution and insert skips and such. LLVM has a nice structure that lets you write passes for each function at various stages of compilation and do whatever you want to the function, on top of everything LLVM does. I have one currently called "expand pseudo instructions" which converts pseudo instructions I define (that don't have an equivalent P2 instruction, but I'm telling LLVM one will exist) into a small series of real instructions.
  • roglohrogloh Posts: 5,171
    edited 2020-08-17 05:43
    n_ermosh wrote: »
    Can't I just use augs to set the index of ptra to be a large offset? Page 61 of the datasheet, unless I'm not understanding how that works? That should just add 1 instruction for when there's a large offset. effectively doing:
    augs #nnnnn
    rdlong r3, ptra[nnnnn]
    

    If you can actually do that it would be very good. But you may need to test that idea out first, because RDLONG addresses are encoded differently and so it may not work the same as you'd assume. I'm guessing that it won't work but would be very happy to be proven wrong there. EDIT: okay I read page 61 and you might be correct! :smile: That's very efficient with only a second instruction needed over the "rdlong r0, ptra[10]" form.

    For loops and branches, the conditional execution should be very common with z as well as c flags for typical greater/less than comparisons for example. TJZ and TJNZ should come in very useful too.
          cmp r1,r2 wz
    if_nz jmp #loop
    
  • I'm currently working out dynamic memory allocation, but next on my list is optimizing loops. Defining a pattern to match comparison to 0 should be pretty easy to save an instruction. Although, I think my overarching philosophy for writing optimizations (for now) is to focus on compiler functionality and completeness rather than on performance, as any really high performance code can always be written in assembly (although I think an end user should never have to touch assembly unless doing something really, really specialized). Lots to do...
  • n_ermosh wrote: »
    Although, I think my overarching philosophy for writing optimizations (for now) is to focus on compiler functionality and completeness rather than on performance

    Absolutely. Once it's (fully) functional, folks can start using it to write libraries while you simultaneously enhance performance. The optimization rabbit hole is never ending....
  • For anyone building anything using cmake: please forget that gnu make exists. It's highly counter-productive. Use ninja everywhere. It's the defacto modern build tool. Once you experience its performance, going back to gnu make will sound and feel ridonkulous. make tries to be a general purpose tool, and its code base and performance bears that fact. ninja tries to do one job only and does it extremely well.`cmake -G Ninja` all the way. Thank me later :) Yes, you need to install ninja.
Sign In or Register to comment.