@ersmith said:
You might have better luck with riscvp2, which is full GCC and already has support for putting code into flash memory. Although given the specs of LVGL it may not need external memory at all?
Ugh. Want to use native P2 code not RISCV instructions.
How much are you writing in assembly though? If you're worried about performance, riscvp2 does eventually compile the code to native P2, just at runtime rather than ahead of time. It's coremark score is around 52 iterations/sec at 180 MHz (when run from HUB or Flash; results are the same either way) vs around 37 iterations/sec for flexspin and 17 for an older Catalina.
Yeah I'll probably not try to use assembly, more a mix of C and SPIN2. But it's mainly the integration with other SPIN2 stuff like my memory driver and other stuff which flexspin handles so well and compels me to use that. Also if I do run into problems, I can debug PASM2 instructions easily as I'm very familiar with those, while I'm a complete newbie with respect to RISCV and don't really want to have to learn that just to fix code issues I may encounter on a P2.
I do actually have an application in mind for this if I can get this working, hence my appetite for fixing these issues. My car has a 9 inch LCD screen with a spare video input for reversing cameras which I don't use and it'd be cool to feed it a P2's video output with a nice GUI showing some automotive stuff like TPMS sensor data and any other CAN accessible data I might like to see. A P2 board with external RAM such as the Edge might be useful to control this all. It'd mostly just need my PSRAM driver for the external memory code and some other IO pins for accessing HW and a composite video output, so it's quite doable. A lot of HUB RAM would be freed up for a screen buffer if I can place this LVGL code into external RAM and LVGL is seemingly very capable for displaying fancy dials and other presentable GUI widgets in real time. I could use something else such as a Pico or ESP device but obviously prefer to use a P2. Now I'm still not sure about performance but for a high speed P2 I'm still optimistic (although maybe a little less so since I saw all that heap activity now in the disassembled code for some functions ).
I was remiss earlier in not mentioning Catalina, which already has external memory support too. Although I'd definitely be very happy to have you continue to work on an external memory solution for flexspin, which would be awesome but if you need something working sooner then Catalina may do the job.
True. I may look at that at some point, but again it's the features of flexspin like mixing SPIN2 and C objects which are compelling in this case.
The reason I still have hope is that with some changes I made that make flexspin happier (remove inline and const and rename some variables) I've been able to get the error list down to only the same type of incompatible assignment type error and if you do end up finding a general fix for that issue it might potentially resolve all of them (with any luck).
I'll continue looking into that. On that note, do you think you could re-post your current code? The all.zip that you posted up-thread was missing the actual lvgl directory .
I'll post the files I modified with a github link to the 9.5 tree of LVGL which IIRC forms that sub-directory. Probably much smaller that way otherwise it's ~700MB. Will edit this post here with the ZIP once I've sorted it out for you.
EDIT: Just added projchanges.zip and instructions within its included readme.txt file for getting the other files.
I do have one other strange error left though which is some internal error now reported in the C startup code added automatically at the end - not part of my project list but something flexspin is adding. Strangely it is reported on line 24 and the file only goes up to line 23. ./Users/roger/Applications/spin2cpp/include/libsys/c_startup.c:24: error: Internal error, unknown type 114 passed to IsArrayType
When the error comes at the end of the file like that it usually means it's some internal issue not really associated with any source code (and the source file name is probably bogus too). The particular error is complaining about trying to find the type, and 114 is an AST_USING, so it's probably upset about a struct __using statement somewhere. It might be related somehow to the other struct errors, hard to say.
Ok. That was the last one, so I'll wait until the other stuff is hopefully working before trying to address that. It only showed up recently.
So with some tweaks today I was able to get LVGL building on a P2 with n_ermosh's P2 LLVM compiler. For our reference this dummy build (basic LVGL code but no real application) consumes about 322kB of HUB for its code segment, and about 14kB of data, plus it would still need some space left for a (partial) frame buffer and a heap+stack. It would be pushing it to fit all this into HUB unless the frame buffer is of lower resolution (like 320x200). One possibility to improve this would be to try to keep the frame buffer(s) in external memory and just transfer portions to/from PSRAM as the screen is rendered into the partial frame buffer in HUBRAM. I read somewhere that LVGL needs at least a 10% sized local frame buffer to still work ok. So that might be a possibility for me if a flexspin build that is running LVGL from external RAM can't be achieved. It would allow higher bit depths too.
It'll be interesting to compare how much HUBRAM the flexspin code will be for this same build if we can fix the compile errors so it completes. I did compile with -Os option on LLVM so it's meant to be optimized for smaller size, and uncalled functions are meant to be removed. The full list of LVGL files in this particular build are attached so we can compare later. I also cut the internally allocated free memory area down to 32kB from 64kB, otherwise the BSS use would be an extra 32kB more and the default sized heap the collides with the default stack. Need to work out how to shrink those a bit, the default stack size is 48kB and probably way too big.
Another interesting data point below. LLVM compiled CoreMark code running on the P2 is certainly running faster than what we saw before (now 41 per second), although it seems to fail the benchmark's CRC test for some reason, so the actual code being generated may well be bad or certainly have a bug of some type somewhere.
We were seeing about 30 per second with optimizations enabled under flexspin before, and slower with my external RAM scheme (18 per second with safe? optimizations enabled). Same 160MHz P2 clock on each.
2K performance run parameters for coremark.
[0]ERROR! list crc 0x4af3 - should be 0xe714
[0]ERROR! matrix crc 0xcb74 - should be 0x1fd7
[0]ERROR! state crc 0xec26 - should be 0x8e3a
CoreMark Size : 666
Total ticks : 2022828112
Total time (secs): 12
Iterations/Sec : 41
Iterations : 500
Compiler version : GCCClang 14.0.0 (https://github.com/ne75/llvm-project.git 72a9bb1ef2656d9953d1f41a8196d425ff2ab0b1)
Compiler flags : -Os
Memory location : STACK
seedcrc : 0xe9f5
[0]crclist : 0x4af3
[0]crcmatrix : 0xcb74
[0]crcstate : 0xec26
[0]crcfinal : 0x0ebd
Errors detected
Another data point is that both toolchains create different total code/data segment sizes. Flexspin's generated code size is a bit smaller than LLVM's, 34052 vs 41852 bytes, but it also needs some extra data space used for its method tables etc (2900 vs 2372 bytes):
LLVM derived CoreMark application:
Sections:
Idx Name Size VMA Type
0 00000000 00000000
1 .text 0000a37c 00000000 TEXT -> $a37c = 41852 bytes
2 .rodata 000005d8 0000a380 DATA
3 .data 000000d0 0000a958 DATA -> $d0+$5d8 = 1496 bytes of preallocated data
4 .bss 0000036c 0000aa28 BSS -> $36c = 876 bytes of zeroed data (so total data = 2372)
5 .heap 0000bffc 0000ad94 BSS
6 .stack 0000c000 00070000 BSS
So I retested CoreMark with all optimizations disabled under LLVM and it didn't fail the CRC result check any more but the CoreMark rate fell back to 13 iterations/second. So the LLVM code generation bug must be because of some particular optimization breaking the code. This makes sense as I also found if I printed out some intermediate variable results it got further in a comparison I made between outputs after running from flexspin vs LLVM generated images before any differences were seen. So some important code seems to be getting broken under LLVM with the -0s option passed to the compiler. That's likely to be a big problem if I wanted to try use it for LVGL until that is figured out and resolved because compiling it under LLVM with optimizations off made the LVGL image size exceed the HUBRAM with a cache and heap allocated as well.
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 3577857752
Total time (secs): 22
Iterations/Sec : 13
Iterations : 300
Compiler version : GCCClang 14.0.0 (https://github.com/ne75/llvm-project.git 72a9bb1ef2656d9953d1f41a8196d425ff2ab0b1)
Compiler flags : -Os
Memory location : STACK
seedcrc : 0xe9f5
[0]crclist : 0xe714
[0]crcmatrix : 0x1fd7
[0]crcstate : 0x8e3a
[0]crcfinal : 0x5275
Correct operation validated. See README.md for run and reporting rules.
This bug will be in one of the differences in the code functions below (optimized code on left, unoptimized on right). Not sure which one unless code is examined thoroughly and I don't want to dig into it that much right now. I guess I could try to apply each optimization one at a time to find when it breaks, if that is possible with LLVM. I'd need to get the list of optimizations first.
UPDATE: The -O1 optmization level also causes the CRC error but boosted performance from 13 to 42 iterations per second (more than a 3x boost alone!).
2K performance run parameters for coremark.
[0]ERROR! list crc 0x2170 - should be 0xe714
[0]ERROR! matrix crc 0x4d15 - should be 0x1fd7
[0]ERROR! state crc 0x84d4 - should be 0x8e3a
CoreMark Size : 666
Total ticks : 1196990704
Total time (secs): 7
Iterations/Sec : 42
ERROR! Must execute for at least 10 secs for a valid result!
Iterations : 300
Compiler version : GCCClang 14.0.0 (https://github.com/ne75/llvm-project.git 72a9bb1ef2656d9953d1f41a8196d425ff2ab0b1)
Compiler flags : -Os
Memory location : STACK
seedcrc : 0xe9f5
[0]crclist : 0x2170
[0]crcmatrix : 0x4d15
[0]crcstate : 0x84d4
[0]crcfinal : 0x4511
Errors detected
I retested CoreMark with all optimizations disabled under LLVM and it didn't fail the CRC result check any more but the CoreMark rate fell back to 13 iterations/second.
Promising, but I don't think you can compare coremark results unless it validates correctly - one of the reasons for using coremark is that it self-validates so you know the results are an accurate representation. Also, coremark results are generally expressed as floating point - it is not clear from the output whether they are being truncated to integers during the calculation or during the printing, but either one will skew the results and also make it difficult to compare code sizes, since omitting floating point will generate smaller code.
I retested CoreMark with all optimizations disabled under LLVM and it didn't fail the CRC result check any more but the CoreMark rate fell back to 13 iterations/second.
Promising, but I don't think you can compare coremark results unless it validates correctly - one of the reasons for using coremark is that it self-validates so you know the results are an accurate representation.
Yes I would agree fully. I still want to fix this problem if I can as LLVM certainly does look promising as well but without optimizations enabled it's definitely not as fast as it could be. Its full linker is probably easily coaxed to place nominated code at upper RAM addresses. The code it generates does include tjz and tjnz however which would have to change to test then conditional branch as seperate instructions. The rest seems pretty clean. It uses the RETA CALLA syntax for function calls everywhere which is different to how flex does it.
Also, coremark results are generally expressed as floating point - it is not clear from the output whether they are being truncated to integers during the calculation or during the printing, but either one will skew the results and also make it difficult to compare code sizes, since omitting floating point will generate smaller code.
Ross.
Both were generated without floating point enabled in the command line args so it should be a reasonable comparison. Thatis why they are printing integers.
/* Configuration : HAS_FLOAT
Define to 1 if the platform supports floating point.
*/
#ifndef HAS_FLOAT
#define HAS_FLOAT 0
#endif
....
#if HAS_FLOAT
ee_printf("Total time (secs): %f\n", time_in_secs(total_time));
if (time_in_secs(total_time) > 0)
ee_printf("Iterations/Sec : %f\n",
default_num_contexts * results[0].iterations
/ time_in_secs(total_time));
#else
ee_printf("Total time (secs): %d\n", time_in_secs(total_time));
if (time_in_secs(total_time) > 0)
ee_printf("Iterations/Sec : %d\n",
default_num_contexts * results[0].iterations
/ time_in_secs(total_time));
#endif
Found one suspicious thing in the LLVM optimized code. This CoreMark bug may relate to the CORDIC QMUL. For some reason in this optimized code snippet, it is potentially executing a GETQX before a QMUL operation. This doesn't look right to me unless its trying to flush the CORDIC somehow first, but that is seemlingly not ever done in the unoptimized code which always puts the QMUL before the GETQX unless it needs to directly feed the next QMUL from a prior CORDIC result, which doesn't appear to be happening here in this case.
Perhaps LLVM code generator needs to be somehow told to not reorder CORDIC instruction pairs/groups like "QMUL" followed by "GETQX/Y" if that is not already done, as perhaps it's trying to pipeline it somehow and it's going awry. The CoreMark matrix multiply operation is definitely wrong when I print out the resulting matrix values and compare with flexspin's results, and its CRC obviously fails then too.
Tagging @n_ermosh as well in case it helps.
EDIT: I'm guessing this might be happening because if you do something like:
QMUL r1, r2
GETQX r3
then the compiler's optimizer thinks it could reorder the instruction affecting the r3 destination register independently because it doesn't know it actually (indirectly) still depends on r1 & r2 from the QMUL. Maybe that needs to be somehow specified in the instruction definitions if it's not already being done.
@rogloh said:
Found one suspicious thing in the LLVM optimized code. This CoreMark bug may relate to the CORDIC QMUL. For some reason in this optimized code snippet, it is potentially executing a GETQX before a QMUL operation. This doesn't look right to me unless its trying to flush the CORDIC somehow first, but that is seemlingly not ever done in the unoptimized code which always puts the QMUL before the GETQX unless it needs to directly feed the next QMUL from a prior CORDIC result, which doesn't appear to be happening here in this case.
Perhaps LLVM code generator needs to be somehow told to not reorder CORDIC instruction pairs/groups like "QMUL" followed by "GETQX/Y" if that is not already done, as perhaps it's trying to pipeline it somehow and it's going awry. The CoreMark matrix multiply operation is definitely wrong when I print out the resulting matrix values and compare with flexspin's results, and its CRC obviously fails then too.
On a general point, is it possible for compilers to be told about relative hub RAM timings? In the code above, the first write of wrlong r3, ptra++ takes 9 cycles with 6 waits after wrlong r0, ptra++ so mov r4, r1 and mov r5, r0 could be shifted up, saving four cycles. Also, the write in a read-modify-write is always the worst case of 10 cycles with 7 waits so add r8, #1 and cmp r8, r5 wcz could be moved above wrword r9, r10 saving four more cycles.
Comments
Yeah I'll probably not try to use assembly, more a mix of C and SPIN2. But it's mainly the integration with other SPIN2 stuff like my memory driver and other stuff which flexspin handles so well and compels me to use that. Also if I do run into problems, I can debug PASM2 instructions easily as I'm very familiar with those, while I'm a complete newbie with respect to RISCV and don't really want to have to learn that just to fix code issues I may encounter on a P2.
I do actually have an application in mind for this if I can get this working, hence my appetite for fixing these issues. My car has a 9 inch LCD screen with a spare video input for reversing cameras which I don't use and it'd be cool to feed it a P2's video output with a nice GUI showing some automotive stuff like TPMS sensor data and any other CAN accessible data I might like to see. A P2 board with external RAM such as the Edge might be useful to control this all. It'd mostly just need my PSRAM driver for the external memory code and some other IO pins for accessing HW and a composite video output, so it's quite doable. A lot of HUB RAM would be freed up for a screen buffer if I can place this LVGL code into external RAM and LVGL is seemingly very capable for displaying fancy dials and other presentable GUI widgets in real time. I could use something else such as a Pico or ESP device but obviously prefer to use a P2. Now I'm still not sure about performance but for a high speed P2 I'm still optimistic (although maybe a little less so since I saw all that heap activity now in the disassembled code for some functions
).
True. I may look at that at some point, but again it's the features of flexspin like mixing SPIN2 and C objects which are compelling in this case.
I'll post the files I modified with a github link to the 9.5 tree of LVGL which IIRC forms that sub-directory. Probably much smaller that way otherwise it's ~700MB. Will edit this post here with the ZIP once I've sorted it out for you.
EDIT: Just added projchanges.zip and instructions within its included readme.txt file for getting the other files.
Ok. That was the last one, so I'll wait until the other stuff is hopefully working before trying to address that. It only showed up recently.
I've just fixed this and pushed the fix to GitHub. p2asm now generates the same code as PNut and the Propeller Tool for the LOC instruction.
So with some tweaks today I was able to get LVGL building on a P2 with n_ermosh's P2 LLVM compiler. For our reference this dummy build (basic LVGL code but no real application) consumes about 322kB of HUB for its code segment, and about 14kB of data, plus it would still need some space left for a (partial) frame buffer and a heap+stack. It would be pushing it to fit all this into HUB unless the frame buffer is of lower resolution (like 320x200). One possibility to improve this would be to try to keep the frame buffer(s) in external memory and just transfer portions to/from PSRAM as the screen is rendered into the partial frame buffer in HUBRAM. I read somewhere that LVGL needs at least a 10% sized local frame buffer to still work ok. So that might be a possibility for me if a flexspin build that is running LVGL from external RAM can't be achieved. It would allow higher bit depths too.
It'll be interesting to compare how much HUBRAM the flexspin code will be for this same build if we can fix the compile errors so it completes. I did compile with -Os option on LLVM so it's meant to be optimized for smaller size, and uncalled functions are meant to be removed. The full list of LVGL files in this particular build are attached so we can compare later. I also cut the internally allocated free memory area down to 32kB from 64kB, otherwise the BSS use would be an extra 32kB more and the default sized heap the collides with the default stack. Need to work out how to shrink those a bit, the default stack size is 48kB and probably way too big.
Another interesting data point below. LLVM compiled CoreMark code running on the P2 is certainly running faster than what we saw before (now 41 per second), although it seems to fail the benchmark's CRC test for some reason, so the actual code being generated may well be bad or certainly have a bug of some type somewhere.
We were seeing about 30 per second with optimizations enabled under flexspin before, and slower with my external RAM scheme (18 per second with safe? optimizations enabled). Same 160MHz P2 clock on each.
Another data point is that both toolchains create different total code/data segment sizes. Flexspin's generated code size is a bit smaller than LLVM's, 34052 vs 41852 bytes, but it also needs some extra data space used for its method tables etc (2900 vs 2372 bytes):
LLVM derived CoreMark application:
Sections:
Idx Name Size VMA Type
0 00000000 00000000
1 .text 0000a37c 00000000 TEXT -> $a37c = 41852 bytes
2 .rodata 000005d8 0000a380 DATA
3 .data 000000d0 0000a958 DATA -> $d0+$5d8 = 1496 bytes of preallocated data
4 .bss 0000036c 0000aa28 BSS -> $36c = 876 bytes of zeroed data (so total data = 2372)
5 .heap 0000bffc 0000ad94 BSS
6 .stack 0000c000 00070000 BSS
Flexspin derived CoreMark application:
$0000 code begins
$8504 code ends (code len=34052 bytes)
$8508 data starts
$9058 data ends (data len=2900 bytes)
So I retested CoreMark with all optimizations disabled under LLVM and it didn't fail the CRC result check any more but the CoreMark rate fell back to 13 iterations/second. So the LLVM code generation bug must be because of some particular optimization breaking the code. This makes sense as I also found if I printed out some intermediate variable results it got further in a comparison I made between outputs after running from flexspin vs LLVM generated images before any differences were seen. So some important code seems to be getting broken under LLVM with the -0s option passed to the compiler. That's likely to be a big problem if I wanted to try use it for LVGL until that is figured out and resolved because compiling it under LLVM with optimizations off made the LVGL image size exceed the HUBRAM with a cache and heap allocated as well.
This bug will be in one of the differences in the code functions below (optimized code on left, unoptimized on right). Not sure which one unless code is examined thoroughly and I don't want to dig into it that much right now. I guess I could try to apply each optimization one at a time to find when it breaks, if that is possible with LLVM. I'd need to get the list of optimizations first.

UPDATE: The -O1 optmization level also causes the CRC error but boosted performance from 13 to 42 iterations per second (more than a 3x boost alone!).
@rogloh
Promising, but I don't think you can compare coremark results unless it validates correctly - one of the reasons for using coremark is that it self-validates so you know the results are an accurate representation. Also, coremark results are generally expressed as floating point - it is not clear from the output whether they are being truncated to integers during the calculation or during the printing, but either one will skew the results and also make it difficult to compare code sizes, since omitting floating point will generate smaller code.
Ross.
Yes I would agree fully. I still want to fix this problem if I can as LLVM certainly does look promising as well but without optimizations enabled it's definitely not as fast as it could be. Its full linker is probably easily coaxed to place nominated code at upper RAM addresses. The code it generates does include tjz and tjnz however which would have to change to test then conditional branch as seperate instructions. The rest seems pretty clean. It uses the RETA CALLA syntax for function calls everywhere which is different to how flex does it.
Both were generated without floating point enabled in the command line args so it should be a reasonable comparison. Thatis why they are printing integers.
/* Configuration : HAS_FLOAT Define to 1 if the platform supports floating point. */ #ifndef HAS_FLOAT #define HAS_FLOAT 0 #endif .... #if HAS_FLOAT ee_printf("Total time (secs): %f\n", time_in_secs(total_time)); if (time_in_secs(total_time) > 0) ee_printf("Iterations/Sec : %f\n", default_num_contexts * results[0].iterations / time_in_secs(total_time)); #else ee_printf("Total time (secs): %d\n", time_in_secs(total_time)); if (time_in_secs(total_time) > 0) ee_printf("Iterations/Sec : %d\n", default_num_contexts * results[0].iterations / time_in_secs(total_time)); #endifFound one suspicious thing in the LLVM optimized code. This CoreMark bug may relate to the CORDIC QMUL. For some reason in this optimized code snippet, it is potentially executing a GETQX before a QMUL operation. This doesn't look right to me unless its trying to flush the CORDIC somehow first, but that is seemlingly not ever done in the unoptimized code which always puts the QMUL before the GETQX unless it needs to directly feed the next QMUL from a prior CORDIC result, which doesn't appear to be happening here in this case.
Perhaps LLVM code generator needs to be somehow told to not reorder CORDIC instruction pairs/groups like "QMUL" followed by "GETQX/Y" if that is not already done, as perhaps it's trying to pipeline it somehow and it's going awry. The CoreMark matrix multiply operation is definitely wrong when I print out the resulting matrix values and compare with flexspin's results, and its CRC obviously fails then too.
Tagging @n_ermosh as well in case it helps.
EDIT: I'm guessing this might be happening because if you do something like:
then the compiler's optimizer thinks it could reorder the instruction affecting the r3 destination register independently because it doesn't know it actually (indirectly) still depends on r1 & r2 from the QMUL. Maybe that needs to be somehow specified in the instruction definitions if it's not already being done.
On a general point, is it possible for compilers to be told about relative hub RAM timings? In the code above, the first write of
wrlong r3, ptra++takes 9 cycles with 6 waits afterwrlong r0, ptra++somov r4, r1andmov r5, r0could be shifted up, saving four cycles. Also, the write in a read-modify-write is always the worst case of 10 cycles with 7 waits soadd r8, #1andcmp r8, r5 wczcould be moved abovewrword r9, r10saving four more cycles.