Shop OBEX P1 Docs P2 Docs Learn Events
C and C++ for the Propeller 2 - Page 2 — Parallax Forums

C and C++ for the Propeller 2

2

Comments

  • ersmith wrote: »
    We'd have to set out some rules (like no inline assembly) to really test the language itself

    This is a good point, but brings up a second point. Benchmarking is all about comparing languages so that you can make an educated decision on what to use, right? But if two languages both support inline-assembly, then benchmarking may be seen as pointless - they're both equally fast. At which point, the next most interesting topic is "how easy is it to use inline assembly?"
    So perhaps, along side the code for each language in this benchmark, you also show a snippet of what it looks like to encapsulate that benchmark in a function, where the internals of that function are inline assembly. Then you might as well include benchmark numbers there as well, just to show off any difference in overhead between the languages.
    I suggest this due in no small part to my bias for PropGCC and fcache over Spin... but that doesn't change the fact that I think its a worthy point. Where PropGCC shines relative to spin, in being able to do inline assembly instead of needing a second cog, FastSpin will do even better (I think... right? I think the inline assembly syntax is much simpler than GCC's.... right?).
  • If we can get agreement on a decent set of intrinsics/builtins and predefined constants/etc. for the C/C++ compilers to have/use for P2 (and P1), then inline assembly becomes less needed. Especially with good optimizing compilers.
    You will still need it for some cases where timing is critical or when doing something non-standard.

    It would be even nicer if we could have the same form of inline assembly in all the compilers, but that's not feasible really. They could be made pretty close (some even identical) with macros, which is worth exploring.
  • DavidZemon wrote: »
    ersmith wrote: »
    We'd have to set out some rules (like no inline assembly) to really test the language itself

    This is a good point, but brings up a second point. Benchmarking is all about comparing languages so that you can make an educated decision on what to use, right? But if two languages both support inline-assembly, then benchmarking may be seen as pointless - they're both equally fast.
    I don't think I'd agree about that: surely the speed of the core language still matters, at least to the degree that it determines how much inline assembly you need. At one extreme, if the compiler is good enough you'll hardly ever need inline assembly. At the other extreme if you're writing nearly everything in assembly then you might as well not have a high level language!
    At which point, the next most interesting topic is "how easy is it to use inline assembly?"
    Yes, that's definitely another dimension along which one could compare languages, althougn it's not a benchmark per se.
    I suggest this due in no small part to my bias for PropGCC and fcache over Spin... but that doesn't change the fact that I think its a worthy point. Where PropGCC shines relative to spin, in being able to do inline assembly instead of needing a second cog, FastSpin will do even better (I think... right? I think the inline assembly syntax is much simpler than GCC's.... right?).

    I do think fastspin's inline assembly is easier to use. For example I'm working on a version of the benchmark that tries to go as fast as possible by reading 4 bytes at a time and swapping them, and for that a little bit of inline assembly helps:
    #if defined(USE_FASTSPIN_INLINE)
      __asm {
        movbyts SPIData, #0b00011011
      }
    #elif defined(USE_GCC_INLINE)
      __asm__ ( "movbyts %0, #0b00011011" : "=r"(SPIData) : : );
    #else
      SPIData <<= 24;
      SPIData |= ((rawData >> 8) & 0xff) << 16;
      SPIData |= ((rawData >> 16) & 0xff) << 8;
      SPIData |= rawData >> 24;
    #endif
    
  • Roy Eltham wrote: »
    If we can get agreement on a decent set of intrinsics/builtins and predefined constants/etc. for the C/C++ compilers to have/use for P2 (and P1), then inline assembly becomes less needed. Especially with good optimizing compilers.
    You will still need it for some cases where timing is critical or when doing something non-standard.
    Definitely. I think we should probably re-visit propeller2.h. For one thing, Chip has changed a number of the Spin names so they no longer match what's done in C; whether that matters or not is an open question I guess. For another, there are a number of useful PASM features (like byte swapping and pixel blending operations) that don't have propeller2.h equivalents.
    It would be even nicer if we could have the same form of inline assembly in all the compilers, but that's not feasible really. They could be made pretty close (some even identical) with macros, which is worth exploring.

    That's an interesting idea. I'm not sure if it's feasible or not. Is anyone up for trying?
  • @ersmith, yep, all good points :)
  • Cluso99Cluso99 Posts: 18,069
    On P1, code size was critical and meant C wasn’t a good fit without XMM.

    P2 is very much different. Programs that are written for P2 in HLL are not so important for speed or size as IMHO drivers will be almost always done in pasm and operate in their own cog.

    What i have found with both P1 and P2 is there is mostly only a single big program (one cog) that ties the little cog drivers together. It is this big program that is much better done in a HLL and this is what we’re talking about. Here the language of choice is more to do with the writer than anything else. Speed and code size is a much lower priority here. Spin2 (pnut/proptool) will give you slower but smaller code whereas C etc (catalina, fastspin, gcc, llvm, basic, python or any other) is more the writers choice.
    Fastspin shines because it can do a number of languages in one compiler, can mix inline pasm, and compiles spin2 (not interpreted).

    It’s a nice objective to get all the C compilers to be as similar as possible, and i do hope they will be, there will always be differences. At least we have a choice which is much better than having only one.

    For those who remember the P1, we were asking for years to get @@@, #define (conditional assembly) and we were flatly told by Parallax there would be no PropTool additions. Hence Brad and Mark wrote bst and homespun. We the had these extras and many of us used them to great advantage even tho there was no support by Parallax. We still use them even tho neither is supported because they are basically bug free. Did you know Catalina uses homespun with a couple of mods?

    Anyway, i see we are in a good place, excepting my own dislike for pnut/proptool for P2.
  • roglohrogloh Posts: 5,787
    edited 2020-08-08 08:29
    Eric, just watched the recorded presentation on YT. Thanks for putting in time for providing a good comparison of what is out there for C tools for P2. At one point you mentioned the extensions I added for p2gcc. So I will include more about that here showing what is involved there.

    Firstly I was able to reproduce your numbers for p2gcc with the fft_bench test code here in my own setup. Turns out it ran at 252MHz instead of 180MHz in my setup due to my prefix.o default (loadp2 freq patching never likes working for me) but when I scaled accordingly down I get pretty much the same number, ~22.8us, and the code size was 20660 bytes which is close too. Then for my own interest I ran it through my custom p2gcc flow I use with MicroPython instead (shown below) to see if that would help and retried. It only improved slightly to ~21.7us after again scaling from 252MHz to 180MHz (so a 5% boost), and the image size was 20600 bytes. For this benchmark I found there were only two cases where my prolog & epilog optimization helped so that probably explains the modest improvement. That part probably optimizes better for more deeply nested functions called from a loop such as recursive stuff. There are also some other optimizations it can do which replace constant lookups with directly loading the constant values into the registers making use of the P2's ## where possible. Not sure if the benchmark needed much of that, didn't check for it.
    propeller-elf-gcc -I. -I../.. -Ibuild -Wall -std=c99   -mcog -S -mno-fcache -m32bit-doubles  -Os -D__P2__ -DNDEBUG -S -MD -o fft_bench.spin1 fft_bench.c
    #Prolog/epilog replacements
    cp fft_bench.spin1 fft_bench.spin0
    perl -i -0p prolog8 fft_bench.spin0
    perl -i -0p prolog9 fft_bench.spin0
    perl -i -0p prolog10 fft_bench.spin0
    perl -i -0p prolog11 fft_bench.spin0
    perl -i -0p prolog12 fft_bench.spin0
    perl -i -0p prolog13 fft_bench.spin0
    perl -i -0p prolog14 fft_bench.spin0
    perl -i -0p epilog8 fft_bench.spin0
    perl -i -0p epilog9 fft_bench.spin0
    perl -i -0p epilog10 fft_bench.spin0
    perl -i -0p epilog11 fft_bench.spin0
    perl -i -0p epilog12 fft_bench.spin0
    perl -i -0p epilog13 fft_bench.spin0
    perl -i -0p epilog14 fft_bench.spin0
    sed -f transformsbackwards.sed fft_bench.spin0 > fft_bench.spin
    #Translate SPIN1 to SPIN2
    s2pasm -p/Users/roger/Code/p2gcc/lib/prefix.spin2 fft_bench.spin
    #Optimize
    ./optimize fft_bench.spin2
    #Assemble 
    p2asm -c -o -hub fft_bench.spin2
    #Link
    p2link -o fft_bench.binary -L /Users/roger/Code/p2gcc/lib prefix.o p2start.o puts.o printf.o longlong.o clock.o stdio.a string.a fft_bench.o
    #Load P2
    ~/Downloads/flexgui-2/bin/loadp2 -CHIP fft_bench.binary -t -b 115200
    

    Here's the output I got when running this code with my optimising extensions (Note: P2 at 252MHz, scale accordingly):
    ( Entering terminal mode.  Press Ctrl-] to exit. )
    fft_bench v1.0 for PROPELLER
    Freq.    Magnitude
    00000000 000001FE
    000000C0 000001FF
    00000140 000001FF
    00000200 000001FF
    1024 point bit-reversal and butterfly run time = 15503 us
    clock frequency = 252000000
    

    If anyone wants to look at these extras I have included them here in the zip file. I believe for some of this to work you just first need to reverse the order of the r0-r15 registers as layed out in the prefix.spin2 file and then update any lib source files that access registers directly by number like setjmp.spin2, cognew2.spin2, spi.spin2 with the updated register address offsets and recompile the library. This reversal then allows the efficient use of setq transfers which reads/writes only in ascending address order, for speeding up prologs and epilogs that save/restore different number of registers to hub. There's probably some effective way to bundle all the perl stuff into just one script but I'm not a perl guy, it's the first time I've ever used it.
  • roglohrogloh Posts: 5,787
    edited 2020-08-08 07:48
    Another point. I think Ken was asking about which compiler to get behind. I could see Eric's reluctance there and I understand it too. To me anyway there still probably isn't a very good P2 solution yet for professional C developers who likely want to use something like GCC. It also feels like if you pick something that exists now and commit yourself to it alone you might be dooming the P2 to never getting a good C compiler. I don't think any of us want that.

    p2gcc: gcc based compiler with reasonable performance but is not making complete use of P2 instructions that could speed things up even further, and is stuck with only 16 regs which I know can slow things down. Needs additional custom assembly/linker/translation tools to complete the build. No new development/support on this toolchain any more and has incomplete libraries, and any further compiler improvements from the later GCC versions will not be obtained here. It works but it is somewhat hobbled.

    riscv gcc: again gcc based, supporting a good set of libraries and has good performance for the most part, fabulous work by Eric to get it to all work, but the downside (for me anyway) is that it decouples the developer from the generated code and uses a different ISA from P2 which makes it difficult to debug or to look at any dissembled listing files without understanding the intermediate representation RISCV ISA. You mostly need to consider it as running inside a black box that mainly Eric knows about. It would seem ridiculous to suggest to your developers they need to learn about RISCV to translate C to P2 assembly. Offers some code compression too which is beneficial, but it also needs that JIT compiler and it's own variable overhead at runtime and hub memory footprint. Perhaps right now it might be the only current way to get much C++ working, depending on what p2gcc can do there (I've not tried any C++ with p2gcc and won't know what is missing, probably a lot).

    The native GCC9.1 based effort started previously by @ntosme2 seems to have ceased at the end of last year which is a real pity as it might have given us the modern toolchain upgrades and native P2 capabilities we all really wanted.

    Very much hoping the LLVM effort @n_ermosh is putting in now may make some inroads. It isn't GCC but could be acceptable for many developers, though I wonder if something like MicroPython will compile easily with LLVM or will need changes in order to get it to work. I didn't get much back when googling LLVM/Clang and MicroPython use. Most MicroPython devs would seem to be building with GCC.

    Fastspin and Catalina are unfortunately going to come up short for developers, due to their own limitations.

    I see this current (hopefully temporary) state of affairs as frustrating because I expect in the short term it can and will negatively impact sales/deployments of P2's for many of those embedded applications depending on C. Eg. consider some robotics applications where embedded C code is already extensively used and a P2 is considered for use somewhere. From a HW perspective the P2 may be an ideal chip for some of these applications with its flexible Smartpins and analog stuff etc, but perhaps won't be used because the existing codebase won't easily port over or be generated by the "typical" toolchain and therefore needs a lot of workarounds to get it to run. This already frustrates me and I'm not even Parallax who will depend on lots of P2 sales. I keep saying this but the AVR was a big success because it could use a free GCC toolchain making it much easier for the developers to get behind it. No need for custom or expensive tools. I'm convinced of this.

    In terms of making decisions based on benchmarks, I think that is probably not the ideal way to do it. Usability and compatibility are probably going to be more important for many people, especially if the performance differences are less than 2x or so. Good debug capabilities for C is simply expected these days and people will be writing code that is more modern than what C89 offered back in the day. C++ compatibility is also going to be desirable, though personally I'm not a fan of C++ even though I spent many years enduring it. I found it's one of those things that the more you use it and learn it, the less you like it, particularly debugging it. It's only gotten more complicated since then too. :smile:

    In terms of the ultimate benchmark performance I'd expect a proper full native P2 port with the modern optimising C compiler should always be faster than a RISC-V JIT based approach if both run from HUB. There cannot be a case where a different instruction set to P2 that is still ultimately compiled to P2 assembly runs faster than native P2 code doing the same work - it might be smaller yes but cannot be faster, unless its compiled code runs from LUTRAM instead of HUBRAM and gains an improvement there by essentially being cached, however that is a different execution model and not really HUB-exec. The RISCV JIT model for compiling C code to native code is probably benefitting right now mainly because we currently only have a somewhat mediocre compiler for P2 that emits more native code than it could/should. A proper native C compiler for P2 should resolve this, we just don't have it yet.

    So maybe in the short term, the focus on a more complete set of libraries and generic header files makes sense. With any luck that work can be used by whatever C approach for P2 wins out in the end.
  • rogloh wrote: »
    Very much hoping the LLVM effort @n_ermosh is putting in now may make some inroads. It isn't GCC but could be acceptable for many developers, though I wonder if something like MicroPython will compile easily with LLVM or will need changes in order to get it to work.

    Clang/LLVM is pretty much equivalent to GCC in terms of features. Clang supports pretty much all the GCC extensions (even the weird ones like computed goto) and many developers are familiar with it from it being used by Android's NDK and all the Apple stuff.
  • rogloh wrote: »
    Eg. consider some robotics applications where embedded C code is already extensively used and a P2 is considered for use somewhere. From a HW perspective the P2 may be an ideal chip for some of these applications with its flexible Smartpins and analog stuff etc, but perhaps won't be used because the existing codebase won't easily port over or be generated by the "typical" toolchain and therefore needs a lot of workarounds to get it to run. This already frustrates me and I'm not even Parallax who will depend on lots of P2 sales. I keep saying this but the AVR was a big success because it could use a free GCC toolchain making it much easier for the developers to get behind it. No need for custom or expensive tools. I'm convinced of this.

    This is exactly why I started the llvm effort. I'm hoping more people can help me support it and continue to improve it, but if not, I intend to (when I can) continue to support it until it's a complete and usable tool. Everyone I talk to about the P2 and what it can do is amazed and really wants it in their projects, but immediately drops the idea because of a lack of toolchain support in favor of ATSAM, AVR, or Infineon's XMC chips.

  • Wuerfel_21Wuerfel_21 Posts: 5,053
    edited 2020-08-08 20:19
    Another interesting thing: What about overlay or XMM support?

    If I'm not mistaken, running code from something fast like HyperRAM would be in the same performance range as LMM on P1 (~18 cycles per instruction) even with no loops and a fairly pessimistic estimate of memory speed (and that code won't be requested ahead of time). (And obviously the performance is the same as native in hot loops)

    Do any of the current P2 compilers support overlays or XMM?
  • roglohrogloh Posts: 5,787
    edited 2020-08-09 02:19
    n_ermosh wrote: »
    This is exactly why I started the llvm effort. I'm hoping more people can help me support it and continue to improve it, but if not, I intend to (when I can) continue to support it until it's a complete and usable tool.
    Great effort so far. If you stick at it you'll get there and let the P2 fully shine with C. I'll use it as soon as I can get my system upgraded (I got 95% of it compiled with build errors disabled but there are still nasty remaining compatibility issues with my version of Clang libraries relating to enable_if templates/SFINAE stuff in header files preventing it to complete, and I can't fix those).
    Everyone I talk to about the P2 and what it can do is amazed and really wants it in their projects, but immediately drops the idea because of a lack of toolchain support in favor of ATSAM, AVR, or Infineon's XMC chips.
    As a C developer it's very easy to believe this.
    Wuerfel_21 wrote: »
    Another interesting thing: What about overlay or XMM support?

    If I'm not mistaken, running code from something fast like HyperRAM would be in the same performance range as LMM on P1 (~18 cycles per instruction) even with no loops and a fairly pessimistic estimate of memory speed (and that code won't be requested ahead of time). (And obviously the performance is the same as native in hot loops)

    Do any of the current P2 compilers support overlays or XMM?
    Yes reading individual instructions from HyperRAM or HyperFlash could work but we are probably talking about 1-2MIPs, so just similar to a 20MIPs P1 with LMM as you say, even though the P2 is way more capable and can get clocked ~8x faster in terms of MIPs. For better performance on a P2 ideally we could use something slightly different this time vs LMM/XMM. We sort of need a cache or overlay area we can execute hub-exec code from. Given the P2 supports relocatable code, it could be amenable to running executable code from different HUB addresses which is helpful here for large programs. Data sections would probably need to be kept at fixed addresses but with 512kB and HyperRAM for even larger areas that is less of problem.

    Another model might be to try to read in smaller fragments of code directly into LUT RAM from HyperRAM using setq2 bursting and run from there at higher speed. I don't know how well that would perform in the general case.

    In both cases branching obviously needs to be handled very differently to potentially trigger bringing in new code as required. All of this stuff would of course need tool support which is not there yet on P2 unless we can hack p2gcc's original support for LMM somehow.
  • rogloh wrote: »
    Yes reading individual instructions from HyperRAM or HyperFlash could work but we are probably talking about 1-2MIPs, so just similar to a 20MIPs P1 with LMM as you say, even though the P2 is way more capable and can get clocked ~8x faster in terms of MIPs. For better performance on a P2 ideally we could use something slightly different this time vs LMM/XMM. We sort of need a cache or overlay area we can execute hub-exec code from. Given the P2 supports relocatable code, it could be amenable to running executable code from different HUB addresses which is helpful here for large programs. Data sections would probably need to be kept at fixed addresses but with 512kB and HyperRAM for even larger areas that is less of problem.
    Yes, that's what I was thinking. Bigger chunks of code that are loaded into memory on-demand. Anything GCC-based should be able to handle compiling overlays already, but you need to handle loading them yourself - this also means you can't put stuff like virtual functions in there (could be worked around by implementing your own vtable - parsing the headers and generating the tables and some inline functions to call through it). rodata sections used by the overlays I don't think need a fixed address, LOC is there for a reason :)
  • RossHRossH Posts: 5,462
    Wuerfel_21 wrote: »
    Another interesting thing: What about overlay or XMM support?

    If I'm not mistaken, running code from something fast like HyperRAM would be in the same performance range as LMM on P1 (~18 cycles per instruction) even with no loops and a fairly pessimistic estimate of memory speed (and that code won't be requested ahead of time). (And obviously the performance is the same as native in hot loops)

    Do any of the current P2 compilers support overlays or XMM?

    Catalina has overlays. I am still waiting to see if XMM is going to be necessary on the P2.

    It is worth pointing out that (except as a cool novelty) XMM didn't get much use on the P1, and if we had had overlays and multi-memoy model support on the P1 (which Catalina now does on both the P1 and P2) then XMM would very likely never have seen any use at all.
  • RossHRossH Posts: 5,462
    rogloh wrote: »
    Fastspin and Catalina are unfortunately going to come up short for developers, due to their own limitations.

    I don't think either of these are true. However, I don't feel obliged to argue the point, and it is probably fair to say that most users just want a gcc-based C compiler. This may be because they think that porting gcc will magically bring all the "other" benefits (other gcc-based languages and applications - even Linux!). But the reality is that most of that will remain just a dream on the Propeller.
  • RossH wrote: »
    Wuerfel_21 wrote: »
    Another interesting thing: What about overlay or XMM support?

    If I'm not mistaken, running code from something fast like HyperRAM would be in the same performance range as LMM on P1 (~18 cycles per instruction) even with no loops and a fairly pessimistic estimate of memory speed (and that code won't be requested ahead of time). (And obviously the performance is the same as native in hot loops)

    Do any of the current P2 compilers support overlays or XMM?

    Catalina has overlays. I am still waiting to see if XMM is going to be necessary on the P2.

    It is worth pointing out that (except as a cool novelty) XMM didn't get much use on the P1, and if we had had overlays and multi-memoy model support on the P1 (which Catalina now does on both the P1 and P2) then XMM would very likely never have seen any use at all.
    Does Catalina for P2 support CMM?

  • ersmithersmith Posts: 6,053
    edited 2020-08-09 12:57
    Does Catalina for P2 support CMM?

    It does, but the performance/size tradeoff is much worse on P2 than on P1. On P2 COMPACT mode (Catalina's CMM) is about 10x slower than NATIVE, while offering only a modest size saving, at least on the benchmarks I tried it on. On the other hand at least Catalina does offer the option of code compression. For some size constrained applications Catalina may be a good choice.
  • Wuerfel_21 wrote: »
    Do any of the current P2 compilers support overlays or XMM?

    It would be relatively easy to add XMM support to riscvp2, since it's already designed around an instruction cache. It's on my to-do list, but I'm not sure how much demand there would be. As Ross pointed out, XMM wasn't widely used on the P1, even though P1 had very constrained memory.
  • It is worth pointing out that (except as a cool novelty) XMM didn't get much use on the P1, and if we had had overlays and multi-memoy model support on the P1 (which Catalina now does on both the P1 and P2) then XMM would very likely never have seen any use at all.
    Since I use SimpleIDE, the way I remember it, SimpleIDE had a functional XMM, for awhile.

    I was starting to use it, as I got more familiar with what could be done with XMM, Parallax disabled XMM in SimpleIDE. Not sure why they decided to do that. But that would not be the last drastic move that they did with SimpleIDE. It seems that if they would have kept alive, I think more people would have started to make use of it.

    Since the P2 has an accessory board with extra memory, I wonder if XMM could make some functional usage of that memory. So, instead of implementing CMM, just on the 512, what could XMM do with some accessory board memory?

    Ray
  • David BetzDavid Betz Posts: 14,516
    edited 2020-08-09 17:56
    Rsadeika wrote: »
    Since I use SimpleIDE, the way I remember it, SimpleIDE had a functional XMM, for awhile.

    I was starting to use it, as I got more familiar with what could be done with XMM, Parallax disabled XMM in SimpleIDE. Not sure why they decided to do that.
    I think one big problem is that Parallax didn't have any boards that supported XMM except the C3 and that board was discontinued fairly early on. You could use XMM with the EEPROM on any Propeller board but the EEPROM is so small that it didn't help much.

  • I thought that you could use XMM with the microSD card, if I am not mistaken.

    Ray
  • RossHRossH Posts: 5,462
    ersmith wrote: »
    Does Catalina for P2 support CMM?

    It does, but the performance/size tradeoff is much worse on P2 than on P1. On P2 COMPACT mode (Catalina's CMM) is about 10x slower than NATIVE, while offering only a modest size saving, at least on the benchmarks I tried it on. On the other hand at least Catalina does offer the option of code compression. For some size constrained applications Catalina may be a good choice.

    The code compression is about 50%, but I accept the the performance is not great. It is pretty much a straight port of the P1 code to the P2, and has not been optimized yet to take advantage of the P2 features.

    Since we are not as "squeezed" for RAM on the P2 as we were on the P1, optimizing CMM is on my "to do" list, but not very high!
  • RossHRossH Posts: 5,462
    David Betz wrote: »
    Rsadeika wrote: »
    Since I use SimpleIDE, the way I remember it, SimpleIDE had a functional XMM, for awhile.

    I was starting to use it, as I got more familiar with what could be done with XMM, Parallax disabled XMM in SimpleIDE. Not sure why they decided to do that.
    I think one big problem is that Parallax didn't have any boards that supported XMM except the C3 and that board was discontinued fairly early on. You could use XMM with the EEPROM on any Propeller board but the EEPROM is so small that it didn't help much.

    Yes, you can certainly use it with EEPROM, and most boards had at least 64k. I think it was the performance of XMM on the P1 that led Parallax to drop it - it was so slow it made the chip itself look bad.

  • Rsadeika wrote: »
    I thought that you could use XMM with the microSD card, if I am not mistaken.

    Ray
    Yes, you can. It is extremely slow, far slower than using a SPI flash chip which is what we intended. The C3 was the original development platform for PropGCC and it had a SPI flash chip as well as a SPI SRAM chip. XMM worked pretty well on that board. MicroSD support even worked on the C3 but, as I said, it was much slower because of needing to do full sector reads.

  • RossH wrote: »
    David Betz wrote: »
    Rsadeika wrote: »
    Since I use SimpleIDE, the way I remember it, SimpleIDE had a functional XMM, for awhile.

    I was starting to use it, as I got more familiar with what could be done with XMM, Parallax disabled XMM in SimpleIDE. Not sure why they decided to do that.
    I think one big problem is that Parallax didn't have any boards that supported XMM except the C3 and that board was discontinued fairly early on. You could use XMM with the EEPROM on any Propeller board but the EEPROM is so small that it didn't help much.

    Yes, you can certainly use it with EEPROM, and most boards had at least 64k. I think it was the performance of XMM on the P1 that led Parallax to drop it - it was so slow it made the chip itself look bad.
    Yes, you can use it with EEPROM but there wasn't much of a code space advantage using XMM in EEPROM vs. CMM in hub RAM and it was slower. Actually, there was another non-Parallax board that supported XMM: the DNA board. I don't think it is available anymore either though.

  • RossH wrote: »
    I think it was the performance of XMM on the P1 that led Parallax to drop it - it was so slow it made the chip itself look bad.
    I suppose so. A compressed instruction set is one feature that didn't make it into P2 that would have made a big difference. The code density of PASM code isn't that good so even 512k runs out quickly for people who want to do graphics applications. I suspect it's fine for anything I intend to do though and I think fastspin native code generation should be quite adequate.

  • RossHRossH Posts: 5,462
    edited 2020-08-17 07:18
    RossH wrote: »
    ersmith wrote: »
    @RossH : I've been looking at compiler output in some detail as part of doing these benchmarks, and I think there's some low hanging fruit for further optimizing Catalina:

    (2) As mentioned on another thread, pushing/popping multiple registers is cheaper with setq / rdlong patterns. In fact given the overhead of calling the current push_m / pop_m functions, you may find a saving by using these and saving/restoring all of the eligible registers all the time. Individual rdlong and wrlong are expensive.
    There is a problem here. I will have to try and recall the details, but I found you can't do things the obvious way if you want to implement interrupts. I decided in the end that I would rather have Catalina completely functional on the P2, and sacrifice efficiency. But again, I might revisit it. I originally had two mechanisms for saving and loading registers - the slow one that worked with interrupts, and the fast one that didn't. But then I found you can't always tell in advance which code will need interrupts, and it all became too difficult to continue to support both, so I have (for the moment) stuck with the slow but safe way.

    EDIT: I recall now - it was not SETQ that you can't use in an interrupt routine - it is the SKIP operations. I should indeed rewrite the push_m and pop_m functions to use SETQ!

    Hi @ersmith

    Thought you might be interested in an update on this. I have now tried three ways of saving/loading multiple registers. No devastating improvements, but the results for the fft_bench.c benchmark are interesting:
    1. Using individual rdlong/wrlong, but only on registers that are used: 24631 us. This is the method Catalina currently uses.
    2. Using instruction skipping, but only on registers that are used: 23489 us
    3. Using "Fast Block Moves", but saving all registers every time: 23179 us

    So, not as much difference as you might have expected. The reason is that while using individual rdlong/wrlong instructions is much slower, Catalina generally only has to save a few registers in each function call, not all of them. And the additional stack space the program needs to save/restore all registers on each function call is potentially quite horrendous - it may not be worth the modest 5% performance improvement.

    For that reason, the instruction skipping method is interesting - it is nearly as fast as using fast block moves, but it doesn't incur the additional stack usage. The main problem with this method is that it cannot be used in a program that use interrupts.

    In the next release of Catalina, I will probably add a couple of new Catalina symbols that you can define on the command line:
    1. A FAST_SAVE_RESTORE symbol, which enables fast block moves which will speed things up but at the expense of increased stack space, and
    2. A NO_INTERRUPTS symbol, which enables optimizations like instruction skipping which will speed things up but which cannot be used in programs that use interrupts.

    I expect more significant savings may come from simply disabling register passing of parameters. This made good sense on the P1 where stack operations were expensive, but is possibly only wasting time and code space on the P2. However, that will take longer to implement.

    Ross.
  • @RossH
    if you can allocate your registers to be saved in increasing order within functions needing them, then you might be able get the best of both worlds with setq burst transfers of only the registers that have been allocated and needing to be preserved, not the worst case full set each time a function is called. I believe this is what p2gcc does, it had different prologue and epilogs generated depending on how many registers were used. I'm not sure if your LCC compiler's register allocation model would do that or not. If it could it might be worth a shot. This 5% speedup is certainly useful if there are no downsides and the boost might be even higher if only the smaller subset of registers get saved.

    Are you sure you can't use interrupts with skipping? I read this from the P2 manual, which seems to indicate it might be possible at least in PASM ISRs, but I guess it may still have problems if you try to start a new skip sequence in C coded ISRs. Not sure there.
    "Within a skipping sequence, a CALL/CALLPA/CALLPB that is not skipped will execute all its nested subroutines normally, with the skipping sequence resuming after the returning RET/_RET_. This allows subroutines to be skipped or entirely executed without affecting the top-level skip sequence. As well, an interrupt service routine will execute normally during a skipping sequence, with the skipping sequence resuming upon its completion."
    
  • RossHRossH Posts: 5,462
    rogloh wrote: »
    @RossH
    if you can allocate your registers to be saved in increasing order within functions needing them, then you might be able get the best of both worlds with setq burst transfers of only the registers that have been allocated and needing to be preserved, not the worst case full set each time a function is called. I believe this is what p2gcc does, it had different prologue and epilogs generated depending on how many registers were used. I'm not sure if your LCC compiler's register allocation model would do that or not. If it could it might be worth a shot. This 5% speedup is certainly useful if there are no downsides and the boost might be even higher if only the smaller subset of registers get saved.
    I thought of a similar solution - i.e. dividing the set of 24 registers into - say - 4 groups of 6 registers each, and saving the group of registers using fast block moves if any part of the group is used. Most functions would only use one or perhaps two groups of registers, so there is quite a potential saving there. I may try something like that when I get more time.
    Are you sure you can't use interrupts with skipping? I read this from the P2 manual, which seems to indicate it might be possible at least in PASM ISRs, but I guess it may still have problems if you try to start a new skip sequence in C coded ISRs. Not sure there.
    "Within a skipping sequence, a CALL/CALLPA/CALLPB that is not skipped will execute all its nested subroutines normally, with the skipping sequence resuming after the returning RET/_RET_. This allows subroutines to be skipped or entirely executed without affecting the top-level skip sequence. As well, an interrupt service routine will execute normally during a skipping sequence, with the skipping sequence resuming upon its completion."
    

    That says you can interrupt a skip, which is not the same thing as using skip in an interrupt. I'm looking at the top of page 18 in the Parallax Google Doc:
    Skipping only works outside of interrupt service routines; i.e. in main code.

    Catalina allows arbitrary C functions as interrupt service routines. I believe it may be unique in doing so. So, unless I am reading that wrong, I can't use skip instructions in C code without an awful lot of messing about to detect whether the function is used as an interrupt routine or not. And if it is a library function, you may not know this at compile time.

  • I thought of a similar solution - i.e. dividing the set of 24 registers into - say - 4 groups of 6 registers each, and saving the group of registers using fast block moves if any part of the group is used. Most functions would only use one or perhaps two groups of registers, so there is quite a potential saving there. I may try something like that when I get more time.
    Exactly. This is how gcc does it on the AVR. Registers are carved out and get used for different purposes.
    https://www.nongnu.org/avr-libc/user-manual/FAQ.html#faq_reg_usage
Sign In or Register to comment.