Shop OBEX P1 Docs P2 Docs Learn Events
Catalina - ANSI C and Lua for the Propeller 1 & 2 - Page 20 — Parallax Forums

Catalina - ANSI C and Lua for the Propeller 1 & 2

11516171820

Comments

  • evanhevanh Posts: 16,583

    Good stuff. I get the impression that 3.5 minutes is a substantial speed up?

  • roglohrogloh Posts: 5,924

    @RossH said:
    Small benchmark programs are fine, but as a real-world test to make sure the HyperRAM is now working 100% on my P2 Eval board, I recompiled Catalina itself to run on the P2 Eval at 300Mhz. Catalina uses the lower part of the Hyper RAM as code and data space, and the upper part as an SD card sector cache, so it really does exercise nearly all of it.

    In regular P2 setups 300MHz was typically around where HyperRAM use to max out IIRC. It gets difficult to go too much faster without resorting to special efforts like optimizing your board layout for skew minimization or using special cooling etc. You can push it higher at certain frequencies but the gaps start to really widen.

    Also you mentioned you had setup HYPER_FASTREAD to 1 before so that would have enabled the "experimental" Sysclk/1 byte rate reads and that certainly used to contain timing gaps over the P2 operating frequency range which would show up to the user as not reading data reliably at different clock speeds. Choosing Sysclk/2 transfers halves the transfer rate, but gives more options to tweak the read delays in between and should help make reads more reliable.

  • I've had the P2EVAL HyperRAM working at my 340MHz-adjacent overclocks, 300 flat should certainly be fine (in /2 mode)

  • RossHRossH Posts: 5,622

    @evanh said:
    Good stuff. I get the impression that 3.5 minutes is a substantial speed up?

    It's a combination of various improvements added in the recent release. In total, self-hosted Catalina is now about twice the speed it used to be. Of course, the biggest chunk of that is just by increasing the clock speed to 300Mhz.

  • RossHRossH Posts: 5,622

    @rogloh said:

    Also you mentioned you had setup HYPER_FASTREAD to 1 before so that would have enabled the "experimental" Sysclk/1 byte rate reads and that certainly used to contain timing gaps over the P2 operating frequency range which would show up to the user as not reading data reliably at different clock speeds. Choosing Sysclk/2 transfers halves the transfer rate, but gives more options to tweak the read delays in between and should help make reads more reliable.

    Yes, I didn't realize it was experimental. It works much more reliably on the P2 Edge, but I might disable it there as well. Disabling it seems to make very little overall difference - perhaps because I use a Hub RAM cache, which means slowing down the PSRAM does not slow down the overall performance by as much.

    Seems I just got lucky, which is very unusual for me! :)

  • roglohrogloh Posts: 5,924
    edited 2025-07-07 04:31

    Depending on your cache row size, HyperRAM+driver latency is still probably very dominant even at half transfer speed.

    There is no need to reduce the rate on the PSRAM for the P2 Edge. Being twice as wide as the HyperRAM breakout the 16 bit wide PSRAM clocks data at half the speed already for achieving Sysclk (byte) transfer rates. HyperRAM uses a DDR clock but being 8 bits wide it still needs to transfer data at 2x the speed for the same bandwidth that the PSRAM can achieve. That is what stresses out the timing.

  • evanhevanh Posts: 16,583
    edited 2025-07-07 06:20

    To clarify what Roger just said, the QPI mode of the 8-pin PSRAM chips are Single Data Rate so can't go faster than sysclock/2 transfer rate. There is no experimental fast read mode for the on-board P2Edge EC32 PSRAMs.

  • RossHRossH Posts: 5,622

    Ok, that's clear. Thanks @rogloh and @evanh.

  • RossHRossH Posts: 5,622
    edited 2025-07-09 02:46

    Been doing some more bench marking of the self-hosted version of Catalina, running at 300Mhz on the P2 Eval board with the Hyper RAM add-on board.

    It seems quite consistent at compiling about 100 SLOC per minute:

    • othello (~500 SLOC) - 5 mins => 100 SLOC/min
    • startrek (~2200 SLOC) - 22 mins => 100 SLOC/min
    • chimaera (~5500 SLOC) - 66 mins => 83 SLOC/min
    • dbasic (~13500 SLOC) - 130 mins => 104 SLOC/min

    Dumbo Basic (dbasic) is currently the largest Catalina demo, and is fairly easy to compile on the P2 in a single command because it consists of only 3 source files (however, see the note below). In total SLOC it is actually about the same size as Catalina itself, but compiling that on the P2 would be a much more cumbersome process because Catalina consists of about half a dozen separate components and nearly 100 separate source files. So while I can definitely say that self-compiling Catalina on a Propeller 2 is feasible, I doubt anyone will ever be crazy enough to actually do it - not even me! :)

    Note: While compiling dbasic, I discovered a couple of issues with the self-hosted version of Catalina which mean dbasic won't compile with release 8.7. I have fixed the issues and it will compile with the next release. But since even at 300Mhz it takes around 2 hours to compile, I don't think I need to be in a hurry - I doubt anyone besides me is going to bother doing it! :)

    EDIT: Changes to fix the compilation of dbasic using the Propeller 2 self-hosted version of Catalina have been pushed to GitHub. SourceForge users will have to wait for the next release.

    Ross.

  • evanhevanh Posts: 16,583

    Are the files on an SD card? Would the filesystem performance be a factor?

  • RossHRossH Posts: 5,622

    @evanh said:
    Are the files on an SD card? Would the filesystem performance be a factor?

    Yes, all files are on the SD card, so the speed of the card is a factor. However, I use a write-back sector cache in PSRAM with the cache flushed at the end of each compilation step, so it is not as significant as you might expect.

    Ross.

  • evanhevanh Posts: 16,583
    edited 2025-07-09 08:44

    Following on from the read/write speed tester exercise - https://forums.parallax.com/discussion/comment/1565955/#Comment_1565955 I've now ported the file access tester as well:

    $ catalina -f200M -p2 -lcx -lmc file-accesstest.c
    Catalina Compiler 8.7
    
    code = 48604 bytes
    cnst = 1228 bytes
    init = 648 bytes
    data = 828 bytes
    file = 61472 bytes
    

    Sandisk Extreme 32 GB (2017)

       clkfreq = 200000000   clkmode = 0x10009f7
    
       File size = 512 bytes
     Delete 200 files ... Duration 6909 ms, 28.95 files/s
    
     10 sec pause for card recovery ..........
     Create 50 files ... Duration 9137 ms, 5.472 files/s
     Verify 50 files ... Duration 271 ms, 184.7 files/s
     Delete 50 files ... Duration 3283 ms, 15.23 files/s
    
     10 sec pause for card recovery ..........
     Create 200 files ... Duration 39171 ms, 5.106 files/s
     Verify 200 files ... Duration 2172 ms, 92.07 files/s
     Delete 200 files ... Duration 14141 ms, 14.14 files/s
    
  • evanhevanh Posts: 16,583
    edited 2025-07-09 08:36

    Comparison using same Sandisk 32 GB card and at 300 MHz sysclock:

    Catalina using built-in 1-bit SPI mode driver:

       clkfreq = 300000000   clkmode = 0x1000ef7
    
       File size = 512 bytes
     Delete 200 files ... Duration 6054 ms, 33.04 files/s
    
     10 sec pause for card recovery ..........
     Create 50 files ... Duration 8622 ms, 5.799 files/s
     Verify 50 files ... Duration 214 ms, 233.6 files/s
     Delete 50 files ... Duration 3258 ms, 15.35 files/s
    
     10 sec pause for card recovery ..........
     Create 200 files ... Duration 37026 ms, 5.402 files/s
     Verify 200 files ... Duration 1798 ms, 111.3 files/s
     Delete 200 files ... Duration 13904 ms, 14.38 files/s
    

    Flexspin using plug-in 1-bit SPI mode driver (Copy of the built-in driver that I don't have to mount differently):

       clkfreq = 300000000   clkmode = 0x1000efb
    SDHC 25 MHz selected
    SPI clock ratio = sysclock/12
     /sd1 handle = ed8c ... mounted
    
       File size = 512 bytes
     Delete 200 files ... Duration 2394 ms, 83.55 files/s
    
     10 sec pause for card recovery ..........
     Create 50 files ... Duration 957 ms, 52.24 files/s
     Verify 50 files ... Duration 57 ms, 882.3 files/s
     Delete 50 files ... Duration 241 ms, 207.7 files/s
    
     10 sec pause for card recovery ..........
     Create 200 files ... Duration 4123 ms, 48.51 files/s
     Verify 200 files ... Duration 634 ms, 315.5 files/s
     Delete 200 files ... Duration 1608 ms, 124.4 files/s
    

    Flexspin using plug-in 4-bit SD mode driver (With lazy CMD12 feature):

       clkfreq = 300000000   clkmode = 0x1000efb
     /sd1 handle = 118f8 ...  SETDIV  SD clock-divider set to sysclock/3 (100.0 MHz)
    
       File size = 512 bytes
     Delete 200 files ... Duration 517 ms, 387.2 files/s
    
     10 sec pause for card recovery ..........
     Create 50 files ... Duration 434 ms, 115.2 files/s
     Verify 50 files ... Duration 40 ms, 1262 files/s
     Delete 50 files ... Duration 195 ms, 256.7 files/s
    
     10 sec pause for card recovery ..........
     Create 200 files ... Duration 1755 ms, 113.9 files/s
     Verify 200 files ... Duration 278 ms, 718.3 files/s
     Delete 200 files ... Duration 1264 ms, 158.2 files/s
    
  • evanhevanh Posts: 16,583
    edited 2025-07-09 08:47

    PS: I've found one SD card that seems not to be compatible with Catalina's driver. It creates the first file on the card but then returns from fopen() without a file handle. Consistently. The card in question is my newest Sandisk Extreme with all the modern bells and whistles, including TRIMming. No other card I have supports the TRIM feature.

     CID decode:  ManID=03   OEMID=SD  Name=SN64G
      Ver=8.0   Serial=8ab989e1   Date=2021-2
    

    eg:

       clkfreq = 200000000   clkmode = 0x10009f7
    
       File size = 512 bytes
     Delete 200 files ... Duration 7177 ms, 27.87 files/s
    
     10 sec pause for card recovery ..........
     Create 50 files ...  fopen() copy049.bin for writing failed!   errno = 0: Error 0
    
    
  • RossHRossH Posts: 5,622

    @evanh said:
    The card in question is my newest Sandisk Extreme with all the modern bells and whistles, including TRIMming. No other card I have supports the TRIM feature.

    No idea what TRIMming is, but I have a 32Gb Sandisk Extreme that works fine.

    Perhaps try reformatting the card. I always use the SD Association format program (https://www.sdcard.org/downloads/formatter/) because I have had a few problems with other formatting programs, especially the Windows one.

  • @evanh said:
    Comparison using same Sandisk 32 GB card and at 300 MHz sysclock:

    Catalina using built-in 1-bit SPI mode driver:
    [...]

    Flexspin using plug-in 1-bit SPI mode driver (Copy of the built-in driver that I don't have to mount differently):
    [...]

    Wow, that's whole order of magnitude difference in write speed there.

  • RossHRossH Posts: 5,622
    edited 2025-07-09 13:32

    @evanh said:
    Comparison using same Sandisk 32 GB card and at 300 MHz sysclock:

    Changing sysclock speed won't make SD access much faster, since Catalina's SD card driver will simply adjust the SD timings to compensate for the clock speed.

    A bigger difference might be due to the file system code used. Catalina uses DOSFS. Do you know what file system code flexspin uses?

    I'll also have a look at your programs and add the code required to enable the write-back sector cache. That will both speed things up, and also also tell me whether it is worth updating Catalina's low-level SD card driver or not.

    Finally, Catalina itself (which is the program I would really like to speed up!) is way too large to execute from Hub RAM - it executes from PSRAM, which will also change the performance. So a more useful test is to have your program do the same.

    Here is the Catalina command I used to do that (this is for the P2 Evaluation board with the HyperRAM add-on, by default on pins 32-48):

    catalina -f200M -p2 -lcx -lmc file-accesstest.c -C P2_EVAL -C LARGE -C NO_LUT

    Note the addition of -C NO_LUT - this is required because your test program uses the LUT itself, so we have to tell Catalina it cannot do so. I then used the build_utilities command to build a suitable XMM loader - just enter the command and follow the prompts. Then load the program using payload, specifying it should use the XMM loader just built:

    payload -i -q1 XMM file-accesstest.bin

    Please do whatever the equivalent is for flexspin, and post those times.

    Ross.

  • @RossH said:
    A bigger difference might be due to the file system code used. Catalina uses DOSFS. Do you know what file system code flexspin uses?

    Flexspin uses FatFs - Generic FAT Filesystem Module R0.14b, with some choice modifications, along the lines of

    static DWORD ld_dword (const BYTE* ptr) /* Load a 4-byte little-endian word */
    {
        #ifdef __propeller2__
        return *((DWORD*)ptr);
        #else
        DWORD rv;
    
        rv = ptr[3];
        rv = rv << 8 | ptr[2];
        rv = rv << 8 | ptr[1];
        rv = rv << 8 | ptr[0];
        return rv;
        #endif
    }
    

    (because we know that on flexspin/P2, you can just cast an arbitrarily aligned char* to long* and read the correct value from that)

  • evanhevanh Posts: 16,583
    edited 2025-07-09 18:58

    @RossH said:

    @evanh said:
    The card in question is my newest Sandisk Extreme with all the modern bells and whistles, including TRIMming. No other card I have supports the TRIM feature.

    No idea what TRIMming is, but I have a 32Gb Sandisk Extreme that works fine.

    Yip, my older 32GB Sandisk Extreme works perfectly with Catalina. And I've tested a 64GB Adata, 64 GB Kingston and a 128 GB Samsung. All work fine. So no issue with capacity or cluster sizes or the likes.

    TRIMming is a block discard mechanism where SSDs can be informed of what to erase in bulk as it becomes unneeded rather than just when overwrites occur. I don't see it being utilised by the Prop2 as it adds more code and overheads to both the driver and filesystem layers. The point being that that particular card has new features that none of my other cards have.

    Perhaps try reformatting the card. I always use the SD Association format program (https://www.sdcard.org/downloads/formatter/) because I have had a few problems with other formatting programs, especially the Windows one.

    All cards have been formatted multiples times with Gparted on Linux long before ever being used with the Prop2. I use no other tool. And yes, I formatted the 64 GB Sandisk again to ensure that that card was clean.

  • evanhevanh Posts: 16,583
    edited 2025-07-09 20:19

    @RossH said:

    @evanh said:
    Comparison using same Sandisk 32 GB card and at 300 MHz sysclock:

    I'll also have a look at your programs and add the code required to enable the write-back sector cache. That will both speed things up, and also also tell me whether it is worth updating Catalina's low-level SD card driver or not.

    Shouldn't that be at the driver level, below the filesystem? BTW, that's pretty much what the lazy CMD12 achieves. It's not an actual cache but achieves the effect of one by eliminating many of the inter-transactional latencies.

    Finally, Catalina itself (which is the program I would really like to speed up!) is way too large to execute from Hub RAM - it executes from PSRAM, which will also change the performance. So a more useful test is to have your program do the same.

    I compared apples with apples. Yes, it was to compare filesystem performance, not compiling nor external RAM.

    PS: I'm happy to check out other memory models for your interest but there's no Flexspin equivalents to compare to. On the Prop2, Flexspin only has native hubexec along with some use of "Fcache" for tight loops being temporarily loaded into cogRAM. Inline assembly can specify to use Fcache too.

  • evanhevanh Posts: 16,583
    edited 2025-07-09 20:10

    @RossH said:
    catalina -f200M -p2 -lcx -lmc file-accesstest.c -C P2_EVAL -C LARGE -C NO_LUT

    payload -i -q1 XMM file-accesstest.bin

    I get nothing. No errors, no prints of anything coming from file-accesstest. It doesn't seem to be running. The download happens and payload goes into "interactive" mode, but that's all.

    There is a XMM.bin file and SRAM.bin file in the working directory - produced by build_utilities.

    I'd previously not been using payload ... so tried the known working prior compile and using payload to download - That works as expected, albeit with broken new line handling.

  • RossHRossH Posts: 5,622

    @evanh said:

    Shouldn't that be at the driver level, below the filesystem?

    It sits between the low level SD driver and DOSFS, and so I have to recompile stdio to enable it. Will do this when I get time.

    PS: I'm happy to check out other memory models for your interest but there's no Flexspin equivalents to compare to. On the Prop2, Flexspin only has native hubexec along with some use of "Fcache" for tight loops being temporarily loaded into cogRAM. Inline assembly can specify to use Fcache too.

    I don't use flexspin (or should that be flex C? I've never been quite sure what the relationship is between flexspin, flexprop, flexc and flexgui). In any case, there's not much point in comparisons since it will not be able to run any of the larger C programs. Also, I expect that when I enable caching, the SD card performance difference will not be much of an issue. But if it is, I'll have a look at your low level SD driver and turn it into a Catalina plugin that can be selected at compile time. This is usually pretty straightforward.

  • RossHRossH Posts: 5,622

    @evanh said:

    @RossH said:
    catalina -f200M -p2 -lcx -lmc file-accesstest.c -C P2_EVAL -C LARGE -C NO_LUT

    payload -i -q1 XMM file-accesstest.bin

    I get nothing. No errors, no prints of anything coming from file-accesstest. It doesn't seem to be running. The download happens and payload goes into "interactive" mode, but that's all.

    Odd. Does your P2 Evaluation board have the HyperRAM on pins 32-48? That should be all it needs.

    There is a XMM.bin file and SRAM.bin file in the working directory - produced by build_utilities.

    Yes, that's correct. In this case they will be the same (they are not the same on all platforms). Did you specify P2_EVAL, HYPER and 64 in answer to the build_utilities prompts?

    I'd previously not been using payload ... so tried the known working prior compile and using payload to download - That works as expected, albeit with broken new line handling.

    Your program uses CR as a line terminator, which is unusual (are you a Mac user? - that's the only platform I know of that used a lone CR as a line terminator). The C standard specifies LF (\n) as the line terminator, and Windows traditionally used CRLF (\r\n), which payload can accommodate by simply ignoring the CR (this is what the -q1 option does).

  • evanhevanh Posts: 16,583
    edited 2025-07-10 10:32

    @RossH said:

    @evanh said:

    @RossH said:
    catalina -f200M -p2 -lcx -lmc file-accesstest.c -C P2_EVAL -C LARGE -C NO_LUT

    payload -i -q1 XMM file-accesstest.bin

    I get nothing. No errors, no prints of anything coming from file-accesstest. It doesn't seem to be running. The download happens and payload goes into "interactive" mode, but that's all.

    Odd. Does your P2 Evaluation board have the HyperRAM on pins 32-48? That should be all it needs.

    Yep.

    There is a XMM.bin file and SRAM.bin file in the working directory - produced by build_utilities.

    Yes, that's correct. In this case they will be the same (they are not the same on all platforms). Did you specify P2_EVAL, HYPER and 64 in answer to the build_utilities prompts?

    Yep, both are 86144 bytes.

    I'd previously not been using payload ... so tried the known working prior compile and using payload to download - That works as expected, albeit with broken new line handling.

    Your program uses CR as a line terminator, which is unusual (are you a Mac user? - that's the only platform I know of that used a lone CR as a line terminator). The C standard specifies LF (\n) as the line terminator, and Windows traditionally used CRLF (\r\n), which payload can accommodate by simply ignoring the CR (this is what the -q1 option does).

    We had that conversation, you suggested I use vertical tab to emit a line scroll, and it worked for a regular terminal. But for payload it doesn't work.

    Text files are just a lonely LF but most programs the stdout emits both a CR and LF as a pair when specifying \n. Which is what's needed for a regular terminal.

  • evanhevanh Posts: 16,583
    edited 2025-07-10 10:29

    @RossH said:
    I don't use flexspin (or should that be flex C? I've never been quite sure what the relationship is between flexspin, flexprop, flexc and flexgui).

    Flexspin is the compiler binary. It compiles all the supported languages: Basic (FlexBasic?), C (FlexC), Spin1 and Spin2.

    Flexprop is the optional GUI. I've not used it.

  • RossHRossH Posts: 5,622

    @evanh said:

    We had that conversation, you suggested I use vertical tab to emit a line scroll, and it worked for a regular terminal. But for payload it doesn't work.

    I don't remember saying that, but I may possibly have suggested adding a VT when using the Parallax serial terminal, which is also a bit of an odd beast.

    Text files are just a lonely LF but most programs the stdout emits both a CR and LF as a pair when specifying \n. Which is what's needed for a regular terminal.

    This is optional in Catalina - you can enable it by adding -C CR_ON_LF to the compile command.

  • RossHRossH Posts: 5,622

    @evanh said:

    Yep, both are 86144 bytes.

    That looks correct. Is your P2 Evaluation board a rev B? I have not tested the HyperRAM on anything else.

    You may need to adjust the HYPER_DELAY_RAM (in target/p2/P2EVAL.inc) for other boards. Try 9.

  • evanhevanh Posts: 16,583

    Oh, right, yep, that fixes it, I needed a value of 11 for 300 MHz. :D And thanks for the directions.

    catalina -f300M -p2 -lcx -lmc file-accesstest.c -C P2_EVAL -C LARGE -C NO_LUT -C CR_ON_LF
    Catalina Compiler 8.7
    
    code = 63460 bytes
    cnst = 1268 bytes
    init = 652 bytes
    data = 828 bytes
    file = 132256 bytes
    

    payload -p /dev/ttyUSB0 -i -q1 XMM file-accesstest.bin

       clkfreq = 300000000   clkmode = 0x1000efb
    
       File size = 512 bytes
     Delete 200 files ... Duration 42769 ms, 4.676 files/s
    
     10 sec pause for card recovery ..........
     Create 50 files ... Duration 30272 ms, 1.652 files/s
     Verify 50 files ... Duration 1761 ms, 28.40 files/s
     Delete 50 files ... Duration 4491 ms, 11.13 files/s
    
     10 sec pause for card recovery ..........
     Create 200 files ... Duration 138419 ms, 1.445 files/s
     Verify 200 files ... Duration 14490 ms, 13.80 files/s
     Delete 200 files ... Duration 24282 ms, 8.237 files/s
    

    BTW: I note the sysclock mode has changed with these extra compile options. Least four bits went from %0111 (no added capacitance) to %1011 (15 pF added).

  • evanhevanh Posts: 16,583
    edited 2025-07-10 15:30

    Ah, I've now disabled the experimental HYPER_FASTREAD as well. It's just more reliable at sysclock/2. The only kit I'd be happy to use sysclock/1 is the Swap (ex P2 Stamp) module that Rayman has dived headlong into. It has the HyperRAM placed tight against the Prop2 - https://forums.parallax.com/discussion/comment/1567691/#Comment_1567691

    HYPER_DELAY_RAM = 12 , for 300 MHz.

    Also, I found that, in previous experiments, leaving the clock pin as unregistered gave the widest functioning bands when selecting a delay value (I called it the "rxlag" compensation) - https://forums.parallax.com/discussion/comment/1561512/#Comment_1561512

    HYPER_UNREGCLK = 1

    Verified sysclock/1 is clean on the Swap module - https://forums.parallax.com/discussion/comment/1566267/#Comment_1566267

    EDIT: Huh, that Swap testing at sysclock/1 found registered clock was best. Oh well, so much for transferring the findings from the Eval board and SD cards. The difference was minimal anyway.

  • RossHRossH Posts: 5,622

    @evanh said:
    Ah, I've now disabled the experimental HYPER_FASTREAD as well.

    Yes, these updates for the P2 Evaluation board were going to be in the next release. I have already updated GitHub, but not SourceForge.

    Glad you got it working. I'll post the results when I have had a chance to enable the write-back cache on this program.

    Ross.

Sign In or Register to comment.