Catalina - ANSI C and Lua for the Propeller 1 & 2

evanh · 2025-07-07 00:52

Good stuff. I get the impression that 3.5 minutes is a substantial speed up?

rogloh · 2025-07-07 01:21

@RossH said:
Small benchmark programs are fine, but as a real-world test to make sure the HyperRAM is now working 100% on my P2 Eval board, I recompiled Catalina itself to run on the P2 Eval at 300Mhz. Catalina uses the lower part of the Hyper RAM as code and data space, and the upper part as an SD card sector cache, so it really does exercise nearly all of it.

In regular P2 setups 300MHz was typically around where HyperRAM use to max out IIRC. It gets difficult to go too much faster without resorting to special efforts like optimizing your board layout for skew minimization or using special cooling etc. You can push it higher at certain frequencies but the gaps start to really widen.

Also you mentioned you had setup HYPER_FASTREAD to 1 before so that would have enabled the "experimental" Sysclk/1 byte rate reads and that certainly used to contain timing gaps over the P2 operating frequency range which would show up to the user as not reading data reliably at different clock speeds. Choosing Sysclk/2 transfers halves the transfer rate, but gives more options to tweak the read delays in between and should help make reads more reliable.

Wuerfel_21 · 2025-07-07 01:27

I've had the P2EVAL HyperRAM working at my 340MHz-adjacent overclocks, 300 flat should certainly be fine (in /2 mode)

RossH · 2025-07-07 01:36

@evanh said:
Good stuff. I get the impression that 3.5 minutes is a substantial speed up?

It's a combination of various improvements added in the recent release. In total, self-hosted Catalina is now about twice the speed it used to be. Of course, the biggest chunk of that is just by increasing the clock speed to 300Mhz.

RossH · 2025-07-07 01:44

@rogloh said:

Also you mentioned you had setup HYPER_FASTREAD to 1 before so that would have enabled the "experimental" Sysclk/1 byte rate reads and that certainly used to contain timing gaps over the P2 operating frequency range which would show up to the user as not reading data reliably at different clock speeds. Choosing Sysclk/2 transfers halves the transfer rate, but gives more options to tweak the read delays in between and should help make reads more reliable.

Yes, I didn't realize it was experimental. It works much more reliably on the P2 Edge, but I might disable it there as well. Disabling it seems to make very little overall difference - perhaps because I use a Hub RAM cache, which means slowing down the PSRAM does not slow down the overall performance by as much.

Seems I just got lucky, which is very unusual for me!

rogloh · 2025-07-07 04:23

Depending on your cache row size, HyperRAM+driver latency is still probably very dominant even at half transfer speed.

There is no need to reduce the rate on the PSRAM for the P2 Edge. Being twice as wide as the HyperRAM breakout the 16 bit wide PSRAM clocks data at half the speed already for achieving Sysclk (byte) transfer rates. HyperRAM uses a DDR clock but being 8 bits wide it still needs to transfer data at 2x the speed for the same bandwidth that the PSRAM can achieve. That is what stresses out the timing.

evanh · 2025-07-07 06:19

To clarify what Roger just said, the QPI mode of the 8-pin PSRAM chips are Single Data Rate so can't go faster than sysclock/2 transfer rate. There is no experimental fast read mode for the on-board P2Edge EC32 PSRAMs.

RossH · 2025-07-07 06:25

Ok, that's clear. Thanks @rogloh and @evanh.

RossH · 2025-07-08 12:31

Been doing some more bench marking of the self-hosted version of Catalina, running at 300Mhz on the P2 Eval board with the Hyper RAM add-on board.

It seems quite consistent at compiling about 100 SLOC per minute:

othello (~500 SLOC) - 5 mins => 100 SLOC/min
startrek (~2200 SLOC) - 22 mins => 100 SLOC/min
chimaera (~5500 SLOC) - 66 mins => 83 SLOC/min
dbasic (~13500 SLOC) - 130 mins => 104 SLOC/min

Dumbo Basic (dbasic) is currently the largest Catalina demo, and is fairly easy to compile on the P2 in a single command because it consists of only 3 source files (however, see the note below). In total SLOC it is actually about the same size as Catalina itself, but compiling that on the P2 would be a much more cumbersome process because Catalina consists of about half a dozen separate components and nearly 100 separate source files. So while I can definitely say that self-compiling Catalina on a Propeller 2 is feasible, I doubt anyone will ever be crazy enough to actually do it - not even me!

Note: While compiling dbasic, I discovered a couple of issues with the self-hosted version of Catalina which mean dbasic won't compile with release 8.7. I have fixed the issues and it will compile with the next release. But since even at 300Mhz it takes around 2 hours to compile, I don't think I need to be in a hurry - I doubt anyone besides me is going to bother doing it!

EDIT: Changes to fix the compilation of dbasic using the Propeller 2 self-hosted version of Catalina have been pushed to GitHub. SourceForge users will have to wait for the next release.

Ross.

evanh · 2025-07-08 17:52

Are the files on an SD card? Would the filesystem performance be a factor?

RossH · 2025-07-08 22:44

@evanh said:
Are the files on an SD card? Would the filesystem performance be a factor?

Yes, all files are on the SD card, so the speed of the card is a factor. However, I use a write-back sector cache in PSRAM with the cache flushed at the end of each compilation step, so it is not as significant as you might expect.

Ross.

evanh · 2025-07-09 07:42

Following on from the read/write speed tester exercise - https://forums.parallax.com/discussion/comment/1565955/#Comment_1565955 I've now ported the file access tester as well:

$ catalina -f200M -p2 -lcx -lmc file-accesstest.c
Catalina Compiler 8.7

code = 48604 bytes
cnst = 1228 bytes
init = 648 bytes
data = 828 bytes
file = 61472 bytes

Sandisk Extreme 32 GB (2017)

   clkfreq = 200000000   clkmode = 0x10009f7

   File size = 512 bytes
 Delete 200 files ... Duration 6909 ms, 28.95 files/s

 10 sec pause for card recovery ..........
 Create 50 files ... Duration 9137 ms, 5.472 files/s
 Verify 50 files ... Duration 271 ms, 184.7 files/s
 Delete 50 files ... Duration 3283 ms, 15.23 files/s

 10 sec pause for card recovery ..........
 Create 200 files ... Duration 39171 ms, 5.106 files/s
 Verify 200 files ... Duration 2172 ms, 92.07 files/s
 Delete 200 files ... Duration 14141 ms, 14.14 files/s

evanh · 2025-07-09 08:34

Comparison using same Sandisk 32 GB card and at 300 MHz sysclock:

Catalina using built-in 1-bit SPI mode driver:

   clkfreq = 300000000   clkmode = 0x1000ef7

   File size = 512 bytes
 Delete 200 files ... Duration 6054 ms, 33.04 files/s

 10 sec pause for card recovery ..........
 Create 50 files ... Duration 8622 ms, 5.799 files/s
 Verify 50 files ... Duration 214 ms, 233.6 files/s
 Delete 50 files ... Duration 3258 ms, 15.35 files/s

 10 sec pause for card recovery ..........
 Create 200 files ... Duration 37026 ms, 5.402 files/s
 Verify 200 files ... Duration 1798 ms, 111.3 files/s
 Delete 200 files ... Duration 13904 ms, 14.38 files/s

Flexspin using plug-in 1-bit SPI mode driver (Copy of the built-in driver that I don't have to mount differently):

   clkfreq = 300000000   clkmode = 0x1000efb
SDHC 25 MHz selected
SPI clock ratio = sysclock/12
 /sd1 handle = ed8c ... mounted

   File size = 512 bytes
 Delete 200 files ... Duration 2394 ms, 83.55 files/s

 10 sec pause for card recovery ..........
 Create 50 files ... Duration 957 ms, 52.24 files/s
 Verify 50 files ... Duration 57 ms, 882.3 files/s
 Delete 50 files ... Duration 241 ms, 207.7 files/s

 10 sec pause for card recovery ..........
 Create 200 files ... Duration 4123 ms, 48.51 files/s
 Verify 200 files ... Duration 634 ms, 315.5 files/s
 Delete 200 files ... Duration 1608 ms, 124.4 files/s

Flexspin using plug-in 4-bit SD mode driver (With lazy CMD12 feature):

   clkfreq = 300000000   clkmode = 0x1000efb
 /sd1 handle = 118f8 ...  SETDIV  SD clock-divider set to sysclock/3 (100.0 MHz)

   File size = 512 bytes
 Delete 200 files ... Duration 517 ms, 387.2 files/s

 10 sec pause for card recovery ..........
 Create 50 files ... Duration 434 ms, 115.2 files/s
 Verify 50 files ... Duration 40 ms, 1262 files/s
 Delete 50 files ... Duration 195 ms, 256.7 files/s

 10 sec pause for card recovery ..........
 Create 200 files ... Duration 1755 ms, 113.9 files/s
 Verify 200 files ... Duration 278 ms, 718.3 files/s
 Delete 200 files ... Duration 1264 ms, 158.2 files/s

evanh · 2025-07-09 08:44

PS: I've found one SD card that seems not to be compatible with Catalina's driver. It creates the first file on the card but then returns from fopen() without a file handle. Consistently. The card in question is my newest Sandisk Extreme with all the modern bells and whistles, including TRIMming. No other card I have supports the TRIM feature.

 CID decode:  ManID=03   OEMID=SD  Name=SN64G
  Ver=8.0   Serial=8ab989e1   Date=2021-2

eg:

   clkfreq = 200000000   clkmode = 0x10009f7

   File size = 512 bytes
 Delete 200 files ... Duration 7177 ms, 27.87 files/s

 10 sec pause for card recovery ..........
 Create 50 files ...  fopen() copy049.bin for writing failed!   errno = 0: Error 0

RossH · 2025-07-09 12:13

@evanh said:
The card in question is my newest Sandisk Extreme with all the modern bells and whistles, including TRIMming. No other card I have supports the TRIM feature.

No idea what TRIMming is, but I have a 32Gb Sandisk Extreme that works fine.

Perhaps try reformatting the card. I always use the SD Association format program (https://www.sdcard.org/downloads/formatter/) because I have had a few problems with other formatting programs, especially the Windows one.

Wuerfel_21 · 2025-07-09 13:16

@evanh said:
Comparison using same Sandisk 32 GB card and at 300 MHz sysclock:

Catalina using built-in 1-bit SPI mode driver:
[...]

Flexspin using plug-in 1-bit SPI mode driver (Copy of the built-in driver that I don't have to mount differently):
[...]

Wow, that's whole order of magnitude difference in write speed there.

RossH · 2025-07-09 13:31

@evanh said:
Comparison using same Sandisk 32 GB card and at 300 MHz sysclock:

Changing sysclock speed won't make SD access much faster, since Catalina's SD card driver will simply adjust the SD timings to compensate for the clock speed.

A bigger difference might be due to the file system code used. Catalina uses DOSFS. Do you know what file system code flexspin uses?

I'll also have a look at your programs and add the code required to enable the write-back sector cache. That will both speed things up, and also also tell me whether it is worth updating Catalina's low-level SD card driver or not.

Finally, Catalina itself (which is the program I would really like to speed up!) is way too large to execute from Hub RAM - it executes from PSRAM, which will also change the performance. So a more useful test is to have your program do the same.

Here is the Catalina command I used to do that (this is for the P2 Evaluation board with the HyperRAM add-on, by default on pins 32-48):

catalina -f200M -p2 -lcx -lmc file-accesstest.c -C P2_EVAL -C LARGE -C NO_LUT

Note the addition of -C NO_LUT - this is required because your test program uses the LUT itself, so we have to tell Catalina it cannot do so. I then used the build_utilities command to build a suitable XMM loader - just enter the command and follow the prompts. Then load the program using payload, specifying it should use the XMM loader just built:

payload -i -q1 XMM file-accesstest.bin

Please do whatever the equivalent is for flexspin, and post those times.

Ross.

Wuerfel_21 · 2025-07-09 14:23

@RossH said:
A bigger difference might be due to the file system code used. Catalina uses DOSFS. Do you know what file system code flexspin uses?

Flexspin uses FatFs - Generic FAT Filesystem Module R0.14b, with some choice modifications, along the lines of

static DWORD ld_dword (const BYTE* ptr) /* Load a 4-byte little-endian word */
{
    #ifdef __propeller2__
    return *((DWORD*)ptr);
    #else
    DWORD rv;

    rv = ptr[3];
    rv = rv << 8 | ptr[2];
    rv = rv << 8 | ptr[1];
    rv = rv << 8 | ptr[0];
    return rv;
    #endif
}

(because we know that on flexspin/P2, you can just cast an arbitrarily aligned char* to long* and read the correct value from that)

evanh · 2025-07-09 18:17

@RossH said:

@evanh said:
The card in question is my newest Sandisk Extreme with all the modern bells and whistles, including TRIMming. No other card I have supports the TRIM feature.

No idea what TRIMming is, but I have a 32Gb Sandisk Extreme that works fine.

Yip, my older 32GB Sandisk Extreme works perfectly with Catalina. And I've tested a 64GB Adata, 64 GB Kingston and a 128 GB Samsung. All work fine. So no issue with capacity or cluster sizes or the likes.

TRIMming is a block discard mechanism where SSDs can be informed of what to erase in bulk as it becomes unneeded rather than just when overwrites occur. I don't see it being utilised by the Prop2 as it adds more code and overheads to both the driver and filesystem layers. The point being that that particular card has new features that none of my other cards have.

Perhaps try reformatting the card. I always use the SD Association format program (https://www.sdcard.org/downloads/formatter/) because I have had a few problems with other formatting programs, especially the Windows one.

All cards have been formatted multiples times with Gparted on Linux long before ever being used with the Prop2. I use no other tool. And yes, I formatted the 64 GB Sandisk again to ensure that that card was clean.

evanh · 2025-07-09 18:25

@RossH said:

@evanh said:
Comparison using same Sandisk 32 GB card and at 300 MHz sysclock:

I'll also have a look at your programs and add the code required to enable the write-back sector cache. That will both speed things up, and also also tell me whether it is worth updating Catalina's low-level SD card driver or not.

Shouldn't that be at the driver level, below the filesystem? BTW, that's pretty much what the lazy CMD12 achieves. It's not an actual cache but achieves the effect of one by eliminating many of the inter-transactional latencies.

Finally, Catalina itself (which is the program I would really like to speed up!) is way too large to execute from Hub RAM - it executes from PSRAM, which will also change the performance. So a more useful test is to have your program do the same.

I compared apples with apples. Yes, it was to compare filesystem performance, not compiling nor external RAM.

PS: I'm happy to check out other memory models for your interest but there's no Flexspin equivalents to compare to. On the Prop2, Flexspin only has native hubexec along with some use of "Fcache" for tight loops being temporarily loaded into cogRAM. Inline assembly can specify to use Fcache too.

evanh · 2025-07-09 20:07

@RossH said:
catalina -f200M -p2 -lcx -lmc file-accesstest.c -C P2_EVAL -C LARGE -C NO_LUT

payload -i -q1 XMM file-accesstest.bin

I get nothing. No errors, no prints of anything coming from file-accesstest. It doesn't seem to be running. The download happens and payload goes into "interactive" mode, but that's all.

There is a XMM.bin file and SRAM.bin file in the working directory - produced by build_utilities.

I'd previously not been using payload ... so tried the known working prior compile and using payload to download - That works as expected, albeit with broken new line handling.

RossH · 2025-07-10 08:51

@evanh said:

Shouldn't that be at the driver level, below the filesystem?

It sits between the low level SD driver and DOSFS, and so I have to recompile stdio to enable it. Will do this when I get time.

PS: I'm happy to check out other memory models for your interest but there's no Flexspin equivalents to compare to. On the Prop2, Flexspin only has native hubexec along with some use of "Fcache" for tight loops being temporarily loaded into cogRAM. Inline assembly can specify to use Fcache too.

I don't use flexspin (or should that be flex C? I've never been quite sure what the relationship is between flexspin, flexprop, flexc and flexgui). In any case, there's not much point in comparisons since it will not be able to run any of the larger C programs. Also, I expect that when I enable caching, the SD card performance difference will not be much of an issue. But if it is, I'll have a look at your low level SD driver and turn it into a Catalina plugin that can be selected at compile time. This is usually pretty straightforward.

RossH · 2025-07-10 09:50

@evanh said:

@RossH said:
catalina -f200M -p2 -lcx -lmc file-accesstest.c -C P2_EVAL -C LARGE -C NO_LUT

payload -i -q1 XMM file-accesstest.bin

I get nothing. No errors, no prints of anything coming from file-accesstest. It doesn't seem to be running. The download happens and payload goes into "interactive" mode, but that's all.

Odd. Does your P2 Evaluation board have the HyperRAM on pins 32-48? That should be all it needs.

There is a XMM.bin file and SRAM.bin file in the working directory - produced by build_utilities.

Yes, that's correct. In this case they will be the same (they are not the same on all platforms). Did you specify P2_EVAL, HYPER and 64 in answer to the build_utilities prompts?

I'd previously not been using payload ... so tried the known working prior compile and using payload to download - That works as expected, albeit with broken new line handling.

Your program uses CR as a line terminator, which is unusual (are you a Mac user? - that's the only platform I know of that used a lone CR as a line terminator). The C standard specifies LF (\n) as the line terminator, and Windows traditionally used CRLF (\r\n), which payload can accommodate by simply ignoring the CR (this is what the -q1 option does).

evanh · 2025-07-10 10:25

@RossH said:

@evanh said:

@RossH said:
catalina -f200M -p2 -lcx -lmc file-accesstest.c -C P2_EVAL -C LARGE -C NO_LUT

payload -i -q1 XMM file-accesstest.bin

I get nothing. No errors, no prints of anything coming from file-accesstest. It doesn't seem to be running. The download happens and payload goes into "interactive" mode, but that's all.

Odd. Does your P2 Evaluation board have the HyperRAM on pins 32-48? That should be all it needs.

Yep.

There is a XMM.bin file and SRAM.bin file in the working directory - produced by build_utilities.

Yes, that's correct. In this case they will be the same (they are not the same on all platforms). Did you specify P2_EVAL, HYPER and 64 in answer to the build_utilities prompts?

Yep, both are 86144 bytes.

I'd previously not been using payload ... so tried the known working prior compile and using payload to download - That works as expected, albeit with broken new line handling.

Your program uses CR as a line terminator, which is unusual (are you a Mac user? - that's the only platform I know of that used a lone CR as a line terminator). The C standard specifies LF (\n) as the line terminator, and Windows traditionally used CRLF (\r\n), which payload can accommodate by simply ignoring the CR (this is what the -q1 option does).

We had that conversation, you suggested I use vertical tab to emit a line scroll, and it worked for a regular terminal. But for payload it doesn't work.

Text files are just a lonely LF but most programs the stdout emits both a CR and LF as a pair when specifying \n. Which is what's needed for a regular terminal.

evanh · 2025-07-10 10:28

@RossH said:
I don't use flexspin (or should that be flex C? I've never been quite sure what the relationship is between flexspin, flexprop, flexc and flexgui).

Flexspin is the compiler binary. It compiles all the supported languages: Basic (FlexBasic?), C (FlexC), Spin1 and Spin2.

Flexprop is the optional GUI. I've not used it.

RossH · 2025-07-10 12:19

@evanh said:

We had that conversation, you suggested I use vertical tab to emit a line scroll, and it worked for a regular terminal. But for payload it doesn't work.

I don't remember saying that, but I may possibly have suggested adding a VT when using the Parallax serial terminal, which is also a bit of an odd beast.

Text files are just a lonely LF but most programs the stdout emits both a CR and LF as a pair when specifying \n. Which is what's needed for a regular terminal.

This is optional in Catalina - you can enable it by adding -C CR_ON_LF to the compile command.

RossH · 2025-07-10 12:23

@evanh said:

Yep, both are 86144 bytes.

That looks correct. Is your P2 Evaluation board a rev B? I have not tested the HyperRAM on anything else.

You may need to adjust the HYPER_DELAY_RAM (in target/p2/P2EVAL.inc) for other boards. Try 9.

evanh · 2025-07-10 14:07

Oh, right, yep, that fixes it, I needed a value of 11 for 300 MHz. And thanks for the directions.

catalina -f300M -p2 -lcx -lmc file-accesstest.c -C P2_EVAL -C LARGE -C NO_LUT -C CR_ON_LF
Catalina Compiler 8.7

code = 63460 bytes
cnst = 1268 bytes
init = 652 bytes
data = 828 bytes
file = 132256 bytes

payload -p /dev/ttyUSB0 -i -q1 XMM file-accesstest.bin

   clkfreq = 300000000   clkmode = 0x1000efb

   File size = 512 bytes
 Delete 200 files ... Duration 42769 ms, 4.676 files/s

 10 sec pause for card recovery ..........
 Create 50 files ... Duration 30272 ms, 1.652 files/s
 Verify 50 files ... Duration 1761 ms, 28.40 files/s
 Delete 50 files ... Duration 4491 ms, 11.13 files/s

 10 sec pause for card recovery ..........
 Create 200 files ... Duration 138419 ms, 1.445 files/s
 Verify 200 files ... Duration 14490 ms, 13.80 files/s
 Delete 200 files ... Duration 24282 ms, 8.237 files/s

BTW: I note the sysclock mode has changed with these extra compile options. Least four bits went from %0111 (no added capacitance) to %1011 (15 pF added).

evanh · 2025-07-10 15:11

Ah, I've now disabled the experimental HYPER_FASTREAD as well. It's just more reliable at sysclock/2. The only kit I'd be happy to use sysclock/1 is the Swap (ex P2 Stamp) module that Rayman has dived headlong into. It has the HyperRAM placed tight against the Prop2 - https://forums.parallax.com/discussion/comment/1567691/#Comment_1567691

HYPER_DELAY_RAM = 12 , for 300 MHz.

Also, I found that, in previous experiments, leaving the clock pin as unregistered gave the widest functioning bands when selecting a delay value (I called it the "rxlag" compensation) - https://forums.parallax.com/discussion/comment/1561512/#Comment_1561512

HYPER_UNREGCLK = 1

Verified sysclock/1 is clean on the Swap module - https://forums.parallax.com/discussion/comment/1566267/#Comment_1566267

EDIT: Huh, that Swap testing at sysclock/1 found registered clock was best. Oh well, so much for transferring the findings from the Eval board and SD cards. The difference was minimal anyway.

RossH · 2025-07-10 23:42

@evanh said:
Ah, I've now disabled the experimental HYPER_FASTREAD as well.

Yes, these updates for the P2 Evaluation board were going to be in the next release. I have already updated GitHub, but not SourceForge.

Glad you got it working. I'll post the results when I have had a chance to enable the write-back cache on this program.

Ross.

Catalina - ANSI C and Lua for the Propeller 1 & 2

Comments