@RossH said:
Small benchmark programs are fine, but as a real-world test to make sure the HyperRAM is now working 100% on my P2 Eval board, I recompiled Catalina itself to run on the P2 Eval at 300Mhz. Catalina uses the lower part of the Hyper RAM as code and data space, and the upper part as an SD card sector cache, so it really does exercise nearly all of it.
In regular P2 setups 300MHz was typically around where HyperRAM use to max out IIRC. It gets difficult to go too much faster without resorting to special efforts like optimizing your board layout for skew minimization or using special cooling etc. You can push it higher at certain frequencies but the gaps start to really widen.
Also you mentioned you had setup HYPER_FASTREAD to 1 before so that would have enabled the "experimental" Sysclk/1 byte rate reads and that certainly used to contain timing gaps over the P2 operating frequency range which would show up to the user as not reading data reliably at different clock speeds. Choosing Sysclk/2 transfers halves the transfer rate, but gives more options to tweak the read delays in between and should help make reads more reliable.
@evanh said:
Good stuff. I get the impression that 3.5 minutes is a substantial speed up?
It's a combination of various improvements added in the recent release. In total, self-hosted Catalina is now about twice the speed it used to be. Of course, the biggest chunk of that is just by increasing the clock speed to 300Mhz.
Also you mentioned you had setup HYPER_FASTREAD to 1 before so that would have enabled the "experimental" Sysclk/1 byte rate reads and that certainly used to contain timing gaps over the P2 operating frequency range which would show up to the user as not reading data reliably at different clock speeds. Choosing Sysclk/2 transfers halves the transfer rate, but gives more options to tweak the read delays in between and should help make reads more reliable.
Yes, I didn't realize it was experimental. It works much more reliably on the P2 Edge, but I might disable it there as well. Disabling it seems to make very little overall difference - perhaps because I use a Hub RAM cache, which means slowing down the PSRAM does not slow down the overall performance by as much.
Seems I just got lucky, which is very unusual for me!
Depending on your cache row size, HyperRAM+driver latency is still probably very dominant even at half transfer speed.
There is no need to reduce the rate on the PSRAM for the P2 Edge. Being twice as wide as the HyperRAM breakout the 16 bit wide PSRAM clocks data at half the speed already for achieving Sysclk (byte) transfer rates. HyperRAM uses a DDR clock but being 8 bits wide it still needs to transfer data at 2x the speed for the same bandwidth that the PSRAM can achieve. That is what stresses out the timing.
To clarify what Roger just said, the QPI mode of the 8-pin PSRAM chips are Single Data Rate so can't go faster than sysclock/2 transfer rate. There is no experimental fast read mode for the on-board P2Edge EC32 PSRAMs.
Been doing some more bench marking of the self-hosted version of Catalina, running at 300Mhz on the P2 Eval board with the Hyper RAM add-on board.
It seems quite consistent at compiling about 100 SLOC per minute:
othello (~500 SLOC) - 5 mins => 100 SLOC/min
startrek (~2200 SLOC) - 22 mins => 100 SLOC/min
chimaera (~5500 SLOC) - 66 mins => 83 SLOC/min
dbasic (~13500 SLOC) - 130 mins => 104 SLOC/min
Dumbo Basic (dbasic) is currently the largest Catalina demo, and is fairly easy to compile on the P2 in a single command because it consists of only 3 source files (however, see the note below). In total SLOC it is actually about the same size as Catalina itself, but compiling that on the P2 would be a much more cumbersome process because Catalina consists of about half a dozen separate components and nearly 100 separate source files. So while I can definitely say that self-compiling Catalina on a Propeller 2 is feasible, I doubt anyone will ever be crazy enough to actually do it - not even me!
Note: While compiling dbasic, I discovered a couple of issues with the self-hosted version of Catalina which mean dbasic won't compile with release 8.7. I have fixed the issues and it will compile with the next release. But since even at 300Mhz it takes around 2 hours to compile, I don't think I need to be in a hurry - I doubt anyone besides me is going to bother doing it!
EDIT: Changes to fix the compilation of dbasic using the Propeller 2 self-hosted version of Catalina have been pushed to GitHub. SourceForge users will have to wait for the next release.
@evanh said:
Are the files on an SD card? Would the filesystem performance be a factor?
Yes, all files are on the SD card, so the speed of the card is a factor. However, I use a write-back sector cache in PSRAM with the cache flushed at the end of each compilation step, so it is not as significant as you might expect.
PS: I've found one SD card that seems not to be compatible with Catalina's driver. It creates the first file on the card but then returns from fopen() without a file handle. Consistently. The card in question is my newest Sandisk Extreme with all the modern bells and whistles, including TRIMming. No other card I have supports the TRIM feature.
@evanh said:
The card in question is my newest Sandisk Extreme with all the modern bells and whistles, including TRIMming. No other card I have supports the TRIM feature.
No idea what TRIMming is, but I have a 32Gb Sandisk Extreme that works fine.
Perhaps try reformatting the card. I always use the SD Association format program (https://www.sdcard.org/downloads/formatter/) because I have had a few problems with other formatting programs, especially the Windows one.
@evanh said:
Comparison using same Sandisk 32 GB card and at 300 MHz sysclock:
Changing sysclock speed won't make SD access much faster, since Catalina's SD card driver will simply adjust the SD timings to compensate for the clock speed.
A bigger difference might be due to the file system code used. Catalina uses DOSFS. Do you know what file system code flexspin uses?
I'll also have a look at your programs and add the code required to enable the write-back sector cache. That will both speed things up, and also also tell me whether it is worth updating Catalina's low-level SD card driver or not.
Finally, Catalina itself (which is the program I would really like to speed up!) is way too large to execute from Hub RAM - it executes from PSRAM, which will also change the performance. So a more useful test is to have your program do the same.
Here is the Catalina command I used to do that (this is for the P2 Evaluation board with the HyperRAM add-on, by default on pins 32-48):
Note the addition of -C NO_LUT - this is required because your test program uses the LUT itself, so we have to tell Catalina it cannot do so. I then used the build_utilities command to build a suitable XMM loader - just enter the command and follow the prompts. Then load the program using payload, specifying it should use the XMM loader just built:
payload -i -q1 XMM file-accesstest.bin
Please do whatever the equivalent is for flexspin, and post those times.
@evanh said:
The card in question is my newest Sandisk Extreme with all the modern bells and whistles, including TRIMming. No other card I have supports the TRIM feature.
No idea what TRIMming is, but I have a 32Gb Sandisk Extreme that works fine.
Yip, my older 32GB Sandisk Extreme works perfectly with Catalina. And I've tested a 64GB Adata, 64 GB Kingston and a 128 GB Samsung. All work fine. So no issue with capacity or cluster sizes or the likes.
TRIMming is a block discard mechanism where SSDs can be informed of what to erase in bulk as it becomes unneeded rather than just when overwrites occur. I don't see it being utilised by the Prop2 as it adds more code and overheads to both the driver and filesystem layers. The point being that that particular card has new features that none of my other cards have.
Perhaps try reformatting the card. I always use the SD Association format program (https://www.sdcard.org/downloads/formatter/) because I have had a few problems with other formatting programs, especially the Windows one.
All cards have been formatted multiples times with Gparted on Linux long before ever being used with the Prop2. I use no other tool. And yes, I formatted the 64 GB Sandisk again to ensure that that card was clean.
@evanh said:
Comparison using same Sandisk 32 GB card and at 300 MHz sysclock:
I'll also have a look at your programs and add the code required to enable the write-back sector cache. That will both speed things up, and also also tell me whether it is worth updating Catalina's low-level SD card driver or not.
Shouldn't that be at the driver level, below the filesystem? BTW, that's pretty much what the lazy CMD12 achieves. It's not an actual cache but achieves the effect of one by eliminating many of the inter-transactional latencies.
Finally, Catalina itself (which is the program I would really like to speed up!) is way too large to execute from Hub RAM - it executes from PSRAM, which will also change the performance. So a more useful test is to have your program do the same.
I compared apples with apples. Yes, it was to compare filesystem performance, not compiling nor external RAM.
PS: I'm happy to check out other memory models for your interest but there's no Flexspin equivalents to compare to. On the Prop2, Flexspin only has native hubexec along with some use of "Fcache" for tight loops being temporarily loaded into cogRAM. Inline assembly can specify to use Fcache too.
I get nothing. No errors, no prints of anything coming from file-accesstest. It doesn't seem to be running. The download happens and payload goes into "interactive" mode, but that's all.
There is a XMM.bin file and SRAM.bin file in the working directory - produced by build_utilities.
I'd previously not been using payload ... so tried the known working prior compile and using payload to download - That works as expected, albeit with broken new line handling.
Shouldn't that be at the driver level, below the filesystem?
It sits between the low level SD driver and DOSFS, and so I have to recompile stdio to enable it. Will do this when I get time.
PS: I'm happy to check out other memory models for your interest but there's no Flexspin equivalents to compare to. On the Prop2, Flexspin only has native hubexec along with some use of "Fcache" for tight loops being temporarily loaded into cogRAM. Inline assembly can specify to use Fcache too.
I don't use flexspin (or should that be flex C? I've never been quite sure what the relationship is between flexspin, flexprop, flexc and flexgui). In any case, there's not much point in comparisons since it will not be able to run any of the larger C programs. Also, I expect that when I enable caching, the SD card performance difference will not be much of an issue. But if it is, I'll have a look at your low level SD driver and turn it into a Catalina plugin that can be selected at compile time. This is usually pretty straightforward.
I get nothing. No errors, no prints of anything coming from file-accesstest. It doesn't seem to be running. The download happens and payload goes into "interactive" mode, but that's all.
Odd. Does your P2 Evaluation board have the HyperRAM on pins 32-48? That should be all it needs.
There is a XMM.bin file and SRAM.bin file in the working directory - produced by build_utilities.
Yes, that's correct. In this case they will be the same (they are not the same on all platforms). Did you specify P2_EVAL, HYPER and 64 in answer to the build_utilities prompts?
I'd previously not been using payload ... so tried the known working prior compile and using payload to download - That works as expected, albeit with broken new line handling.
Your program uses CR as a line terminator, which is unusual (are you a Mac user? - that's the only platform I know of that used a lone CR as a line terminator). The C standard specifies LF (\n) as the line terminator, and Windows traditionally used CRLF (\r\n), which payload can accommodate by simply ignoring the CR (this is what the -q1 option does).
I get nothing. No errors, no prints of anything coming from file-accesstest. It doesn't seem to be running. The download happens and payload goes into "interactive" mode, but that's all.
Odd. Does your P2 Evaluation board have the HyperRAM on pins 32-48? That should be all it needs.
Yep.
There is a XMM.bin file and SRAM.bin file in the working directory - produced by build_utilities.
Yes, that's correct. In this case they will be the same (they are not the same on all platforms). Did you specify P2_EVAL, HYPER and 64 in answer to the build_utilities prompts?
Yep, both are 86144 bytes.
I'd previously not been using payload ... so tried the known working prior compile and using payload to download - That works as expected, albeit with broken new line handling.
Your program uses CR as a line terminator, which is unusual (are you a Mac user? - that's the only platform I know of that used a lone CR as a line terminator). The C standard specifies LF (\n) as the line terminator, and Windows traditionally used CRLF (\r\n), which payload can accommodate by simply ignoring the CR (this is what the -q1 option does).
We had that conversation, you suggested I use vertical tab to emit a line scroll, and it worked for a regular terminal. But for payload it doesn't work.
Text files are just a lonely LF but most programs the stdout emits both a CR and LF as a pair when specifying \n. Which is what's needed for a regular terminal.
@RossH said:
I don't use flexspin (or should that be flex C? I've never been quite sure what the relationship is between flexspin, flexprop, flexc and flexgui).
Flexspin is the compiler binary. It compiles all the supported languages: Basic (FlexBasic?), C (FlexC), Spin1 and Spin2.
We had that conversation, you suggested I use vertical tab to emit a line scroll, and it worked for a regular terminal. But for payload it doesn't work.
I don't remember saying that, but I may possibly have suggested adding a VT when using the Parallax serial terminal, which is also a bit of an odd beast.
Text files are just a lonely LF but most programs the stdout emits both a CR and LF as a pair when specifying \n. Which is what's needed for a regular terminal.
This is optional in Catalina - you can enable it by adding -C CR_ON_LF to the compile command.
BTW: I note the sysclock mode has changed with these extra compile options. Least four bits went from %0111 (no added capacitance) to %1011 (15 pF added).
Ah, I've now disabled the experimental HYPER_FASTREAD as well. It's just more reliable at sysclock/2. The only kit I'd be happy to use sysclock/1 is the Swap (ex P2 Stamp) module that Rayman has dived headlong into. It has the HyperRAM placed tight against the Prop2 - https://forums.parallax.com/discussion/comment/1567691/#Comment_1567691
EDIT: Huh, that Swap testing at sysclock/1 found registered clock was best. Oh well, so much for transferring the findings from the Eval board and SD cards. The difference was minimal anyway.
Comments
Good stuff. I get the impression that 3.5 minutes is a substantial speed up?
In regular P2 setups 300MHz was typically around where HyperRAM use to max out IIRC. It gets difficult to go too much faster without resorting to special efforts like optimizing your board layout for skew minimization or using special cooling etc. You can push it higher at certain frequencies but the gaps start to really widen.
Also you mentioned you had setup HYPER_FASTREAD to 1 before so that would have enabled the "experimental" Sysclk/1 byte rate reads and that certainly used to contain timing gaps over the P2 operating frequency range which would show up to the user as not reading data reliably at different clock speeds. Choosing Sysclk/2 transfers halves the transfer rate, but gives more options to tweak the read delays in between and should help make reads more reliable.
I've had the P2EVAL HyperRAM working at my 340MHz-adjacent overclocks, 300 flat should certainly be fine (in /2 mode)
It's a combination of various improvements added in the recent release. In total, self-hosted Catalina is now about twice the speed it used to be. Of course, the biggest chunk of that is just by increasing the clock speed to 300Mhz.
Yes, I didn't realize it was experimental. It works much more reliably on the P2 Edge, but I might disable it there as well. Disabling it seems to make very little overall difference - perhaps because I use a Hub RAM cache, which means slowing down the PSRAM does not slow down the overall performance by as much.
Seems I just got lucky, which is very unusual for me!
Depending on your cache row size, HyperRAM+driver latency is still probably very dominant even at half transfer speed.
There is no need to reduce the rate on the PSRAM for the P2 Edge. Being twice as wide as the HyperRAM breakout the 16 bit wide PSRAM clocks data at half the speed already for achieving Sysclk (byte) transfer rates. HyperRAM uses a DDR clock but being 8 bits wide it still needs to transfer data at 2x the speed for the same bandwidth that the PSRAM can achieve. That is what stresses out the timing.
To clarify what Roger just said, the QPI mode of the 8-pin PSRAM chips are Single Data Rate so can't go faster than sysclock/2 transfer rate. There is no experimental fast read mode for the on-board P2Edge EC32 PSRAMs.
Ok, that's clear. Thanks @rogloh and @evanh.
Been doing some more bench marking of the self-hosted version of Catalina, running at 300Mhz on the P2 Eval board with the Hyper RAM add-on board.
It seems quite consistent at compiling about 100 SLOC per minute:
Dumbo Basic (dbasic) is currently the largest Catalina demo, and is fairly easy to compile on the P2 in a single command because it consists of only 3 source files (however, see the note below). In total SLOC it is actually about the same size as Catalina itself, but compiling that on the P2 would be a much more cumbersome process because Catalina consists of about half a dozen separate components and nearly 100 separate source files. So while I can definitely say that self-compiling Catalina on a Propeller 2 is feasible, I doubt anyone will ever be crazy enough to actually do it - not even me!
Note: While compiling dbasic, I discovered a couple of issues with the self-hosted version of Catalina which mean dbasic won't compile with release 8.7. I have fixed the issues and it will compile with the next release. But since even at 300Mhz it takes around 2 hours to compile, I don't think I need to be in a hurry - I doubt anyone besides me is going to bother doing it!
EDIT: Changes to fix the compilation of dbasic using the Propeller 2 self-hosted version of Catalina have been pushed to GitHub. SourceForge users will have to wait for the next release.
Ross.
Are the files on an SD card? Would the filesystem performance be a factor?
Yes, all files are on the SD card, so the speed of the card is a factor. However, I use a write-back sector cache in PSRAM with the cache flushed at the end of each compilation step, so it is not as significant as you might expect.
Ross.
Following on from the read/write speed tester exercise - https://forums.parallax.com/discussion/comment/1565955/#Comment_1565955 I've now ported the file access tester as well:
Sandisk Extreme 32 GB (2017)
Comparison using same Sandisk 32 GB card and at 300 MHz sysclock:
Catalina using built-in 1-bit SPI mode driver:
Flexspin using plug-in 1-bit SPI mode driver (Copy of the built-in driver that I don't have to mount differently):
Flexspin using plug-in 4-bit SD mode driver (With lazy CMD12 feature):
PS: I've found one SD card that seems not to be compatible with Catalina's driver. It creates the first file on the card but then returns from fopen() without a file handle. Consistently. The card in question is my newest Sandisk Extreme with all the modern bells and whistles, including TRIMming. No other card I have supports the TRIM feature.
eg:
No idea what TRIMming is, but I have a 32Gb Sandisk Extreme that works fine.
Perhaps try reformatting the card. I always use the SD Association format program (https://www.sdcard.org/downloads/formatter/) because I have had a few problems with other formatting programs, especially the Windows one.
Wow, that's whole order of magnitude difference in write speed there.
Changing sysclock speed won't make SD access much faster, since Catalina's SD card driver will simply adjust the SD timings to compensate for the clock speed.
A bigger difference might be due to the file system code used. Catalina uses DOSFS. Do you know what file system code flexspin uses?
I'll also have a look at your programs and add the code required to enable the write-back sector cache. That will both speed things up, and also also tell me whether it is worth updating Catalina's low-level SD card driver or not.
Finally, Catalina itself (which is the program I would really like to speed up!) is way too large to execute from Hub RAM - it executes from PSRAM, which will also change the performance. So a more useful test is to have your program do the same.
Here is the Catalina command I used to do that (this is for the P2 Evaluation board with the HyperRAM add-on, by default on pins 32-48):
catalina -f200M -p2 -lcx -lmc file-accesstest.c -C P2_EVAL -C LARGE -C NO_LUT
Note the addition of -C NO_LUT - this is required because your test program uses the LUT itself, so we have to tell Catalina it cannot do so. I then used the build_utilities command to build a suitable XMM loader - just enter the command and follow the prompts. Then load the program using payload, specifying it should use the XMM loader just built:
payload -i -q1 XMM file-accesstest.bin
Please do whatever the equivalent is for flexspin, and post those times.
Ross.
Flexspin uses
FatFs - Generic FAT Filesystem Module R0.14b
, with some choice modifications, along the lines of(because we know that on flexspin/P2, you can just cast an arbitrarily aligned char* to long* and read the correct value from that)
Yip, my older 32GB Sandisk Extreme works perfectly with Catalina. And I've tested a 64GB Adata, 64 GB Kingston and a 128 GB Samsung. All work fine. So no issue with capacity or cluster sizes or the likes.
TRIMming is a block discard mechanism where SSDs can be informed of what to erase in bulk as it becomes unneeded rather than just when overwrites occur. I don't see it being utilised by the Prop2 as it adds more code and overheads to both the driver and filesystem layers. The point being that that particular card has new features that none of my other cards have.
All cards have been formatted multiples times with Gparted on Linux long before ever being used with the Prop2. I use no other tool. And yes, I formatted the 64 GB Sandisk again to ensure that that card was clean.
Shouldn't that be at the driver level, below the filesystem? BTW, that's pretty much what the lazy CMD12 achieves. It's not an actual cache but achieves the effect of one by eliminating many of the inter-transactional latencies.
I compared apples with apples. Yes, it was to compare filesystem performance, not compiling nor external RAM.
PS: I'm happy to check out other memory models for your interest but there's no Flexspin equivalents to compare to. On the Prop2, Flexspin only has native hubexec along with some use of "Fcache" for tight loops being temporarily loaded into cogRAM. Inline assembly can specify to use Fcache too.
I get nothing. No errors, no prints of anything coming from file-accesstest. It doesn't seem to be running. The download happens and payload goes into "interactive" mode, but that's all.
There is a XMM.bin file and SRAM.bin file in the working directory - produced by build_utilities.
I'd previously not been using payload ... so tried the known working prior compile and using payload to download - That works as expected, albeit with broken new line handling.
It sits between the low level SD driver and DOSFS, and so I have to recompile stdio to enable it. Will do this when I get time.
I don't use flexspin (or should that be flex C? I've never been quite sure what the relationship is between flexspin, flexprop, flexc and flexgui). In any case, there's not much point in comparisons since it will not be able to run any of the larger C programs. Also, I expect that when I enable caching, the SD card performance difference will not be much of an issue. But if it is, I'll have a look at your low level SD driver and turn it into a Catalina plugin that can be selected at compile time. This is usually pretty straightforward.
Odd. Does your P2 Evaluation board have the HyperRAM on pins 32-48? That should be all it needs.
Yes, that's correct. In this case they will be the same (they are not the same on all platforms). Did you specify P2_EVAL, HYPER and 64 in answer to the build_utilities prompts?
Your program uses CR as a line terminator, which is unusual (are you a Mac user? - that's the only platform I know of that used a lone CR as a line terminator). The C standard specifies LF (\n) as the line terminator, and Windows traditionally used CRLF (\r\n), which payload can accommodate by simply ignoring the CR (this is what the -q1 option does).
Yep.
Yep, both are 86144 bytes.
We had that conversation, you suggested I use vertical tab to emit a line scroll, and it worked for a regular terminal. But for payload it doesn't work.
Text files are just a lonely LF but most programs the stdout emits both a CR and LF as a pair when specifying
\n
. Which is what's needed for a regular terminal.Flexspin is the compiler binary. It compiles all the supported languages: Basic (FlexBasic?), C (FlexC), Spin1 and Spin2.
Flexprop is the optional GUI. I've not used it.
I don't remember saying that, but I may possibly have suggested adding a VT when using the Parallax serial terminal, which is also a bit of an odd beast.
This is optional in Catalina - you can enable it by adding -C CR_ON_LF to the compile command.
That looks correct. Is your P2 Evaluation board a rev B? I have not tested the HyperRAM on anything else.
You may need to adjust the HYPER_DELAY_RAM (in target/p2/P2EVAL.inc) for other boards. Try 9.
Oh, right, yep, that fixes it, I needed a value of 11 for 300 MHz.
And thanks for the directions.
payload -p /dev/ttyUSB0 -i -q1 XMM file-accesstest.bin
BTW: I note the sysclock mode has changed with these extra compile options. Least four bits went from %0111 (no added capacitance) to %1011 (15 pF added).
Ah, I've now disabled the experimental HYPER_FASTREAD as well. It's just more reliable at sysclock/2. The only kit I'd be happy to use sysclock/1 is the Swap (ex P2 Stamp) module that Rayman has dived headlong into. It has the HyperRAM placed tight against the Prop2 - https://forums.parallax.com/discussion/comment/1567691/#Comment_1567691
HYPER_DELAY_RAM = 12 , for 300 MHz.
Also, I found that, in previous experiments, leaving the clock pin as unregistered gave the widest functioning bands when selecting a delay value (I called it the "rxlag" compensation) - https://forums.parallax.com/discussion/comment/1561512/#Comment_1561512
HYPER_UNREGCLK = 1
Verified sysclock/1 is clean on the Swap module - https://forums.parallax.com/discussion/comment/1566267/#Comment_1566267
EDIT: Huh, that Swap testing at sysclock/1 found registered clock was best. Oh well, so much for transferring the findings from the Eval board and SD cards. The difference was minimal anyway.
Yes, these updates for the P2 Evaluation board were going to be in the next release. I have already updated GitHub, but not SourceForge.
Glad you got it working. I'll post the results when I have had a chance to enable the write-back cache on this program.
Ross.