I can't reproduce your results exactly - that is probably due to the use of a different type of SD card. However, the important thing here is how much using the cache speeds things up.
These results use a Sandisk 32Gb Extreme, at 300Mhz, in Catalina NATIVE mode:
As expected, using the cache speeds things up quite dramatically, and shows that this is a much more significant way to improve performance than just improving the low level SD card driver speed.
One other thing to note - you MUST format the SD card between each test, otherwise the tests are not using the same sectors on the SD card - and this makes a MASSIVE difference to the speeds - SD card manufacturers seem to cheat by making the first sectors much faster than subsequent sectors - presumably to give good results on simple speed tests.
Still working on the LARGE mode testing - I can't seem to get consistent results, which indicates I may have a bug with the cache in LARGE mode.
@RossH said:
One other thing to note - you MUST format the SD card between each test, otherwise the tests are not using the same sectors on the SD card - and this makes a MASSIVE difference to the speeds - SD card manufacturers seem to cheat by making the first sectors much faster than subsequent sectors - presumably to give good results on simple speed tests.
I would argue the opposite then. If there is something the cards are doing to make it faster at the start then messing it up plenty before the real test would be in order. Otherwise you'd be falling for a cheat.
@RossH said:
One other thing to note - you MUST format the SD card between each test, otherwise the tests are not using the same sectors on the SD card - and this makes a MASSIVE difference to the speeds - SD card manufacturers seem to cheat by making the first sectors much faster than subsequent sectors - presumably to give good results on simple speed tests.
I would argue the opposite then. If there is something the cards are doing to make it faster at the start then messing it up plenty before the real test would be in order. Otherwise you'd be falling for a cheat.
I suppose you could start with writing a known number of files to the card, rather than an empty card. The main thing it that it be redone exactly the same way before each test (and using the same formatting and file writing program) otherwise you may be using different clusters (in different sections of the card) for different tests - and you will get different results.
Even so, if you are using different file system implementations (e.g. DOSFS vs FatFS) they may allocate clusters to files differently, so you may still get differences. But it is probably fairest to start with an empty card, or at least write a test that utilizes more of the card than just a few sectors (in this case, about 200 - see next post).
This test program only uses about 200 sectors (204 I think), so any cache size 100kb or so would give the same results - which would easily fit in Hub RAM. But my program uses PSRAM for the cache, so the cache could be up to 16Mb. A smaller cache (say 64kb or 32kb ) would give still give good - but slightly slower - results.
That's a page of vague fear-mongering. There's not one reason why given there. Although there is a hint about size boundaries. GParted has used 1 MB partitioning boundaries forever. Ie: First partition starts at 1 MB offset by default.
That's a page of vague fear-mongering. There's not one reason why given there. Although there is a hint about size boundaries. GParted has used 1 MB partitioning boundaries forever. Ie: First partition starts at 1 MB offset by default.
I always now use the SD Association format program, but out of interest I re-ran your program using a blank 32Gb Sandisk Extreme SD card, but formatted with different programs - the SD Association one, the Windows one, and the Linux one (gparted). All different. The differences were generally small (between 5% and 10%) but the SD Association was fastest, then gparted, then Windows.
I note it has a --discard option. I coded a full card discard for some of my cards not long ago - to see if that made any difference to testing speed but it didn't. Probably because it only matters after the card has been written 100%.
Well, I've learnt a good place to put these manually added binaries is in /usr/local/bin. It was, surprisingly, a totally empty directory. There is meant to be a ~/.local/bin as well but, although the directory already existed, the default path doesn't search it.
Disk /dev/sdg: 29.72 GiB, 31914983424 bytes, 62333952 sectors
Which is exactly 30436.5 MB.
SDA formatted:
Device Boot Start End Sectors Size Id Type
/dev/sdg1 8192 62333951 62325760 29.7G c W95 FAT32 (LBA)
GParted formatted:
Device Boot Start End Sectors Size Id Type
/dev/sdg1 2048 62332927 62330880 29.7G b W95 FAT32
GParted formatted, with additional LBA flag set:
Device Boot Start End Sectors Size Id Type
/dev/sdg1 2048 62332927 62330880 29.7G c W95 FAT32 (LBA)
GParted formatted, LBA flag set, no alignment, 4 MB unused at beginning:
Device Boot Start End Sectors Size Id Type
/dev/sdg1 8192 62333951 62325760 29.7G c W95 FAT32 (LBA)
Ha! I found the issue that was giving me such inconsistent results when using the cache in XMM programs!
I have to thank you for your test program, @evanh - it drew my attention to a piece of horribly inefficient code in my cache functions. It bamboozled me for a while because the affect on NATIVE programs is so small as to be almost unnoticeable - but in XMM programs (which execute from PSRAM and therefore execute more slowly) it was inefficient enough that in some cases it made the cached version of the function execute SLOWER than the non-cached version!
Still testing, but rewriting just a couple of lines has so far improved the performance of the self-hosted version of Catalina (which executes from PSRAM) by another 20%!
However, it does make me wonder what other silly things I have yet to find!
I've done some benchmarking on my new SD card cache, and the results are looking pretty good.
Here are the latest times for compiling some Catalina demo programs, using the self-hosted version of Catalina running at 300Mhz on the P2 Eval board equipped with the Hyper RAM add-on:
hello.c (~5 SLOC) - 3 mins
othello.c (~500 SLOC) - 4 mins
startrek.c (~2200 SLOC) - 18 mins
chimaera.c (~5500 SLOC) - 55 mins
dbasic.c (~13500 SLOC) - 104 mins
Compile times are now between 2 and 3 times faster than the initial release. I always wanted to get down to 3 minutes for a simple "Hello, World!" type C program, and this version finally achieves that.
All compiles done using Catalina's Quick Build option - which in hindsight was perhaps not the best name to choose!
The improved cache will be included in the next release.
That's a nice improvement but I'm kind of surprised it's not better than this. If hello.c has 5 lines of code do you know how many lines of additional header files are being parsed?
Have you looked into what sort of cache hit rate you are getting and what proportion of the 3 mins is just waiting to read/write external memory? I guess it would be possible to instrument your cache driver to spit out hits/miss ratios during a compile and the average time taken for a hit and a miss.
With the old 20-33MHz DOS machines I used to use back in the day running Turbo Pascal and Turbo C etc, compiling a hello world type of program would maybe only take like 10-30secs or so (just roughly guessing from my old memory). It certainly was a lot less than 3mins. I'd imagine a 300MHz P2 should be somewhere in the vicinity of that depending on the cache performance.
Is most of the time looking up data structures, or blocked doing file IO, or from lots of cache misses? It'd be interesting to know where the bottleneck is if you wanted to speed it up further.
Update: Actually if you have a 70% cache hit rate and a cache miss takes 1us to resolve (approx), then the resulting MIP rate drops from 150MIPs max (@300MHz) down to about 1MIP for each miss. So 30% of the instructions you run at ~1MIP and 70% you run at 150MIPs. Avg instruction rate is then 100 instructions serviced in 30/1us+70/150 microseconds or 100/30.467us = 3.28MIPs. This is ~10% of the ballpark of a slow old PC which I guess is more in line with your recent results - although this is with a RISC vs CISC ISA so still not really a proper comparison.
@rogloh said:
That's a nice improvement but I'm kind of surprised it's not better than this. If hello.c has 5 lines of code do you know how many lines of additional header files are being parsed?
About 200.
Have you looked into what sort of cache hit rate you are getting and what proportion of the 3 mins is just waiting to read/write external memory? I guess it would be possible to instrument your cache driver to spit out hits/miss ratios during a compile and the average time taken for a hit and a miss.
Yes, the cache hit rate is good. In most of these programs, the cache is large enough to hold all the sectors used, and it is a "write-back" cache, so in fact the hit rate approaches 100%. I have not measured the time taken to actually read/write from PSRAM, since there is probably no way I could improve it much even if I found out that was a significant bottleneck.
With the old 20-33MHz DOS machines I used to use back in the day running Turbo Pascal and Turbo C etc, compiling a hello world type of program would maybe only take like 10-30secs or so (just roughly guessing from my old memory). It certainly was a lot less than 3mins. I'd imagine a 300MHz P2 should be somewhere in the vicinity of that depending on the cache performance.
Even with a cache hit rate of 100%, you still have to ultimately read and write hundreds of kilobytes - even megabytes (the core LCC program is nearly 2Mb) - from the SD card to compile even a 5 line program. Doing that from an SD card is going to take more time than doing it from a hard disk.
Then there is the fact that some of the work is done using Catalyst's scripting capabilities, which are quite slow because (for example) they have to reboot the Propeller and reload Catalyst (and then possibly also Lua) to execute each and every line of script. It's clunky, but it works.
Overall, I'd guess that somewhere between 30 seconds and 1 minute of every compilation time - even for just a 5 line program - is taken up by things that are just not that easy to speed up. The cache (once I got it working right) was actually a very easy win.
Is most of the time looking up data structures, or blocked doing file IO, or from lots of cache misses? It'd be interesting to know where the bottleneck is if you wanted to speed it up further.
Much of the compliation time is spent in the C preprocessor (cpp). The one LCC uses is grindingly slow. I have thought about replacing it - there are open source ones that would be faster. That might save another 15 seconds or so.
Another big issue is the way Catalina does linking - the C library is not actually compiled to binary - it is just translated to PASM. That means every compilation - even for a 5 line program - is actually concatenating and then assembling a large chunk of the standard C library. This design decision probably looks a little crazy, but at the time it was not - in the very early days of the Propeller 1 there was simply no other choice. There was not even a publicly documented executable format for Propeller programs, and no tools available to use it if there had been - -i.e. no intermediate object formats and no linker! Even Spin's executable format had to be reverse engineered. Luckily, there were some clever people around at the time who did that, and who produced alternative Spin tools that expanded on the Parallax offerings. The very first version of Catalina actually spat out the complete program in a combination of Spin and PASM, and then used the Parallax Spin tool to compile/assemble the final result. Catalina soon outgrew the Parallax tools, but luckily by that time there were alternatives (bstc, HomeSpun, OpenSpin etc). On the P2, Catalina no longer uses any Parallax or Spin tools at all, since someone (Dave Hein) finally produced a proper assembler.
Finally, there is the insurmountable issue that because these programs are too large to fit in Hub RAM, they have to execute from external RAM. So the PSRAM is not only being used as an SD cache - it also being used to serve up the executable code itself. I also cache that, of course - but even so, executing code from external RAM can be anywhere from 2 to 10 times slower than executing the same code from Hub RAM.
Update: Actually if you have a 70% cache hit rate and a cache miss takes 1us to resolve (approx), then the resulting MIP rate drops from 150MIPs max (@300MHz) down to about 1MIP for each miss. So 30% of the instructions you run at ~1MIP and 70% you run at 150MIPs. Avg instruction rate is then 100 instructions serviced in 30/1us+70/150 microseconds or 100/30.467us = 3.28MIPs. This is ~10% of the ballpark of a slow old PC which I guess is more in line with your recent results - although this is with a RISC vs CISC ISA so still not really a proper comparison.
I will of course look at further speed improvements, but from here on in every improvement will be harder work for less benefit. And I have other things I'd rather be doing.
@rogloh said:
That's a nice improvement but I'm kind of surprised it's not better than this. If hello.c has 5 lines of code do you know how many lines of additional header files are being parsed?
About 200.
Have you looked into what sort of cache hit rate you are getting and what proportion of the 3 mins is just waiting to read/write external memory? I guess it would be possible to instrument your cache driver to spit out hits/miss ratios during a compile and the average time taken for a hit and a miss.
Yes, the cache hit rate is good. In most of these programs, the cache is large enough to hold all the sectors used, and it is a "write-back" cache, so in fact the hit rate approaches 100%. I have not measured the time taken to actually read/write from PSRAM, since there is probably no way I could improve it much even if I found out that was a significant bottleneck.
At 300MHz my rough rule of thumb is about 1us of time to read in a "small" block of data like 32 bytes or less for a transfer. If your transfer sizes are larger then add another 2*transfer size/sysclk micro seconds to the service time to account for the extra transfer time. If you write a custom memory driver that directly couples your cache to external memory interface and avoids the COG service polling overhead you can speed it up a bit, but you'll then lose the ability to easily share the memory with multiple COGs (or will have to invent your own memory driver sharing scheme which likely slows it down again as well).
With the old 20-33MHz DOS machines I used to use back in the day running Turbo Pascal and Turbo C etc, compiling a hello world type of program would maybe only take like 10-30secs or so (just roughly guessing from my old memory). It certainly was a lot less than 3mins. I'd imagine a 300MHz P2 should be somewhere in the vicinity of that depending on the cache performance.
Even with a cache hit rate of 100%, you still have to ultimately read and write hundreds of kilobytes - even megabytes (the core LCC program is nearly 2Mb) - from the SD card to compile even a 5 line program. Doing that from an SD card is going to take more time than doing it from a hard disk.
Yeah, it will need to be read into PSRAM/HyperRAM initially. If the tools get swapped frequently during overall compilation that will add up.
Then there is the fact that some of the work is done using Catalyst's scripting capabilities, which are quite slow because (for example) they have to reboot the Propeller and reload Catalyst (and then possibly also Lua) to execute each and every line of script. It's clunky, but it works.
Sounds like it.
Overall, I'd guess that somewhere between 30 seconds and 1 minute of every compilation time - even for just a 5 line program - is taken up by things that are just not that easy to speed up. The cache (once I got it working right) was actually a very easy win.
Is most of the time looking up data structures, or blocked doing file IO, or from lots of cache misses? It'd be interesting to know where the bottleneck is if you wanted to speed it up further.
Much of the compliation time is spent in the C preprocessor (cpp). The one LCC uses is grindingly slow. I have thought about replacing it - there are open source ones that would be faster. That might save another 15 seconds or so.
Another big issue is the way Catalina does linking - the C library is not actually compiled to binary - it is just translated to PASM. That means every compilation - even for a 5 line program - is actually concatenating and then assembling a large chunk of the standard C library. This design decision probably looks a little crazy, but at the time it was not - in the very early days of the Propeller 1 there was simply no other choice. There was not even a publicly documented executable format for Propeller programs, and no tools available to use it if there had been - -i.e. no intermediate object formats and no linker! Even Spin's executable format had to be reverse engineered. Luckily, there were some clever people around at the time who did that, and who produced alternative Spin tools that expanded on the Parallax offerings. The very first version of Catalina actually spat out the complete program in a combination of Spin and PASM, and then used the Parallax Spin tool to compile/assemble the final result. Catalina soon outgrew the Parallax tools, but luckily by that time there were alternatives (bstc, HomeSpun, OpenSpin etc). On the P2, Catalina no longer uses any Parallax or Spin tools at all, since someone (Dave Hein) finally produced a proper assembler.
Yeah I agree, additional steps like that would have to slow it down.
Finally, there is the insurmountable issue that because these programs are too large to fit in Hub RAM, they have to execute from external RAM. So the PSRAM is not only being used as an SD cache - it also being used to serve up the executable code itself. I also cache that, of course - but even so, executing code from external RAM can be anywhere from 2 to 10 times slower than executing the same code from Hub RAM.
Yeah the external memory operation does slow it down a fair bit especially during cache misses.
I will of course look at further speed improvements, but from here on in every improvement will be harder work for less benefit. And I have other things I'd rather be doing.
Comments
@evanh
I can't reproduce your results exactly - that is probably due to the use of a different type of SD card. However, the important thing here is how much using the cache speeds things up.
These results use a Sandisk 32Gb Extreme, at 300Mhz, in Catalina NATIVE mode:
No cache:
Using a write-back SD sector cache:
As expected, using the cache speeds things up quite dramatically, and shows that this is a much more significant way to improve performance than just improving the low level SD card driver speed.
One other thing to note - you MUST format the SD card between each test, otherwise the tests are not using the same sectors on the SD card - and this makes a MASSIVE difference to the speeds - SD card manufacturers seem to cheat by making the first sectors much faster than subsequent sectors - presumably to give good results on simple speed tests.
Still working on the LARGE mode testing - I can't seem to get consistent results, which indicates I may have a bug with the cache in LARGE mode.
Ross.
I would argue the opposite then. If there is something the cards are doing to make it faster at the start then messing it up plenty before the real test would be in order. Otherwise you'd be falling for a cheat.
Excellent results for sure. How much RAM is used by the cache?
I suppose you could start with writing a known number of files to the card, rather than an empty card. The main thing it that it be redone exactly the same way before each test (and using the same formatting and file writing program) otherwise you may be using different clusters (in different sections of the card) for different tests - and you will get different results.
Even so, if you are using different file system implementations (e.g. DOSFS vs FatFS) they may allocate clusters to files differently, so you may still get differences. But it is probably fairest to start with an empty card, or at least write a test that utilizes more of the card than just a few sectors (in this case, about 200 - see next post).
For details on why the format program matters, see https://www.sdcard.org/press/thoughtleadership/the-sd-memory-card-formatterhow-this-handy-tool-solves-your-memory-card-formatting-needs/).
This test program only uses about 200 sectors (204 I think), so any cache size 100kb or so would give the same results - which would easily fit in Hub RAM. But my program uses PSRAM for the cache, so the cache could be up to 16Mb. A smaller cache (say 64kb or 32kb ) would give still give good - but slightly slower - results.
That's a page of vague fear-mongering. There's not one reason why given there. Although there is a hint about size boundaries. GParted has used 1 MB partitioning boundaries forever. Ie: First partition starts at 1 MB offset by default.
Nice! Which also explains why you don't have it always compiled in.
I always now use the SD Association format program, but out of interest I re-ran your program using a blank 32Gb Sandisk Extreme SD card, but formatted with different programs - the SD Association one, the Windows one, and the Linux one (gparted). All different. The differences were generally small (between 5% and 10%) but the SD Association was fastest, then gparted, then Windows.
Huh, there is a Linux edition, and it looks to be quite new - https://www.sdcard.org/downloads/sd-memory-card-formatter-for-linux/
I note it has a
--discard
option. I coded a full card discard for some of my cards not long ago - to see if that made any difference to testing speed but it didn't. Probably because it only matters after the card has been written 100%.Not FAT32-on-SDXC option though, lame.
Well, I've learnt a good place to put these manually added binaries is in /usr/local/bin. It was, surprisingly, a totally empty directory. There is meant to be a ~/.local/bin as well but, although the directory already existed, the default path doesn't search it.
Which is exactly 30436.5 MB.
SDA formatted:
GParted formatted:
GParted formatted, with additional LBA flag set:
GParted formatted, LBA flag set, no alignment, 4 MB unused at beginning:
Oh, yeah, from 64 GB upwards it shoves you with exFAT. That's a fail.
Reformatted to FAT32 with GParted: LBA flag set, no alignment, 16 MB unused at beginning:
Same behaviour with the Samsung 128 GB. SDA formatted:
And redone to FAT32:
Ha! I found the issue that was giving me such inconsistent results when using the cache in XMM programs!
I have to thank you for your test program, @evanh - it drew my attention to a piece of horribly inefficient code in my cache functions. It bamboozled me for a while because the affect on NATIVE programs is so small as to be almost unnoticeable - but in XMM programs (which execute from PSRAM and therefore execute more slowly) it was inefficient enough that in some cases it made the cached version of the function execute SLOWER than the non-cached version!
Still testing, but rewriting just a couple of lines has so far improved the performance of the self-hosted version of Catalina (which executes from PSRAM) by another 20%!
However, it does make me wonder what other silly things I have yet to find!
Ross.
I love making progress like that.
I've done some benchmarking on my new SD card cache, and the results are looking pretty good.
Here are the latest times for compiling some Catalina demo programs, using the self-hosted version of Catalina running at 300Mhz on the P2 Eval board equipped with the Hyper RAM add-on:
Compile times are now between 2 and 3 times faster than the initial release. I always wanted to get down to 3 minutes for a simple "Hello, World!" type C program, and this version finally achieves that.
All compiles done using Catalina's Quick Build option - which in hindsight was perhaps not the best name to choose!
The improved cache will be included in the next release.
Ross.
That's a nice improvement but I'm kind of surprised it's not better than this. If hello.c has 5 lines of code do you know how many lines of additional header files are being parsed?
Have you looked into what sort of cache hit rate you are getting and what proportion of the 3 mins is just waiting to read/write external memory? I guess it would be possible to instrument your cache driver to spit out hits/miss ratios during a compile and the average time taken for a hit and a miss.
With the old 20-33MHz DOS machines I used to use back in the day running Turbo Pascal and Turbo C etc, compiling a hello world type of program would maybe only take like 10-30secs or so (just roughly guessing from my old memory). It certainly was a lot less than 3mins. I'd imagine a 300MHz P2 should be somewhere in the vicinity of that depending on the cache performance.
Is most of the time looking up data structures, or blocked doing file IO, or from lots of cache misses? It'd be interesting to know where the bottleneck is if you wanted to speed it up further.
Update: Actually if you have a 70% cache hit rate and a cache miss takes 1us to resolve (approx), then the resulting MIP rate drops from 150MIPs max (@300MHz) down to about 1MIP for each miss. So 30% of the instructions you run at ~1MIP and 70% you run at 150MIPs. Avg instruction rate is then 100 instructions serviced in 30/1us+70/150 microseconds or 100/30.467us = 3.28MIPs. This is ~10% of the ballpark of a slow old PC which I guess is more in line with your recent results - although this is with a RISC vs CISC ISA so still not really a proper comparison.
About 200.
Yes, the cache hit rate is good. In most of these programs, the cache is large enough to hold all the sectors used, and it is a "write-back" cache, so in fact the hit rate approaches 100%. I have not measured the time taken to actually read/write from PSRAM, since there is probably no way I could improve it much even if I found out that was a significant bottleneck.
Even with a cache hit rate of 100%, you still have to ultimately read and write hundreds of kilobytes - even megabytes (the core LCC program is nearly 2Mb) - from the SD card to compile even a 5 line program. Doing that from an SD card is going to take more time than doing it from a hard disk.
Then there is the fact that some of the work is done using Catalyst's scripting capabilities, which are quite slow because (for example) they have to reboot the Propeller and reload Catalyst (and then possibly also Lua) to execute each and every line of script. It's clunky, but it works.
Overall, I'd guess that somewhere between 30 seconds and 1 minute of every compilation time - even for just a 5 line program - is taken up by things that are just not that easy to speed up. The cache (once I got it working right) was actually a very easy win.
Much of the compliation time is spent in the C preprocessor (cpp). The one LCC uses is grindingly slow. I have thought about replacing it - there are open source ones that would be faster. That might save another 15 seconds or so.
Another big issue is the way Catalina does linking - the C library is not actually compiled to binary - it is just translated to PASM. That means every compilation - even for a 5 line program - is actually concatenating and then assembling a large chunk of the standard C library. This design decision probably looks a little crazy, but at the time it was not - in the very early days of the Propeller 1 there was simply no other choice. There was not even a publicly documented executable format for Propeller programs, and no tools available to use it if there had been - -i.e. no intermediate object formats and no linker! Even Spin's executable format had to be reverse engineered. Luckily, there were some clever people around at the time who did that, and who produced alternative Spin tools that expanded on the Parallax offerings. The very first version of Catalina actually spat out the complete program in a combination of Spin and PASM, and then used the Parallax Spin tool to compile/assemble the final result. Catalina soon outgrew the Parallax tools, but luckily by that time there were alternatives (bstc, HomeSpun, OpenSpin etc). On the P2, Catalina no longer uses any Parallax or Spin tools at all, since someone (Dave Hein) finally produced a proper assembler.
Finally, there is the insurmountable issue that because these programs are too large to fit in Hub RAM, they have to execute from external RAM. So the PSRAM is not only being used as an SD cache - it also being used to serve up the executable code itself. I also cache that, of course - but even so, executing code from external RAM can be anywhere from 2 to 10 times slower than executing the same code from Hub RAM.
I will of course look at further speed improvements, but from here on in every improvement will be harder work for less benefit. And I have other things I'd rather be doing.
At 300MHz my rough rule of thumb is about 1us of time to read in a "small" block of data like 32 bytes or less for a transfer. If your transfer sizes are larger then add another 2*transfer size/sysclk micro seconds to the service time to account for the extra transfer time. If you write a custom memory driver that directly couples your cache to external memory interface and avoids the COG service polling overhead you can speed it up a bit, but you'll then lose the ability to easily share the memory with multiple COGs (or will have to invent your own memory driver sharing scheme which likely slows it down again as well).
Yeah, it will need to be read into PSRAM/HyperRAM initially. If the tools get swapped frequently during overall compilation that will add up.
Sounds like it.
Yeah I agree, additional steps like that would have to slow it down.
Yeah the external memory operation does slow it down a fair bit especially during cache misses.
Fair enough.