New xmm modes for RamPage2
Rayman
Posts: 14,826
David Betz just provided a new xmm driver for RamPage2. I'm testing it out right now and seems to work great now.
RamPage2 comprises two SQI flash chips in parallel for an 8-bit data bus and two SQI sram chips also sharing the data bus.
There is also a uSD socket. RamPage2 directly mounts to Quickstart or Propeller Platform boards.
I'm still figuring PropGCC out, but I think I know enough to make some speed comparisons...
I think the "Dhrystone" demo is best for this, but I've seen others using the fibo demo.
It's not clear to me what optimizations to use when making speed comparisons...
Here, I'm using -O0 (no optimizations) and -dno-fcache (fcache makes small routines faster by bringing them inside a cog).
Here are the priliminary results, comparing to C3.
Mode Dhrystones/second (bigger is better)
lmm 2818
cmm 543
Rampage2:
Mode Dhrystones/second (bigger is better)
xmmc 515
xmm-single 335
xmm-split 310
C3:
Mode Dhrystones/second (bigger is better)
xmmc 197
xmm-single 59
xmm-split 80
So, I'm happy to find RamPage2 significantly faster than C3. It ought to be, since it uses a lot more pins...
Also, xmmc speed is on par with cmm, which is very good because that means it's faster than Spin
and that one can start with cmm and get an idea of how fast it would run with RamPage2.
RamPage2 comprises two SQI flash chips in parallel for an 8-bit data bus and two SQI sram chips also sharing the data bus.
There is also a uSD socket. RamPage2 directly mounts to Quickstart or Propeller Platform boards.
I'm still figuring PropGCC out, but I think I know enough to make some speed comparisons...
I think the "Dhrystone" demo is best for this, but I've seen others using the fibo demo.
It's not clear to me what optimizations to use when making speed comparisons...
Here, I'm using -O0 (no optimizations) and -dno-fcache (fcache makes small routines faster by bringing them inside a cog).
Here are the priliminary results, comparing to C3.
Mode Dhrystones/second (bigger is better)
lmm 2818
cmm 543
Rampage2:
Mode Dhrystones/second (bigger is better)
xmmc 515
xmm-single 335
xmm-split 310
C3:
Mode Dhrystones/second (bigger is better)
xmmc 197
xmm-single 59
xmm-split 80
So, I'm happy to find RamPage2 significantly faster than C3. It ought to be, since it uses a lot more pins...
Also, xmmc speed is on par with cmm, which is very good because that means it's faster than Spin
and that one can start with cmm and get an idea of how fast it would run with RamPage2.
Comments
I'd go for Os or O3 for benchmarking. Or whatever works best. After all if you are racing a car you put your foot down, right?
fcache? Well why not? Grown up micro's have had caches for ages. For a lot of codes they get you a speed boost. For some codes they can be detrimental.
SimpleIDE only shows me Os, O0, O1, O2...
Os is O3
BTW, C3 XMMC performance is handicapped by half cache. You should use C3F for valid comparisons.
Looks like there's room for improvement in the RP2 driver...
Would one always want to use C3F for xmmc mode?
Is the regular C3 set the other way so that xmm-split and xmm-single work right?
Yes you would always use C3F with xmmc mode. Otherwise, as Steve said, you'll only be using half of the cache. In with the C3 driver the cache is split in half with half used for caching SRAM and half used for caching flash. Since xmmc only uses flash it only uses half the cache. In C3F mode the entire cache is used for flash since SRAM is not supported.
Yes you would always use C3F with xmmc mode. Otherwise, as Steve said, you'll only be using half of the cache. In with the C3 driver the cache is split in half with half used for caching SRAM and half used for caching flash. Since xmmc only uses flash it only uses half the cache. In C3F mode the entire cache is used for flash since SRAM is not supported.
Edit: Can someone PLEASE fix whatever is causing double posts in the forum software?
606 with RP2 xmm-split
599 with RP2 xmm-single
1098 with RP2 xmmc
Update:
lmm mode also ~doubles but cmm improves almost 4X:
5813 with lmm2030 with cmm
So, cmm is then ~3X faster that xmm-split this way and 2X faster than xmmc.
I'm not 100% where spin speed fits into this... Have to see if anybody did Dhrystone in spin...
cache-param1: 0x000706
it was this:
cache-param1: 0
BTW: why is cache-size = "512 +8k" instead of just "8k" ?
Goes from 1098 to 1151... maybe 10%. Guess I should see where C3F is with these optimization settings...
C3F still wins, but it's very close:
1151 for RP2X in xmmc mode
1171 for C3F in xmmc mode
Clearly this is unacceptable But, Ross did teach me that in xmmc mode the cache help so much that
flash read speed starts to not matter at some point. Still, I think I can make this better, or at least beat C3.
RP2 is 4X faster in single and split modes. This is where I'm very interested in running...
Are you running the p2test or default branch compiler?
Also, how does your make file look?
I have this:
DEFINES = -Dprintf=__simple_printf -DMSC_CLOCK -DINTEGER_ONLY -DFIXED_NUMBER_OF_PASSES=2000
CFLAGS = -Os -mfcache $(DEFINES)
I get dhrystone 2.2 numbers like this with the default branch:
C3 LMM 5854
C3 CMM 2285
C3 XMMC 380
C3F XMMC 1196
SSF XMMC 1131
EEPROM XMMC 1191
According to that, we're totally wasting time
SSF is the implementation of my original Dual QuadSPI Flash idea.
A real test however is with running David's ebasic.
The program like this:
Gives results like this (smaller is better):
EEPROM 376080
C3 127104
C3F 127104 (odd.)
SSF 43792
A program like this:
Gives results like this (smaller is better):
EEPROM 10 seconds
C3 5 seconds
C3F 3 seconds
SSF 2 seconds
Maybe I can find an ebasic version of fibo...
Strange we get different numbers... I'm using the code in the demos folder.
Not using makefile, using a batch file like this:
Actually, I think we are getting similar results, only a slight difference because you used 2000 instead of 3000 passes.
But, your numbers are almost the same I get with -Os and with fcache.
In modes where the cost of fetching instructions from memory is significant, Os often produces better results than O2 or O3. This will be especially true for XMM, but applies to LMM and CMM as well.
Eric
Thanks Eric. We'll have to test that.
There is a compiler that goes to -O11
Sorry, I just woke up:)
More seriously. GCC has a ton of optimization options. The -O1, 2, 3 options are a short hand that selects different optimization options. So could effectively make your own -O4 by selecting the optimization switches yourself on the command line.