New xmm modes for RamPage2

Rayman · 2013-04-21 12:43

David Betz just provided a new xmm driver for RamPage2. I'm testing it out right now and seems to work great now.

RamPage2 comprises two SQI flash chips in parallel for an 8-bit data bus and two SQI sram chips also sharing the data bus.
There is also a uSD socket. RamPage2 directly mounts to Quickstart or Propeller Platform boards.

I'm still figuring PropGCC out, but I think I know enough to make some speed comparisons...
I think the "Dhrystone" demo is best for this, but I've seen others using the fibo demo.

It's not clear to me what optimizations to use when making speed comparisons...
Here, I'm using -O0 (no optimizations) and -dno-fcache (fcache makes small routines faster by bringing them inside a cog).

Here are the priliminary results, comparing to C3.

Mode Dhrystones/second (bigger is better)

lmm 2818
cmm 543

Rampage2:
Mode Dhrystones/second (bigger is better)

xmmc 515
xmm-single 335
xmm-split 310

C3:
Mode Dhrystones/second (bigger is better)

xmmc 197
xmm-single 59
xmm-split 80

So, I'm happy to find RamPage2 significantly faster than C3. It ought to be, since it uses a lot more pins...
Also, xmmc speed is on par with cmm, which is very good because that means it's faster than Spin
and that one can start with cmm and get an idea of how fast it would run with RamPage2.

David Betz · 2013-04-21 13:32

Congratulations! And thanks for finding that bug in my rampage2 cache driver!

Heater. · 2013-04-21 13:40

Rayman,

I'd go for Os or O3 for benchmarking. Or whatever works best. After all if you are racing a car you put your foot down, right?

fcache? Well why not? Grown up micro's have had caches for ages. For a lot of codes they get you a speed boost. For some codes they can be detrimental.

Rayman · 2013-04-21 14:32

Is there an O3?

SimpleIDE only shows me Os, O0, O1, O2...

jazzed · 2013-04-21 15:07

Rayman wrote: »

Is there an O3?

SimpleIDE only shows me Os, O0, O1, O2...

Os is O3

BTW, C3 XMMC performance is handicapped by half cache. You should use C3F for valid comparisons.

Rayman · 2013-04-21 15:53

You're right, "C3F" is significantly faster... Might be faster than RP2 then...
Looks like there's room for improvement in the RP2 driver...

David Betz · 2013-04-21 16:20

Rayman wrote: »

You're right, "C3F" is significantly faster... Might be faster than RP2 then...
Looks like there's room for improvement in the RP2 driver...

C3F of course only works in xmmc mode so you need to compare that with xmmc on the RamPage2. Also, the RamPage2 driver is currently using a four way cache with the tags in hub memory. The C3F driver uses a direct mapped cache with the tags in COG memory. You can change the RamPage2 driver to a direct mapped cache by playing with the cache configuration parameters in the rampage2x.cfg file.

' $08: geometry: (way_width << 24) | (index_width << 16) | (offset_width << 8)

This is what goes in cache-param1 in fhe .cfg file. Set way_width to 0 for a direct mapped cache. The widths are the number of bits in each field. To make it work like C3F you should change the param1 value to 0x000706.

Rayman · 2013-04-21 16:43

I don't really follow that, but let me ask this:
Would one always want to use C3F for xmmc mode?
Is the regular C3 set the other way so that xmm-split and xmm-single work right?

David Betz · 2013-04-21 17:09

Rayman wrote: »

I don't really follow that, but let me ask this:
Would one always want to use C3F for xmmc mode?
Is the regular C3 set the other way so that xmm-split and xmm-single work right?

Did you try setting cache-param1 to 0x000706 (direct mapped, 128 cache lines, 64 bytes per cache line)?

Yes you would always use C3F with xmmc mode. Otherwise, as Steve said, you'll only be using half of the cache. In with the C3 driver the cache is split in half with half used for caching SRAM and half used for caching flash. Since xmmc only uses flash it only uses half the cache. In C3F mode the entire cache is used for flash since SRAM is not supported.

David Betz · 2013-04-21 17:10

Rayman wrote: »

I don't really follow that, but let me ask this:
Would one always want to use C3F for xmmc mode?
Is the regular C3 set the other way so that xmm-split and xmm-single work right?

Did you try setting cache-param1 to 0x000706 (direct mapped, 128 cache lines, 64 bytes per cache line)?

Yes you would always use C3F with xmmc mode. Otherwise, as Steve said, you'll only be using half of the cache. In with the C3 driver the cache is split in half with half used for caching SRAM and half used for caching flash. Since xmmc only uses flash it only uses half the cache. In C3F mode the entire cache is used for flash since SRAM is not supported.

Edit: Can someone PLEASE fix whatever is causing double posts in the forum software?

Rayman · 2013-04-21 17:53

Tried it... made it slower... Was 516 Dhry/s and now 147...

Rayman · 2013-04-21 18:04

Just ran Dhrystone demo with -Os and -mfcache options... Speed approximately doubled:

606 with RP2 xmm-split
599 with RP2 xmm-single
1098 with RP2 xmmc

Update:
lmm mode also ~doubles but cmm improves almost 4X:

5813 with lmm2030 with cmm

So, cmm is then ~3X faster that xmm-split this way and 2X faster than xmmc.

I'm not 100% where spin speed fits into this... Have to see if anybody did Dhrystone in spin...

David Betz · 2013-04-21 18:08

Rayman wrote: »

Tried it... made it slower... Was 516 Dhry/s and now 147...

This is with xmmc?

David Betz · 2013-04-21 18:08

Rayman wrote: »

Tried it... made it slower... Was 516 Dhry/s and now 147...

This is with xmmc?

Rayman · 2013-04-21 18:45

yes. Is this right:
cache-param1: 0x000706

it was this:
cache-param1: 0

BTW: why is cache-size = "512 +8k" instead of just "8k" ?

David Betz · 2013-04-21 18:48

Smile! I gave you the wrong information. I didn't even read my own help text above about the cache geometry settings. Try 0x00070600. Sorry about that!

Rayman · 2013-04-21 18:57

I'm in -Os with -fcache mode right now and that setting increases Dhry/s, but only a bit.
Goes from 1098 to 1151... maybe 10%. Guess I should see where C3F is with these optimization settings...

C3F still wins, but it's very close:

1151 for RP2X in xmmc mode
1171 for C3F in xmmc mode

Clearly this is unacceptable

But, Ross did teach me that in xmmc mode the cache help so much that
flash read speed starts to not matter at some point. Still, I think I can make this better, or at least beat C3.

RP2 is 4X faster in single and split modes. This is where I'm very interested in running...

David Betz · 2013-04-21 19:24

Rayman wrote: »

I'm in -Os with -fcache mode right now and that setting increases Dhry/s, but only a bit.
Goes from 1098 to 1151... maybe 10%. Guess I should see where C3F is with these optimization settings...

C3F still wins, but it's very close:

1151 for RP2X in xmmc mode
1171 for C3F in xmmc mode

Clearly this is unacceptable But, Ross did teach me that in xmmc mode the cache help so much that
flash read speed starts to not matter at some point. Still, I think I can make this better, or at least beat C3.

RP2 is 4X faster in single and split modes. This is where I'm very interested in running...

Is this using cache-param1 set to 0x00070600? If so, you may be seeing the difference between having the tags in hub memory vs. COG memory. There is a version of the C3 cache code that uses tags in hub memory as well and that might have different performance characteristics. Also, the code in cache_common.spin can probably be optimized.

David Betz · 2013-04-21 19:24

Rayman wrote: »

I'm in -Os with -fcache mode right now and that setting increases Dhry/s, but only a bit.
Goes from 1098 to 1151... maybe 10%. Guess I should see where C3F is with these optimization settings...

C3F still wins, but it's very close:

1151 for RP2X in xmmc mode
1171 for C3F in xmmc mode

Clearly this is unacceptable But, Ross did teach me that in xmmc mode the cache help so much that
flash read speed starts to not matter at some point. Still, I think I can make this better, or at least beat C3.

RP2 is 4X faster in single and split modes. This is where I'm very interested in running...

Is this using cache-param1 set to 0x00070600? If so, you may be seeing the difference between having the tags in hub memory vs. COG memory. There is a version of the C3 cache code that uses tags in hub memory as well and that might have different performance characteristics. Also, the code in cache_common.spin can probably be optimized.

jazzed · 2013-04-21 21:48

Rayman,

Are you running the p2test or default branch compiler?
Also, how does your make file look?

I have this:
DEFINES = -Dprintf=__simple_printf -DMSC_CLOCK -DINTEGER_ONLY -DFIXED_NUMBER_OF_PASSES=2000
CFLAGS = -Os -mfcache $(DEFINES)

I get dhrystone 2.2 numbers like this with the default branch:

C3 LMM 5854
C3 CMM 2285
C3 XMMC 380
C3F XMMC 1196
SSF XMMC 1131
EEPROM XMMC 1191

According to that, we're totally wasting time

SSF is the implementation of my original Dual QuadSPI Flash idea.

A real test however is with running David's ebasic.

The program like this:

10 A = CNT()
20 B = CNT()
30 C = B - A
40 print C

Gives results like this (smaller is better):

EEPROM 376080
C3 127104
C3F 127104 (odd.)
SSF 43792

A program like this:

10 print "Count to 100"
20 for n = 0 to 100
30  if n mod 10 = 0 then
40    print
50  end if
60  print n; " ";
70 next n

Gives results like this (smaller is better):

EEPROM 10 seconds
C3 5 seconds
C3F 3 seconds
SSF 2 seconds

Rayman · 2013-04-22 03:29

I'm running with the latest and greatest p2test branch...

Maybe I can find an ebasic version of fibo...

Strange we get different numbers... I'm using the code in the demos folder.
Not using makefile, using a batch file like this:

[FONT=Consolas][SIZE=2][FONT=Consolas][SIZE=2]propeller-elf-gcc.exe -c dry.c -o dry1.o -Os -m%MemoryMode% -mfcache -I . -Dprintf=__simple_printf  -DMSC_CLOCK -DINTEGER_ONLY -DFIXED_NUMBER_OF_PASSES=3000 -fno-exceptions 

propeller-elf-gcc.exe -c dry.c -o dry2.o -Os -m%MemoryMode% -mfcache -I . -DPASS2 -Dprintf=__simple_printf  -DMSC_CLOCK -DINTEGER_ONLY -DFIXED_NUMBER_OF_PASSES=3000 -fno-exceptions 

propeller-elf-gcc.exe  -Os -m%MemoryMode% -mfcache -o a.out dry1.o dry2.o
[/SIZE][/FONT][/SIZE][/FONT]

Actually, I think we are getting similar results, only a slight difference because you used 2000 instead of 3000 passes.
But, your numbers are almost the same I get with -Os and with fcache.

ersmith · 2013-04-22 10:53

jazzed wrote: »

Os is O3

No, actually O3 is even more aggressive optimization for speed than O2, whereas Os is similar to O2 but optimizing for size.

In modes where the cost of fetching instructions from memory is significant, Os often produces better results than O2 or O3. This will be especially true for XMM, but applies to LMM and CMM as well.

Eric

jazzed · 2013-04-22 17:07

ersmith wrote: »

No, actually O3 is even more aggressive optimization for speed than O2, whereas Os is similar to O2 but optimizing for size.

Thanks Eric. We'll have to test that.

Rayman · 2013-04-22 17:22

There's no O4 is there?

Heater. · 2013-04-22 22:44

-O4 is only for the experienced. It totally rearranges your code effectivly causing it to run backwards, thus simulating faster than light execution. You get the results before you put in the data. This is only usful for debugging as you can run the program in reverse with wrong results and step back to the point of the bug thats causing it.

There is a compiler that goes to -O11

Sorry, I just woke up:)

More seriously. GCC has a ton of optimization options. The -O1, 2, 3 options are a short hand that selects different optimization options. So could effectively make your own -O4 by selecting the optimization switches yourself on the command line.

New xmm modes for RamPage2

Comments