Shop OBEX P1 Docs P2 Docs Learn Events
New xmm modes for RamPage2 — Parallax Forums

New xmm modes for RamPage2

RaymanRayman Posts: 14,826
edited 2013-04-22 22:44 in Propeller 1
David Betz just provided a new xmm driver for RamPage2. I'm testing it out right now and seems to work great now.

RamPage2 comprises two SQI flash chips in parallel for an 8-bit data bus and two SQI sram chips also sharing the data bus.
There is also a uSD socket. RamPage2 directly mounts to Quickstart or Propeller Platform boards.

I'm still figuring PropGCC out, but I think I know enough to make some speed comparisons...
I think the "Dhrystone" demo is best for this, but I've seen others using the fibo demo.

It's not clear to me what optimizations to use when making speed comparisons...
Here, I'm using -O0 (no optimizations) and -dno-fcache (fcache makes small routines faster by bringing them inside a cog).

Here are the priliminary results, comparing to C3.

Mode Dhrystones/second (bigger is better)

lmm 2818
cmm 543

Rampage2:
Mode Dhrystones/second (bigger is better)

xmmc 515
xmm-single 335
xmm-split 310

C3:
Mode Dhrystones/second (bigger is better)

xmmc 197
xmm-single 59
xmm-split 80

So, I'm happy to find RamPage2 significantly faster than C3. It ought to be, since it uses a lot more pins...
Also, xmmc speed is on par with cmm, which is very good because that means it's faster than Spin
and that one can start with cmm and get an idea of how fast it would run with RamPage2.

Comments

  • David BetzDavid Betz Posts: 14,516
    edited 2013-04-21 13:32
    Congratulations! And thanks for finding that bug in my rampage2 cache driver!
  • Heater.Heater. Posts: 21,230
    edited 2013-04-21 13:40
    Rayman,

    I'd go for Os or O3 for benchmarking. Or whatever works best. After all if you are racing a car you put your foot down, right?

    fcache? Well why not? Grown up micro's have had caches for ages. For a lot of codes they get you a speed boost. For some codes they can be detrimental.
  • RaymanRayman Posts: 14,826
    edited 2013-04-21 14:32
    Is there an O3?

    SimpleIDE only shows me Os, O0, O1, O2...
  • jazzedjazzed Posts: 11,803
    edited 2013-04-21 15:07
    Rayman wrote: »
    Is there an O3?

    SimpleIDE only shows me Os, O0, O1, O2...

    Os is O3

    BTW, C3 XMMC performance is handicapped by half cache. You should use C3F for valid comparisons.
  • RaymanRayman Posts: 14,826
    edited 2013-04-21 15:53
    You're right, "C3F" is significantly faster... Might be faster than RP2 then...
    Looks like there's room for improvement in the RP2 driver...
  • David BetzDavid Betz Posts: 14,516
    edited 2013-04-21 16:20
    Rayman wrote: »
    You're right, "C3F" is significantly faster... Might be faster than RP2 then...
    Looks like there's room for improvement in the RP2 driver...
    C3F of course only works in xmmc mode so you need to compare that with xmmc on the RamPage2. Also, the RamPage2 driver is currently using a four way cache with the tags in hub memory. The C3F driver uses a direct mapped cache with the tags in COG memory. You can change the RamPage2 driver to a direct mapped cache by playing with the cache configuration parameters in the rampage2x.cfg file.
    ' $08: geometry: (way_width << 24) | (index_width << 16) | (offset_width << 8)
    
    This is what goes in cache-param1 in fhe .cfg file. Set way_width to 0 for a direct mapped cache. The widths are the number of bits in each field. To make it work like C3F you should change the param1 value to 0x000706.
  • RaymanRayman Posts: 14,826
    edited 2013-04-21 16:43
    I don't really follow that, but let me ask this:
    Would one always want to use C3F for xmmc mode?
    Is the regular C3 set the other way so that xmm-split and xmm-single work right?
  • David BetzDavid Betz Posts: 14,516
    edited 2013-04-21 17:09
    Rayman wrote: »
    I don't really follow that, but let me ask this:
    Would one always want to use C3F for xmmc mode?
    Is the regular C3 set the other way so that xmm-split and xmm-single work right?
    Did you try setting cache-param1 to 0x000706 (direct mapped, 128 cache lines, 64 bytes per cache line)?

    Yes you would always use C3F with xmmc mode. Otherwise, as Steve said, you'll only be using half of the cache. In with the C3 driver the cache is split in half with half used for caching SRAM and half used for caching flash. Since xmmc only uses flash it only uses half the cache. In C3F mode the entire cache is used for flash since SRAM is not supported.
  • David BetzDavid Betz Posts: 14,516
    edited 2013-04-21 17:10
    Rayman wrote: »
    I don't really follow that, but let me ask this:
    Would one always want to use C3F for xmmc mode?
    Is the regular C3 set the other way so that xmm-split and xmm-single work right?
    Did you try setting cache-param1 to 0x000706 (direct mapped, 128 cache lines, 64 bytes per cache line)?

    Yes you would always use C3F with xmmc mode. Otherwise, as Steve said, you'll only be using half of the cache. In with the C3 driver the cache is split in half with half used for caching SRAM and half used for caching flash. Since xmmc only uses flash it only uses half the cache. In C3F mode the entire cache is used for flash since SRAM is not supported.

    Edit: Can someone PLEASE fix whatever is causing double posts in the forum software?
  • RaymanRayman Posts: 14,826
    edited 2013-04-21 17:53
    Tried it... made it slower... Was 516 Dhry/s and now 147...
  • RaymanRayman Posts: 14,826
    edited 2013-04-21 18:04
    Just ran Dhrystone demo with -Os and -mfcache options... Speed approximately doubled:

    606 with RP2 xmm-split
    599 with RP2 xmm-single
    1098 with RP2 xmmc

    Update:
    lmm mode also ~doubles but cmm improves almost 4X:

    5813 with lmm2030 with cmm

    So, cmm is then ~3X faster that xmm-split this way and 2X faster than xmmc.

    I'm not 100% where spin speed fits into this... Have to see if anybody did Dhrystone in spin...
  • David BetzDavid Betz Posts: 14,516
    edited 2013-04-21 18:08
    Rayman wrote: »
    Tried it... made it slower... Was 516 Dhry/s and now 147...
    This is with xmmc?
  • David BetzDavid Betz Posts: 14,516
    edited 2013-04-21 18:08
    Rayman wrote: »
    Tried it... made it slower... Was 516 Dhry/s and now 147...
    This is with xmmc?
  • RaymanRayman Posts: 14,826
    edited 2013-04-21 18:45
    yes. Is this right:
    cache-param1: 0x000706

    it was this:
    cache-param1: 0

    BTW: why is cache-size = "512 +8k" instead of just "8k" ?
  • David BetzDavid Betz Posts: 14,516
    edited 2013-04-21 18:48
    Smile! I gave you the wrong information. I didn't even read my own help text above about the cache geometry settings. Try 0x00070600. Sorry about that!
  • RaymanRayman Posts: 14,826
    edited 2013-04-21 18:57
    I'm in -Os with -fcache mode right now and that setting increases Dhry/s, but only a bit.
    Goes from 1098 to 1151... maybe 10%. Guess I should see where C3F is with these optimization settings...

    C3F still wins, but it's very close:

    1151 for RP2X in xmmc mode
    1171 for C3F in xmmc mode

    Clearly this is unacceptable :) But, Ross did teach me that in xmmc mode the cache help so much that
    flash read speed starts to not matter at some point. Still, I think I can make this better, or at least beat C3.

    RP2 is 4X faster in single and split modes. This is where I'm very interested in running...
  • David BetzDavid Betz Posts: 14,516
    edited 2013-04-21 19:24
    Rayman wrote: »
    I'm in -Os with -fcache mode right now and that setting increases Dhry/s, but only a bit.
    Goes from 1098 to 1151... maybe 10%. Guess I should see where C3F is with these optimization settings...

    C3F still wins, but it's very close:

    1151 for RP2X in xmmc mode
    1171 for C3F in xmmc mode

    Clearly this is unacceptable :) But, Ross did teach me that in xmmc mode the cache help so much that
    flash read speed starts to not matter at some point. Still, I think I can make this better, or at least beat C3.

    RP2 is 4X faster in single and split modes. This is where I'm very interested in running...
    Is this using cache-param1 set to 0x00070600? If so, you may be seeing the difference between having the tags in hub memory vs. COG memory. There is a version of the C3 cache code that uses tags in hub memory as well and that might have different performance characteristics. Also, the code in cache_common.spin can probably be optimized.
  • David BetzDavid Betz Posts: 14,516
    edited 2013-04-21 19:24
    Rayman wrote: »
    I'm in -Os with -fcache mode right now and that setting increases Dhry/s, but only a bit.
    Goes from 1098 to 1151... maybe 10%. Guess I should see where C3F is with these optimization settings...

    C3F still wins, but it's very close:

    1151 for RP2X in xmmc mode
    1171 for C3F in xmmc mode

    Clearly this is unacceptable :) But, Ross did teach me that in xmmc mode the cache help so much that
    flash read speed starts to not matter at some point. Still, I think I can make this better, or at least beat C3.

    RP2 is 4X faster in single and split modes. This is where I'm very interested in running...
    Is this using cache-param1 set to 0x00070600? If so, you may be seeing the difference between having the tags in hub memory vs. COG memory. There is a version of the C3 cache code that uses tags in hub memory as well and that might have different performance characteristics. Also, the code in cache_common.spin can probably be optimized.
  • jazzedjazzed Posts: 11,803
    edited 2013-04-21 21:48
    Rayman,

    Are you running the p2test or default branch compiler?
    Also, how does your make file look?

    I have this:
    DEFINES = -Dprintf=__simple_printf -DMSC_CLOCK -DINTEGER_ONLY -DFIXED_NUMBER_OF_PASSES=2000
    CFLAGS = -Os -mfcache $(DEFINES)

    I get dhrystone 2.2 numbers like this with the default branch:

    C3 LMM 5854
    C3 CMM 2285
    C3 XMMC 380
    C3F XMMC 1196
    SSF XMMC 1131
    EEPROM XMMC 1191

    According to that, we're totally wasting time :)

    SSF is the implementation of my original Dual QuadSPI Flash idea.

    A real test however is with running David's ebasic.

    The program like this:
    10 A = CNT()
    20 B = CNT()
    30 C = B - A
    40 print C
    


    Gives results like this (smaller is better):

    EEPROM 376080
    C3 127104
    C3F 127104 (odd.)
    SSF 43792


    A program like this:
    10 print "Count to 100"
    20 for n = 0 to 100
    30  if n mod 10 = 0 then
    40    print
    50  end if
    60  print n; " ";
    70 next n
    

    Gives results like this (smaller is better):

    EEPROM 10 seconds
    C3 5 seconds
    C3F 3 seconds
    SSF 2 seconds
  • RaymanRayman Posts: 14,826
    edited 2013-04-22 03:29
    I'm running with the latest and greatest p2test branch...

    Maybe I can find an ebasic version of fibo...

    Strange we get different numbers... I'm using the code in the demos folder.
    Not using makefile, using a batch file like this:
    [FONT=Consolas][SIZE=2][FONT=Consolas][SIZE=2]propeller-elf-gcc.exe -c dry.c -o dry1.o -Os -m%MemoryMode% -mfcache -I . -Dprintf=__simple_printf  -DMSC_CLOCK -DINTEGER_ONLY -DFIXED_NUMBER_OF_PASSES=3000 -fno-exceptions 
    
    propeller-elf-gcc.exe -c dry.c -o dry2.o -Os -m%MemoryMode% -mfcache -I . -DPASS2 -Dprintf=__simple_printf  -DMSC_CLOCK -DINTEGER_ONLY -DFIXED_NUMBER_OF_PASSES=3000 -fno-exceptions 
    
    propeller-elf-gcc.exe  -Os -m%MemoryMode% -mfcache -o a.out dry1.o dry2.o
    [/SIZE][/FONT][/SIZE][/FONT]
    

    Actually, I think we are getting similar results, only a slight difference because you used 2000 instead of 3000 passes.
    But, your numbers are almost the same I get with -Os and with fcache.
  • ersmithersmith Posts: 6,092
    edited 2013-04-22 10:53
    jazzed wrote: »
    Os is O3
    No, actually O3 is even more aggressive optimization for speed than O2, whereas Os is similar to O2 but optimizing for size.

    In modes where the cost of fetching instructions from memory is significant, Os often produces better results than O2 or O3. This will be especially true for XMM, but applies to LMM and CMM as well.

    Eric
  • jazzedjazzed Posts: 11,803
    edited 2013-04-22 17:07
    ersmith wrote: »
    No, actually O3 is even more aggressive optimization for speed than O2, whereas Os is similar to O2 but optimizing for size.

    Thanks Eric. We'll have to test that.
  • RaymanRayman Posts: 14,826
    edited 2013-04-22 17:22
    There's no O4 is there?
  • Heater.Heater. Posts: 21,230
    edited 2013-04-22 22:44
    -O4 is only for the experienced. It totally rearranges your code effectivly causing it to run backwards, thus simulating faster than light execution. You get the results before you put in the data. This is only usful for debugging as you can run the program in reverse with wrong results and step back to the point of the bug thats causing it.

    There is a compiler that goes to -O11

    Sorry, I just woke up:)

    More seriously. GCC has a ton of optimization options. The -O1, 2, 3 options are a short hand that selects different optimization options. So could effectively make your own -O4 by selecting the optimization switches yourself on the command line.
Sign In or Register to comment.