Shop OBEX P1 Docs P2 Docs Learn Events
Spin vs. C Speed - Page 2 — Parallax Forums

Spin vs. C Speed

2

Comments

  • Here are some results of a simple SPI master coded in C and built using SimpleIDE v1.0.2 (RC2) with optimization set to "-O2 Speed":

    CMM: 1.33 Mbps
    LMM: 1.33 Mbps
    COG: 1.43 Mbps

    I'm surprised to see no difference between the CMM and LMM results and am also surprised to see the COG model results so slow.

    These results are much better than the 14 Kbps I was getting with Spin, but still disappointing.

    They're all essentially the same because the inner loop will be running out of COG memory in all 3 cases. SimpleIDE has a fairly old version of the PropGCC compiler, so the code generation is less than optimal. That said, I wouldn't expect sustained rates much above 2.5 Mbps at 80 MHz. It's going to take at least 4 instructions (16 cycles) to output a bit, so the upper limit is 5 Mbps, but in practice loop overhead and fetching data from memory is going to make the sustained rate quite a bit slower.

    Are you able to share your code (Spin or C)? We may be able to suggest some improvements.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2016-08-17 12:59
    I tend to agree with the figures that are being quoted as they are much the same as I find with Tachyon where the SPI loop is in cog memory as an "instruction". My SPI code runs at around 4MHz and for fills I can run sustained at around 3MHz but for reading bytes from memory and sending out over SPI the speed is around 1.48Mbps.

    Here is a one-liner I did just to test a 512 byte block of hub memory being sent.
    BUFFERS 512 LAP ADO I C@ SPIWRB DROP LOOP LAP .LAP 2.769ms ok
    

    I can do a bit better by reading longs and sending 32-bits over SPI with this test
    BUFFERS 512 LAP ADO I @ SPIWR SPIWR SPIWR SPIWR DROP 4 +LOOP LAP .LAP 1.898ms ok
    
    That works out at 2.158Mbps sustained but for a fill it can do better again with (once again using longs):
    0 128 LAP FOR SPIWR SPIWR SPIWR SPIWR NEXT DROP LAP .LAP 1.385ms ok
    
    Which is just about 3Mbps sustained. ( each SPIWR rotates the data MSB first 8 bits so 4 ops for 32-bits with the original data afterwards)
  • We will never be assimilated. :)
  • ErNaErNa Posts: 1,752

    Yes, once indoctrinated, most people find it difficult to accept new ideas or different ways of doing things. Imagine if people were indoctrinated in "Forth" first!! :)
    You are absolutely right! That's what I often point to: if HP had targeted the HP45 to pupils and not to scientists, TI would not have won that the battle on calculators and UPN would be governing!

  • MJBMJB Posts: 1,235
    I tend to agree with the figures that are being quoted as they are much the same as I find with Tachyon where the SPI loop is in cog memory as an "instruction". My SPI code runs at around 4MHz and for fills I can run sustained at around 3MHz but for reading bytes from memory and sending out over SPI the speed is around 1.48Mbps.

    Here is a one-liner I did just to test a 512 byte block of hub memory being sent.
    BUFFERS 512 LAP ADO I C@ SPIWRB DROP LOOP LAP .LAP 2.769ms ok
    

    I can do a bit better by reading longs and sending 32-bits over SPI with this test
    BUFFERS 512 LAP ADO I @ SPIWR SPIWR SPIWR SPIWR DROP 4 +LOOP LAP .LAP 1.898ms ok
    
    That works out at 2.158Mbps sustained but for a fill it can do better again with (once again using longs):
    0 128 LAP FOR SPIWR SPIWR SPIWR SPIWR NEXT DROP LAP .LAP 1.385ms ok
    
    Which is just about 3Mbps sustained. ( each SPIWR rotates the data MSB first 8 bits so 4 ops for 32-bits with the original data afterwards)

    @Peter - looks like you are wasting your time here - this is a 'C' thread ...
    I will copy your snippets to my Tachyon document
  • MJB wrote: »
    @Peter - looks like you are wasting your time here - this is a 'C' thread ...
    I will copy your snippets to my Tachyon document

    Oh I think its admirable. I do the same thing with JetBrains IDEs here at work :) it has worked... I've converted a few. I think the same can be said for Peter.
  • Heater.Heater. Posts: 21,230
    How is JetBrains even comparable to Tachyon or any other Forth we see here?

    Forgive me if I am wrong but JetBrains is closed source. It has no place in my world.

    Meanwhile Tachyon is a language and run time system, available to all.



  • Heater. wrote: »
    How is JetBrains even comparable to Tachyon or any other Forth we see here?

    Forgive me if I am wrong but JetBrains is closed source. It has no place in my world.

    Meanwhile Tachyon is a language and run time system, available to all.

    Oh, I don't mean to compare Tachyon and JetBrains. Simply Peter's evangelical attitude toward Tachyon with my attitude to all things JetBrains.
  • Heater.Heater. Posts: 21,230
    I guess we all have our flags to wave.

    Me, I wave the flag for anything open source and usable by anybody with no strings attached.

    But, that's just me.

  • Heater. wrote: »
    I guess we all have our flags to wave.

    Me, I wave the flag for anything open source and usable by anybody with no strings attached.

    But, that's just me.

    Not all of their products are open source, but some of them are: IntelliJ IDEA Community Edition. Full list of repos: https://github.com/JetBrains
  • Heater.Heater. Posts: 21,230
    OK. Not bad.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2016-08-17 22:13
    I wasn't trying to push Forth but just confirming really that SPI's speed in a PASM loop and not unrolled or using counters is around 4Mbps but drops down to around 50% or less when writing from hub RAM plus overheads. If we dedicate a cog and counter as an SPI engine we can definitely do better but the thread is about a "quick and dirty" SPI routine which is what caught my eye :)

  • ersmith wrote: »
    SimpleIDE has a fairly old version of the PropGCC compiler, so the code generation is less than optimal.

    Huh. So that means if I'm using the PropGCC bundled with the SimpleIDE download, I'm getting less than optimal results?

    How do I go about fixing that? I assume I should clone and make the github repository, is that correct?
  • ersmithersmith Posts: 6,039
    edited 2016-08-18 19:24
    scandoslav wrote: »
    ersmith wrote: »
    SimpleIDE has a fairly old version of the PropGCC compiler, so the code generation is less than optimal.

    Huh. So that means if I'm using the PropGCC bundled with the SimpleIDE download, I'm getting less than optimal results?
    Correct.
    How do I go about fixing that? I assume I should clone and make the github repository, is that correct?

    That's one way. Another way is to grab a prebuilt binary from @DavidZemon's web page.
  • Oh, I meant to post that this morning :)
    Here's all the download links: http://david.zemon.name/PropWare/RelatedLinks.xhtml
  • It seems like it's time for Parallax to post an update to PropGCC/SimpleIDE. Who's the best person to contact at Parallax to try to convince that an update is needed? I know there have been a few bug fixes since the last update, and it would be nice to get those into the official version on the Parallax website.
  • Dave Hein wrote: »
    It seems like it's time for Parallax to post an update to PropGCC/SimpleIDE. Who's the best person to contact at Parallax to try to convince that an update is needed? I know there have been a few bug fixes since the last update, and it would be nice to get those into the official version on the Parallax website.
    There is a bit of a problem with that. There are a few of the Simple Library functions that don't work properly with the newest PropGCC build. We will have to track down why and either fix the library or PropGCC before we can release a new SimpleIDE/PropGCC package.

  • I'm sure there are several people on the forum that would look into these issues and fix them in a few days or weeks. Is there a list of issues posted somewhere?
  • Dave Hein wrote: »
    I'm sure there are several people on the forum that would look into these issues and fix them in a few days or weeks. Is there a list of issues posted somewhere?
    Thanks for the offer. I'll try to get the list from Andy.

  • David Betz wrote: »
    Dave Hein wrote: »
    I'm sure there are several people on the forum that would look into these issues and fix them in a few days or weeks. Is there a list of issues posted somewhere?
    Thanks for the offer. I'll try to get the list from Andy.

    Obviously the list belongs in GitHub's issue tracker for the Simple-Libraries repo: https://github.com/parallaxinc/Simple-Libraries/issues

    If there are issues that don't exist there, I'd ask that either Andy add them, or simply copy/paste them here on the forums and let the community do the work of typing them up in the issue tracker. Then, as Dave Hein suggests, we can all tackle them quickly.
  • DavidZemonDavidZemon Posts: 2,973
    edited 2016-08-18 22:37
    And it could be really helpful to create a "milestone" for the release of a new SimpleIDE package, and include within that milestone any tickets that must be completed prior to the new release.

    The ability to determine which issues are impeding a release is more rare though. It's one of the few parts of this that can not be easily crowd-sourced. Either you (David Betz), Andy, or someone else "in the know" will have to do that milestone labeling. Then the crowd can pick and choose from the tickets within that milestone which ones to tackle.
  • DavidZemon wrote: »
    Obviously the list belongs in GitHub's issue tracker for the Simple-Libraries repo: https://github.com/parallaxinc/Simple-Libraries/issues
    The issues listed at the GitHub site are as follows:

    - Please add to ColorPal library
    - Renaming this GitHub to something more appropriate
    - TOGGLE command bug in SimpleTools.h
    - wavplayer.c needs optimization check
    - sscani documented but missing
    - Simple Libraries appear to need lib*.a rebuilds for propeller-gcc default branch.
    - Doc SimpleText readStr() needs CR to complete entry.
    - Library Header simpletext.h Needs Print Format Specifier Descriptions
    - ee_put_float appears to only work when executed before other put actions
    - Other-Cog resources only support LMM/CMM

    I don't think any of these issues are related to using the latest PropGCC code.

  • Dave Hein wrote: »
    DavidZemon wrote: »
    Obviously the list belongs in GitHub's issue tracker for the Simple-Libraries repo: https://github.com/parallaxinc/Simple-Libraries/issues
    The issues listed at the GitHub site are as follows:

    - Please add to ColorPal library
    - Renaming this GitHub to something more appropriate
    - TOGGLE command bug in SimpleTools.h
    - wavplayer.c needs optimization check
    - sscani documented but missing
    - Simple Libraries appear to need lib*.a rebuilds for propeller-gcc default branch.
    - Doc SimpleText readStr() needs CR to complete entry.
    - Library Header simpletext.h Needs Print Format Specifier Descriptions
    - ee_put_float appears to only work when executed before other put actions
    - Other-Cog resources only support LMM/CMM

    I don't think any of these issues are related to using the latest PropGCC code.
    I don't think it's any of these issues. I think Parallax considers the problem to be with the new GCC and not with the Simple Libraries so there wouldn't be an issue posted here. On the other hand, the issue could be posted to the propeller-gcc project. I'll see if I can track down the issues and get them posted there.

  • DavidZemonDavidZemon Posts: 2,973
    edited 2016-08-19 03:31
    Okay folks... PropWare is supertastic. I wrote the following test code:
    #include <PropWare/hmi/output/printer.h>
    #include <PropWare/gpio/pin.h>
    #include <PropWare/serial/spi/spi.h>
    
    static const PropWare::Port::Mask MOSI_MASK = PropWare::Port::P15;
    static const PropWare::Port::Mask SCLK_MASK = PropWare::Port::P14;
    
    int main () {
        const PropWare::Pin mosi(MOSI_MASK, PropWare::Port::OUT);
        const PropWare::Pin sclk(SCLK_MASK, PropWare::Port::OUT);
        const PropWare::Pin cs(PropWare::Port::P11, PropWare::Port::OUT);
        cs.high();
        PropWare::SPI spi = PropWare::SPI::get_instance();
        waitcnt(10 * MILLISECOND + CNT);
    
        spi.set_mosi(mosi.get_mask());
        spi.set_sclk(sclk.get_mask());
    
        const unsigned int start = CNT;
        cs.low();
        // The first parameter, 0, is the first address is the block. The second parameter is the number of bytes to send.
        spi.shift_out_block_msb_first_fast(0, 32 * 1024);
        cs.high();
        const unsigned int totalRuntime = PropWare::Utility::measure_time_interval(start);
    
    
        pwOut << "Runtime: " << totalRuntime << "us\n";
        return 0;
    }
    

    In CMM mode, this results in a code size of 4,892 bytes and total size of 5,200 bytes. Those numbers include the CMM interpreter.

    The output is "Runtime: 78679us"

    32*8*1024/(78679*10^(-6)) = 3331816 = 3.3 Mbps. Confirmed with my Saleae:
    spi_runtime.png

    So, I do believe that's one benchmark where PropWare is faster than the famed Tachyon Forth! And that's without writing a special function to shift blocks out 32-bits at a time!
    1232 x 145 - 9K
  • DavidZemon wrote: »
    Okay folks... PropWare is supertastic. I wrote the following test code:
    #include <PropWare/hmi/output/printer.h>
    #include <PropWare/gpio/pin.h>
    #include <PropWare/serial/spi/spi.h>
    
    static const PropWare::Port::Mask MOSI_MASK = PropWare::Port::P15;
    static const PropWare::Port::Mask SCLK_MASK = PropWare::Port::P14;
    
    int main () {
        const PropWare::Pin mosi(MOSI_MASK, PropWare::Port::OUT);
        const PropWare::Pin sclk(SCLK_MASK, PropWare::Port::OUT);
        const PropWare::Pin cs(PropWare::Port::P11, PropWare::Port::OUT);
        cs.high();
        PropWare::SPI spi = PropWare::SPI::get_instance();
        waitcnt(10*MILLISECOND + CNT);
    
        spi.set_mosi(mosi.get_mask());
        spi.set_sclk(sclk.get_mask());
    
        volatile unsigned int start = CNT;
        cs.low();
        // The first parameter, 0, is the first address is the block. The second parameter is the number of bytes to send.
        spi.shift_out_block_msb_first_fast(0, 32*1024);
        cs.high();
        volatile unsigned  int totalRuntime = PropWare::Utility::measure_time_interval(start);
    
        
        pwOut << "Runtime: " << totalRuntime << "us\n";
        return 0;
    }
    

    In CMM mode, this results in a code size of 4,892 bytes and total size of 5,200 bytes. Those numbers include the CMM interpreter.

    The output is "Runtime: 78679us"

    32*8*1024/(78679*10^(-6)) = 3331816 = 3.3 Mbps. Confirmed with my Saleae:
    spi_runtime.png

    So, I do believe that's one benchmark where PropWare is faster than the famed Tachyon Forth! And that's without writing a special function to shift blocks out 32-bits at a time!
    Why do you use volatile when declaring start and totalRuntime? They aren't accessed by other COGs.

  • David Betz wrote: »
    DavidZemon wrote: »
    Okay folks... PropWare is supertastic. I wrote the following test code:
    #include <PropWare/hmi/output/printer.h>
    #include <PropWare/gpio/pin.h>
    #include <PropWare/serial/spi/spi.h>
    
    static const PropWare::Port::Mask MOSI_MASK = PropWare::Port::P15;
    static const PropWare::Port::Mask SCLK_MASK = PropWare::Port::P14;
    
    int main () {
        const PropWare::Pin mosi(MOSI_MASK, PropWare::Port::OUT);
        const PropWare::Pin sclk(SCLK_MASK, PropWare::Port::OUT);
        const PropWare::Pin cs(PropWare::Port::P11, PropWare::Port::OUT);
        cs.high();
        PropWare::SPI spi = PropWare::SPI::get_instance();
        waitcnt(10*MILLISECOND + CNT);
    
        spi.set_mosi(mosi.get_mask());
        spi.set_sclk(sclk.get_mask());
    
        volatile unsigned int start = CNT;
        cs.low();
        // The first parameter, 0, is the first address is the block. The second parameter is the number of bytes to send.
        spi.shift_out_block_msb_first_fast(0, 32*1024);
        cs.high();
        volatile unsigned  int totalRuntime = PropWare::Utility::measure_time_interval(start);
    
        
        pwOut << "Runtime: " << totalRuntime << "us\n";
        return 0;
    }
    

    In CMM mode, this results in a code size of 4,892 bytes and total size of 5,200 bytes. Those numbers include the CMM interpreter.

    The output is "Runtime: 78679us"

    32*8*1024/(78679*10^(-6)) = 3331816 = 3.3 Mbps. Confirmed with my Saleae:
    spi_runtime.png

    So, I do believe that's one benchmark where PropWare is faster than the famed Tachyon Forth! And that's without writing a special function to shift blocks out 32-bits at a time!
    Why do you use volatile when declaring start and totalRuntime? They aren't accessed by other COGs.

    I found that gcc sometimes inlined the declaration of start in the function call to measure time. It shouldn't since CNT is volatile... But it did. Haven't tested that in a long time so I don't have repeatable test case at the moment
  • DavidZemon wrote: »
    David Betz wrote: »
    DavidZemon wrote: »
    Okay folks... PropWare is supertastic. I wrote the following test code:
    #include <PropWare/hmi/output/printer.h>
    #include <PropWare/gpio/pin.h>
    #include <PropWare/serial/spi/spi.h>
    
    static const PropWare::Port::Mask MOSI_MASK = PropWare::Port::P15;
    static const PropWare::Port::Mask SCLK_MASK = PropWare::Port::P14;
    
    int main () {
        const PropWare::Pin mosi(MOSI_MASK, PropWare::Port::OUT);
        const PropWare::Pin sclk(SCLK_MASK, PropWare::Port::OUT);
        const PropWare::Pin cs(PropWare::Port::P11, PropWare::Port::OUT);
        cs.high();
        PropWare::SPI spi = PropWare::SPI::get_instance();
        waitcnt(10*MILLISECOND + CNT);
    
        spi.set_mosi(mosi.get_mask());
        spi.set_sclk(sclk.get_mask());
    
        volatile unsigned int start = CNT;
        cs.low();
        // The first parameter, 0, is the first address is the block. The second parameter is the number of bytes to send.
        spi.shift_out_block_msb_first_fast(0, 32*1024);
        cs.high();
        volatile unsigned  int totalRuntime = PropWare::Utility::measure_time_interval(start);
    
        
        pwOut << "Runtime: " << totalRuntime << "us\n";
        return 0;
    }
    

    In CMM mode, this results in a code size of 4,892 bytes and total size of 5,200 bytes. Those numbers include the CMM interpreter.

    The output is "Runtime: 78679us"

    32*8*1024/(78679*10^(-6)) = 3331816 = 3.3 Mbps. Confirmed with my Saleae:
    spi_runtime.png

    So, I do believe that's one benchmark where PropWare is faster than the famed Tachyon Forth! And that's without writing a special function to shift blocks out 32-bits at a time!
    Why do you use volatile when declaring start and totalRuntime? They aren't accessed by other COGs.

    I found that gcc sometimes inlined the declaration of start in the function call to measure time. It shouldn't since CNT is volatile... But it did. Haven't tested that in a long time so I don't have repeatable test case at the moment
    If you find a case where it fails please let us know. That is a bug that should certainly be fixed!

  • David Betz wrote: »
    DavidZemon wrote: »
    David Betz wrote: »
    DavidZemon wrote: »
    Okay folks... PropWare is supertastic. I wrote the following test code:
    #include <PropWare/hmi/output/printer.h>
    #include <PropWare/gpio/pin.h>
    #include <PropWare/serial/spi/spi.h>
    
    static const PropWare::Port::Mask MOSI_MASK = PropWare::Port::P15;
    static const PropWare::Port::Mask SCLK_MASK = PropWare::Port::P14;
    
    int main () {
        const PropWare::Pin mosi(MOSI_MASK, PropWare::Port::OUT);
        const PropWare::Pin sclk(SCLK_MASK, PropWare::Port::OUT);
        const PropWare::Pin cs(PropWare::Port::P11, PropWare::Port::OUT);
        cs.high();
        PropWare::SPI spi = PropWare::SPI::get_instance();
        waitcnt(10*MILLISECOND + CNT);
    
        spi.set_mosi(mosi.get_mask());
        spi.set_sclk(sclk.get_mask());
    
        volatile unsigned int start = CNT;
        cs.low();
        // The first parameter, 0, is the first address is the block. The second parameter is the number of bytes to send.
        spi.shift_out_block_msb_first_fast(0, 32*1024);
        cs.high();
        volatile unsigned  int totalRuntime = PropWare::Utility::measure_time_interval(start);
    
        
        pwOut << "Runtime: " << totalRuntime << "us\n";
        return 0;
    }
    

    In CMM mode, this results in a code size of 4,892 bytes and total size of 5,200 bytes. Those numbers include the CMM interpreter.

    The output is "Runtime: 78679us"

    32*8*1024/(78679*10^(-6)) = 3331816 = 3.3 Mbps. Confirmed with my Saleae:
    spi_runtime.png

    So, I do believe that's one benchmark where PropWare is faster than the famed Tachyon Forth! And that's without writing a special function to shift blocks out 32-bits at a time!
    Why do you use volatile when declaring start and totalRuntime? They aren't accessed by other COGs.

    I found that gcc sometimes inlined the declaration of start in the function call to measure time. It shouldn't since CNT is volatile... But it did. Haven't tested that in a long time so I don't have repeatable test case at the moment
    If you find a case where it fails please let us know. That is a bug that should certainly be fixed!

    I tried my original code snippet as well as another couple examples and everything seems to be behaving as expected. Not sure what my original issue was... perhaps I did something wrong, perhaps its a bug that has already been fixed? In any case, I updated my example code above to use "const unsigned int" instead of "volatile unsigned int"
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2016-08-19 08:33
    DavidZemon wrote: »
    The output is "Runtime: 78679us"

    32*8*1024/(78679*10^(-6)) = 3331816 = 3.3 Mbps. Confirmed with my Saleae:
    spi_runtime.png

    So, I do believe that's one benchmark where PropWare is faster than the famed Tachyon Forth! And that's without writing a special function to shift blocks out 32-bits at a time!

    "famed" Tachyon Forth hey? :)

    Seeing you are using a block function rather than individual calls or instructions in my case I will also use a block function. Normally I do a block move from and to virtual memory in the SD card but here is one that just uses hub memory:
    [BSPIWR] 0 $8000 LAP RUNMOD LAP .LAP 78.644ms ok
    
    Just a tiny tad faster than the spi.shift_out_block_msb_first_fast function but also just confirming what I was saying about SPI on the Prop, that this is about its limit in software alone. Since I only got back home sitting down to coffee and strudel to test this out now, I will attach some timings back into this post a little later. (Done, 2.4us/byte or 3.33333Mbps)

    BLOCK%20SPI.jpg

    BTW, [BSPIWR] loads the function as a COG MODULE where I have a small area reserved for these functions in the Tachyon cog memory that are invoked with RUNMOD. The setup masks aren't shown as they do not affect the actual timings.

    BTW again, high five DZ dude, love a challenge :)

  • The original thread question was about Spin vs. C speed. We've seen that C++ and Tachyon can both use special libraries to implement SPI at 3.3 Mbps, but this doesn't necessarily tell us much about the language speeds. What happens if you have to implement a custom bit-banging protocol to control a board? Which language should you choose then?

    I submit that the answer is Spin, compiled with fastspin. Here's a Spin version (pure Spin, no PASM!) of the SPI protocol:
    CON
      _clkmode = xtal1 + pll16x
      _clkfreq = 80_000_000
     DO  = 0     'pins
     CLK = 1
     CS  = 2
    
    OBJ
      fds : "FullDuplexSerial.spin"
    
    VAR
     byte array[512]
    
    pub Main | i, time, ptr
    
     '' start up the serial port
     fds.start(31, 30, 0, 115200)
    
     outa[CS]  := 1
     dira[CS]  := 1
     dira[DO]  := 1
     dira[CLK] := 1
    
     repeat i from 0 to 511
       array[i] := i & 255
    
     time := cnt
     ptr := @array[0]
     outa[CS] := 0
     repeat 512
       spiout(byte[ptr])
       ptr++
     outa[CS] := 1
    
     time := cnt-time
    ' print time
     fds.str(string("elapsed time: "))
     fds.dec(time)
     fds.str(string(" cycles", 13, 10))
     repeat
     
    pub spiout(value)
     value ><= 8            'MSB first
     repeat 8               '8 bits
       outa[DO] := value
       outa[CLK] := 1
       value >>= 1
       outa[CLK] := 0
    

    Compile it with the just released fastspin 3.1.0 using:
    fastspin spitest.spin
    
    and load the resulting spitest.binary using the tool of your choice, to get:
    elapsed time: 98848 cycles
    
    That's for 512 bytes, so the speed is 3314988 Mbps -- slightly slower than the Propware and Tachyon versions, but note that there is absolutely no PASM, inline or otherwise, needed and no custom libraries -- we're just directly bit-banging.
Sign In or Register to comment.