Spin vs. C Speed

ersmith · 2016-08-17 12:19

Sal Ammoniac wrote: »

Here are some results of a simple SPI master coded in C and built using SimpleIDE v1.0.2 (RC2) with optimization set to "-O2 Speed":

CMM: 1.33 Mbps
LMM: 1.33 Mbps
COG: 1.43 Mbps

I'm surprised to see no difference between the CMM and LMM results and am also surprised to see the COG model results so slow.

These results are much better than the 14 Kbps I was getting with Spin, but still disappointing.

They're all essentially the same because the inner loop will be running out of COG memory in all 3 cases. SimpleIDE has a fairly old version of the PropGCC compiler, so the code generation is less than optimal. That said, I wouldn't expect sustained rates much above 2.5 Mbps at 80 MHz. It's going to take at least 4 instructions (16 cycles) to output a bit, so the upper limit is 5 Mbps, but in practice loop overhead and fetching data from memory is going to make the sustained rate quite a bit slower.

Are you able to share your code (Spin or C)? We may be able to suggest some improvements.

Peter Jakacki · 2016-08-17 12:52

I tend to agree with the figures that are being quoted as they are much the same as I find with Tachyon where the SPI loop is in cog memory as an "instruction". My SPI code runs at around 4MHz and for fills I can run sustained at around 3MHz but for reading bytes from memory and sending out over SPI the speed is around 1.48Mbps.

Here is a one-liner I did just to test a 512 byte block of hub memory being sent.

BUFFERS 512 LAP ADO I C@ SPIWRB DROP LOOP LAP .LAP 2.769ms ok

I can do a bit better by reading longs and sending 32-bits over SPI with this test

BUFFERS 512 LAP ADO I @ SPIWR SPIWR SPIWR SPIWR DROP 4 +LOOP LAP .LAP 1.898ms ok

That works out at 2.158Mbps sustained but for a fill it can do better again with (once again using longs):

0 128 LAP FOR SPIWR SPIWR SPIWR SPIWR NEXT DROP LAP .LAP 1.385ms ok

Which is just about 3Mbps sustained. ( each SPIWR rotates the data MSB first 8 bits so 4 ops for 32-bits with the original data afterwards)

Dave Hein · 2016-08-17 13:05

We will never be assimilated.

ErNa · 2016-08-17 15:11

Peter Jakacki wrote: »

Yes, once indoctrinated, most people find it difficult to accept new ideas or different ways of doing things. Imagine if people were indoctrinated in "Forth" first!!

You are absolutely right! That's what I often point to: if HP had targeted the HP45 to pupils and not to scientists, TI would not have won that the battle on calculators and UPN would be governing!

MJB · 2016-08-17 16:04

Peter Jakacki wrote: »
I tend to agree with the figures that are being quoted as they are much the same as I find with Tachyon where the SPI loop is in cog memory as an "instruction". My SPI code runs at around 4MHz and for fills I can run sustained at around 3MHz but for reading bytes from memory and sending out over SPI the speed is around 1.48Mbps.

Here is a one-liner I did just to test a 512 byte block of hub memory being sent.
BUFFERS 512 LAP ADO I C@ SPIWRB DROP LOOP LAP .LAP 2.769ms ok
I can do a bit better by reading longs and sending 32-bits over SPI with this test
BUFFERS 512 LAP ADO I @ SPIWR SPIWR SPIWR SPIWR DROP 4 +LOOP LAP .LAP 1.898ms ok
That works out at 2.158Mbps sustained but for a fill it can do better again with (once again using longs):
0 128 LAP FOR SPIWR SPIWR SPIWR SPIWR NEXT DROP LAP .LAP 1.385ms ok
Which is just about 3Mbps sustained. ( each SPIWR rotates the data MSB first 8 bits so 4 ops for 32-bits with the original data afterwards)

@Peter - looks like you are wasting your time here - this is a 'C' thread ...
I will copy your snippets to my Tachyon document

DavidZemon · 2016-08-17 16:47

MJB wrote: »

@Peter - looks like you are wasting your time here - this is a 'C' thread ...
I will copy your snippets to my Tachyon document

Oh I think its admirable. I do the same thing with JetBrains IDEs here at work

it has worked... I've converted a few. I think the same can be said for Peter.

Heater. · 2016-08-17 17:51

How is JetBrains even comparable to Tachyon or any other Forth we see here?

Forgive me if I am wrong but JetBrains is closed source. It has no place in my world.

Meanwhile Tachyon is a language and run time system, available to all.

DavidZemon · 2016-08-17 18:30

Heater. wrote: »

How is JetBrains even comparable to Tachyon or any other Forth we see here?

Forgive me if I am wrong but JetBrains is closed source. It has no place in my world.

Meanwhile Tachyon is a language and run time system, available to all.

Oh, I don't mean to compare Tachyon and JetBrains. Simply Peter's evangelical attitude toward Tachyon with my attitude to all things JetBrains.

Heater. · 2016-08-17 19:13

I guess we all have our flags to wave.

Me, I wave the flag for anything open source and usable by anybody with no strings attached.

But, that's just me.

DavidZemon · 2016-08-17 19:30

Heater. wrote: »

I guess we all have our flags to wave.

Me, I wave the flag for anything open source and usable by anybody with no strings attached.

But, that's just me.

Not all of their products are open source, but some of them are: IntelliJ IDEA Community Edition. Full list of repos: https://github.com/JetBrains

Heater. · 2016-08-17 19:34

OK. Not bad.

Peter Jakacki · 2016-08-17 22:12

I wasn't trying to push Forth but just confirming really that SPI's speed in a PASM loop and not unrolled or using counters is around 4Mbps but drops down to around 50% or less when writing from hub RAM plus overheads. If we dedicate a cog and counter as an SPI engine we can definitely do better but the thread is about a "quick and dirty" SPI routine which is what caught my eye

scandoslav · 2016-08-18 03:12

ersmith wrote: »

SimpleIDE has a fairly old version of the PropGCC compiler, so the code generation is less than optimal.

Huh. So that means if I'm using the PropGCC bundled with the SimpleIDE download, I'm getting less than optimal results?

How do I go about fixing that? I assume I should clone and make the github repository, is that correct?

ersmith · 2016-08-18 19:24

scandoslav wrote: »

ersmith wrote: »

SimpleIDE has a fairly old version of the PropGCC compiler, so the code generation is less than optimal.

Huh. So that means if I'm using the PropGCC bundled with the SimpleIDE download, I'm getting less than optimal results?

Correct.

How do I go about fixing that? I assume I should clone and make the github repository, is that correct?

That's one way. Another way is to grab a prebuilt binary from @DavidZemon's web page.

DavidZemon · 2016-08-18 19:55

Oh, I meant to post that this morning

Here's all the download links: http://david.zemon.name/PropWare/RelatedLinks.xhtml

Dave Hein · 2016-08-18 20:05

It seems like it's time for Parallax to post an update to PropGCC/SimpleIDE. Who's the best person to contact at Parallax to try to convince that an update is needed? I know there have been a few bug fixes since the last update, and it would be nice to get those into the official version on the Parallax website.

David Betz · 2016-08-18 20:06

Dave Hein wrote: »

It seems like it's time for Parallax to post an update to PropGCC/SimpleIDE. Who's the best person to contact at Parallax to try to convince that an update is needed? I know there have been a few bug fixes since the last update, and it would be nice to get those into the official version on the Parallax website.

There is a bit of a problem with that. There are a few of the Simple Library functions that don't work properly with the newest PropGCC build. We will have to track down why and either fix the library or PropGCC before we can release a new SimpleIDE/PropGCC package.

Dave Hein · 2016-08-18 20:26

I'm sure there are several people on the forum that would look into these issues and fix them in a few days or weeks. Is there a list of issues posted somewhere?

David Betz · 2016-08-18 20:33

Dave Hein wrote: »

I'm sure there are several people on the forum that would look into these issues and fix them in a few days or weeks. Is there a list of issues posted somewhere?

Thanks for the offer. I'll try to get the list from Andy.

DavidZemon · 2016-08-18 22:31

David Betz wrote: »

Dave Hein wrote: »

I'm sure there are several people on the forum that would look into these issues and fix them in a few days or weeks. Is there a list of issues posted somewhere?

Thanks for the offer. I'll try to get the list from Andy.

Obviously the list belongs in GitHub's issue tracker for the Simple-Libraries repo: https://github.com/parallaxinc/Simple-Libraries/issues

If there are issues that don't exist there, I'd ask that either Andy add them, or simply copy/paste them here on the forums and let the community do the work of typing them up in the issue tracker. Then, as Dave Hein suggests, we can all tackle them quickly.

DavidZemon · 2016-08-18 22:34

And it could be really helpful to create a "milestone" for the release of a new SimpleIDE package, and include within that milestone any tickets that must be completed prior to the new release.

The ability to determine which issues are impeding a release is more rare though. It's one of the few parts of this that can not be easily crowd-sourced. Either you (David Betz), Andy, or someone else "in the know" will have to do that milestone labeling. Then the crowd can pick and choose from the tickets within that milestone which ones to tackle.

Dave Hein · 2016-08-18 23:18

DavidZemon wrote: »

Obviously the list belongs in GitHub's issue tracker for the Simple-Libraries repo: https://github.com/parallaxinc/Simple-Libraries/issues

The issues listed at the GitHub site are as follows:

- Please add to ColorPal library
- Renaming this GitHub to something more appropriate
- TOGGLE command bug in SimpleTools.h
- wavplayer.c needs optimization check
- sscani documented but missing
- Simple Libraries appear to need lib*.a rebuilds for propeller-gcc default branch.
- Doc SimpleText readStr() needs CR to complete entry.
- Library Header simpletext.h Needs Print Format Specifier Descriptions
- ee_put_float appears to only work when executed before other put actions
- Other-Cog resources only support LMM/CMM

I don't think any of these issues are related to using the latest PropGCC code.

David Betz · 2016-08-19 00:00

Dave Hein wrote: »

DavidZemon wrote: »

Obviously the list belongs in GitHub's issue tracker for the Simple-Libraries repo: https://github.com/parallaxinc/Simple-Libraries/issues

The issues listed at the GitHub site are as follows:

- Please add to ColorPal library
- Renaming this GitHub to something more appropriate
- TOGGLE command bug in SimpleTools.h
- wavplayer.c needs optimization check
- sscani documented but missing
- Simple Libraries appear to need lib*.a rebuilds for propeller-gcc default branch.
- Doc SimpleText readStr() needs CR to complete entry.
- Library Header simpletext.h Needs Print Format Specifier Descriptions
- ee_put_float appears to only work when executed before other put actions
- Other-Cog resources only support LMM/CMM

I don't think any of these issues are related to using the latest PropGCC code.

I don't think it's any of these issues. I think Parallax considers the problem to be with the new GCC and not with the Simple Libraries so there wouldn't be an issue posted here. On the other hand, the issue could be posted to the propeller-gcc project. I'll see if I can track down the issues and get them posted there.

DavidZemon · 2016-08-19 02:05

Okay folks... PropWare is supertastic. I wrote the following test code:

#include <PropWare/hmi/output/printer.h>
#include <PropWare/gpio/pin.h>
#include <PropWare/serial/spi/spi.h>

static const PropWare::Port::Mask MOSI_MASK = PropWare::Port::P15;
static const PropWare::Port::Mask SCLK_MASK = PropWare::Port::P14;

int main () {
    const PropWare::Pin mosi(MOSI_MASK, PropWare::Port::OUT);
    const PropWare::Pin sclk(SCLK_MASK, PropWare::Port::OUT);
    const PropWare::Pin cs(PropWare::Port::P11, PropWare::Port::OUT);
    cs.high();
    PropWare::SPI spi = PropWare::SPI::get_instance();
    waitcnt(10 * MILLISECOND + CNT);

    spi.set_mosi(mosi.get_mask());
    spi.set_sclk(sclk.get_mask());

    const unsigned int start = CNT;
    cs.low();
    // The first parameter, 0, is the first address is the block. The second parameter is the number of bytes to send.
    spi.shift_out_block_msb_first_fast(0, 32 * 1024);
    cs.high();
    const unsigned int totalRuntime = PropWare::Utility::measure_time_interval(start);


    pwOut << "Runtime: " << totalRuntime << "us\n";
    return 0;
}

In CMM mode, this results in a code size of 4,892 bytes and total size of 5,200 bytes. Those numbers include the CMM interpreter.

The output is "Runtime: 78679us"

32*8*1024/(78679*10^(-6)) = 3331816 = 3.3 Mbps. Confirmed with my Saleae:

So, I do believe that's one benchmark where PropWare is faster than the famed Tachyon Forth! And that's without writing a special function to shift blocks out 32-bits at a time!

David Betz · 2016-08-19 02:36

DavidZemon wrote: »
Okay folks... PropWare is supertastic. I wrote the following test code:
#include <PropWare/hmi/output/printer.h>
#include <PropWare/gpio/pin.h>
#include <PropWare/serial/spi/spi.h>

static const PropWare::Port::Mask MOSI_MASK = PropWare::Port::P15;
static const PropWare::Port::Mask SCLK_MASK = PropWare::Port::P14;

int main () {
    const PropWare::Pin mosi(MOSI_MASK, PropWare::Port::OUT);
    const PropWare::Pin sclk(SCLK_MASK, PropWare::Port::OUT);
    const PropWare::Pin cs(PropWare::Port::P11, PropWare::Port::OUT);
    cs.high();
    PropWare::SPI spi = PropWare::SPI::get_instance();
    waitcnt(10*MILLISECOND + CNT);

    spi.set_mosi(mosi.get_mask());
    spi.set_sclk(sclk.get_mask());

    volatile unsigned int start = CNT;
    cs.low();
    // The first parameter, 0, is the first address is the block. The second parameter is the number of bytes to send.
    spi.shift_out_block_msb_first_fast(0, 32*1024);
    cs.high();
    volatile unsigned  int totalRuntime = PropWare::Utility::measure_time_interval(start);

    
    pwOut << "Runtime: " << totalRuntime << "us\n";
    return 0;
}
In CMM mode, this results in a code size of 4,892 bytes and total size of 5,200 bytes. Those numbers include the CMM interpreter.

The output is "Runtime: 78679us"

32*8*1024/(78679*10^(-6)) = 3331816 = 3.3 Mbps. Confirmed with my Saleae:

So, I do believe that's one benchmark where PropWare is faster than the famed Tachyon Forth! And that's without writing a special function to shift blocks out 32-bits at a time!

Why do you use volatile when declaring start and totalRuntime? They aren't accessed by other COGs.

DavidZemon · 2016-08-19 03:01

David Betz wrote: »
DavidZemon wrote: »
Okay folks... PropWare is supertastic. I wrote the following test code:
#include <PropWare/hmi/output/printer.h>
#include <PropWare/gpio/pin.h>
#include <PropWare/serial/spi/spi.h>

static const PropWare::Port::Mask MOSI_MASK = PropWare::Port::P15;
static const PropWare::Port::Mask SCLK_MASK = PropWare::Port::P14;

int main () {
    const PropWare::Pin mosi(MOSI_MASK, PropWare::Port::OUT);
    const PropWare::Pin sclk(SCLK_MASK, PropWare::Port::OUT);
    const PropWare::Pin cs(PropWare::Port::P11, PropWare::Port::OUT);
    cs.high();
    PropWare::SPI spi = PropWare::SPI::get_instance();
    waitcnt(10*MILLISECOND + CNT);

    spi.set_mosi(mosi.get_mask());
    spi.set_sclk(sclk.get_mask());

    volatile unsigned int start = CNT;
    cs.low();
    // The first parameter, 0, is the first address is the block. The second parameter is the number of bytes to send.
    spi.shift_out_block_msb_first_fast(0, 32*1024);
    cs.high();
    volatile unsigned  int totalRuntime = PropWare::Utility::measure_time_interval(start);

    
    pwOut << "Runtime: " << totalRuntime << "us\n";
    return 0;
}
In CMM mode, this results in a code size of 4,892 bytes and total size of 5,200 bytes. Those numbers include the CMM interpreter.

The output is "Runtime: 78679us"

32*8*1024/(78679*10^(-6)) = 3331816 = 3.3 Mbps. Confirmed with my Saleae:

So, I do believe that's one benchmark where PropWare is faster than the famed Tachyon Forth! And that's without writing a special function to shift blocks out 32-bits at a time!
Why do you use volatile when declaring start and totalRuntime? They aren't accessed by other COGs.

I found that gcc sometimes inlined the declaration of start in the function call to measure time. It shouldn't since CNT is volatile... But it did. Haven't tested that in a long time so I don't have repeatable test case at the moment

David Betz · 2016-08-19 03:07

DavidZemon wrote: »
David Betz wrote: »
DavidZemon wrote: »
Okay folks... PropWare is supertastic. I wrote the following test code:
#include <PropWare/hmi/output/printer.h>
#include <PropWare/gpio/pin.h>
#include <PropWare/serial/spi/spi.h>

static const PropWare::Port::Mask MOSI_MASK = PropWare::Port::P15;
static const PropWare::Port::Mask SCLK_MASK = PropWare::Port::P14;

int main () {
    const PropWare::Pin mosi(MOSI_MASK, PropWare::Port::OUT);
    const PropWare::Pin sclk(SCLK_MASK, PropWare::Port::OUT);
    const PropWare::Pin cs(PropWare::Port::P11, PropWare::Port::OUT);
    cs.high();
    PropWare::SPI spi = PropWare::SPI::get_instance();
    waitcnt(10*MILLISECOND + CNT);

    spi.set_mosi(mosi.get_mask());
    spi.set_sclk(sclk.get_mask());

    volatile unsigned int start = CNT;
    cs.low();
    // The first parameter, 0, is the first address is the block. The second parameter is the number of bytes to send.
    spi.shift_out_block_msb_first_fast(0, 32*1024);
    cs.high();
    volatile unsigned  int totalRuntime = PropWare::Utility::measure_time_interval(start);

    
    pwOut << "Runtime: " << totalRuntime << "us\n";
    return 0;
}
In CMM mode, this results in a code size of 4,892 bytes and total size of 5,200 bytes. Those numbers include the CMM interpreter.

The output is "Runtime: 78679us"

32*8*1024/(78679*10^(-6)) = 3331816 = 3.3 Mbps. Confirmed with my Saleae:

So, I do believe that's one benchmark where PropWare is faster than the famed Tachyon Forth! And that's without writing a special function to shift blocks out 32-bits at a time!
Why do you use volatile when declaring start and totalRuntime? They aren't accessed by other COGs.
I found that gcc sometimes inlined the declaration of start in the function call to measure time. It shouldn't since CNT is volatile... But it did. Haven't tested that in a long time so I don't have repeatable test case at the moment

If you find a case where it fails please let us know. That is a bug that should certainly be fixed!

DavidZemon · 2016-08-19 03:32

David Betz wrote: »
DavidZemon wrote: »
David Betz wrote: »
DavidZemon wrote: »
Okay folks... PropWare is supertastic. I wrote the following test code:
#include <PropWare/hmi/output/printer.h>
#include <PropWare/gpio/pin.h>
#include <PropWare/serial/spi/spi.h>

static const PropWare::Port::Mask MOSI_MASK = PropWare::Port::P15;
static const PropWare::Port::Mask SCLK_MASK = PropWare::Port::P14;

int main () {
    const PropWare::Pin mosi(MOSI_MASK, PropWare::Port::OUT);
    const PropWare::Pin sclk(SCLK_MASK, PropWare::Port::OUT);
    const PropWare::Pin cs(PropWare::Port::P11, PropWare::Port::OUT);
    cs.high();
    PropWare::SPI spi = PropWare::SPI::get_instance();
    waitcnt(10*MILLISECOND + CNT);

    spi.set_mosi(mosi.get_mask());
    spi.set_sclk(sclk.get_mask());

    volatile unsigned int start = CNT;
    cs.low();
    // The first parameter, 0, is the first address is the block. The second parameter is the number of bytes to send.
    spi.shift_out_block_msb_first_fast(0, 32*1024);
    cs.high();
    volatile unsigned  int totalRuntime = PropWare::Utility::measure_time_interval(start);

    
    pwOut << "Runtime: " << totalRuntime << "us\n";
    return 0;
}
In CMM mode, this results in a code size of 4,892 bytes and total size of 5,200 bytes. Those numbers include the CMM interpreter.

The output is "Runtime: 78679us"

32*8*1024/(78679*10^(-6)) = 3331816 = 3.3 Mbps. Confirmed with my Saleae:

So, I do believe that's one benchmark where PropWare is faster than the famed Tachyon Forth! And that's without writing a special function to shift blocks out 32-bits at a time!
Why do you use volatile when declaring start and totalRuntime? They aren't accessed by other COGs.
I found that gcc sometimes inlined the declaration of start in the function call to measure time. It shouldn't since CNT is volatile... But it did. Haven't tested that in a long time so I don't have repeatable test case at the moment
If you find a case where it fails please let us know. That is a bug that should certainly be fixed!

I tried my original code snippet as well as another couple examples and everything seems to be behaving as expected. Not sure what my original issue was... perhaps I did something wrong, perhaps its a bug that has already been fixed? In any case, I updated my example code above to use "const unsigned int" instead of "volatile unsigned int"

Peter Jakacki · 2016-08-19 04:35

DavidZemon wrote: »

The output is "Runtime: 78679us"

32*8*1024/(78679*10^(-6)) = 3331816 = 3.3 Mbps. Confirmed with my Saleae:

So, I do believe that's one benchmark where PropWare is faster than the famed Tachyon Forth! And that's without writing a special function to shift blocks out 32-bits at a time!

"famed" Tachyon Forth hey?

Seeing you are using a block function rather than individual calls or instructions in my case I will also use a block function. Normally I do a block move from and to virtual memory in the SD card but here is one that just uses hub memory:

[BSPIWR] 0 $8000 LAP RUNMOD LAP .LAP 78.644ms ok

Just a tiny tad faster than the spi.shift_out_block_msb_first_fast function but also just confirming what I was saying about SPI on the Prop, that this is about its limit in software alone. Since I only got back home sitting down to coffee and strudel to test this out now, I will attach some timings back into this post a little later. (Done, 2.4us/byte or 3.33333Mbps)

BLOCK%20SPI.jpg

BTW, [BSPIWR] loads the function as a COG MODULE where I have a small area reserved for these functions in the Tachyon cog memory that are invoked with RUNMOD. The setup masks aren't shown as they do not affect the actual timings.

BTW again, high five DZ dude, love a challenge

ersmith · 2016-08-19 18:29

The original thread question was about Spin vs. C speed. We've seen that C++ and Tachyon can both use special libraries to implement SPI at 3.3 Mbps, but this doesn't necessarily tell us much about the language speeds. What happens if you have to implement a custom bit-banging protocol to control a board? Which language should you choose then?

I submit that the answer is Spin, compiled with fastspin. Here's a Spin version (pure Spin, no PASM!) of the SPI protocol:

CON
  _clkmode = xtal1 + pll16x
  _clkfreq = 80_000_000
 DO  = 0     'pins
 CLK = 1
 CS  = 2

OBJ
  fds : "FullDuplexSerial.spin"

VAR
 byte array[512]

pub Main | i, time, ptr

 '' start up the serial port
 fds.start(31, 30, 0, 115200)

 outa[CS]  := 1
 dira[CS]  := 1
 dira[DO]  := 1
 dira[CLK] := 1

 repeat i from 0 to 511
   array[i] := i & 255

 time := cnt
 ptr := @array[0]
 outa[CS] := 0
 repeat 512
   spiout(byte[ptr])
   ptr++
 outa[CS] := 1

 time := cnt-time
' print time
 fds.str(string("elapsed time: "))
 fds.dec(time)
 fds.str(string(" cycles", 13, 10))
 repeat
 
pub spiout(value)
 value ><= 8            'MSB first
 repeat 8               '8 bits
   outa[DO] := value
   outa[CLK] := 1
   value >>= 1
   outa[CLK] := 0

Compile it with the just released fastspin 3.1.0 using:

fastspin spitest.spin

and load the resulting spitest.binary using the tool of your choice, to get:

elapsed time: 98848 cycles

That's for 512 bytes, so the speed is 3314988 Mbps -- slightly slower than the Propware and Tachyon versions, but note that there is absolutely no PASM, inline or otherwise, needed and no custom libraries -- we're just directly bit-banging.

Spin vs. C Speed

Comments