Spin speed up

EE351 · 2015-03-08 12:38

Hi all,

I have a general question about spin. I'm using an enc28j60 driver in a project and while it is fast, I need to squeeze a little more speed out of it. I'm noticing the spi portion of the driver is written in pasm, but the higher level stuff is written in spin. Here is my question:

I'm noticing there are a lot of nested pub methods being called, sometimes 3 to 4 levels deep. Some of these methods are only 1 or 2 lines each. Would the spin code execute faster if these methods are consolidated into the first method that calls them (presumably at the expense of more memory being used and less readable code)?

Also, I'm using the clock from the enc28j60 to drive the prop at 100mhz. Any insight would be helpful!

kwinn · 2015-03-08 12:47

Consolidating the code to reduce the number of calls/returns will speed up the code.

Duane Degn · 2015-03-08 12:55

Yes, Spin is painfully slow wonderful. Moving the small methods inline should help a bit.

You can of course time how long code take to execute with:

  time := -cnt
  ' do stuff
  time += cnt

The variables "time" should hold the time it took to execute the code plus a small overhead. You can measure the overhead by having no code in the "do stuff" section and a subtract the overhead from the final value. If you just want to see which code runs faster then you don't really need to worry about the overhead.

JonnyMac · 2015-03-08 18:25

If you want to be really fussy about it -- as I am -- subtract 544 from cnt before adding it into your elapsed time variable; this will remove the overhead of the call (with nothing between the two lines you will get 0)

ticks := -cnt
  ' code to time
  ticks += cnt - 544

Whoops... as soon as I post I see that Duane mentioned the overhead. He's right; if you're just looking for a delta between to bits of code, the overhead is irrelevant.

If you have the space, "unrolled" code tends to run faster as you're not adding the loop overhead into the time required.

Phil Pilgrim (PhiPi) · 2015-03-08 18:34

Duane Degn wrote:

Yes, Spin is painfully slow.

Painfully? Come on, Duane! As a RAD (rapid application development) language and as a glue language for PASM, Spin still ranks #1 with me! And -- really -- it's not all THAT slow!

-Phil

JonnyMac · 2015-03-08 19:18

Spin still ranks #1 with me!

Me, too!

Bob Lawrence (VE1RLL) · 2015-03-08 19:29

re:Spin still ranks #1 with me!

Me three

Duane Degn · 2015-03-08 19:45

Phil Pilgrim (PhiPi) wrote: »

Painfully? Come on, Duane!

That was my frustration about my lack of progress with my hexapod code speaking.

I still love Spin. It's by far my favorite language to program with.

I should have qualified "painfully." While Spin was fast enough to get my hexapod to walk relatively smoothly, I find it's not fast enough to run all the calculations needed when giving the robot a more complex gait (walking while tilted). Fortunately I have several free cogs so I can spread the work among multiple cogs each with a F32 coprocessor. I was griping since Spin's limited speed is making the program more complex than I would prefer.

Though honestly, I'm not sure if the bottleneck is Spin or the sheer number of equations I'm asking F32 to perform.

I'm a huge fan of Spin.

BTW, If any of you haven't seen the latest video of my hexapod, make sure and check it out (post #73 of the above link). IMO, it's still pretty cool even though it's not working as well as I'd like. Here's a direct link to the youtube video.

EE351 · 2015-03-08 20:04

Thanks to everyone who replied!

The extra speed is needed on the receiving side (the data being sent to the prop). I'll try consolidating (unrolling) the methods first to see if that gives me the extra speed. If this doesn't work, I still have 2 cogs free and I may make an attempt to move some of the higher level stuff to pasm.

kwinn · 2015-03-08 20:07

That's a nice hexapod Duane, I really like the blinking eyes.

Have you looked at doing those equations in fixed instead of floating point, and simplifying them by looking for common sub-equations and such? Going from floating to fixed point can reduce calculation time by a factor of 10 to 20, and calculating common sub equations once and using the result can also make a big difference.

Phil Pilgrim (PhiPi) · 2015-03-08 21:16

Duane Degn wrote:

Though honestly, I'm not sure if the bottleneck is Spin or the sheer number of equations I'm asking F32 to perform.

I'm guessing that the bottleneck is your presumed reliance on floating-point when properly-scaled integer computations would do the job just as well -- and a lot faster. In the course I'm teaching, I won't let my students even close to F32 or its ilk. It's totally unnecessary.

-Phil

Cluso99 · 2015-03-08 21:52

Bob Lawrence (VE1RLL) wrote: »

re:Spin still ranks #1 with me!

Me three

+1 = 4

kwinn · 2015-03-09 06:55

Cluso99 wrote: »

+1 = 4

Plus another one := 5

Dave Hein · 2015-03-09 07:49

EE351, I don't know what your code looks like, but if your high level code is getting one byte at a time from the enc28j60 driver you will be limited to something like 100 kbps. That's only 12.5 KBytes/second. At least that's the kind of performance I've seen with a tight loop in Spin that copies one byte at a time into a buffer when calling the FDS RX method.

You can improve the performance if you do block copies using BYTEMOVE. I've been able to improve the transfer speed by a factor of 10 by doing block copies instead of single-byte copies.

EE351 · 2015-03-09 09:04

Dave Hein wrote: »

EE351, I don't know what your code looks like, but if your high level code is getting one byte at a time from the enc28j60 driver you will be limited to something like 100 kbps. That's only 12.5 KBytes/second. At least that's the kind of performance I've seen with a tight loop in Spin that copies one byte at a time into a buffer when calling the FDS RX method.

You can improve the performance if you do block copies using BYTEMOVE. I've been able to improve the transfer speed by a factor of 10 by doing block copies instead of single-byte copies.

No, it's a bit faster than that. According to wireshark, the transfer rate is in the neighborhood of about 40kbytes per second. I'll post the code later when I get home from work.

Dave Hein · 2015-03-09 09:17

I may have not made myself clear. PASM can easily handle 40kbytes per second, but Spin cannot if it is handling it one byte at a time. However, if you can move data as blocks Spin can handle higher rates by using bytemove.

EE351 · 2015-03-09 09:23

Dave Hein wrote: »

I may have not made myself clear. PASM can easily handle 40kbytes per second, but Spin cannot if it is handling it one byte at a time. However, if you can move data as blocks Spin can handle higher rates by using bytemove.

I should have been more clear also. As far as I can tell, the routine that reads the incoming data packet is written in assembly, however the registers that need to be set prior to the read is written in spin. This is where I noticed all the nested methods that were called in order to retrieve an incoming udp packet.

Dave Hein · 2015-03-09 11:32

OK, well if the Spin code is just setting registers it doesn't seem like there should be a speed issue. I did run a speed test on the following loop

  repeat i from 1 to 1000
    byte[ptr1++] := byte[ptr2++]

and the result is about 42kbyte/second, so I thought that is what you were referring to with 40kbytes/second. I do know in a loop that called the FDS RX method it runs about 12kbytes/second. And with FDS the driver code is in PASM and is capable or higher rates. It's just the Spin code that is the bottleneck.

I'm looking forward to seeing the code that you post later today. Maybe then I'll understand where the problem is.

EE351 · 2015-03-09 17:12

Dave Hein wrote: »
OK, well if the Spin code is just setting registers it doesn't seem like there should be a speed issue. I did run a speed test on the following loop
  repeat i from 1 to 1000
    byte[ptr1++] := byte[ptr2++]
and the result is about 42kbyte/second, so I thought that is what you were referring to with 40kbytes/second. I do know in a loop that called the FDS RX method it runs about 12kbytes/second. And with FDS the driver code is in PASM and is capable or higher rates. It's just the Spin code that is the bottleneck.

I'm looking forward to seeing the code that you post later today. Maybe then I'll understand where the problem is.

Here are the files. I think the Pub method get_frame could probably benefit from some optimization. Any insight you can provide would be greatly appreciated!

UDP_enc28j60_gece_demo_2.zip

Duane Degn · 2015-03-10 12:49

kwinn wrote: »

Have you looked at doing those equations in fixed instead of floating point, and simplifying them by looking for common sub-equations and such? Going from floating to fixed point can reduce calculation time by a factor of 10 to 20, and calculating common sub equations once and using the result can also make a big difference.

Phil Pilgrim (PhiPi) wrote: »

I'm guessing that the bottleneck is your presumed reliance on floating-point when properly-scaled integer computations would do the job just as well -- and a lot faster.

kwinn and Phil,

To avoid hijacking this thread too much, I replied to your comments in my hexapod thread and in this thread about trigonometry with Spin.

From my tests so far, F32 is the clear winner. I hope you will let me know if there's something wrong with my testing method.

BTW, I retract my "painfully slow" remark. No one is a bigger fan of Spin than myself. Here's my YouTube playlist as proof.

Dave Hein · 2015-03-10 13:17

I looked at the code last night, but I didn't have any insight to offer, so I didn't comment last night. There's a lot of Spin code there, and I can't really say where the biggest slowdown is. It looks like Duane may have identified something with the floating point usage. If the floating point tweaks aren't sufficient then you'll probably have to add some debug prints and measure cycle times to determine where things are slowing down.

Duane Degn · 2015-03-10 13:58

Dave Hein wrote: »

It looks like Duane may have identified something with the floating point usage.

I don't think my issues with a math bottleneck are related to the issues EE351 is having.

I take it this code is intended to display animation data?

I looked at the code and I didn't see anything obviously wrong. It looks like buffers are being moved with longmove. I didn't see any ways of increasing the code's speed.

I wonder if there's a way of accessing the data locally faster than it could be transmitted with the ethernet connection. I also wonder if there's a faster ethernet device. It looks like the one being used has a single data line. Aren't there parallel ethernet modules which use more than one data line? I haven't done many ethernet projects myself but I thought I saw parallel modules mentioned somewhere on the forum.

EE351 · 2015-03-10 19:57

Duane Degn wrote: »

I don't think my issues with a math bottleneck are related to the issues EE351 is having.

I take it this code is intended to display animation data?

I looked at the code and I didn't see anything obviously wrong. It looks like buffers are being moved with longmove. I didn't see any ways of increasing the code's speed.

I wonder if there's a way of accessing the data locally faster than it could be transmitted with the ethernet connection. I also wonder if there's a faster ethernet device. It looks like the one being used has a single data line. Aren't there parallel ethernet modules which use more than one data line? I haven't done many ethernet projects myself but I thought I saw parallel modules mentioned somewhere on the forum.

This project takes incoming e1.31 data packets and tranfers that to the GE color effects pixel strings. Originally, I was reading the Ethernet buffers directly, but due to the slow refresh rate of the GECEs, they couldn't keep up (the data in the Ethernet buffer was getting overwritten before the string could be fully updated with new data). That's why I have the pixel buffers.

I could switch to a faster Ethernet chip, but that means additional costs. The enc28j60 modules can be bought for about $3 (including shipping).

The data transfer is pretty much one way (out from the e1.31 program into the propeller). That's why I was looking at the get_frame method of the enc28j60 driver to see if it could be optimized. First I'll try unrolling this code to see if it boosts the speed.

Wossname · 2015-03-11 08:57

Threads like these make me want to learn Spin more and more. I've been coding Props exclusively in PASM for years now, and I know literally *nothing* about Spin at all besides how to call coginit() with my PASM params

I guess something in my brain stem objects to spending $10 on a powerful beast of a microcontroller and then running a cycle-hungry bytecode language on it. My projects tend to be high-throughput signalling or RF applications (because I find them interesting) and those tend to require raw PASM to get the job done.

However, I hereby resolve to learn Spin at least to a basic level. Then I can build robots!

ManAtWork · 2015-03-11 10:44

Hi EE351,

I had exactly the same problem some years ago. I decided to optimize the ENC28J60 driver by
a) reducing the depth of spin method calls
b) adding a new PASM function "dbl_out()"
I found out that often there are two consecutive calls to spi_out(). Replaycing that by one call to dbl_out() considerably reduces the overhead. Overall, I got a prformance boost by aprox. a factor of two.

I removed the CRC feature because I didn't need it (no TCP/IP), but it should be easy to add that, again.

EE351 · 2015-03-11 13:28

ManAtWork wrote: »

Hi EE351,

I had exactly the same problem some years ago. I decided to optimize the ENC28J60 driver by
a) reducing the depth of spin method calls
b) adding a new PASM function "dbl_out()"
I found out that often there are two consecutive calls to spi_out(). Replaycing that by one call to dbl_out() considerably reduces the overhead. Overall, I got a prformance boost by aprox. a factor of two.

I removed the CRC feature because I didn't need it (no TCP/IP), but it should be easy to add that, again.

Thank you for sharing this, I will give your code a try.

Spin speed up

Comments