Shop OBEX P1 Docs P2 Docs Learn Events
Spin speed up — Parallax Forums

Spin speed up

EE351EE351 Posts: 81
edited 2015-03-11 13:28 in Propeller 1
Hi all,

I have a general question about spin. I'm using an enc28j60 driver in a project and while it is fast, I need to squeeze a little more speed out of it. I'm noticing the spi portion of the driver is written in pasm, but the higher level stuff is written in spin. Here is my question:

I'm noticing there are a lot of nested pub methods being called, sometimes 3 to 4 levels deep. Some of these methods are only 1 or 2 lines each. Would the spin code execute faster if these methods are consolidated into the first method that calls them (presumably at the expense of more memory being used and less readable code)?

Also, I'm using the clock from the enc28j60 to drive the prop at 100mhz. Any insight would be helpful!

Comments

  • kwinnkwinn Posts: 8,697
    edited 2015-03-08 12:47
    Consolidating the code to reduce the number of calls/returns will speed up the code.
  • Duane DegnDuane Degn Posts: 10,588
    edited 2015-03-08 12:55
    Yes, Spin is painfully slow wonderful. Moving the small methods inline should help a bit.

    You can of course time how long code take to execute with:
      time := -cnt
      ' do stuff
      time += cnt
    

    The variables "time" should hold the time it took to execute the code plus a small overhead. You can measure the overhead by having no code in the "do stuff" section and a subtract the overhead from the final value. If you just want to see which code runs faster then you don't really need to worry about the overhead.
  • JonnyMacJonnyMac Posts: 9,186
    edited 2015-03-08 18:25
    If you want to be really fussy about it -- as I am -- subtract 544 from cnt before adding it into your elapsed time variable; this will remove the overhead of the call (with nothing between the two lines you will get 0)
    ticks := -cnt
      ' code to time
      ticks += cnt - 544
    


    Whoops... as soon as I post I see that Duane mentioned the overhead. He's right; if you're just looking for a delta between to bits of code, the overhead is irrelevant.

    If you have the space, "unrolled" code tends to run faster as you're not adding the loop overhead into the time required.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2015-03-08 18:34
    Duane Degn wrote:
    Yes, Spin is painfully slow.
    Painfully? Come on, Duane! As a RAD (rapid application development) language and as a glue language for PASM, Spin still ranks #1 with me! And -- really -- it's not all THAT slow! :)

    -Phil
  • JonnyMacJonnyMac Posts: 9,186
    edited 2015-03-08 19:18
    Spin still ranks #1 with me!


    Me, too!
  • Bob Lawrence (VE1RLL)Bob Lawrence (VE1RLL) Posts: 1,720
    edited 2015-03-08 19:29
    re:Spin still ranks #1 with me!

    Me three
  • Duane DegnDuane Degn Posts: 10,588
    edited 2015-03-08 19:45
    Painfully? Come on, Duane!

    That was my frustration about my lack of progress with my hexapod code speaking.

    I still love Spin. It's by far my favorite language to program with.

    I should have qualified "painfully." While Spin was fast enough to get my hexapod to walk relatively smoothly, I find it's not fast enough to run all the calculations needed when giving the robot a more complex gait (walking while tilted). Fortunately I have several free cogs so I can spread the work among multiple cogs each with a F32 coprocessor. I was griping since Spin's limited speed is making the program more complex than I would prefer.

    Though honestly, I'm not sure if the bottleneck is Spin or the sheer number of equations I'm asking F32 to perform.

    I'm a huge fan of Spin.

    BTW, If any of you haven't seen the latest video of my hexapod, make sure and check it out (post #73 of the above link). IMO, it's still pretty cool even though it's not working as well as I'd like. Here's a direct link to the youtube video.
  • EE351EE351 Posts: 81
    edited 2015-03-08 20:04
    Thanks to everyone who replied!

    The extra speed is needed on the receiving side (the data being sent to the prop). I'll try consolidating (unrolling) the methods first to see if that gives me the extra speed. If this doesn't work, I still have 2 cogs free and I may make an attempt to move some of the higher level stuff to pasm.
  • kwinnkwinn Posts: 8,697
    edited 2015-03-08 20:07
    That's a nice hexapod Duane, I really like the blinking eyes.

    Have you looked at doing those equations in fixed instead of floating point, and simplifying them by looking for common sub-equations and such? Going from floating to fixed point can reduce calculation time by a factor of 10 to 20, and calculating common sub equations once and using the result can also make a big difference.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2015-03-08 21:16
    Duane Degn wrote:
    Though honestly, I'm not sure if the bottleneck is Spin or the sheer number of equations I'm asking F32 to perform.
    I'm guessing that the bottleneck is your presumed reliance on floating-point when properly-scaled integer computations would do the job just as well -- and a lot faster. In the course I'm teaching, I won't let my students even close to F32 or its ilk. It's totally unnecessary.

    -Phil
  • Cluso99Cluso99 Posts: 18,069
    edited 2015-03-08 21:52
    re:Spin still ranks #1 with me!

    Me three
    +1 = 4 ;)
  • kwinnkwinn Posts: 8,697
    edited 2015-03-09 06:55
    Cluso99 wrote: »
    +1 = 4 ;)

    Plus another one := 5
  • Dave HeinDave Hein Posts: 6,347
    edited 2015-03-09 07:49
    EE351, I don't know what your code looks like, but if your high level code is getting one byte at a time from the enc28j60 driver you will be limited to something like 100 kbps. That's only 12.5 KBytes/second. At least that's the kind of performance I've seen with a tight loop in Spin that copies one byte at a time into a buffer when calling the FDS RX method.

    You can improve the performance if you do block copies using BYTEMOVE. I've been able to improve the transfer speed by a factor of 10 by doing block copies instead of single-byte copies.
  • EE351EE351 Posts: 81
    edited 2015-03-09 09:04
    Dave Hein wrote: »
    EE351, I don't know what your code looks like, but if your high level code is getting one byte at a time from the enc28j60 driver you will be limited to something like 100 kbps. That's only 12.5 KBytes/second. At least that's the kind of performance I've seen with a tight loop in Spin that copies one byte at a time into a buffer when calling the FDS RX method.

    You can improve the performance if you do block copies using BYTEMOVE. I've been able to improve the transfer speed by a factor of 10 by doing block copies instead of single-byte copies.

    No, it's a bit faster than that. According to wireshark, the transfer rate is in the neighborhood of about 40kbytes per second. I'll post the code later when I get home from work.
  • Dave HeinDave Hein Posts: 6,347
    edited 2015-03-09 09:17
    I may have not made myself clear. PASM can easily handle 40kbytes per second, but Spin cannot if it is handling it one byte at a time. However, if you can move data as blocks Spin can handle higher rates by using bytemove.
  • EE351EE351 Posts: 81
    edited 2015-03-09 09:23
    Dave Hein wrote: »
    I may have not made myself clear. PASM can easily handle 40kbytes per second, but Spin cannot if it is handling it one byte at a time. However, if you can move data as blocks Spin can handle higher rates by using bytemove.
    I should have been more clear also. As far as I can tell, the routine that reads the incoming data packet is written in assembly, however the registers that need to be set prior to the read is written in spin. This is where I noticed all the nested methods that were called in order to retrieve an incoming udp packet.
  • Dave HeinDave Hein Posts: 6,347
    edited 2015-03-09 11:32
    OK, well if the Spin code is just setting registers it doesn't seem like there should be a speed issue. I did run a speed test on the following loop
      repeat i from 1 to 1000
        byte[ptr1++] := byte[ptr2++]
    
    and the result is about 42kbyte/second, so I thought that is what you were referring to with 40kbytes/second. I do know in a loop that called the FDS RX method it runs about 12kbytes/second. And with FDS the driver code is in PASM and is capable or higher rates. It's just the Spin code that is the bottleneck.

    I'm looking forward to seeing the code that you post later today. Maybe then I'll understand where the problem is.
  • EE351EE351 Posts: 81
    edited 2015-03-09 17:12
    Dave Hein wrote: »
    OK, well if the Spin code is just setting registers it doesn't seem like there should be a speed issue. I did run a speed test on the following loop
      repeat i from 1 to 1000
        byte[ptr1++] := byte[ptr2++]
    
    and the result is about 42kbyte/second, so I thought that is what you were referring to with 40kbytes/second. I do know in a loop that called the FDS RX method it runs about 12kbytes/second. And with FDS the driver code is in PASM and is capable or higher rates. It's just the Spin code that is the bottleneck.

    I'm looking forward to seeing the code that you post later today. Maybe then I'll understand where the problem is.

    Here are the files. I think the Pub method get_frame could probably benefit from some optimization. Any insight you can provide would be greatly appreciated!

    UDP_enc28j60_gece_demo_2.zip
  • Duane DegnDuane Degn Posts: 10,588
    edited 2015-03-10 12:49
    kwinn wrote: »
    Have you looked at doing those equations in fixed instead of floating point, and simplifying them by looking for common sub-equations and such? Going from floating to fixed point can reduce calculation time by a factor of 10 to 20, and calculating common sub equations once and using the result can also make a big difference.
    I'm guessing that the bottleneck is your presumed reliance on floating-point when properly-scaled integer computations would do the job just as well -- and a lot faster.

    kwinn and Phil,

    To avoid hijacking this thread too much, I replied to your comments in my hexapod thread and in this thread about trigonometry with Spin.

    From my tests so far, F32 is the clear winner. I hope you will let me know if there's something wrong with my testing method.

    BTW, I retract my "painfully slow" remark. No one is a bigger fan of Spin than myself. Here's my YouTube playlist as proof.
  • Dave HeinDave Hein Posts: 6,347
    edited 2015-03-10 13:17
    I looked at the code last night, but I didn't have any insight to offer, so I didn't comment last night. There's a lot of Spin code there, and I can't really say where the biggest slowdown is. It looks like Duane may have identified something with the floating point usage. If the floating point tweaks aren't sufficient then you'll probably have to add some debug prints and measure cycle times to determine where things are slowing down.
  • Duane DegnDuane Degn Posts: 10,588
    edited 2015-03-10 13:58
    Dave Hein wrote: »
    It looks like Duane may have identified something with the floating point usage.

    I don't think my issues with a math bottleneck are related to the issues EE351 is having.

    I take it this code is intended to display animation data?

    I looked at the code and I didn't see anything obviously wrong. It looks like buffers are being moved with longmove. I didn't see any ways of increasing the code's speed.

    I wonder if there's a way of accessing the data locally faster than it could be transmitted with the ethernet connection. I also wonder if there's a faster ethernet device. It looks like the one being used has a single data line. Aren't there parallel ethernet modules which use more than one data line? I haven't done many ethernet projects myself but I thought I saw parallel modules mentioned somewhere on the forum.
  • EE351EE351 Posts: 81
    edited 2015-03-10 19:57
    Duane Degn wrote: »
    I don't think my issues with a math bottleneck are related to the issues EE351 is having.

    I take it this code is intended to display animation data?

    I looked at the code and I didn't see anything obviously wrong. It looks like buffers are being moved with longmove. I didn't see any ways of increasing the code's speed.

    I wonder if there's a way of accessing the data locally faster than it could be transmitted with the ethernet connection. I also wonder if there's a faster ethernet device. It looks like the one being used has a single data line. Aren't there parallel ethernet modules which use more than one data line? I haven't done many ethernet projects myself but I thought I saw parallel modules mentioned somewhere on the forum.

    This project takes incoming e1.31 data packets and tranfers that to the GE color effects pixel strings. Originally, I was reading the Ethernet buffers directly, but due to the slow refresh rate of the GECEs, they couldn't keep up (the data in the Ethernet buffer was getting overwritten before the string could be fully updated with new data). That's why I have the pixel buffers.

    I could switch to a faster Ethernet chip, but that means additional costs. The enc28j60 modules can be bought for about $3 (including shipping).

    The data transfer is pretty much one way (out from the e1.31 program into the propeller). That's why I was looking at the get_frame method of the enc28j60 driver to see if it could be optimized. First I'll try unrolling this code to see if it boosts the speed.
  • WossnameWossname Posts: 174
    edited 2015-03-11 08:57
    Threads like these make me want to learn Spin more and more. I've been coding Props exclusively in PASM for years now, and I know literally *nothing* about Spin at all besides how to call coginit() with my PASM params :)

    I guess something in my brain stem objects to spending $10 on a powerful beast of a microcontroller and then running a cycle-hungry bytecode language on it. My projects tend to be high-throughput signalling or RF applications (because I find them interesting) and those tend to require raw PASM to get the job done.

    However, I hereby resolve to learn Spin at least to a basic level. Then I can build robots!
  • ManAtWorkManAtWork Posts: 2,178
    edited 2015-03-11 10:44
    Hi EE351,

    I had exactly the same problem some years ago. I decided to optimize the ENC28J60 driver by
    a) reducing the depth of spin method calls
    b) adding a new PASM function "dbl_out()"
    I found out that often there are two consecutive calls to spi_out(). Replaycing that by one call to dbl_out() considerably reduces the overhead. Overall, I got a prformance boost by aprox. a factor of two.

    I removed the CRC feature because I didn't need it (no TCP/IP), but it should be easy to add that, again.
  • EE351EE351 Posts: 81
    edited 2015-03-11 13:28
    ManAtWork wrote: »
    Hi EE351,

    I had exactly the same problem some years ago. I decided to optimize the ENC28J60 driver by
    a) reducing the depth of spin method calls
    b) adding a new PASM function "dbl_out()"
    I found out that often there are two consecutive calls to spi_out(). Replaycing that by one call to dbl_out() considerably reduces the overhead. Overall, I got a prformance boost by aprox. a factor of two.

    I removed the CRC feature because I didn't need it (no TCP/IP), but it should be easy to add that, again.

    Thank you for sharing this, I will give your code a try.
Sign In or Register to comment.