Spin speed up
EE351
Posts: 81
Hi all,
I have a general question about spin. I'm using an enc28j60 driver in a project and while it is fast, I need to squeeze a little more speed out of it. I'm noticing the spi portion of the driver is written in pasm, but the higher level stuff is written in spin. Here is my question:
I'm noticing there are a lot of nested pub methods being called, sometimes 3 to 4 levels deep. Some of these methods are only 1 or 2 lines each. Would the spin code execute faster if these methods are consolidated into the first method that calls them (presumably at the expense of more memory being used and less readable code)?
Also, I'm using the clock from the enc28j60 to drive the prop at 100mhz. Any insight would be helpful!
I have a general question about spin. I'm using an enc28j60 driver in a project and while it is fast, I need to squeeze a little more speed out of it. I'm noticing the spi portion of the driver is written in pasm, but the higher level stuff is written in spin. Here is my question:
I'm noticing there are a lot of nested pub methods being called, sometimes 3 to 4 levels deep. Some of these methods are only 1 or 2 lines each. Would the spin code execute faster if these methods are consolidated into the first method that calls them (presumably at the expense of more memory being used and less readable code)?
Also, I'm using the clock from the enc28j60 to drive the prop at 100mhz. Any insight would be helpful!
Comments
You can of course time how long code take to execute with:
The variables "time" should hold the time it took to execute the code plus a small overhead. You can measure the overhead by having no code in the "do stuff" section and a subtract the overhead from the final value. If you just want to see which code runs faster then you don't really need to worry about the overhead.
Whoops... as soon as I post I see that Duane mentioned the overhead. He's right; if you're just looking for a delta between to bits of code, the overhead is irrelevant.
If you have the space, "unrolled" code tends to run faster as you're not adding the loop overhead into the time required.
-Phil
Me, too!
Me three
That was my frustration about my lack of progress with my hexapod code speaking.
I still love Spin. It's by far my favorite language to program with.
I should have qualified "painfully." While Spin was fast enough to get my hexapod to walk relatively smoothly, I find it's not fast enough to run all the calculations needed when giving the robot a more complex gait (walking while tilted). Fortunately I have several free cogs so I can spread the work among multiple cogs each with a F32 coprocessor. I was griping since Spin's limited speed is making the program more complex than I would prefer.
Though honestly, I'm not sure if the bottleneck is Spin or the sheer number of equations I'm asking F32 to perform.
I'm a huge fan of Spin.
BTW, If any of you haven't seen the latest video of my hexapod, make sure and check it out (post #73 of the above link). IMO, it's still pretty cool even though it's not working as well as I'd like. Here's a direct link to the youtube video.
The extra speed is needed on the receiving side (the data being sent to the prop). I'll try consolidating (unrolling) the methods first to see if that gives me the extra speed. If this doesn't work, I still have 2 cogs free and I may make an attempt to move some of the higher level stuff to pasm.
Have you looked at doing those equations in fixed instead of floating point, and simplifying them by looking for common sub-equations and such? Going from floating to fixed point can reduce calculation time by a factor of 10 to 20, and calculating common sub equations once and using the result can also make a big difference.
-Phil
Plus another one := 5
You can improve the performance if you do block copies using BYTEMOVE. I've been able to improve the transfer speed by a factor of 10 by doing block copies instead of single-byte copies.
No, it's a bit faster than that. According to wireshark, the transfer rate is in the neighborhood of about 40kbytes per second. I'll post the code later when I get home from work.
I'm looking forward to seeing the code that you post later today. Maybe then I'll understand where the problem is.
Here are the files. I think the Pub method get_frame could probably benefit from some optimization. Any insight you can provide would be greatly appreciated!
UDP_enc28j60_gece_demo_2.zip
kwinn and Phil,
To avoid hijacking this thread too much, I replied to your comments in my hexapod thread and in this thread about trigonometry with Spin.
From my tests so far, F32 is the clear winner. I hope you will let me know if there's something wrong with my testing method.
BTW, I retract my "painfully slow" remark. No one is a bigger fan of Spin than myself. Here's my YouTube playlist as proof.
I don't think my issues with a math bottleneck are related to the issues EE351 is having.
I take it this code is intended to display animation data?
I looked at the code and I didn't see anything obviously wrong. It looks like buffers are being moved with longmove. I didn't see any ways of increasing the code's speed.
I wonder if there's a way of accessing the data locally faster than it could be transmitted with the ethernet connection. I also wonder if there's a faster ethernet device. It looks like the one being used has a single data line. Aren't there parallel ethernet modules which use more than one data line? I haven't done many ethernet projects myself but I thought I saw parallel modules mentioned somewhere on the forum.
This project takes incoming e1.31 data packets and tranfers that to the GE color effects pixel strings. Originally, I was reading the Ethernet buffers directly, but due to the slow refresh rate of the GECEs, they couldn't keep up (the data in the Ethernet buffer was getting overwritten before the string could be fully updated with new data). That's why I have the pixel buffers.
I could switch to a faster Ethernet chip, but that means additional costs. The enc28j60 modules can be bought for about $3 (including shipping).
The data transfer is pretty much one way (out from the e1.31 program into the propeller). That's why I was looking at the get_frame method of the enc28j60 driver to see if it could be optimized. First I'll try unrolling this code to see if it boosts the speed.
I guess something in my brain stem objects to spending $10 on a powerful beast of a microcontroller and then running a cycle-hungry bytecode language on it. My projects tend to be high-throughput signalling or RF applications (because I find them interesting) and those tend to require raw PASM to get the job done.
However, I hereby resolve to learn Spin at least to a basic level. Then I can build robots!
I had exactly the same problem some years ago. I decided to optimize the ENC28J60 driver by
a) reducing the depth of spin method calls
b) adding a new PASM function "dbl_out()"
I found out that often there are two consecutive calls to spi_out(). Replaycing that by one call to dbl_out() considerably reduces the overhead. Overall, I got a prformance boost by aprox. a factor of two.
I removed the CRC feature because I didn't need it (no TCP/IP), but it should be easy to add that, again.
Thank you for sharing this, I will give your code a try.