Spin Speed, Why So Slow?
Miner_with_a_PIC
Posts: 123
I have noticed that Spin routines run very slowly. Here is an example:
repeat
outa[noparse][[/noparse]15] :=1
outa[noparse][[/noparse]15] :=0
Scoped periodicity of 284 us with clock frequency of 5 MHz no pll one cog only running, that is to complete the repeat loop it takes about 1420 clock cycles. Compare this to its assembly counterpart:
:test_loop
mov outa,#0
nop
mov outa,testmask
jmp #:test_loop
this has a measured periodicity of 3.24 us or 16 clock cycles (12 if you exclude the nop)
My understanding is that the cog running spin retrieves bytes from main memory via the hub and uses its locally loaded interpreter to process these commands. Hub access is slow 7 to 22 clock cycles, but still why is spin so very slow. Each simple spin command in the example given should only take a maximum of 22 clock cycles to retrieve the command via the hub and 4 clock cycles to process, instead the average is 470 clock cycles each. Does anyone have the scoop as to why this is so? My main concern is that for speed intensive applications, spin's speed constraint always plague my progress even when it only there to direct/load the ASM code.
Thanks in advance...
repeat
outa[noparse][[/noparse]15] :=1
outa[noparse][[/noparse]15] :=0
Scoped periodicity of 284 us with clock frequency of 5 MHz no pll one cog only running, that is to complete the repeat loop it takes about 1420 clock cycles. Compare this to its assembly counterpart:
:test_loop
mov outa,#0
nop
mov outa,testmask
jmp #:test_loop
this has a measured periodicity of 3.24 us or 16 clock cycles (12 if you exclude the nop)
My understanding is that the cog running spin retrieves bytes from main memory via the hub and uses its locally loaded interpreter to process these commands. Hub access is slow 7 to 22 clock cycles, but still why is spin so very slow. Each simple spin command in the example given should only take a maximum of 22 clock cycles to retrieve the command via the hub and 4 clock cycles to process, instead the average is 470 clock cycles each. Does anyone have the scoop as to why this is so? My main concern is that for speed intensive applications, spin's speed constraint always plague my progress even when it only there to direct/load the ASM code.
Thanks in advance...
Comments
The reason for it being so slow is that it grabs the tokenized command, decodes it, has to jump around a bit (in the interpreter) to figure out what pin, what value, and what to do.
There are projects (I have not explored them, but I am sure others will pipe in with info) that allow you to place a more optimized version of the spin interpreter into a cog. I think there was a significant speed increase, but not earth shattering.
If you want the speed, use PASM. If you have an issue with space because of the 512 long limit, use a large memory module system. I think you can do that at about an eighth the speed of PASM (someone correct me please).
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
April, 2008: when I discovered the answers to all my micro-computational-botherations!
Some of my objects:
MCP3X0X ADC Driver - Programmable Schmitt inputs, frequency reading, and more!
Simple Propeller-based Database - Making life easier and more readable for all your EEPROM storage needs.
String Manipulation Library - Don't allow strings to be the bane of the Propeller, bend them to your will!
Fast Inter-Propeller Comm - Fast communication between two propellers (1.37MB/s @100MHz)!
Post Edited (Bobb Fwed) : 1/13/2010 11:00:00 PM GMT
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
*Peter*
Spin is actually quite fast. A Spin-only serial driver can run at over 19.2KBaud with a system clock of 80MHz. For higher speeds, you can use any of the LMM (Large Memory Model) interpreters and a matching C compiler like Catalina or ImageCraft's. That has very much lower overhead for execution speed at the expense of significantly increased memory requirements. You could also use one of the Forth interpreters which also have lower execution overhead or you could try Bean's new PropBasic compiler which compiles a variant of SX/B to assembly language.
Most programs spend most of their execution time in very small areas of the program. If you're optimizing for speed, you can get major speedups by converting just those small areas into assembly. The same is true for Spin programs. The kicker is that the small areas involved are often not the ones you think will benefit from the optimization. The combination of Spin and assembly is actually quite good with Spin heavily optimized for space and assembly being quite fast, but bulky.
Here we see 7 instructions in 12 bytes.
As Bob says, for each one of those instructions the interpreter has to fetch the bytes codes, increment some Program Counter, decode the thing, do the operation. That may involve fetching some operand and writing out some result.
If you really want to see what goes on you would have tor have a look at the actual interpreter code wich has been posted here somewhere.
Bear in mind that the interpreter has been written to fit into a COG so it is probably not optimized for speed. Although I believe Cluso has produced a version that is a tad faster.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
There seems to be a trade-off between interpreter size and speed. My understanding is that the Prop2 will have increased main memory (256K?), perhaps the cog memory size will also be increased; an opportunity for a larger interpreter + faster spin processing may be accommodated on the Prop2. How much faster I am unsure, but it doesn't seem like very much based on Bobb's inputs. For now I look forward to programming assembly based solutions to gain a foothold on speed while using spin for projects that require slower speeds.
Don't forget, it is not a case of selection SPIN or PASM on a project by project basis. Rather use PASM for the parts of your project that really need it. Most things in OBEX, for example, are Spin/PASM mixtures.
Also helps that PASM is about the simplest assembly language to work with I have ever come across.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Check out the Propeller Wiki·and contribute if you can.
This is a little OT but I can't help myself[noparse]:)[/noparse]
The counters are practically free, you get two per cog and they work pretty much the same whether called from Spin or PASM...
AND consider the waitcnt function...amazing time resoluton, and it doesn't seem affected by Spin at all... except for the first few thousand clocks.
Processors are processors... but the Prop is a controller through and through. When there is something better, I'd like an email to my PropPhone.
Rich
www.parallax.com/tabid/766/Default.aspx
Massimo
@rjo_ >> Good news regarding the waitcnt function as it is perhaps one of the most used pieces of code in spin...I believe 381 clock cycles in the minimum wait for spin and 9 for assembly provided the waitcnt occurs right after the add as it does in the examples.
@Roy >> This is awesome news as 1 clock cycle per instruction was a selling feature for the SX line of chips. I also noted the ambitious 160 MHz clock speed, 3 cycle jumps and special jmps that when coded correctly would take only 1 cycle. This Parallax team are really a great bunch of folks, very clever, open and willing to try new things rather than get stuck in a rut.
@Mike and Heater >> So far the webinars didn't go into as much detail answering my question as you guys did, many thanks for sharing. Where would I get a copy of the BST spin complier?
@Peter >> The concern was more related to a ratio of performance, where clock cycle to clock cycle comparison is more important than the speed the chip is running. The ratio is very nearly 100X faster for ASM over Spin. For many applications where human interaction is involved spin's speed is overkill but when interacting with hardware like ADC, DACs, EEPROMs or fast serial (>19,200 bps) using spin often becomes a speed limiter.
www.parallax.com/tabid/442/Default.aspx
I realized my mistake after I posted as I later read 5mHz, no PLL etc. Since it was rather early in the morning for me at the time (and also now) and I needed a coffee. I put it down to that. I know though that there are trade-offs with byte interpreters, memory limitations, and slow hub access, the latter being a real killer as instructions have to be read in as unaligned bytes.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
*Peter*
propeller.wikispaces.com/
and don't forget to check the sticky in the forum and the obex...
Massimo
If you look at the Webinar Video done by Jeff Martin on Dec 12th you will see a very similar program to the one
you have posted both in PASM and SPIN....however in Jeff's program he set the PLL and Clock frequency correctly and
thus achieved an 80 MHz clock not the 20 KHz internal RC oscillator in the slow mode.
The program is shown around about minute·14 into the video and the Oscilloscope output is for the SPIN
code is shown at minute 15....the PASM output is shown around minute 16.
From the Webinar it is seen that the PASM code obtains a square wave with a period of 200 NanoSeconds
the SPin code obtains a square wave with a period of 22 MicroSeconds
So these numbers are QUITE a bit faster than what you have quoted....due to the 80 MHz clocking rate.
So SPIN is not REALLY slow....it is fast in comparison to MANY other microcontrollers. It is only slow when compared to
PASM which is also extremely faster than MANY microcontrollers.
SPIN is fast enough to implement and RS232 UART that can function up to 19200 bps which is quite good
for a High Level language.
Is a Hare slow? Not when you compare it to Humans....but yes when compared to Cheetahs
·
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Samuel
www.RobotBASIC.com
·
VAR
· long a
PUB start | i
· repeat i from 1 to 1000000
···· a := i + i
Catalina does this in 3 seconds and Spin takes 16 seconds.· There are essentially four operations executed in the loop.
1. Add i to i and store in a
2. increment i
3. compare i to 1000000
4. jump to beginning of loop if not greater than 1000000
At 20 MOPs this would take 0.2 seconds.· The Spin code executes the 4 million operation in 16 seconds, which works out to 0.25 MOPs.· That's a ratio of 80 to 1.
Some of the time is involved in accessing the Hub memory.· The loop consist of 8 bytes of spin code.· The constant value of 1000000 is included in 3 of the 8 bytes of code.· The variable "i" must be read three times and written once per loop.· The variable "a" is written once per loop.· Therefore, the total number of hub accesses is 13 per loop.· Assuming an average of 16 cycles per hub access, this works out to about 200 cycles per loop.
Catalina takes 3 seconds to run, which is 240 cycles per loop.· So it is fairly efficient.· Spin uses 1,280 cycles per loop.· Subtracting out the memory accesses leaves a little over 1,000 cycles per loop, which is about 63 Cog instructions per operation.· That seems like a reasonable number, but there must be ways to speed it up.
I have heard some about BST, and I'm interested in getting more information on it.· Can someone run a benchmark for BST using the code listed above?· I would be interested seeing how it compares to the Spin interpreter.· It would also be interesting to see how PASM compares with this simple loop.
Is there a webpage or thread that describes BST?· It seems like all the information is scattered around in seperate threads, and I can't find a simple overview of it.
Dave
·
If you want to speed it up, write it in PASM.
bst is part of a small suite of tools for those of us lucky enough not to use Windows. Aside from some very basic optional optimisations it's has a bit-for-bit compatible spin compiler. I don't think you'll spot a speed increase over the Parallax compiler.
www.fnarfbargle.com/bst.html
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life may be "too short", but it's the longest thing we ever do.
As far as operating systems, I tend to use the one that gets the job done quickest.· In some cases that's Linux, and in other cases that's Windows.· I even use a Mac sometimes when my daughter needs help with her MacBook.
Dave
I am new to PASM so the code below may contain syntax/flow errors and/or not be the optimum (shortest) solution. It seems to take 5 instructions each using 4 clock cycles each, so that would be 20 total clock cycles per loop.
DAT
org 0
mov a,#0
loop_the_loop
add a,#1
mov b, a
SHL b,#1
cmp a,upper_limit wz
if_nz jmp #loop_the_loop
'place more code here or stop your cog
a long 0
b long 0
upper_limit long 1000000
Personally I'd get into PASM. It's great. If you don't want to get that low level, I'd have a look at Bean's PropBasic. It's a high level language that directly generates PASM.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life may be "too short", but it's the longest thing we ever do.
Andy
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
mov b,two_mil
halt: jmp #halt
two_mil long 2_000_000
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
It still seems to me that a Spin-to-PASM compiler would be very useful.· It would provide at least a 4X speedup over interpreted Spin.· Some extensions could be added to the Spin language to force some variables into cog RAM insted of hub RAM.· This would allow compiled Spin to approach the speed of PASM.
Dave
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
April, 2008: when I discovered the answers to all my micro-computational-botherations!
Some of my objects:
MCP3X0X ADC Driver - Programmable Schmitt inputs, frequency reading, and more!
Simple Propeller-based Database - Making life easier and more readable for all your EEPROM storage needs.
String Manipulation Library - Don't allow strings to be the bane of the Propeller, bend them to your will!
Fast Inter-Propeller Comm - Fast communication between two propellers (1.37MB/s @100MHz)!