Spin Speed, Why So Slow?

Miner_with_a_PIC · 2010-01-13 22:41

I have noticed that Spin routines run very slowly. Here is an example:

repeat

outa[noparse][[/noparse]15] :=1
outa[noparse][[/noparse]15] :=0

Scoped periodicity of 284 us with clock frequency of 5 MHz no pll one cog only running, that is to complete the repeat loop it takes about 1420 clock cycles. Compare this to its assembly counterpart:

:test_loop

mov outa,#0
nop
mov outa,testmask
jmp #:test_loop

this has a measured periodicity of 3.24 us or 16 clock cycles (12 if you exclude the nop)

My understanding is that the cog running spin retrieves bytes from main memory via the hub and uses its locally loaded interpreter to process these commands. Hub access is slow 7 to 22 clock cycles, but still why is spin so very slow. Each simple spin command in the example given should only take a maximum of 22 clock cycles to retrieve the command via the hub and 4 clock cycles to process, instead the average is 470 clock cycles each. Does anyone have the scoop as to why this is so? My main concern is that for speed intensive applications, spin's speed constraint always plague my progress even when it only there to direct/load the ASM code.

Thanks in advance...

Bobb Fwed · 2010-01-13 22:54

This is the age old trade off between high-level and low-level programming. High level gives you easy to read structure, logic, and general interface but at a severe price (in speed).
The reason for it being so slow is that it grabs the tokenized command, decodes it, has to jump around a bit (in the interpreter) to figure out what pin, what value, and what to do.

There are projects (I have not explored them, but I am sure others will pipe in with info) that allow you to place a more optimized version of the spin interpreter into a cog. I think there was a significant speed increase, but not earth shattering.

If you want the speed, use PASM. If you have an issue with space because of the 512 long limit, use a large memory module system. I think you can do that at about an eighth the speed of PASM (someone correct me please).

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
April, 2008: when I discovered the answers to all my micro-computational-botherations!

Some of my objects:
MCP3X0X ADC Driver - Programmable Schmitt inputs, frequency reading, and more!
Simple Propeller-based Database - Making life easier and more readable for all your EEPROM storage needs.
String Manipulation Library - Don't allow strings to be the bane of the Propeller, bend them to your will!
Fast Inter-Propeller Comm - Fast communication between two propellers (1.37MB/s @100MHz)!

Post Edited (Bobb Fwed) : 1/13/2010 11:00:00 PM GMT

Peter Jakacki · 2010-01-13 22:59

Have you enabled the crystal oscillator and PLL? Otherwise you will be running at around 12MHz on the internal RC clock.

  _CLKMODE = XTAL1 + PLL16X

  _XINFREQ = 5_000_000

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
*Peter*

Mike Green · 2010-01-13 23:03

The following is a compiler listing from the BST Spin compiler for the code fragment you showed. Note that there are 7 Spin instructions involved in the loop. That averages out to 40us per instruction and 12 bytes to fetch. With a 5MHz clock and 4 cycles per instruction plus an average of 16 cycles per byte fetched, that's 0.8us per instruction and about 3us per byte. That's roughly 48 assembly instructions per Spin instruction, not bad for a general purpose stack-based interpreter.

6                      repeat
Addr : 0018: Label0000
7                         outa[noparse][[/noparse]15] :=1
Addr : 0018:             36  : Constant 2 $00000001
Addr : 0019:          38 0F  : Constant 1 Bytes - 0F 
Addr : 001B:          3D B4  : Register [noparse][[/noparse]Bit] op OUTA Write
8                         outa[noparse][[/noparse]15] :=0
Addr : 001D:             35  : Constant 1 $00000000
Addr : 001E:          38 0F  : Constant 1 Bytes - 0F 
Addr : 0020:          3D B4  : Register [noparse][[/noparse]Bit] op OUTA Write
Addr : 0022: Label0001
Addr : 0022: JMP Label0000
Addr : 0022:          04 74  : Jmp 0018 -12  
Addr : 0024: Label0002
Addr : 0024:             32  : Return

Spin is actually quite fast. A Spin-only serial driver can run at over 19.2KBaud with a system clock of 80MHz. For higher speeds, you can use any of the LMM (Large Memory Model) interpreters and a matching C compiler like Catalina or ImageCraft's. That has very much lower overhead for execution speed at the expense of significantly increased memory requirements. You could also use one of the Forth interpreters which also have lower execution overhead or you could try Bean's new PropBasic compiler which compiles a variant of SX/B to assembly language.

Most programs spend most of their execution time in very small areas of the program. If you're optimizing for speed, you can get major speedups by converting just those small areas into assembly. The same is true for Spin programs. The kicker is that the small areas involved are often not the ones you think will benefit from the optimization. The combination of Spin and assembly is actually quite good with Spin heavily optimized for space and assembly being quite fast, but bulky.

heater · 2010-01-13 23:07

If you compile that repeat loop with BST you can get a listing of the byte codes it results in which look like this:

2                        repeat
Addr : 0018: Label0002
3                          outa[noparse][[/noparse]15] :=1
Addr : 0018:             36  : Constant 2 $00000001
Addr : 0019:          38 0F  : Constant 1 Bytes - 0F 
Addr : 001B:          3D B4  : Register [noparse][[/noparse]Bit] op OUTA Write
4                          outa[noparse][[/noparse]15] :=0
Addr : 001D:             35  : Constant 1 $00000000
Addr : 001E:          38 0F  : Constant 1 Bytes - 0F 
Addr : 0020:          3D B4  : Register [noparse][[/noparse]Bit] op OUTA Write
Addr : 0022: Label0003
Addr : 0022: JMP Label0002
Addr : 0022:          04 74  : Jmp 0018 -12  
Addr : 0024: Label0004
Addr : 0024:             32  : Return

Here we see 7 instructions in 12 bytes.

As Bob says, for each one of those instructions the interpreter has to fetch the bytes codes, increment some Program Counter, decode the thing, do the operation. That may involve fetching some operand and writing out some result.

If you really want to see what goes on you would have tor have a look at the actual interpreter code wich has been posted here somewhere.

Bear in mind that the interpreter has been written to fit into a COG so it is probably not optimized for speed. Although I believe Cluso has produced a version that is a tad faster.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

heater · 2010-01-13 23:11

Blimey, Mike you had the same idea.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Miner_with_a_PIC · 2010-01-13 23:36

Heater, its okay...I needed to read that information twice(actually several times) to get a better understanding anywho... I very much appreciate everyone's inputs on this, it appears that the interpreter lives up to its name, expending much of its clock cycles decoding (interpreting) high level instructions into smaller more manageable pieces.

There seems to be a trade-off between interpreter size and speed. My understanding is that the Prop2 will have increased main memory (256K?), perhaps the cog memory size will also be increased; an opportunity for a larger interpreter + faster spin processing may be accommodated on the Prop2. How much faster I am unsure, but it doesn't seem like very much based on Bobb's inputs. For now I look forward to programming assembly based solutions to gain a foothold on speed while using spin for projects that require slower speeds.

heater · 2010-01-13 23:55

I'm sure the Prop II will have the same COG memory. The COGs instructions contain 9 bit operand fields. Changing that would be a major redesign of the whole architecture. However it will have many other speed ups so this becomes less of an issue anyway.

Don't forget, it is not a case of selection SPIN or PASM on a project by project basis. Rather use PASM for the parts of your project that really need it. Most things in OBEX, for example, are Spin/PASM mixtures.

Also helps that PASM is about the simplest assembly language to work with I have ever come across.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Roy Eltham · 2010-01-14 05:34

The Prop 2 will have 1 cycle instruction throughput though, instead of 4 cycles, so Spin will run faster clock for clock on the Prop 2.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Check out the Propeller Wiki·and contribute if you can.

rjo_ · 2010-01-14 06:11

Miner,

This is a little OT but I can't help myself[noparse]:)[/noparse]

The counters are practically free, you get two per cog and they work pretty much the same whether called from Spin or PASM...
AND consider the waitcnt function...amazing time resoluton, and it doesn't seem affected by Spin at all... except for the first few thousand clocks.

Processors are processors... but the Prop is a controller through and through. When there is something better, I'd like an email to my PropPhone.

Rich

rjo_ · 2010-01-14 06:12

Better make that a text message... I don't know how to read emails with my PropPhone[noparse]:)[/noparse]

max72 · 2010-01-14 08:25

Check the PASM webinar. The first example is pin toggling, spin vs Pasm..

www.parallax.com/tabid/766/Default.aspx

Massimo

Miner_with_a_PIC · 2010-01-14 17:52

@max72 >> Thanks for sharing that link; I am embarrassed to say that I did not realize that the webinar resource was available. I have spent perhaps the last 3 hours watching videos that have helped immensely with filling in the gaps, answer questions that I had and informing me of what is to come. I intend on watching most if not all that are available.

@rjo_ >> Good news regarding the waitcnt function as it is perhaps one of the most used pieces of code in spin...I believe 381 clock cycles in the minimum wait for spin and 9 for assembly provided the waitcnt occurs right after the add as it does in the examples.

@Roy >> This is awesome news as 1 clock cycle per instruction was a selling feature for the SX line of chips. I also noted the ambitious 160 MHz clock speed, 3 cycle jumps and special jmps that when coded correctly would take only 1 cycle. This Parallax team are really a great bunch of folks, very clever, open and willing to try new things rather than get stuck in a rut.

@Mike and Heater >> So far the webinars didn't go into as much detail answering my question as you guys did, many thanks for sharing. Where would I get a copy of the BST spin complier?

@Peter >> The concern was more related to a ratio of performance, where clock cycle to clock cycle comparison is more important than the speed the chip is running. The ratio is very nearly 100X faster for ASM over Spin. For many applications where human interaction is involved spin's speed is overkill but when interacting with hardware like ADC, DACs, EEPROMs or fast serial (>19,200 bps) using spin often becomes a speed limiter.

Mike Green · 2010-01-14 18:02

There's a link for BST (and other Propeller tools and documentation) on the Propeller Downloads page (of course):
www.parallax.com/tabid/442/Default.aspx

Peter Jakacki · 2010-01-14 18:12

Hi JT,

I realized my mistake after I posted as I later read 5mHz, no PLL etc. Since it was rather early in the morning for me at the time (and also now) and I needed a coffee. I put it down to that. I know though that there are trade-offs with byte interpreters, memory limitations, and slow hub access, the latter being a real killer as instructions have to be read in as unaligned bytes.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
*Peter*

Miner_with_a_PIC · 2010-01-14 18:34

Peter, as you can see from my request of the painfully obvious above (location of BST complier) we all have our moments except I have no excuse as I have had enough caffeine to give a small elephant a coronary.

max72 · 2010-01-14 21:16

Another nice place to browse around is:

propeller.wikispaces.com/

and don't forget to check the sticky in the forum and the obex...

Massimo

SamMishal · 2010-01-15 12:22

Miner_with_a_PIC said...
I have noticed that Spin routines run very slowly. Here is an example:

repeat

outa[noparse][[/noparse]15] :=1
outa[noparse][[/noparse]15] :=0

Scoped periodicity of 284 us with clock frequency of 5 MHz no pll one cog only running, that is to complete the repeat loop it takes about 1420 clock cycles. Compare this to its assembly counterpart:

:test_loop

mov outa,#0
nop
mov outa,testmask
jmp #:test_loop

this has a measured periodicity of 3.24 us or 16 clock cycles (12 if you exclude the nop)

My understanding is that the cog running spin retrieves bytes from main memory via the hub and uses its locally loaded interpreter to process these commands. Hub access is slow 7 to 22 clock cycles, but still why is spin so very slow. Each simple spin command in the example given should only take a maximum of 22 clock cycles to retrieve the command via the hub and 4 clock cycles to process, instead the average is 470 clock cycles each. Does anyone have the scoop as to why this is so? My main concern is that for speed intensive applications, spin's speed constraint always plague my progress even when it only there to direct/load the ASM code.

Thanks in advance...

I am sure all the good people who replied have explained this.... but here is my two cents worth....

If you look at the Webinar Video done by Jeff Martin on Dec 12th you will see a very similar program to the one
you have posted both in PASM and SPIN....however in Jeff's program he set the PLL and Clock frequency correctly and
thus achieved an 80 MHz clock not the 20 KHz internal RC oscillator in the slow mode.

The program is shown around about minute·14 into the video and the Oscilloscope output is for the SPIN
code is shown at minute 15....the PASM output is shown around minute 16.

From the Webinar it is seen that the PASM code obtains a square wave with a period of 200 NanoSeconds
the SPin code obtains a square wave with a period of 22 MicroSeconds

So these numbers are QUITE a bit faster than what you have quoted....due to the 80 MHz clocking rate.

So SPIN is not REALLY slow....it is fast in comparison to MANY other microcontrollers. It is only slow when compared to
PASM which is also extremely faster than MANY microcontrollers.

SPIN is fast enough to implement and RS232 UART that can function up to 19200 bps which is quite good
for a High Level language.

Is a Hare slow? Not when you compare it to Humans....but yes when compared to Cheetahs
·

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Samuel

www.RobotBASIC.com
·

Dave Hein · 2010-01-28 21:35

I was looking at the Catalina thread, and I saw the benchmark for the following loop:

VAR
· long a
PUB start | i
· repeat i from 1 to 1000000
···· a := i + i

Catalina does this in 3 seconds and Spin takes 16 seconds.· There are essentially four operations executed in the loop.

1. Add i to i and store in a
2. increment i
3. compare i to 1000000
4. jump to beginning of loop if not greater than 1000000

At 20 MOPs this would take 0.2 seconds.· The Spin code executes the 4 million operation in 16 seconds, which works out to 0.25 MOPs.· That's a ratio of 80 to 1.
Some of the time is involved in accessing the Hub memory.· The loop consist of 8 bytes of spin code.· The constant value of 1000000 is included in 3 of the 8 bytes of code.· The variable "i" must be read three times and written once per loop.· The variable "a" is written once per loop.· Therefore, the total number of hub accesses is 13 per loop.· Assuming an average of 16 cycles per hub access, this works out to about 200 cycles per loop.

Catalina takes 3 seconds to run, which is 240 cycles per loop.· So it is fairly efficient.· Spin uses 1,280 cycles per loop.· Subtracting out the memory accesses leaves a little over 1,000 cycles per loop, which is about 63 Cog instructions per operation.· That seems like a reasonable number, but there must be ways to speed it up.

I have heard some about BST, and I'm interested in getting more information on it.· Can someone run a benchmark for BST using the code listed above?· I would be interested seeing how it compares to the Spin interpreter.· It would also be interesting to see how PASM compares with this simple loop.

Is there a webpage or thread that describes BST?· It seems like all the information is scattered around in seperate threads, and I can't find a simple overview of it.

Dave
·

BradC · 2010-01-28 22:00

Dave Hein said...
Catalina takes 3 seconds to run, which is 240 cycles per loop. So it is fairly efficient. Spin uses 1,280 cycles per loop. Subtracting out the memory accesses leaves a little over 1,000 cycles per loop, which is about 63 Cog instructions per operation. That seems like a reasonable number, but there must be ways to speed it up.

If you want to speed it up, write it in PASM.

Dave Hein said...

I have heard some about BST, and I'm interested in getting more information on it. Can someone run a benchmark for BST using the code listed above? I would be interested seeing how it compares to the Spin interpreter. It would also be interesting to see how PASM compares with this simple loop.

bst is part of a small suite of tools for those of us lucky enough not to use Windows. Aside from some very basic optional optimisations it's has a bit-for-bit compatible spin compiler. I don't think you'll spot a speed increase over the Parallax compiler.

Dave Hein said...

Is there a webpage or thread that describes BST? It seems like all the information is scattered around in seperate threads, and I can't find a simple overview of it.

www.fnarfbargle.com/bst.html

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life may be "too short", but it's the longest thing we ever do.

Dave Hein · 2010-01-28 22:47

BradC said...

If you want to speed it up, write it in PASM.

bst is part of a small suite of tools for those of us lucky enough not to use Windows. Aside from some very basic optional optimisations it's has a bit-for-bit compatible spin compiler. I don't think you'll spot a speed increase over the Parallax compiler.

Sorry,·I thought·BST was a compiler that produced Propeller machine code.· I have written lots of assembly in the past, but I haven't tackled PASM yet.· Anyhow, that would defeat the purpose of writing in a high level language.· Does a Spin compiler exist that produces PASM or Propeller machine code?· It seems like it would be a useful tool.

As far as operating systems, I tend to use the one that gets the job done quickest.· In some cases that's Linux, and in other cases that's Windows.· I even use a Mac sometimes when my daughter needs help with her MacBook.

Dave

Mike Green · 2010-01-28 23:00

There is no Spin compiler that produces native Propeller code. It's not really practical without a very sophisticated optimizer because the native instruction set doesn't support a lot of the data and control structures needed for a language like Spin. The limited cog memory space is also a profound limitation for compiling high level code (like Spin). Bean and JonnyMac's PropBasic is a compiler that produces PASM. It's based on SX/B and accepts a restricted subset of Basic ... still very useful. The language and compiler are still works in progress and very helpful for very simple and moderately simple programs where speed is important, thus PASM is needed.

Miner_with_a_PIC · 2010-01-28 23:01

Dave >> "It would also be interesting to see how PASM compares with this simple loop."

I am new to PASM so the code below may contain syntax/flow errors and/or not be the optimum (shortest) solution. It seems to take 5 instructions each using 4 clock cycles each, so that would be 20 total clock cycles per loop.

DAT

org 0

mov a,#0

loop_the_loop
add a,#1
mov b, a
SHL b,#1
cmp a,upper_limit wz
if_nz jmp #loop_the_loop

'place more code here or stop your cog

a long 0
b long 0
upper_limit long 1000000

BradC · 2010-01-28 23:01

Dave Hein said...

BradC said...

If you want to speed it up, write it in PASM.

bst is part of a small suite of tools for those of us lucky enough not to use Windows. Aside from some very basic optional optimisations it's has a bit-for-bit compatible spin compiler. I don't think you'll spot a speed increase over the Parallax compiler.

Sorry, I thought BST was a compiler that produced Propeller machine code. I have written lots of assembly in the past, but I haven't tackled PASM yet. Anyhow, that would defeat the purpose of writing in a high level language. Does a Spin compiler exist that produces PASM or Propeller machine code? It seems like it would be a useful tool.

Personally I'd get into PASM. It's great. If you don't want to get that low level, I'd have a look at Bean's PropBasic. It's a high level language that directly generates PASM.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life may be "too short", but it's the longest thing we ever do.

Ariba · 2010-01-29 00:29

Here is my Assembly version with timing calculation:

      org  0

      mov  a,#0
      mov  i,#0
      mov  count,OneM
loop  mov  t1,i
      add  t1,i
      add  a,t1
      add  i,#1
      djnz count,#loop
halt  jmp  #halt

OneM  long 1_000_000
a     res  1
i     res  1
count res  1
t1    res  1

' 5 Instructions in loop = 20 clockcycles = 250ns
' 250ns * 1000000 = 0.25 seconds total
' 16 s / 0.25 s = 64 times faster than Spin

Andy

Bill Henning · 2010-01-29 00:34

      org  0

      mov  a,#0
      mov  i,#0
      mov  count,OneM
loop  mov  a,i
      add  a,i
      add  i,#1
      djnz count,#loop
halt  jmp  #halt

OneM  long 1_000_000
a     res  1
i     res  1
count res  1
t1    res  1

' 4 Instructions in loop = 16 clockcycles = 200ns
' 200ns * 1000000 = 0.20 seconds total
' 16 s / 0.20 s = 80 times faster than Spin

Ariba said...
Here is my Assembly version with timing calculation:

      org  0

      mov  a,#0
      mov  i,#0
      mov  count,OneM
loop  mov  t1,i
      add  t1,i
      add  a,t1
      add  i,#1
      djnz count,#loop
halt  jmp  #halt

OneM  long 1_000_000
a     res  1
i     res  1
count res  1
t1    res  1

' 5 Instructions in loop = 20 clockcycles = 250ns
' 250ns * 1000000 = 0.25 seconds total
' 16 s / 0.25 s = 64 times faster than Spin

Andy

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system

Ariba · 2010-01-29 01:29

Ah, yes. Read it not exactly. Perhaps because overwriting a all the time makes not much sense in a real code.

Bill Henning · 2010-01-29 01:45

Exactly... a human could optimize it to:

mov b,two_mil
halt: jmp #halt

two_mil long 2_000_000

Ariba said...
Ah, yes. Read it not exactly. Perhaps because overwriting a all the time makes not much sense in a real code.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system

Dave Hein · 2010-01-29 03:50

The assembly cycles are similar, or the same as my estimate from my initial post of 0.2 seconds.· However, to fairly compare Spin and PASM the variables should be in hub RAM.· This would add 7 to 22 cycles for each memory access.· If "i" is read once and written once per loop, and "a" is written once per loop the total number of cycles would increase from 4 to between 25 to 70.· This would be quite a bit slower, but still much faster than Spin.

It still seems to me that a Spin-to-PASM compiler would be very useful.· It would provide at least a 4X speedup over interpreted Spin.· Some extensions could be added to the Spin language to force some variables into cog RAM insted of hub RAM.· This would allow compiled Spin to approach the speed of PASM.

Dave

Mike Green · 2010-01-29 03:57

@Dave - To fairly compare Spin and PASM, you have to include the startup overhead for the PASM code which is substantial. Alternatively, you could have a kernel in each cog used for PASM that would copy blocks of code as needed from hub to cog and transfer control to the copied code, sort of like an overlay loader.

Bobb Fwed · 2010-01-29 17:13

Now why does SPIN use more power than PASM? Do hub operation draw more power?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
April, 2008: when I discovered the answers to all my micro-computational-botherations!

Some of my objects:
MCP3X0X ADC Driver - Programmable Schmitt inputs, frequency reading, and more!
Simple Propeller-based Database - Making life easier and more readable for all your EEPROM storage needs.
String Manipulation Library - Don't allow strings to be the bane of the Propeller, bend them to your will!
Fast Inter-Propeller Comm - Fast communication between two propellers (1.37MB/s @100MHz)!

Spin Speed, Why So Slow?

Comments