Spin speed up

richaj45 · 2012-01-02 15:13

Hello:

I was wondering what it would take to speed up Spin?

Maybe a new compiler that generates a "Large Memory Model" program or maybe use multiple cogs to speed up interpretation of the current byte codes.

What do people think>

rich

Heater. · 2012-01-02 15:27

A faster processor, Prop II is on the way:)

I don't think having Spin generate LLM code is desirable, especially not for the Prop I. The beauty of Spin is that it produces compact byte code allowing you to squeeze more functionality into the limited space. LLM code is much bigger. Besides if you want the Speed of LMM we now have Catalina and propgcc.

Using multiple COGS may not give you much boost whilst at the same time wasting COGs. In my efforts at creating a multi-cog Z80 emulator (Basically byte sized instructions) some time ago I could not get performance out of it. Perhaps I did not try hard enough but the overheads of passing data and control back and forth between COGs were a killer then.

richaj45 · 2012-01-02 17:42

Thanks for the feedback.

Sounds like it is a dead end subject.

cheers,
rich

Dave Hein · 2012-01-02 19:21

You can get the most out of Spin by structuring statements to use the smallest number of Spin bytecodes. As an example, use X++ instead of X += 1 or X := X + 1. The compiler makes no attempt at optimization, and it will generate Spin bytecodes exactly as the statement is written. The first 8 stack variables and the first 8 longs in a VAR section are accessed with a single bytecode, so they should be used as much as possible. When addressing arrays, avoid the zero index. That is, uses X instead of X[0]. There are a number of other tricks and tips that have been written up somewhere.

There have been a few attempts at speeding up the Spin interpreter. I believe the speed improvements have been around 25% to 33%. In theory, it would be possible to write a Spin compiler that generates LMM code. This would run quite a bit faster -- around 4 times as fast or faster. Of course, this would also require more hub RAM for program space. Another approach would be to convert Spin source to C source, and then compile it with Catalina or PropGCC. However, if you would go this route you might as well just program in C.

Cluso99 · 2012-01-03 00:25

I did a faster spin interpreter and gained around 20-25% improvement, but the maths were the best improvement. With LMM, this could be improved somewhat. I used a bytecode vector table in hub to achieve the biggest improvement and that freed space in the cog to unravel the twisted code, making is much faster in sections. At the time I did not have enough info about where to spend the time to achieve better performance.
However, with that speedup and the fact I overclock from 80MHz to 104MHz regularly (30% improvement on all 8 cores) and that 108MHz is most likely stable too (35%), a nice increase can be had.

Heater. · 2012-01-03 01:33

If I understand correctly the Prop II will not have Spin interpreter in ROM. It will be added to the RAM image by the compiler. That means that a speed up such as Cluso's dispatch table in RAM could become the norm for a new improved interpreter. Who knows what else could be done with that?

prof_braino · 2012-01-03 04:51

If you can use forth, propforth is noticably faster. It can also be optimized in assembler at the bottle necks.
Functions can be loaded and eliminated dynamically from memory, so the 32k design of the prop that is commonly listed as a "drawback" is not much of an issue.
It can run from SD card, and can execute from scripts on the fly, so there is no real limit to program size, as long as a given module can fit in the available memeory.
And it interactive, so development iterations are quicker. But, using forth is considered "cheating" since its so easy.

Humanoido · 2012-01-03 05:26

richaj45 wrote: »

Hello: I was wondering what it would take to speed up Spin? What do people think...rich

I've seen where Spin processing is over 100 times faster with or without a clock using the following formula:

X = NP where
N = Number of Propeller chips
P = Propeller chip
X = SPIN Language Speed Increase

If the number of Propeller chips is 100, then
X = 100P for an increase of 100 times the processing speed of a single Propeller.
The gain is from multiple SPIN instances of parallel thinking, not the overhead.
You may want to figure in multiple instances of Cogs by using 8X instead of X.
http://forums.parallax.com/showthread.php?124495-Fill-the-Big-Brain

richaj45 · 2012-01-03 06:46

Cluso99:

Can i get a copy of the Spin speedup you did?

Here is a what-if questions:
If a COG were changed so that it's memory space was 1024 longs instead of 512 would that have helped to speedup Spin?

cheers,
rich

Heater. · 2012-01-03 08:33

richaj45,

If a COG were changed so that it's memory space was 1024 longs instead of 512

But how would you do that? The 512 limit isn't just that that's all it has. It's built into the instruction set with the 9 bit src and dst fields. So not a simple case of just tacking on more RAM.

Dave Hein · 2012-01-03 08:58

Extending the size of the cog memory has been discussed before. There are several ways to get around the 9-bit src/dst field size, such as paging, relative addressing, index registers, etc.

jazzed · 2012-01-03 11:22

Having an LMM-like Spin interpreter would be useful for various reasons. An LMM-like interpreter would allow for using high density Spin and would be faster of course although single threaded (the slow interpreter could still be used for multi-cog spin threads). Having an open source Spin compiler would enable many more interesting things like 32 bit Spin programs, etc....

Heater. · 2012-01-03 11:48

Jazzed,
"...would be faster...".

I'm not sure I'm with you. Do you mean compile Spin to LMM?
In which case the Spin interpreter become an LMM kernel, faster but bigger.
Or do you mean an LMM Spin interpreter runing Spin byte codes? Which I might have guessed would be slower.

Cluso99 · 2012-01-03 13:44

jazzed wrote: »

Having an LMM-like Spin interpreter would be useful for various reasons. An LMM-like interpreter would allow for using high density Spin and would be faster of course although single threaded (the slow interpreter could still be used for multi-cog spin threads). Having an open source Spin compiler would enable many more interesting things like 32 bit Spin programs, etc....

Yes, an open sourced spin compiler (and pasm compiler) would have opened a lot of things up for developing the prop. Even with the problems with PropTool source, it would have helped the march.

Spin Interpreter:

The addition of LMM to the spin mix (as in moving the relatively unused sections to LMM), plus the decode vector improvement, would lead to some considerable gains. Unfortunately 512 longs will not happen although I agree it is a shame, with banking. However, the PropII does introduce a usable fifo which can be used as the stack in the spin interpreter. This would provide quite an improvement. But the whole PropII will improve spin significantly, and it will be soft, so it can be extended.

Richaj45: See my signature below in the tools section. There is a link to the Faster Spin Interpreter thread. It includes the source, plus a better writeup of the spin interpreter byte codes.

jazzed · 2012-01-03 14:57

Heater. wrote: »

Jazzed,
"...would be faster...".

I'm not sure I'm with you. Do you mean compile Spin to LMM?
In which case the Spin interpreter become an LMM kernel, faster but bigger.
Or do you mean an LMM Spin interpreter runing Spin byte codes? Which I might have guessed would be slower.

"LMM-like" ... I.E. overlay.

Byte-codes can get pre-loaded a long at a time, most interpreter routines pre-loaded, less used routines loaded on demand, other tricks like making routines that don't have tons of nops as in the current interpreter's CZ encodings, etc....

Ray, if you just point to one interpreter file as your defacto standard we could all start testing with it.

Dave Hein · 2012-01-03 15:38

I looked at various ways of speeding up the interpreter when we were investigating Big Spin -- http://forums.parallax.com/showthread.php?128921-Big-Spin-is-it-still-a-pipedream&highlight=bigspin . I tried jump tables, LMM and dual-cog implementations. I profiled the bytecodes that were used, and found that about 8 or 9 bytecodes accounted for about 50% of the opcodes that were executed. This looked promising, and I thought that we could optimize a small number of the opcodes to get a significant gain. However, I was only able to get about a 33% improvement after optimizing a large number of the opcodes. The LMM overhead on the unoptimized opcodes was dragging the performance down.

If a cog had about 200 or 300 more longs of memory we could see some real improvement -- possibly by a factor of 2. The 496 longs of memory is just a bit to tight to allow for an efficient implementation.

JasonDorie · 2012-01-03 15:49

I've found on many occasions that SPIN itself is quite fast, but the way I use it is slowing it down (essentially what Dave Hein mentions in post #4).

I've written a couple of I2C drivers for sensors that initially weren't fast enough, but by timing and changing the SPIN source I was able to double or triple the execution speed. A 20% speedup in the interpreter pales in comparison to what we could expect from an optimizing SPIN compiler.

One of the things on my to-do list is to write and time a whole bunch of different expressions so I can get a feel for how fast they actually are (as well as which forms of things are faster and by how much) and document the result. I also want to get a feel for how much overhead there is in calling a method in an object vs doing something inline, etc. I've been surprised a number of times at how fast I was able to make SPIN code go when I really wanted to.

Jason

Cluso99 · 2012-01-03 15:58

steve iirc he latst cod e i posted was at tphe end of my thread and on obex iirc. will check later as out now.

there were certainly more improvements to be made and a mix of lmm and overlays would improvve things as more space allows better speed on highly used functions. i still had space and was at this point when i stopped dev.

one improvement i thought was to save the push and pops between bytecodes but this requires compiler mods to take advantage and breaks compatibility. there maybe other ways to do this.

i would think that reading llongs would not help because the overhead t-'o check this would deffeat any benefit and use valuable space.

Cluso99 · 2012-01-03 16:10

JasonDorie wrote: »

I've found on many occasions that SPIN itself is quite fast, but the way I use it is slowing it down (essentially what Dave Hein mentions in post #4).

I've written a couple of I2C drivers for sensors that initially weren't fast enough, but by timing and changing the SPIN source I was able to double or triple the execution speed. A 20% speedup in the interpreter pales in comparison to what we could expect from an optimizing SPIN compiler.

One of the things on my to-do list is to write and time a whole bunch of different expressions so I can get a feel for how fast they actually are (as well as which forms of things are faster and by how much) and document the result. I also want to get a feel for how much overhead there is in calling a method in an object vs doing something inline, etc. I've been surprised a number of times at how fast I was able to make SPIN code go when I really wanted to.

Jason

how true. unfortunately it makes the code less readable. an optimising compiler would do much of this though. and every 20% helps too.

btw i am noticing problems editting posts with my xoom. anyone else noticed this?

jazzed · 2012-01-03 18:20

Cluso99 wrote: »

one improvement i thought was to save the push and pops between bytecodes but this requires compiler mods to take advantage and breaks compatibility. there maybe other ways to do this.

Yes and an open source compiler is in the works, but it feels something like "Waiting for Godot".

Cluso99 wrote: »

i would think that reading llongs would not help because the overhead t-'o check this would deffeat any benefit and use valuable space.

I was thinking that several bytecodes are single byte and reading a long costs no more than reading a byte and reading 4 bytes one at a time would take app. 800ns instead of app. 200ns at 80MHz ... surely the accounting could be done in much less than 12 instructions.

Cluso99 · 2012-01-04 23:31

Steve: Maybe it could be done in 12 instructions, but remember we will not have 12 longs to spare or some other bytecode will slowdown. Still, it would definately be worthwhile to try it.

My latest version of the Spin Interpreter is 260C_007 posted here
http://forums.parallax.com/showthread.php?104276-Spin-Interpreter-Faster Post #40 and the bytecodes are explained in the file in Post #37

It can be traced with my Zero Footprint debugger (see the Tools link in my signature for a link).

T Chap · 2012-01-06 12:48

Is there a compilation of SPIN optimizing tricks?

Priorities are first to reduce program size, speed is second.

JasonDorie · 2012-01-06 15:02

One good one is to use constant() wherever you can. For example, if you have this line of code:

result := (65536 / 32) * value

It will compile to fewer instructions (and go faster) if you do this:

result := constant(65536 / 32) * value

The compiler makes no attempt at optimization, so even seemingly obvious things like that get ignored by it. Anywhere in your code that you're doing math with constant inputs may be able to take advantage of that.

Also, just packing expressions results in faster / smaller code. IE, use this:

X := ((Input * 5) + Center) / Scale

Instead of this:

X := Input * 5
X += Center
X /= Scale

And last, don't call a function if you can do the code inline. It makes things less readable, but the overhead of a function call isn't zero, so writing larger block functions helps.

T Chap · 2012-01-06 15:35

JasonDorie wrote: »

One good one is to use constant() wherever you can. For example, if you have this line of code:

result := (65536 / 32) * value

It will compile to fewer instructions (and go faster) if you do this:

result := constant(65536 / 32) * value

The compiler makes no attempt at optimization, so even seemingly obvious things like that get ignored by it. Anywhere in your code that you're doing math with constant inputs may be able to take advantage of that.

Also, just packing expressions results in faster / smaller code. IE, use this:

X := ((Input * 5) + Center) / Scale

Instead of this:

X := Input * 5
X += Center
X /= Scale

And last, don't call a function if you can do the code inline. It makes things less readable, but the overhead of a function call isn't zero, so writing larger block functions helps.

That is a neat trick, never heard of it. Thanks

PUB contest | xxx
    'xxx := constant(65536 / 32) * xxx    '4 longs
    xxx := (65536 / 32) * xxx     '5 longs

JasonDorie · 2012-01-06 16:53

Using operators is faster when possible too. For example, if you want an absolute value, use ||Value instead of if( Value < 0 ) Value := -Value.

That one is kind of obvious, but Spin has operators for clamping (#> and <#), bit translation (lookup & lookdown), and a few other things that aren't particularly common. Also, in Spin generally the number of operands is the deciding factor, whereas in C or ASM the complexity of the operators matters more. So, in C I might do a multiply by 3 by doing:

X := X + (X<<1)

...because that would be faster. In Spin, it'll be slower than just doing the multiply, and using X *= 3 is even faster still.

kuroneko · 2012-01-06 22:47

JasonDorie wrote: »

So, in C I might do a multiply by 3 by doing:

X := X + (X<<1)

...because that would be faster. In Spin, it'll be slower than just doing the multiply, and using X *= 3 is even faster still.

Note that x *= 3 has only an advantage when x is not among the first 8 variables. And x += x << 1 is always fastest in this case. So in the end it nearly always boils down to trial and error.

JasonDorie · 2012-01-07 00:35

Interesting. I had always been told that the operand count was pretty much the limiting factor, but I decided to actually *time* this stuff to see how long it actually takes. Here's how many cycles a few simple operations took, and the overhead of performing the timing itself has been removed:

  'Vars in the first 8 longs
  x := y * 3                    '1264
  x := y / 3                    '1264
  x *= 3                        '1184
  x := x + (x<<1)               '1024
  x := y + (y<<1)               '1024
  x += (x<<1)                   ' 944                        
  x += (y<<1)                   ' 944                        

  x #>= 1000                    '624 each, 1248 for both lines
  x <#= 2000
  
  x := 1000 #> x <# 2000        '1088

  'Adding vars in the next 8 longs (p,q,r,s)
  x += (s<<1)                   '1056                        
  q += (s<<1)                   '1168                        
  x := q * 3                    '1376
  p := q * 3                    '1488

I have also been told that there was an advantage to using the first 8 longs in an object, but I never knew how much - each variable access not in the first 8 longs adds about 100 cycles.

Dave Hein · 2012-01-07 04:58

It may be worth writing a Spin optimizer after all. It's interesting that no one has ever documented the number of cycles for each Spin bytecode. But then again, an official Spin bytecode language has never been defined either.

An optimizer could be written as a pre-processor that converts statements into more efficient forms. If it was smart enough, it could identify when inefficient variables are used multiple times in a statement and copy them in a temporary variable that is defined within the first 8 stack longs.

T Chap · 2012-01-07 05:51

I think it would be useful to have a thread dedicated to optimized SPIN methods that included comparison results, like size, and if possible cycles.

Here is another obvious example:

PUB contest | x
     'x := x+1     '4 longs
      x++           '3 longs

Using min or max limits can cause problems for post ++ or --, limits don't work with these.

I would be interested to find out how you would 'time' the code.

Dave Hein · 2012-01-07 08:31

I ran the code that was posted above through spinsim to measure the number of cycles for each Spin bytecode. I first ran it as "spinsim test.binary -l" to get the bytecode listing. Then I ran it using the PASM interpreter in ROM with "spinsim test.binary -l -p". I wrote a small program to locate the top of the interpreter loop, and printed out the number of cycles per loop. The merged listing is shown below.

The number of cycles is shown at the beginning of the line. The hex numbers after "Cog 0:" are the bytecode address, value on the top of the stack and the instruction bytecodes.

The Spin bytecode mnemonics have the following meaning

ldllc Load Long Local Compact
stllc Store Long Local Compact
ldlip Load Long Immediate Packed
exllc Execute Long Local Compact
ldll  Load Long Local
stll  Store Long Local
ldwi  Load Word Immediate

pub main | x, y, t3, t4, t5, t6, t7, p, q, r, s
 'Vars in the first 8 longs
  x := y * 3                    '1264
 160 Cog 0: 00ac 00000000 - 0018 68              ldllc $8
 176 Cog 0: 00b0 00000000 - 0019 37 21           ldlip $3
 768 Cog 0: 00b4 00000003 - 001b f4              mul
 160 Cog 0: 00b0 00000000 - 001c 65              stllc $4

  x := y / 3                    '1264
 160 Cog 0: 00ac 00000000 - 001d 68              ldllc $8
 176 Cog 0: 00b0 00000000 - 001e 37 21           ldlip $3
 768 Cog 0: 00b4 00000003 - 0020 f6              div
 160 Cog 0: 00b0 00000000 - 0021 65              stllc $4

  x *= 3                        '1184
 176 Cog 0: 00ac 00000000 - 0022 37 21           ldlip $3
1008 Cog 0: 00b0 00000003 - 0024 66 54           exllc $4 mul 

  x := x + (x<<1)               '1024
 160 Cog 0: 00ac 00000000 - 0026 64              ldllc $4
 160 Cog 0: 00b0 00000000 - 0027 64              ldllc $4
 128 Cog 0: 00b4 00000000 - 0028 36              ldli1
 208 Cog 0: 00b8 00000001 - 0029 e3              shl
 208 Cog 0: 00b4 00000000 - 002a ec              add
 160 Cog 0: 00b0 00000000 - 002b 65              stllc $4

  x := y + (y<<1)               '1024
 160 Cog 0: 00ac 00000000 - 002c 68              ldllc $8
 160 Cog 0: 00b0 00000000 - 002d 68              ldllc $8
 128 Cog 0: 00b4 00000000 - 002e 36              ldli1
 208 Cog 0: 00b8 00000001 - 002f e3              shl
 208 Cog 0: 00b4 00000000 - 0030 ec              add
 160 Cog 0: 00b0 00000000 - 0031 65              stllc $4

  x += (x<<1)                   ' 944                        
 160 Cog 0: 00ac 00000000 - 0032 64              ldllc $4
 128 Cog 0: 00b0 00000000 - 0033 36              ldli1
 208 Cog 0: 00b4 00000001 - 0034 e3              shl
 448 Cog 0: 00b0 00000000 - 0035 66 4c           exllc $4 add 

  x += (y<<1)                   ' 944                        
 160 Cog 0: 00ac 00000000 - 0037 68              ldllc $8
 128 Cog 0: 00b0 00000000 - 0038 36              ldli1
 208 Cog 0: 00b4 00000001 - 0039 e3              shl
 448 Cog 0: 00b0 00000000 - 003a 66 4c           exllc $4 add 

  x #>= 1000                    '624 each, 1248 for both lines
 176 Cog 0: 00ac 00000000 - 003c 39 03 e8        ldwi 1000
 448 Cog 0: 00b0 000003e8 - 003f 66 44           exllc $4 minlim 

  x <#= 2000
 176 Cog 0: 00ac 00000000 - 0041 39 07 d0        ldwi 2000
 448 Cog 0: 00b0 000007d0 - 0044 66 45           exllc $4 maxlim 

  x := 1000 #> x <# 2000        '1088
 176 Cog 0: 00ac 00000000 - 0046 39 03 e8        ldwi 1000
 160 Cog 0: 00b0 000003e8 - 0049 64              ldllc $4
 208 Cog 0: 00b4 000003e8 - 004a e4              minlim
 176 Cog 0: 00b0 000003e8 - 004b 39 07 d0        ldwi 2000
 208 Cog 0: 00b4 000007d0 - 004e e5              maxlim
 160 Cog 0: 00b0 000003e8 - 004f 65              stllc $4

  'Adding vars in the next 8 longs (p,q,r,s)
  x += (s<<1)                   '1056                        
 272 Cog 0: 00ac 00000000 - 0050 cc 2c           ldll $2c
 128 Cog 0: 00b0 00000000 - 0052 36              ldli1
 208 Cog 0: 00b4 00000001 - 0053 e3              shl
 448 Cog 0: 00b0 00000000 - 0054 66 4c           exllc $4 add 

  q += (s<<1)                   '1168                        
 272 Cog 0: 00ac 00000000 - 0056 cc 2c           ldll $2c
 128 Cog 0: 00b0 00000000 - 0058 36              ldli1
 208 Cog 0: 00b4 00000001 - 0059 e3              shl
 560 Cog 0: 00b0 00000000 - 005a ce 24 4c        exll $24 add 

  x := q * 3                    '1376
 272 Cog 0: 00ac 00000000 - 005d cc 24           ldll $24
 176 Cog 0: 00b0 00000000 - 005f 37 21           ldlip $3
 768 Cog 0: 00b4 00000003 - 0061 f4              mul
 160 Cog 0: 00b0 00000000 - 0062 65              stllc $4

  p := q * 3                    '1488
 272 Cog 0: 00ac 00000000 - 0063 cc 24           ldll $24
 176 Cog 0: 00b0 00000000 - 0065 37 21           ldlip $3
 768 Cog 0: 00b4 00000003 - 0067 f4              mul
 272 Cog 0: 00b0 00000000 - 0068 cd 20           stll $20

  x := x+1     '4 longs
 160 Cog 0: 00ac 00000000 - 006a 64              ldllc $4
 128 Cog 0: 00b0 00000000 - 006b 36              ldli1
 208 Cog 0: 00b4 00000001 - 006c ec              add
 160 Cog 0: 00b0 00000001 - 006d 65              stllc $4

  x++           '3 longs
 336 Cog 0: 00ac 00000000 - 006e 66 2e           exllc $4 postinc 

 208 Cog 0: 00ac 00000000 - 0070 32              ret
 208 Cog 0: 0074 00000032 - fff9 3f 89           ldreg $1e9
     Cog 0: 0078 00000000 - fffb 21              cogstop

Publison · 2012-01-07 09:27

T Chap wrote: »

Is there a compilation of SPIN optimizing tricks?

Priorities are first to reduce program size, speed is second.

Maybe the closest one would be Phil's "Tips and Traps" in the Propeller Stickies:

http://forums.parallax.com/showthread.php?84031-Propeller-Tricks-amp-Traps-(Last-update-21-June-2007)&p=575479#post575479

Spin speed up

Comments