Spin speed up
Hello:
I was wondering what it would take to speed up Spin?
Maybe a new compiler that generates a "Large Memory Model" program or maybe use multiple cogs to speed up interpretation of the current byte codes.
What do people think>
rich
I was wondering what it would take to speed up Spin?
Maybe a new compiler that generates a "Large Memory Model" program or maybe use multiple cogs to speed up interpretation of the current byte codes.
What do people think>
rich

Comments
I don't think having Spin generate LLM code is desirable, especially not for the Prop I. The beauty of Spin is that it produces compact byte code allowing you to squeeze more functionality into the limited space. LLM code is much bigger. Besides if you want the Speed of LMM we now have Catalina and propgcc.
Using multiple COGS may not give you much boost whilst at the same time wasting COGs. In my efforts at creating a multi-cog Z80 emulator (Basically byte sized instructions) some time ago I could not get performance out of it. Perhaps I did not try hard enough but the overheads of passing data and control back and forth between COGs were a killer then.
Sounds like it is a dead end subject.
cheers,
rich
There have been a few attempts at speeding up the Spin interpreter. I believe the speed improvements have been around 25% to 33%. In theory, it would be possible to write a Spin compiler that generates LMM code. This would run quite a bit faster -- around 4 times as fast or faster. Of course, this would also require more hub RAM for program space. Another approach would be to convert Spin source to C source, and then compile it with Catalina or PropGCC. However, if you would go this route you might as well just program in C.
However, with that speedup and the fact I overclock from 80MHz to 104MHz regularly (30% improvement on all 8 cores) and that 108MHz is most likely stable too (35%), a nice increase can be had.
Functions can be loaded and eliminated dynamically from memory, so the 32k design of the prop that is commonly listed as a "drawback" is not much of an issue.
It can run from SD card, and can execute from scripts on the fly, so there is no real limit to program size, as long as a given module can fit in the available memeory.
And it interactive, so development iterations are quicker. But, using forth is considered "cheating" since its so easy.
I've seen where Spin processing is over 100 times faster with or without a clock using the following formula:
X = NP where
N = Number of Propeller chips
P = Propeller chip
X = SPIN Language Speed Increase
If the number of Propeller chips is 100, then
X = 100P for an increase of 100 times the processing speed of a single Propeller.
The gain is from multiple SPIN instances of parallel thinking, not the overhead.
You may want to figure in multiple instances of Cogs by using 8X instead of X.
http://forums.parallax.com/showthread.php?124495-Fill-the-Big-Brain
Can i get a copy of the Spin speedup you did?
Here is a what-if questions:
If a COG were changed so that it's memory space was 1024 longs instead of 512 would that have helped to speedup Spin?
cheers,
rich
But how would you do that? The 512 limit isn't just that that's all it has. It's built into the instruction set with the 9 bit src and dst fields. So not a simple case of just tacking on more RAM.
"...would be faster...".
I'm not sure I'm with you. Do you mean compile Spin to LMM?
In which case the Spin interpreter become an LMM kernel, faster but bigger.
Or do you mean an LMM Spin interpreter runing Spin byte codes? Which I might have guessed would be slower.
Yes, an open sourced spin compiler (and pasm compiler) would have opened a lot of things up for developing the prop. Even with the problems with PropTool source, it would have helped the march.
Spin Interpreter:
The addition of LMM to the spin mix (as in moving the relatively unused sections to LMM), plus the decode vector improvement, would lead to some considerable gains. Unfortunately 512 longs will not happen although I agree it is a shame, with banking. However, the PropII does introduce a usable fifo which can be used as the stack in the spin interpreter. This would provide quite an improvement. But the whole PropII will improve spin significantly, and it will be soft, so it can be extended.
Richaj45: See my signature below in the tools section. There is a link to the Faster Spin Interpreter thread. It includes the source, plus a better writeup of the spin interpreter byte codes.
"LMM-like" ... I.E. overlay.
Byte-codes can get pre-loaded a long at a time, most interpreter routines pre-loaded, less used routines loaded on demand, other tricks like making routines that don't have tons of nops as in the current interpreter's CZ encodings, etc....
Ray, if you just point to one interpreter file as your defacto standard we could all start testing with it.
If a cog had about 200 or 300 more longs of memory we could see some real improvement -- possibly by a factor of 2. The 496 longs of memory is just a bit to tight to allow for an efficient implementation.
I've written a couple of I2C drivers for sensors that initially weren't fast enough, but by timing and changing the SPIN source I was able to double or triple the execution speed. A 20% speedup in the interpreter pales in comparison to what we could expect from an optimizing SPIN compiler.
One of the things on my to-do list is to write and time a whole bunch of different expressions so I can get a feel for how fast they actually are (as well as which forms of things are faster and by how much) and document the result. I also want to get a feel for how much overhead there is in calling a method in an object vs doing something inline, etc. I've been surprised a number of times at how fast I was able to make SPIN code go when I really wanted to.
Jason
there were certainly more improvements to be made and a mix of lmm and overlays would improvve things as more space allows better speed on highly used functions. i still had space and was at this point when i stopped dev.
one improvement i thought was to save the push and pops between bytecodes but this requires compiler mods to take advantage and breaks compatibility. there maybe other ways to do this.
i would think that reading llongs would not help because the overhead t-'o check this would deffeat any benefit and use valuable space.
how true. unfortunately it makes the code less readable. an optimising compiler would do much of this though. and every 20% helps too.
btw i am noticing problems editting posts with my xoom. anyone else noticed this?
I was thinking that several bytecodes are single byte and reading a long costs no more than reading a byte and reading 4 bytes one at a time would take app. 800ns instead of app. 200ns at 80MHz ... surely the accounting could be done in much less than 12 instructions.
My latest version of the Spin Interpreter is 260C_007 posted here
http://forums.parallax.com/showthread.php?104276-Spin-Interpreter-Faster Post #40 and the bytecodes are explained in the file in Post #37
It can be traced with my Zero Footprint debugger (see the Tools link in my signature for a link).
Priorities are first to reduce program size, speed is second.
result := (65536 / 32) * value
It will compile to fewer instructions (and go faster) if you do this:
result := constant(65536 / 32) * value
The compiler makes no attempt at optimization, so even seemingly obvious things like that get ignored by it. Anywhere in your code that you're doing math with constant inputs may be able to take advantage of that.
Also, just packing expressions results in faster / smaller code. IE, use this:
X := ((Input * 5) + Center) / Scale
Instead of this:
X := Input * 5
X += Center
X /= Scale
And last, don't call a function if you can do the code inline. It makes things less readable, but the overhead of a function call isn't zero, so writing larger block functions helps.
That is a neat trick, never heard of it. Thanks
PUB contest | xxx 'xxx := constant(65536 / 32) * xxx '4 longs xxx := (65536 / 32) * xxx '5 longsThat one is kind of obvious, but Spin has operators for clamping (#> and <#), bit translation (lookup & lookdown), and a few other things that aren't particularly common. Also, in Spin generally the number of operands is the deciding factor, whereas in C or ASM the complexity of the operators matters more. So, in C I might do a multiply by 3 by doing:
X := X + (X<<1)
...because that would be faster. In Spin, it'll be slower than just doing the multiply, and using X *= 3 is even faster still.
I have also been told that there was an advantage to using the first 8 longs in an object, but I never knew how much - each variable access not in the first 8 longs adds about 100 cycles.
An optimizer could be written as a pre-processor that converts statements into more efficient forms. If it was smart enough, it could identify when inefficient variables are used multiple times in a statement and copy them in a temporary variable that is defined within the first 8 stack longs.
Here is another obvious example:
PUB contest | x 'x := x+1 '4 longs x++ '3 longsUsing min or max limits can cause problems for post ++ or --, limits don't work with these.
I would be interested to find out how you would 'time' the code.
The number of cycles is shown at the beginning of the line. The hex numbers after "Cog 0:" are the bytecode address, value on the top of the stack and the instruction bytecodes.
The Spin bytecode mnemonics have the following meaning
pub main | x, y, t3, t4, t5, t6, t7, p, q, r, s 'Vars in the first 8 longs x := y * 3 '1264 160 Cog 0: 00ac 00000000 - 0018 68 ldllc $8 176 Cog 0: 00b0 00000000 - 0019 37 21 ldlip $3 768 Cog 0: 00b4 00000003 - 001b f4 mul 160 Cog 0: 00b0 00000000 - 001c 65 stllc $4 x := y / 3 '1264 160 Cog 0: 00ac 00000000 - 001d 68 ldllc $8 176 Cog 0: 00b0 00000000 - 001e 37 21 ldlip $3 768 Cog 0: 00b4 00000003 - 0020 f6 div 160 Cog 0: 00b0 00000000 - 0021 65 stllc $4 x *= 3 '1184 176 Cog 0: 00ac 00000000 - 0022 37 21 ldlip $3 1008 Cog 0: 00b0 00000003 - 0024 66 54 exllc $4 mul x := x + (x<<1) '1024 160 Cog 0: 00ac 00000000 - 0026 64 ldllc $4 160 Cog 0: 00b0 00000000 - 0027 64 ldllc $4 128 Cog 0: 00b4 00000000 - 0028 36 ldli1 208 Cog 0: 00b8 00000001 - 0029 e3 shl 208 Cog 0: 00b4 00000000 - 002a ec add 160 Cog 0: 00b0 00000000 - 002b 65 stllc $4 x := y + (y<<1) '1024 160 Cog 0: 00ac 00000000 - 002c 68 ldllc $8 160 Cog 0: 00b0 00000000 - 002d 68 ldllc $8 128 Cog 0: 00b4 00000000 - 002e 36 ldli1 208 Cog 0: 00b8 00000001 - 002f e3 shl 208 Cog 0: 00b4 00000000 - 0030 ec add 160 Cog 0: 00b0 00000000 - 0031 65 stllc $4 x += (x<<1) ' 944 160 Cog 0: 00ac 00000000 - 0032 64 ldllc $4 128 Cog 0: 00b0 00000000 - 0033 36 ldli1 208 Cog 0: 00b4 00000001 - 0034 e3 shl 448 Cog 0: 00b0 00000000 - 0035 66 4c exllc $4 add x += (y<<1) ' 944 160 Cog 0: 00ac 00000000 - 0037 68 ldllc $8 128 Cog 0: 00b0 00000000 - 0038 36 ldli1 208 Cog 0: 00b4 00000001 - 0039 e3 shl 448 Cog 0: 00b0 00000000 - 003a 66 4c exllc $4 add x #>= 1000 '624 each, 1248 for both lines 176 Cog 0: 00ac 00000000 - 003c 39 03 e8 ldwi 1000 448 Cog 0: 00b0 000003e8 - 003f 66 44 exllc $4 minlim x <#= 2000 176 Cog 0: 00ac 00000000 - 0041 39 07 d0 ldwi 2000 448 Cog 0: 00b0 000007d0 - 0044 66 45 exllc $4 maxlim x := 1000 #> x <# 2000 '1088 176 Cog 0: 00ac 00000000 - 0046 39 03 e8 ldwi 1000 160 Cog 0: 00b0 000003e8 - 0049 64 ldllc $4 208 Cog 0: 00b4 000003e8 - 004a e4 minlim 176 Cog 0: 00b0 000003e8 - 004b 39 07 d0 ldwi 2000 208 Cog 0: 00b4 000007d0 - 004e e5 maxlim 160 Cog 0: 00b0 000003e8 - 004f 65 stllc $4 'Adding vars in the next 8 longs (p,q,r,s) x += (s<<1) '1056 272 Cog 0: 00ac 00000000 - 0050 cc 2c ldll $2c 128 Cog 0: 00b0 00000000 - 0052 36 ldli1 208 Cog 0: 00b4 00000001 - 0053 e3 shl 448 Cog 0: 00b0 00000000 - 0054 66 4c exllc $4 add q += (s<<1) '1168 272 Cog 0: 00ac 00000000 - 0056 cc 2c ldll $2c 128 Cog 0: 00b0 00000000 - 0058 36 ldli1 208 Cog 0: 00b4 00000001 - 0059 e3 shl 560 Cog 0: 00b0 00000000 - 005a ce 24 4c exll $24 add x := q * 3 '1376 272 Cog 0: 00ac 00000000 - 005d cc 24 ldll $24 176 Cog 0: 00b0 00000000 - 005f 37 21 ldlip $3 768 Cog 0: 00b4 00000003 - 0061 f4 mul 160 Cog 0: 00b0 00000000 - 0062 65 stllc $4 p := q * 3 '1488 272 Cog 0: 00ac 00000000 - 0063 cc 24 ldll $24 176 Cog 0: 00b0 00000000 - 0065 37 21 ldlip $3 768 Cog 0: 00b4 00000003 - 0067 f4 mul 272 Cog 0: 00b0 00000000 - 0068 cd 20 stll $20 x := x+1 '4 longs 160 Cog 0: 00ac 00000000 - 006a 64 ldllc $4 128 Cog 0: 00b0 00000000 - 006b 36 ldli1 208 Cog 0: 00b4 00000001 - 006c ec add 160 Cog 0: 00b0 00000001 - 006d 65 stllc $4 x++ '3 longs 336 Cog 0: 00ac 00000000 - 006e 66 2e exllc $4 postinc 208 Cog 0: 00ac 00000000 - 0070 32 ret 208 Cog 0: 0074 00000032 - fff9 3f 89 ldreg $1e9 Cog 0: 0078 00000000 - fffb 21 cogstopMaybe the closest one would be Phil's "Tips and Traps" in the Propeller Stickies:
http://forums.parallax.com/showthread.php?84031-Propeller-Tricks-amp-Traps-(Last-update-21-June-2007)&p=575479#post575479