Spin speed up
richaj45
Posts: 179
Hello:
I was wondering what it would take to speed up Spin?
Maybe a new compiler that generates a "Large Memory Model" program or maybe use multiple cogs to speed up interpretation of the current byte codes.
What do people think>
rich
I was wondering what it would take to speed up Spin?
Maybe a new compiler that generates a "Large Memory Model" program or maybe use multiple cogs to speed up interpretation of the current byte codes.
What do people think>
rich
Comments
I don't think having Spin generate LLM code is desirable, especially not for the Prop I. The beauty of Spin is that it produces compact byte code allowing you to squeeze more functionality into the limited space. LLM code is much bigger. Besides if you want the Speed of LMM we now have Catalina and propgcc.
Using multiple COGS may not give you much boost whilst at the same time wasting COGs. In my efforts at creating a multi-cog Z80 emulator (Basically byte sized instructions) some time ago I could not get performance out of it. Perhaps I did not try hard enough but the overheads of passing data and control back and forth between COGs were a killer then.
Sounds like it is a dead end subject.
cheers,
rich
There have been a few attempts at speeding up the Spin interpreter. I believe the speed improvements have been around 25% to 33%. In theory, it would be possible to write a Spin compiler that generates LMM code. This would run quite a bit faster -- around 4 times as fast or faster. Of course, this would also require more hub RAM for program space. Another approach would be to convert Spin source to C source, and then compile it with Catalina or PropGCC. However, if you would go this route you might as well just program in C.
However, with that speedup and the fact I overclock from 80MHz to 104MHz regularly (30% improvement on all 8 cores) and that 108MHz is most likely stable too (35%), a nice increase can be had.
Functions can be loaded and eliminated dynamically from memory, so the 32k design of the prop that is commonly listed as a "drawback" is not much of an issue.
It can run from SD card, and can execute from scripts on the fly, so there is no real limit to program size, as long as a given module can fit in the available memeory.
And it interactive, so development iterations are quicker. But, using forth is considered "cheating" since its so easy.
I've seen where Spin processing is over 100 times faster with or without a clock using the following formula:
X = NP where
N = Number of Propeller chips
P = Propeller chip
X = SPIN Language Speed Increase
If the number of Propeller chips is 100, then
X = 100P for an increase of 100 times the processing speed of a single Propeller.
The gain is from multiple SPIN instances of parallel thinking, not the overhead.
You may want to figure in multiple instances of Cogs by using 8X instead of X.
http://forums.parallax.com/showthread.php?124495-Fill-the-Big-Brain
Can i get a copy of the Spin speedup you did?
Here is a what-if questions:
If a COG were changed so that it's memory space was 1024 longs instead of 512 would that have helped to speedup Spin?
cheers,
rich
But how would you do that? The 512 limit isn't just that that's all it has. It's built into the instruction set with the 9 bit src and dst fields. So not a simple case of just tacking on more RAM.
"...would be faster...".
I'm not sure I'm with you. Do you mean compile Spin to LMM?
In which case the Spin interpreter become an LMM kernel, faster but bigger.
Or do you mean an LMM Spin interpreter runing Spin byte codes? Which I might have guessed would be slower.
Yes, an open sourced spin compiler (and pasm compiler) would have opened a lot of things up for developing the prop. Even with the problems with PropTool source, it would have helped the march.
Spin Interpreter:
The addition of LMM to the spin mix (as in moving the relatively unused sections to LMM), plus the decode vector improvement, would lead to some considerable gains. Unfortunately 512 longs will not happen although I agree it is a shame, with banking. However, the PropII does introduce a usable fifo which can be used as the stack in the spin interpreter. This would provide quite an improvement. But the whole PropII will improve spin significantly, and it will be soft, so it can be extended.
Richaj45: See my signature below in the tools section. There is a link to the Faster Spin Interpreter thread. It includes the source, plus a better writeup of the spin interpreter byte codes.
"LMM-like" ... I.E. overlay.
Byte-codes can get pre-loaded a long at a time, most interpreter routines pre-loaded, less used routines loaded on demand, other tricks like making routines that don't have tons of nops as in the current interpreter's CZ encodings, etc....
Ray, if you just point to one interpreter file as your defacto standard we could all start testing with it.
If a cog had about 200 or 300 more longs of memory we could see some real improvement -- possibly by a factor of 2. The 496 longs of memory is just a bit to tight to allow for an efficient implementation.
I've written a couple of I2C drivers for sensors that initially weren't fast enough, but by timing and changing the SPIN source I was able to double or triple the execution speed. A 20% speedup in the interpreter pales in comparison to what we could expect from an optimizing SPIN compiler.
One of the things on my to-do list is to write and time a whole bunch of different expressions so I can get a feel for how fast they actually are (as well as which forms of things are faster and by how much) and document the result. I also want to get a feel for how much overhead there is in calling a method in an object vs doing something inline, etc. I've been surprised a number of times at how fast I was able to make SPIN code go when I really wanted to.
Jason
there were certainly more improvements to be made and a mix of lmm and overlays would improvve things as more space allows better speed on highly used functions. i still had space and was at this point when i stopped dev.
one improvement i thought was to save the push and pops between bytecodes but this requires compiler mods to take advantage and breaks compatibility. there maybe other ways to do this.
i would think that reading llongs would not help because the overhead t-'o check this would deffeat any benefit and use valuable space.
how true. unfortunately it makes the code less readable. an optimising compiler would do much of this though. and every 20% helps too.
btw i am noticing problems editting posts with my xoom. anyone else noticed this?
I was thinking that several bytecodes are single byte and reading a long costs no more than reading a byte and reading 4 bytes one at a time would take app. 800ns instead of app. 200ns at 80MHz ... surely the accounting could be done in much less than 12 instructions.
My latest version of the Spin Interpreter is 260C_007 posted here
http://forums.parallax.com/showthread.php?104276-Spin-Interpreter-Faster Post #40 and the bytecodes are explained in the file in Post #37
It can be traced with my Zero Footprint debugger (see the Tools link in my signature for a link).
Priorities are first to reduce program size, speed is second.
result := (65536 / 32) * value
It will compile to fewer instructions (and go faster) if you do this:
result := constant(65536 / 32) * value
The compiler makes no attempt at optimization, so even seemingly obvious things like that get ignored by it. Anywhere in your code that you're doing math with constant inputs may be able to take advantage of that.
Also, just packing expressions results in faster / smaller code. IE, use this:
X := ((Input * 5) + Center) / Scale
Instead of this:
X := Input * 5
X += Center
X /= Scale
And last, don't call a function if you can do the code inline. It makes things less readable, but the overhead of a function call isn't zero, so writing larger block functions helps.
That is a neat trick, never heard of it. Thanks
That one is kind of obvious, but Spin has operators for clamping (#> and <#), bit translation (lookup & lookdown), and a few other things that aren't particularly common. Also, in Spin generally the number of operands is the deciding factor, whereas in C or ASM the complexity of the operators matters more. So, in C I might do a multiply by 3 by doing:
X := X + (X<<1)
...because that would be faster. In Spin, it'll be slower than just doing the multiply, and using X *= 3 is even faster still.
I have also been told that there was an advantage to using the first 8 longs in an object, but I never knew how much - each variable access not in the first 8 longs adds about 100 cycles.
An optimizer could be written as a pre-processor that converts statements into more efficient forms. If it was smart enough, it could identify when inefficient variables are used multiple times in a statement and copy them in a temporary variable that is defined within the first 8 stack longs.
Here is another obvious example:
Using min or max limits can cause problems for post ++ or --, limits don't work with these.
I would be interested to find out how you would 'time' the code.
The number of cycles is shown at the beginning of the line. The hex numbers after "Cog 0:" are the bytecode address, value on the top of the stack and the instruction bytecodes.
The Spin bytecode mnemonics have the following meaning
Maybe the closest one would be Phil's "Tips and Traps" in the Propeller Stickies:
http://forums.parallax.com/showthread.php?84031-Propeller-Tricks-amp-Traps-(Last-update-21-June-2007)&p=575479#post575479