SPIN Faster
william chan
Posts: 1,326
Hi All,
Is there some tricks or secrets to speed up SPIN execution?
For example,
Can we declare some variables to use COG memory instead of HUB memory to speed things up?
Thanks.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my
Is there some tricks or secrets to speed up SPIN execution?
For example,
Can we declare some variables to use COG memory instead of HUB memory to speed things up?
Thanks.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my
Comments
Probably the only way to speed up Spin code would be to minimize the number of basic operations and just to write code carefully and efficiently. The general rule for optimization is that you don't know where the savings might be until you complete the program and try it with real data and bookkeeping as to execution time. Often where you think savings might be is completely wrong.
command's execution time by looking at the propeller rom assembly for each function.
And if you captured that all in a document, it may be useful to the rest of us [noparse]:)[/noparse]
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
jazzed·... about·living in·http://en.wikipedia.org/wiki/Silicon_Valley
Traffic is slow at times, but Parallax orders·always get here fast 8)
Are you saying that local variables declared implicitly using the "|" sign after the method name would be using COG memory?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my
Can you give us an example?
I need this to improve my spin-adc's speed, which is only about 78 samples per second for 9 bits ADC.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my
Don't you think so?
Or
Can I use OUTB to replace "i"?
I believe OUTB can be accessed directly without waiting for hub access.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my
As an aside, for maximum readability using these registers should be done in a way that is clear you are not using it for it's normal purpose, something like:
A little akward but it's perfectly clear you are not using that register for setting the pins on port b.·Note that·this technique willl likely be a little slower than using the direct name.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
Post Edited (Paul Baker (Parallax)) : 3/12/2008 7:12:32 AM GMT
It should speed up your code if you substitute·OUTA like you said, or·CTRA, youd be using the cog ram.· Try it.·
The PHSx registers have shadow registers which may muck some things up. Time for me to have a look at the interpreter to see if it will be a problem.
When I changed the code to Mike's code
repeat noofloops
i += !(outa[noparse][[/noparse]feedback_pin] := !ina[noparse][[/noparse]input_pin])
it actually improved slightly to 80 samples per second.
But when I replaced "i" with INB or OUTB or FRQA it actually became about 9% slower !
Go figure !
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my
The only practical way to optimise is to try it and time it or study the ROM interpreter code. Examining the bytecode gives a first-pass guess; the less bytecode, the fewer hub accesses needed to execute it but that doesn't factor in that some bytecode may take longer than others. There are some general rules; the fewer pushes and/or pops the better, loading constants is quicker than loading variables but that's not the whole picture.
Programmers take it as granted that 'a+=b' is quicker than 'a:=a+b' and 'a~" is quicker than 'a:=0' but how many people have actually checked that is so ? For all cases of 'a' ?
Another problem is that there are so many different ways things could be done and shorter source doesn't necessarily mean fastest execution. Let's rewrite that
repeat noofloops
i += !(outa[noparse][[/noparse]feedback_pin] := !ina[noparse][[/noparse]input_pin])
Assuming that bit access to Cog registers is slower than accessing them directly as a whole. How about ...
Is that faster ? I have no idea.
My experience is that the quickest access to variables is when they are parameters or local (stack), then if they are global (VAR) and slowest access is to those in DAT.
If "i" could be placed in Cog memory you might see a speed-up but 30% would seem to be quite optimistic. It assumes that 30% of your code's time is because of the hub access for this one variable whereas there will be many more hub accesses to execute the code. Hub access time is just a small part of interpreting of the Spin bytecode.
Optimisation is really a matter of shaving one or two execution cycles off something which takes hundreds of execution cycles. The largest gains usually come from optimising the whole, not just one part of the whole.
That 224736 is the count I got back for 1000 repeats of 'nothing'. I did all testing with longs.
The "Near" variables are those which are within the first 32 longs of the start of their base ( object or method ), the "Far" are further away ( eg, put an array before them ).
Method locals ( and parameters ) are more often than not "Near" variables in most commonly written code. VAR and DAT variables are luck of the draw. Put most used variables first and variables before arrays to make them faster.
Now for some more interesting results which shows normally perceived wisdom to be wrong and reveals many attempts to improve bit-banged I/O speed by use of ~ and ~~ to have been completely counter-productive ...
Assigning 1 or -1 takes the same time as assigning 0, ~~ takes the same time as ~.
For fastest bit-banged I/O, set the pin number as a constant ( but not always ! ) and choose the optimal pin number, assign 0 or 1 ( or -1 ) and don't use ~ or ~~.
So two questions -
Why do so many people use ~ and ~~ rather than assignment ?
Why did no one think to benchmark these before ?
Post Edited (hippy) : 3/12/2008 6:25:38 PM GMT
-Phil
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
to me it looks like great effort for small gains of speed staying in SPIN.
and i think examining the interpretercode is three times more effort than porting your code to assembler directly
i have done this for an IO-expander using shiftregisters with latches
first i programmed it in spin and when it was running in spin i ported the bitbanging to assembler
i took a look at your code. There is not much bitbanging at all
my opinion: good project to start learning Propeller-Assembler !
regards
Stefan
Does the SPIN opcodes tell us why it is slower to use COG local registers?
Can the spin compiler be improved to make accessing COG local registers faster than local vars?
Stefan,
I am still trying to learn propeller assembly, but still stuck at the low end of the curve.
I don't know why, even though I consider myself an expert in SX assembly.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my
The Spin compiler is not the determining factor. The choices made in the development of the Spin interpreter are what determines
which kind of access is faster and by how much and the interpreter is in ROM and can't be changed. You're free to modify it now
and you can load the modified version into a cog and execute it like any other assembly program. There's not much advantage in
doing what you suggest in that there are only a few locations that are accessible that way without side effects (from being a control
register), there's not that much to be gained from accessing cog registers from Spin anyway (because the cost to access hub memory
over cog memory is so small and is eaten up by overhead in fetching and interpreting the bytecodes from hub memory).
In other words, if Spin isn't fast enough, do it in assembly language. The time critical parts of your code are probably pretty small and
likely would translate to a small, straightforward assembly routine.
Steven,
Can you make changes to the spin interpreter to read the cog registers directly?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my
2) learn ASM
3) read the counters lab, I posted the link in your other thread...
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
E3 = Thought
http://folding.stanford.edu/·- Donating some CPU/GPU downtime just might lead to a cure for cancer! My team stats.
There could be quite an amount of work involved in that I suspect, and it wouldn't be as simple as 'read the registers directly'.
Even before 128MHz things can start failing. The highest practical clock speed for 'normal circumstances' would seem to be 100MHz. More info on the Wiki at -
propeller.wikispaces.com/Oscillator
The Spin interpreter in the Propeller chip is in what is known as masked ROM. It cannot be upgraded.
Like I said, you're free to make your own version of the Spin interpreter which can be run in a cog like any other assembly program.
It's just that it can't be built into the chip.
RinksCustoms,
"Overclocking" the Propeller to 128MHz is not safe. The chip will not work reliably at most temperatures and supply voltages at that speed. Read the datasheet. There's a graph on page 31. You'll need to keep the supply voltage at the high end of its range and you'll need to refrigerate the chip.
Why does toggling certain IO pins (P0 and P1) faster than toggling other IO pins?
It shouldn't be the case.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my
Possibly, but it would be really hard until we get our own compiler. If some of the things like the post clear and post set operators are no quicker than doing it another way we could take them out and put something else in instead. However, like I said, we can't do this without a new compiler to support it. It would also become a pain in the neck to maintain.
1. You need a "compiler" to make changes to the SPIN interpreter?
2. How many longs does spin interpreter take up?
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my
2. It takes 496 longs.