SPIN Faster

william chan · 2008-03-12 01:08

Hi All,

Is there some tricks or secrets to speed up SPIN execution?

For example,

Can we declare some variables to use COG memory instead of HUB memory to speed things up?

Thanks.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my

Mike Green · 2008-03-12 01:27

Spin can't use cog memory for variables. Pretty much the whole cog is used for the interpreter and its internal data. There are some special op-codes to allow reference to the first couple of local variables / parameters of a method. I'm not sure how much speed savings there'd be, but there's clearly a space advantage to using local variables and the first few global variables in an object.

Probably the only way to speed up Spin code would be to minimize the number of basic operations and just to write code carefully and efficiently. The general rule for optimization is that you don't know where the savings might be until you complete the program and try it with real data and bookkeeping as to execution time. Often where you think savings might be is completely wrong.

jazzed · 2008-03-12 02:39

Outside of Mike's useful "general rule for optimization" observation, you could find a spin
command's execution time by looking at the propeller rom assembly for each function.
And if you captured that all in a document, it may be useful to the rest of us [noparse]:)[/noparse]

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
jazzed·... about·living in·http://en.wikipedia.org/wiki/Silicon_Valley

Traffic is slow at times, but Parallax orders·always get here fast 8)

william chan · 2008-03-12 02:50

Mike,

Are you saying that local variables declared implicitly using the "|" sign after the method name would be using COG memory?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my

Mike Green · 2008-03-12 03:10

No. The local variables are placed into the stack which is in hub memory. My comment referred to some special Spin byte codes for referring to the first few stack locations in the current stack frame.

william chan · 2008-03-12 03:47

What are the special way (opcode) to access the 1st local variable in a method.
Can you give us an example?
I need this to improve my spin-adc's speed, which is only about 78 samples per second for 9 bits ADC.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my

Mike Green · 2008-03-12 04:20

In your case, there's very little advantage in using this optimization since you only reference one variable which is already local and you have less than 8 longs of local variables (including the parameters). Any improvement would have to come from simplifying (optimizing) the code itself and there's very little code to optimize. Here's one example that's clearly shorter, but may not differ significantly in speed from what you wrote:

PUB adc | i
dira[noparse][[/noparse]feedback_pin]~~
repeat
    i := 0
    repeat noofloops
        i += !(outa[noparse][[/noparse]feedback_pin] := !ina[noparse][[/noparse]input_pin])
    sample := i

william chan · 2008-03-12 04:40

But if "i" can be placed in COG memory, it should loop about 30% faster since the cog need not wait for hub access.
Don't you think so?

Or

Can I use OUTB to replace "i"?
I believe OUTB can be accessed directly without waiting for hub access.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my

Paul Baker · 2008-03-12 06:54

Actually yes you could use OUTB as a cog located variable, of course this will lead to a different behavior on the next gen, but thats not important now. For that matter INB is availible, and if you are not using the counters, four more are availible: FRQA, FRQB, PSHA, PHSB.

As an aside, for maximum readability using these registers should be done in a way that is clear you are not using it for it's normal purpose, something like:

CON
  LOCAL_I = 5 'this is the special purpose index to OUTB (list is on p. 305 of manual)
 
 
 
 
 repeat SPR{LOCAL_I] from 0 to 4
    a += SPR{LOCAL_I]

A little akward but it's perfectly clear you are not using that register for setting the pins on port b.·Note that·this technique willl likely be a little slower than using the direct name.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

Post Edited (Paul Baker (Parallax)) : 3/12/2008 7:12:32 AM GMT

Sleazy - G · 2008-03-12 07:03

OUTA/B port is a locally defined BUT·globally accessable··, and all·cogs can access it at any time they are available, and the pins are·"OR'ed"?

It should speed up your code if you substitute·OUTA like you said, or·CTRA, youd be using the cog ram.· Try it.·

stevenmess2004 · 2008-03-12 07:13

Paul Baker (Parallax) said...
Actually yes you could use OUTB as a cog located variable, of course this will lead to a different behavior on the next gen, but thats not important now. For that matter INB is availible, and if you are not using the counters, four more are availible: FRQA, FRQB, PSHA, PHSB.
As an aside, for maximum readability using these registers should be done in a way that is clear you are not using it for it's normal purpose, something like:

The PHSx registers have shadow registers which may muck some things up. Time for me to have a look at the interpreter to see if it will be a problem.

william chan · 2008-03-12 08:21

The most unexpected thing has happened.

When I changed the code to Mike's code

repeat noofloops
i += !(outa[noparse][[/noparse]feedback_pin] := !ina[noparse][[/noparse]input_pin])

it actually improved slightly to 80 samples per second.

But when I replaced "i" with INB or OUTB or FRQA it actually became about 9% slower !

Go figure !

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my

hippy · 2008-03-12 12:01

Optimising execution time with Spin is incredibly difficult. Because it's an interpreter nearly everything comes in a form of "do this, with that" and timing varies depending as to what "that" is. It doesn't follow that because "that" is a Cog register it will be any faster than something else, in fact, as demonstrated, it can be slower.

The only practical way to optimise is to try it and time it or study the ROM interpreter code. Examining the bytecode gives a first-pass guess; the less bytecode, the fewer hub accesses needed to execute it but that doesn't factor in that some bytecode may take longer than others. There are some general rules; the fewer pushes and/or pops the better, loading constants is quicker than loading variables but that's not the whole picture.

Programmers take it as granted that 'a+=b' is quicker than 'a:=a+b' and 'a~" is quicker than 'a:=0' but how many people have actually checked that is so ? For all cases of 'a' ?

Another problem is that there are so many different ways things could be done and shorter source doesn't necessarily mean fastest execution. Let's rewrite that

repeat noofloops
i += !(outa[noparse][[/noparse]feedback_pin] := !ina[noparse][[/noparse]input_pin])

Assuming that bit access to Cog registers is slower than accessing them directly as a whole. How about ...

inaset := |< input_pin
outaclr := ! ( outaset := |< feedback_pin )
repeat noofloops
  if INA & inaset
    OUTA &= outaclr
    i += 1
  else
    OUTA |= outaset
    i += 0

Is that faster ? I have no idea.

My experience is that the quickest access to variables is when they are parameters or local (stack), then if they are global (VAR) and slowest access is to those in DAT.

hippy · 2008-03-12 12:15

william chan said...
But if "i" can be placed in COG memory, it should loop about 30% faster since the cog need not wait for hub access.
Don't you think so?

If "i" could be placed in Cog memory you might see a speed-up but 30% would seem to be quite optimistic. It assumes that 30% of your code's time is because of the hub access for this one variable whereas there will be many more hub accesses to execute the code. Hub access time is just a small part of interpreting of the Spin bytecode.

Optimisation is really a matter of shaving one or two execution cycles off something which takes hundreds of execution cycles. The largest gains usually come from optimising the whole, not just one part of the whole.

hippy · 2008-03-12 18:19

I did some benchmarking ...

PUB Main | counter

  tv.Start( TV_PIN )

  counter := CNT
  repeat 1000
    x++
  counter := CNT - counter
  
  tv.Dec( ( counter - 224736 ) / 1000 )

That 224736 is the count I got back for 1000 repeats of 'nothing'. I did all testing with longs.

                  .----------.----------.----------.
                  |   x++    |  x += 1  | x := x+1 |
                  |----------|----------|----------|
Near VAR/Local    |   336    |   576    |   656    |
OUTB              |   416    |   672    |   784    |
Near DAT          |   448    |   688    |   880    |
Far VAR/DAT/Local |   464    |   704    |   912    |
OUTB[noparse][[/noparse]31..0]       |   816    |  1072    |  1584    |
                  `----------^----------^----------'

The "Near" variables are those which are within the first 32 longs of the start of their base ( object or method ), the "Far" are further away ( eg, put an array before them ).

Method locals ( and parameters ) are more often than not "Near" variables in most commonly written code. VAR and DAT variables are luck of the draw. Put most used variables first and variables before arrays to make them faster.

Now for some more interesting results which shows normally perceived wisdom to be wrong and reveals many attempts to improve bit-banged I/O speed by use of ~ and ~~ to have been completely counter-productive ...

            .----------------.----------------.
            |  OUTB[noparse][[/noparse]x] := 0  |    OUTB[noparse][[/noparse]x]~    |
            |----------------|----------------|
OUTB[noparse][[/noparse] 0 ]   |      560       |      608       |
OUTB[noparse][[/noparse] 1 ]   |      560       |      608       |
OUTB[noparse][[/noparse] 2 ]   |      608       |      656       |
OUTB[noparse][[/noparse] 3 ]   |      608       |      656       |
OUTB[noparse][[/noparse] 4 ]   |      608       |      656       |
OUTB[noparse][[/noparse] 5 ]   |      576       |      624       |
OUTB[noparse][[/noparse] var ] |      592       |      640       |
            `----------------^----------------'

Assigning 1 or -1 takes the same time as assigning 0, ~~ takes the same time as ~.

For fastest bit-banged I/O, set the pin number as a constant ( but not always ! ) and choose the optimal pin number, assign 0 or 1 ( or -1 ) and don't use ~ or ~~.

So two questions -

Why do so many people use ~ and ~~ rather than assignment ?
Why did no one think to benchmark these before ?

Post Edited (hippy) : 3/12/2008 6:25:38 PM GMT

Phil Pilgrim (PhiPi) · 2008-03-12 18:28

hippy said...
Why do so many people use ~ and ~~ rather than assignment ?

The bigger question is: If they're slower, why don't they compile to := 0/1/-1 instead of whatever it is they compile to? Is it because ~ and ~~ produce less code, trading speed for compactness?

-Phil

hippy · 2008-03-12 18:35

No, "OUTA[noparse][[/noparse]0]~" uses four bytes same as "OUTA[noparse][[/noparse]0]:=0". The Compiler doesn't optimise so it will give people what they ask for.


0018         35              S3       PUSH     #0
0019         3D D4 18                 USING    OUTA[noparse][[/noparse]] POSTCLR
001C         32                       RETURN

0018         35              S3       PUSH     #0
0019         35                       PUSH     #0
001A         3D B4                    POP      OUTA[noparse][[/noparse]]
001C         32                       RETURN

Phil Pilgrim (PhiPi) · 2008-03-12 19:24

Hmm, I guess the real question, then, is why there's a special POSTCLR bytecode at all, if the assignment sequence can accomplish the same end more quickly and with no more code space. It would seem that outa[noparse][[/noparse]0]~ could just compile to outa[noparse][[/noparse]0] := 0, since they mean the same thing.

-Phil

Paul Baker · 2008-03-12 22:28

stevenmess2004 said...

The PHSx registers have shadow registers which may muck some things up. Time for me to have a look at the interpreter to see if it will be a problem.

I haven't tested this, but I think the shadow register only comes into play while the counters are running.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer

Parallax, Inc.

StefanL38 · 2008-03-12 22:39

hello william,

to me it looks like great effort for small gains of speed staying in SPIN.

and i think examining the interpretercode is three times more effort than porting your code to assembler directly

i have done this for an IO-expander using shiftregisters with latches

first i programmed it in spin and when it was running in spin i ported the bitbanging to assembler

i took a look at your code. There is not much bitbanging at all
my opinion: good project to start learning Propeller-Assembler !

regards

Stefan

william chan · 2008-03-13 00:56

My tests and Hippy's tests confirms using a local var like "i" is much faster than using COG local registers like FRQA or OUTB.

Does the SPIN opcodes tell us why it is slower to use COG local registers?

Can the spin compiler be improved to make accessing COG local registers faster than local vars?

Stefan,

I am still trying to learn propeller assembly, but still stuck at the low end of the curve.
I don't know why, even though I consider myself an expert in SX assembly.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my

Mike Green · 2008-03-13 01:17

The Spin interpreter's source code is now available. An examination of that will explain what's faster or slower and why.

The Spin compiler is not the determining factor. The choices made in the development of the Spin interpreter are what determines
which kind of access is faster and by how much and the interpreter is in ROM and can't be changed. You're free to modify it now
and you can load the modified version into a cog and execute it like any other assembly program. There's not much advantage in
doing what you suggest in that there are only a few locations that are accessible that way without side effects (from being a control
register), there's not that much to be gained from accessing cog registers from Spin anyway (because the cost to access hub memory
over cog memory is so small and is eaten up by overhead in fetching and interpreting the bytecodes from hub memory).

In other words, if Spin isn't fast enough, do it in assembly language. The time critical parts of your code are probably pretty small and
likely would translate to a small, straightforward assembly routine.

stevenmess2004 · 2008-03-13 01:17

I've had a quick look and I think that the problem with the cog registers is that they are not read in place but are pushed onto the stack and then pop'ed from the stack. Combined with the special opcodes for quick access to the first few vars and local vars I think this is what causes the difference (also, the special opcodes only need one opcode while the cog registers need two opcodes).

william chan · 2008-03-13 01:36

Can the spin interpreter in the propeller chip be easily upgraded?

Steven,

Can you make changes to the spin interpreter to read the cog registers directly?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my

RinksCustoms · 2008-03-13 02:21

1) You could "Overclock" the chip to 128MHz safely, at about 145MHz, things start SPINing out of control, lol

2) learn ASM

3) read the counters lab, I posted the link in your other thread...

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
E³ = Thought

http://folding.stanford.edu/·- Donating some CPU/GPU downtime just might lead to a cure for cancer! My team stats.

hippy · 2008-03-13 02:39

william chan said...
Can the spin interpreter in the propeller chip be easily upgraded?

Steven, Can you make changes to the spin interpreter to read the cog registers directly?

There could be quite an amount of work involved in that I suspect, and it wouldn't be as simple as 'read the registers directly'.

RinksCustoms said...
1) You could "Overclock" the chip to 128MHz safely, at about 145MHz, things start SPINing out of control, lol

Even before 128MHz things can start failing. The highest practical clock speed for 'normal circumstances' would seem to be 100MHz. More info on the Wiki at -

propeller.wikispaces.com/Oscillator

Mike Green · 2008-03-13 03:04

william,
The Spin interpreter in the Propeller chip is in what is known as masked ROM. It cannot be upgraded.

Like I said, you're free to make your own version of the Spin interpreter which can be run in a cog like any other assembly program.
It's just that it can't be built into the chip.

RinksCustoms,
"Overclocking" the Propeller to 128MHz is not safe. The chip will not work reliably at most temperatures and supply voltages at that speed. Read the datasheet. There's a graph on page 31. You'll need to keep the supply voltage at the high end of its range and you'll need to refrigerate the chip.

william chan · 2008-03-13 07:07

Hippy,

Why does toggling certain IO pins (P0 and P1) faster than toggling other IO pins?
It shouldn't be the case.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my

stevenmess2004 · 2008-03-13 07:22

william chan said...
Can the spin interpreter in the propeller chip be easily upgraded?

Since we now have the code we can do whatever we like with the spin interpreter. It would be loaded just like another assembly object. However, it will take up room in the hub since we can't write over the interpreter in the ROM.

said...
Can you make changes to the spin interpreter to read the cog registers directly?

Possibly, but it would be really hard until we get our own compiler. If some of the things like the post clear and post set operators are no quicker than doing it another way we could take them out and put something else in instead. However, like I said, we can't do this without a new compiler to support it. It would also become a pain in the neck to maintain.

william chan · 2008-03-13 07:25

I don't quite understand...

1. You need a "compiler" to make changes to the SPIN interpreter?

2. How many longs does spin interpreter take up?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.fd.com.my
www.mercedes.com.my

stevenmess2004 · 2008-03-13 08:16

1. Basically yes. The problem is that we would need to change the meaning of some of the opcodes. If you change the meaning of an opcode in the interpreter than you also need to change the meaning in the compiler.

2. It takes 496 longs.

SPIN Faster

Comments