An easy way to ptimize Spin code speed

RichardF · 2007-06-28 12:45

The following object will time, in system cnt register changes, how long it takes to run a group of code. You can test different ways to accomplish your code objective to see which Spin commands are most efficient. It·is an educational tool·which can make you appreciate the use of·the many shortcut math and binary operators provided in the·Spin language. For example:

"num := num + 1" takes 656 clock cycles
"num++" takes 336 clock cycles; twice as fast!

This code group takes 1744 clock cycles to complete:
z := x - y
if z·< 500
··z := 500
The same thing using z := x - y #> 500 takes 1072 clock cycles, a big difference.

This is a simple application but has a lot of usefulness if you play around with it. Thanks to rokicki for getting me started using cnt loops.

'CodeTimes.spin
{Check a group of code for how many clock cycles it takes to complete}
CON
_clkmode = xtal1 + pll16x···························· ' use crystal x 16
_xinfreq = 5_000_000
LCD_PIN·· = 27······································ ·· ' for Parallax 4x20 serial LCD on A0
LCD_BAUD· = 19_200
LCD_LINES = 4
OBJ
lcd : "debug_lcd"
VAR
long stack1[noparse][[/noparse]20], codeCnt
PUB main
if lcd.start(LCD_PIN, LCD_BAUD, LCD_LINES)··········· ' start lcd
· lcd.cursor(0)··············································· ··· ' cursor off
· lcd.backLight(true)··········································· ' backlight on (if available)
· lcd.cls···························································· ' clear the lcd
· lcd.str(string("Code Count:", 13))

cognew(deltaCount, @stack1)
repeat
· updateLcd(codeCnt)

PUB deltaCount | curCnt, newCnt, deltaCnt
curCnt := cnt

'insert code to time here

newCnt := cnt
deltaCnt := newCnt - curCnt
codeCnt := deltaCnt - 368······················· ·'368 is clock cycles to complete loop with no code inserted.

PRI updateLcd(value)
lcd.gotoxy(12, 0)
lcd.decf(value, 8)··································· ' print right-justified decimal value

CardboardGuru · 2007-06-28 13:41

RichardF said...

"num := num + 1" takes 656 clock cycles
"num++" takes 336 clock cycles; twice as fast!

The second also takes up half as much memory. 2 bytes (one two byte instruction) vs 4 bytes (4 one byte instructions).

You could add a facility to compare code sizes into this little app. The address of the start of variable space is held in location $0008, so by comparing this with the empty case, you can print out how many bytes have been used by the code.

Jonathan · 2007-06-28 15:37

Thanks for sharing, handy little tool!

Jonathan

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.madlabs.info - Home of the Hydrogen Fuel Cell Robot

rokicki · 2007-06-28 17:42

This kind of stuff is fun to do. I would (of course) prefer it if the *compiler* would do it,
as it is easy to take this too far and start generating totally unreadable code with huge
expressions (not to mention dependencies on order of execution, which even when
understood are best separated into separate functions.) But we know that's not going
to happen, at least not from Parallax.

I was going to write a command-line standalone cross-platform Spin compiler with
two modes: optimize, and 100%-equivalent. (The sourceforge project is called spincc.)
But so far I haven't found a round tuit.

Anyway, you have to be impressed at how tight the Spin interpreter is. 336 cycles to
do foo++!

Note that some variables (the first seven locals, and the first eight parameters, and the
first 8 (I think) long variables) are accessed somewhat faster than the other ones.
Also, remember you *always* get a RESULT variable even if you don't use it, so that's
the way you get the eighth fast local. When I was working on the fat16 code, I shaved
off quite a few bytes by making sure I only used seven local variables (this meant I
reused some, perhaps in a way that threatened the code's readability.)

Anytime you see [noparse][[/noparse]0] you can just eliminate it and end up with faster tighter code.
(This goes for a[noparse][[/noparse]0] as well as a[noparse][[/noparse]0].) If I remember correctly a[noparse][[/noparse] i ][noparse][[/noparse] j ] is faster and
shorter than a[noparse][[/noparse]i+j], however.

There's a ton more little tricks you can do. It's quite fun.

CardboardGuru · 2007-06-28 17:56

<I>Anyway, you have to be impressed at how tight the Spin interpreter is. 336 cycles to
do foo++!</I>

Depends what you mean by tight. Yes it is slow, but it's also interpreting a bytecode system with 255 single byte operations - some of which have sub-operations stored in subsequent bytes. And it's doing that in a cog with space for only 496 instructions/longs. Less that 2 instructions per operation in the interpreter. That's tight! Perhaps it has to page in the code to do a particular instruction from ROM.

rokicki · 2007-06-28 18:05

When I said "tight" I meant---fast, impressive. 336 cycles is only 84 instructions not even counting hub access, and
we know there are at least three hub accesses. That's pretty good to do all the dispatch, read, write, and operation
itself. I think the spin interpreter is an amazing piece of work.

Last night I demo'ed my Prop bot. Every bit of code on the bot (right now) is in Spin (although it does use the
counters for A/D). No need for assembly at all to do an amazing amount of stuff.

RichardF · 2007-06-28 23:16

Thank you for sharing your thoughts. I have a lot of things I want to try to optimize the interpreter. I am fascinated with the cntrx registers and the fact that the a and b counters run in the background. With 8 cogs that is 16 independent counters running that can access 32 I/O pins! All for $12.95! Amazing. I bought one of the first KayPros running CPM. How far we have come in 30 years.
Richard

mirror · 2007-06-29 02:58

I think it would be good to have a collection of Spin "tricks" somewhere.

I loved the "num := num + 1" versus "num++" comparison. It's interesting looking at some of the Parallax written code, they seem to understand the issues. It does seem that the more you can stuff onto a line, the less clock cycles it'll use. It seems that having assigments and comparisons on the same line is faster, but it does sometimes make the logical flow a bit more tricky. This line from FullDuplexSerial is a beauty:

repeat until (rxbyte := rxcheck) => 0 or (cnt - t) / (clkfreq / 1000) > ms

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
It's not all that hard to count the number of grains of sand on the beach. The hardest part is making a start - after that it just takes time.·· Mirror - 15 May 2007

mirror · 2007-06-29 05:56

I often find I do something like:

val := startVal 
repeat 
  func(val) 
  val += IncAmount 
  if val => endVal 
    val -= (endVal - startVal)

So if startVal := 1 and endVal := 11 and IncAmount := 3 then we'll get:
func(1), func(4), func(7), func(10), func(3), func(6), func(9), func(2), func(5), func(8), func(1) etc

I guess I can reduce it to:

val := startVal 
repeat 
  func(val) 
  if (val += IncAmount) => endVal 
    val -= (endVal - startVal)

But, can I do better? I guess want I want is a modulo wrap with a base of something other than 0.

An easy way to ptimize Spin code speed

Comments