An easy way to ptimize Spin code speed
RichardF
Posts: 168
The following object will time, in system cnt register changes, how long it takes to run a group of code. You can test different ways to accomplish your code objective to see which Spin commands are most efficient. It·is an educational tool·which can make you appreciate the use of·the many shortcut math and binary operators provided in the·Spin language. For example:
"num := num + 1" takes 656 clock cycles
"num++" takes 336 clock cycles; twice as fast!
This code group takes 1744 clock cycles to complete:
z := x - y
if z·< 500
··z := 500
The same thing using z := x - y #> 500 takes 1072 clock cycles, a big difference.
This is a simple application but has a lot of usefulness if you play around with it. Thanks to rokicki for getting me started using cnt loops.
'CodeTimes.spin
{Check a group of code for how many clock cycles it takes to complete}
CON
_clkmode = xtal1 + pll16x···························· ' use crystal x 16
_xinfreq = 5_000_000
LCD_PIN·· = 27······································ ·· ' for Parallax 4x20 serial LCD on A0
LCD_BAUD· = 19_200
LCD_LINES = 4
OBJ
lcd : "debug_lcd"
VAR
long stack1[noparse][[/noparse]20], codeCnt
PUB main
if lcd.start(LCD_PIN, LCD_BAUD, LCD_LINES)··········· ' start lcd
· lcd.cursor(0)··············································· ··· ' cursor off
· lcd.backLight(true)··········································· ' backlight on (if available)
· lcd.cls···························································· ' clear the lcd
· lcd.str(string("Code Count:", 13))
cognew(deltaCount, @stack1)
repeat
· updateLcd(codeCnt)
PUB deltaCount | curCnt, newCnt, deltaCnt
curCnt := cnt
'insert code to time here
newCnt := cnt
deltaCnt := newCnt - curCnt
codeCnt := deltaCnt - 368······················· ·'368 is clock cycles to complete loop with no code inserted.
PRI updateLcd(value)
lcd.gotoxy(12, 0)
lcd.decf(value, 8)··································· ' print right-justified decimal value
"num := num + 1" takes 656 clock cycles
"num++" takes 336 clock cycles; twice as fast!
This code group takes 1744 clock cycles to complete:
z := x - y
if z·< 500
··z := 500
The same thing using z := x - y #> 500 takes 1072 clock cycles, a big difference.
This is a simple application but has a lot of usefulness if you play around with it. Thanks to rokicki for getting me started using cnt loops.
'CodeTimes.spin
{Check a group of code for how many clock cycles it takes to complete}
CON
_clkmode = xtal1 + pll16x···························· ' use crystal x 16
_xinfreq = 5_000_000
LCD_PIN·· = 27······································ ·· ' for Parallax 4x20 serial LCD on A0
LCD_BAUD· = 19_200
LCD_LINES = 4
OBJ
lcd : "debug_lcd"
VAR
long stack1[noparse][[/noparse]20], codeCnt
PUB main
if lcd.start(LCD_PIN, LCD_BAUD, LCD_LINES)··········· ' start lcd
· lcd.cursor(0)··············································· ··· ' cursor off
· lcd.backLight(true)··········································· ' backlight on (if available)
· lcd.cls···························································· ' clear the lcd
· lcd.str(string("Code Count:", 13))
cognew(deltaCount, @stack1)
repeat
· updateLcd(codeCnt)
PUB deltaCount | curCnt, newCnt, deltaCnt
curCnt := cnt
'insert code to time here
newCnt := cnt
deltaCnt := newCnt - curCnt
codeCnt := deltaCnt - 368······················· ·'368 is clock cycles to complete loop with no code inserted.
PRI updateLcd(value)
lcd.gotoxy(12, 0)
lcd.decf(value, 8)··································· ' print right-justified decimal value
Comments
The second also takes up half as much memory. 2 bytes (one two byte instruction) vs 4 bytes (4 one byte instructions).
You could add a facility to compare code sizes into this little app. The address of the start of variable space is held in location $0008, so by comparing this with the empty case, you can print out how many bytes have been used by the code.
Jonathan
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.madlabs.info - Home of the Hydrogen Fuel Cell Robot
as it is easy to take this too far and start generating totally unreadable code with huge
expressions (not to mention dependencies on order of execution, which even when
understood are best separated into separate functions.) But we know that's not going
to happen, at least not from Parallax.
I was going to write a command-line standalone cross-platform Spin compiler with
two modes: optimize, and 100%-equivalent. (The sourceforge project is called spincc.)
But so far I haven't found a round tuit.
Anyway, you have to be impressed at how tight the Spin interpreter is. 336 cycles to
do foo++!
Note that some variables (the first seven locals, and the first eight parameters, and the
first 8 (I think) long variables) are accessed somewhat faster than the other ones.
Also, remember you *always* get a RESULT variable even if you don't use it, so that's
the way you get the eighth fast local. When I was working on the fat16 code, I shaved
off quite a few bytes by making sure I only used seven local variables (this meant I
reused some, perhaps in a way that threatened the code's readability.)
Anytime you see [noparse][[/noparse]0] you can just eliminate it and end up with faster tighter code.
(This goes for a[noparse][[/noparse]0] as well as a[noparse][[/noparse]0].) If I remember correctly a[noparse][[/noparse] i ][noparse][[/noparse] j ] is faster and
shorter than a[noparse][[/noparse]i+j], however.
There's a ton more little tricks you can do. It's quite fun.
do foo++!</I>
Depends what you mean by tight. Yes it is slow, but it's also interpreting a bytecode system with 255 single byte operations - some of which have sub-operations stored in subsequent bytes. And it's doing that in a cog with space for only 496 instructions/longs. Less that 2 instructions per operation in the interpreter. That's tight! Perhaps it has to page in the code to do a particular instruction from ROM.
we know there are at least three hub accesses. That's pretty good to do all the dispatch, read, write, and operation
itself. I think the spin interpreter is an amazing piece of work.
Last night I demo'ed my Prop bot. Every bit of code on the bot (right now) is in Spin (although it does use the
counters for A/D). No need for assembly at all to do an amazing amount of stuff.
Richard
I loved the "num := num + 1" versus "num++" comparison. It's interesting looking at some of the Parallax written code, they seem to understand the issues. It does seem that the more you can stuff onto a line, the less clock cycles it'll use. It seems that having assigments and comparisons on the same line is faster, but it does sometimes make the logical flow a bit more tricky. This line from FullDuplexSerial is a beauty:
repeat until (rxbyte := rxcheck) => 0 or (cnt - t) / (clkfreq / 1000) > ms
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
It's not all that hard to count the number of grains of sand on the beach. The hardest part is making a start - after that it just takes time.·· Mirror - 15 May 2007
So if startVal := 1 and endVal := 11 and IncAmount := 3 then we'll get:
func(1), func(4), func(7), func(10), func(3), func(6), func(9), func(2), func(5), func(8), func(1) etc
I guess I can reduce it to:
But, can I do better? I guess want I want is a modulo wrap with a base of something other than 0.