Shop OBEX P1 Docs P2 Docs Learn Events
An easy way to ptimize Spin code speed — Parallax Forums

An easy way to ptimize Spin code speed

RichardFRichardF Posts: 168
edited 2007-06-29 05:56 in Propeller 1
The following object will time, in system cnt register changes, how long it takes to run a group of code. You can test different ways to accomplish your code objective to see which Spin commands are most efficient. It·is an educational tool·which can make you appreciate the use of·the many shortcut math and binary operators provided in the·Spin language. For example:

"num := num + 1" takes 656 clock cycles
"num++" takes 336 clock cycles; twice as fast!

This code group takes 1744 clock cycles to complete:
z := x - y
if z·< 500
··z := 500
The same thing using z := x - y #> 500 takes 1072 clock cycles, a big difference.

This is a simple application but has a lot of usefulness if you play around with it. Thanks to rokicki for getting me started using cnt loops.

'CodeTimes.spin
{Check a group of code for how many clock cycles it takes to complete}
CON
_clkmode = xtal1 + pll16x···························· ' use crystal x 16
_xinfreq = 5_000_000
LCD_PIN·· = 27······································ ·· ' for Parallax 4x20 serial LCD on A0
LCD_BAUD· = 19_200
LCD_LINES = 4
OBJ
lcd : "debug_lcd"
VAR
long stack1[noparse][[/noparse]20], codeCnt
PUB main
if lcd.start(LCD_PIN, LCD_BAUD, LCD_LINES)··········· ' start lcd
· lcd.cursor(0)··············································· ··· ' cursor off
· lcd.backLight(true)··········································· ' backlight on (if available)
· lcd.cls···························································· ' clear the lcd
· lcd.str(string("Code Count:", 13))

cognew(deltaCount, @stack1)
repeat
· updateLcd(codeCnt)

PUB deltaCount | curCnt, newCnt, deltaCnt
curCnt := cnt

'insert code to time here

newCnt := cnt
deltaCnt := newCnt - curCnt
codeCnt := deltaCnt - 368······················· ·'368 is clock cycles to complete loop with no code inserted.

PRI updateLcd(value)
lcd.gotoxy(12, 0)
lcd.decf(value, 8)··································· ' print right-justified decimal value

Comments

  • CardboardGuruCardboardGuru Posts: 443
    edited 2007-06-28 13:41
    RichardF said...

    "num := num + 1" takes 656 clock cycles
    "num++" takes 336 clock cycles; twice as fast!

    The second also takes up half as much memory. 2 bytes (one two byte instruction) vs 4 bytes (4 one byte instructions).

    You could add a facility to compare code sizes into this little app. The address of the start of variable space is held in location $0008, so by comparing this with the empty case, you can print out how many bytes have been used by the code.
  • JonathanJonathan Posts: 1,023
    edited 2007-06-28 15:37
    Thanks for sharing, handy little tool!

    Jonathan

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.madlabs.info - Home of the Hydrogen Fuel Cell Robot
  • rokickirokicki Posts: 1,000
    edited 2007-06-28 17:42
    This kind of stuff is fun to do. I would (of course) prefer it if the *compiler* would do it,
    as it is easy to take this too far and start generating totally unreadable code with huge
    expressions (not to mention dependencies on order of execution, which even when
    understood are best separated into separate functions.) But we know that's not going
    to happen, at least not from Parallax.

    I was going to write a command-line standalone cross-platform Spin compiler with
    two modes: optimize, and 100%-equivalent. (The sourceforge project is called spincc.)
    But so far I haven't found a round tuit.

    Anyway, you have to be impressed at how tight the Spin interpreter is. 336 cycles to
    do foo++!

    Note that some variables (the first seven locals, and the first eight parameters, and the
    first 8 (I think) long variables) are accessed somewhat faster than the other ones.
    Also, remember you *always* get a RESULT variable even if you don't use it, so that's
    the way you get the eighth fast local. When I was working on the fat16 code, I shaved
    off quite a few bytes by making sure I only used seven local variables (this meant I
    reused some, perhaps in a way that threatened the code's readability.)

    Anytime you see [noparse][[/noparse]0] you can just eliminate it and end up with faster tighter code.
    (This goes for a[noparse][[/noparse]0] as well as a[noparse][[/noparse]0].) If I remember correctly a[noparse][[/noparse] i ][noparse][[/noparse] j ] is faster and
    shorter than a[noparse][[/noparse]i+j], however.

    There's a ton more little tricks you can do. It's quite fun.
  • CardboardGuruCardboardGuru Posts: 443
    edited 2007-06-28 17:56
    <I>Anyway, you have to be impressed at how tight the Spin interpreter is. 336 cycles to
    do foo++!</I>

    Depends what you mean by tight. Yes it is slow, but it's also interpreting a bytecode system with 255 single byte operations - some of which have sub-operations stored in subsequent bytes. And it's doing that in a cog with space for only 496 instructions/longs. Less that 2 instructions per operation in the interpreter. That's tight! Perhaps it has to page in the code to do a particular instruction from ROM.
  • rokickirokicki Posts: 1,000
    edited 2007-06-28 18:05
    When I said "tight" I meant---fast, impressive. 336 cycles is only 84 instructions not even counting hub access, and
    we know there are at least three hub accesses. That's pretty good to do all the dispatch, read, write, and operation
    itself. I think the spin interpreter is an amazing piece of work.

    Last night I demo'ed my Prop bot. Every bit of code on the bot (right now) is in Spin (although it does use the
    counters for A/D). No need for assembly at all to do an amazing amount of stuff.
  • RichardFRichardF Posts: 168
    edited 2007-06-28 23:16
    Thank you for sharing your thoughts. I have a lot of things I want to try to optimize the interpreter. I am fascinated with the cntrx registers and the fact that the a and b counters run in the background. With 8 cogs that is 16 independent counters running that can access 32 I/O pins! All for $12.95! Amazing. I bought one of the first KayPros running CPM. How far we have come in 30 years.
    Richard
  • mirrormirror Posts: 322
    edited 2007-06-29 02:58
    I think it would be good to have a collection of Spin "tricks" somewhere.

    I loved the "num := num + 1" versus "num++" comparison. It's interesting looking at some of the Parallax written code, they seem to understand the issues. It does seem that the more you can stuff onto a line, the less clock cycles it'll use. It seems that having assigments and comparisons on the same line is faster, but it does sometimes make the logical flow a bit more tricky. This line from FullDuplexSerial is a beauty:

    repeat until (rxbyte := rxcheck) => 0 or (cnt - t) / (clkfreq / 1000) > ms

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    It's not all that hard to count the number of grains of sand on the beach. The hardest part is making a start - after that it just takes time.·· Mirror - 15 May 2007
  • mirrormirror Posts: 322
    edited 2007-06-29 05:56
    I often find I do something like:
    val := startVal 
    repeat 
      func(val) 
      val += IncAmount 
      if val => endVal 
        val -= (endVal - startVal)
    

    So if startVal := 1 and endVal := 11 and IncAmount := 3 then we'll get:
    func(1), func(4), func(7), func(10), func(3), func(6), func(9), func(2), func(5), func(8), func(1) etc

    I guess I can reduce it to:
    val := startVal 
    repeat 
      func(val) 
      if (val += IncAmount) => endVal 
        val -= (endVal - startVal)
    

    But, can I do better? I guess want I want is a modulo wrap with a base of something other than 0.
Sign In or Register to comment.