Shop OBEX P1 Docs P2 Docs Learn Events
Code size: Spin Vs. Assembly ? — Parallax Forums

Code size: Spin Vs. Assembly ?

RaymanRayman Posts: 14,162
edited 2007-12-01 11:56 in Propeller 1
Ok, so assembly is 80 to 200 times faster than Spin (so says deSilva).· But what about the relative size of the code to do the same things?

It appears to me that SPIN compiles to very small code, perhaps comparable in size to the equivalent assembly code.· But, I haven't looked at this in any detail.

Anybody looked at this?

Comments

  • Martin HebelMartin Hebel Posts: 1,239
    edited 2007-11-30 14:01
    Spin doesn't 'compile' at all. Tokens representing the Spin program are stored in main RAM. The Cog's interpreter fetches the next token, and processes it using 80 to 200 ASM instructions to perform the specified operation.

    This is one reason why interpreted langauages can be argued to be better than compiled. It takes very little room to store the tokenized version of the program.

    -Martin

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    SelmaWare Solutions - StampPlot GUI for controllers, XBee and Propeller Application Boards

    Southern Illinois University Carbondale, Electronic Systems Technologies
  • hippyhippy Posts: 1,981
    edited 2007-11-30 14:42
    I'd say Spin was compiled. At least in the sense that it turns source code into instruction set opcodes for the Spin Virtual Machine. In the same we would still call a compiler a compiler even if the instruction set of a processor is actually interpreted by microcode. Conversely of course we could argue that there's no such thing as a compilation for a microcoded processor either smile.gif

    I wouldn't really call the Spin bytecode "tokens" either, not in the same way that the Basic Stamp and other home computers of the 80's tokenised programs.

    Anyway, back to the issue; Spin size versus Assembly size ... I don't think there's any easy comparison to be had in general.

    Spin bytecode is quite efficiently optimised so common things like pushing zero or one to the stack is a single byte, and popping it to some global or local (stack) variable is also a byte, so that's a 2:1 improvement, more so if moving a number over 511 which requires two longs in Assembler but could be as short as three bytes in Spin. Then again it depends on the value of the number and where's it going; the advantage moves between the two.

    There are some specialised opcodes which make Case and Repeat very space efficient, but then one would probably use different techniques in Assembler than in Spin. Spin isn't optimised, so an unknown 'poor choice' of coding style can have an impact on both space and execution time.

    Trying to compare at the opcode level is near impossible ( they are very different architectures ), only in terms of particular tasks could a more realistic comparison be made IMO.

    Spin appears to be more compact than Assembler but I'm not sure if that's simply because it has 32KB instead of 2KB to reside in. My gut feeling is that they both achieve the same code density overall.
  • Ken PetersonKen Peterson Posts: 806
    edited 2007-11-30 15:25
    Has anyone ever "compiled" a set of guidelines for manually optimizing SPIN?

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔


    The more I know, the more I know I don't know.· Is this what they call Wisdom?
  • RaymanRayman Posts: 14,162
    edited 2007-11-30 15:43
    I would like to see optimization guide too! I had to trial-and-error find fast enough stuff for my wav player...
  • hippyhippy Posts: 1,981
    edited 2007-12-01 04:54
    The problem there is determining what is optimal. Smallest code size, shortest or simplest looking instruction sequence may not necessarily mean fastest execution. My theory has been that shorter sequences are likely faster because the work is mainly being done in-cog rather than a longer sequence of bytecodes.

    Things like var++, var~ look to be more efficient and faster than var+=1, var:=0 because the later push then pop while the former work with reference to a pointer so there should be a lot less interaction with hub memory going on. Accessing byteArray[noparse][[/noparse] i ] may be faster than using 'pointers' in Spin ( byte[noparse][[/noparse] @byteArray + i ] ) because there are specific opcodes to do indexing from an offset.

    The compiler really is non-optimising so one instant saving is to get used to using 'constant'.

    var := var + "A" - "0"

    will do three pushes, an add, a subtract and a pop.

    var += constant( "A" - "0" )

    is a lot more efficient.

    Likewise if using CON values to control flow ...

    if PIN_NUMBER => 28

    if constant( PIN_NUMBER => 28 )
  • Harrison.Harrison. Posts: 484
    edited 2007-12-01 06:36
    Thanks for the reminder on using the constant() compiler pseudo function. I saved about 37 longs in my tcp stack just by doing that. I also saved a few longs by removing unneeded bitmasks in long->byte conversions.
  • deSilvadeSilva Posts: 2,967
    edited 2007-12-01 11:56
    I posted this as an example half a year ago:
    ----
    deSilva said...

    It is not widely known that SPIN allows full recursion of calls! This can be emulated within an assembly program by installing an ad-hoc stack mechanism.
    It will be instructive anyhow to have a look at the many "patches". Note that you never shall "patch" crossing JMPRETs (aka CALLs), as the code has to stay re-entrant!

    The speed-up is about 40, which is not so overwhelming compared to the general speed-up from SPIN to handmade assembly (about 80 according to my experience) which discloses a very efficient stack management within SPIN!

    This innocent looking piece of SPIN.....
    PUB spinFibo(n)
      if n>2
         return spinFibo(n-1)+ spinFibo(n-2)
      else
         return 1
    


    ... has thus created this "assembly-monster":
    DAT
    fiboasm
    ' PAR shall contain a reference to 2 longs
    '  [noparse][[/noparse] 0 ] Argument for fibo (0: result ready)
    '  [noparse][[/noparse] 1 ] Result
     
        mov a, #$1ff
        add a, cnt
        waitcnt a,#0    ' save energy while idling
        rdlong a, par
        tjz a,#fiboasm
      ' organize a stack
        mov stackP, #stack
        
        jmpret retaddr, #fibo   ' result = fibo(a)
    
       ' result available
        mov a, par
        add a, #4
        wrlong result, a
        mov a,#0
        wrlong a, par
        jmp #fiboasm
    
    fibo
    ' if a<3 return 1
        cmps a, #3   wc
        mov result, #1
        if_c jmp retaddr
    
        
        add stackP, #1   ' points to the LAST USED entry
        movd :f1, stackP
        add stackP, #1       
        movd :f2, stackP  
    :f1 mov 0-0, retaddr ' push return address 
    :f2 mov 0-0, a       ' push argument
    
        sub a, #1 
        jmpret retaddr, #fibo ' call fibo(a-1)
    
        movs :f3, stackP 
        movd :f4, stackP 
    :f3 mov a, 0-0      ' get argument
                        '... and substitute by result
    :f4 mov 0-0, result                    
    
        sub a, #2
        jmpret retaddr, #fibo  ' call fibo(a-2)         
    
    ' add both reults
        movs :f5, stackP
        sub stackP, #1
    :f5 add result, 0-0  
        movs :f6, stackP   ' return to caller
        sub stackP, #1     ' adjust stack
    :f6 jmp 0-0    
    
    
    retaddr  res 1
    result res 1
    a  res 1
    
    ' The stack runs from lower to higher addresses; stackP always points to the last used entry!
    stackP res 0    ' a litte bit over-optimized [img]http://forums.parallax.com/images/smilies/smile.gif[/img]
    stack res 100     ' ... or as long as it will go
    





    ----
    If you are interested in the general timing without trying yourself:
    fibo(29) needed:
    26 sec with SPIN
    1.8 sec with PHP on my mid-range Windows Notebook
    800 ms with the above posted piece of code
    30 ms with a very efficient FORTH Implementation on my mid-range Windows Notebook
    ----
    BTW: I am well aware that there are simple algorithms to compute the n-th Fibonacci number in o(1) - this is obviously not the point smile.gif


    The space efficiency is a matter of what has been "anticipated" in the design of the SPIN interpreter.
    Recursive calls of procedures (or just decent calls of precedured in the first place) have.
    Procedure variables (pointers to methods) have not...

    There was a nice example the other day: "How to compute the parity of a word?". This is just ONE machine instruction. The SPIN code is awkward...

    In this FIBONACCI example we need:

    36 bytes for the SPIN byte code
    152 bytes for machine code


    This is most likely an extreme situation; however note that you have to augment your assembly program by many goodies (multiplication, division, searching,..) freely available in the SPIN COG... Such things add to size...

    A most important consideration is how to work with algorithmically really complex programs... Their max size can be 8 k machine instructions, using a clever swapping technique; realistically it will not be more than 4k instructions.....

    This will be also the bottleneck for future C-programs, until they provide a SD-card overlay swapping....
Sign In or Register to comment.