Shop OBEX P1 Docs P2 Docs Learn Events
Suggestion for improving Spin performance — Parallax Forums

Suggestion for improving Spin performance

Dave HeinDave Hein Posts: 6,347
edited 2010-02-06 15:17 in Propeller 1
Anybody who has worked with Spin discovers at some point that it runs much slower than a similar program written in assembly.· One reason is that the program and variables are stored in hub memory.· Another reason is that the interpreter must execute many instructions to decode and execute each Spin instruction.

Some applications are speed-critcal, and they will not run fast enough in Spin.· The obvious solution is to convert the speed-crtitical portions of the code to assembly.· This is fine for small pieces of code, but this can become unwieldly for large code segments.

My suggestion is to create new block designators will will cause the Spin code to be compiled into assembly.· These new Spin blocks can only be run in cogs without the spin interpreter, just like the current use of assembly code.

I am suggesting the following new block designators:

CVAR - Define variables in cog memory
CDAT - Define initialized data in cog memory
CPRI - Define a private method block that is compiled into assembly
CPUB - Define a public method block that is compiled into assembly

Using this method we could write code that looks like standard Spin code, but runs almost as fast as hand-written assembly code.

Dave

Comments

  • Oldbitcollector (Jeff)Oldbitcollector (Jeff) Posts: 8,091
    edited 2010-02-02 19:25
    >CPUB
    >CPRI

    I like the way you think!

    This almost sounds like something which the new Propeller BASIC might be able to accomplish
    with some modification. Since they've already done BASIC to ASSEMBLY, how much work
    would it take to do SPIN to ASSEMBLY?

    OBC

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    New to the Propeller?

    Visit the: The Propeller Pages @ Warranty Void.
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-02 20:13
    Argh! Feature requests already... even if it is one that was already in the back of my mind...

    Actually it would be a fair bit of work, as Spin requires a totally different parser, and uses indentation for blocks.

    I could do a Spin->LMM compiler, but I simply don't have time to do one at the moment. It is on the "TODO" list, but no were near the top.

    What would be easier, once PropellerBasic is released, would be for one of the existing compilers (sphix, bst, homespun) to be modified to emit PropellerBasic source code / pasm.
    Oldbitcollector said...
    >CPUB
    >CPRI

    I like the way you think!

    This almost sounds like something which the new Propeller BASIC might be able to accomplish
    with some modification. Since they've already done BASIC to ASSEMBLY, how much work
    would it take to do SPIN to ASSEMBLY?

    OBC
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-02 20:15
    Great idea in theory.

    Unfortunately there is no spare space in Chip's interpreter, and not enough in Brad's - plus it would need new Spin op codes.

    Perhaps there might be a way to squeeze a very minimal LMM interpreter into Brad's - but it is very difficult, if not impossible.
    Dave Hein said...
    Anybody who has worked with Spin discovers at some point that it runs much slower than a similar program written in assembly. One reason is that the program and variables are stored in hub memory. Another reason is that the interpreter must execute many instructions to decode and execute each Spin instruction.


    Some applications are speed-critcal, and they will not run fast enough in Spin. The obvious solution is to convert the speed-crtitical portions of the code to assembly. This is fine for small pieces of code, but this can become unwieldly for large code segments.



    My suggestion is to create new block designators will will cause the Spin code to be compiled into assembly. These new Spin blocks can only be run in cogs without the spin interpreter, just like the current use of assembly code.



    I am suggesting the following new block designators:



    CVAR - Define variables in cog memory
    CDAT - Define initialized data in cog memory
    CPRI - Define a private method block that is compiled into assembly
    CPUB - Define a public method block that is compiled into assembly



    Using this method we could write code that looks like standard Spin code, but runs almost as fast as hand-written assembly code.



    Dave
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
  • Dave HeinDave Hein Posts: 6,347
    edited 2010-02-02 20:42
    Bill Henning said...
    Great idea in theory.

    Unfortunately there is no spare space in Chip's interpreter, and not enough in Brad's - plus it would need new Spin op codes.

    Perhaps there might be a way to squeeze a very minimal LMM interpreter into Brad's - but it is very difficult, if not impossible.
    Actually there is no need to change the interpreter.· Only the compiler needs to change.· The compiler would generate the same output as if someone used imbedded assembly in a DAT block.· The main advantage is that the "assembly" code is written in Spin.
    I played around with converting Spin to assembly, and it is fairly straight-forward.· Based on what I've done in a few hours I think the whole compiler could be done in a·few weeks.
    Dave
  • Mike GreenMike Green Posts: 23,101
    edited 2010-02-02 21:00
    Dave,
    CVAR and CDAT are too simple for this sort of use since you have to somehow associate the cog storage with a particular block of code that may include one or more methods. CPUB isn't really needed since none of these routines can be called from outside the containing object. They can't even be called except from within a running cog. You could allow CVAR and CDAT to be used within a DAT block and they'd simple translate to RES or LONG / WORD / BYTE statements. CPRI could also occur within a DAT block and would translate into the appropriate instructions. Recursion wouldn't be allowed and local variables would be allocated as RES statements
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-02 21:12
    David,

    I would love it if you could pull it off in a few weeks - so would everyone on the forums.

    I am afraid you may not be familiar with the limitations of COGs - namely that there are only 496 locations in the cogs that can be used to hold instructions or data.

    It might be possible to write a Spin subset compiler in a few weeks, one that did not use LMM and was limited to 496 long's for code+data, however that would be of limited use.

    Back in 2006 I came up with the "Large Memory Model" (LMM for short) to work around that limitation, however it does cause a performance hit.

    It is entirely possible to write a Spin to LMM compiler - everyone would love to see one - however I am certain you would find that it would take more than a few weeks.

    PropBasic chose to go for the limited memory COG-only model for now, and the conversion from SX/B to PropBasic was much faster than the approach I am taking with PropellerBasic - which is trying to generate optimized LMM code. Mind you, I also have not had much time to work on my compiler lately due to hardware product development.
    Dave Hein said...
    Bill Henning said...

    Great idea in theory.

    Unfortunately there is no spare space in Chip's interpreter, and not enough in Brad's - plus it would need new Spin op codes.

    Perhaps there might be a way to squeeze a very minimal LMM interpreter into Brad's - but it is very difficult, if not impossible.
    Actually there is no need to change the interpreter. Only the compiler needs to change. The compiler would generate the same output as if someone used imbedded assembly in a DAT block. The main advantage is that the "assembly" code is written in Spin.
    I played around with converting Spin to assembly, and it is fairly straight-forward. Based on what I've done in a few hours I think the whole compiler could be done in a few weeks.
    Dave
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system

    Post Edited (Bill Henning) : 2/2/2010 9:17:12 PM GMT

  • Dave HeinDave Hein Posts: 6,347
    edited 2010-02-02 21:47
    Bill, I'm fully aware of the limitations of a cog.· I also know about your invention of the LMM and how it is used to extend the program and variable space.· If you re-read my first post you will see no mention of LMM in it.· I am only proposing extending the embedded assembly feature in a DAT block by using compiled Spin instead of assembly.

    As far as being of "limited use", it would be useful in applications that use one or more cogs to run interpreted Spin, and one or more cogs running assembly code.· I think that would cover most of the current propeller applications.

    I agree that LMM is a great feature, but it may not be needed in all applications.· That being said, compiled Spin could be extended to use LMM in the future.

    Mike, I was thinking that a standard technique for doing remote method calls could be developed so that CPUB methods could be called from another cog.· The Spin floating point object uses a technique for doing remote method calls.

    There is no reason that compiled Spin couldn't use hub variables and a stack.· This would allow for recursive methods.

    CVAR and CDAT may have to be grouped in some way so they are associated with specific CPUB and CPRI methods.·· We may need another keyword to delineate CVAR, CDAT, CPUB and CPRI blocks that correspond to one cog image.

    Dave
    ·
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-02 23:19
    David,

    Sorry, I mis-interpreted your original post.

    Now that I understand what you mean I agree that it would be useful, as it would allow pretty fast small cog programs with the same syntax.

    Perhaps something like:

    COG label | cog var names
    .. spin code to be translated to PASM

    would be a good syntax?

    then it would look like a semi-normal Spin function.

    Dave Hein said...
    Bill, I'm fully aware of the limitations of a cog. I also know about your invention of the LMM and how it is used to extend the program and variable space. If you re-read my first post you will see no mention of LMM in it. I am only proposing extending the embedded assembly feature in a DAT block by using compiled Spin instead of assembly.


    As far as being of "limited use", it would be useful in applications that use one or more cogs to run interpreted Spin, and one or more cogs running assembly code. I think that would cover most of the current propeller applications.



    I agree that LMM is a great feature, but it may not be needed in all applications. That being said, compiled Spin could be extended to use LMM in the future.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
  • Cluso99Cluso99 Posts: 18,069
    edited 2010-02-03 02:10
    @Dave: Nice ideas. Here goes...

    The compilers could be modified (presuming there is enough benefit to get Brad and/or MIchael to do it) to do what you say. However, there may be a way to di it without modification - I will need to think it through.

    I wrote a spin interpreter, based on Chip's code. It is published (see Tool link in my signature). I was getting a speed gain of 20-25% and there was now space available because I relocated the bytecode decoding to hub ram. Unfortunately, in my quest for even more speed, I modified the code and introduced a bug which I never tracked down. This could be a reason to find and fix the bug.

    Without thinking, a spin routine that hands off to a cog routine (different cog of course)·and waits for a reply (just like we do for FullDuplexSerial etc). This way, there is no change required to the Interpreter, compiler or anything else. It is really no different to what we do now, just that it sounds simpler. Comments???



    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
  • Dave HeinDave Hein Posts: 6,347
    edited 2010-02-03 05:07
    Cluso99 said...
    @Dave: Nice ideas. Here goes...

    The compilers could be modified (presuming there is enough benefit to get Brad and/or MIchael to do it) to do what you say. However, there may be a way to di it without modification - I will need to think it through.

    I wrote a spin interpreter, based on Chip's code. It is published (see Tool link in my signature). I was getting a speed gain of 20-25% and there was now space available because I relocated the bytecode decoding to hub ram. Unfortunately, in my quest for even more speed, I modified the code and introduced a bug which I never tracked down. This could be a reason to find and fix the bug.

    Without thinking, a spin routine that hands off to a cog routine (different cog of course)·and waits for a reply (just like we do for FullDuplexSerial etc). This way, there is no change required to the Interpreter, compiler or anything else. It is really no different to what we do now, just that it sounds simpler. Comments???
    I not sure how to this could be implemented without modifying the Spin compilers.· We could use a preprocessor that converts a Spin file into a Spin/Asm file.· This could then be fed to the current Spin compilers.· That's the approach I'm currently taking to check out the concept.· Ultimately, it would be good to include it in the IDE.· The assembly part could be completely hidden to the Spin programmer, or he could select an option to view the generated assembly.
    The technique I'm suggesting should provide a dramatic speedup.· Assuming that Spin code is about 80 times slower than assembly I would expect·Spin assembly·to run about 40 times faster than interpreted Spin.
    BTW Bill, your suggestion for defining a COG block specifier sounds good.· However, it would be useful to have some form of COG VAR and DAT blocks.· Maybe the term COG could be used to turn on the assembly mode, and HUB would put it back into the interpreter mode.· Or maybe the keywords could be ASM and INT.
    Dave
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-03 05:14
    I for one would love to see a Spin interpreter that could run HUGE programs because it ran the code from XMM...

    I'd suggest code, var, dat in XMM, local vars + stack in hub.
    Cluso99 said...
    @Dave: Nice ideas. Here goes...


    The compilers could be modified (presuming there is enough benefit to get Brad and/or MIchael to do it) to do what you say. However, there may be a way to di it without modification - I will need to think it through.



    I wrote a spin interpreter, based on Chip's code. It is published (see Tool link in my signature). I was getting a speed gain of 20-25% and there was now space available because I relocated the bytecode decoding to hub ram. Unfortunately, in my quest for even more speed, I modified the code and introduced a bug which I never tracked down. This could be a reason to find and fix the bug.



    Without thinking, a spin routine that hands off to a cog routine (different cog of course) and waits for a reply (just like we do for FullDuplexSerial etc). This way, there is no change required to the Interpreter, compiler or anything else. It is really no different to what we do now, just that it sounds simpler. Comments???
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2010-02-03 07:13
    Here's the rub, as I see it: Compiling directly into a cog will severely limit the size of any compiled Spin programs. I think that a given routine will compile to more assembly instructions than most people realize. So going to LMM or XMM seems like a possible alternative. But you'll still have to emulate a stack machine to avoid restrictions on nesting and expression complexity; and in order to avoid slowdowns in basic things like multiplication, those will have to be implemented by routines within the LMM/XMM emulator cog, entailing some level of bytecode interpretation. The likely outcome is that a given operation will have to fetch far more instructions from the hub to execute than it would have in the Spin interpreter. I could be wrong, but my guess is that there will still be a speedup but that it will be marginal, at best.

    -Phil
  • BradCBradC Posts: 2,601
    edited 2010-02-03 07:27
    It really sounds like what you are suggesting/after is Bean's PropBasic compiler, but with a SPIN syntax instead of Basic.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Life may be "too short", but it's the longest thing we ever do.
  • RossHRossH Posts: 5,519
    edited 2010-02-03 10:29
    @Phil,

    Completely agree - I actually did some estimates but then didn't post them because I thought it would sound too much like I was spruiking Catalina. I think you would be lucky to end up with even half the longs available for use by compiled SPIN code in each cog, and you might expect to achieve a speed increase of about four times interpreted SPIN speeds - but only in special cases (such as small, tight loops). For most small functions any speed improvement you get from compiling the code is most likely going to to be outweighed by the calling, argument passing and stack manipulation overheads. Also it is VERY DIFFICULT to write a compiler to do even the very trivial type of code optimizations that most programmers can easily do by eye. Compilers basically fall into two groups - incredibly stupid or incredibly complex (and often still quite stupid). Which means the number of SPIN statements you could compile into PASM and fit in a cog is likely to be small - perhaps less than a page of SPIN code - and you only have 8 cogs available unless you are going to also going to include a dynamic cog loader (more time and space lost!). Also none of the interpreted SPIN methods would be accessible from the compiled SPIN, and you may not even be able to call other compiled SPIN methods without some inter-cog communication code (more time and space lost!). Also there are some SPIN statements that would consume quite a bit of cog space if included (e.g. STRCOMP, the various FILL and MOVE statements, not to mention LOOKUP/LOOKDOWN). Of course you might arrange to only include the code for these if they are used - but that in itself is not trivial. Or you could hive all the support code off to a separate "library" cog - but that not only consumes yet another cog, it also takes more overhead in intercog communications (more time and space lost!).

    It hardly seems worth it. Why not just use PropBasic? Or Catalina? (had to get a plug in there somewhere!)

    Ross.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Catalina - a FREE C compiler for the Propeller - see Catalina
  • mparkmpark Posts: 1,305
    edited 2010-02-03 12:11
    Have you guys looked at Bob Anderson's Augmented Assembly Code tool? I haven't used it yet but it looks very cool. If/else, case, loops, indexing, wow!
    ''      loop
    ''        sum += kk
    ''        x[noparse][[/noparse]kk] = sum
    ''        kk += 1
    ''        exitif kk > 9
    ''      endloop
    
    


    With a few tweaks, AAC could look just like Spin, but it would be the subset of Spin that translates easily to PASM!
  • RossHRossH Posts: 5,519
    edited 2010-02-03 13:18
    Hi mpark,

    Yes, for a limited subset of SPIN Dave's idea is feasible. And AAC would be a good starting point. You'd still lose cog space in the "glue" required, but not too much if you didn't have to communicate with other cogs or support all the expensive SPIN functions. And of course you only have LONG data types (no strings, and you can simulate bytes and words, but they each still take a long).

    But personally I think Bob's AAC is better positioned as an advanced PASM, not a cut-down SPIN.

    Ross.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Catalina - a FREE C compiler for the Propeller - see Catalina
  • Dave HeinDave Hein Posts: 6,347
    edited 2010-02-03 16:47
    Bill, Assembled Spin could implement LMM, or even XMM.· However, that would be a future project.· My primary interest is to increase speed for code that will fit in a cog.· Increasing program size can come later.

    Phil, You are correct that assembly code will take more space than interpreted code.· I don’t have a good handle on how much more space it will need, but I’m guessing 2X to 4X more.· Many applications don’t use all the cogs, so one approach would be to hand off processing from one cog to the next.· This would allow for full utilization of all 16 Kbytes of cog memory.

    Another approach would be overlays, which would work well for code that spins in small loops most of the time.· The third approach is LMM, which introduces a 2X or 3X slowdown, but has a lot of flexibility.

    I intend to support hub variables as well as cog variables.· This will include supporting the stack.· Some basic routines, such as multiplication will probably be duplicated in each cog.· Other routines that are rarely used may exist in a single cog, and they would be remotely called from other cogs.· Or it may make sense to load these routines in an overlay area, but that would be a future enhancement.

    BradC, I have used Bean’s Basic compiler on the SX, and it works very well.· I haven’t tried PropBasic yet, but I expect that it will be just as great on the Propeller.· Many people use Spin, and I’m just trying to improve the performance of Spin programs.

    Ross, I agree that it will be tough to generated highly optimized assembly, but I think I can get within a factor of 4X the execution time of hand-assembly code.· This should provide at least a 20X speedup over interpreted Spin code.

    For function calls I intend to use COG variables for CPRI methods.· CPUB methods will use an external HUB stack so they will be compatible with interpreted Spin methods.· Calling CPRI methods should be fairly efficient.· I also plan on allowing inter-cog function calls.· There would be a small amount of code in each cog that monitors a calling queue, which would execute the requested function and then return a result in a response queue.· I’ve worked with multiprocessor systems in the past, and this is fairly easy to implement.

    Some of the intrinsic function won’t take much space, such as STRCOMP, FILL and MOVE.· Each of these could be implemented inline in less than 10 longs, or they could be called as CPRI functions with 4 or 5 longs per call.· LOOKUP and LOOKDOWN would most likely be callable CPRI functions.· And yes, these functions would only be included if they are actually used.
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-03 17:53
    I think a good way to start would be to only compile "leaf" functions - ones that cannot call any other functions, except possibly other COG functions, which would be in-lined if they fit.

    Most of the speedup would be from low level tight "leaf" functions anyway!

    COG leafname | local1, arr[noparse][[/noparse]10], ... , localN

    could then directly be launched from COGNEW, and when it exited it would simply do:

    COGID myid
    COGSTOP myid

    By not allowing it to call any other functions, only supporting locals as defined above, it would be VERY useful for coding array manipulation, list searching, graphics functions etc

    Basically, with the roughly 100uS cog start up, it would allow for easy to write very fast helper functions.

    If limited to leaf functions, and not allowing calling other functions (other than in-lining other leaf functions if they fit), it would be much easier to write, and incredibly useful.

    Also, by limiting as such, there is no need for stack handling, or a lot of other cruft that would slow the code down - you might approach 1/2 speed of OK quality pasm code.

    Adding a form of the cog data table you suggested would also be easier if it "belonged" to the COG function block... perhaps something like:

    COG myfn | a, i
        repeat i from 0 to 10 
           c += tbl1(i) ' sorry code messed up the square brackets, so i replaced them
        wrlong c, par
        table tbl1, v1...v10
    
    



    and to sum a vector in the hub

    ' ptr points to 3 longs, length of vector, pointer to long vector, pointer where long result is to be placed
    COG sumvec|sum,ptr,len,i
        len = par(0)
        ptr = par(1)
        sum = 0
        repeat i from 0 to len-1
            sum += long[noparse][[/noparse]ptr]
            ptr += 4
        wrlong sum, par(2) ' sorry code messed up the square brackets, so i replaced them
    
    


    Dave Hein said...
    Bill, Assembled Spin could implement LMM, or even XMM. However, that would be a future project. My primary interest is to increase speed for code that will fit in a cog. Increasing program size can come later.


    Phil, You are correct that assembly code will take more space than interpreted code. I don’t have a good handle on how much more space it will need, but I’m guessing 2X to 4X more. Many applications don’t use all the cogs, so one approach would be to hand off processing from one cog to the next. This would allow for full utilization of all 16 Kbytes of cog memory.



    Another approach would be overlays, which would work well for code that spins in small loops most of the time. The third approach is LMM, which introduces a 2X or 3X slowdown, but has a lot of flexibility.



    I intend to support hub variables as well as cog variables. This will include supporting the stack. Some basic routines, such as multiplication will probably be duplicated in each cog. Other routines that are rarely used may exist in a single cog, and they would be remotely called from other cogs. Or it may make sense to load these routines in an overlay area, but that would be a future enhancement.



    BradC, I have used Bean’s Basic compiler on the SX, and it works very well. I haven’t tried PropBasic yet, but I expect that it will be just as great on the Propeller. Many people use Spin, and I’m just trying to improve the performance of Spin programs.



    Ross, I agree that it will be tough to generated highly optimized assembly, but I think I can get within a factor of 4X the execution time of hand-assembly code. This should provide at least a 20X speedup over interpreted Spin code.



    For function calls I intend to use COG variables for CPRI methods. CPUB methods will use an external HUB stack so they will be compatible with interpreted Spin methods. Calling CPRI methods should be fairly efficient. I also plan on allowing inter-cog function calls. There would be a small amount of code in each cog that monitors a calling queue, which would execute the requested function and then return a result in a response queue. I’ve worked with multiprocessor systems in the past, and this is fairly easy to implement.



    Some of the intrinsic function won’t take much space, such as STRCOMP, FILL and MOVE. Each of these could be implemented inline in less than 10 longs, or they could be called as CPRI functions with 4 or 5 longs per call. LOOKUP and LOOKDOWN would most likely be callable CPRI functions. And yes, these functions would only be included if they are actually used.
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system

    Post Edited (Bill Henning) : 2/3/2010 6:05:56 PM GMT
  • Cluso99Cluso99 Posts: 18,069
    edited 2010-02-03 18:09
    I have already written a fast overlay loader (see the Tools link in my signature). Heater uses it in ZiCog.

    The "leaf" idea should not do a cognew, but rather cause the cog to load a "leaf" overlay and execute it. Obviously it should check to see it is not loaded already. This way, you do not have to load a whole 496 long cog, but rather just the code in the routine to be executed. When the cog is not executing it would just go into a wait for job cycle, monitoring hub.

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Links to other interesting threads:

    · Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
    · Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
    · Prop Tools under Development or Completed (Index)
    · Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
    · Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
    My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
  • Dave HeinDave Hein Posts: 6,347
    edited 2010-02-05 17:01
    I made an initial attempt at writing a Spin compiler that converts Spin to assembly.· I call the program spinasm.· It looks for CPUB and CVAR blocks, and generates assembly instructions on a line-by-line basis.· The standard PUB, VAR and other blocks are copied to the output file with no changes.

    One problem I encountered is that hub variable names are not accessible to assembly instructions.· I had to build a list of offsets for each hub variable, which is combined with the par register at run time.· spinasm currently handles some basic operators and it understands "if" statements and some variations of the "repeat" statement.· It can call other CPUB routines that take a single parameter.· It can also read and write byte, word and long hub variables.

    I wrote a test program that transmits serial data on pin 30.· The program hello.spin uses a Spin routine to transmit the data.· It can run up to a baud rate of 19,200, and fails at baud rates of 38,400 and above.

    hello1.spin is almost identical to hello.spin, except that I changed some blocks to CPUB and CVAR, and I changed the call to cognew to be consistent with assembly instead of Spin.· I compiled hello1.spn to hello2.spin.· The program works fairly well, except that it prints "rello World" instead of "Hello World".· I haven't tracked down why the first character is "r" instead of "H".

    The generated assembly code is relatively inefficient.· It performs a lot of uneccessary moves to and from temporary registers, and conditional jumps are really ugly.· However, I should be able to improve that and get an improvement in speed and program size reduction.

    As far as Spin speed improvement, this should be quite dramatic.· I have run hello2.spin at a baud rate of 115,200 without a problem.· It should be able run much higher than that.· I'll run a few benchmarks to see what the speed improvement is.

    Dave


    Post Edited (Dave Hein) : 2/5/2010 6:17:57 PM GMT
  • mparkmpark Posts: 1,305
    edited 2010-02-05 18:11
    Awesome! This could actually work!
  • Bill HenningBill Henning Posts: 6,445
    edited 2010-02-05 18:24
    Nice work!
    Dave Hein said...
    I made an initial attempt at writing a Spin compiler that converts Spin to assembly. I call the program spinasm. It looks for CPUB and CVAR blocks, and generates assembly instructions on a line-by-line basis. The standard PUB, VAR and other blocks are copied to the output file with no changes.


    One problem I encountered is that hub variable names are not accessible to assembly instructions. I had to build a list of offsets for each hub variable, which is combined with the par register at run time. spinasm currently handles some basic operators and it understands "if" statements and some variations of the "repeat" statement. It can call other CPUB routines that take a single parameter. It can also read and write byte, word and long hub variables.



    I wrote a test program that transmits serial data on pin 30. The program hello.spin uses a Spin routine to transmit the data. It can run up to a baud rate of 19,200, and fails at baud rates of 38,400 and above.



    hello1.spin is almost identical to hello.spin, except that I changed some blocks to CPUB and CVAR, and I changed the call to cognew to be consistent with assembly instead of Spin. I compiled hello1.spn to hello2.spin. The program works fairly well, except that it prints "rello World" instead of "Hello World". I haven't tracked down why the first character is "r" instead of "H".



    The generated assembly code is relatively inefficient. It performs a lot of uneccessary moves to and from temporary registers, and conditional jumps are really ugly. However, I should be able to improve that and get an improvement in speed and program size reduction.



    As far as Spin speed improvement, this should be quite dramatic. I have run hello2.spin at a baud rate of 115,200 without a problem. It should be able run much higher than that. I'll run a few benchmarks to see what the speed improvement is.



    Dave
    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
    Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
    Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
    Las - Large model assembler Largos - upcoming nano operating system
  • Dave HeinDave Hein Posts: 6,347
    edited 2010-02-06 15:17
    I found the problem in hello.spin.· I was incrementing the write index before storing the character in the circular buffer.· The spinasm routine was running fast enough to read the old character, which was an "r" before the "H" was written into the buffer.

    I ran the following benchmark routine in both spin and spinasm.

    PUB spinbench | i
    · repeat i from 1 to 1000000
    ··· a := i + i

    The spin routine took 18.8 seconds, and the spinasm routine took 0.8 seconds.· That's a factor 23.5 times faster.· For the spinasm routine, the variable "a" was in hub memory, and "i" was in cog memory.· I think with some optimization I should be able to get this up to 40 times faster in the future.

    I added a couple of intrinsic functions -- waitcnt and bytemove.· The intrinsic functions are pretty easy to write, and they are only included if they are called by the user code.

    Dave
Sign In or Register to comment.