Suggestion for improving Spin performance
Dave Hein
Posts: 6,347
Anybody who has worked with Spin discovers at some point that it runs much slower than a similar program written in assembly.· One reason is that the program and variables are stored in hub memory.· Another reason is that the interpreter must execute many instructions to decode and execute each Spin instruction.
Some applications are speed-critcal, and they will not run fast enough in Spin.· The obvious solution is to convert the speed-crtitical portions of the code to assembly.· This is fine for small pieces of code, but this can become unwieldly for large code segments.
My suggestion is to create new block designators will will cause the Spin code to be compiled into assembly.· These new Spin blocks can only be run in cogs without the spin interpreter, just like the current use of assembly code.
I am suggesting the following new block designators:
CVAR - Define variables in cog memory
CDAT - Define initialized data in cog memory
CPRI - Define a private method block that is compiled into assembly
CPUB - Define a public method block that is compiled into assembly
Using this method we could write code that looks like standard Spin code, but runs almost as fast as hand-written assembly code.
Dave
Some applications are speed-critcal, and they will not run fast enough in Spin.· The obvious solution is to convert the speed-crtitical portions of the code to assembly.· This is fine for small pieces of code, but this can become unwieldly for large code segments.
My suggestion is to create new block designators will will cause the Spin code to be compiled into assembly.· These new Spin blocks can only be run in cogs without the spin interpreter, just like the current use of assembly code.
I am suggesting the following new block designators:
CVAR - Define variables in cog memory
CDAT - Define initialized data in cog memory
CPRI - Define a private method block that is compiled into assembly
CPUB - Define a public method block that is compiled into assembly
Using this method we could write code that looks like standard Spin code, but runs almost as fast as hand-written assembly code.
Dave
Comments
>CPRI
I like the way you think!
This almost sounds like something which the new Propeller BASIC might be able to accomplish
with some modification. Since they've already done BASIC to ASSEMBLY, how much work
would it take to do SPIN to ASSEMBLY?
OBC
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
New to the Propeller?
Visit the: The Propeller Pages @ Warranty Void.
Actually it would be a fair bit of work, as Spin requires a totally different parser, and uses indentation for blocks.
I could do a Spin->LMM compiler, but I simply don't have time to do one at the moment. It is on the "TODO" list, but no were near the top.
What would be easier, once PropellerBasic is released, would be for one of the existing compilers (sphix, bst, homespun) to be modified to emit PropellerBasic source code / pasm.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Unfortunately there is no spare space in Chip's interpreter, and not enough in Brad's - plus it would need new Spin op codes.
Perhaps there might be a way to squeeze a very minimal LMM interpreter into Brad's - but it is very difficult, if not impossible.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
I played around with converting Spin to assembly, and it is fairly straight-forward.· Based on what I've done in a few hours I think the whole compiler could be done in a·few weeks.
Dave
CVAR and CDAT are too simple for this sort of use since you have to somehow associate the cog storage with a particular block of code that may include one or more methods. CPUB isn't really needed since none of these routines can be called from outside the containing object. They can't even be called except from within a running cog. You could allow CVAR and CDAT to be used within a DAT block and they'd simple translate to RES or LONG / WORD / BYTE statements. CPRI could also occur within a DAT block and would translate into the appropriate instructions. Recursion wouldn't be allowed and local variables would be allocated as RES statements
I would love it if you could pull it off in a few weeks - so would everyone on the forums.
I am afraid you may not be familiar with the limitations of COGs - namely that there are only 496 locations in the cogs that can be used to hold instructions or data.
It might be possible to write a Spin subset compiler in a few weeks, one that did not use LMM and was limited to 496 long's for code+data, however that would be of limited use.
Back in 2006 I came up with the "Large Memory Model" (LMM for short) to work around that limitation, however it does cause a performance hit.
It is entirely possible to write a Spin to LMM compiler - everyone would love to see one - however I am certain you would find that it would take more than a few weeks.
PropBasic chose to go for the limited memory COG-only model for now, and the conversion from SX/B to PropBasic was much faster than the approach I am taking with PropellerBasic - which is trying to generate optimized LMM code. Mind you, I also have not had much time to work on my compiler lately due to hardware product development.
As far as being of "limited use", it would be useful in applications that use one or more cogs to run interpreted Spin, and one or more cogs running assembly code.· I think that would cover most of the current propeller applications.
I agree that LMM is a great feature, but it may not be needed in all applications.· That being said, compiled Spin could be extended to use LMM in the future.
Mike, I was thinking that a standard technique for doing remote method calls could be developed so that CPUB methods could be called from another cog.· The Spin floating point object uses a technique for doing remote method calls.
There is no reason that compiled Spin couldn't use hub variables and a stack.· This would allow for recursive methods.
CVAR and CDAT may have to be grouped in some way so they are associated with specific CPUB and CPRI methods.·· We may need another keyword to delineate CVAR, CDAT, CPUB and CPRI blocks that correspond to one cog image.
Dave
·
Sorry, I mis-interpreted your original post.
Now that I understand what you mean I agree that it would be useful, as it would allow pretty fast small cog programs with the same syntax.
Perhaps something like:
COG label | cog var names
.. spin code to be translated to PASM
would be a good syntax?
then it would look like a semi-normal Spin function.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
The compilers could be modified (presuming there is enough benefit to get Brad and/or MIchael to do it) to do what you say. However, there may be a way to di it without modification - I will need to think it through.
I wrote a spin interpreter, based on Chip's code. It is published (see Tool link in my signature). I was getting a speed gain of 20-25% and there was now space available because I relocated the bytecode decoding to hub ram. Unfortunately, in my quest for even more speed, I modified the code and introduced a bug which I never tracked down. This could be a reason to find and fix the bug.
Without thinking, a spin routine that hands off to a cog routine (different cog of course)·and waits for a reply (just like we do for FullDuplexSerial etc). This way, there is no change required to the Interpreter, compiler or anything else. It is really no different to what we do now, just that it sounds simpler. Comments???
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
The technique I'm suggesting should provide a dramatic speedup.· Assuming that Spin code is about 80 times slower than assembly I would expect·Spin assembly·to run about 40 times faster than interpreted Spin.
BTW Bill, your suggestion for defining a COG block specifier sounds good.· However, it would be useful to have some form of COG VAR and DAT blocks.· Maybe the term COG could be used to turn on the assembly mode, and HUB would put it back into the interpreter mode.· Or maybe the keywords could be ASM and INT.
Dave
I'd suggest code, var, dat in XMM, local vars + stack in hub.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Life may be "too short", but it's the longest thing we ever do.
Completely agree - I actually did some estimates but then didn't post them because I thought it would sound too much like I was spruiking Catalina. I think you would be lucky to end up with even half the longs available for use by compiled SPIN code in each cog, and you might expect to achieve a speed increase of about four times interpreted SPIN speeds - but only in special cases (such as small, tight loops). For most small functions any speed improvement you get from compiling the code is most likely going to to be outweighed by the calling, argument passing and stack manipulation overheads. Also it is VERY DIFFICULT to write a compiler to do even the very trivial type of code optimizations that most programmers can easily do by eye. Compilers basically fall into two groups - incredibly stupid or incredibly complex (and often still quite stupid). Which means the number of SPIN statements you could compile into PASM and fit in a cog is likely to be small - perhaps less than a page of SPIN code - and you only have 8 cogs available unless you are going to also going to include a dynamic cog loader (more time and space lost!). Also none of the interpreted SPIN methods would be accessible from the compiled SPIN, and you may not even be able to call other compiled SPIN methods without some inter-cog communication code (more time and space lost!). Also there are some SPIN statements that would consume quite a bit of cog space if included (e.g. STRCOMP, the various FILL and MOVE statements, not to mention LOOKUP/LOOKDOWN). Of course you might arrange to only include the code for these if they are used - but that in itself is not trivial. Or you could hive all the support code off to a separate "library" cog - but that not only consumes yet another cog, it also takes more overhead in intercog communications (more time and space lost!).
It hardly seems worth it. Why not just use PropBasic? Or Catalina? (had to get a plug in there somewhere!)
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
With a few tweaks, AAC could look just like Spin, but it would be the subset of Spin that translates easily to PASM!
Yes, for a limited subset of SPIN Dave's idea is feasible. And AAC would be a good starting point. You'd still lose cog space in the "glue" required, but not too much if you didn't have to communicate with other cogs or support all the expensive SPIN functions. And of course you only have LONG data types (no strings, and you can simulate bytes and words, but they each still take a long).
But personally I think Bob's AAC is better positioned as an advanced PASM, not a cut-down SPIN.
Ross.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Catalina - a FREE C compiler for the Propeller - see Catalina
Phil, You are correct that assembly code will take more space than interpreted code.· I don’t have a good handle on how much more space it will need, but I’m guessing 2X to 4X more.· Many applications don’t use all the cogs, so one approach would be to hand off processing from one cog to the next.· This would allow for full utilization of all 16 Kbytes of cog memory.
Another approach would be overlays, which would work well for code that spins in small loops most of the time.· The third approach is LMM, which introduces a 2X or 3X slowdown, but has a lot of flexibility.
I intend to support hub variables as well as cog variables.· This will include supporting the stack.· Some basic routines, such as multiplication will probably be duplicated in each cog.· Other routines that are rarely used may exist in a single cog, and they would be remotely called from other cogs.· Or it may make sense to load these routines in an overlay area, but that would be a future enhancement.
BradC, I have used Bean’s Basic compiler on the SX, and it works very well.· I haven’t tried PropBasic yet, but I expect that it will be just as great on the Propeller.· Many people use Spin, and I’m just trying to improve the performance of Spin programs.
Ross, I agree that it will be tough to generated highly optimized assembly, but I think I can get within a factor of 4X the execution time of hand-assembly code.· This should provide at least a 20X speedup over interpreted Spin code.
For function calls I intend to use COG variables for CPRI methods.· CPUB methods will use an external HUB stack so they will be compatible with interpreted Spin methods.· Calling CPRI methods should be fairly efficient.· I also plan on allowing inter-cog function calls.· There would be a small amount of code in each cog that monitors a calling queue, which would execute the requested function and then return a result in a response queue.· I’ve worked with multiprocessor systems in the past, and this is fairly easy to implement.
Some of the intrinsic function won’t take much space, such as STRCOMP, FILL and MOVE.· Each of these could be implemented inline in less than 10 longs, or they could be called as CPRI functions with 4 or 5 longs per call.· LOOKUP and LOOKDOWN would most likely be callable CPRI functions.· And yes, these functions would only be included if they are actually used.
Most of the speedup would be from low level tight "leaf" functions anyway!
COG leafname | local1, arr[noparse][[/noparse]10], ... , localN
could then directly be launched from COGNEW, and when it exited it would simply do:
COGID myid
COGSTOP myid
By not allowing it to call any other functions, only supporting locals as defined above, it would be VERY useful for coding array manipulation, list searching, graphics functions etc
Basically, with the roughly 100uS cog start up, it would allow for easy to write very fast helper functions.
If limited to leaf functions, and not allowing calling other functions (other than in-lining other leaf functions if they fit), it would be much easier to write, and incredibly useful.
Also, by limiting as such, there is no need for stack handling, or a lot of other cruft that would slow the code down - you might approach 1/2 speed of OK quality pasm code.
Adding a form of the cog data table you suggested would also be easier if it "belonged" to the COG function block... perhaps something like:
and to sum a vector in the hub
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
Post Edited (Bill Henning) : 2/3/2010 6:05:56 PM GMT
The "leaf" idea should not do a cognew, but rather cause the cog to load a "leaf" overlay and execute it. Obviously it should check to see it is not loaded already. This way, you do not have to load a whole 496 long cog, but rather just the code in the routine to be executed. When the cog is not executing it would just go into a wait for job cycle, monitoring hub.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)·
· Prop OS: SphinxOS·, PropDos , PropCmd··· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBlade Props: www.cluso.bluemagic.biz
One problem I encountered is that hub variable names are not accessible to assembly instructions.· I had to build a list of offsets for each hub variable, which is combined with the par register at run time.· spinasm currently handles some basic operators and it understands "if" statements and some variations of the "repeat" statement.· It can call other CPUB routines that take a single parameter.· It can also read and write byte, word and long hub variables.
I wrote a test program that transmits serial data on pin 30.· The program hello.spin uses a Spin routine to transmit the data.· It can run up to a baud rate of 19,200, and fails at baud rates of 38,400 and above.
hello1.spin is almost identical to hello.spin, except that I changed some blocks to CPUB and CVAR, and I changed the call to cognew to be consistent with assembly instead of Spin.· I compiled hello1.spn to hello2.spin.· The program works fairly well, except that it prints "rello World" instead of "Hello World".· I haven't tracked down why the first character is "r" instead of "H".
The generated assembly code is relatively inefficient.· It performs a lot of uneccessary moves to and from temporary registers, and conditional jumps are really ugly.· However, I should be able to improve that and get an improvement in speed and program size reduction.
As far as Spin speed improvement, this should be quite dramatic.· I have run hello2.spin at a baud rate of 115,200 without a problem.· It should be able run much higher than that.· I'll run a few benchmarks to see what the speed improvement is.
Dave
Post Edited (Dave Hein) : 2/5/2010 6:17:57 PM GMT
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com E-mail: mikronauts _at_ gmail _dot_ com 5.0" VGA LCD in stock!
Morpheus dual Prop SBC w/ 512KB kit $119.95, Mem+2MB memory/IO kit $89.95, both kits $189.95 SerPlug $9.95
Propteus and Proteus for Propeller prototyping 6.250MHz custom Crystals run Propellers at 100MHz
Las - Large model assembler Largos - upcoming nano operating system
I ran the following benchmark routine in both spin and spinasm.
PUB spinbench | i
· repeat i from 1 to 1000000
··· a := i + i
The spin routine took 18.8 seconds, and the spinasm routine took 0.8 seconds.· That's a factor 23.5 times faster.· For the spinasm routine, the variable "a" was in hub memory, and "i" was in cog memory.· I think with some optimization I should be able to get this up to 40 times faster in the future.
I added a couple of intrinsic functions -- waitcnt and bytemove.· The intrinsic functions are pretty easy to write, and they are only included if they are called by the user code.
Dave