performance SPIN/ASM
Guyvo
Posts: 21
Hi props,
I wrote my frist prop program to experiment a bit with the things I learned from the manual and DiSilvas tutorial. I attach my small program with this post.
I basically do a small check in comparing performance between SPIN and ASM.
The SPIN loop is
repeat index from 1 to ADDMAX
The asm loop is:
:loop·· DJNZ··········· fill,#:loop
· ······· WRLONG······ tmp,par·····
Results below;
SPIN :········· 10 seconds for 1M··· ADD/DEC
ASM:·········· 10 seconds for 200M· ADD/DEC
The difference between SPIN/ASM is a factor 200 this sounds a lot to me. I probably mis something but what ?
PS
The print on VGA screen is taken out of the loop run
I wrote my frist prop program to experiment a bit with the things I learned from the manual and DiSilvas tutorial. I attach my small program with this post.
I basically do a small check in comparing performance between SPIN and ASM.
The SPIN loop is
repeat index from 1 to ADDMAX
The asm loop is:
:loop·· DJNZ··········· fill,#:loop
· ······· WRLONG······ tmp,par·····
Results below;
SPIN :········· 10 seconds for 1M··· ADD/DEC
ASM:·········· 10 seconds for 200M· ADD/DEC
The difference between SPIN/ASM is a factor 200 this sounds a lot to me. I probably mis something but what ?
PS
The print on VGA screen is taken out of the loop run
spin
3K
Comments
If I don't remember too wrongly, Chip mentioned a performance-difference of 100 to 200, depending on what is done.
Spin is after all interpreted.
Not only does all interpreters have these kinds of performance issues, but it's further slowed down by needing to load tokens and variables from HUB RAM almost constantly, so the fact that it gives this performance is actually very good and a testament to Chip's genius and code-tweaking abilities.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Don't visit my new website...
Just a bit suprised that it is a factor 100..200. I'd expect 50 at the most.
Of course if you use counters/timers to this kind of thing this will propably more accurate. But my intention was to see how the SPIN perfomance was in respect to the assembler code.
Cheers
Guy
The FPForth has a speed-down of 30, which is excellent to my opinion.
My general experience is that hand translated SPIN to machine code gives a speed-up of 80.
However there are many additional shortcuts in machine code as well as possible optimizations in SPIN.
E.g. SPIN performs * and / on a signed 32 bit basis wheras in many cases nearly 3 times faster unsigned 16 bit arithmetic will suffice.
http://forums.parallax.com/showthread.php?p=659790
At the end of the posting are some comparative figures, also with a PC.
It does work as a nice form of compression just by the nature of it.
Token take up much less space then 32bit ASM code. [noparse]:)[/noparse]
But yes, Spin is really slow in contrast.
deSilva said... You would nomally expect a speed-down of 10 to 20 for a reasonably designed VM.
A lot depends upon what instruction set the VM is interpreting, and that is then offset by the code density of the bytecode. If all Spin bytecodes were 32-bits that would undoubtedly cut instruction decode times, but reduce the maximum size of Spin programs. Take away the stack and make every variable global and that would improve execution speed also. Limit depth of method calling and that would help too. Say goodbye to all those things which make Spin so easy to use and you can have faster execution. It's all balance.
There's an implication here that either Spin bytecodes aren't very well designed, the interpreter is inefficient, or the Cogs are not suited to the task, and I wouldn't particularly agree with any of that. To suggest a "reasonably designed VM" would do better is to suggest the Spin VM isn't, although I'm not sure where any comparison derives from, or where any expectations of the amount of execution deterioration come from.
I am sure there are things which make the VM slower than it potentially could be - we know of the potential bottleneck of hub access - and the VM has to do a lot of interacting with the Hub, but does that really make the VM inefficient, badly designed, or overly slow ? Not really, that's just the reality of having to implement Spin on that architecture. And let's not forget the entire Spin Interpreter fits inside 496 longs.
I haven't done any speed comparisons between Spin and Assembler or against any other reference, but I think I'd want to see exact code so we can see exactly what is being compared. A ball-park figure is fine but that is only a rough guide, and in reality it's often overall application performance ( and that's not necessarily the same as execution speed ) which is far more important than how efficient the VM is. If that wasn't the case the BS1 would never have been the success it has been.
It's interesting to look at the difference between "repeat var from 1 to 1000" and "repeat 1000", and this shows exactly why we need to be comparing like with like when trying to determine execution speed deterioration ...
Looking at "repeat while --var > 0" is perhaps even more interesting, as it shows considerable inefficiency. That's down to lack of compiler optimisation though ...
Post Edited (hippy) : 9/29/2007 5:13:10 PM GMT
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
The more I know, the more I know I don't know.· Is this what they call Wisdom?
As hippy indicated, your ratios can vary wildly depending on the examples you pick to compare. The DJNZ instruction is particularly poor because it's optimized for an N to 1 loop. Admittedly, the REPEAT N is also optimized for this. That's why developing fair and useful benchmarks is difficult.
The reference values come from VMs all over the world: Java, P-Code,....
I did not at all intimate that Propeller, SPIN, and SPIN bytecode can be better implemented (although they most likely can, but not much).
During the time Frohf was making his concept for the FPForth, I tried out many possible concepts, and they all came out to a very high overhead. The main reason was the combination of HUB access and missing stacksupport. When the stack is located in the HUB the situation is still worse. When having limited space for interpreter code this adds to overhead (calls rather than unrolled instructions).
So the performance of SPIN interpretation can be explained (of course it can!) - and I have no suggestion how to improve it, short of changing the hardware and the language
This is why I gave the above reference. It is a program where SPIN does extremely well (speed-down of only 40!!)
Post Edited (deSilva) : 9/29/2007 7:12:54 PM GMT
- "Synthetic Benchmarks", that try to capture a typical user profile (e.g., database accesses, number crunching,...)
- "Natural Benchmarks", where you just take the algorithm you have to run, and implement it
-- in different languages (SPIN, Forth, Assembler)
-- or even on different machines .
Synthetic benchmarks are highly political- natural benchmarks are not
I hear what you say, and can understand the sentiment. I'd be quite interested to hear from Chip & Co as to how the Propeller evolved; as a set of fast Cogs with Spin then added, or first and foremost as a multi-tasking Spin execution environment. I suspect it was Cogs which were the main focus but influenced from the Spin perspective, but I could be entirely wrong.
There will always be two extremes, the perfect VM instruction set coupled with the perfect architecture to run the VM on, and, at the other end, terribly mismatched instruction sets and hardware. A VM is usually there to deliver something the underlying hardware cannot deliver as easily so there will always be some mismatch or no need to use a VM at all.
I'm intrigued as to how people view the Propeller - As a souped-up set of eight Basic Stamps running a new language, or as eight super-quick microcontrollers with a free interpreter thrown in ? My perception flip-flops between the two depending on what application I am thinking of. That it's both is what I find so impressive about Chip's creation.
With the Propeller there seems to be two diverse goals; the efficient Cogs and an easy to use high-level language and that's where any VM deficiency originates. To move towards a better Cog architecture for Spin or to move Spin towards the hardware would damage one or the other, and what we have seems to be pretty close to an optimal balance.
BasicStamp has attracted a lot of engineers without much knowledge of computer science; to my opinion, it was:
- the packaging ("add a battery and a serial plug - that's all")
- the lure of BASIC ("What can be easier?")
As the Propeller has passed both those virtues, I am still in doubt....
I had a short discussion the other day with someone asking whether he should try assembly language with the SX (for speed) or BASIC with the Stamp (for safety and ease of use). The outlook to have best of both with the Propeller lead to this reaction: "But this is much to complicated!"
I constructed·1987 the control system·which managed and my interest of Propeller it is related to the new construction to this system.
I do test a few processor.
If goes about that Propeller presumably probably will fall off in competition with other and that only concerning very much small memory for 1 COG.
That this processor had some chances of entrance on control system managements must have not the least 8-16 KB of main of COG of memory and HUB does not have a necessity to have more than 2*1 COG.
Would enable to that, to program such interpreter which wanted itself.
To this new add of possibility Which already I violated here on a forum.
1. Rapid possibility to serie communication in both sides between 2 Propeller
2. Possibility to the best possibilities of DA-AD of changes.
And here I must say that not AD would expect in him but simple DA as then there is possibility have one AD with only one comparator.
3. Possibility of collaboration of COG without the necessity of transition through HUB.
Else it is only TOY processor.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.
Sapieha
Post Edited (Sapieha) : 9/29/2007 10:49:25 PM GMT
Lack of on-chip AD is something I do miss from other processors.
I don't see high-speed serial between Propellers being too much of an obstacle, but it perhaps depends how fast you need to go.
It still makes me smile to think that I am doing all that I want to, and things I never expected to, on a $12 USD chip !
If looks at force him it crosses with the mountain of TOY
If looks at the amount of memory That on regret.
Hippy
You write ”I don't see high-speed serial between Propellers being too much of an obstacle, but it perhaps depends how fast you need to go.” there open complete possibility of collaboration a few to executable synchronizations of tasks in the system.
·
My system run 20 years and I must upgrade it. My costumer’s will have like stable system.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.
Sapieha
I need the propeller for educational purposes only - it is fine for it. The only problem is, that my students have also to know how the "real world" works, and so we have to extend to AVRs and ARMs....
As this forum is dedicated (is this a pun?) to the propeller, I am hesitant to elaborate further on what I - similar to Sapieha - feel as strong shortcomings of the Prop for any "serious" application.
But no, why should I be hesitant? One of my favourite authors is Wilhelm Busch He once wrote:
"Ist der Ruf erst ruiniert,
lebt sich's g
I think SPIN is rather slow compared to machine code, but for most applications only 20% or so of your code actually needs to run that fast. For the rest of it, it is nice to have a high level language to work with in a rapid development environment like the Prop Tool.
I have to agree that 32K for program and video RAM is a bit tight, and I wouldn't be a bit surprised if people start running into the 256K ceiling pretty quick with the next version. Give them memory and they'll use it!
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
The more I know, the more I know I don't know.· Is this what they call Wisdom?
I need to go back and re-read those large model threads again..
@ Ken : I agree that Video Ram can be a major consumer of resources, and user-defined characters as well even in text mode. I suppose it's a notable problem on the Propeller because few other micro's support video quite so well or as easily.
@ BradC : My only concern with using the Large Memory Model entirely is that it limits code in hub to just 8K Assembler instructions, and after that another layer has to be added to bring in code from external Eeprom. I've done almost zero Assembler with the Propeller so cannot judge how its code efficiency compares against other micro's, but I can see 8K being quite a limitation. The Mark-II with 128KB (?) will allow 32K instructions which should help there.
For comparison, I mashed out a quick assembly program to do the same thing, on screen as well. Seeing the two run is a good rough visual indicator of the speed difference. Of course, this is just loops, decisions and addition, but it does feature 2-3 hub ops per loop in the ASM part. The ASM code could be faster, with fewer HUB ops, however the SPIN program might easily be written faster too. Was just a "what does it really look like?" kind of thing.
By way of reality check, I fired up the Atari, running at 1.7Mhz to write a quick and dirty assembly language comparison on the 6502. It does this task in a little under 3 minutes in assembly language. If some of the DMA is turned off, that drops to just over 2.5. The basic is not even worth discussion... The assembly code, in this case, is actually an apples to apples comparison as the character mode driver used duplicates the memory operations required for both. One byte per on screen character, incremented to display digits in sequence. The 6502 code is smaller by a factor of three however.
The SPIN program gets this done in about 5.5 minutes. Propeller assembly only takes about 6 seconds! SPIN + counters really is like programming at assembly language speeds on older CPUs. It's really only a factor of two, minus the impact the counters would have on many tasks. In this task, spin is roughly 10 percent of assembly, on the Propeller.
A big model program doing the same thing would be somewhere between the two. Anyone have even a prototype kernel written? I don't think I've seen anybody post a large memory model program. Is this true?
Re: word size. / toy
IMHO, this Propeller is a bit unbalanced in terms of it's RAM. Lots of things can get done, but one has to work for it. Some potential is off the table for lack of RAM. V2 will not have this problem. IMHO, the core design is solid and quite capable. I would not characterize this chip as a toy. The only thing it suffers from is high expectations, due to it's robust on-chip feature set. Meeting those is often more a matter of thinking than would otherwise be the case for other CPU's. Many other designs require more baggage in terms of support too. I'm not sure there is a solution where so little extra hardware is required for so many tasks.
Word size is a love hate affair with me. The conditional instruction set really allows one to pack a lot of functionality into few instructions. This I really like. The more I apply it, the smaller programs become.
Indexing is a PITA. At this point in my learning, I'm not sure if it's me wanting to program a prop like other CPU's or not. More indirect addressing would be nice to have.
Multiple shift / rotate in one instruction is excellent, but I do find myself choosing to waste bits rather than engage in more aggressive bit manipulation. The best overall balance seems to be to keep data in the HUB, where byte, word and long access is an option. Given that, the faster HUB access in the V2 will mitigate most of this, as well as power faster large memory model programs. Likely to end up a non issue in the end.
It's only a major factor right now because of the smaller RAM size in the current prop. It's often tempting to leverage COG memory because it's there, and it's needed. 32K in the HUB, but there is also 16K in the COG's! Perhaps large model programs could bank this in and out for better use of all the RAM, depending on how many COG's have to be executing something. Either way, I've been exploring ways to leverage unused COG memory to my advantage.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
Post Edited (potatohead) : 9/30/2007 4:33:52 PM GMT
I use the Prop within it's limitations as I use my bike.
But I am NOT one of the "dedicated"
"If the only tool in your box is a hammer, every job starts looking like a nail." <- can't remember who said this, but it fits.
I, however, find the Propeller an interesting little micro which a hobbyist can use very easily for a variety of tasks. Is it appropriate for every application? Certainly not. No single micro is appropriate for every application. That is why there are hundreds of processors and microcontrollers out there to choose from. I do believe, however, that there are some serious applications for which the propeller would be very well suited. Anyone who has tried to make a processor do a range of different tasks all at once should appreciate the multi-processing architecture of the Propeller. I personally appreciate the very well written development tool and the availability of a repository of community supported objects. My biggest gripe with the current propeller is its lack of program space and the inability to expand it.
To call it a joke would be to say that it's been over-sold, or that Parallax says it will do things that it really can't. I don't think that is the case. They're selling it for what it is, and if our fellow forum contributors are blowing it's capability out of proportion, then this is not the fault of the Propeller, or of Parallax. It is what it is.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
The more I know, the more I know I don't know.· Is this what they call Wisdom?
Then there's the same but instead of using displayb[noparse]/noparse using byte[noparse][[/noparse]constant]
A toy brings joy to less or more aged children, and can be most useful in many circumstances. The sequence is
"toy" -> "gimmick" -> "tool"
That was the path of the PC and of the telephone.
My main concern is the same as yours: program memory. There are tons of C-code I should like to run. A mega128 can! Most C-programs will not instantaniously profit from a 32 bit architecture, but only "grow" in footprint. I am very eager to see the C-compiler.. I am even prepared to pay $199 for it... I think that is the limit
Post Edited (deSilva) : 9/30/2007 5:56:50 PM GMT
With regards to your points...
(A) SPIN is MUCH faster than Basic stamps
(B) Due to its capabilities, we keep forgetting... but the propeller is a MICROCONTROLLER
(C) Depends on what you are doing. For a controller, its ok. For a general purpose computer, you are right.
Bill
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers
I am back now, and will dust off my code
Bill
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers
FJUMP somelabel
Umm... how do you encode this in the spin assembler?
JMP kernel#fjump
long address
It is NOT fun stuffing address into that long, nor is it even simple to figure out what address is!
This is why I started writing a large model macro assembler last year... before I got diverted by an additional client who needed a box designed post-haste.
I was actually working on the demos yesterday, and will continue working today; after which I will continue with my assembler.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
www.mikronauts.com - a new blog about microcontrollers