Spin Interpreter - Faster???

Since Chip has so kindly published the Spin Interpreter I was having a look at the code - a fantastic achievement to squash this code into a cog.
Now I have a FAST assembler overlay loader I am itching to try it out on something worthwhile and I thought I might have a go at trying to increase the speed of the Spin Interpreter.
However,·don't know how to go about invoking my (yet to start)·version of the Interpreter in place of the existing one.
Chip: Do I have your Okay·
··??? ··I just want to replace the interpreter (on the fly, so to speak) and not touch any spin code/compiler.
Now I have a FAST assembler overlay loader I am itching to try it out on something worthwhile and I thought I might have a go at trying to increase the speed of the Spin Interpreter.
However,·don't know how to go about invoking my (yet to start)·version of the Interpreter in place of the existing one.
Chip: Do I have your Okay·

Comments
The first step would be to get a copy of the ROM Spin Interpreter running from RAM for a single sand-boxed program, with its own stack, variable space and a specific Spin method to execute. That way you can use the ROM Interpreter and Spin for TV or serial to monitor the actions of the RAM Interpreter running in another Cog executing its code.
You should be able to get what's needed from here on mostly working while running the RAM Interpreter in this sand-boxed environment.
The big hurdle IMO is getting CogInit and CogNew to work. I cannot recall what the full Spin bytecode sequence is for launching a separate Cog Spin method ( and I never really understood what was going on there nor got it implemented ) but I'm sure it calls Spin bytecode in ROM up at $Fxxx or something like that. Somehow you're going to have to subvert that process or it will launch the ROM interpreter not the RAM Interpreter.
After that you need to make the RAM Interpreter completely replace the ROM Interpreter. In principle that can be something like ...
PUB Main | running if not running running := true CogInit( 0, @RamInterpreter, $0004 ) <rest of code>
The RAM Interpreter will start up, initialise to exactly as the ROM Interpreter starts up ( object base, var base, start PC etc ), begin executing PUB Main, avoid getting stuck in a forever re-initialising loop because 'running' is true ( local stack vars aren't zeroed except on Reset ) and then execute the rest of the code as if it were the ROM Interpreter from then on. There might be some tweaking and zeroing needed because the stack area will have changed in the process of getting to the RAM Interpreter but that might not matter.
CogNews and CogInits should start up new Cogs executing the RAM interpreter just as they would have happened for the ROM Interpreter.
The final thing to do is to move the RAM interpreter into a sub-object so a normal Spin Program can use the RAM Interpreter simply by specifying it as an object and calling that to switch from ROM to RAM interpretation ( replace the CogInit with a call to RamInterpreter.Start ). Debugging the sand-boxed environment is probably easier if this is not done until late in the day.
The two crucial issues IMO are being able to start up the initial RAM interpreter using the Program Header at $0004 and then achieving the same for CogNew and CogInit, plus adding your LMM and overlay handling and the changes that will bring.
The one thing you don't have to worry about is the affect including all this extra code will have on the Spin program you're ultimately going to be running ( providing you don't overwrite the RAM interpreter itself ) as that will just be a black hole in memory as far as the rest of the program is concerned, RAM for stack and variable use will be allocated beyond that hole.
Two things whch may be of some help : My own Spin VM effort ...
http://forums.parallax.com/showthread.php?p=696792
And my Spin Bytecode Disassembler ...
http://forums.parallax.com/showthread.php?p=665019
Plus there's also GEAR and other tools out there as well.
Post Edited (hippy) : 6/15/2008 3:43:57 AM GMT
propeller.wikispaces.com/Cracking+Open+the+Propeller+Chip
OBC
Edit:
Bah! Hippy beat me to it! [noparse]:)[/noparse] What he said...
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
New to the Propeller?
Getting started with the Protoboard? - Propeller Cookbook 1.4
Updates to the Cookbook are now posted to: Propeller.warrantyvoid.us
Got an SD card? - PropDOS
Need a part? Got spare electronics? - The Electronics Exchange
I was thinking, maybe I could modify the Booter to wait and just load from the serial (and prevent reset from the Prop Tool) and then launch my version of the interpreter. That way everything would be the same except the Interpreter. Your thoughts??? (I realise you didn't have the benefit of the Interpreter/Booter/Runner release. Oh... and decoded the encoding
"Download this object and put these N lines at the start of your Main method and watch execution speed increase" has instant appeal whereas having to alter hardware or the download process doesn't.
I don't see any major problems except in the area of CogNew and CogInit and once there's a RAM Interpreter which is modifiable and able to grow beyond 496 instructions running it should be possible to add execution tracing and other tools to see exactly what does go on in those cases. Having the actual source code as you note is a great help.
Baggers
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
http://www.propgfx.co.uk/forum/·home of the PropGFX Lite
·
This should help you on your way. It shows the steps taken to transparently replace the ROM Interpreter with a RAM interpreter, RamVm_Step5.spin shows how clean the final version is and how transparent it really is.
I didn't bother with handling CogNew and CogInit so only a single Cog program can run at the current time except when sand-boxed; I used a Led to show it was working. Don't forget to change the LED_PIN and the XINFREQ !
Note I've subtly changed the declaration of the running flag to be a parameter of the Main method. This makes it easier to describe the changes needed and allows any Main method to have any local variables left exactly as they were.
You'll probably want to start working from RamVm_Step6.spin. At present this runs the 'number++' method in its own Cog but the CogNew which starts that launches the ROM Interpreter, you need to change the RAM interpreter to make it launch the RAM Interpreter.
PS : Yes, I did some tests to prove it is the RAM Interpreter executing the Spin code and not the ROM Interpreter
To get the code density required, Chip uses a lot of multiple and consecutive IF_C_AND_NZ type tests to choose one of four execution outcomes, these all add extra and unnecessary cycles to execution. With a better, faster bytecode decoder and dispatcher it would be possible to stream-line the important cases at the cost of extra code size and having to LMM interpret / overlay some bytecode handlers.
There is unlikely to be a definitive general case 'this is better than the other' outcome because it's going to be swings and roundabouts.
What I'm particularly interested in is not so much speed improvements, but getting it working in the first place. That opens up a world of possibilities including extending the Spin language in third-party compilers ( function pointers ), supporting source-level debugging and even allowing compiled C to be mixed with Spin. Floating point could be natively supported in Spin and even interrupts could be added !
What about each single Cog being able to run multiple Spin threads ? A Propeller could have as many 'Spin Cogs' as it needed although obviously each would run slower.
Post Edited (hippy) : 6/15/2008 2:37:01 PM GMT
Seems we are on similar paths so here's my take...
Many are criticising the Prop for no C (seperate thread) and that has been answered but at a price hobyists object to (not professionals). Open source C is another complaint. They could all be addresses by others making it compile to spin byte code, which would be far more attractive·if...
We could speedup the Interpreter significantly. Yes it is fast already but not compared to native PASM - the performance of the propeller has been greatly underestimated by the masses. I believe this is definately achievable. My overlayer hits the "sweet spot". Your LMM works great. If we can characterise the bytes codes we can make the·most used and code efficient ones resident, and overlay the rest - don't load if already resident.
A second thought would be for the·main primary spin code to use 2 co-operating spin interpreter cogs for the one spin code. I have a concept in mind.
I also fail to see the issue (problem)·with spin, but that's yet another topic - and I love the PASM. The RISC code has huge advantages! And as you say, once open it can be expanded. Let's push the spin byte code and it's interpretation to the limit.
Hippy... are you interested in a joint challenge... you've done a lot more work than me on this and I see we are on similar pages
To get started, I'd like to just drop Chip's code into a spin file and load it up. When I am· certain it is working properly I can start the work. In just the first 20 odd lines of Chip's code it is squashed to save bytes (great code, but execution time suffers) and this follows all the way through. I am thoroughly amazed he squashed this code into about 490 byte - Congratulations
There are gains in better dispatching but I'm not sure they are that great overall. That's not to say don't try or give up, you're going to learn a lot just by understanding the interpreter and experimenting. As you say it's great code by Chip, a phenomenal achievement getting it into 496 longs.
I do like your idea of caching bytecode handlers, and I'm guessing dual-Cog is some sort of pre-fetch which sounds interesting.
One thing I've found playing with the interpreter to prove it was running from ROM is that there are things going on which are not immediately clear to me. Remove the 'id' variable and the early 'cogid id', the only two places 'id' is referenced and it stops working. Why, I have no idea. It feels like there may be some un-commented position dependency in the code, but then I haven't actually studied it ! Not sure what sort of difficulties that is going to throw up when modifying Chip's code.
Unfortunately I've got myself in deep with other projects and couldn't take on another ( this was a worthwhile aside though ), so while I'm happy to share what advice and experience I have I wouldn't want to get involved in a joint venture at this time.
At the moment I wouldn't really be interested in going beyond making CogNew and CogInit launch the RAM interpreter. I have an idea there which if it works I'll let you know.
Keep the project going and open on the forum though because there are others who have been interested in dynamic objects and so on and there may be some common ground so they may want to help.
FYI, this is what I'm using to prove it's RAM not ROM running. It slows down all waitcnt, but makes CLKSET unavailable ...
j8 if_nc_and_z call #popyx 'clkset ' if_nc_and_z clkset x ' if_nc_and_z wrlong y,#$0000 ' if_nc_and_z wrbyte x,#$0004 if_c_or_nz call #popx 'pop parameter if_nc_and_nz cogstop x 'cogstop if_c_and_z lockret x 'lockret if_c_and_nz sub x,CNT if_c_and_nz shl x,#2 if_c_and_nz add x,CNT if_c_and_nz waitcnt x,#0 'waitcnt jmp #loop
At it again I see [noparse]:)[/noparse] You mentioned adding debugger. Do you have a strategy?
Had to set the clkfreq back to 5M for my display ... don't know why.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Here's the final proof of the pudding, Step 8, a normal Spin program with the RAM interpreter completely replacing the ROM interpreter. Again there's some serious hacking in the interpreter itself to get the magic numbers right so it's not a finished product.
That's me done on the 'doing things' front.
Best idea so far ... ?
Here's another thing I have thought of which could be very useful; inline PASM. That could either be LMM interpreted ( easy ) or overlaid into the interpreter ( slightly harder ) and run full-speed. Now we have full access to the interpreter execution environment we can subvert existing Spin commands to do whatever we want, we've made the interpreter extensible, for example ...
CogInit( -1, @MyPasm, @MyDataBlock )
The -1 can mean run the code at MyPasm inline and the second is a parameter which gets put into a 'virtual PAR' register that MyPasm code can use, it could be a 32-bit number or a pointer to data block.
I can think of all sorts of uses for that like high-speed bit-banged serial output where fast execution is wanted but using another Cog isn't.
Want to know what the base address of the object you're executing in is ? Easy, ask the interpreter, it knows, maybe a result := long[noparse][[/noparse] 1 ] or something. Want to play with the stack ? Easy when you have access to the stack pointer, just need a mechanism to read and write it.
A third-party compiler or pre-processor can hide all the ugliness of how it has to be coded in Spin and one day Spin itself may even have macros which hide the actual implementation ...
#define Inline( adr, arg ) CogInit( -1, adr, arg )
Post Edited (hippy) : 6/15/2008 6:43:32 PM GMT
some of your modifications are looking really cool. sorry to have doubted you [noparse]:)[/noparse] nice work!!! keep it up
Baggers.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
http://www.propgfx.co.uk/forum/·home of the PropGFX Lite
·
I am certain I can get the code executing much faster - it's just the way to do it. So for now I'll just get chip's code working faster (in sections, not overall)·and not concern myself with the resident and overlay sections. It will be easy to drop in a debugger later (and I probably need that as I go anyway).
Later I will need some help in characterising the byte codes (frequency of use) to work out which routines must be resident. I do not want there to be much difference to using the normal Interpreter and the two versions I am wanting (a single cog version and a dual cog version).
If we open up the bytecode and it's FAST (with extensibility) then C, Pascal or whatever can be done by others into bytecode.
Until tonight then....
To do that is going to require analysing where the currently inexplicable dependencies are. I found a couple but still had problems ( note how in my hacking that for every instruction added I removed exactly one existing instruction to maintain exact code size ).
Once everything is LMM it's then easy to change things, enhance and subvert the instruction set and optimise execution without worrying about limitations, and then when happy start winding things back in to the core interpreter to enhance speed, deciding what should be core, LMM or overlay.
Getting that LMM engine version would also be a good reference design to put into the 'public domain' ( would it need Chip's / Parallax's permission ? ) for others to use as the basis of changing the interpreter in the way they might want to. I'd say it should be your first milestone goal.
The best way of determining frequency of Spin bytecode use is perhaps to twist the arm of someone at Parallax to get the PropTool to generate that information, other than that it means having to walk a generated .eeprom and determine it oneself ( not impossible but a fair bit of work ) or you could use the first version of the RAM Interpreter to do that profiling on actually executing code, which is perhaps even better than any walking of a static program as long as it's given a fair workout.
Experience should pre-judge that method call and returns will be quite frequent, along with pushes then pops, then if-else and the various repeats. Less used ( or less necessary to run fast anyway ) will be CogNews and CogInits, LookUps and LookDowns, Case and some of the less common math operators.
One thing which did come out in earlier debate with deSilva was that the best Interpreter / VM implementations will be those which can pre-configure themselves at compile time to suit what the program to be run is. If that program doesn't need to use an 'equals comparison' why even generate the code to do that, let alone make it fast, even if most other programs may identify that as one of the priority bytecodes which needs to be handed fast. As before though, what appears most important under static analysis isn't necessarily what is most important at run time.
Failing compile time configuration of the interpreter, dynamically configuring at run-time is the next best option, but then there's the overhead of handling that needed without impacting the speed gains being sought.
It's a fascinating field, and I'm sure you're going to have hours ( maybe decades ) of fun
It would have taken me ages to get the interpreter loading from RAM. I just changed the xtal and pin parameters and bingo, my LED flashes (pin 24). Just to check I loaded the Chip's interpreter into RAM and faster LED.
Now for the fun....
I wouldn't expect a converted and hammered into shape translation to LMM to work first time and I would be expecting debugging it to be such a nightmare that it would be easier to throw it away and start again. That suggests it's not worth wasting that effort to start with.
A different approach would be to code up the LMM engine and add the dispatcher ( which is what I did when I started writing my SpinVm ) then start bringing in small parts of Chip's code to make bytecodes work. The only problem there is in untangling Chip's code because it does look like it could be a case of pulling on a thread you need only to find that the entire bale of knotted cotton is dragged in as well - That's not a criticism of Chip's code, just the reality of refactoring any code and this is dense and complicated code which needed to be tangled to work within the constraints.
So if that's not going to work so well, it's perhaps back to how I did it with SpinVm which was clean room development. Implement what's needed regardless of how Chip did it. That means understanding what the bytecode does rather than understanding how Chip's code works.
It may be that it's easier to take SpinVm and alter that rather than start from Chip's code. The best compromise being 'start again from nothing', build the LMM kernel, then pull in bytecode handlers from SpinVm and apply Chip's optimisation tricks. Method call, stack frame marking, return and abort should all be updated to work how Chip does it because I simply hammered it into shape until it worked and it's not run-time equivalent. CogNew and CogInit eed implementing.
There's no reason you shouldn't write your own Spin interpreter from the ground up or take a mix-and-match approach to what I've suggested but that will be a longer development process although one you'll be 100% in control of and will completely understand and can more easily mould to what you want.
Essential to that is understanding the bytecode at their high level so I'd recommend knocking up some short test programs to see what bytecode is generated and feel comfortable to what programming in bytecode would be like. That's also good knowledge for if you do move on to writing compilers which generate Spin bytecode. I'd recommend doing that anyway as it gives an understanding as to why the ROM interpreter and my SpinVm are as they are.
Development can be incremental, so you don't need to get it all working at once. AFAIR, I started with repeat, if-else, then moved on to simple assignments, more complex assignments, then method and object handling ( call, return, abort ) towards the end. The more you get working the more momentum it builds.
You are free to do it anyway you want so this is really just my perspective. If I were getting into it again, I'd rewrite the LMM kernel first and probably quite differently to how I did back then based on experience gained, then pull in the SpinVm code modified to work with the new kernel. This time I'd make sure I had a kernel which could jump and call between LMM blocks and entirely forget about being fast or size efficient to start with. I'd modify the kernel to do what was needed rather than try and make the LMM code fit what I had; that turned out to be a restricting approach. I ended up with too much needing to be within the Cog memory and you'll want as much of that free as possible for overlays.
What I found, and the same applies with altering Chip's code, is that once the Cog is full, it's very hard to break out of that situation while ensuring the change will work without adverse effects. Key to everything is getting the foundation and the approach to use right.
This has got my imagination fired up and I'm enthused, and as my main Propeller project is a translator from Basic / C-subset / Pascal to Spin, turning it into a Spin bytecode compiler does have a big appeal. I cannot promise anything yet but I am re-assessing how much of this I can help with.
I am not much of a compiler expert, although I wrote many cross-assemblers in the mid 70's onwards running on my ICL computer. I much prefer the assembler level programs although I have written commercial VB3-6.
So I am going to leave any higher level to others - just provide the speed and extensibility (hooks). Sometime I will need Chip's OK and so until then will keep my code off the forum (this isn't cleanroom stuff so it's Parallax's copyright).
That said, what I am going to do is pull a section of chip's code out to allow room for my overlay loader and place the section(s) removed into overlays. Then I can start to unravel the code into seperate pieces. This will decrease the speed but hopefully keep the code intact as far as consistency. Then I will be able to profile code hopefully via the forum to work out which needs to be resident. Also I can speedup the overlay sections (although I know they will be slower by being overlayed).
That's my plan.
I will not post until Chip gives me the OK (you there Chip???)
Hippy -·see www.bluemagic.biz·if you want me to send you a copy direct.
Your code to start the RamInterpreter does a great job.
I have now discovered PASD
Now that I have the space to play with I have come up with some interesting ideas to make it faster!!
1) Fall back to the earlier stage where ROM Spin launches the RAM interpreter, this allows ROM Spin to run TV Driver, your own debugging loop and debugging Cogs in ROM Spin while letting the RAM interpreter work on the code it's given. This is the way I debugged SpinVm.
Any Cog launched from ROM Spin will give a ROM executing Cog, any Cog launched from within the RAM interpreted code will give another RAM interpreting Cog.
2) Launch immediately into RAM interpreter but remove the hack in CogNew/CogInit which starts a new Cog. Without the hack any new Cog launched will be ROM Spin. Thus your main RAM Spin program can launch a ROM Spin Cog for debugging which watches your main program execute.
My opinion is that (2) is more complicated and limiting than (1), especially if a catastrophic bug occurs. It is a small step backwards but I'd go that way then move forward once you're happy with what you have.
I wasn't entirely successful with getting PASD running to debug my VM kernels but that could just be me. As I wasn't sure exactly how PASD worked ( didn't have time to study it and, quite possibly wrongly, didn't feel I could trust it ), I wrote my own routines and did debugging mainly by dumping the entire VM Cog to RAM and handshaking to do single stepping. It was somewhat tortuous and long-winded at times but did work.
I didn't think about the cognew hack.
I thought about modifying (and did try) shifting out the debug code on subsequent loads but that didn't work. Maybe PASD is looking for something directly in RAM.
I freed up quite a bit of space by overlaying the lower and upper codes, and keeping the memory and push/pop codes resident. From looking at the ROM diagrams, Chip used all the space, but in RAM we have plenty available. I am going to reduce bytecode decoding time by holding the jump tables (which include some decoding info) in a separate overlay (which is actually never loaded). This will speedup every instruction and save a bit of space also, at the detriment of 256 longs of RAM. There are other gains to be had by placing some of the code inline and hitting sweet spots in others. My biggest hurdle is being able to see I haven't changed any of the outcomes of the code because I don't have any spin code to verify the Interpreter.
Will keep you posted
Unable to test currently as I am away
I have improved the execution speed in a number of areas (particularly in the bytecode decoding and native assembler execution). I am using a hub based translation array to decode the bytecodes and decide what routines need to be executed. Also trying where possible to hit the hub "sweet spots".
Will keep you posted - expect another week before I am home again. Will be looking for others to help test.
Question: I am logging (with my own debugger) each spin bytecode executed for results comparison with the modified and unmodified RamInterpreter. I plan on then dropping both into Excel and comparing the results for errors (as they must be identical). Is there anything I should be outputting besides dcall...dcurr and x...adr registers ?
This following line makes the RamInterpreter relocatable (ie position independent within the spin source code)
PUB Start k_0000_XXXX := @RamInterpreter ^ $0000_F004 'setup hub address of RamInterpreter before loading
http://forums.parallax.com/showthread.php?p=665019
That will read and disassemble .eeprom files and intersperse source code from the associated .spin file so it's quite easy to write a method with some sample Spin and relate that to what's actually generated. That's how I learned what I know. There may be other tools, perhaps better.
I think you have to call all the methods to ensure it gets disassembled and keep the source code in synch with the disassembly. Play with the 'view' options to see the bytecode in more understandable high-level forms down to raw bytecode meaning. It's not perfect but does reasonably well on simple, single Spin files - It's original purpose was to help me understand the bytecode generation.
It's interesting to see how Chip has optimised the bytecode instruction set, especially for Repeat, LookUp, LookDown and Case, including dedicated instructions which take a bit more understanding than other bytecode. Other than those it's pretty much like any other stack-based machine code.
My conventions on instruction naming etc are different to those used by GEAR and others, and quite probably we're all way off the mark from how Chip refers to them. If you're familiar with assembly languages it should all hopefully be understandable. Treat the bytecode as an actual instruction set and don't worry that it's interpreted was my approach.
By the way, I have removed the overlays I added·and have managed to squeeze in a debugger to boot (only have clockset disabled). Its past bedtime so I will look at your bytecode disassembler later. Thanks again.
The RamInterpreter is smaller and running faster but is not yet ready for full testing and release.
Other related threads:
Hacking Spin Interpreter Cog Ram
http://forums.parallax.com/showthread.php?p=739430
Spin Bytecode Disassembler
http://forums.parallax.com/showthread.php?p=665019
What's in the Initialisation Section
http://forums.parallax.com/showthread.php?p=738220
Update: The code is 30+ longs shorter which has allowed me to shoehorn in all kinds of debugging -·my own and also PASD. It should be much faster and·I have used some of the code space to speedup even·more. Just in case you are thinking how am I doing it - I have an external hub ram decoder (which Chip didn't have space for).
Future: I am thinking that an inline style ASM might be in order using small overlays into the unused area and utilising the unused bytecode $3C. Any comments??
Problem: I am trying to setup an object using a DAT style so that I can get the Interpreter to execute the bytecodes I want and then output (already done) the various pointers and stacks and variables. I can then compare these results to the real interpreter. I guess that what I am looking·to do is a spin object in DAT by defining bytes, words, and longs. Obviously the first data will need to look like a real object and within the object I will need o create stack space - does any of this make sense?
Attached: Updated documents on Spin Bytecodes.