Spin Interpreter ( Source Code )
hippy
Posts: 1,981
For your Christmas delectation and delight; one Spin Interpreter written in PASM.
I'd hoped to get further than I have, but there's only so many hours in a day, the coding elves have packed up tools and run off claiming to have some other job they need to do, and I seem to have just five longs of code space left with a fair chunk still to implement, but here it is anyway
It runs a reasonable amount of code but there are some notable absences ( a += b, a++, and no subroutine calling ).
It's become very clear that I need a redesign as there's some 'trick' I'm not seeing. I'm currently using 80 longs as lookup / jump vector table which if I can reclaim most of that it might be enough to squeeze it into 496 longs, but I'm still doubtful. I've a long way to go before I can match the skill of Chip's programming. In the meantime I'll probably resort to overlays.
For now though, back to the mince pies and alcohol.
I'd hoped to get further than I have, but there's only so many hours in a day, the coding elves have packed up tools and run off claiming to have some other job they need to do, and I seem to have just five longs of code space left with a fair chunk still to implement, but here it is anyway
It runs a reasonable amount of code but there are some notable absences ( a += b, a++, and no subroutine calling ).
It's become very clear that I need a redesign as there's some 'trick' I'm not seeing. I'm currently using 80 longs as lookup / jump vector table which if I can reclaim most of that it might be enough to squeeze it into 496 longs, but I'm still doubtful. I've a long way to go before I can match the skill of Chip's programming. In the meantime I'll probably resort to overlays.
For now though, back to the mince pies and alcohol.
Comments
This is a Spin bytecode Interpreter
In its own right I'm not sure what use it is, but l can probably think of something when I'm more awake. Being able to single-step, breakpoint and debug bytecode should have its uses and the porting possibilities appeal to me; PICmicro / AVR / ARM / PC etc. With a bit of time-slicing ( and obviously loss of speed ) it should be possible to emulate eight cogs on a single CPU. I can see still see advantages to running Spin code even if the hardware isn't Propeller and doesn't support all the hardware attributes.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
Clean-room implementation from scratch lest anyone thinks I'm simply trying to decode the ROM ( no offence taken ).
The post Christmas haze has lifted so time for version two. This one makes extensive use of overlays which really does free ones mind from trying to work within Cog memory constraints, and it's now easier to say what it doesn't do rather than does ...
* No calls to Object Methods
* No Abort / Abort trapping
* No LookUp / LookDown
* No SPR bit or bits access, INA[noparse][[/noparse]0], INA[noparse][[/noparse]0..7]
* No CogInit / CogNew
* Random numbers aren't random
For comparison with the genuine Parallax Spin Interpreter; size is approximately 3700 bytes, 925 longs.
Approximate benchmark timings (seconds/mm:ss), calling a subroutine to increment a long in hub memory 10 million times ...
170 / 02:50 - Genuine Parallax Spin Interpreter
480 / 08:00 - Using overlays
225 / 03:45 - Without using overlays
220 / 03:40 - Fetch loop optimised without using overlays
So it's at best 30% slower than Parallax's ( vice-versa, Parallax is 23% faster ). It's in the right ballpark so I'm very happy with that.
One interesting pursuit of this effort of yours would be to see if by deletion of a few of the lesser used operators to free up some space, if a substantial speeding up of remaining operators is possible by optimizing for speed rather than space.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Paul Baker
Propeller Applications Engineer
Parallax, Inc.
If I can match the speed of the ROM interpreter I may come asking for a job
I think the furthest I'm going to go is the incrementing loop and a stopwatch. Someone else can worry about hyper-optimisation and appreciate the sterling work Chip's done. I'm more than happy with what I've got, but obviously will optimise where I see an opportunity.
If I had to set a 'success criteria' on speed, I'd say bit-banging 19200 baud serial.
The way I'm doing it ( looked-up jump table ) there'd be no gain. Handlers can either be fixed in Cog or overlayed in as needed so there's obviously a speed advantage which can be given to most commonly used opcodes. I was quite surprised that with everything overlayed it was only half the speed, I'd expected worse.
There are gains to be had in un-rolling code and focused optimisations but I'm trying to keep it as clean as possible for maintenance at present.
One interesting comparison would be, instead of overlay loading and executing, to perform LMM-style interpretation of the overlay code. That would benefit non-looped overlays but the looped ones would suffer; I can however see an easy way to do both.
A favour to ask ...
As it's Christmas ( playing the "season to be giving presents" card ); you don't happen to have an example line of code which generates the $3C opcode you'd be prepared to share ? Try as hard as I can, just cannot seem to generate one. It would be much appreciated.
170 / 02:50 - Genuine Parallax Spin Interpreter
480 / 08:00 - Using overlays
225 / 03:45 - Without using overlays
220 / 03:40 - Fetch loop optimised without using overlays
Current Development Version having undergone some tweaking ...
170 / 02:50 - Genuine Parallax Spin Interpreter
450 / 07:30 - Using overlays, load and execute
395 / 06:35 - Using overlays, interpret LMM-style when possible
170 / 02:50 - Favoured in-cog handlers, overlays interpreted LMM-style when possible
So a significant saving using LMM-style interpretation rather than load and execute overlays. Load and execute takes 14% longer than LMM and I already do demand-sized loads for speed. Something significant perhaps for everyone looking at LMM-style implementations.
The real hoot though is getting execution speed to match the ROM Interpreter !
It's perhaps not a fair comparison as I have all the opcodes used in the benchmark as favoured opcodes held in-cog for maximum performance, particularly the three subroutine related opcodes ( Frame / Call / Return ) and the Cog is once again near full so some of that will have to move back to overlay.
One thing I hadn't appreciated ( coming from a background of much slower micros ) is that it doesn't really amount to saving much by optimising away a single instruction, and conversely even an extra unnecessary one left in does little damage, not even over 10 million loops. In this case optimizing to hit hub access sweet spots is the key to fast execution.
The current footprint is down to around 2800 bytes, 700 longs. A new release will be along sometime soon.
I'm in the wrong OS so cannot determine footprint or benchmarks at present but at one point it was running my increment loop faster than the ROM interpreter does. I'll give details later.
Everything working except for whatever the missing $3C opcode is used with, and I hit a brick wall with CogNew / CogInit. That uses an unidentified sub-opcode which needs further investigation.
If the PropTool generated a suitable symbol table / listing file it would be possible to do high-level Spin debugging, statement stepping, breakpoints with variable content viewing and change ( like VB6 and other similar development tools ). Background Debugging Mode would be something very worthwhile Parallax could add to the Prop II's extra ROM, a Spin Interpreter designed for this purpose. Just a listing file and leaving the Propeller community to it would be a good second best.
Two side-tracks were multi-tasking ( running two or more Spin Programs on the same cog ) and embedding other bytecode interpreters within this one. Both worked well. The bottom line is that it should be relatively easy to emulate any arbitrary bytecode architecture. Likewise any other processor instruction set when cycle timing isn't important.
LCC, an ANSI C compiler, looks to have a fairly well defined bytecode / VM so I may consider that next, or maybe a PICmicro/AVR emulator so their compilers can be used. I might also take a look at rewriting the BS1/PICAXE emulator in PASM.
Bug found while investigating "Where does an ABORT trap to when it occurs in a method being used to form a parameter for a method " ? The answer is; it doesn't continue after the method being called, but falls out the method that call was attempted within. No, I can't phrase it better !!!
This, using Parallax ROM Spin Interpreter, prints ...
Started
Trap Start
Abort
Finished
Some may have expected "Trap Finish" to have been executed but it isn't.