Propeller "Thumb" Code
hippy
Posts: 1,981
Potential criticisms of the Propeller are that Spin is interpreted quite slowly, Cogs have limited (496) instruction capacity, and Large Memory Models are limited to just 8K long instructions without using additional, external storage.
One solution would be a better tuned Virtual Machine instruction set which would offer faster execution than Spin and give better code density over LMM.
The challenge is in creating a VM instruction set which is both useful and easy/fast to decode and execute. One could, in theory at least, emulate any other instruction set, but being closely aligned to the native instruction set should give the best performance.
One solution to this is ARM Thumb which effectively compresses the native instruction set into a smaller representation.
I started by considering a 16-bit instruction set with a Motorola 6800-style architecture using two accumulators, a source/destination register and the rest defining the opcode. Further thinking led to the notion of an 8-bit register/immediate and a 7-bit lookup table which defines the Propeller instruction to use. The instructions in the lookup table specify where to put the register bits ( usually as the source register / immediate value ). The assembler would determine which 128 native instructions are used and determine what the VM opcodes would be. This would allow all VM instructions access to the conditional execution and effect flags.
This represents a VM processor of 256 registers ( mapped to $100-$1FF, which also allows easy access to the special function registers ), a 128 entry native instruction lookup table ( could be held as words ), leaving 128-196 instructions for the VM implementation. The VM interpretation should be quite light and fast with most of the effort in the Assembler determining the opcode table mapping. Instruction capacity is doubledto 16K over 32-bit LMM.
The so far unused bit of the VM instruction would indicate extended opcodes; JMP, CALL, RET and so on. Adding Skip on Carry/Zero etc would save on too rapidly using up the opcode mapping table ( the Assembler could choose mapping or skipping as needs dictate ). There's also no reason that there couldn't be a 'the following long is a native Propeller instruction' or a means to dynamically switch between native and VM interpretation.
The initial longs of the VM object code image would include the native instruction lookup table and pre-initialisation values for registers. This would be of variable size by insisting the first instruction is a jump to the first executable opcode immediately after that block.
Has anyone done anything in this direction already, or any comments or views on the concept ?
One solution would be a better tuned Virtual Machine instruction set which would offer faster execution than Spin and give better code density over LMM.
The challenge is in creating a VM instruction set which is both useful and easy/fast to decode and execute. One could, in theory at least, emulate any other instruction set, but being closely aligned to the native instruction set should give the best performance.
One solution to this is ARM Thumb which effectively compresses the native instruction set into a smaller representation.
I started by considering a 16-bit instruction set with a Motorola 6800-style architecture using two accumulators, a source/destination register and the rest defining the opcode. Further thinking led to the notion of an 8-bit register/immediate and a 7-bit lookup table which defines the Propeller instruction to use. The instructions in the lookup table specify where to put the register bits ( usually as the source register / immediate value ). The assembler would determine which 128 native instructions are used and determine what the VM opcodes would be. This would allow all VM instructions access to the conditional execution and effect flags.
This represents a VM processor of 256 registers ( mapped to $100-$1FF, which also allows easy access to the special function registers ), a 128 entry native instruction lookup table ( could be held as words ), leaving 128-196 instructions for the VM implementation. The VM interpretation should be quite light and fast with most of the effort in the Assembler determining the opcode table mapping. Instruction capacity is doubledto 16K over 32-bit LMM.
The so far unused bit of the VM instruction would indicate extended opcodes; JMP, CALL, RET and so on. Adding Skip on Carry/Zero etc would save on too rapidly using up the opcode mapping table ( the Assembler could choose mapping or skipping as needs dictate ). There's also no reason that there couldn't be a 'the following long is a native Propeller instruction' or a means to dynamically switch between native and VM interpretation.
The initial longs of the VM object code image would include the native instruction lookup table and pre-initialisation values for registers. This would be of variable size by insisting the first instruction is a jump to the first executable opcode immediately after that block.
Has anyone done anything in this direction already, or any comments or views on the concept ?
Comments
Have you "scoped" out the possible speed of the interpreter? It sounds like the instructions would not be any faster than those of Spin and not particularly more compact. The main advantage would be to be able to make use of the ARM Thumb programming tools that are available both for free and for a fee.
Hadn't thought of using actual ARM tools ( no idea how similar the Prop and ARM are ). Shouldn't be too hard building the tools from scratch ( famous last words ).
However this should not be the central consideration. The main concern of VMs is that they have to be tuned to do something useful WRT the application domain, be it string handling, floating point math, or whatever.
In fact I don't know how long a SPIN bytecode takes (400 ticks is my best guess = 80 machine instructions) This also matches fine with the observation that a given SPIN program can be transformed to a (much longer) equivalent assembly program that runs 50 to 100 times faster.
So following Hippy's way we can expect a "PropThumb" program to run 4 to 6 times faster than SPIN. This also matches fine with the observation that programs in Frohf's FORTH run 3 to 4 times faster than equivalent SPIN.
But I think just emulating machine instructions well has a too limited effect. The "power" of a single machine instruction is just too low to justify the emulation overhead.
So what I am thinking of for some time is to have truly different VMs in each COG: A very slow general purpose VM (=SPIN), some special VMs (a la FLOAT32), a JIT compiler for SPIN Bytecode,....
I have to say that just trying it opens one's eyes to just how hard it is to produce an efficient VM and VM instruction set which is also usable. It's also a good exercise in optimising Propeller Assembler.
I would guess the main speed advantage over the Spin VM would be not having to maintain the HUB based stack. You'd be keeping temporary variables in the COG RAM rather then HUB RAM.
OK for hand crafted ASM, but all but impossible to write a proper compiler for it without reintroducing a stack and losing the advantage.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Help to build the Propeller wiki - propeller.wikispaces.com
Play Defender - Propeller version of the classic game
Prop Room Robotics - my web store for Roomba spare parts in the UK
There is no "main advantage". There are many effects all adverse to high speed code interpretation. As Hippy has put it well: ".. just trying it opens one's eyes..."
I've finally arrived at a VM opcode format which is an 8-bit index into a native opcode lookup table and an 8-bit operand which gets inserted into that native instruction. The looked-up native opcode itself defines how to insert the operand with the bottom 8-bits being the address of the desired handler within the cog. The VM itself could have handlers for Cog or Hub-based stack so both are easily supported.
I wouldn't particularly view my Thumb VM as an alternative to Spin but as a more densely packed LMM using the native instruction set. One particular addition has been the introduction of register indirection to avoid self modifying code. A nice thing about it is that it can either represent native Propeller code or any own-created instruction set which maps to it.
One question I am wondering about is how deterministic should any VM be ? Should there be a fixed length VM instruction cycle or simply as fast as possible ? There's technically no reason it shouldn't be capable of both, dynamically switched as required at run-time. With WAITCNT supported and a fairly fast VM determinism may not really be that important, and a VM the wrong thing to use if that's what's wanted.
I've got the concept of a fast workable Thumb VM solved to the extent I'm happy with it, but ( like others ) I've run into the problem of needing the tools to implement it.
I've spent the last week or so rewriting my Parsing Engine to handle Macros, #Define and Conditional Processing properly, as well as both line-based and end-of-line-is-whitespace source, so hopefully that gives a solid framework to build tools upon, assemblers and compilers.
The big challenge for the Thumb VM is creating the Native Opcode Lookup Table, but hopefully it won't be too bad. Anyway, attched is what I have so far by way of documentation, theory and prototype code. It's not even seen the PropTool compiler, so expect errors ...
Post Edited (hippy) : 12/20/2007 2:29:56 PM GMT
Proof of Concept Thumb VM implementation attached. Not comprehensively tested but it runs the embedded Thumb VM Program which proves the most common functionality, including branch and subroutine call and return. It increments a value in a fixed location in hub memory, Tv_Text displays it. Currently uses an in-Cog subroutine call stack but external stack is easy enough to add and it shouldn't have too much impact on speed.
A casual stopwatch test suggests it's delivering just under 1MIPS at 80MHz. That gives PASM around a 20:1 speed advantage. For "mov a,b" and similar PASM has around a 12:1 advantage.
Thumb VM is approximately three times slower than LMM, but there's no reason Thumb VM cannot support native 32-bit instructions LMM-style, however FCACHE etc isn't possible because most of the VM is used up for stack, native opcode expansions and 256 user registers.
The large register set comes from looking at it as a virtual "microcontroller" rather than as an instruction set designed to support C or other HLL. It really is "Super Cog" code rather than anything else.
I'm currently working on a subset-of-C compiler ( more like Spin in a C-style rather than full C ) and I haven't decided what instruction set to target, but I may go for this, so there could be some usable tools coming along at some point.
For example: Turbo Pascal compiles in a way that every variable and argument ends up in the stack. This is way simpler than using registers. OTOH Turbo C uses registers when possible (and when forced).
This is a constraint I somehow tried to address in my LMM. Together with a way of reserving space in the stack. I think in a LMM for bigger code size (I think you moved the page in the propeller wiki, no ? so you know my ideas). If you make it to only fit the HUB RAM, a limited stack like the hardware stack in the PICs is a brillant idea.
A fit-for-all-cases will be hardly possible.
Quite a few things added, but not all tested. Changes have mainly been to address the failings which prevent its use as a more general architecture as a HLL target.
Optimised the instruction Fetch routine and gained around 10% speed improvement despite other code changes which are detrimental to fastest execution. Missing a rdbyte slot really can have a significant effect, this is a case in point. Seems to be quite close to 1MIPS now.
Supports stack in Cog or in Hub. Seems about a 3% decline in speed when using hub stack. Also added Push and Pop plus Load From Stack and Put To Stack so parameters can be accessed on the stack relative to top of stack. Can switch between cog and hub stacks dynamically. Both pc and sp now exist in User Register space so multiple stacks, stack frames and other clever tricks can be performed. It's not a pure stack-based VM but can be used that way.
Added Load Word Immediate and Load Long Immediate to save on wasting registers holding pre-defined constants.
Sequences of native 32-bit Propeller instructions can be executed LMM-style for speed critical code sections. The VM can also be made deterministic so it runs each VM instruction with fixed timing ( or multiples of a fixed timing cycle ). That's only really there 'because I could', timed bit-banging would probably best be done with waitcnt and let the VM run as fast as possible as with Spin.
Added a Break instruction to support debugging, breakpoints and single-stepping.
The VM PASM code is now pre-initialised by Spin rather than from data in the executable image. This is to save cog code space. The VM has proven itself to be quite easily extensible, but the added VM code has created quite a lot of bloat. Ideally the VM would include instruction handlers only for what it needs and the best way is to build the VM PASM code dynamically in the Assembler itself.
Next thing is to get the VM executable bit-banging serial to see how well it does and how fast it can go, then some mechanism to interact with Spin Objects, and after that I think it's going to be on to trying to re-create a Spin interpreter
Post Edited (hippy) : 12/22/2007 2:13:08 PM GMT
Right!
It's interesting to see how much of a VM design rests on good or bad decisions made early on when the consequences aren't always appreciated. I was lucky to have fallen upon the lookup table to convert instructions which then gave a fairly efficient means of decoding and dispatching and has made it easy to add new VM specific opcodes, on the other hand I have severely limited myself with the idea of it being register-based and lost half of cog memory that way. I could map those registers to hub memory but then it loses its speed advantage. As always, it a case of one or the other, rarely both !
Extending a VM always becomes some kind of fudge, kludge or hack, and it's become fairly clear that there's no perfect VM, just VM's better suited to some architectures than others. I'm quite surprised ( and "chuffed" in Brit speak, "pleased with myself" ) that it came out so well as a middle-ground VM. How practical or usable it is I have no idea, but hopefully it's useful for others thinking of writing VM's. For my own future purposes I believe I need a stack-based VM tailored to what I need but this has been a good learning experience.
It seems clear to me that there are two extremes of VM ( and real world architectures ); those that use an entirely register based system, and those which are stack-oriented. I think I knew that already, but it's been interesting and fun to 'discover the truth' in the doing it.
There is more to say about that.. Much more even that would fit into one book...
In fact it is a (not so simple) optimization consideration, in two dimensions:
(A) Number of operands (="addresses") in the instruction: 0, 1, 2, or 3?
(B) Number of high access speed memory cells: 0, 1, small N (<16), large N (>256)
These are however largely influenced by instruction and data cache size, and the relation of high speed memory access to low speed memory access (e.g. Propeller: 1: 16)
A true stack machine is (A=0, B=0) with two 1-address exceptions ("push" and "pop")
Minor influence comes from the "programming style": It will make a difference whether you are happy with a linear vector and two nested loops or not... Unlimited nesting (as in loops, formulas, routine calls, multi-dimensional arrays) is the reason why HLL-compilers use intermediate stack code. It is a standard task of code generators to transform that efficiently into any register architecture, but the idea sticks: Why not leave it as ist? This was one of the roots of FORTH (and the HP desktop calculators)...
The answer is simple: In many (most) cases it is just stupid to push and pop things. But it leaves the problem of optimizing (A = program length) and (B = run time speed) still open.
At the time when more programmers used machine code, even "human factors" influenced architecture. The VAX 11 was a machine famous for its programmer friendliness. Though this sounds stupid, it has its good reasons in reducing the costly number of coding errors...
As I said: More than just one book