TrimSpin

Dave Hein · 2012-07-17 15:29

I have been trying out an idea about speeding up the Spin interpreter by supporting a subset of the Spin bytecodes. This allows the interpreter to be optimized for a smaller number of bytecodes. The attached file contains a program that demonstrates the results that I have so far. The TrimSpin interpreter executes random code almost twice as fast as the standard interpreter. Some functions, such as BYTEMOVE or the multiply operator run only slightly faster since they use similar code that the standard interpreter uses. The STRSIZE instruction runs about 2.5 times faster because it uses a dedicated routine instead of sharing it with the STRCOMP code.

My motivation in doing this is mostly to see if it's possible to develop a virtual machine on the Prop that runs faster than the current Spin VM and is as compact, or even more compact. This might be useful for improving the peformance of Spin, or maybe it could be used for C.

jmg · 2012-07-17 15:47

So the primary target for this would be Prop 2, given spin is in ROM in Prop 1 ?

Phil Pilgrim (PhiPi) · 2012-07-17 16:48

jmg wrote:

So the primary target for this would be Prop 2, given spin is in ROM in Prop 1 ?

Not necessarily. Just because there's a bytecode interpreter in ROM, doesn't mean you have to use that one. You can use any interpreter you want.

Dave, I would be interested in finding out which bytecodes were eliminated, though, and what might have to be changed in some Spin programs to accommodate.

Thanks,
-Phil

Dave Hein · 2012-07-17 17:07

The following features are not supported in TrimSpin,
and must be avoided in Spin programs that are executed by TrimSpin.

VAR variables cannot be accessed directly, but can by accessed through
pointers, such as BYTE[@var_variable].

Only the first 8 local stack variables can be directly accessed. However, all
local variables can be accessed through pointers, such as WORD[@loc_variable].

Indexed access can only be used with the BYTE, WORD and LONG operators,
such as WORD[ptr][index].

The operator-assignment operators can only be used on long variables, such
as x++ and x *= 10. They can not be used on byte or word variables.

The loop index used with the repeat-from-to instruction must be a long
variable.

An array of objects is not supported.

Registers can only be accessed as longs. Bit modes are not supported for
registers, and registers can not be used with extended operators, such as
++ and |=.

BYTEMOVE does not handle the case where the source and destination buffers
overlap, and the source address is greater than the destination address.

The following operations are not supported:
  |<    >|    ^^    ~X    X~    ~~X   X~~   ?X    X?
  |<=   >|=   ^^=   <=    >=    <>=   ===   =<=   =>=

The following Spin instructions are not supported:
ABORT     BYTEFILL  LONGFILL  WORDFILL  WORDMOVE  LONGMOVE
CASE      COGINIT   COGNEW    LOCKCLR   LOCKRET   LOCKSET
LOOKDOWN  LOOKDOWNZ LOOKUP    LOOKUPZ   WAITCNT   WAITPEQ
WAITPNE   WAITVID

Dave Hein · 2012-07-17 17:21

jmg wrote: »

So the primary target for this would be Prop 2, given spin is in ROM in Prop 1 ?

P2 will have almost 4 times as much hub RAM as P1, so I think P1 is the primary target. For the most part TrimSpin is just an experiment to see how fast a reduced-instruction interpreter could run. I think I could get it to run even faster if I didn't restrict myself to the Spin bytecodes, but then I would need to modify the compiler to generate new codes.

Phil Pilgrim (PhiPi) · 2012-07-17 17:31

Dave,

A lot of what was eliminated could be restored by a Spin preprocessor that would convert unsupported Spin expressions to something that is supported. But some exclusions present insurmountable difficulties, especially COGNEW -- hard to live without that one!

-Phil

Dave Hein · 2012-07-17 20:38

Yes, COGNEW, COGINIT, WAITCNT and the lock operators would be nice to have. I could add more instructions, but it was becoming difficult to open up cog space without impacting peformance. If we only had an extra Kbyte of cog RAM we could probably implement all of the interpreter functions without a speed penalty. I think P2 will suffer from not having more cog memory.

jmg · 2012-07-18 14:32

The idea of an 'Elastic Spin' is great, but I'm less certain about a subset that removes functionality, from the get-go.

If you are going to do the housekeeping to construct and transport a 'NewSpin', then a smarter approach I would suggest would be to 'smarten up' the Spin compiler, so it creates both a NewSpin image, and the user bytecodes.

This would be multipass and optimising, both very cheap on a PC.
eg If your code never used a Spin byte code, then that could be removed and a new smaller custom NewSpin built, and that could then free room, to allow some (small) In-line user functions ( effectively user Byte Codes)

This would be a simple 'scan and trim' algorithm, but it would run multi-pass until two passes gave the same size.

It could also report how many times each Bytecode resource was called, to show the user possible candidates for more trim.

Also, a user could know they will use most of the Spin byte codes, but prefer 'small' on some, in order to allow 'fast' on others.

The idea is that you never get a result that is not compatible with present Spin, but you can craft programs what would be much faster, each using a project specific subset of Spin, and you can free up space for user byte-codes, and/or faster (but larger) NewSpin Bytecodes.

The ideal is to always have a ~100% full COG, but filled with truly useful stuff.

msrobots · 2012-07-18 15:56

Jmg,

this is a interesting thought. Optimizing the interpreter for the program loaded. We have to load Spin-Interpreter on Prop2 anyways - so why not integrate a custom one. Not a easy task I guess, but fore sure worth a try.

A multi-stage loader like catalina has will be useful anyways to save/recover hubspace used to load any Spin-Interpreter...

Enjoy!

Mike

Dave Hein · 2012-07-18 16:58

I thought about something similar to that. The interpreter would be built by reading the binary file, and adding the interpreter code for each opcode that was used. The entry in the jump table would be changed from unsupp to the label for that opcode. That would work as long as the program limited itself to a small number of instructions.

msrobots · 2012-07-18 22:12

wouldn't that be so in most of the cases? I think I never used all of them at all. But everybody uses 'his own' subset. Interesting.

Enjoy!

Mike

Dave Hein · 2012-07-19 07:26

jmg and Mike, I thought more about the idea of a custom-built interpreter, and I think that's the right approach. A custom-built interpreter could be grafted to any binary that uses a subset of the Spin bytecodes and improve the performance. Spinsim contains routines that disassembles bytecodes and profiles them. I could use those routines to write a program that post-processes a binary produced by the Spin tools or BST. I'll look into that.

Cluso99 · 2012-07-19 15:43

Very interesting concepts.

msrobots · 2012-07-19 18:15

Dave,

and somehow tell you the less used codes to optimize your spin to reduce the total number ...

cool.

Enjoy!

Mike

Humanoido · 2012-07-27 00:09

Remarkable and brilliant project, Dave.
The regular Spin is 2K. Does TrimSpin free up any memory?

Circuitsoft · 2012-08-10 13:52

Could you fit an LMM interpreter into TrimSpin? Could that allow C code to directly call Spin code and vice versa?

Cluso99 · 2012-08-10 15:48

circuitsoft: Neat idea! The LMM execution unit can be as small as 4 longs and can even live in shaddow cog ram (see my zero footprint debugger). So almost certainly this is possible. I know Dave has implemented LMM in his version of the interpreter. Therefore this is certainly possible. And add my faster interpreter (uses IIRC 256 hub longs for the vector table) and we could get a very interesting mix.

Dave Hein · 2012-08-10 19:23

I have looked at using LMM to speed up the interpreter by moving the lesser used instructions to hub RAM, and optimizing the more frequenctly used instructions. I think I was able to get about a 33% improvement using this technique along with the vector table. My goal for TrimSpin was to get a 2x improvement in speed, which I was able to do for programs using a subset of Spin. However, it will be difficult to do this for Spin programs that use a large part of the instruction set.

Circuitsoft · 2012-08-10 19:41

There was already a discussion on possibly writing a pre-processor to replace unimplemented operations with implemented equivalent sequences. What I really want is to directly call C/LMM-ASM from spin and vice versa. Maybe one of the unimplemented opcodes could be repurposed for calling LMM code?

Dave Hein · 2012-08-10 20:24

I posted an object in the OBEX a couple of years ago that allows Spin programs to call LMM code. It's call SpinLMM, and it's located at http://obex.parallax.com/objects/635/ . SpinLMM uses the one Spin bytecode that is not used by the Spin interpreter. I believe the unused bytecode is $3C. There are a number of issues with interfacing to C LMM code. The stack grows in different directions between Spin and C. C LMM uses a more complex LMM interpreter than the one in SpinLMM. The C LMM code must be linked to a specific address. It's possible to do, but it would difficult.

TrimSpin

Comments