Spin micro-JIT
rokicki
Posts: 1,000
Has anyone thought hard about a Spin micro-JIT, that compiles each statement to actual COG
assembly (not LMM, actual low-level ASM itself)?
I think this would be very powerful, even if you do it in a super-simple way. Each statement
compiles to a given sequence of assembly instructions, and you simply lay them out in the
COG space linearly. When you run out of instructions, you empty the cog and do it again.
This means that any spin that compiles to code that actually fits in the cog, runs at full
assembly language speed. Otherwise, it runs at compile+ASM-exe speed.
I think a bytecode-to-PASM wouldn't be horribly difficult, although it might take a fairly
large table somewhere (think ROM). Statement to statement linking isn't too bad either.
Of course, we'd keep the current Spin model that all variables and such live in main memory
(otherwise the "memory model" starts to fail). But intermediate values in Spin could
use temp registers in the cog rather than go through the hub.
I actually wanted to prototype this using a subset of Basic, but I haven't had the time.
assembly (not LMM, actual low-level ASM itself)?
I think this would be very powerful, even if you do it in a super-simple way. Each statement
compiles to a given sequence of assembly instructions, and you simply lay them out in the
COG space linearly. When you run out of instructions, you empty the cog and do it again.
This means that any spin that compiles to code that actually fits in the cog, runs at full
assembly language speed. Otherwise, it runs at compile+ASM-exe speed.
I think a bytecode-to-PASM wouldn't be horribly difficult, although it might take a fairly
large table somewhere (think ROM). Statement to statement linking isn't too bad either.
Of course, we'd keep the current Spin model that all variables and such live in main memory
(otherwise the "memory model" starts to fail). But intermediate values in Spin could
use temp registers in the cog rather than go through the hub.
I actually wanted to prototype this using a subset of Basic, but I haven't had the time.
Comments
You could JIT to hub and then execute LMM-style, but it would be easier to do that external to the PropTool and build a complete LMM version of the program to execute.
Program size will soar tremendously but it might be useful ( esp for Prop II ) and I'm also wondering if that's a means of getting execution speed up for a 'Spin' program running from external Eeprom or SD Card ?
There'd be some control over speed-versus-size; one can keep 'Call #Pop" or one can convert them to inline "rdlong x,sp + sub sp,#4".
It's enough of a spark that a Spin bytecode to LMM might be my next project - Especially as the LCC stuff requires writing an assembler to convert PASM to LMM, so that's one half of the Spin -> PASM -> LMM conversion done.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Cheers,
Simon
www.norfolkhelicopterclub.com
You'll always have as many take-offs as landings, the trick is to be sure you can take-off again
BTW: I type as I'm thinking, so please don't take any offence at my writing style
By choosing numbers which can be optimised into a single byte the bytecode density improves even further but that seemed a tad unfair given that I'm trying to get some realistic figures rather than proving bytecode better and so forth. Results are never realistic or fair but at least this is hopefully ball-park. Gains will go up and down depending upon complete code involved.
For comparison, I'd likely call ImageCraft C "Kernel Call Only PASM" which matches close to the 4:1 density observed though LMM-C seems to achieve better density than in practice on real programs. I've not taken into account that there can be massive optimisation applied as one moves towards PASM and have assumed a stack-based architecture rather than register-based throughout.
I recall the Fibonaci demo for JD Forth was stated as 40%/60% larger than the Spin code ( apologies to Carl if I've got that wrong ) which puts it, I'd guess, in the "16-Bit threaded Code" category and optimised.
Once one goes towards pure PASM the code size starts to increase quite rapidly. If
we assume a 'program statement' which does anything meaningful takes 8 bytes with
bytecode ( var := 1+2 ), with 32KB the program capacity should come out like -
My criticism of ICC has been based on the limited memory capacity of the Propeller Mk I. Is that fair ? Hard to tell because it depends on how many bytes an actual 'statement' uses and what the complete program is. If the above holds true then ICC does better than I have given it credit for.
The "16-Bit threaded" code is what I've also suggested would be the balance between Bytecode and Kernel Call PASM, sacrificing some speed for better code density.
Looked at the opposite way; how many bytes needed for a 1,000 statement program -
If that "how many 8-byte statements fit in hub" is applied for the 256KB of the Propeller Mk II then we get the following -
That to me suggests that the Prop Mk II will be extremely usable no matter what programming language is chosen.
I've never been a big fan of the·LMM model (sorry to all those out there who believe in it).·My belief is that its·primary failing is that not enough is achieved per instruction.·I think·that this is the main reason why the LMM model sees the level of code·bloat that it does.
In another thread I mentioned that my "spin" project uses 7 cogs and has 817 longs free - so I knew that code bloat was going to be a massive threat its translation. All blocks have now been translated and are compiling, but not yet fully debugged. Initial indications are that the code will now require only 5 cogs and still·have about 310 longs free.·So as the code has grown, the size increase over spin has dropped down to about 25%. I've still had to work pretty hard at optimising, but this was my expectation·of what would be eventually·possible with JDForth!
I need to add an extra code block to compare JDForth with Hippy's examples:
Now I did cheat a bit, and translate your example directly. If the two numbers pushed where unique 16-bit literals, then the real equivalent·is:
But, it is amazing how often "common constants"·are used in programming, so the real result achieved by JDForth is somewhere between these two extremes.
As I write more code in JDForth, I'e found other·sequences of words that appear so often that they are just begging to be optimised. I was surprised (even though I should not have been) at how often the sequence 4 + appeared in code. (4 + adds 4 to the item on the top of the stack).·I also didn't have 4 as a common constant! Before optimising, the code looked like this:
but after adding the word 4+ into the kernel, it looks like this:
·The kernel grew by 1 llong in the process and is now:
Sorry if this all sounds like I'm beating my own drum, but there is a point that I'm trying to make.
Any analysis of an execution model needs to consider code at the highest level as well. Adding two numbers is fine, but communicating to·an SD card or sending UDP packets over the Ethernet is·the real bottom line in building embedded products. Real consideration needs to be given as to how the code will grow as the project grows. I know that·last statement is very "touchy-feely", but that is the reality of being a programmer. We are all artists, afterall.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Carl Jacobs
JDForth - Forth to Spin Compiler http://www.jacobsdesign.com.au/software/jdforth/jdforth.php
Includes: FAT16 support for SD cards. Bit-bash Serial at 2M baud. 32-bit floating point maths.·Fib(28) in 0.86 seconds. ~3x faster than spin, ~40% larger than spin.
Indeed this is where it all becomes 'swings and roundabouts' and makes it hard to do any entirely fair and accurate comparisons at this subset level. Choose particular number values and one can improve and/or disadvantage on a whim.
There is probably another (it's used in assembler so don't know if it is relevant in spin/forth or whatever) +$200. It adds 1 to the destination address in assembler.
Have you picked up on Hippy's use (and now mine) of the lower hub ram <$200 for rd/wr byte/word/long format wrlong cog,#hub. I posted a sample of loading/storing utilising the sweet spot by adding $204 which increments both the cog and ram address simultaneously.