Spin micro-JIT

rokicki · 2008-09-01 18:07

Has anyone thought hard about a Spin micro-JIT, that compiles each statement to actual COG
assembly (not LMM, actual low-level ASM itself)?

I think this would be very powerful, even if you do it in a super-simple way. Each statement
compiles to a given sequence of assembly instructions, and you simply lay them out in the
COG space linearly. When you run out of instructions, you empty the cog and do it again.

This means that any spin that compiles to code that actually fits in the cog, runs at full
assembly language speed. Otherwise, it runs at compile+ASM-exe speed.

I think a bytecode-to-PASM wouldn't be horribly difficult, although it might take a fairly
large table somewhere (think ROM). Statement to statement linking isn't too bad either.

Of course, we'd keep the current Spin model that all variables and such live in main memory
(otherwise the "memory model" starts to fail). But intermediate values in Spin could
use temp registers in the cog rather than go through the hub.

I actually wanted to prototype this using a subset of Basic, but I haven't had the time.

hippy · 2008-09-01 20:56

It's probably not too hard to do but I bet the time taken to parse the tokens and build the PASM code takes longer than just interpreting them. The code for doing it JIT also eats up Cog space, reducing the built PASM code space.

You could JIT to hub and then execute LMM-style, but it would be easier to do that external to the PropTool and build a complete LMM version of the program to execute.

Program size will soar tremendously but it might be useful ( esp for Prop II ) and I'm also wondering if that's a means of getting execution speed up for a 'Spin' program running from external Eeprom or SD Card ?

There'd be some control over speed-versus-size; one can keep 'Call #Pop" or one can convert them to inline "rdlong x,sp + sub sp,#4".

It's enough of a spark that a Spin bytecode to LMM might be my next project - Especially as the LCC stuff requires writing an assembler to convert PASM to LMM, so that's one half of the Spin -> PASM -> LMM conversion done.

simonl · 2008-09-01 23:47

Blimey you two! If you can pull that one off you'll have a whole world of extra friends me thinks

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Cheers,
Simon

www.norfolkhelicopterclub.com

You'll always have as many take-offs as landings, the trick is to be sure you can take-off again

BTW: I type as I'm thinking, so please don't take any offence at my writing style

hippy · 2008-09-10 20:44

I've done some code generating testing to see what impact various output formats have with the results as follows, increasing in speed as one goes down the list but with increasing code size as well. I chose an arbitrary but simple snippet of pushing two numbers onto a stack, adding them, leaving the result on the stack.

By choosing numbers which can be optimised into a single byte the bytecode density improves even further but that seemed a tad unfair given that I'm trying to get some realistic figures rather than proving bytecode better and so forth. Results are never realistic or fair but at least this is hopefully ball-park. Gains will go up and down depending upon complete code involved.

For comparison, I'd likely call ImageCraft C "Kernel Call Only PASM" which matches close to the 4:1 density observed though LMM-C seems to achieve better density than in practice on real programs. I've not taken into account that there can be massive optimisation applied as one moves towards PASM and have assumed a stack-based architecture rather than register-based throughout.

I recall the Fibonaci demo for JD Forth was stated as 40%/60% larger than the Spin code ( apologies to Carl if I've got that wrong ) which puts it, I'd guess, in the "16-Bit threaded Code" category and optimised.

8-Bit Bytecode : 5 bytes ( 1:1 )

XX XX           Push    #1
XX XX           Push    #2
XX              Add

16-Bit Threaded Code : 10 bytes ( 2:1 )

XX XX           word    Push
XX XX           word    1
XX XX           word    Push
XX XX           word    2
XX XX           word    Add

32-Bit Kernel Call Only PASM : 20 bytes ( 4:1 )

XX XX XX XX     Call    #Push
XX XX XX XX     long    1
XX XX XX XX     call    #Push
XX XX XX XX     long    2
XX XX XX XX     call    #Add

32-Bit Kernel Assisted PASM : 32 bytes ( 6:1 )

XX XX XX XX     Call    #Push
XX XX XX XX     long    1
XX XX XX XX     call    #Push
XX XX XX XX     long    2
XX XX XX XX     call    #PopRhs
XX XX XX XX     rdlong  lhs,sp
XX XX XX XX     add     lhs,rhs
XX XX XX XX     wrlong  lhs,sp

32-Bit Direct PASM : 36 bytes ( 7:1 )

XX XX XX XX     Call    #Push
XX XX XX XX     long    1
XX XX XX XX     call    #Push
XX XX XX XX     long    2
XX XX XX XX     rdlong  rhs,sp
XX XX XX XX     add     sp,#4
XX XX XX XX     rdlong  lhs,sp
XX XX XX XX     add     lhs,rhs
XX XX XX XX     wrlong  lhs,sp

Once one goes towards pure PASM the code size starts to increase quite rapidly. If
we assume a 'program statement' which does anything meaningful takes 8 bytes with
bytecode ( var := 1+2 ), with 32KB the program capacity should come out like -

8-Bit Bytecode       : 4,096 statements
16-Bit Threaded      : 2,048
Kernel Call PASM     : 1,024
Kernel assisted PASM :   682
Direct PASM          :   585

My criticism of ICC has been based on the limited memory capacity of the Propeller Mk I. Is that fair ? Hard to tell because it depends on how many bytes an actual 'statement' uses and what the complete program is. If the above holds true then ICC does better than I have given it credit for.

The "16-Bit threaded" code is what I've also suggested would be the balance between Bytecode and Kernel Call PASM, sacrificing some speed for better code density.

Looked at the opposite way; how many bytes needed for a 1,000 statement program -

8-Bit Bytecode       :  8,000
16-Bit Threaded      : 16,000
Kernel Call PASM     : 32,000
Kernel assisted PASM : 48,000
Direct PASM          : 56,000

If that "how many 8-byte statements fit in hub" is applied for the 256KB of the Propeller Mk II then we get the following -

8-Bit Bytecode       : 32,768 statements
16-Bit Threaded      : 16,384
Kernel Call PASM     :  8,192
Kernel assisted PASM :  5,456
Direct PASM          :  4,680

That to me suggests that the Prop Mk II will be extremely usable no matter what programming language is chosen.

Carl Jacobs · 2008-09-10 22:32

Hippy is·correct in·stating JDForth is·seeing an increase in code size of between 40% and 60% over spin. I was driven to produce JDForth primarily for my own requirements.·I needed more performance in something not too much bigger.·Turning JDForth·into a product is just something I did because I'm my own boss, and sometimes I like to play!

I've never been a big fan of the·LMM model (sorry to all those out there who believe in it).·My belief is that its·primary failing is that not enough is achieved per instruction.·I think·that this is the main reason why the LMM model sees the level of code·bloat that it does.

In another thread I mentioned that my "spin" project uses 7 cogs and has 817 longs free - so I knew that code bloat was going to be a massive threat its translation. All blocks have now been translated and are compiling, but not yet fully debugged. Initial indications are that the code will now require only 5 cogs and still·have about 310 longs free.·So as the code has grown, the size increase over spin has dropped down to about 25%. I've still had to work pretty hard at optimising, but this was my expectation·of what would be eventually·possible with JDForth!

I need to add an extra code block to compare JDForth with Hippy's examples:

16-Bit Bytecode : 6 bytes ( 1.2:1 )
XX XX           Push    #1      ' A common constant
XX XX           Push    #2      ' A common contsant 
XX xx           Add

Now I did cheat a bit, and translate your example directly. If the two numbers pushed where unique 16-bit literals, then the real equivalent·is:

16-Bit Bytecode : 10 bytes ( 2:1 )
XX XX xx xx     Push    #1234   ' Unique literal
XX XX xx xx     Push    #5678   ' Unique literal 
XX xx           Add

But, it is amazing how often "common constants"·are used in programming, so the real result achieved by JDForth is somewhere between these two extremes.

As I write more code in JDForth, I'e found other·sequences of words that appear so often that they are just begging to be optimised. I was surprised (even though I should not have been) at how often the sequence 4 + appeared in code. (4 + adds 4 to the item on the top of the stack).·I also didn't have 4 as a common constant! Before optimising, the code looked like this:

16-Bit Bytecodes : 6 Bytes
xx xx xx xx     Push    #4   
XX xx           Add

but after adding the word 4+ into the kernel, it looks like this:

16-Bit Bytecodes : 2 Bytes
XX xx           Add 4

·The kernel grew by 1 llong in the process and is now:

'=  4+
    add kTOS, #2      ' Add 2 to top of stack and fall through to next word
'=  2+
    add kTOS, #1
'=  1+
    add kTOS, #1
    jmp kNext         ' Go get and execute the next bytecode

Sorry if this all sounds like I'm beating my own drum, but there is a point that I'm trying to make.

Any analysis of an execution model needs to consider code at the highest level as well. Adding two numbers is fine, but communicating to·an SD card or sending UDP packets over the Ethernet is·the real bottom line in building embedded products. Real consideration needs to be given as to how the code will grow as the project grows. I know that·last statement is very "touchy-feely", but that is the reality of being a programmer. We are all artists, afterall.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Carl Jacobs

JDForth - Forth to Spin Compiler http://www.jacobsdesign.com.au/software/jdforth/jdforth.php
Includes: FAT16 support for SD cards. Bit-bash Serial at 2M baud. 32-bit floating point maths.·Fib(28) in 0.86 seconds. ~3x faster than spin, ~40% larger than spin.

hippy · 2008-09-11 01:20

Carl Jacobs said...
I need to add an extra code block to compare JDForth with Hippy's examples: ... Now I did cheat a bit, and translate your example directly. If the two numbers pushed where unique 16-bit literals, then the real equivalent is:

Indeed this is where it all becomes 'swings and roundabouts' and makes it hard to do any entirely fair and accurate comparisons at this subset level. Choose particular number values and one can improve and/or disadvantage on a whim.

Cluso99 · 2008-09-11 01:41

Wow - I like the +4. -4 might also be useful rather than performing a subtract, do an add ?
There is probably another (it's used in assembler so don't know if it is relevant in spin/forth or whatever) +$200. It adds 1 to the destination address in assembler.

Have you picked up on Hippy's use (and now mine) of the lower hub ram <$200 for rd/wr byte/word/long format wrlong cog,#hub. I posted a sample of loading/storing utilising the sweet spot by adding $204 which increments both the cog and ram address simultaneously.

Spin micro-JIT

Comments