Shop OBEX P1 Docs P2 Docs Learn Events
Large Memory Model (LMM) VM — Parallax Forums

Large Memory Model (LMM) VM

hippyhippy Posts: 1,981
edited 2008-01-13 15:52 in Propeller 1
Another week, another Virtual Machine. This is a simple Large Memory Model executing engine. Nothing fancy, just linear execution of LMM programs with registers in-Cog. In-Cog or Hub Memory stack. No overlay or 'load then execute' blocks. Not designed to be compatible with anyone else's idea of what an LMM VM should look like.

Primary goal was to be able to hand code LMM programs in the Propeller Tool without that being too complex.

Execution speed ratio of PASM versus LMM is around 1:6 ( PASM versus SPIN is around 1:42 ) for the simple benchmarks I tested. The ratio gets better as more native instructions are executed compared against calls into the VM.

Comments

  • deSilvadeSilva Posts: 2,967
    edited 2008-01-03 08:11
    Hippy, can you give an example of the 1:42 PASM/SPIN bench? This is either an untuned PASM or a very SPIN friendly algorithm..
  • simonlsimonl Posts: 866
    edited 2008-01-03 10:15
    deSilva: The bechmark tests are in the spin Hippy's posted wink.gif

    ▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
    Cheers,

    Simon
    www.norfolkhelicopterclub.co.uk
    You'll always have as many take-offs as landings, the trick is to be sure you can take-off again ;-)
    BTW: I type as I'm thinking, so please don't take any offense at my writing style smile.gif
  • AleAle Posts: 2,363
    edited 2008-01-03 13:24
    hippy,

    when you move to register you always use 3 longs, I thought (well for my lmm) that a good idea would be to use mov/movd/movs and movi/shl depending on the size of the number to load. Well the compiler would take care of that.
    I also thought that *IF* the jumps to kernel routines would have been placed at the end of a cache line boundary.. the compiler would have added nops to move them at the beginning of the next cache line to avoid the lose of relation between the constant and the routine that needs it.
    If you have some longs left in your vm, you can merge the address and the register (leave the condition to 0000 so it is treated as nop if necessary, in case of conditions) into one long (tjz, djnz, etc).

    Just some thought I had for mine, that may help you smile.gif. Nice nice work !!! (You got 2 vms and I still did not release mine... well soon)
  • hippyhippy Posts: 1,981
    edited 2008-01-03 16:51
    Benchmarking is an area where I think everyone needs flame-retardant underwear and I'm no expert in the field. I simply choose something which gives a rough idea or feel and leave it as that. No intent to skew results, I know they likely will be but don't really care. I'm just after ballpark figures.

    In the benchmarks within version 001 I simply increment a long in hub until it hits one hundred million, as fast as I can and do that increment within a subroutine for the second benchmark set. The benchmarks are skewed towards PASM/LMM as Spin has to load the long, increment and store, PASM/LMM keeps the number in Cog, increments and only writes to hub.

    Making the benchmarks more similar, so PASM/LMM both load, increment and store, and it's a very different picture; LMM gains but Spin is the big winner ...

    Benchmark Timing      mm:ss   sec   ratios
                                                            
    1 - Spin timing test  15:40   940   1:23.5
    2 - LMM  timing test  02:40   160   1:4 
    3 - PASM timing test  00:40    40   1:1
    
    4 - Spin timing test  28:00  1680   1:33
    5 - LMM timing test   05:00   300   1:5
    6 - PASM timing test  01:00    60   1:1
    
    
    



    One may wonder how PASM:LMM can be 1:4 when that's the theoretical maximum without calls into any VM and there's a lot of those going on with LMM here ?

    The answer there, I believe, is because the benchmark not so much advantages LMM but disadvantages PASM; the load, increment and store misses the Hub access sweet spots.

    It certainly highlights the difficulties of trying to fairly benchmark two very different languages, Spin and PASM/LMM, and even between PASM and LMM. It sure is difficult to say in any absolute terms how much slower Spin or LMM is than PASM.


    @ Ale : Thanks for the tips. Best results do come from using a compiler / assembler which is bright enough to be able to make things easy for the VM itself. I particularly like your idea of optimising constant loads. As always, so many different possibilities with pro's and con's depending on actual application. I think this is the key message for all VM's; what may be suitable for one task is not always best suited to another.

    One conclusion I've come to is that fastest execution speed isn't necessarily the prime objective when lots of Hub access is involved as with a VM, but it's hard to quantify or put to the test.

    PS : Three VM's; Spin, LMM and Thumb smile.gif but they are all of a muchness. Good luck with your own ( and the same for everyone else ). I can tell you it sure does feel good to get something off paper and running. Once one's done the rest seem to tumble out of the bag.

    Post Edited (hippy) : 1/3/2008 4:56:46 PM GMT
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2008-01-13 03:59
    Hippy,

    As an acid-test benchmark, I believe one must compare LMM and PASM without hub-access instructions. Here's the tightest loop in your VM:

    LMM_Fetch       rdlong  :Opc,pc
                    add     pc,#4
    :Opc            nop
                    jmp     #LMM_Fetch
    
    
    


    Because the rdlong misses its appointment with the hub each time around, it has to resynchronize, which means each loop takes 32 clocks. That's eight instruction times, leaving a best-case hubless execution ratio of 1:8. A 1:6 ratio is simply not possible without the equivalent PASM program being hobbled by non-optimum hub accesses.

    As Bill Henning points out in his germinal LMM post, one could inline a bunch more rdlong/add/nop triads to the loop to boost efficiency; but in the end, that jmp will always add an extra hub cycle.

    -Phil
  • deSilvadeSilva Posts: 2,967
    edited 2008-01-13 05:00
    Ha, Phil, now it's becoming a little bit more tangible smile.gif

    When analysing ("looking through", to be honest) some PASM programs I find a relation of 12 COG to one HUB instruction, very few JMPRETs, but adding DJNZs and other JMPs this adds up to even more than the HUB quota, maybe 2 per 12 instructions.

    This is difficult to count, as you have to consider the number of loop cycles, and viable results can only be established by a simulator. There are PASM programs which have a much lower HUB quota... The HUB instructions are generally sync-tuned, but let's assume they take an average of 14 ticks (3.5 standard instructions)

    12 PASM instructions will thus take 11*4 + 1*14 = 58 Ticks
    12 LMM instructions will take 12*32 + 2*20 = 424 Ticks (I have counted 5 extra instructions for LMM"Pseudocode")

    This is the ratio to be expected: LMM/PASM = 424/58 = 7.4

    This leaves ample space for manipulation: More JMPs and JMPRETs will charge the LMM, more HUB instructions will charge PASM, as the LMM gets them "for free".

    Post Edited (deSilva) : 1/13/2008 5:12:07 AM GMT
  • hippyhippy Posts: 1,981
    edited 2008-01-13 15:52
    Phil Pilgrim (PhiPi) said...
    Because the rdlong misses its appointment with the hub each time around, it has to resynchronize, which means each loop takes 32 clocks.

    Thanks Phil. I really should do some reading up on Propeller instruction timings.

    The interesting thing is that with utilising some LMM handling ( using that loop ) in my other SpinVM, the penalty didn't seem to be as bad as I was expecting compared to when no LMM was used.

    Calls into the kernel from this loop automatically have some 'spare cycles' to work in which would otherwise be wasted, so a kernel call could add no noticeable overhead, or show less overhead than it actually has. That fits with what I've observed, and is also what makes it so very hard to determine how much less efficient LMM is than raw PASM in practice.

    That also explains why using Thumb-style LMM ( where a fetched 16-bit wordcode is converted to PASM instruction ) using the same loop is no less efficient than 32-bit LMM, because it utilises those very same wasted cycles; each gets the next hub access at the same rate regardless of the extra intervening processing.

    So, if maximum speed is required, 32-bit LMM and unroll that loop, if it's not required, then 16-Bit Thumb-syle LMM don't unroll the loop and it doubles code capacity with no loss of speed over 32-bit LMM.
Sign In or Register to comment.