Large Memory Model (LMM) VM
hippy
Posts: 1,981
Another week, another Virtual Machine. This is a simple Large Memory Model executing engine. Nothing fancy, just linear execution of LMM programs with registers in-Cog. In-Cog or Hub Memory stack. No overlay or 'load then execute' blocks. Not designed to be compatible with anyone else's idea of what an LMM VM should look like.
Primary goal was to be able to hand code LMM programs in the Propeller Tool without that being too complex.
Execution speed ratio of PASM versus LMM is around 1:6 ( PASM versus SPIN is around 1:42 ) for the simple benchmarks I tested. The ratio gets better as more native instructions are executed compared against calls into the VM.
Primary goal was to be able to hand code LMM programs in the Propeller Tool without that being too complex.
Execution speed ratio of PASM versus LMM is around 1:6 ( PASM versus SPIN is around 1:42 ) for the simple benchmarks I tested. The ratio gets better as more native instructions are executed compared against calls into the VM.
Comments
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Cheers,
Simon
www.norfolkhelicopterclub.co.uk
You'll always have as many take-offs as landings, the trick is to be sure you can take-off again ;-)
BTW: I type as I'm thinking, so please don't take any offense at my writing style
when you move to register you always use 3 longs, I thought (well for my lmm) that a good idea would be to use mov/movd/movs and movi/shl depending on the size of the number to load. Well the compiler would take care of that.
I also thought that *IF* the jumps to kernel routines would have been placed at the end of a cache line boundary.. the compiler would have added nops to move them at the beginning of the next cache line to avoid the lose of relation between the constant and the routine that needs it.
If you have some longs left in your vm, you can merge the address and the register (leave the condition to 0000 so it is treated as nop if necessary, in case of conditions) into one long (tjz, djnz, etc).
Just some thought I had for mine, that may help you . Nice nice work !!! (You got 2 vms and I still did not release mine... well soon)
In the benchmarks within version 001 I simply increment a long in hub until it hits one hundred million, as fast as I can and do that increment within a subroutine for the second benchmark set. The benchmarks are skewed towards PASM/LMM as Spin has to load the long, increment and store, PASM/LMM keeps the number in Cog, increments and only writes to hub.
Making the benchmarks more similar, so PASM/LMM both load, increment and store, and it's a very different picture; LMM gains but Spin is the big winner ...
One may wonder how PASM:LMM can be 1:4 when that's the theoretical maximum without calls into any VM and there's a lot of those going on with LMM here ?
The answer there, I believe, is because the benchmark not so much advantages LMM but disadvantages PASM; the load, increment and store misses the Hub access sweet spots.
It certainly highlights the difficulties of trying to fairly benchmark two very different languages, Spin and PASM/LMM, and even between PASM and LMM. It sure is difficult to say in any absolute terms how much slower Spin or LMM is than PASM.
@ Ale : Thanks for the tips. Best results do come from using a compiler / assembler which is bright enough to be able to make things easy for the VM itself. I particularly like your idea of optimising constant loads. As always, so many different possibilities with pro's and con's depending on actual application. I think this is the key message for all VM's; what may be suitable for one task is not always best suited to another.
One conclusion I've come to is that fastest execution speed isn't necessarily the prime objective when lots of Hub access is involved as with a VM, but it's hard to quantify or put to the test.
PS : Three VM's; Spin, LMM and Thumb but they are all of a muchness. Good luck with your own ( and the same for everyone else ). I can tell you it sure does feel good to get something off paper and running. Once one's done the rest seem to tumble out of the bag.
Post Edited (hippy) : 1/3/2008 4:56:46 PM GMT
As an acid-test benchmark, I believe one must compare LMM and PASM without hub-access instructions. Here's the tightest loop in your VM:
Because the rdlong misses its appointment with the hub each time around, it has to resynchronize, which means each loop takes 32 clocks. That's eight instruction times, leaving a best-case hubless execution ratio of 1:8. A 1:6 ratio is simply not possible without the equivalent PASM program being hobbled by non-optimum hub accesses.
As Bill Henning points out in his germinal LMM post, one could inline a bunch more rdlong/add/nop triads to the loop to boost efficiency; but in the end, that jmp will always add an extra hub cycle.
-Phil
When analysing ("looking through", to be honest) some PASM programs I find a relation of 12 COG to one HUB instruction, very few JMPRETs, but adding DJNZs and other JMPs this adds up to even more than the HUB quota, maybe 2 per 12 instructions.
This is difficult to count, as you have to consider the number of loop cycles, and viable results can only be established by a simulator. There are PASM programs which have a much lower HUB quota... The HUB instructions are generally sync-tuned, but let's assume they take an average of 14 ticks (3.5 standard instructions)
12 PASM instructions will thus take 11*4 + 1*14 = 58 Ticks
12 LMM instructions will take 12*32 + 2*20 = 424 Ticks (I have counted 5 extra instructions for LMM"Pseudocode")
This is the ratio to be expected: LMM/PASM = 424/58 = 7.4
This leaves ample space for manipulation: More JMPs and JMPRETs will charge the LMM, more HUB instructions will charge PASM, as the LMM gets them "for free".
Post Edited (deSilva) : 1/13/2008 5:12:07 AM GMT
Thanks Phil. I really should do some reading up on Propeller instruction timings.
The interesting thing is that with utilising some LMM handling ( using that loop ) in my other SpinVM, the penalty didn't seem to be as bad as I was expecting compared to when no LMM was used.
Calls into the kernel from this loop automatically have some 'spare cycles' to work in which would otherwise be wasted, so a kernel call could add no noticeable overhead, or show less overhead than it actually has. That fits with what I've observed, and is also what makes it so very hard to determine how much less efficient LMM is than raw PASM in practice.
That also explains why using Thumb-style LMM ( where a fetched 16-bit wordcode is converted to PASM instruction ) using the same loop is no less efficient than 32-bit LMM, because it utilises those very same wasted cycles; each gets the next hub access at the same rate regardless of the extra intervening processing.
So, if maximum speed is required, 32-bit LMM and unroll that loop, if it's not required, then 16-Bit Thumb-syle LMM don't unroll the loop and it doubles code capacity with no loss of speed over 32-bit LMM.