A tighter VM loop for LMMs

Phil Pilgrim (PhiPi) · 2008-01-13 18:08

The now-"traditional" way to implement a large memory model (LMM) virtual machine (VM) involves the following code (rf: Bill Henning, hippy, et al.):

              [b]org[/b]       0
_vm           [b]mov[/b]       pc,[b]par[/b]

_vm_lp        [b]rdlong[/b]    _xeq,pc
              [b]add[/b]       pc,#4
_xeq          [b]nop[/b]
              [b]jmp[/b]       #_vm_lp

Due to the fact that the rdlong is always one instruction too late to meet its appointment with the hub, the loop executes in 32 clocks, making the LMM:PASM speed ratio 1:8. As Bill Henning points out, this could be improved by inlining more rdlong/add/nop triads; but the jmp back to the beginning will always incur loss of synchronization with the hub and cost an extra 16 clocks. He comments further that an autoincrement facility could eliminate the add and tighten the loop — if only such a facility existed in the Propeller.

This got me thinking about other ways to eliminate a step in the VM loop, thus reducing it to 16 clocks from 32. My first thought was to use phsa for the program counter. In a 16-clock loop, it would increment 16 times. This would mean that the machine instructions being emulated would have to be put into RAM on four-long intervals. The gaps in between instructions could be filled with other instruction threads, but this kind of interleaving would be an absolute mess. And this does not even begin to deal with the emulation of jmps. So that idea was quickly abandoned.

The other option is to combine the loop back with an increment of the program counter. The only instruction available to modify a register and jump in one instruction cycle is djnz. Not only does this increment in the wrong direction, it does it by one, not four. But wait! There are two factors in our favor here:

The source address for a rdlong doesn't have to be a multiple of four. The instruction merely ignores the last two bits.
This is a virtual machine, after all. We can make instructions run backwards in memory if we want.

With these two principles in mind, I came up with the following VM loop:

              [b]org[/b]       0
 _vmr         [b]mov[/b]       pc,[b]par[/b]
              [b]jmp[/b]       #_go
              
_xeq1         [b]nop[/b]
              [b]rdlong[/b]    _xeq0,pc
              [b]sub[/b]       pc,#7
_xeq0         [b]nop[/b]
_go           [b]rdlong[/b]    _xeq1,pc
              [b]djnz[/b]      pc,#_xeq1

Each time around the loop, two instructions are executed, pc gets decremented by eight, and 32 clocks transpire. This makes the VMM:PASM speed ratio 1:4, instead of 1:8. Now pc doesn't decrement by four at each step. It alternately decrements by one and seven. But what's important is that pc[noparse][[/noparse]15..2] decrement at a constant rate, which it does here.

There are a couple implications of this loop:

The emulated code has to run backwards in memory. Unless one wants to write it that way, some sort of preprocessor will be necessary to reformat normal-looking code. A compiler, of course, is one example of a preprocessor.
VM subroutines that get called by the emulated code will have to keep pc in phase with the 1-and-7 decrementing. The simplest way to do this would be to andn pc,#3 and always return to _go.

My intent is to write a simple preprocessor for this VM and make it available on the web for people to try.

-Phil

deSilva · 2008-01-13 18:48

Phil, this is ingenious!
Though you should check this:

Phil Pilgrim (PhiPi) said...
* The source address for a rdlong doesn't have to be a multiple of four. The instruction merely ignores the last two bits.

I doubt this statement... Parallax has always emphazised that such kind af addressing yields undefined results. I can very well imagine that the two LSBs - which will be used for the word and byte exraction - can have side effects also with RDLONG.....

hippy · 2008-01-13 18:48

@ Phil : Excellent and inspired thinking.

It would be possible to just reverse the entire table of LMM code at runtime if it were linear in a contiguous block and also 'reverse' any addresses when first called or used within the LMM code. That would allow Reversed-LMM to be more easily written within the PropTool.

The overhead of reversing addresses when needed would be minimal to the gains of hitting hub access sweet spots and the entire LMM code reversal is a one-time per boot-up process. If LMM jump destinations were +/- offsets rather than absolute, that would minimise the overhead further.

If the LMM code is reversed at run time it can be written back to the Eeprom making it a once-per download occurrence, no extra overhead on subsequent boot-up. It's entirely possible to determine if code is running from Eeprom ( power-on/F11 ) or Ram (F10) and behave appropriately on each.

hippy · 2008-01-13 18:55

@ deSilva : I haven't read any Parallax statements on the lsb's but I've not seen any problems caused by them not both being zero for rdlong in practice. That doesn't mean there aren't potential problems.

Phil Pilgrim (PhiPi) · 2008-01-13 18:57

deSilva,

I'm going by the original Assembly Language Instruction Set table which states, for rdlong, "Read main memory long S[noparse][[/noparse]15..2] into D." I've tested the loop on some simple code, and it works.

-Phil

Phil Pilgrim (PhiPi) · 2008-01-13 19:01

Hippy,

The VM I plan to implement would be both relocatable and reentrant, so all jumps and calls would be relative. I also plan to implement short branches, which would not entail calls outside of the VM loop.

-Phil

hippy · 2008-01-14 01:12

This looks very promising. I modified my LmmVm to do this Reverse LMM, and it was interesting to compare the two - Version 002 = traditional, version 003 = Reversed.

No change in "LmmTest1", but "LmmTest2" deteriorated, 4m 20s dropping to 4m 40s. That's probably because there are few sequential native instructions and more calls into the kernel. Hub access sweet spots probably got shifted and missed for "LmmTest2". Neither fair tests, and not enough here to compare the two.

I rewrote "Test1" to add four add's and four sub's so functionality was unchanged but more contiguous native instructions to execute ....

                 m:ss  sec  ratio

Traditional LMM  7:20  440  1:7.4    1 x rdlong + 1 x jmp
Unrolled LMM     5:00  300  1:5      4 x rdlong + 1 x jmp
Reversed LMM     4:40  280  1:4.7
Native PASM      1:00   60  1:1

A 30% improvement there over traditional LMM and getting very close to that 1:4 nirvana.

Added : It also performed slightly better in this specific test better than the traditional LMM with an unrolled loop.

Post Edited (hippy) : 1/14/2008 4:21:49 AM GMT

cgracey · 2008-01-14 01:34

deSilva said...
Phil, this is ingenious!
Though you should check this:

Phil Pilgrim (PhiPi) said...
* The source address for a rdlong doesn't have to be a multiple of four. The instruction merely ignores the last two bits.

I doubt this statement... Parallax has always emphazised that such kind af addressing yields undefined results. I can very well imagine that the two LSBs - which will be used for the word and byte exraction - can have side effects also with RDLONG.....

This is okay! In a RDLONG/WRLONG the two address LSBs are ignored. In a RDWORD/WRWORD only one address LSB is ignored. In a RDBYTE/WRBYTE all address bits are used. Of course, bits 31..16 are always ignore in every case, as there's only 64KB of main memory.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

cgracey · 2008-01-14 01:37

Phil, that's ingenious, all right! Good job!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔

Chip Gracey
Parallax, Inc.

Phil Pilgrim (PhiPi) · 2008-01-14 02:49

Thanks, guys! There's more to come. It may take a couple days, though...

-Phil

hippy · 2008-01-14 14:46

I've done some more testing with my LmmVm ...

                 m:ss  sec  ratio

Traditional LMM  7:20  440  1:7.4    1 x rdlong + 1 x jmp
Unrolled LMM     5:40  340  1:5.7    2 x rdlong + 1 x jmp
Unrolled LMM     5:20  320  1:5.4    3 x rdlong + 1 x jmp
Unrolled LMM     5:00  300  1:5      4 x rdlong + 1 x jmp
Unrolled LMM     4:40  280  1:4.7    5 x rdlong + 1 x jmp
Unrolled LMM     4:40  280  1:4.7    6 x rdlong + 1 x jmp
Reversed LMM     4:40  280  1:4.7
Native PASM      1:00   60  1:1

So some general observations on LMM which I believe may be useful ...

Always unroll a traditional LMM loop. Even one un-rolling gives a massive speed improvement for minimal extra code use ( three longs ).

Unrolling is a matter of diminishing returns. It eats up more code space and is limited by how many contiguous native instructions there are between calls into kernel. The ideal amount of unrolling depends on the contiguous nature of native instructions being interpreted.

Reversed LMM delivers performance as good as any unrolled LMM. It would be, it never misses a 'hub access sweet spot'. It uses a fixed amount of code space, less than a 2 x rdlong unrolled loop. There is no, "how much unrolling is ideal?", question to answer.

Inefficiencies in calls to the kernel have an impact on overall performance. Hub access sweet spots may get missed there. The more calls, the greater the impact. The greater frequency of kernel calls, the closer to traditional unrolled LMM performance every implementation becomes.

Choice of LMM branch addressing mechanism can have a major effect. Absolute addresses usually need adjusting ( can be done at compile time for traditional LMM ), and are more complicated for reversed LMM. Relative +/- addressing is faster to execute but not easy to write using the PropTool; needs an "@" symbol for current hub address like "$" for current cog address to allow relative offsets to be calculated, "long @-@Label". Third-party tools can overcome this lack.

Using the PropTool to write LMM code necessitates overheads to make life easier for the LMM coder. Third party tools would minimise such overheads. The overheads appear to be the same regardless of LMM type.

The 1:4 nirvana for LMM will never be reached in a real-world program, but can be brought close. Efficient calls into the kernel are the key there. The test program used for these benchmarks is not ideal as it does its own hub access which derails smooth running.

Debugging reversed LMM kernel calls can be a PITA; descending addresses and not long aligned. Working out when non-aligned addresses have to be corrected or can simply be added to or subtracted from takes some thought and paper-working. Always aligning could add unnecessary overhead and waste code space.

It is not easy to produce a single source code LMM implementation which can have compile-time or run-time selection of traditional / unrolled and reversed LMM; "add pc" needs to become "sub pc" in numerous places. Kernel developing and debugging with traditional LMM and then switching to reversed LMM is easy and would probably be preferable. No conditional compilation supported by the PropTool though so third-party tools or extensive error-prone manual editing need to be used.

Reversed LMM may be complicated for the LMM developer, but not for the end-user. The end-user should see no difference in what they have to do to use and code in LMM regardless of implementation.

Reversed LMM has an overhead in needing to reverse the LMM code, either at run-time, or at compile time using third-party tools. Run-time reversal of LMM code is not excessively time consuming nor code space hungry, can be coded in PASM for maximum speed. Trying to write LMM code in reverse order using the PropTool is unnatural and hard, but could be done.

All-in, from my perspective, Traditional LMM is easier for kernel development, reversed LMM better for delivery. Phil deserves heaps of praise for his innovative thinking and solution.

Phil Pilgrim (PhiPi) · 2008-01-14 19:09

Wow, hippy, you've really put a lot of work into this! Thanks for the thorough analysis!

The alignment issues for the reverse VM are something I'm glad you brought up. I was planning on mixing longs and words in the VM code, so the preprocessor will have to be sure to start on a long boundary and always issue words in even multiples.

The absolute vs. relative addressing issue is a complex one. Choosing to use relative addressing seemed like a no-brainer at first, since you can't determine absolute hub addresses at compile time, and the VM would have to add the object offset to get the target address. OTOH, each jump uses up an extra long for the target offset. This could be eliminated for absolute jumps by using a cog-resident address table. Then you'd encode the address table index into the destination field of the jmp (actually a jmpret ... nr) to the VM jmp handler (_jmp):

           [b]jmp[/b]     target  [i]becomes[/i]
           [b]jmpret[/b]  target_reg,#_jmp [b]nr[/b]
           ...
target_reg [b]word[/b] @target    'This line goes into the cog space within the VM.

Of course, no more than 512 target addresses can be accommodated this way. For that reason, it won't work well for relative jumps, since every relative address could different, and every jmp would thus require a different table entry. Also, indexing the table and unpacking adjacent words adds to the VM overhead.

Short branches, OTOH, are easy:

           [b]br[/b]      target   [i]becomes[/i]
:here      [b]add[/b]     _pc,#@target-@:here    'for forward branches
:here      [b]sub[/b]     _pc,#@:here-@target    'for backward branches

'[b]Note:[/b] Can't use [b]$[/b] here, since it's relative to the last [b]org[/b].

This provides a branch range of ±128 instructions.

One advantage to using a jump table is that it becomes possible for a pre-processor to optimize between jmps and brs. This is because each type has the same compiled length. So a decision about which type to use won't affect decisions that have already been made in the code above it.

A final note about traditional vs. backward VMs: With a traditional VM, preprocessing could be done with a good macro facility. That's simply not possible with the reverse VM unless the reversal is done at runtime. Also, to accommodate hyperthreading (two processes running in parallel by taking turns), the reverse VM has to be unrolled to four fetches, decrementing _pc0 by 7 and 1, and _pc1 by 4 and 4.

-Phil

Post Edited (Phil Pilgrim (PhiPi)) : 1/14/2008 7:51:39 PM GMT

hippy · 2008-01-14 19:35

You gave me a +30% improvement in execution speed and pointed out the major flaw in my simple unrolled loop, so I thought it was only fair to help return the favour

I think VM's and LMM's are going to become increasingly more useful and important as time goes on. I actually see LMM as an integral part of any generic VM now as well as in their own right. A 4MIPS upwards LMM on a 20MIPS core is quite impressive.

Relative addressing isn't too bad in my implementation because I replace "jmp #dst" with a "jmp #LMM_jmp" with a word/long following holding the hub address and that would simply need converting to +/-. It does need an extra hub access though to fetch it. This is where I lose efficiency I think in my LMM implementation.

I'm not sure about using a jump table as it eats into available cog memory and limits user register use and number of jumps. Ultimately every LMM compiler should generate its own LMM kernel tailored to what it needs so it can choose 'the best strategy' for any given source code. That's possibly a bit of a way off though !

As we head towards 'optimal performance' it becomes easier to look at each individual issue in isolation and way up the pro's and con's of each possibility.

Phil Pilgrim (PhiPi) · 2008-01-14 20:19

Hippy,

In your VM code you have a constant value of $0010 - 4 assigned to k_CogBp_Minus_4. This is added to the @addrs to get the real hub address. I take it that this value is application-specific, depending on where the object is actually loaded, right? (As you can see, I'm still dithering about the relative address thing!

)

As to the jump table, I agree that it could take up some precious space. However, I see some offsetting considerations:

It can't take up more than 256 longs, since the word index is limited to 0 - 511.
Since the user code is executed from the hub, much more room is available in the cog, anyway, than with PASM code (assuming the stack is in the hub).
A jump table would free the user from having to make the br/jmp decision for each jump in his program. The preprocessor could do the optimizing, since all addresses are known after the first pass, prior to any optimization.
The jump table could contain absolute addresses, since the object offset can be added to each one just once. (Of course, this is negated by the unpacking overhead.)
With a jump table, there'd be no advantage to implementing djnz, tjz, and tjnz as special cases. They could simply be preprocessed to:

         [b]sub[/b]     reg,#1 [b]wz[/b]  'DJNZ
  [b]if_nz jmpret[/b]  table_index,#_jmp [b]nr[/b]

         [b]test[/b]    reg [b]wz[/b]  'TJZ
  [b]if_z  jmpret[/b]  table_index,#_jmp [b]nr[/b]

         [b]test[/b]    reg [b]wz[/b]  'TJNZ
  [b]if_nz jmpret[/b]  table_index,#_jmp [b]nr[/b]

Of course, implementing the latter point disturbs the zero flag, but that's a pretty minor thing.

Here's a thought for the jump table: Have two jump processors, one for even table (word) indices and one for odd table indices. One of them will need to shift the table entry and the other one won't. The preprocessor can make a simple determination whether to assign an even or odd address, based on how often the associated label is referenced. (This isn't optimum, of course, since a static count may not reflect the dynamic reality.)

-Phil

Post Edited (Phil Pilgrim (PhiPi)) : 1/14/2008 8:36:24 PM GMT

hippy · 2008-01-14 21:05

you have a constant value of $0010 - 4 assigned to k_CogBp_Minus_4 ...

Yes, when using "long @Label" the address put into hub memory is offset to the base of the object it's in, and that starts at $0010 for the top-level object / main program. It would need to change if the LMM is in a sub-object.

The -4 is because an 'add #4' follows 'add k_CogBp_Minus_4'; it was easier to do it that way than add the $0010 and jump round the second addition.

I've not written much LMM which uses loops or jumps/calls to other LMM so I'm not an expert ( and also the overhead hasn't been much of a concern ). When I did do it it was before I'd done LmmVm and was a nightmare. I like the ideal of embedding the offset in the jmp/jmpret, +/-256 longs. Maybe do that, and force a different/more complex jump when needed. Hopefully there will only be few of the later. I'll have a think on the options.

Phil Pilgrim (PhiPi) · 2008-01-15 00:52

The devil is certainly in the details. Implementing a jump table is harder than I thought it would be. Here are some options for jmp #label:

1. mov _pc, _table_reg

Advantages: Fast, inline, and uses only one instruction.
Disadvantages: Inefficient use of cog RAM, since each table entry is a long. Can't be used with reverse VM, since there's no way to tell what the last two bits of _pc were without adding a bunch more instructions.

2. add _pc, _table_reg

Advantages: Fast, inline, uses only one instruction, and can be used with reverse VM.
Disadvantages: Requires one long table entry for each jump, instead of one for each target label => more entries.

3. jmpret _table_reg, #_jmp0 nr 'or _jmp1 for high word in table entry

Advantages: Can use word-packed jump tables, since it executes external VM code to do the unpacking.
Disadvantages: Using with any unrolled VM loop entails rolling back the PC and issuing another rdlong just to access the destination field (which contains the register address). This is because you don't know which of the multiple VM-loop "sockets" the jump to _jmpn issued from.

The reversed-VM code for the third option would look something like this:

_jmp0   add     _pc,#4            'Repoint to the instruction.
        rdlong  _inst0,_pc        'Reread it. 
        shr     _inst0,#9         'Get the jump table address (destination field) into source position.
        xor     _inst0,_mod2mov   'Convert the instruction to:  mov _pc,reg
        nop                       'Wish there was something useful to put here.
_inst0  nop                       'Load _pc with jump address.
'       shr     _pc,#16            [i]Used only in _jmp1 to access high word.[/i]
        jmp     #_go

        ...

mod2mov long    (%1010000_0000_0100_100011111 ^ _pc) << 9

That's a lot of screwing around (7 or 8 instructions, including a hub op) to do a jump. OTOH, by keeping the user jmp code confined to a single long, it allows the preprocessor to optimize out most of the jumps by changing them to relative branches. So it may actually save time in the long run.

-Phil

hippy · 2008-01-15 05:28

I think I've found a problem with trying to embed a cog register address inside a native
instruction call into the kernel. For example, to load a cog register with a 32-bit constant
I currently use ...

    jmp    #LMM_Load
    long   <reg>
    long   <constant>

better is ...

    jmpret <reg>,#LMM_Load NR           ' or ... long LMM_CALL | LMM_Load | <reg> << 9
    long   <constant>

When LMM_Load gets executed, in the original ( <reg> in next long ) the pc already
points to that long, so a simple rdlong and put the <reg> where necessary to do it.

Trying to extract the <reg> from the instruction is easy in the traditional LMM when
not unrolled, it's in cog register _xeq where the jmp was executed. If there's two
or more _xeq's ( as with unrolled LMM and reveresed LMM ) when one hits LMM_Load
one doesn't know which _xeq to retrieve <reg> from without checking to see what
pc holds.

It may be that the rdlong doesn't hit a sweet spot so the extra overhead may be no
greater. Testing the pc would require using C or Z which means having to preserve
those flags ( avoided so far ) or some hackery ( untested ) to get round that ...

      mov    tmp,pc
      and    tmp,#1
      djnz   tmp,#:Skip
      mov    _xeq1,_xeq0
:Skip movd   :Opc,_xeq0
      nop
:Opc  rdlong 0-0,pc

I'm going to have to do some further testing.

Phil Pilgrim (PhiPi) · 2008-01-15 05:55

Using jmpret <reg>, @LMM_op nr with unrolled code, I haven't found a way around backing up and rereading the command from the hub to retrieve the destination field. I just reread your code. Now I see what you did. Very clever!

One issue I see with having long constants embedded in the code is that making the statement that uses it conditional is harder. That's because the "condition bits" in the constant might not all be zero, which is necessary to make the constant a nop. Of course, you could have 16 different LMM_load commands, one for each condition code, so you could leave the four CC bits zero in the constant and fill them in in the VM; but that would be stretching things!

I think my approach would be similar to the jump table idea: Just put the constants above #511 into cog registers and load them from there. (Of course, I never was planning to put the stack in the cog, so there seems like there'd be room aplenty. But I might get fooled.)

-Phil

Post Edited (Phil Pilgrim (PhiPi)) : 1/15/2008 6:53:58 AM GMT

Ariba · 2008-01-15 06:49

I also made my own LMM kernel (traditional, 4 times unrolled) with an Assembler for it. All is in a very early state and not ready to release.

Hippy, I also use a LoadLong instruction that loads the next long in a dedicated register, but your idea with a Destination Register is very smart. Here is a code, that should do it:

      sub    pc,#4
      rdlong xeq0,pc      'reread the instruction
      shr    xeq0,#9      'Dst->Src
      movd   :Opc,xeq0    'Src! to Dst of Opc
      add    pc,#4
:Opc  rdlong 0-0,pc
      add    pc,#4        'Skip long

Phil:

Phil Pilgrim (PhiPi) said...

Of course, implementing the latter point disturbs the zero flag, but that's a pretty minor thing.

First I was thinking similar, but then I have changed my LMM core to not affect any Flags. That makes it much easier for the Application code.

Andy

Phil Pilgrim (PhiPi) · 2008-01-15 06:55

Ariba,

How do you make your LoadLong conditional?

-Phil

Ym2413a · 2008-01-15 07:10

This is truely interesting.
I personally like the idea of the VM program being anywhere in memory and can be loaded and run.
Addresses and native code that are relocatable in memory open the door for all types of complex programs.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Share the knowledge: propeller.wikispaces.com
Lets make some music: www.andrewarsenault.com/hss

Ariba · 2008-01-15 07:15

My Loadlong is not conditional (as the only one of the LMM instructions), or better: its conditional only if you use a long value < 18 bits. I never had a Problem with this and have coded a lot with LMM in the last weeks.
If really necessary I would simply use a conditional Fjump, or Fbranch before the LoadLong (Fbranch is my 1 long jump instruction with a jump destination of +-1024 addresses). The Fjump uses also the next long, but with an absolute address in this long. That's no problem because also the PropII has not more the 18 bits for the address.

I use the Cog Ram for a little Stack and a lot of registers and a cache, so I don't want have jump, or constant tables in it.

Andy

Phil Pilgrim (PhiPi) · 2008-01-15 08:21

Hippy,

You might want to recheck your code fragment that tests pc[noparse][[/noparse]1..0]. In your movd, _xeq0[noparse][[/noparse]src] (not _xeq0[noparse][[/noparse]dst]) gets copied to :Opc[noparse][[/noparse]dst]. Here's something that might work:

_ld     [b]mov[/b]      _scr,_pc
        [b]and[/b]      _scr,#3
        [b]tjz[/b]      _scr,#$+2
        [b]mov[/b]      _xeq0,_xeq1
        [b]shr[/b]      _xeq0,#9
        [b]movd[/b]     _xeq0,:ld
        [b]nop[/b]
:ld     [b]rdlong[/b]   0-0,_pc
        [b]sub[/b]      _pc,#4
        [b]andn[/b]     _pc,#3
        [b]jmp[/b]      #_go

Here's another version (longer code, quicker execution):

_ld    [b]mov[/b]       _scr,_pc
       [b]and[/b]       _scr,#3
       [b]tjz[/b]       _scr,#:use1
       
:use0  [b]shr[/b]       _xeq0,#9
       [b]movd[/b]      :ld0,_xeq0
       [b]nop[/b]
:ld0   [b]rdlong[/b]    0-0,_pc
       [b]sub[/b]       _pc,#4
       [b]jmp[/b]       #_xeq0+1

:use1  [b]shr[/b]       _xeq1,#9
       [b]movd[/b]      :ld1,_xeq1
       [b]nop[/b]
:ld1   [b]rdlong[/b]    0-0,_pc
       [b]sub[/b]       _pc,#4
       [b]jmp[/b]       #_xeq1+1

-Phil

hippy · 2008-01-15 17:03

Thanks Phil. I also like the "longer code" solution. How many times the code has to be added is
the determination for short/slow or longer/fast there.

I have gone back to the traditional LMM because I still have a mental block with descending pc
addresses. The same problem occurs with an unrolled loop, more so because the pc is always
long aligned. The solution there is not to use two "add pc,#4" but "add pc,#7" then "add pc,#1"
as with the reversed LMM which allows _zeq0 and _xeq1 as last executed native opcode to be
determined. So you had a 'double-whammy' of a good idea.

On handling conditionals, I do that by a conditional skip of the next multi-instruction LMM code
which is quite a lot of overhead. My philosophy is to make LMM easy to write and leave getting
efficiency down to the LMM writer; if they pre-load a register with a long they can use normal
conditional register 'mov' later at maximum LMM speed. If they want an easy life, they have
to suffer the inefficiency consequences.

I think it has to be accepted that LMM isn't PASM so there always will be some inefficiency but
as efficient as possible is the ultimate goal. With the right tools that becomes so much easier.

I'm going to put some serious effort into seeing if I can create a single source to develop the
LMM kernel using traditional LMM and dynamically change that to reversed LMM at run-time.

Phil Pilgrim (PhiPi) · 2008-01-15 17:28

Here's a load long that's shorter and quicker still:

_ld     [b]mov[/b]      _scr0,_pc
        [b]mov[/b]      _scr1,_pc
        [b]sub[/b]      _pc,#4
        [b]and[/b]      _scr0,#3
        [b]tnjz[/b]     _scr0,:use1

:use0   [b]xor[/b]      _xeq0,_fix
        [b]jmp[/b]      #_xeq0

:use1   [b]xor[/b]      _xeq1,_fix
        [b]jmp[/b]      #_xeq1
        ...
_fix    [b]long[/b]     (%010111_0001_1111 ^ %000010_0010_1111) << 18 | (_scr1 ^ _ld)

It works by converting the jmpret <reg>,#_ld nr at _xeq0/1 to a rdlong <reg>,_scr1 in situ and then jumping there (assuming I've got the bits right). For your forward VM change the sub _pc,#4 to an add.

-Phil

Post Edited (Phil Pilgrim (PhiPi)) : 1/15/2008 7:34:22 PM GMT

hippy · 2008-01-16 02:25

Hmmmm .... Using relative branching with offset in the unused destination reg of the LMM call
jmp instruction and I'm getting notable deterioration over having the destination address in
the following long. This is using the traditional LMM with 2 x rdlong.

Once again it's LMM code size versus execution speed and a delicate balance of exactly what
LMM code there is.

Phil Pilgrim (PhiPi) · 2008-01-16 03:20

For relative branching, I'm just using add _pc,#offset and sub _pc,#offset: no LMM call necessary. This works for a range of +/-128 instructions. I'm still going with the jump table for longer jumps, which keeps the instruction lengths of branches and jumps the same (i.e. a single long). This is important for me, because it allows the preprocessor to replace jumps with relative branches wherever it can without things moving around in the process. Hopefully, the number of long jumps will be so minimal that the jump table can be kept small.

-Phil

hippy · 2008-01-16 03:50

I forgot about add/sub pc,#offset

I went for +/-512 ( separate LMM handlers for forward of backwards jumps ), which could be +/-640 if +/-128 are add/sub pc,#offset.

So many different ways things can be done

Phil Pilgrim (PhiPi) · 2008-01-16 04:16

Most interesting! I hadn't thought about using a VM-mediated relative branch. The extended range would definitely help to reduce my jump table size!

-Phil

Phil Pilgrim (PhiPi) · 2008-01-16 20:52

Here are a couple benchmarks for RevLMM vs. Spin. In the first one, a pin is toggled using subroutine calls. A call for high, followed by a call for low, ad infinitum:

[b]CON[/b]

  [b]_clkmode[/b] = [b]xtal[/b]1 + [b]pll[/b]16x
  [b]_xinfreq[/b] = 5_000_000

[b]VAR[/b]

  [b]word[/b] stack[noparse][[/noparse]512]

[b]OBJ[/b]

  vm : "lmm_vm"
  
[b]PUB[/b] start

  stack[noparse][[/noparse] 0 ] := @@0
  stack[noparse][[/noparse] 1 ] := 512
  stack[noparse][[/noparse] 2 ] := @my_prog
  vm.start(@stack)

  [b]dira[/b]~~
  [b]repeat[/b]
    go_hi
    go_lo

[b]PRI[/b] go_hi

  [b]outa[/b]~~

[b]PRI[/b] go_lo

  [b]outa[/b]~
    
[b]DAT[/b]
        
              [b]jmp[/b]       vm#_ret                   'Return from lo. 
lo            [b]mov[/b]       [b]outa[/b],#0                   'Start of lo: lower pin 1.

              [b]jmp[/b]       vm#_ret                   'Return from hi.
hi            [b]mov[/b]       [b]outa[/b],_0x0000_0001         'Start of hi: raise pin 1.

              [b]jmpret[/b]    jmp_tbl_00,vm#_jmp_lo [b]nr[/b]  'Loop back to loop.
              [b]jmpret[/b]    jmp_tbl_01,vm#_call_lo [b]nr[/b] 'Lowwer pin.
loop          [b]jmpret[/b]    jmp_tbl_00,vm#_call_hi [b]nr[/b] 'Raise pin.
             
              [b]mov[/b]       [b]dira[/b],#1                   'Start of prog: set pin 0 to output.

              [b]org[/b]       $1ed                      'Addr of tables in VM cog.
_0x0000_0001  [b]long[/b]      1                         'Literal table.
              
jmp_tbl_00    [b]word[/b]      @loop,@hi                 'Jump table.
jmp_tbl_01    [b]word[/b]      @lo,0
              
my_prog       [b]word[/b]      2,1                       'Jump table size, Literal table size.

Here's a scope trace of the output: RevLMM in yellow, Spin in blue. LMM is faster by about 8.5:1. This demonstrates the performance hit entailed by too many kernel calls.

attachment.php?attachmentid=51478

Things improve when the real work is done by inline code. In this example the pin is toggled in a simple loop: no subroutine calls and a quick relative branch back to the beginning:

[b]CON[/b]

  [b]_clkmode[/b] = [b]xtal[/b]1 + [b]pll[/b]16x
  [b]_xinfreq[/b] = 5_000_000

[b]VAR[/b]

  [b]word[/b] stack[noparse][[/noparse]512]

[b]OBJ[/b]

  vm : "lmm_vm"
  
[b]PUB[/b] start

  stack[noparse][[/noparse] 0 ] := @@0
  stack[noparse][[/noparse] 1 ] := 512
  stack[noparse][[/noparse] 2 ] := @my_prog
  vm.start(@stack)

  [b]dira[/b]~~
  [b]repeat[/b]
    [b]outa[/b]~~
    [b]outa[/b]~

[b]DAT[/b]
        
              [b]add[/b]       vm#_pc,#12
              [b]mov[/b]       [b]outa[/b],#0
loop          [b]mov[/b]       [b]outa[/b],#1             
              [b]mov[/b]       [b]dira[/b],#1                   'Start of prog: set pin 0 to output.

              [b]org[/b]       $1ed                      'Addr of tables in VM cog.
_0x0000_0001  [b]long[/b]      1                         'Literal table.
              
jmp_tbl_00    [b]word[/b]      @loop,@hi                 'Jump table.
jmp_tbl_01    [b]word[/b]      @lo,0
              
my_prog       [b]word[/b]      2,1                       'Jump table size, Literal table size.

Here's the scope trace, showing the oft-cited performance ratio of 29:1:

attachment.php?attachmentid=51479

-Phil

Post Edited (Phil Pilgrim (PhiPi)) : 1/16/2008 9:08:48 PM GMT

Jasper_M · 2008-01-17 21:47

WOW! This technique can also be used in GFX drivers, to get scanline buffers etc. to Cog RAM. I'll prolly use this in my next GFX driver if/when I decide to make one.

A tighter VM loop for LMMs

Comments