Stupid LMM tricks

ersmith · 2019-03-31 21:28

These are some kind of half baked ideas about LMM (Large Memory Model) for Prop1.

The traditional LMM loop as first proposed by Bill Henning looks like:

LMM_JMP
        rdlong  pc, pc
        ' fall through
LMM_Loop
	rdlong	instr, pc
	add	pc, #4
instr
	nop
	jmp	#LMM_Loop

It just misses a hub window, so it effectively takes 8 COG instructions to "interpret" one instruction from HUB (the rdlong takes at least 8 cycles, and more often it will take another 12 cycles as we wait for the window to come around again). Branches are implemented as a "JMP #LMM_JMP" followed by the long address to jump to.

Over the years there have been various discussions on how to improve this, ranging from obvious (unroll the loop) to stashing some functions in COG (FCACHE) to I think some tricky backwards loops using djnz.

What I'm interested in in this thread are ways to use the 3 "wasted" cycles in the small, simple LMM loop for something useful.

For example, consider:

LMM_JMP
       mov    pc, arg
LMM_Loop
	rdlong	instr, pc
	add	pc, #4
instr	nop
	mov	temp, instr
	andn	temp, cmask
	tjnz	temp, #LMM_Loop

decompress_instr
	mov	arg, instr	' save original data
	shr	instr, #23	' isolate COG address
	and	arg, amask	' isolate data
	jmp	instr  		' go run it; routine must exit with "jmp #LMM_Loop"
	
cmask	long	$FF80FFFF
amask	long	$0000FFFF

This uses some of the "wasted" time to test for an interesting class of instructions (ones with the IF_NEVER condition) and re-purposes them to call arbitrary COG addresses, with a 16 bit parameter placed in "arg" as a bonus. For example, a branch in traditional LMM looks like

   jmp #LMM_JMP
   long @newpc

but in this revised format it could look like:

   long (LMM_JMP<<23) + @newpc

This might be slightly slower than the original (I haven't checked the hub windows to see) but it'll certainly save space; it's like a "CMM-light" and is very easy to implement.

ersmith · 2019-03-31 21:39

Another thought I had is that one could save instructions in a "trace" as they are executed:

RESTART_TRACE
	movd	_trmov, #tracecache
	mov	tracecount, #TRACE_SIZE
	mov	tracestart, pc
TRACE_LMM_LOOP
	rdlong	instr, pc
	add	pc, #4
_trmov	mov	0-0, instr	' save instruction in trace
instr
	nop	' placeholder
nextinstr
	add	_trmov, increment_dest	' update trace pointer
	djnz	tracecount, #TRACE_LMM_LOOP
	' we've run out of room in the trace, start over
	jmp	#RESTART_TRACE
increment_dest
	long	1<<9

Here the TRACE_SIZE long words starting at "tracecache" are filled up with instructions as they are executed.

This could be interesting for debugging -- it gives you a window into what was happening before a certain point.

But there's another possibility as well: if you see a jump that goes backwards into an instruction that's already in the trace, then potentially one could jump directly into the trace cache and from that point run the loop at full COG speed (rather than 1/8 LMM speed).

There are some tricky bits to handle there; for example a branch that isn't taken ends up in the trace cache as "if_x jmp #LMM_JMP" followed by "long @addr" (which will be a no-op if addr is less than 18 bits). When re-running this we'd have to have LMM_JMP know whether it was called from the trace cache (in which case the new PC has to be picked up out of COG memory) or from HUB. But that's solvable in principle; instead of "jmp #LMM_JMP" use "jmpret RA, #LMM_JMP" and then check whether RA contains "nextinstr" (in which case we were running from HUB and fetch the new PC via rdlong pc, pc) or something in the trace cache (in which case we need to use RA to find the new pc).

Being able to quickly find which jumps go back into the trace cache is also tricky. But we can find an interesting class by comparing "tracestart" with the new pc in LMM jump; if they match we're going to the start of the trace cache, otherwise call RESTART_TRACE and start a new trace. Then every simple loop will end up detectable.

This is all kind of a half-baked idea, but it seems like it might have potential to unlock a sort of "automatic" fcache, so I thought I'd throw it out for discussion. I'd be interested in others' thoughts. Has anyone else tried anything along these lines?

Eric

Wuerfel_21 · 2019-03-31 23:29

I just unroll the LMM loop 5 times, for 20 cycles on average per ALU instruction (vs. 32 in the simple loop).
In my XMM thingy i'm working on, jumping within the current sector (128 longs) is achieved using "movs vm_lmmpc,#123". (downside is that jumping to another sector has to go through a fairly long routine and if the sector isn't cached you get some 30K cycles (estimate, haven't measured) of waiting for the SD card)

Back on topic, you could avoid adding complexity to the jump pseudo-op by directly jumping to the relevant position in the trace and avoiding inline argument reads in the loop (by moving them to a register before the loop)

speaking of loading registers, has anyone had this idea before? (I don't think so, but who knows?)
EDIT: now that I think about it, if I keep all my registers below $100, I can even load 24 bit values (or below $80 for 25 bit values and so on)

vm_mvi        ''Move 23 bit Immediate
              rdlong vm_junk,vm_lmmpc
              add vm_lmmpc,#4
              movd :themov,vm_junk
              sar vm_junk,#9
:themov       mov 0,vm_junk
              jmp #vm_lmmloop

ersmith · 2019-04-01 00:30

Wuerfel_21 wrote: »

In my XMM thingy i'm working on, jumping within the current sector (128 longs) is achieved using "movs vm_lmmpc,#123". (downside is that jumping to another sector has to go through a fairly long routine and if the sector isn't cached you get some 30K cycles (estimate, haven't measured) of waiting for the SD card)

Yes, I've done things like "add vm_lmmpc, #X" and "sub vm_lmmpc, #X" for relative branches too. There are lots of LMM tricks out there.

Figuring out a way to hide the latency of external memory is definitely interesting. I've thought sometimes that with an emulator (e.g. of another processor) you could translate on the fly as the data is read into the card. Compiling code from, say, Z80 to Propeller is slow, but so is the card read, so you could use two cogs. Have one cog read the data and as soon as a byte is ready the other cog can fetch it and compile it in parallel with the next read. At the end you have two buffers: the 512 byte raw sector and a larger (maybe 2K?) buffer of translated instructions which can then be directly executed. Translating is expensive, but the idea is that because the read is so slow and the translation is in parallel it's almost free. And then the runtime becomes really simple and fast.

Back on topic, you could avoid adding complexity to the jump pseudo-op by directly jumping to the relevant position in the trace and avoiding inline argument reads in the loop (by moving them to a register before the loop)

Ah, but that means adding complexity to the compiler. I was actually hoping to keep the compiler simple (so it still just emits a simple "jmp #lmm_jmp" or something similar) and having the caching happen at runtime. I think it's feasible; I've got something like that working in my RiscV JIT compiler, and was thinking about how to do it for LMM, which is a much simpler interpreter. But the P1 instruction set isn't quite as friendly for this kind of thing as some others.

speaking of loading registers, has anyone had this idea before? (I don't think so, but who knows?)
EDIT: now that I think about it, if I keep all my registers below $100, I can even load 24 bit values (or below $80 for 25 bit values and so on)
vm_mvi        ''Move 23 bit Immediate
              rdlong vm_junk,vm_lmmpc
              add vm_lmmpc,#4
              movd :themov,vm_junk
              sar vm_junk,#9
:themov       mov 0,vm_junk
              jmp #vm_lmmloop

That's nice. You could run into a problem with conditional execution though if the immediate doesn't look like a no-op (conditional bits set to 0 i.e. IF_NEVER).

I wonder if there's a way to emulate the P2 AUGS instruction on P1? Probably; we would have to check a flag in the LMM loop though and translate the instruction from something like "add x, #N" to "add x, temp", where "temp" is register set to a combination of the 23 bits set in AUGS and the 9 bits from the instruction.

Maybe instead of an LMM direct interpretation loop we need to do a "just in time" compilation of LMM to COG code. That could handle things like AUGS relatively easily...

Wuerfel_21 · 2019-04-01 00:51

ersmith wrote: »

That's nice. You could run into a problem with conditional execution though if the immediate doesn't look like a no-op (conditional bits set to 0 i.e. IF_NEVER).

I just don't make that instruction conditional. altough thinking some more, one could abuse the D-field of the jump to vm_mvi and similiar functions to house the condition code and then evalute it in software (of course if you need an unconditional call, you'd jump beyond the call to check_cond)

check_cond ''There is likely a nicer way of getting the instruction's D-field but it's late and I want to go to bed
              sub vm_lmmpc,#4
              rdlong vm_junk2,vm_lmmpc
              add vm_lmmpc,#4
        if_z  shr vm_junk2,#1
        if_c  shr vm_junk2,#2
              and vm_junk2,vm_bit9
check_cond_ret tjnz vm_junk2,#0
              add vm_lmmpc,#4
              jmp #vm_lmmloop

vm_mvi_cond call #check_cond
vm_mvi        ''Move 23 bit Immediate
              rdlong vm_junk,vm_lmmpc
              add vm_lmmpc,#4
              movd :themov,vm_junk
              sar vm_junk,#9
:themov       mov 0,vm_junk
              jmp #vm_lmmloop

Cluso99 · 2019-04-01 07:46

Eric,
Recently I had an idea. Perhaps you might think about it.

I did a fast overlay loader and to implement it I loaded the longs in reverse so that I could catch every hub cycle (ie 16 clocks). Each hub cycle only allows 2 instructions between the rdlongs.

So here's the idea...

Use an LMM loop that operates in reverse. ie the next instruction is normally the next lower address

To make this work, the assembler must take the code and reverse it. This could be accomplished by an external program to reverse the source code. Then assemble the code.

Your thoughts?

ersmith · 2019-04-01 10:56

Cluso99 wrote: »

Use an LMM loop that operates in reverse. ie the next instruction is normally the next lower address

To make this work, the assembler must take the code and reverse it. This could be accomplished by an external program to reverse the source code. Then assemble the code.

Your thoughts?

I think I briefly mentioned something like this in the first post. I don't really like it, myself -- it imposes a lot of complexity in the compiler (having to reverse the whole program) and the gain isn't that great compared to what you get from just unrolling the LMM loop. Even in the best case you're still going to pay a 4x speed penalty for LMM.

Ideally I'd like to get rid of that 4x penalty entirely, at least for frequently used code. Fcache is one way to do that -- at compile time you generate code to load small loops into COG and then run them from there. But not many compilers implement fcache; PropGCC and fastspin are the only ones I know of. For hand-written assembly programming it's a pain to have to write loops in fcache form.

Fcache as traditionally implemented also misses some cases where loops involving subroutine calls would fit in COG memory, but the compiler can't prove that. Neither compiler will cache a loop that involves subroutine calls (except that PropGCC does have a hack to cache small recursive functions that only call themselves).

The trace cache idea seems tantalizingly close to having the LMM interpreter implement the fcache automatically at run time. But as I said it's only half-baked right now, and I wonder if I'm missing something...

ersmith · 2019-04-01 11:23

I should maybe talk a little bit about where I'm coming from with traces. For P2 I implemented a just in time compiler from RiscV instructions to P2 instructions, and used it to get microPython working. I've been doing some research on JIT compilers to see how to improve the performance of mine, and came across the idea of a trace cache. In a typical JIT compiler some hypothetical byte code like:

L1:
    push #2
L2:
     add

would be translated to:

L1:
   call #push
   mov tos, #2
 L2:
   call #pop
   add tos, oldtos

because we have to be able to handle branches to either L1 or L2. In a trace cache though you could translate this to:

L1:
    add tos, #2

Because every branch starts a new trace, the trace starting at L1 is free to optimize across bytecode boundaries. A jump to L2 would start an entirely different trace, so the compiler can ignore that possibility while compiling at L1.

This is really cool, and I started to think about whether anything like this would be applicable to LMM. I'm not sure if it's worth it yet though.

Phil Pilgrim (PhiPi) · 2019-04-01 17:40

ersmith wrote:

... it imposes a lot of complexity in the compiler ...

For any desired feature where that's the case, I would answer, "Yes, but it just has to be done once."

-Phil

ersmith · 2019-04-01 18:42

Phil Pilgrim (PhiPi) wrote: »

ersmith wrote:

... it imposes a lot of complexity in the compiler ...

For any desired feature where that's the case, I would answer, "Yes, but it just has to be done once."

(1) Well, it has to be done once *per compiler*. There are a plethora of compilers for P1 (and even for P2 now). "Per compiler" also includes each individual assembly language project that uses an LMM (a fair number of PASM programs end up with an LMM loop of some kind to handle slow code).

(2) Has anyone actually implemented a reverse LMM in an actual system (not a test, but a real language/compiler that uses one)? I haven't heard of one, but I'd be curious.

(3) The incremental gain over unrolling the LMM loop lots of times probably doesn't make it worth it. Asymptotically the unrolled LMM loop will approach the ideal performance.

I've actually gone ahead and made the fastspin LMM code somewhat parameterizable, so as to test some of this theory. Below are some results on the xxtea benchmark (xxtea.spin), giving the cycles to decompress the sample string. It's only one benchmark, so of course it needs to be taken with a grain of salt, but it's a reasonable test for LMM because it does involve some heavy computation.

lmm=trace is a proto-tracing LMM that only records the trace and does nothing with it; basically it's to see what the recording overhead looks like (not bad).
lmm=slow is the dead simple LMM posted above, not unrolled at all.
lmm=4x is the default fastspin LMM, which has the loop unrolled 4 times.
lmm=8x has the loop unrolled 8 times, and lmm=16x has it unrolled 16 times

The last three are presented with fcache disabled (first) and then enabled.

lmm=trace:       159760 cycles
lmm=slow:        158944 cycles
lmm=4x:          107568 cycles
lmm=8x:           98944 cycles
lmm=16x:          94624 cycles
lmm=4x + fcache:  35904 cycles
lmm=8x + fcache:  35216 cycles
lmm=16x + fcache: 35024 cycles

As you can see the gains from unrolling the loop are diminishing; unrolling it 4 times is a big win, doing it 16 times is probably not worth it; the sweet spot looks to be around 8 loop iterations. I would guess that a perfect LMM loop would still clock in at over 90000 cycles, since branches are going to interfere with the hub rotation.

fcache makes an enormous difference, far more than unrolling the loop. So there's definitely still a lot of room for improvement in LMM, if we can find some way to avoid fetching instructons from HUB (which means some kind of caching).

Even though fcache is pretty well understood theoretically, it's not widely deployed in P1 languages, mostly because it's a pain to implement. So it'd be nice to figure out an alternative. Some of the theory may be applicable to other kinds of interpreters too. For comparison, the default Spin bytecode interpreter takes 1044640 cycles to interpret that program, so it's about 10x slower than unrolled LMM and 30x slower than the fcached version.

Cluso99 · 2019-04-01 21:05

Would you like to post the test code and i will have a go at reversed lmm?

Phil Pilgrim (PhiPi) · 2019-04-01 21:33

I can see where the complications occur. The hard part is adjusting the addresses of the JMP targets. Even more complicated is when those addresses are not immediate constants but computed during execution.

-Phil

Wuerfel_21 · 2019-04-01 22:27

But what about using PHSA as the program counter?

That would actually work, but the instructions would need to be 4 longs apart (more for hubops)

Can we run code from INA?
Highly impractical, would need all 32 pins (or P1V with both ports, but there we have better ways of speeding up LMM)

ersmith · 2019-04-01 23:05

Cluso99 wrote: »

Would you like to post the test code and i will have a go at reversed lmm?

If you'd like, although I think it will be quite a pain (but perhaps I just don't have my head wrapped around it). I've attached a .zip file with xxtea.spin (original source), xxtea_pasm.spin (result after compilation with fastspin --lmm=slow, the simplest LMM loop), xxtea.lst (listing file for xxtea_pasm.spin), and xxtea.binary. The numbers you'll get when running this will be a little slower than I posted earlier; I realized that fastspin was still using "add pc, #XXX" for short relative branches, and not all LMM schemes can support this, so it's disabled now for everything other than --lmm=orig. That should be all the source code you'd need to experiment with xxtea, although as compiler output it's not very readable. The LMM stuff is near the start of xxtea (it begins with COG resident code, then goes to LMM HUB code at the label "hubentry").

I've pushed the revised fastspin source code into my github repository on the "experiment/lmm" branch. This version of fastspin has a new option "--lmm=xxx", where "xxx" may be one of "orig" (the original fastspin LMM), "slow" (the really simple one), or "trace" (still very much under construction, and not working at the moment). The source code for the LMM loops are in sys/lmm_xxx.spin. To add a new one you'd have to modify:

   Makefile -- to add a new dependency for sys/lmm_xxx.spin.h
   backends/asm/outasm.c -- to #include sys/lmm_xxx.spin.h, and to set the builtin_lmm pointer to sys_lmm_xxx_spin; see what happens with LMM_KIND_SLOW or LMM_KIND_TRACE for examples
   frontends/common.h  -- to add a new LMM_KIND_XXX define
   fastspin.c -- to add a new option to --lmm (set the global variable gl_lmm_kind to LMM_KIND_XXX)

you may need to change the jump output code in

   backends/asm/assemble_ir.c

I think the default settings will work for most LMM modes, but if you need something fancy the code for emitting the jumps starts near line 752, at the comment that says "handle jumps in LMM mode".

I haven't tried building fastspin on Windows lately, so I don't know if that still works. I tend to cross compile on a Linux machine (Debian Stretch).

If you do decide to tackle this, let me know if I can help with something. I'm convinced it's a bit of a dead end; even in the absolute best case the reversed LMM loop is going to be no more than 2x faster than the slowest LMM loop, or around 80000 cycles, which is still a lot slower than any of the fcache cases. But I guess for programs with large loops that don't fit in fcache it might be worth a go.

Regards,
Eric

ersmith · 2019-04-01 23:08

Phil Pilgrim (PhiPi) wrote: »

I can see where the complications occur. The hard part is adjusting the addresses of the JMP targets. Even more complicated is when those addresses are not immediate constants but computed during execution.

Indeed. CALL/RET will require a bit of work, for example, as would any kind of indirect jump table.

Phil Pilgrim (PhiPi) · 2019-04-02 02:23

Yeah, and if the JMP target -- or any src or dst, for that matter -- is 0-0 or #0-0, you're pretty much screwed, unless the programmer takes extra measures to accommodate the reversal. But you can never count on the programmer to do anything sane.

-Phil

Cluso99 · 2019-04-02 02:47

Thanks Eric. 2x faster is very worthwhile, depending of course on the complexity.

What I may do is just cut the big btea routine and play with that. First going forward with a traditional LMM loop, and then try reversing it. That should be easier and test the concept without too much effort. So, I'll give that a go.

Cluso99 · 2019-04-02 07:01

Just been looking at the reverse loop. Works fine when you are decrementing the cog register address (by 1) for the overlay loader, and decrementing the hub by 4.

However, with LMM, it needs to just decrement the hub by 4 which is not possible with a djnz instruction

So it's not possible to do a reverse loop LMM

And now I recall looking at this years ago and finding the same result.

Wuerfel_21 · 2019-04-02 17:08

remember that the bottom two bits of an RDLONG's source are ignored...

dat
boot          jmp #lentry

vm_lmmpc      long someconst + %11

lmm2          nop
lentry        rdlong lmm1,vm_lmmpc
              sub vm_lmmpc,#7
lmm1          nop
              rdlong lmm2,vm_lmmpc
              djnz vm_lmmpc,#lmm2

ersmith · 2019-04-03 09:53

Well, I got the tracing LMM cache working, but performance was very disappointing; I think the time spent looking for cache misses (and flushing the cache when it overflowed) completely destroyed performance benefits in most cases, unless I gave it a very large cache. I still think the idea is promising, but I need to go back to the drawing board.

However, I did try a simpler kind of caching LMM and that actually worked well. The caching algorithm is really simple: in LMM_JUMP we check for a backwards branch, and if it's short enough that the code from the branch target to where we are now will fit in the COG memory cache, we copy it in there and jump to it. There's also some work to handle CALL/RET from cache and to patch LMM_JUMP executing from cache so it goes back to the cache (and then avoids the LMM_JUMP routine completely). But the performance bump is pretty nice. Below are some figures for xxtea and for the fftbench benchmark:

                   xxtea            fftbench
no or minimal cache:
lmm=slow:        158944 cycles       231584 us
lmm=orig:         98944 cycles       153798 us
lmm=cache:       170384 cycles       236920 us

64 long cache (FCACHE for lmm=orig)
lmm=orig:         35216 cycles       149697 us
lmm=cache:        54608 cycles       228021 us

96 long cache
lmm=orig:         35216 cycles        61311 us
lmm=cache:        35600 cycles        96444 us

128 long cache
lmm=orig:         35216 cycles        58025 us
lmm=cache:        35600 cycles        58332 us


The lmm=orig features an 8x unrolled LMM loop. For comparison, on
xxtea unrolling it different amounts:

lmm=slow (1x):   158944 * see note
lmm=orig (4x):   107568
lmm=orig (8x):    98944
lmm=orig (16x):   94624

note: lmm=slow is not quite the same as a 1x lmm=orig; it is
missing an optimization for short branches (turning them into
add/sub of the pc). So not all of the performance increase
from 1x to 4x is due to the loop unrolling.

Remember that the "lmm=orig" case with a cache involves FCACHE: the compiler has to analyze the code, find the loops, and emit code to load them into COG automatically. The "lmm=cache" does not require this; instead we do that analysis and load at run time, so it will work with any compiler (or none, if you hand write your LMM PASM code). There's certainly a significant overhead if you don't give it a cache, or if the main program loops don't fit in cache. But if they do fit they can pretty much run at full speed.

I've attached the code for the automatically caching LMM. It's also checked in to the experiment/lmm branch of fastspin.

There's also still room for improvement. As @Wuerfel_21 pointed out copying from HUB to COG backwards can avoid the hub miss penalty, and the cache loading code doesn't do that yet. There may also be some other tricks that would improve performance.

yisiguro · 2019-04-03 12:29

Wuerfel_21 wrote: »

But what about using PHSA as the program counter?
That would actually work, but the instructions would need to be 4 longs apart (more for hubops)

It is not only 4 long word per instruction, but interleaved for 4 continuations, for example, just suppose virtually BRANCH different modulo.

Another funny possibility is, using all two cog counter and one I/O pin. One for simply dividing sysclk by 16.

ersmith · 2019-04-03 16:49

yisiguro wrote: »

Wuerfel_21 wrote: »

But what about using PHSA as the program counter?
That would actually work, but the instructions would need to be 4 longs apart (more for hubops)

It is not 4 long word per instruction, but interleaved for 4 continuations, for example, just suppose virtually BRANCH different modulo.

Another funny possibility is, using all two cog counter and one I/O pin. One for simply dividing sysclk by 16.

Even in the best case that won't give much improvement over an unrolled LMM loop, will it? An ideal LMM loop executes one instruction every 16 cycles. An 8x unrolled LMM loop takes an extra hub delay slot over this, so it works out to 18 cycles per instruction. A 16x unrolled loop is 17 cycles per instruction.

A caching LMM can do much, much better, as the experiments above indicate. In the ideal case you get to 4 cycles per instruction, for a 400% improvement over even a perfect non-caching LMM.

Stupid LMM tricks

Comments