Stupid LMM tricks
ersmith
Posts: 6,053
in Propeller 1
These are some kind of half baked ideas about LMM (Large Memory Model) for Prop1.
The traditional LMM loop as first proposed by Bill Henning looks like:
Over the years there have been various discussions on how to improve this, ranging from obvious (unroll the loop) to stashing some functions in COG (FCACHE) to I think some tricky backwards loops using djnz.
What I'm interested in in this thread are ways to use the 3 "wasted" cycles in the small, simple LMM loop for something useful.
For example, consider:
This uses some of the "wasted" time to test for an interesting class of instructions (ones with the IF_NEVER condition) and re-purposes them to call arbitrary COG addresses, with a 16 bit parameter placed in "arg" as a bonus. For example, a branch in traditional LMM looks like
The traditional LMM loop as first proposed by Bill Henning looks like:
LMM_JMP rdlong pc, pc ' fall through LMM_Loop rdlong instr, pc add pc, #4 instr nop jmp #LMM_LoopIt just misses a hub window, so it effectively takes 8 COG instructions to "interpret" one instruction from HUB (the rdlong takes at least 8 cycles, and more often it will take another 12 cycles as we wait for the window to come around again). Branches are implemented as a "JMP #LMM_JMP" followed by the long address to jump to.
Over the years there have been various discussions on how to improve this, ranging from obvious (unroll the loop) to stashing some functions in COG (FCACHE) to I think some tricky backwards loops using djnz.
What I'm interested in in this thread are ways to use the 3 "wasted" cycles in the small, simple LMM loop for something useful.
For example, consider:
LMM_JMP mov pc, arg LMM_Loop rdlong instr, pc add pc, #4 instr nop mov temp, instr andn temp, cmask tjnz temp, #LMM_Loop decompress_instr mov arg, instr ' save original data shr instr, #23 ' isolate COG address and arg, amask ' isolate data jmp instr ' go run it; routine must exit with "jmp #LMM_Loop" cmask long $FF80FFFF amask long $0000FFFF
This uses some of the "wasted" time to test for an interesting class of instructions (ones with the IF_NEVER condition) and re-purposes them to call arbitrary COG addresses, with a 16 bit parameter placed in "arg" as a bonus. For example, a branch in traditional LMM looks like
jmp #LMM_JMP long @newpcbut in this revised format it could look like:
long (LMM_JMP<<23) + @newpcThis might be slightly slower than the original (I haven't checked the hub windows to see) but it'll certainly save space; it's like a "CMM-light" and is very easy to implement.
Comments
Here the TRACE_SIZE long words starting at "tracecache" are filled up with instructions as they are executed.
This could be interesting for debugging -- it gives you a window into what was happening before a certain point.
But there's another possibility as well: if you see a jump that goes backwards into an instruction that's already in the trace, then potentially one could jump directly into the trace cache and from that point run the loop at full COG speed (rather than 1/8 LMM speed).
There are some tricky bits to handle there; for example a branch that isn't taken ends up in the trace cache as "if_x jmp #LMM_JMP" followed by "long @addr" (which will be a no-op if addr is less than 18 bits). When re-running this we'd have to have LMM_JMP know whether it was called from the trace cache (in which case the new PC has to be picked up out of COG memory) or from HUB. But that's solvable in principle; instead of "jmp #LMM_JMP" use "jmpret RA, #LMM_JMP" and then check whether RA contains "nextinstr" (in which case we were running from HUB and fetch the new PC via rdlong pc, pc) or something in the trace cache (in which case we need to use RA to find the new pc).
Being able to quickly find which jumps go back into the trace cache is also tricky. But we can find an interesting class by comparing "tracestart" with the new pc in LMM jump; if they match we're going to the start of the trace cache, otherwise call RESTART_TRACE and start a new trace. Then every simple loop will end up detectable.
This is all kind of a half-baked idea, but it seems like it might have potential to unlock a sort of "automatic" fcache, so I thought I'd throw it out for discussion. I'd be interested in others' thoughts. Has anyone else tried anything along these lines?
Eric
In my XMM thingy i'm working on, jumping within the current sector (128 longs) is achieved using "movs vm_lmmpc,#123". (downside is that jumping to another sector has to go through a fairly long routine and if the sector isn't cached you get some 30K cycles (estimate, haven't measured) of waiting for the SD card)
Back on topic, you could avoid adding complexity to the jump pseudo-op by directly jumping to the relevant position in the trace and avoiding inline argument reads in the loop (by moving them to a register before the loop)
speaking of loading registers, has anyone had this idea before? (I don't think so, but who knows?)
EDIT: now that I think about it, if I keep all my registers below $100, I can even load 24 bit values (or below $80 for 25 bit values and so on)
Figuring out a way to hide the latency of external memory is definitely interesting. I've thought sometimes that with an emulator (e.g. of another processor) you could translate on the fly as the data is read into the card. Compiling code from, say, Z80 to Propeller is slow, but so is the card read, so you could use two cogs. Have one cog read the data and as soon as a byte is ready the other cog can fetch it and compile it in parallel with the next read. At the end you have two buffers: the 512 byte raw sector and a larger (maybe 2K?) buffer of translated instructions which can then be directly executed. Translating is expensive, but the idea is that because the read is so slow and the translation is in parallel it's almost free. And then the runtime becomes really simple and fast.
Ah, but that means adding complexity to the compiler. I was actually hoping to keep the compiler simple (so it still just emits a simple "jmp #lmm_jmp" or something similar) and having the caching happen at runtime. I think it's feasible; I've got something like that working in my RiscV JIT compiler, and was thinking about how to do it for LMM, which is a much simpler interpreter. But the P1 instruction set isn't quite as friendly for this kind of thing as some others.
That's nice. You could run into a problem with conditional execution though if the immediate doesn't look like a no-op (conditional bits set to 0 i.e. IF_NEVER).
I wonder if there's a way to emulate the P2 AUGS instruction on P1? Probably; we would have to check a flag in the LMM loop though and translate the instruction from something like "add x, #N" to "add x, temp", where "temp" is register set to a combination of the 23 bits set in AUGS and the 9 bits from the instruction.
Maybe instead of an LMM direct interpretation loop we need to do a "just in time" compilation of LMM to COG code. That could handle things like AUGS relatively easily...
Recently I had an idea. Perhaps you might think about it.
I did a fast overlay loader and to implement it I loaded the longs in reverse so that I could catch every hub cycle (ie 16 clocks). Each hub cycle only allows 2 instructions between the rdlongs.
So here's the idea...
Use an LMM loop that operates in reverse. ie the next instruction is normally the next lower address
To make this work, the assembler must take the code and reverse it. This could be accomplished by an external program to reverse the source code. Then assemble the code.
Your thoughts?
I think I briefly mentioned something like this in the first post. I don't really like it, myself -- it imposes a lot of complexity in the compiler (having to reverse the whole program) and the gain isn't that great compared to what you get from just unrolling the LMM loop. Even in the best case you're still going to pay a 4x speed penalty for LMM.
Ideally I'd like to get rid of that 4x penalty entirely, at least for frequently used code. Fcache is one way to do that -- at compile time you generate code to load small loops into COG and then run them from there. But not many compilers implement fcache; PropGCC and fastspin are the only ones I know of. For hand-written assembly programming it's a pain to have to write loops in fcache form.
Fcache as traditionally implemented also misses some cases where loops involving subroutine calls would fit in COG memory, but the compiler can't prove that. Neither compiler will cache a loop that involves subroutine calls (except that PropGCC does have a hack to cache small recursive functions that only call themselves).
The trace cache idea seems tantalizingly close to having the LMM interpreter implement the fcache automatically at run time. But as I said it's only half-baked right now, and I wonder if I'm missing something...
This is really cool, and I started to think about whether anything like this would be applicable to LMM. I'm not sure if it's worth it yet though.
-Phil
(1) Well, it has to be done once *per compiler*. There are a plethora of compilers for P1 (and even for P2 now). "Per compiler" also includes each individual assembly language project that uses an LMM (a fair number of PASM programs end up with an LMM loop of some kind to handle slow code).
(2) Has anyone actually implemented a reverse LMM in an actual system (not a test, but a real language/compiler that uses one)? I haven't heard of one, but I'd be curious.
(3) The incremental gain over unrolling the LMM loop lots of times probably doesn't make it worth it. Asymptotically the unrolled LMM loop will approach the ideal performance.
I've actually gone ahead and made the fastspin LMM code somewhat parameterizable, so as to test some of this theory. Below are some results on the xxtea benchmark (xxtea.spin), giving the cycles to decompress the sample string. It's only one benchmark, so of course it needs to be taken with a grain of salt, but it's a reasonable test for LMM because it does involve some heavy computation.
lmm=trace is a proto-tracing LMM that only records the trace and does nothing with it; basically it's to see what the recording overhead looks like (not bad).
lmm=slow is the dead simple LMM posted above, not unrolled at all.
lmm=4x is the default fastspin LMM, which has the loop unrolled 4 times.
lmm=8x has the loop unrolled 8 times, and lmm=16x has it unrolled 16 times
The last three are presented with fcache disabled (first) and then enabled.
As you can see the gains from unrolling the loop are diminishing; unrolling it 4 times is a big win, doing it 16 times is probably not worth it; the sweet spot looks to be around 8 loop iterations. I would guess that a perfect LMM loop would still clock in at over 90000 cycles, since branches are going to interfere with the hub rotation.
fcache makes an enormous difference, far more than unrolling the loop. So there's definitely still a lot of room for improvement in LMM, if we can find some way to avoid fetching instructons from HUB (which means some kind of caching).
Even though fcache is pretty well understood theoretically, it's not widely deployed in P1 languages, mostly because it's a pain to implement. So it'd be nice to figure out an alternative. Some of the theory may be applicable to other kinds of interpreters too. For comparison, the default Spin bytecode interpreter takes 1044640 cycles to interpret that program, so it's about 10x slower than unrolled LMM and 30x slower than the fcached version.
-Phil
That would actually work, but the instructions would need to be 4 longs apart (more for hubops)
Can we run code from INA?
Highly impractical, would need all 32 pins (or P1V with both ports, but there we have better ways of speeding up LMM)
If you'd like, although I think it will be quite a pain (but perhaps I just don't have my head wrapped around it). I've attached a .zip file with xxtea.spin (original source), xxtea_pasm.spin (result after compilation with fastspin --lmm=slow, the simplest LMM loop), xxtea.lst (listing file for xxtea_pasm.spin), and xxtea.binary. The numbers you'll get when running this will be a little slower than I posted earlier; I realized that fastspin was still using "add pc, #XXX" for short relative branches, and not all LMM schemes can support this, so it's disabled now for everything other than --lmm=orig. That should be all the source code you'd need to experiment with xxtea, although as compiler output it's not very readable. The LMM stuff is near the start of xxtea (it begins with COG resident code, then goes to LMM HUB code at the label "hubentry").
I've pushed the revised fastspin source code into my github repository on the "experiment/lmm" branch. This version of fastspin has a new option "--lmm=xxx", where "xxx" may be one of "orig" (the original fastspin LMM), "slow" (the really simple one), or "trace" (still very much under construction, and not working at the moment). The source code for the LMM loops are in sys/lmm_xxx.spin. To add a new one you'd have to modify: you may need to change the jump output code in I think the default settings will work for most LMM modes, but if you need something fancy the code for emitting the jumps starts near line 752, at the comment that says "handle jumps in LMM mode".
I haven't tried building fastspin on Windows lately, so I don't know if that still works. I tend to cross compile on a Linux machine (Debian Stretch).
If you do decide to tackle this, let me know if I can help with something. I'm convinced it's a bit of a dead end; even in the absolute best case the reversed LMM loop is going to be no more than 2x faster than the slowest LMM loop, or around 80000 cycles, which is still a lot slower than any of the fcache cases. But I guess for programs with large loops that don't fit in fcache it might be worth a go.
Regards,
Eric
Indeed. CALL/RET will require a bit of work, for example, as would any kind of indirect jump table.
-Phil
What I may do is just cut the big btea routine and play with that. First going forward with a traditional LMM loop, and then try reversing it. That should be easier and test the concept without too much effort. So, I'll give that a go.
However, with LMM, it needs to just decrement the hub by 4 which is not possible with a djnz instruction
So it's not possible to do a reverse loop LMM And now I recall looking at this years ago and finding the same result.
However, I did try a simpler kind of caching LMM and that actually worked well. The caching algorithm is really simple: in LMM_JUMP we check for a backwards branch, and if it's short enough that the code from the branch target to where we are now will fit in the COG memory cache, we copy it in there and jump to it. There's also some work to handle CALL/RET from cache and to patch LMM_JUMP executing from cache so it goes back to the cache (and then avoids the LMM_JUMP routine completely). But the performance bump is pretty nice. Below are some figures for xxtea and for the fftbench benchmark:
Remember that the "lmm=orig" case with a cache involves FCACHE: the compiler has to analyze the code, find the loops, and emit code to load them into COG automatically. The "lmm=cache" does not require this; instead we do that analysis and load at run time, so it will work with any compiler (or none, if you hand write your LMM PASM code). There's certainly a significant overhead if you don't give it a cache, or if the main program loops don't fit in cache. But if they do fit they can pretty much run at full speed.
I've attached the code for the automatically caching LMM. It's also checked in to the experiment/lmm branch of fastspin.
There's also still room for improvement. As @Wuerfel_21 pointed out copying from HUB to COG backwards can avoid the hub miss penalty, and the cache loading code doesn't do that yet. There may also be some other tricks that would improve performance.
Another funny possibility is, using all two cog counter and one I/O pin. One for simply dividing sysclk by 16.
Even in the best case that won't give much improvement over an unrolled LMM loop, will it? An ideal LMM loop executes one instruction every 16 cycles. An 8x unrolled LMM loop takes an extra hub delay slot over this, so it works out to 18 cycles per instruction. A 16x unrolled loop is 17 cycles per instruction.
A caching LMM can do much, much better, as the experiments above indicate. In the ideal case you get to 4 cycles per instruction, for a 400% improvement over even a perfect non-caching LMM.