P2 Taqoz 2.8: Study about an Inline Compiler - Speed, Stack and Inline Assembly
perhaps this is interesting for somebody.
In this thread https://forums.parallax.com/discussion/175066/question-about-register-indirection-alti#latest
two ideas have been discussed, which might be ways to achieve faster Forth.
A) Make a compiler, that does not call Forth primitives but inlines their assembler code. To some degree this would be possible within the structures of Taqoz. The gain of speed comes from the cache for instructions, which needs some cycles to become effective, while it does not work for many calls and jumps or hub data reads. As the jumps for control instructions do work differently in Forth and Asm, their implementation would need extra work.
Avoid (LUT) Stack handling, which involves several movs and a LUT operation each time. This would need a complete rewrite of the Forth system.
Using the simple fibonacci benchmark I tried different degrees of inline code and assembler that avoids stack movements. In addition the results using different types of variables are given. (Used: https://forums.parallax.com/discussion/174970/p2-taqoz-v2-8-lutlongs-value-type-variables-locals-now-in-lut-and-faster#latest )
The readability of code, that uses variables, is much better than pure stack Forth, but at the cost of speed factor 10 to Stack Forth. (For me assembler is better to read sometimes than Forth that uses a lot of swap over rot...)
Reference for speed is hubexec assembler. Cog assembler is twice as fast.
Pure Stack Forth is factor 5,9 slower to hubexec.
If the Forth code is inlined (without the loop instruction) there is a speed gain of 4,6/5,9. The penalty of code size is factor 4. This would be idea A. If an adapted loop structure is part of the inlined block, there is a speed gain of 2.1/5.9.
If inline assembler is used, then often switching back and forth between Forth and Assembler eats up the advantage of minimised stack movement. Only the complete loop in assembler brings the big advantage.
So all in all:
1. If speed is important in Forth, do not use any type of variables but use stack words.
2. Inlining of assembler code of Taqoz Forth words within the Taqoz system instead of calling them (idea A) is of low merit, if loop command is not part of the inlined code. So the inlining compiler must have own working control structures. Then a gain of speed of factor 2.5 is possible using the existing stack structure. The code size penalty is factor 4!
3. Inlining of assembler code, that uses registers instead of stack must also include the loop instructions.
fiboAsm: 1836311903 1176 100 pc to HubAssembler 38 bytes fiboAsmFor: 1836311903 1185 100 pc to HubAssembler fiboCog: 1836311903 496 42 pc to HubAssembler with djnz fiboForth: 1836311903 6944 590 pc to HubAssembler 26 bytes fiboInForth: 1836311903 5432 461 pc to HubAssembler 100 bytes inline forth commands without loop command fiboInForForth: 1836311903 2496 210 pc to HubAssembler loop command is included in inlined block fiboInCacheForth: 1836311903 4712 397 pc to HubAssembler inline forth commands and cache them fiboValues: 1836311903 82888 7048 pc to HubAssembler fiboLocals: 1836311903 60080 5108 pc to HubAssembler 48 bytes fiboLut: 1836311903 76696 6521 pc to HubAssembler fiboGlobals: 1836311903 52840 4493 pc to HubAssembler fiboStAbc: 1836311903 5065 427 pc to HubAssembler 44 bytes asm to avoid stack movement, loop in forth fiboStAbcd: 1836311903 1409 119 pc to HubAssembler 44 bytes avoid stack movement, complete loop in hub asm