P2 Taqoz 2.8: Study about an Inline Compiler - Speed, Stack and Inline Assembly — Parallax Forums

# P2 Taqoz 2.8: Study about an Inline Compiler - Speed, Stack and Inline Assembly

Posts: 1,043
edited 2023-01-23 16:45 in Forth

Hi,
perhaps this is interesting for somebody.
two ideas have been discussed, which might be ways to achieve faster Forth.
A) Make a compiler, that does not call Forth primitives but inlines their assembler code. To some degree this would be possible within the structures of Taqoz. The gain of speed comes from the cache for instructions, which needs some cycles to become effective, while it does not work for many calls and jumps or hub data reads. As the jumps for control instructions do work differently in Forth and Asm, their implementation would need extra work.
Avoid (LUT) Stack handling, which involves several movs and a LUT operation each time. This would need a complete rewrite of the Forth system.

Using the simple fibonacci benchmark I tried different degrees of inline code and assembler that avoids stack movements. In addition the results using different types of variables are given. (Used: https://forums.parallax.com/discussion/174970/p2-taqoz-v2-8-lutlongs-value-type-variables-locals-now-in-lut-and-faster#latest )

The readability of code, that uses variables, is much better than pure stack Forth, but at the cost of speed factor 10 to Stack Forth. (For me assembler is better to read sometimes than Forth that uses a lot of swap over rot...)

Reference for speed is hubexec assembler. Cog assembler is twice as fast.
Pure Stack Forth is factor 5,9 slower to hubexec.
If the Forth code is inlined (without the loop instruction) there is a speed gain of 4,6/5,9. The penalty of code size is factor 4. This would be idea A. If an adapted loop structure is part of the inlined block, there is a speed gain of 2.1/5.9.

If inline assembler is used, then often switching back and forth between Forth and Assembler eats up the advantage of minimised stack movement. Only the complete loop in assembler brings the big advantage.

So all in all:
1. If speed is important in Forth, do not use any type of variables but use stack words.
2. Inlining of assembler code of Taqoz Forth words within the Taqoz system instead of calling them (idea A) is of low merit, if loop command is not part of the inlined code. So the inlining compiler must have own working control structures. Then a gain of speed of factor 2.5 is possible using the existing stack structure. The code size penalty is factor 4!
3. Inlining of assembler code, that uses registers instead of stack must also include the loop instructions.

```fiboAsm:          1836311903 1176     100 pc to HubAssembler 38 bytes
fiboAsmFor:       1836311903 1185     100 pc to HubAssembler
fiboCog:          1836311903 496       42 pc to HubAssembler with djnz
fiboForth:        1836311903 6944     590 pc to HubAssembler 26 bytes
fiboInForth:      1836311903 5432     461 pc to HubAssembler 100 bytes inline forth commands without loop command
fiboInForForth:   1836311903 2496     210 pc to HubAssembler loop command is included in inlined block
fiboInCacheForth: 1836311903 4712     397 pc to HubAssembler inline forth commands and cache them
fiboValues:       1836311903 82888   7048 pc to HubAssembler
fiboLocals:       1836311903 60080   5108 pc to HubAssembler 48 bytes
fiboLut:          1836311903 76696   6521 pc to HubAssembler
fiboGlobals:      1836311903 52840   4493 pc to HubAssembler
fiboStAbc:        1836311903 5065     427 pc to HubAssembler 44 bytes asm to avoid stack movement, loop in forth
fiboStAbcd:       1836311903 1409     119 pc to HubAssembler 44 bytes avoid stack movement, complete loop in hub asm
```

• Posts: 1,043

Update:
Took me a while to find out, how it is possible in Taqoz to do an inline-compiler. Actually it is more of a macro assembler. This works in 2 steps:
1. Rewrite Forth words as (HUB) code words using the parameter stack of Taqoz. So this is the code-word of "+":

```code In+ ( n1 n2 -- n1+n2 )
sub     \$18,#1 ' DROP depth
mov     a,b
mov     b,c
mov     c,d
rdlut   d,--ptrb
ret
end
```

The code must end with a standalone ret. This step is not necessary, if the word already exists in this way.

1. Now you can combine such words in a syntax similar to Forth.
```inCode inLoop ( n 0 1 -- n x fibo )
inl: TUCK
inl: In+
inNEXT: c \ use register c for djnz loop
endIn
```

These words are just copied into the new code-word, not called. inNEXT: compiles djnz.

1. Use the word as a Forth word.
```: fiboInLoop ( n -- f )  \ Fibonacci Reihe, liefert letztes Ergebnis
0 1
inLoop  ( n 0 1 -- n x fibo ) \ does the calculation
rot 2drop \ get rid of stack values
;
```

Is it worth the trouble?

```fiboForth:        1836311903 6944     590 pc to HubAssembler 26 bytes
fiboTuckForth:    1836311903 8017     676 pc to HubAssembler 24 bytes TUCK is in HUB
fiboTuck+Forth:   1836311903 6592     560 pc to HubAssembler
fiboInLoop:       1836311903 2168     184 pc to HubAssembler 12+40 bytes inLoop is in HUB
fiboInLoopCache:  1836311903 1728     145 pc to HubAssembler inLoop is cached in COG
```

So this has factor 6944/2168= 3.2 of the speed of Taqoz. And about half the speed of Hub assembler code. It uses the new defined Forth83 word TUCK, which is a combination of SWAP and OVER.

• Posts: 394

That's a very clever speed improvement - I like it!

• Posts: 394
edited 2023-03-13 14:45

@"Christof Eb." I think your source code got corrupted at step 2 above - it's got 'In' in various places it shouldn't have e.g. inCode, inNEXT, endIn

• Posts: 1,043

Hi Bob,
have a look into the fiboTestsK.fth in the zip. In there you will find the definitions of the new words containing "in".
Unfortunately many existing words of Taqoz are not suitable to be inlined, because they end with a jump instead of a separate ret.
Christof

• Posts: 394

Aah - the code displayed is just an extract from the zip file - OK. Thanks for the caution about not using words that end in a jump. I see this being used mainly on new stuff, so that's not a big deal, just remember to end all your words in 'ret'