P2 Taqoz V2.8: csForth -- Cached Call Sequences for Speed of Innermost Loops

Christof Eb. · 2023-08-09 15:49

Hi,
Taqoz is already fast, 160ms for an empty loop of 1.000.000. ( 1_000_000 FOR NEXT )
The question was, if we can be faster, if we avoid HUB accesses. The inner interpreter doNext loads every wordcode and has to decide, what it should do with it. If the actual wordcode is a "CODE word", then this word is the address of an assembler routine and it is just called.
"CODE words" are all wordcodes < $17FF and can reside in COG RAM, LUT RAM or HUB RAM. From COG RAM or LUT RAM they execute very fast.
It turns out, that the most basic words are implemented in Taqoz as "CODE words". So the idea is to cache a sequence of call commands in LUT and to call this sequence. (A subroutine threaded Forth here) To make this possible, the list must only consist of "CODE words" together with special a loop statement.
Fortunately my local variables are "CODE words" too, so we can use these here as constants for literals.
(Mandatory for this is the use of the version _BOOT_P2.BIX in Taqoz.zip in https://sourceforge.net/projects/tachyon-forth/files/TAQOZ/binaries/ . )

At the moment we have implemented csFor....NEXT and csBegin....csUntil.
( no if...else...then )

What does csBegin do? It works at run time with the compiled codes.
It sums the assembler code of "call" to each of the CODES and places these sums into the cache as a sequence until it finds the word csUntil, which is just a marker for the end of the loop. It then adds some code to have the function of UNTIL into the sequence and a "ret" at the end. So now the cache holds a sequence of valid assembler codes which can be jumped to. The instruction pointer PTRA is advanced to the word after csUntil.

An empty csFor....NEXT is after preparation of the cache just the loading of a counter register, an empty call sequence plus a djnz executing from LUT memory, so it is 20ms @200MHz.

An example:

: csEmptyLoop2 {: , a# limit# 1# -- } \ declare uninitialized locals
   1 to_1# \ a local constant
   0 to_a# \ a local variable
   1_000_000 to_limit# \ a local constant
   csBegin
      1# +to_a#
   a# limit# = csUntil
; forgetLocals

see csEmptyLoop2 
{
TAQOZ#   see csEmptyLoop2

1B5D7: pub csEmptyLoop2
0B808: AE32     cLoc
0B80A: 1801     1
0B80C: 02C7     >c
0B80E: 1800     0
0B810: 02BB     >a
0B812: 0083     := 131 \ incomplete
0B818: 02C1     >b
0B81A: 030B     csBegin
0B81C: 02CB     c>
0B81E: 02BD     +>a
0B820: 02BF     a>
0B822: 02C5     b>
0B824: 00F2     =
0B826: 2062     csUntil
0B828: 02BF     a>
0B82A: 2B6E     .
0B82C: 20AE     CRLF
0B82E: AE5D     cEndloc ;
      ( 40  bytes )
}

lap emptyLoop2 lap .lap     
lap csEmptyLoop2 lap .lap    

{
TAQOZ# lap emptyLoop2 lap .lap      --- 1000000
280,058,648 cycles= 1,400,293,240ns @200MHz ok
TAQOZ# lap csEmptyLoop2 lap .lap     --- 1000000
157,044,824 cycles= 785,224,120ns @200MHz ok
TAQOZ# 1,40000 785 / . --- 178  ok
}

.lutCache
{
TAQOZ# .lutCache ---
217 $FDA0_02CB
218 $FDA0_02BD
219 $FDA0_02BF
220 $FDA0_02C5
221 $FDA0_00F2
222 $FDA0_008D call DROP sets zero flag
223 $AD80_02D9 if_z jmp \#startLutCache
224 $FD64_002D ret
}

It can be nicely seen, that the calls in the lutCache are just the sum of $FDA0_0000 + the wordcode as displayed by the decompiler.

The gain of speed is about factor 1.8

The relation of times for the fibonacci benchmark is:
cog-assembler : hub-assembler : csForth : Forth
42% : 100% : 302% : 590%

One nice thing is, that the code (words instead of longs) is still compact, as the cache is reused.

Comments, suggestions welcome!
Have Fun Christof

Christof Eb. · 2023-08-13 16:47

Added short constants, which must be 0....511 (opposed to 0...1023 in Taqoz).
Added if...else...then.
So now this tiny just-in-time-compiler seems to have grown into something which can be useful, i think.
Added "codeWords", which displays all words, which can be used inside the loop. - Quite a lot - about 260.
You can add your own code words in LUT .
Those, which reside in COG or LUT, will give the speed.

TRACE/UNTRACE should not be used, as tracing registers are used.
You have to be sure, that you do not use threaded Forth words, because they will be treated to calculate jump destinations. ELSE works this way. As the tiny-JIT-compiler is written in assembler and meant to be fast, there are no checks.

Have fun!
Christof