P2 Taqoz V2.8: csForth -- Cached Call Sequences for Speed of Innermost Loops
Hi,
Taqoz is already fast, 160ms for an empty loop of 1.000.000. ( 1_000_000 FOR NEXT )
The question was, if we can be faster, if we avoid HUB accesses. The inner interpreter doNext loads every wordcode and has to decide, what it should do with it. If the actual wordcode is a "CODE word", then this word is the address of an assembler routine and it is just called.
"CODE words" are all wordcodes < $17FF and can reside in COG RAM, LUT RAM or HUB RAM. From COG RAM or LUT RAM they execute very fast.
It turns out, that the most basic words are implemented in Taqoz as "CODE words". So the idea is to cache a sequence of call commands in LUT and to call this sequence. (A subroutine threaded Forth here) To make this possible, the list must only consist of "CODE words" together with special a loop statement.
Fortunately my local variables are "CODE words" too, so we can use these here as constants for literals.
(Mandatory for this is the use of the version _BOOT_P2.BIX in Taqoz.zip in https://sourceforge.net/projects/tachyon-forth/files/TAQOZ/binaries/ . )
At the moment we have implemented csFor....NEXT and csBegin....csUntil.
( no if...else...then )
What does csBegin do? It works at run time with the compiled codes.
It sums the assembler code of "call" to each of the CODES and places these sums into the cache as a sequence until it finds the word csUntil, which is just a marker for the end of the loop. It then adds some code to have the function of UNTIL into the sequence and a "ret" at the end. So now the cache holds a sequence of valid assembler codes which can be jumped to. The instruction pointer PTRA is advanced to the word after csUntil.
An empty csFor....NEXT is after preparation of the cache just the loading of a counter register, an empty call sequence plus a djnz executing from LUT memory, so it is 20ms @200MHz.
An example:
: csEmptyLoop2 {: , a# limit# 1# -- } \ declare uninitialized locals 1 to_1# \ a local constant 0 to_a# \ a local variable 1_000_000 to_limit# \ a local constant csBegin 1# +to_a# a# limit# = csUntil ; forgetLocals see csEmptyLoop2 { TAQOZ# see csEmptyLoop2 1B5D7: pub csEmptyLoop2 0B808: AE32 cLoc 0B80A: 1801 1 0B80C: 02C7 >c 0B80E: 1800 0 0B810: 02BB >a 0B812: 0083 := 131 \ incomplete 0B818: 02C1 >b 0B81A: 030B csBegin 0B81C: 02CB c> 0B81E: 02BD +>a 0B820: 02BF a> 0B822: 02C5 b> 0B824: 00F2 = 0B826: 2062 csUntil 0B828: 02BF a> 0B82A: 2B6E . 0B82C: 20AE CRLF 0B82E: AE5D cEndloc ; ( 40 bytes ) } lap emptyLoop2 lap .lap lap csEmptyLoop2 lap .lap { TAQOZ# lap emptyLoop2 lap .lap --- 1000000 280,058,648 cycles= 1,400,293,240ns @200MHz ok TAQOZ# lap csEmptyLoop2 lap .lap --- 1000000 157,044,824 cycles= 785,224,120ns @200MHz ok TAQOZ# 1,40000 785 / . --- 178 ok } .lutCache { TAQOZ# .lutCache --- 217 $FDA0_02CB 218 $FDA0_02BD 219 $FDA0_02BF 220 $FDA0_02C5 221 $FDA0_00F2 222 $FDA0_008D call DROP sets zero flag 223 $AD80_02D9 if_z jmp \#startLutCache 224 $FD64_002D ret }
It can be nicely seen, that the calls in the lutCache are just the sum of $FDA0_0000 + the wordcode as displayed by the decompiler.
The gain of speed is about factor 1.8
The relation of times for the fibonacci benchmark is:
cog-assembler : hub-assembler : csForth : Forth
42% : 100% : 302% : 590%
One nice thing is, that the code (words instead of longs) is still compact, as the cache is reused.
Comments, suggestions welcome!
Have Fun Christof
Comments
Added short constants, which must be 0....511 (opposed to 0...1023 in Taqoz).
Added if...else...then.
So now this tiny just-in-time-compiler seems to have grown into something which can be useful, i think.
Added "codeWords", which displays all words, which can be used inside the loop. - Quite a lot - about 260.
You can add your own code words in LUT .
Those, which reside in COG or LUT, will give the speed.
TRACE/UNTRACE should not be used, as tracing registers are used.
You have to be sure, that you do not use threaded Forth words, because they will be treated to calculate jump destinations. ELSE works this way. As the tiny-JIT-compiler is written in assembler and meant to be fast, there are no checks.
Have fun!
Christof