Pondering about another Forth for P2

Christof Eb. · 2025-02-11 11:05

So while I do appreciate Taqoz, because it is optimized for P2, fast, compact and complete, I nevertheless ponder and wonder, if there could be an alternate way.

Motivations
1. It would be nice, if we could use drivers, others have written and placed in OBEX, so an integration into FlexProp together with SPIN or C files would be nice. Forth as a script language inside FlexProp.
2. Because Taqoz holds both stacks in LUT, it is not really suitable to do classical PAUSE multitasking. You have to swap loads of longs in and out of LUT and COG ram.
3. It would be nice to have an ANSI Forth. Because of portability and documentation.
4. I wonder, if the reason that Peter did not use certain elements of P2, was because they came late in P2's development? For example it seems natural to use PUSHA and POPA to implement the data stack.

So one direction, I am thinking is porting a Forth, that is written in C. Unfortunately my first goal the great ESP32Forth by Brad Nelson was a dead end, because it uses elements from C++ and is not documented enough for my needs to find work-arounds.
Out of pForth, Atlast, YAFFA, lbForth, ceForth and cForth the last one, cForth by Mitch Bradley seems attractive because it is very complete. Up to now I have not yet understood how I could bring the many source files into a structure, that can be compiled with FlexProp.

An idea is to make a Forth compiler for the FlexProp nucode-machine. As this uses only one stack this is not straight forward and would mean a lot of overhead for each subroutine level. I am not sure, if you can mix nucode with native code.

A totally different thing would be a new direct threaded Forth implementation:
1. Use direct threaded or subroutine threaded code, that can inline small core words like Forth09 https://colorcomputerarchive.com/repo/Documents/Manuals/Programming/Forth09 (D.P. Johnson).pdf . Inlining is limited by a size limit per word. IP is the program counter. Code is machine code. The idea would be, that this enables the micro cache of P2 to chime in. Downside is low code density.
2. Use PTRA for data stack (POPA, PUSHA), only TOS and NOS (2nd) in cog ram. Use PTRB for return stack (CALLB, RETB). So the stacks are in HUB Ram and can be switched easily.
3. As this machine would not be capable of calling high level code of other languages (at least I don't know how) perhaps a ServerCog could be used.

Comments? Ideas?

Cheers, Christof

Wuerfel_21 · 2025-02-11 11:32

@"Christof Eb." said:
4. I wonder, if the reason that Peter did not use certain elements of P2, was because they came late in P2's development? For example it seems natural to use PUSHA and POPA to implement the data stack.

PUSHA and POPA are just aliases for WRLONG x,ptra++ and RDLONG x,--ptra. These are capital-S Slow instructions (3..10 cycles and 9..16 cycles, respectively), so there's your motivation to use the LUT instead.

Maciek · 2025-02-11 15:17

Just two comments, probably of not much use but:

porting -> this might help to translate the code: codeconverter.io (won't help with understanding, unfortunately)
new, P2 dedicated Forth -> can you commit hundreds of hours to it's development (and that's just for starters, until the feasibility study is complete) ? And hundreds more when the answer is a YES ?

JonnyMac · 2025-02-11 16:06

I nevertheless ponder and wonder, if there could be an alternate way.

Have you seen the book?
-- https://www.amazon.com/Irreducible-Complexity-Discovery-Chen-Hanson-Ting/dp/1096059789

I have a passing interesting Forth and I thought this might be helpful.

Christof Eb. · 2025-02-12 08:03

@JonnyMac said:

I nevertheless ponder and wonder, if there could be an alternate way.

Have you seen the book?
-- https://www.amazon.com/Irreducible-Complexity-Discovery-Chen-Hanson-Ting/dp/1096059789

I have a passing interesting Forth and I thought this might be helpful.

Thank you Jon!
Dr. Ting created several generations of his eForth together with updated versions of his book and kindly donated these packages to the public in his late years. It is very valuable for me, because the explanations are very helpful. I try to use his ceForth_33, which is written in C. Which you can find here: https://sites.google.com/view/forth-books/home/forth-books/dr-tings-collection15 . For your "passing interest" I can recommend the read of "Inside F83". F83 was/is a very fine Forth.
I have used his eP32 VHDL version which is fascinating because it shows, that you can base a Forth on a very small kernel.

@Maciek said:
Just two comments, probably of not much use but:

porting -> this might help to translate the code: codeconverter.io (won't help with understanding, unfortunately)

new, P2 dedicated Forth -> can you commit hundreds of hours to it's development (and that's just for starters, until the feasibility study is complete) ? And hundreds more when the answer is a YES ?

Oh, oh, you are very right here.... But as written above, a Forth kernel can be very very much smaller than Peter's. Also I hope to be able to use for example the file system from the C-side. In comparison to Taqoz things can be simplified. For example I have added a block-file system on SD card to the eP32 VHDL Forth, which is very very very much simpler than Peters compromise and still you can do nearly anything with the simple things.

At the moment I try to experiment with ports of Forths written in C/C++. Within the Arduino world it is no big thing to port the mighty ESP32Forth for example to ARM.

Cheers Christof

Christof Eb. · 2025-02-12 08:15

@Wuerfel_21 said:

@"Christof Eb." said:
4. I wonder, if the reason that Peter did not use certain elements of P2, was because they came late in P2's development? For example it seems natural to use PUSHA and POPA to implement the data stack.

PUSHA and POPA are just aliases for WRLONG x,ptra++ and RDLONG x,--ptra. These are capital-S Slow instructions (3..10 cycles and 9..16 cycles, respectively), so there's your motivation to use the LUT instead.

Yes, data access in HUB is a burden. And I am well aware, that Peter's Forth is very well optimised for P2, he found very good compromises. It is also astonishing, that he found a sufficient size for the stacks. They have never overflowed in my code unless there was a bug. The only problem is as said, that task switch is very slow.

Wuerfel_21 · 2025-02-12 19:32

If the stack isn't close-to-full, simply swapping it in/out of LUT should come out faster than having it in hub to begin with unless you're switching tasks only to run a few instructions and switch back.

Christof Eb. · 2025-02-14 11:11

Now I had spend days to get running one of ESP32Forth, lbForth, ceForth33, pForth. I managed to get all of them compiled OK, but none of them worked with FlexProp. No low hanging fruits....

Then I decided to start my own "P2CCForth" written in C. I once read, that every "Forther" sooner or later will want to create their own System. :-)
And this is FUN! Within one day I have some first primitive words and an inner interpreter working. It works with a switch case structure, which is compiled into jmprel. It can already run the fibonacci benchmark and it is just as fast as the YAFFA that I have used for years for ESP32. Speed comparison:
P2CCForth : TaQoz : C_with_FCACHE = 1:4:24
The structure is planned in a way that inline assembler should be possible and you can always resort to create additional "primitive" words, which are written as C functions and therefore can also have inline assembler.

Some features, that I want to implement:

Fixed sizes. They simplify very much, because you directly know, where to find things. For example I want to go for 20bytes name length. So only the parameter field is variable length in dictionary.
Use SD storage with large 16kB (or a bit more?) fixed size Blocks: Compiled Dictionary Overlays, one as App-Overlay and one as Tools-Overlay. Source Text Blocks.
Ready for PAUSE cooperative multitasking, 8cog multitasking, local variables.

bob_g4bby · 2025-02-14 11:33

@"Christof Eb" that sounds very good! So a X4 speed increase on Taqoz sounds useful

Christof Eb. · 2025-02-15 06:21

@bob_g4bby said:
@"Christof Eb" that sounds very good! So a X4 speed increase on Taqoz sounds useful

No, no, it's 4 times SLOWER. :-)
Taqoz is very well optimized (over years!) for P2 and completely written in assembler, you cannot hope to beat that with something written in C. It's strength on the other hand will be to be able to use features, which are accessible via FlexC, for example the virtual file system.

If you are looking for a way to gain speed in Taqoz, than you might have a look at: https://forums.parallax.com/discussion/175463/p2-taqoz-v2-8-csforth-cached-call-sequences-for-speed-of-innermost-loops#latest With this method, you can have a speed improvement for Taqoz code of factor 1.8 or so in some cases.