Quick Gains converting P1 code to P2 code - pfth example

mindrobots · 2014-03-06 20:21

Since I finally got it working, I can report on soem of the low hanging fruit gains from migrating P1 code to P2 architure features. Were I more experienced, these would have been quicker and easier gains.

I started with Dave Hein's wonderful pfth.spin that he did a quick fixup on to get it to the P2 and made the following changes:

1) Dave's bit banged serial I/O was converted to using SERIN/SEROUT
2) the forth return stack was changed from being a HUBRAM structure to using AUXRAM based off of PTRY
3) the forth data stack was changed from being a HUBRAM structure to using AUXRAM based off of PTRX

pfth before the changes used 487 long of COGRAM and with the .fth files I loaded had 16286 bytes free in the dictionary

After the changes I made, it used 434 long of COGRAM and loading the same .fth files had 17314 bytes free in the dictionary

The net gain after the changes was 53 longs in COGRAM and 1028 extra bytes in the dictionary plus the serial I/O will go pretty much as fast as your FTDI can handle - I was running near 1Mbit without problems.

EDIT1: There was a stack problem.
EDIT2: OK, the stack problem is fixed!!
I've attached my hacked version if anyone wants to play with it. It is currently set to run serial I/O at 230400. (it still needs cleaning up)

There are other optimizations that can be made - using the P2 multiply and divide instructions is an obvious one - other will have to wait as I learn more about the P2.

I need to benchmark pfth running on the P1 versus the P2 and also the 2 different P2 implementations. I would think some speed is picked up with using PTRX/PTRY for the stacks since each push/pop is now one instruction instead of two.

Thoughts:
256 longs is a lot of stack space for a forth - if there were 4 PTRs into AUXRAM (PTRW/PTRX/PTRY/PTRZ), you could use it for stacks AND an I/O buffer. Tasking gives you more opportunities to use this space for stacks!
The SERA/SERB are wonderfully powerful and save a lot of code space in the COG and really simple to use.
Having the AUXRAM is another great benefit for saving space and increasing speed.
There's alot more to learn and play with!!

Thanks for the P2, Chip!!!

Dave Hein · 2014-03-06 20:51

I think the problem with underflows is that the data stack will wrap around to the return stack, which will clobber the return address. The unmodified version of pfth initializes the data stack to a depth of 4. This allows for the underflows to occur without crashing the interpreter. After the interpreter processes a line it will check for an underflow and reset the stack pointer back to a depth of 4.

mindrobots · 2014-03-07 07:33

Thanks, Dave!

I found the stack problem and fixed it. Now the spin code should be a fully functional pfth interpreter for the P2 with my changes as outlined above.

Dave Hein · 2014-03-07 11:18

I'll give it a try when I have a chance. It will be interesting running a time test on the P1, the original P2 version and your modified version. I expect your modified version should be quite a bit faster since every Forth word accesses the stack.

mindrobots · 2014-03-07 12:29

Dave Hein wrote: »

I'll give it a try when I have a chance. It will be interesting running a time test on the P1, the original P2 version and your modified version. I expect your modified version should be quite a bit faster since every Forth word accesses the stack.

I'll have to see if Peter's iterative FIBO work on pfth (it should). Then I can get some numbers.

Since we have PUSH/POP instructions now, the call/ret sequences can be removed and the PUSH/POP can just be put inline which should speed it up even more.

I see now why people write their own forths or start hacking forth kernels!

Your code has been a big help to me for brushing up on my PASM and getting back into this. Thanks!

mindrobots · 2014-03-07 13:31

Quick and dirty benchmarks. I added a getcnt word to pfth that just grabs the current counter value and puts it on the stack. Here are the crude results using Peter's iterative fibo benchmark:

: fibo 0 1 rot 0 over if do over + swap loop else 2drop then drop ;

classic pfth on the DE0 Nano running the Feb 6 emulation:
getcnt 26 fibo . getcnt swap - . cr
149016 ticks

pfth with the SERA and AUXRAM stack changes (same Nano, same emulation):
getcnt 26 fibo . getcnt swap - . cr
94808 ticks

The stack changes allowed it to run in 63% of the time as the "classic" - Since forth is so stack heavy, this is an important feature.

I haven't run this on a P1 yet. I guess for now, a tick to tick comparison is meaningful.

EDIT: Some additional numbers. I went through and changed a number of calls to pop and push routines to inline POPX/PUSHX instrucitons since there's no code penalty and you get rid of the CALL/RET overhead. This version is attached below.

Here are the numbers for this additional modification:

getcnt 26 fibo . getcnt swap - . cr
91656

down to 61.5% now - oh, well!

David Betz · 2014-03-08 13:48

It would be interesting to see the results of running some other benchmark on pfth. The iterative fibo is essentially useless as a benchmark because it is really just a simple loop. What you want is something that more closely approximates what a real program would do. The recursive fibo was a bit better in that regard because it contained lots of function calls which most real programs also contain. I suppose even defining a "multiply" word would better approximate this since it would call a word on every iteration of the loop rather than just using the intrinsic multiply operator that probably expands to inline code. How do pfth and Tachon compare if you replace the * operator with an invocation of this word?

: mult * ;

Of course, maybe these Forth systems are smart enough to inline this word. I'm not really sure how much optimization Forth implementations do these days.

mindrobots · 2014-03-08 15:08

David Betz wrote: »

It would be interesting to see the results of running some other benchmark on pfth. The iterative fibo is essentially useless as a benchmark because it is really just a simple loop. What you want is something that more closely approximates what a real program would do. The recursive fibo was a bit better in that regard because it contained lots of function calls which most real programs also contain. I suppose even defining a "multiply" word would better approximate this since it would call a word on every iteration of the loop rather than just using the intrinsic multiply operator that probably expands to inline code. How do pfth and Tachon compare if you replace the * operator with an invocation of this word?

: mult * ;

Of course, maybe these Forth systems are smart enough to inline this word. I'm not really sure how much optimization Forth implementations do these days.

Hi David,

I think this was a valid benchmark for apple comparisons of the changes that were made to the stack implementations. This forth (and none others that I recall) have an mechanism to inline code, that would pretty much break the forth model of a threaded language.

In the case of most forths, there is a set of core words that are implemented in the processors native language. Al other words are built on these or on forth words that have built before them.

In the case of the fibo definition:
: fibo 0 1 rot 0 over if do over + swap loop else 2drop then drop ;

0 pushes a number on the stack
1 likewise
rot executes the rot word which was previously defined as: : rot 2 roll ; (here, 2 is pushed on the stack and roll is a core word)
0 pushes 0 to the stack
over executes the over word previously defined as : over 1 pick ;
if executes the forth code put into the fibo definition when if was executed immediately when fibo was define : if _jz compile, here 2 allot ; immediate
...and so on

as you can see when fibo is executed, each piece of its definition is in essence a forth "function call" because each of the pieces have been threaded together to make up the fibo definition. so while it appears to be a simple loop, it compiles to much more than that and the forth interpreter is busy pushing and popping the data stack and return stack as it runs through this thread either executing (ultimately the core functions) or threading through other words that have been invoked as rot and over have been above.

So based on how the forth inner interpreter (the forth VM) works this is a great benchmark for seeing how fundamental VM changes impact the speed of overall program execution. When you look at the inner interpreter and the core words, they are mostly stack operations.

The next change I was looking at was using PRTA/PTRB as program counter pointers for the inner interpreter since most of what it does is run through the threads indirectly executing the word pointed to from a thread list. This could potentially speed up the inner interpreter but as the dictionary is currently structured, I really need pushx/y and popx/y to work with PTRA/B (which they don't) and I also need a way to do an indirect jump via PTRA/B which I don't think I can do. As the architecture of the P2 and pfth exist right now, I think the inner interpreter is about as fast as it can get.

I need these instructions:

pushx/y ptra
popx/y ptra
and something like jmp *ptra where I can jump to the address in the address pointed to by ptra (I don't think that exists or I don't know how to code it) I'm still looking because that is where the big speed gain is.

Back to your point about function calls, for a forth, anything that causes a non core word to be executed is essentially a function call since it invokes another layer of threaded code to start execution on the VM. The execution of a core word is basically in-lining code since it just causes a jump to the PASM code defining that core word.

In pfth at least, * is a core word so defining it as : mult * ; would cause any execution of mulyt to execute the mult thread which would then execute the * core word. so you do end up putting a layer of function execution in there. I could change + to add and define : add + ; and run the fibo benchmarks again, I guess.

*sorry if anyone got lost in the interpreter description, I've been looking at it so much I can see it in my head but may not be able to put it into words.

Dave Hein · 2014-03-08 19:01

I don't think any of the Prop Forth interpreters is smart enough to recognize that mult contains only a core word, and would optimize out the function call. Some of the professional Forth interpreters (i.e., the ones that you have to pay for) may be smart enough to do that. The Fast Forth interpreter that I dabbled with a few months ago defined all the core words as simple compiled words, similar to the mult example. If I ever get back to tinkering with Fast I would probably add the ability to recognize words containing a single core word, and just substitute the core word instead.

Actually I'm thinking about porting the Fast interpreter to P2 just to see how much faster it would be. The problem with both the pfth and PropForth interpreters is that the inner loop reads an execution token from hub ram, and then it reads a cog jump address from hub ram. This double read causes a hub stall. Maybe a cached read would reduce the number of times it stalls. The inner loops of the Fast and Tachyon interpreters only do a single read.

David Betz · 2014-03-08 19:17

mindrobots wrote: »

Hi David,

I think this was a valid benchmark for apple comparisons of the changes that were made to the stack implementations. This forth (and none others that I recall) have an mechanism to inline code, that would pretty much break the forth model of a threaded language.

In the case of most forths, there is a set of core words that are implemented in the processors native language. Al other words are built on these or on forth words that have built before them.

In the case of the fibo definition:
: fibo 0 1 rot 0 over if do over + swap loop else 2drop then drop ;

0 pushes a number on the stack
1 likewise
rot executes the rot word which was previously defined as: : rot 2 roll ; (here, 2 is pushed on the stack and roll is a core word)
0 pushes 0 to the stack
over executes the over word previously defined as : over 1 pick ;
if executes the forth code put into the fibo definition when if was executed immediately when fibo was define : if _jz compile, here 2 allot ; immediate
...and so on

as you can see when fibo is executed, each piece of its definition is in essence a forth "function call" because each of the pieces have been threaded together to make up the fibo definition. so while it appears to be a simple loop, it compiles to much more than that and the forth interpreter is busy pushing and popping the data stack and return stack as it runs through this thread either executing (ultimately the core functions) or threading through other words that have been invoked as rot and over have been above.

So based on how the forth inner interpreter (the forth VM) works this is a great benchmark for seeing how fundamental VM changes impact the speed of overall program execution. When you look at the inner interpreter and the core words, they are mostly stack operations.

The next change I was looking at was using PRTA/PTRB as program counter pointers for the inner interpreter since most of what it does is run through the threads indirectly executing the word pointed to from a thread list. This could potentially speed up the inner interpreter but as the dictionary is currently structured, I really need pushx/y and popx/y to work with PTRA/B (which they don't) and I also need a way to do an indirect jump via PTRA/B which I don't think I can do. As the architecture of the P2 and pfth exist right now, I think the inner interpreter is about as fast as it can get.

I need these instructions:

pushx/y ptra
popx/y ptra
and something like jmp *ptra where I can jump to the address in the address pointed to by ptra (I don't think that exists or I don't know how to code it) I'm still looking because that is where the big speed gain is.

Back to your point about function calls, for a forth, anything that causes a non core word to be executed is essentially a function call since it invokes another layer of threaded code to start execution on the VM. The execution of a core word is basically in-lining code since it just causes a jump to the PASM code defining that core word.

In pfth at least, * is a core word so defining it as : mult * ; would cause any execution of mulyt to execute the mult thread which would then execute the * core word. so you do end up putting a layer of function execution in there. I could change + to add and define : add + ; and run the fibo benchmarks again, I guess.

*sorry if anyone got lost in the interpreter description, I've been looking at it so much I can see it in my head but may not be able to put it into words.

Thanks for the explanation. I'm familiar with the way a traditional Forth works. I just thought that the ones for the Propeller did more to inline core words but maybe I was mistaken. Your proposed modifications sound interesting. I'm looking forward to hearing about how they improve performance.

Thanks,
David

Quick Gains converting P1 code to P2 code - pfth example

Comments