Tachyon Forth for P2 -FAT32+WIZnet- Now Smartpins - wOOt!
Peter Jakacki
Posts: 10,193
EDIT: Click here for more information about Tachyon Forth for the P2
I've been a bit of a late starter with any P2 code but I've cobbled together a kernel that I have been using to learn a bit more about the P2 instruction set and the way PNut compiles code. Although I am not up and running yet it may be mostly a matter of porting much of the high level byte code across and adjusting to suit PNut. I've also found that some of the little condition code tricks we used on P1 don't work the same on P2.
Here is some high level bytecode compiled in PNut that I have used for a simple test:
This is how it would look if compiled by Forth normally
and this is how it is actually constructed to compile inside of PNut
and this is part of the output I get:
So between the two _GETCNTs and stacking it takes $31(49) cycles which at 50Mhz and 2 clocks/instruction IIRC is 1.96us so it doesn't look too bad considering I am only testing functionality and I won't optimize it until it says "ok"
This is the bytecode interpreter loop:
I've been a bit of a late starter with any P2 code but I've cobbled together a kernel that I have been using to learn a bit more about the P2 instruction set and the way PNut compiles code. Although I am not up and running yet it may be mostly a matter of porting much of the high level byte code across and adjusting to suit PNut. I've also found that some of the little condition code tricks we used on P1 don't work the same on P2.
Here is some high level bytecode compiled in PNut that I have used for a simple test:
This is how it would look if compiled by Forth normally
: DEMO BEGIN $0D EMIT $0A EMIT GETCNT GETCNT SWAP - PRTNUM $20 EMIT $21 BEGIN DUP EMIT 1+ DUP $7E = UNTIL DROP AGAIN ;
and this is how it is actually constructed to compile inside of PNut
orgh byte "Tachyon",0 Tachyon byte _BYTE/4,$0D,EMIT/4,_BYTE/4,$0A,EMIT/4 byte _GETCNT/4,_GETCNT/4,SWAP/4,MINUS/4 byte PRTNUM/4,_BYTE/4,$20,EMIT/4 byte _BYTE/4,$21 lbl0 byte DUP/4,EMIT/4,INC/4 byte DUP/4,_BYTE/4,$7E,_EQ/4,_UNTIL/4,lbl2-lbl0 lbl2 byte DROP/4,_AGAIN/4,lbl1-Tachyon lbl1
and this is part of the output I get:
00000031 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|} 00000031 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|} 00000031 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|} 00000031 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}
So between the two _GETCNTs and stacking it takes $31(49) cycles which at 50Mhz and 2 clocks/instruction IIRC is 1.96us so it doesn't look too bad considering I am only testing functionality and I won't optimize it until it says "ok"
This is the bytecode interpreter loop:
doNEXT rdbyte instr,PTRA++ 'read byte code instruction shl instr,#2 wc jmp instr 'execute the code by directly indexing the first 256 long in cogBTW, the byte aligned addresses necessitates the extra step of shifting the byte code value by 2 to get the correct address to jump to and also messes up the PNut compiled source as I have to use /4 after every bytecode reference. But I will work with what I've got until it is up and running.
Comments
jmprel x
opcode, that will be good for case-table jumps like this.
I guess that can jump to any of LUT or COG ?
(and makes another strong case for my LUT:COG ordering, to avoid gaps in jumps )
RDFAST #0,startbyteaddress
Once you do that, 'RFBYTE D (WC,WZ)' can be used to read contiguous bytes, starting from startbyteaddress. RFBYTE means 'read fast byte' and it always takes 2 clocks. RDFAST initiates the read-fast mode. This doesn't work with hub exec, because hub exec uses the RDFAST mode, itself. That first D/# term in RDFAST tells how many 64-byte blocks to read before wrapping back to startbyteaddress (0= infinite). To make wrapping work, startbyteaddress must be long-aligned.
WRFAST works the same way, and uses WFBYTE, WFWORD, WFLONG.
Does RDFAST block until the streamer is ready or does the first RFBYTE block if it's not ready?
I have been suffering the pain of "adocumentation" and see all these wonderful instructions in the summary but not at all sure of what they do. Some of these have changed from P2-hot and the descriptions of many others are buried in a myriad of tangled posts. But as I said I will work with what I've got to see what I can do although I am looking forward to the new image with long addressed cog memory etc.
As for RDFAST I will have to think about how I can use this feature although I do want to achieve functionality first so that I can have an interactive development and test environment with SD filesystem. Once I write an inline assemler I can then play with these enhancements and get a feel for what will work. Also I think that due to hubexec that there will of course be no problem in having PASM code definitions but in also having PASM mixed with bytecode.
Overall I'm pumped even though I don't expect silicon for a good year, so here's looking at making this a good year!
BTW, I have my kernel mostly running now after which I will add the high level bytecode.
We should be able to get a number of interpreters working fast!
And with the LUT for extra code it should permit the interpreters to perform fast too!
This is all true.
But does execution after a RDFAST continue immediately (in which case a too-early RFBYTE would need to block) or does RDFAST wait until the FIFO begins to fill from the hub before continuing?
PS: Also, I think the streamer is a separate DMA engine from the FIFO's DMA engine. They can both run concurrently.
Not to be too deterred I am now working on a version that gets rid of the vector table and compiles 16-bit addresses in place of the bytecode. So that means a Forth instruction can jump to code anywhere in the first 64k, be that cog, lut, or hub. So high level definitions such as colon defs will have a CALL to the colon interpreter to stack the IP and load it up with the new address. All this is rather similar to a more conventional 16-bit (address) Forth as we now have a much larger code space to work from as we also have hubexec. Code outside of 64k can still be called by jumping via an instruction in the first 64k or I may just insist that code is long aligned so that I can address 256k of code directly with a 16-bit word. So all definitions are CODE definitions by default. This should make for a pretty snappy but still compact Forth.
I may have some time later tonight to fire up the changes I have been making.
EDIT: My demo code works!!!
There is a list of labels available in pnut.exe. IIRC its something like Ctl-G to switch to view.
vs BST listing for the equivalent:
I can't see how I can use RDFAST though as I can't really make use of sequential access.
At this rate it won't be too long before I am ready to release Tachyon Explorer for the P2 and then we can have some fun. I could also include a P2 assembler so assembly CODE definitions can be created and tested interactively.
I am convinced that Forth will make learning assembly on the Propeller 2 much easier... by offering interactive exploration of the architecture.
RDFAST releases once it has data in the FIFO. That way, RFBYTE/RFWORD/RFLONG never wait for anything.
The Forth environment lends itself to test out code easily as parameters can just be put on the stack and the results printed out interactively. That also includes timing the operation as well being able to see I/O effects with SPLAT.
Now even with the changes with the new image we are expecting anytime I still expect to be up and running by next week, or sooner
BTW, just tested this modified version of the kernel and it works really well, I can even include my debug sub-routines as regular "Forth" words
@Chip: I know some have been wanting the return stack wider but what I am interested in is a deeper stack. I find that 32 levels is more than sufficient, would this be possible?
To get a stack that deep, we'd have to forget about flops and move to a dedicated SRAM.
I can see the desire for lut-based stack operators.
Are you needing all those stack levels after your interpreter CALL?
Curious thing with subroutine threading is that I've tried using direct assembly calls for routines in hubexec and there is no noticeable performance gain over interpreting the 16-bit addresses using the runtime interpreter. So why use assembly since it takes twice as much memory then. I even replaced the FOR NEXT with DJNZ but it's much the same.
Running this code results in a pulse period of 4.8us (@50MHz) Now calling that as an assembly routine utilizing calls to the Forth words and DJNZ for looping results in a pulse period of 4.4us.
Of course we could code such a simple loop as pure assembly rather than calls to Forth words and the stack etc and this will certainly be the case for quite a few functions but it seems that there is not much reason to worry about this method for kernel words. Now foregoing calls to the Forth kernel and coding as we normally would (plus no REP) an assembly routine that same I/O toggle routine doing exactly the same thing results in a pulse period of 580ns.
Being able to mix reusable assembly subroutines with regular Forth code will make a big difference to Tachyon as I no longer need to worry about special opcodes such as SPI etc or even the RUNMODs.
The demo "word-code" that I am using:
I'm not able to decipher what stacking levels are in use there. Was it intended to help answer Chip's question of whether there is a strong case of having the LUT for stacking or not?
Chip,
Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.
C.W.
Wow! Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub. I really like that, because it means we don't need a whole extra set of instructions and pointers to use lut for stack. Excellent!
Are there any other sleeper purposes for address sensitivity?
P.S. This scheme would work only for CALLA/CALLB/RETA/RETB, since they are dedicated instructions that could easily be redirected. PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.