Tachyon Forth for P2 -FAT32+WIZnet- Now Smartpins - wOOt!

Peter Jakacki · 2015-10-05 09:04

EDIT: Click here for more information about Tachyon Forth for the P2

I've been a bit of a late starter with any P2 code but I've cobbled together a kernel that I have been using to learn a bit more about the P2 instruction set and the way PNut compiles code. Although I am not up and running yet it may be mostly a matter of porting much of the high level byte code across and adjusting to suit PNut. I've also found that some of the little condition code tricks we used on P1 don't work the same on P2.

Here is some high level bytecode compiled in PNut that I have used for a simple test:

This is how it would look if compiled by Forth normally

: DEMO 
   BEGIN 
    $0D EMIT $0A EMIT GETCNT GETCNT SWAP - PRTNUM $20 EMIT
    $21 BEGIN DUP EMIT 1+ DUP $7E = UNTIL DROP
   AGAIN
;

and this is how it is actually constructed to compile inside of PNut

orgh

	byte	"Tachyon",0
Tachyon
	byte	_BYTE/4,$0D,EMIT/4,_BYTE/4,$0A,EMIT/4
	byte	_GETCNT/4,_GETCNT/4,SWAP/4,MINUS/4
	byte	PRTNUM/4,_BYTE/4,$20,EMIT/4
	byte	_BYTE/4,$21
lbl0	byte	DUP/4,EMIT/4,INC/4
	byte	DUP/4,_BYTE/4,$7E,_EQ/4,_UNTIL/4,lbl2-lbl0
lbl2	byte	DROP/4,_AGAIN/4,lbl1-Tachyon
lbl1

and this is part of the output I get:

00000031 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}
00000031 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}
00000031 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}
00000031 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}

So between the two _GETCNTs and stacking it takes $31(49) cycles which at 50Mhz and 2 clocks/instruction IIRC is 1.96us so it doesn't look too bad considering I am only testing functionality and I won't optimize it until it says "ok"

This is the bytecode interpreter loop:

doNEXT
 		rdbyte	instr,PTRA++		'read byte code instruction
		shl	instr,#2 wc
		jmp	instr			'execute the code by directly indexing the first 256 long in cog

BTW, the byte aligned addresses necessitates the extra step of shifting the byte code value by 2 to get the correct address to jump to and also messes up the PNut compiled source as I have to use /4 after every bytecode reference. But I will work with what I've got until it is up and running.

evanh · 2015-10-05 09:14

Note: Chip has just changed back to long addressing granularity within Cog space so that one will be sorted upon next image dump.

Peter Jakacki · 2015-10-05 09:16

evanh wrote: »

Note: Chip has just changed back to long addressing granularity within Cog space so that one will be sorted upon next image dump.

Yes, I saw mention of it earlier and I agree with this method of addressing cog memory in general.

jmg · 2015-10-05 20:12

There is also a
jmprel x
opcode, that will be good for case-table jumps like this.
I guess that can jump to any of LUT or COG ?

(and makes another strong case for my LUT:COG ordering, to avoid gaps in jumps )

Cluso99 · 2015-10-05 21:44

This will be an interesting development Peter!

cgracey · 2015-10-05 23:55

There's another thing you can do:

RDFAST #0,startbyteaddress

Once you do that, 'RFBYTE D (WC,WZ)' can be used to read contiguous bytes, starting from startbyteaddress. RFBYTE means 'read fast byte' and it always takes 2 clocks. RDFAST initiates the read-fast mode. This doesn't work with hub exec, because hub exec uses the RDFAST mode, itself. That first D/# term in RDFAST tells how many 64-byte blocks to read before wrapping back to startbyteaddress (0= infinite). To make wrapping work, startbyteaddress must be long-aligned.

WRFAST works the same way, and uses WFBYTE, WFWORD, WFLONG.

Electrodude · 2015-10-06 00:10

cgracey wrote: »

There's another thing you can do:

RDFAST #0,startbyteaddress

Once you do that, 'RFBYTE D (WC,WZ)' can be used to read contiguous bytes, starting from startbyteaddress. RFBYTE means 'read fast byte' and it always takes 2 clocks. RDFAST initiates the read-fast mode. This doesn't work with hub exec, because hub exec uses the RDFAST mode, itself. That first D/# term in RDFAST tells how many 64-byte blocks to read before wrapping back to startbyteaddress (0= infinite). To make wrapping work, startbyteaddress must be long-aligned.

WRFAST works the same way, and uses WFBYTE, WFWORD, WFLONG.

Does RDFAST block until the streamer is ready or does the first RFBYTE block if it's not ready?

Peter Jakacki · 2015-10-06 00:19

cgracey wrote: »

There's another thing you can do:

RDFAST #0,startbyteaddress

Once you do that, 'RFBYTE D (WC,WZ)' can be used to read contiguous bytes, starting from startbyteaddress. RFBYTE means 'read fast byte' and it always takes 2 clocks. RDFAST initiates the read-fast mode. This doesn't work with hub exec, because hub exec uses the RDFAST mode, itself. That first D/# term in RDFAST tells how many 64-byte blocks to read before wrapping back to startbyteaddress (0= infinite). To make wrapping work, startbyteaddress must be long-aligned.

WRFAST works the same way, and uses WFBYTE, WFWORD, WFLONG.

I have been suffering the pain of "adocumentation" and see all these wonderful instructions in the summary but not at all sure of what they do. Some of these have changed from P2-hot and the descriptions of many others are buried in a myriad of tangled posts. But as I said I will work with what I've got to see what I can do although I am looking forward to the new image with long addressed cog memory etc.

As for RDFAST I will have to think about how I can use this feature although I do want to achieve functionality first so that I can have an interactive development and test environment with SD filesystem. Once I write an inline assemler I can then play with these enhancements and get a feel for what will work. Also I think that due to hubexec that there will of course be no problem in having PASM code definitions but in also having PASM mixed with bytecode.

Overall I'm pumped even though I don't expect silicon for a good year, so here's looking at making this a good year!

BTW, I have my kernel mostly running now after which I will add the high level bytecode.

Cluso99 · 2015-10-06 00:32

cgracey wrote: »

There's another thing you can do:

RDFAST #0,startbyteaddress

Once you do that, 'RFBYTE D (WC,WZ)' can be used to read contiguous bytes, starting from startbyteaddress. RFBYTE means 'read fast byte' and it always takes 2 clocks. RDFAST initiates the read-fast mode. This doesn't work with hub exec, because hub exec uses the RDFAST mode, itself. That first D/# term in RDFAST tells how many 64-byte blocks to read before wrapping back to startbyteaddress (0= infinite). To make wrapping work, startbyteaddress must be long-aligned.

WRFAST works the same way, and uses WFBYTE, WFWORD, WFLONG.

WOW! This is an interesting set of instructions helped by the egg-beater.

We should be able to get a number of interpreters working fast!
And with the LUT for extra code it should permit the interpreters to perform fast too!

jmg · 2015-10-06 00:46

Electrodude wrote: »

Does RDFAST block until the streamer is ready or does the first RFBYTE block if it's not ready?

IIRC there is a small FIFO, so the first read has to wait for a slot (of course) but thereafter linear-data is slot-free. Random data is not helped much, but designs that used a short Skip approach should benefit.

cgracey · 2015-10-06 04:17

jmg wrote: »

Electrodude wrote: »

Does RDFAST block until the streamer is ready or does the first RFBYTE block if it's not ready?

IIRC there is a small FIFO, so the first read has to wait for a slot (of course) but thereafter linear-data is slot-free. Random data is not helped much, but designs that used a short Skip approach should benefit.

This is all true.

Electrodude · 2015-10-06 04:36

jmg wrote: »

Electrodude wrote: »

Does RDFAST block until the streamer is ready or does the first RFBYTE block if it's not ready?

IIRC there is a small FIFO, so the first read has to wait for a slot (of course) but thereafter linear-data is slot-free. Random data is not helped much, but designs that used a short Skip approach should benefit.

cgracey wrote: »

jmg wrote: »

Electrodude wrote: »

Does RDFAST block until the streamer is ready or does the first RFBYTE block if it's not ready?

IIRC there is a small FIFO, so the first read has to wait for a slot (of course) but thereafter linear-data is slot-free. Random data is not helped much, but designs that used a short Skip approach should benefit.

This is all true.

But does execution after a RDFAST continue immediately (in which case a too-early RFBYTE would need to block) or does RDFAST wait until the FIFO begins to fill from the hub before continuing?

evanh · 2015-10-06 06:00

As a guess I'd say RDFAST never blocks. It just initiates the FIFO operation in the same manner as HubExec.

PS: Also, I think the streamer is a separate DMA engine from the FIFO's DMA engine. They can both run concurrently.

evanh · 2015-10-06 06:04

And presumably the streamer has priority over both the FIFO and HubOps for Hub accesses.

Peter Jakacki · 2015-10-06 08:02

Now that I've had a bit of a play I can see that PNut is limiting me somewhat in being able to massage the code into the best areas whereas I relied on BST with its features and listing to help me with the P1.

Not to be too deterred I am now working on a version that gets rid of the vector table and compiles 16-bit addresses in place of the bytecode. So that means a Forth instruction can jump to code anywhere in the first 64k, be that cog, lut, or hub. So high level definitions such as colon defs will have a CALL to the colon interpreter to stack the IP and load it up with the new address. All this is rather similar to a more conventional 16-bit (address) Forth as we now have a much larger code space to work from as we also have hubexec. Code outside of 64k can still be called by jumping via an instruction in the first 64k or I may just insist that code is long aligned so that I can address 256k of code directly with a 16-bit word. So all definitions are CODE definitions by default. This should make for a pretty snappy but still compact Forth.

doNEXT
 		rdword	instr,PTRA++		'read word code instruction address
		jmp	instr			'execute the code by directly indexing the first 64k in cog/lut/hub

I may have some time later tonight to fire up the changes I have been making.

EDIT: My demo code works!!!

Cluso99 · 2015-10-06 09:00

Peter,
There is a list of labels available in pnut.exe. IIRC its something like Ctl-G to switch to view.

mindrobots · 2015-10-06 09:10

Ctl-M for the other listing format.

Peter Jakacki · 2015-10-06 13:15

Yeah, but it's not exactly the same as an actual listing though. I mean do these values really tell me much?

TYPE: 55   VALUE: 1300011D   NAME: _AND
TYPE: 55   VALUE: 13800125   NAME: _ANDN
TYPE: 55   VALUE: 1400012D   NAME: _OR
TYPE: 55   VALUE: 14800135   NAME: _XOR
TYPE: 55   VALUE: 1500013D   NAME: _SHR

vs BST listing for the equivalent:

73D4(0035) AF 61 BF 60 | _AND                    and        tos+1,tos
73D8(0036) 11 00 7C 5C |                         jmp        #DROP
73DC(0037) AF 61 BF 64 | _ANDN                   andn        tos+1,tos
73E0(0038) 11 00 7C 5C |                         jmp        #DROP
73E4(0039) AF 61 BF 68 | _OR                     or        tos+1,tos
73E8(003A) 11 00 7C 5C |                         jmp        #DROP
73EC(003B) AF 61 BF 6C | _XOR                    xor        tos+1,tos
73F0(003C) 11 00 7C 5C |                         jmp        #DROP
73F4(003D) AF 61 BF 28 | _SHR                    shr        tos+1,tos
73F8(003E) 11 00 7C 5C |                         jmp        #DROP

Peter Jakacki · 2015-10-06 17:18

Using words instead of bytecodes makes more sense for the P2 as we can can directly address the first 64k of memory or 256k as code when Chip changes the cog to long addressing and I also align high level defs to longs. As for timing I am finding that the words "FOR 1234 DROP NEXT" will take under 1us/loop @160MHz which is faster than the P1 although admittedly the P1 does execute this in 3.4us/loop at 80Mhz. There is room though for improving these figures as I haven't really made use of any special P2 features yet. There will be plenty of other speed gains simply because we can have more PASM instructions plus other things.

I can't see how I can use RDFAST though as I can't really make use of sequential access.

At this rate it won't be too long before I am ready to release Tachyon Explorer for the P2 and then we can have some fun. I could also include a P2 assembler so assembly CODE definitions can be created and tested interactively.

Cluso99 · 2015-10-06 23:19

Peter Jakacki wrote: »

.....
At this rate it won't be too long before I am ready to release Tachyon Explorer for the P2 and then we can have some fun.

Nice!

I could also include a P2 assembler so assembly CODE definitions can be created and tested interactively.

This would be really cool !!!

LoopyByteloose · 2015-10-07 01:40

Happy to see Tachyon Forth migrating to the Propeller 2. I have been looking forward to this for quite awhile.

I am convinced that Forth will make learning assembly on the Propeller 2 much easier... by offering interactive exploration of the architecture.

cgracey · 2015-10-07 02:35

Electrodude wrote: »

jmg wrote: »

Electrodude wrote: »

Does RDFAST block until the streamer is ready or does the first RFBYTE block if it's not ready?

IIRC there is a small FIFO, so the first read has to wait for a slot (of course) but thereafter linear-data is slot-free. Random data is not helped much, but designs that used a short Skip approach should benefit.

cgracey wrote: »

jmg wrote: »

Electrodude wrote: »

Does RDFAST block until the streamer is ready or does the first RFBYTE block if it's not ready?

IIRC there is a small FIFO, so the first read has to wait for a slot (of course) but thereafter linear-data is slot-free. Random data is not helped much, but designs that used a short Skip approach should benefit.

This is all true.

But does execution after a RDFAST continue immediately (in which case a too-early RFBYTE would need to block) or does RDFAST wait until the FIFO begins to fill from the hub before continuing?

RDFAST releases once it has data in the FIFO. That way, RFBYTE/RFWORD/RFLONG never wait for anything.

Peter Jakacki · 2015-10-07 03:37

Cluso99 wrote: »

Peter Jakacki wrote: »

.....
At this rate it won't be too long before I am ready to release Tachyon Explorer for the P2 and then we can have some fun.

Nice!

I could also include a P2 assembler so assembly CODE definitions can be created and tested interactively.

This would be really cool !!!

The Forth environment lends itself to test out code easily as parameters can just be put on the stack and the results printed out interactively. That also includes timing the operation as well being able to see I/O effects with SPLAT.

Now even with the changes with the new image we are expecting anytime I still expect to be up and running by next week, or sooner

Peter Jakacki · 2015-10-07 04:08

Now that I have played with a "wordcode" kernel I am looking at a subroutine threaded interpreter. The current method is to jump to a routine and then jump back explicitly to the runtime interpreter. By calling the routine and having it return to whatever called it is now possible with P2 although it only has an 8 level return stack. But that leaves me the option at compile time of compiling 16-bit wordcode addresses to be read by the runtime interpreter or compiling call instructions instead as every high level routine is entered as assembly code anyway, then there is no need to interpret 16-bit wordcode. So the subroutine threaded method takes up twice as much memory but also runs faster.

doNEXT 		rdword	instr,PTRA++		'read word code instruction address
		call	instr
		jmp	#doNEXT

BTW, just tested this modified version of the kernel and it works really well, I can even include my debug sub-routines as regular "Forth" words

@Chip: I know some have been wanting the return stack wider but what I am interested in is a deeper stack. I find that 32 levels is more than sufficient, would this be possible?

evanh · 2015-10-07 05:05

Grr, it's bed-bugs I tell you!

cgracey · 2015-10-07 05:50

Peter Jakacki wrote: »
Now that I have played with a "wordcode" kernel I am looking at a subroutine threaded interpreter. The current method is to jump to a routine and then jump back explicitly to the runtime interpreter. By calling the routine and having it return to whatever called it is now possible with P2 although it only has an 8 level return stack. But that leaves me the option at compile time of compiling 16-bit wordcode addresses to be read by the runtime interpreter or compiling call instructions instead as every high level routine is entered as assembly code anyway, then there is no need to interpret 16-bit wordcode. So the subroutine threaded method takes up twice as much memory but also runs faster.
doNEXT 		rdword	instr,PTRA++		'read word code instruction address
		call	instr
		jmp	#doNEXT
BTW, just tested this modified version of the kernel and it works really well, I can even include my debug sub-routines as regular "Forth" words

@Chip: I know some have been wanting the return stack wider but what I am interested in is a deeper stack. I find that 32 levels is more than sufficient, would this be possible?

To get a stack that deep, we'd have to forget about flops and move to a dedicated SRAM.

I can see the desire for lut-based stack operators.

Are you needing all those stack levels after your interpreter CALL?

Peter Jakacki · 2015-10-07 05:56

Just thinking really in case I needed to go deeper and lut-based operators would certainly do it as I take it that you mean the lut would be used for stack space. The thing is that the instruction pointer or IP which in reality is PTRA needs to be stacked whenever I enter a new "colon" definition, that is, another routine that is made up of word-codes. So having general stack operators would be nice.

Peter Jakacki · 2015-10-07 06:59

SUBROUTINE THREADING VS CALLING DIRECTLY

Curious thing with subroutine threading is that I've tried using direct assembly calls for routines in hubexec and there is no noticeable performance gain over interpreting the 16-bit addresses using the runtime interpreter. So why use assembly since it takes twice as much memory then. I even replaced the FOR NEXT with DJNZ but it's much the same.

Running this code results in a pulse period of 4.8us (@50MHz)

word	_1,_16,FOR,DUP,DUP,OUTSET,OUTCLR,forNEXT,DROP

Now calling that as an assembly routine utilizing calls to the Forth words and DJNZ for looping results in a pulse period of 4.4us.

MYIO		call	#_1
		mov	R2,#16
IOLP 		call	#DUP
 		call	#DUP
 		call	#OUTSET
 		call	#OUTCLR
 		djnz	R2,@IOLP
		jmp	#DROP

Of course we could code such a simple loop as pure assembly rather than calls to Forth words and the stack etc and this will certainly be the case for quite a few functions but it seems that there is not much reason to worry about this method for kernel words. Now foregoing calls to the Forth kernel and coding as we normally would (plus no REP) an assembly routine that same I/O toggle routine doing exactly the same thing results in a pulse period of 580ns.

FASTIO          mov	X,#1
		or	DIRA,X
      		mov	R2,#16
FIOLP		or	OUTA,X
		andn	OUTA,X
		djnz	R2,@FIOLP
		ret

Being able to mix reusable assembly subroutines with regular Forth code will make a big difference to Tachyon as I no longer need to worry about special opcodes such as SPI etc or even the RUNMODs.

The demo "word-code" that I am using:

Tachyon
        word	hello
	word	CR	       ' use direct call to "newline" rather than this code -> '_WORD,$0D,EMIT,_WORD,$0A,EMIT
	word	_8,FOR,_WORD,"*",EMIT,forNEXT
	word	_WORD,$21
lbl0	word	DUP,EMIT,INC
	word	DUP,_WORD,$7E,_EQ,_UNTIL,lbl2-lbl0
lbl2	word	DROP
' 	word	_0,_WORD,$0800,DUMP
 	word	_GETCNT,stksub,_GETCNT,SWAP,MINUS,space,PRTNUM
	word	_LONG
	long	1_000_000
	word	FOR,_WORD,$1234,DROP,forNEXT
'lbl3	word	DEC,DUP,_0,_EQ,_UNTIL,lbl4-lbl3
'lbl4	word	DROP
 	word	_GETCNT,_NOP,_GETCNT,SWAP,MINUS,space,PRTNUM
' 	word	_1,_16,FOR,DUP,DUP,OUTSET,OUTCLR,forNEXT,DROP     ' replace this loop with FASTIO for testing
	word	FASTIO
 	word    CR	         ' Replace -> '_WORD,$0D,EMIT,_WORD,$0A,EMIT
 	word	_AGAIN,lbl1-Tachyon
lbl1

evanh · 2015-10-07 10:47

Peter,
I'm not able to decipher what stacking levels are in use there. Was it intended to help answer Chip's question of whether there is a strong case of having the LUT for stacking or not?

Chip,
Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.

ctwardell · 2015-10-07 15:25

This is looking really good Peter.

C.W.

cgracey · 2015-10-07 18:05

evanh wrote: »

Chip,
Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.

Wow! Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub. I really like that, because it means we don't need a whole extra set of instructions and pointers to use lut for stack. Excellent!

Are there any other sleeper purposes for address sensitivity?

P.S. This scheme would work only for CALLA/CALLB/RETA/RETB, since they are dedicated instructions that could easily be redirected. PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.

Tachyon Forth for P2 -FAT32+WIZnet- Now Smartpins - wOOt!

Comments