Tachyon Forth for P2 -FAT32+WIZnet- Now Smartpins - wOOt!

mindrobots · 2015-10-07 18:32

cgracey wrote: »

PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.

Are there hub $200..$3FF anymore? With HUB moved to $400, I thought there wasn't any address overlap in the latest memory map.

jmg · 2015-10-07 18:42

cgracey wrote: »

Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub.

I'm not quite following, are you talking about moving the stack by PC address ?
What about a case where the LUT is all already allocated ?
Using LUT as stack has obvious merits, but making it a hidden by-pc choice seems likely to surprise. Code that is moved from HUB to LUT changes more than expected.

cgracey · 2015-10-07 18:54

This is all it is, and it has nothing to do with hub exec:

When CALLx/RETx execute and PTRx = %00000000001xxxxxxxxx or $200..$3FF, the lut is used for stack data, instead of the hub. And PTRx changes by +-1, not +-4. In these cases, CALLx is just two clocks and RETx is three, which are both a lot faster than hub accesses and are deterministic, as well.

jmg · 2015-10-07 19:04

cgracey wrote: »

This is all it is, and it has nothing to do with hub exec:

When CALLx/RETx execute and PTRx = %00000000001xxxxxxxxx or $200..$3FF, the lut is used for stack data, instead of the hub. And PTRx changes by +-1, not +-4. In these cases, CALLx is just two clocks and RETx is three, which are both a lot faster than hub accesses and are deterministic, as well.

I'm still not seeing how that protects the LUT from accidental clobbering ?
A change in destination area is quite significant, not something that should be buried ?
ie I think these need explicit opcodes, not a hidden operational flip-flop.

David Betz · 2015-10-07 19:10

cgracey wrote: »

evanh wrote: »

Chip,
Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.

Wow! Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub. I really like that, because it means we don't need a whole extra set of instructions and pointers to use lut for stack. Excellent!

Are there any other sleeper purposes for address sensitivity?

P.S. This scheme would work only for CALLA/CALLB/RETA/RETB, since they are dedicated instructions that could easily be redirected. PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.

Umm... Doesn't that mean that if I have PTRA pointing into LUT and do a CALLA to some function and that function does a POPA to get its return address, it will fetch the wrong value?

Seairth · 2015-10-07 19:41

cgracey wrote: »

evanh wrote: »

Chip,
Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.

Wow! Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub. I really like that, because it means we don't need a whole extra set of instructions and pointers to use lut for stack. Excellent!

Are there any other sleeper purposes for address sensitivity?

P.S. This scheme would work only for CALLA/CALLB/RETA/RETB, since they are dedicated instructions that could easily be redirected. PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.

If you add PTRA/B variants of RDLUT/WRLUT (and here), then you can get the equivalent push/pop functionality.

As for cog ram, I don't know that much can be done about that. On the other hand, push/pop is not as important there since the stack is already located in directly addressable registers.

cgracey · 2015-10-07 19:49

David Betz wrote: »

cgracey wrote: »

evanh wrote: »

Chip,
Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.

Wow! Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub. I really like that, because it means we don't need a whole extra set of instructions and pointers to use lut for stack. Excellent!

Are there any other sleeper purposes for address sensitivity?

P.S. This scheme would work only for CALLA/CALLB/RETA/RETB, since they are dedicated instructions that could easily be redirected. PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.

Umm... Doesn't that mean that if I have PTRA pointing into LUT and do a CALLA to some function and that function does a POPA to get its return address, it will fetch the wrong value?

Yes, because PUSHA/PUSHB/POPA/POPB are actually WRLONG/RDLONG, only CALLA/CALLB/RETA/RETB could have this lut functionality. Maybe it's too complicated, in that sense, as it would make people suppose that PUSHx and POPx would work, too. To do a pop without caring about the data, you'd just 'SUB PTRx,#1', while a push would be 'WRLUT data,PTRx' plus 'ADD PTRx,#1'.

It would be easy to support PTRx expressions for RDLUT/WRLUT, where only the lower 9 bits are used for lut address. So, a pop from lut would be 'RDLUT data,--PTRx' and a push would be 'WRLUT data,PTRX++'. If we made discrete PUSHx/POPx instructions which weren't just aliases for WRLONG/RDLONG, we could handle this better by invoking either WRLONG/RDLONG or WRLUT/RDLUT, depending on the PTRx range being within lut, or not. That would be pretty easy.

We would have these discrete instructions which would use hub or lut, based on the PTRx address:

PUSHA D/#
PUSHB D/#
POPA D
POPB D
CALLA D/#/@
CALLB D/#/@
RETA
RETB

Meanwhile RDBYTE/RDWORD/RDLONG and WRBYTE/WRWORD/WRLONG would always access hub, only.

cgracey · 2015-10-07 19:59

I think adding PTRx expressions to RDLUT/WRLUT is a no-brainer, in any case.

Cluso99 · 2015-10-07 20:07

Chip,
WOW! Love the idea.

I know I keep pushing, but really if the instructions were always longs we would only require 18-bits of addressing, not 20-bits. This would free up 2-bits in those large instructions.

At least I need to understand some of the callx/retx/pushx/popx instructions because you caught me unawares about the pushx/popx being wrlong/rdlong instructions acting on the hub.

May I suggest that we get a new FPGA release out first, and that we continue this discussion in a new separate thread rather than here on Peter's thread. I think there are huge advantages in being able to use the LUT for the stack.

jmg · 2015-10-07 20:15

David Betz wrote: »

Umm... Doesn't that mean that if I have PTRA pointing into LUT and do a CALLA to some function and that function does a POPA to get its return address, it will fetch the wrong value?

On top of that, is the common ASM trick of CALL, then a Table Jump then RET
- if the RET is now in a different memory space, the Pointer has moved....
The idea of optional LUT as Stack is good but I think it needs explicit instructions.
When this breaks, it will be very hard to find.

Peter Jakacki · 2015-10-08 00:52

I've been making some really good progress on this subroutine threaded version, it is looking really schmick now. Also now rather than performing relative branching with IF ELSE WHILE UNTIL AGAIN etc these words are now combined internally into just two definitions that read an absolute branch. So ELSE and AGAIN are unconditionally jumps whereas IF WHILE UNTIL are conditional jumps. Internally ELSE and AGAIN are just referred to as GOTO

In Tachyon P1 when we went to emit a single character it always required 4 bytes of code either as _BYTE,$0D,XCALL,xEMIT or XCALL,xPRTSTR,$0D,00 and it looked like we needed 6 bytes to do the same in P2 but this has dropped back down to 4 bytes with a simple PRTSTR,$0D since the $0D is compiled as a 16-bit word which is in fact the string $0D,00.

Here's the main demo plus a few other bits as it is compiled in PNut. It looks a lot cleaner without the relative branch calculations.

dat
 		orgh
Tachyon
       	word	PRTSTR
       	byte	"Tachyon Forth R16 for the P2 V0.1 151008 ",0
	word	CR,_8,STARS                                          		' simple subroutine call
	word	CR,_WORD,$21,_WORD,$5D,FOR,DUP,EMIT,INC,forNEXT,DROP		' display ASCII using FOR NEXT
	word	CR,_WORD,$21
lbl0	word	DUP,EMIT,INC,DUP,_WORD,$7E,_EQ,_UNTIL,lbl0,DROP			' display ASCII using BEGIN UNTIL
 	word	CR,_LONG
	long	$CAFEF00D							' test Forth versions of number printing
	word	DUP,PRTHEX,SPACE,DUP,PRTBYTE,SPACE
	word	DUP,PRTWORD,SPACE,PRTLONG,SPACE
	word	_GETCNT,_WORD,10000						' time a simple 10000 FOR $1234 DROP NEXT loop
	word	FOR,_WORD,$1234,DROP,forNEXT
	word	_GETCNT,SWAP,MINUS,SPACE,PRINT					' display results
 	word	_GETCNT,_NOP,_GETCNT,SWAP,MINUS,SPACE,PRINT 			' display timing for a NOP (plus GETCNTs)
 	word	_GETCNT,_GETCNT,SWAP,MINUS,SPACE,PRINT				' display timing for GETCNTs themselves
	word	SLOWIO,SUBIO,FASTIO						' toggle some pins with various methods
	word	_0,_WORD,$100,HDUMP						' dump some hub  memory
 	word	CR,GOTO,Tachyon

CR	call	#COLON
	word	PRTSTR,$0D,PRTSTR,$0A,EXIT

STARS	call	#COLON
	word	FOR,PRTSTR,"*",forNEXT,EXIT

FASTIO          mov	X,#1
		or	DIRA,X
      		mov	R2,#16
FIOLP		or	OUTA,X
		andn	OUTA,X
		djnz	R2,@FIOLP
		ret

At present I am converting and integrating a lot of the bytecode from Tachyon P1 after which I will be able to switch on full console interaction. Not long now!

Peter Jakacki · 2015-10-16 02:31

Now that I have Tachyon up and running for the P2 I need to go back into the kernel and optimize some stuff. I am already using the LUT for the four stacks as well as maintaining top of stack copies in cog ram. The kernel is much more simplified compared to the P1 in that all code is compiled as an address and all addresses are assumed to be code, whether in cog, lut, or hub. So all high-level code does a "call #COLON" then has 16-bit sub-routine threaded word-code addresses to "interpret" which is surprisingly compact. Compiling these word-codes as a full CALL takes twice as much memory but does not seem to run much faster.

The code space is only 64k at the moment but at least all of it is code space and that's a lot of code space. I would only need to align code entry on longs to increase this range up to 256k. The dictionary now has the luxury of being able to expand without having to resort to special compacting techniques as was done for P1 using the EEPROM. Also the dictionary is a bit simpler in that the header does not need to know what type of code it is anymore, it just has a 16-bit pointer and no vector table of course.

After I test Tachyon a bit more I will write an inline PASM style assembler that can read from SD so I can test out even more and expand the networking capabilities too.

I'd like to integrate the serial transmit with the receive all in one cog rather than transmit from the main cog so has anyone got any tips in regards to writing a full-duplex driver using interrupts? Is there a standard SPI Flash chip to use that we can use with the FPGAs so that we can reboot it like a regular setup?

jmg · 2015-10-16 03:24

Peter Jakacki wrote: »

Is there a standard SPI Flash chip to use that we can use with the FPGAs so that we can reboot it like a regular setup?

FWIR, pretty much all SPI flash parts have a common-subset of 1-bit mode, not-fast, and 24 bit address.
So you can start with that.
Most also do QuadSPI, and some have QuadSPI DDR/Dual edge, and some have a lock-mode where you can load that [Cmd+24b] address in Quad-mode.

Peter Jakacki · 2015-10-16 03:37

jmg wrote: »

Peter Jakacki wrote: »

Is there a standard SPI Flash chip to use that we can use with the FPGAs so that we can reboot it like a regular setup?

FWIR, pretty much all SPI flash parts have a common-subset of 1-bit mode, not-fast, and 24 bit address.
So you can start with that.
Most also do QuadSPI, and some have QuadSPI DDR/Dual edge, and some have a lock-mode where you can load that [Cmd+24b] address in Quad-mode.

I picked up a tube of W25Q80s a couple of years ago in anticipation back then for P2 but I can't seem to find anything about SPI Flash booting which I guess is probably not implemented as of yet.

potatohead · 2015-10-16 03:40

It's not. Likely next stage of the design. For that, we need booter, crypt, etc... Gotta settle the basics first. Right now, it's all small as it goes through various revision to settle bugs in the instructions, form, addressing, etc... Chip has rewritten the basic support and sample programs, what 4 times?

I suspect once the bugs end, and we are all feeling happy, the next stage will kick off.

Forth is in WAY early. Nice job guys! (talking about Pfth too)

Peter Jakacki · 2015-10-16 03:55

Thanks, it'll be nice when I can boot from SPI as then I have a real chance to test out the filesystem and networking for real but until then I still have testing and optimizing to do. When I'm happy enough with it I will post the source, just as a regular text document though via my dropbox links.

This bit of a screen grab dumps the temporary compilation area and so shows the 16-bit words that are compiled for that same line, then a couple of stack manipulations etc. The 0014 at the end of the dump is the EXIT that is appended automatically (plus one extra at the moment)

Parallax Propeller 2 .:.:--TACHYON--:.:. Forth V10151015.0000
----------------------------------------------------------------
%11010011 $1AF0 1234 HERE $20 DUMPW
1C00: 0087 00D3 0087 1AF0 0087 04D2 1207 0087    ................
1C10: 0020 1164 0014 0014 0000 0000 0000 0000     .d.............
 ok
.S  [0000.04D2 0000.1AF0 0000.00D3 0000.0003]  ok
SWAP OVER + .S  [0000.1FC2 0000.04D2 0000.00D3 0000.0003]  ok
ROT .S  [0000.00D3 0000.1FC2 0000.04D2 0000.0003]  ok

Peter Jakacki · 2015-10-17 02:57

In developing this kernel further which btw is working very nicely I'd like to optimize some cog code but I can't seem to find an easy way to do cog indexed or indirect. With the LUT we have RDLUT and WRLUT where we can specify an address register, same as we do for RDLONG/WRLONG etc. Now SETS and SETD still require two instructions or nops between the set and the instruction. Then there is ALTDS which is badly in need of some macros but looks like it could do the job although in a somewhat unwieldly fashion. It's also a pity that SETB/CLRB doesn't span addresses for bits>31 as that would be very useful for a whole number of things including accessing the ports as one.

So, have I missed something? Is there an easy way to address cog memory indirectly?

BTW, if someone wants to play with Tachyon "as is" just let me know as I don't need "bug" reports yet

Here's a little test loop that flashes the leds on the DE2 with pattens from memory and if key2 is pressed then it runs faster and exits if key3 is pressed

0 BEGIN DUP W@ $FFFF PBCLR PBSET 30 PIN@ IF #100 ELSE #20 THEN ms 2 + 31 PIN@ 0= UNTIL

ozpropdev · 2015-10-17 06:13

Peter
Have a look at the ALTDS instruction

 ALTDS D,S/# 'modify D according to bits in S and possibly replace next instruction's CCCCOOOOOOOCZI / DDDDDDDDD / SSSSSSSSS fields.

 In ALTDS, S provides the following pattern: %RRR_DDD_SSS

 %RRR: (101 allows instruction substitution)
 000 = don't affect D's CCCCOOOOOOOOOCZI field
 001 = don't affect D's CCCCOOOOOOOOOCZI field, cancel write for next instruction
 010 = decrement D's OOOOOOOCZ field
 011 = increment D's OOOOOOOCZ field
 100 = use D's OOOOOOOCZ field as the result register for the next instruction (separate from D)
 101 = use D's CCCCOOOOOOOCZI field as next instruction's CCCCOOOOOOOCZI field
 110 = use D's OOOOOOOCZ field as the result register for the next instruction, decrement D's OOOOOOOCZ field
 111 = use D's OOOOOOOCZ field as the result register for the next instruction, increment D's OOOOOOOCZ field

 %DDD
 000 = don't affect D's DDDDDDDDD field
 001 = copy D's SSSSSSSSS field into its DDDDDDDDD field
 010 = decrement D's DDDDDDDDD field
 011 = increment D's DDDDDDDDD field
 100 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction
 101 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction, copy D's SSSSSSSSS field into its DDDDDDDDD field
 110 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction, decrement D's DDDDDDDDD field
 111 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction, increment D's DDDDDDDDD field

 %SSS
 000 = don't affect D's SSSSSSSSS field
 001 = copy D's DDDDDDDDD field into its SSSSSSSSS field
 010 = decrement D's SSSSSSSSS field
 011 = increment D's SSSSSSSSS field
 100 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction
 101 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, copy D's DDDDDDDDD field into its SSSSSSSSS field
 110 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, decrement D's SSSSSSSSS field
 111 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, increment D's SSSSSSSSS field

This should to the trick

Peter Jakacki · 2015-10-17 09:34

I was hoping for a simpler single "instruction" rather than an ALTDS instruction seeing we already have the RD/WR type of instructions for lut and hub.. Also does it actually modify the next instruction in memory or just in the pipeline which is what I think it must do?

Would help to if actual examples were given, it doesn't look clean and simple at all.

Cluso99 · 2015-10-18 00:02

Yes, it just modifies the ALU input in the pipeline.
Perhaps the pointers PTRA & PTRB might help you. Not sure how they actually work in a normal instruction.

Peter Jakacki · 2015-10-18 01:03

PTR addressing modes only work with hub read/write otherwise they are just a register like any other so that a MOV will read or write to the PTR register rather than use it indirectly. No, what's needed is a simple RDCOG and WRCOG instruction, surely that's not hard to implement and works in the same manner as RDLUT an WRLUT.

The ALTDS is a rather versatile but nonetheless kludgey instruction, not at all in keeping with the rest of the P2 instruction set. As I mentioned before, where are the examples of how this was intended to be used as immediately it looks kludgey when you try to use it. At least proper macros could hide most of this kludge but most of the time I just need a simple RDCOG/WRCOG.

potatohead · 2015-10-18 01:33

Wrcog is mov. This design has the same limitation p1 does in that the pipeline requires modification to get indexing and in direction done.

The usual way is to maintain a pointer, add, setd, sets, etc...

Altds compresses this for some additional speed, in that it can work directly with the target instruction data.

We don't have the advanced pipeline to work otherwise. Cog ops of this kind take a couple instructions as a result.

That is the why, in any case. Maybe Chip has a better answer in the wings.

Peter Jakacki · 2015-10-18 02:08

That's not correct though as MOV moves from register or immediate to register whereas a RDCOG would use the source indirectly just as RDLUT and also as RDLONG does etc. So if the source pointed to a register "myio" which held an address that pointed to INB then RDCOG X,myio would be equivalent to MOV X,INB.

@Chip: How hard would it be to add RDCOG and WRCOG in the same manner?

potatohead · 2015-10-18 02:18

It is the indirection, indexing that takes the time. It can happen in hubexec because the instruction comes from a different place. In cogex, it's all timing bound.

Yes, it is register to register in the cog, which is why wrcog turns into a self modify. On the hot chip, the pipe was deeper, more complex, and the cog ram was ported more too, I believe.

The pointers for cog use required depth and sophistication absent in this simpler design. I believe the options were to slow down the cog with a delay in the critical path, extend the pipe, or add ports to cog ram, likely a combination.

Other things got stripped, like waitvid, the pll, and more general purpose, simple things put in place. Hubexec lost cache, and has the fifo, which can also smooth the hub access.

I miss those pointers too.

There is actually a discussion on this in the older design thread.

Peter Jakacki · 2015-10-18 07:22

Current word list for Tachyon as I continue testing etc. I've looked at what I need for pin I/O and so I've put in PINSET/PINCLR/PININP/PIN@ which accepts bits 0 to 63 to span both ports plus I have PASET/PACLR PBSET/PBCLR to write a 32-bit mask to the ports as well as sets the direction to output. I just need to clean up some more bugs and I should be able to release this source if anyone is interested.

  Parallax Propeller 2 .:.:--TACHYON--:.:. Forth V10151016.1800
----------------------------------------------------------------
  ok
  ok
WORDS
1D3E WORDS          1D1A WTAB           1D00 CTYPE          001F DUP
0022 OVER           0019 DROP           0018 2DROP          0029 SWAP
002D ROT            001C NIP            0028 BOUNDS         000C STREND
0175 0              0174 1              0173 2              0172 3
0171 4              016F 5              016E 6              016D 7
016C 8              016B 9              0169 0D             0176 ON
0176 TRUE           0176 -1             0167 BL             0168 16
0175 FALSE          0175 OFF            0037 1+             0034 1-
0036 2+             0033 2-             0031 +              0030 -
00D1 DO             00DB LOOP           00D8 +LOOP          00D2 FOR
00E3 NEXT           0085 I              003E INVERT         0040 AND
0042 ANDN           0044 OR             0046 XOR            004C ROL
004E ROR            0048 SHR            004A SHL            0050 2/
0052 2*             0054 REV            0056 MASK           005A >N
005B >B             005C 9BITS          0060 0=             0060 NOT
005E =              0064 <>             0068 >              006F C@
0071 W@             0073 @              0075 C+!            0077 C!
006C C@++           0079 W+!            007B W!             007D +!
007F !              00F7 UM*            00F3 *              00F1 ABS
0039 -NEGATE        003A ?NEGATE        003C NEGATE         00A7 PINSET
00AD PINCLR         00B3 PININP         00BE SHROUT         00BB SHRINP
1C86 DIRX           1C92 OUTX           1C9E INX            0091 PA@
0093 PB@            0095 PA!            0097 PB!            0099 DACLR
009B DBCLR          009D PASET          009E DASET          00A0 PBSET
00A1 DBSET          00A3 PACLR          00A5 PBCLR          00B5 PIN@
0004 RESET          0012 0EXIT          0014 EXIT           0016 NOP
0017 3DROP          001E ?DUP           0024 3RD            0026 4TH
00C1 CALL           00C5 JUMP           00E0 BRANCH>        00E7 >R
00EA R>             014E >L             00D6 L>             015D !SP
0165 !RP            0163 !LP            0161 !BP            1A6D DEPTH
0133 COG@           0138 COGREG         013A COG!           013F COGID
00EF @REG           0114 EMIT           097C CR             019F SPACE
0117 CNT@           1A49 ERROR          114E QD             1156 DUMP
117A DUMPW          11E0 COGDUMP        110E .BYTE          111E .WORD
112E .LONG          13B5 .              12B5 @PAD           12C7 HOLD
12D9 >CHAR          136B #>             1307 <#             131B #
135B #S             1387 <D>            13F3 PRINT$         1393 LEN$
13CB U.             1407 .DEC           0918 U/             0908 U/MOD
0922 */             0066 0<             0067 <              098A U<
1064 WITHIN         108A ?EXIT          10B4 ERASE          10B8 FILL
011E CMOVE          109A <CMOVE         10CC ms             0950 HEX
0944 DECIMAL        0936 BINARY         1603 GETWORD        161F SEARCH
0C00 NUMBER         1A75 .S             127B HERE           1C7C temp
1C72 accept         0EB2 names          0ED8 CREATEWORD     0F78 CREATE
0F8C ALLOT          1038 [COMPILE]      1000 pub            0ECA pri
0FE0 :              1010 ;              0E5C NFA'           0E76 '
0D30 IF             0D4C ELSE           0D7E THEN           0D7E ENDIF
0CC2 BEGIN          0CD4 UNTIL          0D16 AGAIN          0D30 WHILE
0CEE REPEAT         1B51 \              1B51 ''             1B75 (
1BA5 {              0016 }              1B99 IFNDEF         1B8B IFDEF
0DB8 "              0DC8 ."             10E2 (.")           19AE UNKNOWN
19EC NOTFOUND        ok

BTW, here's a quick iterative Fibonacci test. fibo(46) takes 8672 cycles unoptimized which compares with Tachyon P1 which takes 10,768 cycles.

pub fibo ( n -- f )         1+ 0 1 ROT FOR OVER + SWAP NEXT NIP ;
46 LAP fibo LAP .LAP SPACE . 8672 1836311903 ok

With a OVER + SWAP made into a single operation:

46 LAP fibo LAP .LAP 3248 ok

So if the P2 were to run at 160MHz that would take 20.3us for a fibo(46)
Since I now have deep and efficient stacks I could also try out the recursive algorithm later.

Peter Jakacki · 2015-10-18 15:48

This simple one-liner is all I need to test launching tasks in cogs. It just flashes the DE2's leds at different rates so that 10 leds are blinking and my main cog is still talking to the console while this is happening. So that means there must be 11 cogs available on the DE2?

: LEDTASK   COGID #32 + BEGIN DUP HIGH DUP 3 SHL ms DUP LOW DUP 3 SHL ms AGAIN ;
16 1 DO ' LEDTASK I TASK ! LOOP

BTW, I noticed that the debug interrupt vectors are mirrored on every 256k boundary rather than at the end of 1M.

3.FF80 80 DUMPL
FF80: 0000.0000 0000.0000 0000.0000 0000.0000    ................
FF90: 0000.0000 0000.0000 0000.0000 0000.0000    ................
FFA0: 0000.0000 0000.0000 0000.0000 0000.0000    ................
FFB0: 0000.0000 0000.0000 0000.0000 0000.0000    ................
FFC0: FABB.FFFF FABB.FFFF FABB.FFFF FABB.FFFF    ................
FFD0: FABB.FFFF FABB.FFFF FABB.FFFF FABB.FFFF    ................
FFE0: FABB.FFFF FABB.FFFF FABB.FFFF FABB.FFFF    ................
FFF0: FABB.FFFF FABB.FFFF FABB.FFFF FABB.FFFF    ................ ok

potatohead · 2015-10-18 16:07

That last one is a big oops! You might want to make a thread for Chip when he gets back from his travels this week.

I thought the DE2 was good for 12 cogs.

Peter Jakacki · 2015-10-19 02:53

In-between expanding Tachyon I am also revisiting functional code and seeing what I can do to optimize it. So it is with the FILL and ERASE function which was relying on high-level wordcode calling a CMOVE function. This works but was taking 2114016 cycles (13.2ms @160MHz) for 64k so I looked at trying to do this with SETQ and WRBYTE but it doesn't seem suited to that. So instead I tried the WRFAST/REP/WFBYTE sequence which seems to have dropped the fill to a much more respectable 2 cycles/byte so that 64k completes in 131200 cycles (820us @160MHz).

To improve that I could work out if I can use words or longs in which case it will be much faster again at 205us for 64k fill (edit, which I have just tried with LFILL and confirmed)

Do any of you P2 PASM gurus have any tips and tricks you'd like to share?

' wrfast rep code -> $6.0000 $1.0000 LAP ERASE LAP DECIMAL .LAP 131200 ok (2 clks/byte) =  820us/64K
 ' ( addr cnt -- )
ERASE	call	#PUSHACC
' ( addr cnt fillch -- )
FILL 	wrfast	tos1,tos2
	rep	@.L0,tos1
	wfbyte	tos
.L0    	jmp	#DROP3

jmg · 2015-10-19 03:03

Peter Jakacki wrote: »

Do any of you P2 PASM gurus have any tips and tricks you'd like to share?

The code you have looks good.

Do you mean around the 32/16/8 bit decision ?
Do you want this run time or compile time choice ?

If speed matters over size, you could cover most of the fill using longs for speed, and I think Chip still allows any alignment, so wr longs can start on any address (may have a start-cost ?) so you do longs, then check if an addendum is needed. As that is only 0.1.2.3, a byte fill is probably ok.

Peter Jakacki · 2015-10-19 03:10

Yes, I did try a long fill and it works well but byte/word/long are normally decided at runtime due to the wide range of parameters that will be thrown at it.

' $6.0000 $1.0000 0 LAP LFILL LAP DECIMAL .LAP 32864 ( 205us/64k )
' LFILL ( addr bytes ch -- )
LFILL	shr	tos1,#2
 	wrfast	tos1,tos2
	rep	@.L0,tos1
	wflong	tos
.L0    	jmp	#DROP3

The wfbyte/wfword/wflong are great candidates though for a simple SETS as the source field of the instruction controls the word size. So I will have a look at this but I don't want to make the runtime word too complicated either in trying to divide the operation up into mixed operations. Anyway I can use discrete FILL/WFILL/LFILL words but that's a waste or I could modify the source field each time perhaps. It's fun playing with the code at this level too.

Tachyon Forth for P2 -FAT32+WIZnet- Now Smartpins - wOOt!

Comments