PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.
Are there hub $200..$3FF anymore? With HUB moved to $400, I thought there wasn't any address overlap in the latest memory map.
Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub.
I'm not quite following, are you talking about moving the stack by PC address ?
What about a case where the LUT is all already allocated ?
Using LUT as stack has obvious merits, but making it a hidden by-pc choice seems likely to surprise. Code that is moved from HUB to LUT changes more than expected.
This is all it is, and it has nothing to do with hub exec:
When CALLx/RETx execute and PTRx = %00000000001xxxxxxxxx or $200..$3FF, the lut is used for stack data, instead of the hub. And PTRx changes by +-1, not +-4. In these cases, CALLx is just two clocks and RETx is three, which are both a lot faster than hub accesses and are deterministic, as well.
This is all it is, and it has nothing to do with hub exec:
When CALLx/RETx execute and PTRx = %00000000001xxxxxxxxx or $200..$3FF, the lut is used for stack data, instead of the hub. And PTRx changes by +-1, not +-4. In these cases, CALLx is just two clocks and RETx is three, which are both a lot faster than hub accesses and are deterministic, as well.
I'm still not seeing how that protects the LUT from accidental clobbering ?
A change in destination area is quite significant, not something that should be buried ?
ie I think these need explicit opcodes, not a hidden operational flip-flop.
Chip,
Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.
Wow! Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub. I really like that, because it means we don't need a whole extra set of instructions and pointers to use lut for stack. Excellent!
Are there any other sleeper purposes for address sensitivity?
P.S. This scheme would work only for CALLA/CALLB/RETA/RETB, since they are dedicated instructions that could easily be redirected. PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.
Umm... Doesn't that mean that if I have PTRA pointing into LUT and do a CALLA to some function and that function does a POPA to get its return address, it will fetch the wrong value?
Chip,
Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.
Wow! Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub. I really like that, because it means we don't need a whole extra set of instructions and pointers to use lut for stack. Excellent!
Are there any other sleeper purposes for address sensitivity?
P.S. This scheme would work only for CALLA/CALLB/RETA/RETB, since they are dedicated instructions that could easily be redirected. PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.
As for cog ram, I don't know that much can be done about that. On the other hand, push/pop is not as important there since the stack is already located in directly addressable registers.
Chip,
Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.
Wow! Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub. I really like that, because it means we don't need a whole extra set of instructions and pointers to use lut for stack. Excellent!
Are there any other sleeper purposes for address sensitivity?
P.S. This scheme would work only for CALLA/CALLB/RETA/RETB, since they are dedicated instructions that could easily be redirected. PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.
Umm... Doesn't that mean that if I have PTRA pointing into LUT and do a CALLA to some function and that function does a POPA to get its return address, it will fetch the wrong value?
Yes, because PUSHA/PUSHB/POPA/POPB are actually WRLONG/RDLONG, only CALLA/CALLB/RETA/RETB could have this lut functionality. Maybe it's too complicated, in that sense, as it would make people suppose that PUSHx and POPx would work, too. To do a pop without caring about the data, you'd just 'SUB PTRx,#1', while a push would be 'WRLUT data,PTRx' plus 'ADD PTRx,#1'.
It would be easy to support PTRx expressions for RDLUT/WRLUT, where only the lower 9 bits are used for lut address. So, a pop from lut would be 'RDLUT data,--PTRx' and a push would be 'WRLUT data,PTRX++'. If we made discrete PUSHx/POPx instructions which weren't just aliases for WRLONG/RDLONG, we could handle this better by invoking either WRLONG/RDLONG or WRLUT/RDLUT, depending on the PTRx range being within lut, or not. That would be pretty easy.
We would have these discrete instructions which would use hub or lut, based on the PTRx address:
PUSHA D/#
PUSHB D/#
POPA D
POPB D
CALLA D/#/@
CALLB D/#/@
RETA
RETB
Meanwhile RDBYTE/RDWORD/RDLONG and WRBYTE/WRWORD/WRLONG would always access hub, only.
I know I keep pushing, but really if the instructions were always longs we would only require 18-bits of addressing, not 20-bits. This would free up 2-bits in those large instructions.
At least I need to understand some of the callx/retx/pushx/popx instructions because you caught me unawares about the pushx/popx being wrlong/rdlong instructions acting on the hub.
May I suggest that we get a new FPGA release out first, and that we continue this discussion in a new separate thread rather than here on Peter's thread. I think there are huge advantages in being able to use the LUT for the stack.
Umm... Doesn't that mean that if I have PTRA pointing into LUT and do a CALLA to some function and that function does a POPA to get its return address, it will fetch the wrong value?
On top of that, is the common ASM trick of CALL, then a Table Jump then RET
- if the RET is now in a different memory space, the Pointer has moved....
The idea of optional LUT as Stack is good but I think it needs explicit instructions.
When this breaks, it will be very hard to find.
I've been making some really good progress on this subroutine threaded version, it is looking really schmick now. Also now rather than performing relative branching with IF ELSE WHILE UNTIL AGAIN etc these words are now combined internally into just two definitions that read an absolute branch. So ELSE and AGAIN are unconditionally jumps whereas IF WHILE UNTIL are conditional jumps. Internally ELSE and AGAIN are just referred to as GOTO
In Tachyon P1 when we went to emit a single character it always required 4 bytes of code either as _BYTE,$0D,XCALL,xEMIT or XCALL,xPRTSTR,$0D,00 and it looked like we needed 6 bytes to do the same in P2 but this has dropped back down to 4 bytes with a simple PRTSTR,$0D since the $0D is compiled as a 16-bit word which is in fact the string $0D,00.
Here's the main demo plus a few other bits as it is compiled in PNut. It looks a lot cleaner without the relative branch calculations.
dat
orgh
Tachyon
word PRTSTR
byte"Tachyon Forth R16 for the P2 V0.1 151008 ",0word CR,_8,STARS ' simple subroutine callword CR,_WORD,$21,_WORD,$5D,FOR,DUP,EMIT,INC,forNEXT,DROP ' display ASCII using FOR NEXTword CR,_WORD,$21
lbl0 word DUP,EMIT,INC,DUP,_WORD,$7E,_EQ,_UNTIL,lbl0,DROP ' display ASCII using BEGIN UNTILword CR,_LONG
long$CAFEF00D' test Forth versions of number printingword DUP,PRTHEX,SPACE,DUP,PRTBYTE,SPACE
word DUP,PRTWORD,SPACE,PRTLONG,SPACE
word _GETCNT,_WORD,10000' time a simple 10000 FOR $1234 DROP NEXT loopword FOR,_WORD,$1234,DROP,forNEXT
word _GETCNT,SWAP,MINUS,SPACE,PRINT ' display resultsword _GETCNT,_NOP,_GETCNT,SWAP,MINUS,SPACE,PRINT ' display timing for a NOP (plus GETCNTs)word _GETCNT,_GETCNT,SWAP,MINUS,SPACE,PRINT ' display timing for GETCNTs themselvesword SLOWIO,SUBIO,FASTIO ' toggle some pins with various methodsword_0,_WORD,$100,HDUMP ' dump some hub memoryword CR,GOTO,Tachyon
CR call #COLON
word PRTSTR,$0D,PRTSTR,$0A,EXIT
STARS call #COLON
word FOR,PRTSTR,"*",forNEXT,EXIT
FASTIO mov X,#1orDIRA,X
mov R2,#16
FIOLP orOUTA,X
andnOUTA,X
djnz R2,@FIOLP
ret
At present I am converting and integrating a lot of the bytecode from Tachyon P1 after which I will be able to switch on full console interaction. Not long now!
Now that I have Tachyon up and running for the P2 I need to go back into the kernel and optimize some stuff. I am already using the LUT for the four stacks as well as maintaining top of stack copies in cog ram. The kernel is much more simplified compared to the P1 in that all code is compiled as an address and all addresses are assumed to be code, whether in cog, lut, or hub. So all high-level code does a "call #COLON" then has 16-bit sub-routine threaded word-code addresses to "interpret" which is surprisingly compact. Compiling these word-codes as a full CALL takes twice as much memory but does not seem to run much faster.
The code space is only 64k at the moment but at least all of it is code space and that's a lot of code space. I would only need to align code entry on longs to increase this range up to 256k. The dictionary now has the luxury of being able to expand without having to resort to special compacting techniques as was done for P1 using the EEPROM. Also the dictionary is a bit simpler in that the header does not need to know what type of code it is anymore, it just has a 16-bit pointer and no vector table of course.
After I test Tachyon a bit more I will write an inline PASM style assembler that can read from SD so I can test out even more and expand the networking capabilities too.
I'd like to integrate the serial transmit with the receive all in one cog rather than transmit from the main cog so has anyone got any tips in regards to writing a full-duplex driver using interrupts? Is there a standard SPI Flash chip to use that we can use with the FPGAs so that we can reboot it like a regular setup?
Is there a standard SPI Flash chip to use that we can use with the FPGAs so that we can reboot it like a regular setup?
FWIR, pretty much all SPI flash parts have a common-subset of 1-bit mode, not-fast, and 24 bit address.
So you can start with that.
Most also do QuadSPI, and some have QuadSPI DDR/Dual edge, and some have a lock-mode where you can load that [Cmd+24b] address in Quad-mode.
Is there a standard SPI Flash chip to use that we can use with the FPGAs so that we can reboot it like a regular setup?
FWIR, pretty much all SPI flash parts have a common-subset of 1-bit mode, not-fast, and 24 bit address.
So you can start with that.
Most also do QuadSPI, and some have QuadSPI DDR/Dual edge, and some have a lock-mode where you can load that [Cmd+24b] address in Quad-mode.
I picked up a tube of W25Q80s a couple of years ago in anticipation back then for P2 but I can't seem to find anything about SPI Flash booting which I guess is probably not implemented as of yet.
It's not. Likely next stage of the design. For that, we need booter, crypt, etc... Gotta settle the basics first. Right now, it's all small as it goes through various revision to settle bugs in the instructions, form, addressing, etc... Chip has rewritten the basic support and sample programs, what 4 times?
I suspect once the bugs end, and we are all feeling happy, the next stage will kick off.
Forth is in WAY early. Nice job guys! (talking about Pfth too)
Thanks, it'll be nice when I can boot from SPI as then I have a real chance to test out the filesystem and networking for real but until then I still have testing and optimizing to do. When I'm happy enough with it I will post the source, just as a regular text document though via my dropbox links.
This bit of a screen grab dumps the temporary compilation area and so shows the 16-bit words that are compiled for that same line, then a couple of stack manipulations etc. The 0014 at the end of the dump is the EXIT that is appended automatically (plus one extra at the moment)
Parallax Propeller 2 .:.:--TACHYON--:.:. Forth V10151015.0000
----------------------------------------------------------------
%11010011$1AF01234 HERE $20 DUMPW
1C00: 0087 00D3 0087 1AF0 0087 04D2 12070087 ................
1C10: 00201164001400140000000000000000 .d.............
ok
.S [0000.04D2 0000.1AF0 0000.00D3 0000.0003] ok
SWAP OVER + .S [0000.1FC2 0000.04D2 0000.00D3 0000.0003] ok
ROT .S [0000.00D3 0000.1FC2 0000.04D2 0000.0003] ok
In developing this kernel further which btw is working very nicely I'd like to optimize some cog code but I can't seem to find an easy way to do cog indexed or indirect. With the LUT we have RDLUT and WRLUT where we can specify an address register, same as we do for RDLONG/WRLONG etc. Now SETS and SETD still require two instructions or nops between the set and the instruction. Then there is ALTDS which is badly in need of some macros but looks like it could do the job although in a somewhat unwieldly fashion. It's also a pity that SETB/CLRB doesn't span addresses for bits>31 as that would be very useful for a whole number of things including accessing the ports as one.
So, have I missed something? Is there an easy way to address cog memory indirectly?
BTW, if someone wants to play with Tachyon "as is" just let me know as I don't need "bug" reports yet
Here's a little test loop that flashes the leds on the DE2 with pattens from memory and if key2 is pressed then it runs faster and exits if key3 is pressed
0 BEGIN DUP W@ $FFFF PBCLR PBSET 30 PIN@ IF #100ELSE #20 THEN ms 2 + 31 PIN@ 0= UNTIL
ALTDS D,S/# 'modify D according to bits in S and possibly replace next instruction's CCCCOOOOOOOCZI / DDDDDDDDD / SSSSSSSSS fields.
In ALTDS, S provides the following pattern: %RRR_DDD_SSS
%RRR: (101 allows instruction substitution)
000 = don't affect D's CCCCOOOOOOOOOCZI field
001 = don't affect D's CCCCOOOOOOOOOCZI field, cancel write for next instruction
010 = decrement D's OOOOOOOCZ field
011 = increment D's OOOOOOOCZ field
100 = use D's OOOOOOOCZ field as the result register for the next instruction (separate from D)
101 = use D's CCCCOOOOOOOCZI field as next instruction's CCCCOOOOOOOCZI field
110 = use D's OOOOOOOCZ field as the result register for the next instruction, decrement D's OOOOOOOCZ field
111 = use D's OOOOOOOCZ field as the result register for the next instruction, increment D's OOOOOOOCZ field
%DDD
000 = don't affect D's DDDDDDDDD field
001 = copy D's SSSSSSSSS field into its DDDDDDDDD field
010 = decrement D's DDDDDDDDD field
011 = increment D's DDDDDDDDD field
100 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction
101 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction, copy D's SSSSSSSSS field into its DDDDDDDDD field
110 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction, decrement D's DDDDDDDDD field
111 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction, increment D's DDDDDDDDD field
%SSS
000 = don't affect D's SSSSSSSSS field
001 = copy D's DDDDDDDDD field into its SSSSSSSSS field
010 = decrement D's SSSSSSSSS field
011 = increment D's SSSSSSSSS field
100 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction
101 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, copy D's DDDDDDDDD field into its SSSSSSSSS field
110 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, decrement D's SSSSSSSSS field
111 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, increment D's SSSSSSSSS field
I was hoping for a simpler single "instruction" rather than an ALTDS instruction seeing we already have the RD/WR type of instructions for lut and hub.. Also does it actually modify the next instruction in memory or just in the pipeline which is what I think it must do?
Would help to if actual examples were given, it doesn't look clean and simple at all.
Yes, it just modifies the ALU input in the pipeline.
Perhaps the pointers PTRA & PTRB might help you. Not sure how they actually work in a normal instruction.
PTR addressing modes only work with hub read/write otherwise they are just a register like any other so that a MOV will read or write to the PTR register rather than use it indirectly. No, what's needed is a simple RDCOG and WRCOG instruction, surely that's not hard to implement and works in the same manner as RDLUT an WRLUT.
The ALTDS is a rather versatile but nonetheless kludgey instruction, not at all in keeping with the rest of the P2 instruction set. As I mentioned before, where are the examples of how this was intended to be used as immediately it looks kludgey when you try to use it. At least proper macros could hide most of this kludge but most of the time I just need a simple RDCOG/WRCOG.
That's not correct though as MOV moves from register or immediate to register whereas a RDCOG would use the source indirectly just as RDLUT and also as RDLONG does etc. So if the source pointed to a register "myio" which held an address that pointed to INB then RDCOG X,myio would be equivalent to MOV X,INB.
@Chip: How hard would it be to add RDCOG and WRCOG in the same manner?
It is the indirection, indexing that takes the time. It can happen in hubexec because the instruction comes from a different place. In cogex, it's all timing bound.
Yes, it is register to register in the cog, which is why wrcog turns into a self modify. On the hot chip, the pipe was deeper, more complex, and the cog ram was ported more too, I believe.
The pointers for cog use required depth and sophistication absent in this simpler design. I believe the options were to slow down the cog with a delay in the critical path, extend the pipe, or add ports to cog ram, likely a combination.
Other things got stripped, like waitvid, the pll, and more general purpose, simple things put in place. Hubexec lost cache, and has the fifo, which can also smooth the hub access.
I miss those pointers too.
There is actually a discussion on this in the older design thread.
Current word list for Tachyon as I continue testing etc. I've looked at what I need for pin I/O and so I've put in PINSET/PINCLR/PININP/PIN@ which accepts bits 0 to 63 to span both ports plus I have PASET/PACLR PBSET/PBCLR to write a 32-bit mask to the ports as well as sets the direction to output. I just need to clean up some more bugs and I should be able to release this source if anyone is interested.
BTW, here's a quick iterative Fibonacci test. fibo(46) takes 8672 cycles unoptimized which compares with Tachyon P1 which takes 10,768 cycles.
pubfibo( n -- f ) 1+ 0 1 ROT FOR OVER + SWAP NEXT NIP ;46 LAP fibo LAP .LAP SPACE . 86721836311903 ok
With a OVER + SWAP made into a single operation:
46 LAP fibo LAP .LAP 3248 ok
So if the P2 were to run at 160MHz that would take 20.3us for a fibo(46)
Since I now have deep and efficient stacks I could also try out the recursive algorithm later.
This simple one-liner is all I need to test launching tasks in cogs. It just flashes the DE2's leds at different rates so that 10 leds are blinking and my main cog is still talking to the console while this is happening. So that means there must be 11 cogs available on the DE2?
: LEDTASK COGID #32 + BEGIN DUP HIGH DUP 3 SHL ms DUP LOW DUP 3 SHL ms AGAIN ;161DO' LEDTASK I TASK ! LOOP
BTW, I noticed that the debug interrupt vectors are mirrored on every 256k boundary rather than at the end of 1M.
In-between expanding Tachyon I am also revisiting functional code and seeing what I can do to optimize it. So it is with the FILL and ERASE function which was relying on high-level wordcode calling a CMOVE function. This works but was taking 2114016 cycles (13.2ms @160MHz) for 64k so I looked at trying to do this with SETQ and WRBYTE but it doesn't seem suited to that. So instead I tried the WRFAST/REP/WFBYTE sequence which seems to have dropped the fill to a much more respectable 2 cycles/byte so that 64k completes in 131200 cycles (820us @160MHz).
To improve that I could work out if I can use words or longs in which case it will be much faster again at 205us for 64k fill (edit, which I have just tried with LFILL and confirmed)
Do any of you P2 PASM gurus have any tips and tricks you'd like to share?
' wrfast rep code -> $6.0000 $1.0000 LAP ERASE LAP DECIMAL .LAP 131200 ok (2 clks/byte) = 820us/64K' ( addr cnt -- )
ERASE call #PUSHACC
' ( addr cnt fillch -- )
FILL wrfast tos1,tos2
rep @.L0,tos1
wfbyte tos
.L0 jmp #DROP3
Do any of you P2 PASM gurus have any tips and tricks you'd like to share?
The code you have looks good.
Do you mean around the 32/16/8 bit decision ?
Do you want this run time or compile time choice ?
If speed matters over size, you could cover most of the fill using longs for speed, and I think Chip still allows any alignment, so wr longs can start on any address (may have a start-cost ?) so you do longs, then check if an addendum is needed. As that is only 0.1.2.3, a byte fill is probably ok.
Yes, I did try a long fill and it works well but byte/word/long are normally decided at runtime due to the wide range of parameters that will be thrown at it.
The wfbyte/wfword/wflong are great candidates though for a simple SETS as the source field of the instruction controls the word size. So I will have a look at this but I don't want to make the runtime word too complicated either in trying to divide the operation up into mixed operations. Anyway I can use discrete FILL/WFILL/LFILL words but that's a waste or I could modify the source field each time perhaps. It's fun playing with the code at this level too.
Comments
Are there hub $200..$3FF anymore? With HUB moved to $400, I thought there wasn't any address overlap in the latest memory map.
What about a case where the LUT is all already allocated ?
Using LUT as stack has obvious merits, but making it a hidden by-pc choice seems likely to surprise. Code that is moved from HUB to LUT changes more than expected.
When CALLx/RETx execute and PTRx = %00000000001xxxxxxxxx or $200..$3FF, the lut is used for stack data, instead of the hub. And PTRx changes by +-1, not +-4. In these cases, CALLx is just two clocks and RETx is three, which are both a lot faster than hub accesses and are deterministic, as well.
A change in destination area is quite significant, not something that should be buried ?
ie I think these need explicit opcodes, not a hidden operational flip-flop.
If you add PTRA/B variants of RDLUT/WRLUT (and here), then you can get the equivalent push/pop functionality.
As for cog ram, I don't know that much can be done about that. On the other hand, push/pop is not as important there since the stack is already located in directly addressable registers.
Yes, because PUSHA/PUSHB/POPA/POPB are actually WRLONG/RDLONG, only CALLA/CALLB/RETA/RETB could have this lut functionality. Maybe it's too complicated, in that sense, as it would make people suppose that PUSHx and POPx would work, too. To do a pop without caring about the data, you'd just 'SUB PTRx,#1', while a push would be 'WRLUT data,PTRx' plus 'ADD PTRx,#1'.
It would be easy to support PTRx expressions for RDLUT/WRLUT, where only the lower 9 bits are used for lut address. So, a pop from lut would be 'RDLUT data,--PTRx' and a push would be 'WRLUT data,PTRX++'. If we made discrete PUSHx/POPx instructions which weren't just aliases for WRLONG/RDLONG, we could handle this better by invoking either WRLONG/RDLONG or WRLUT/RDLUT, depending on the PTRx range being within lut, or not. That would be pretty easy.
We would have these discrete instructions which would use hub or lut, based on the PTRx address:
PUSHA D/#
PUSHB D/#
POPA D
POPB D
CALLA D/#/@
CALLB D/#/@
RETA
RETB
Meanwhile RDBYTE/RDWORD/RDLONG and WRBYTE/WRWORD/WRLONG would always access hub, only.
WOW! Love the idea.
I know I keep pushing, but really if the instructions were always longs we would only require 18-bits of addressing, not 20-bits. This would free up 2-bits in those large instructions.
At least I need to understand some of the callx/retx/pushx/popx instructions because you caught me unawares about the pushx/popx being wrlong/rdlong instructions acting on the hub.
May I suggest that we get a new FPGA release out first, and that we continue this discussion in a new separate thread rather than here on Peter's thread. I think there are huge advantages in being able to use the LUT for the stack.
- if the RET is now in a different memory space, the Pointer has moved....
The idea of optional LUT as Stack is good but I think it needs explicit instructions.
When this breaks, it will be very hard to find.
In Tachyon P1 when we went to emit a single character it always required 4 bytes of code either as _BYTE,$0D,XCALL,xEMIT or XCALL,xPRTSTR,$0D,00 and it looked like we needed 6 bytes to do the same in P2 but this has dropped back down to 4 bytes with a simple PRTSTR,$0D since the $0D is compiled as a 16-bit word which is in fact the string $0D,00.
Here's the main demo plus a few other bits as it is compiled in PNut. It looks a lot cleaner without the relative branch calculations.
dat orgh Tachyon word PRTSTR byte "Tachyon Forth R16 for the P2 V0.1 151008 ",0 word CR,_8,STARS ' simple subroutine call word CR,_WORD,$21,_WORD,$5D,FOR,DUP,EMIT,INC,forNEXT,DROP ' display ASCII using FOR NEXT word CR,_WORD,$21 lbl0 word DUP,EMIT,INC,DUP,_WORD,$7E,_EQ,_UNTIL,lbl0,DROP ' display ASCII using BEGIN UNTIL word CR,_LONG long $CAFEF00D ' test Forth versions of number printing word DUP,PRTHEX,SPACE,DUP,PRTBYTE,SPACE word DUP,PRTWORD,SPACE,PRTLONG,SPACE word _GETCNT,_WORD,10000 ' time a simple 10000 FOR $1234 DROP NEXT loop word FOR,_WORD,$1234,DROP,forNEXT word _GETCNT,SWAP,MINUS,SPACE,PRINT ' display results word _GETCNT,_NOP,_GETCNT,SWAP,MINUS,SPACE,PRINT ' display timing for a NOP (plus GETCNTs) word _GETCNT,_GETCNT,SWAP,MINUS,SPACE,PRINT ' display timing for GETCNTs themselves word SLOWIO,SUBIO,FASTIO ' toggle some pins with various methods word _0,_WORD,$100,HDUMP ' dump some hub memory word CR,GOTO,Tachyon CR call #COLON word PRTSTR,$0D,PRTSTR,$0A,EXIT STARS call #COLON word FOR,PRTSTR,"*",forNEXT,EXIT FASTIO mov X,#1 or DIRA,X mov R2,#16 FIOLP or OUTA,X andn OUTA,X djnz R2,@FIOLP ret
At present I am converting and integrating a lot of the bytecode from Tachyon P1 after which I will be able to switch on full console interaction. Not long now!
The code space is only 64k at the moment but at least all of it is code space and that's a lot of code space. I would only need to align code entry on longs to increase this range up to 256k. The dictionary now has the luxury of being able to expand without having to resort to special compacting techniques as was done for P1 using the EEPROM. Also the dictionary is a bit simpler in that the header does not need to know what type of code it is anymore, it just has a 16-bit pointer and no vector table of course.
After I test Tachyon a bit more I will write an inline PASM style assembler that can read from SD so I can test out even more and expand the networking capabilities too.
I'd like to integrate the serial transmit with the receive all in one cog rather than transmit from the main cog so has anyone got any tips in regards to writing a full-duplex driver using interrupts? Is there a standard SPI Flash chip to use that we can use with the FPGAs so that we can reboot it like a regular setup?
So you can start with that.
Most also do QuadSPI, and some have QuadSPI DDR/Dual edge, and some have a lock-mode where you can load that [Cmd+24b] address in Quad-mode.
I suspect once the bugs end, and we are all feeling happy, the next stage will kick off.
Forth is in WAY early. Nice job guys! (talking about Pfth too)
This bit of a screen grab dumps the temporary compilation area and so shows the 16-bit words that are compiled for that same line, then a couple of stack manipulations etc. The 0014 at the end of the dump is the EXIT that is appended automatically (plus one extra at the moment)
Parallax Propeller 2 .:.:--TACHYON--:.:. Forth V10151015.0000 ---------------------------------------------------------------- %11010011 $1AF0 1234 HERE $20 DUMPW 1C00: 0087 00D3 0087 1AF0 0087 04D2 1207 0087 ................ 1C10: 0020 1164 0014 0014 0000 0000 0000 0000 .d............. ok .S [0000.04D2 0000.1AF0 0000.00D3 0000.0003] ok SWAP OVER + .S [0000.1FC2 0000.04D2 0000.00D3 0000.0003] ok ROT .S [0000.00D3 0000.1FC2 0000.04D2 0000.0003] ok
So, have I missed something? Is there an easy way to address cog memory indirectly?
BTW, if someone wants to play with Tachyon "as is" just let me know as I don't need "bug" reports yet
Here's a little test loop that flashes the leds on the DE2 with pattens from memory and if key2 is pressed then it runs faster and exits if key3 is pressed
0 BEGIN DUP W@ $FFFF PBCLR PBSET 30 PIN@ IF #100 ELSE #20 THEN ms 2 + 31 PIN@ 0= UNTIL
Have a look at the ALTDS instruction
ALTDS D,S/# 'modify D according to bits in S and possibly replace next instruction's CCCCOOOOOOOCZI / DDDDDDDDD / SSSSSSSSS fields. In ALTDS, S provides the following pattern: %RRR_DDD_SSS %RRR: (101 allows instruction substitution) 000 = don't affect D's CCCCOOOOOOOOOCZI field 001 = don't affect D's CCCCOOOOOOOOOCZI field, cancel write for next instruction 010 = decrement D's OOOOOOOCZ field 011 = increment D's OOOOOOOCZ field 100 = use D's OOOOOOOCZ field as the result register for the next instruction (separate from D) 101 = use D's CCCCOOOOOOOCZI field as next instruction's CCCCOOOOOOOCZI field 110 = use D's OOOOOOOCZ field as the result register for the next instruction, decrement D's OOOOOOOCZ field 111 = use D's OOOOOOOCZ field as the result register for the next instruction, increment D's OOOOOOOCZ field %DDD 000 = don't affect D's DDDDDDDDD field 001 = copy D's SSSSSSSSS field into its DDDDDDDDD field 010 = decrement D's DDDDDDDDD field 011 = increment D's DDDDDDDDD field 100 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction 101 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction, copy D's SSSSSSSSS field into its DDDDDDDDD field 110 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction, decrement D's DDDDDDDDD field 111 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction, increment D's DDDDDDDDD field %SSS 000 = don't affect D's SSSSSSSSS field 001 = copy D's DDDDDDDDD field into its SSSSSSSSS field 010 = decrement D's SSSSSSSSS field 011 = increment D's SSSSSSSSS field 100 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction 101 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, copy D's DDDDDDDDD field into its SSSSSSSSS field 110 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, decrement D's SSSSSSSSS field 111 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, increment D's SSSSSSSSS field
This should to the trickWould help to if actual examples were given, it doesn't look clean and simple at all.
Perhaps the pointers PTRA & PTRB might help you. Not sure how they actually work in a normal instruction.
The ALTDS is a rather versatile but nonetheless kludgey instruction, not at all in keeping with the rest of the P2 instruction set. As I mentioned before, where are the examples of how this was intended to be used as immediately it looks kludgey when you try to use it. At least proper macros could hide most of this kludge but most of the time I just need a simple RDCOG/WRCOG.
The usual way is to maintain a pointer, add, setd, sets, etc...
Altds compresses this for some additional speed, in that it can work directly with the target instruction data.
We don't have the advanced pipeline to work otherwise. Cog ops of this kind take a couple instructions as a result.
That is the why, in any case. Maybe Chip has a better answer in the wings.
@Chip: How hard would it be to add RDCOG and WRCOG in the same manner?
Yes, it is register to register in the cog, which is why wrcog turns into a self modify. On the hot chip, the pipe was deeper, more complex, and the cog ram was ported more too, I believe.
The pointers for cog use required depth and sophistication absent in this simpler design. I believe the options were to slow down the cog with a delay in the critical path, extend the pipe, or add ports to cog ram, likely a combination.
Other things got stripped, like waitvid, the pll, and more general purpose, simple things put in place. Hubexec lost cache, and has the fifo, which can also smooth the hub access.
I miss those pointers too.
There is actually a discussion on this in the older design thread.
Parallax Propeller 2 .:.:--TACHYON--:.:. Forth V10151016.1800 ---------------------------------------------------------------- ok ok WORDS 1D3E WORDS 1D1A WTAB 1D00 CTYPE 001F DUP 0022 OVER 0019 DROP 0018 2DROP 0029 SWAP 002D ROT 001C NIP 0028 BOUNDS 000C STREND 0175 0 0174 1 0173 2 0172 3 0171 4 016F 5 016E 6 016D 7 016C 8 016B 9 0169 0D 0176 ON 0176 TRUE 0176 -1 0167 BL 0168 16 0175 FALSE 0175 OFF 0037 1+ 0034 1- 0036 2+ 0033 2- 0031 + 0030 - 00D1 DO 00DB LOOP 00D8 +LOOP 00D2 FOR 00E3 NEXT 0085 I 003E INVERT 0040 AND 0042 ANDN 0044 OR 0046 XOR 004C ROL 004E ROR 0048 SHR 004A SHL 0050 2/ 0052 2* 0054 REV 0056 MASK 005A >N 005B >B 005C 9BITS 0060 0= 0060 NOT 005E = 0064 <> 0068 > 006F C@ 0071 W@ 0073 @ 0075 C+! 0077 C! 006C C@++ 0079 W+! 007B W! 007D +! 007F ! 00F7 UM* 00F3 * 00F1 ABS 0039 -NEGATE 003A ?NEGATE 003C NEGATE 00A7 PINSET 00AD PINCLR 00B3 PININP 00BE SHROUT 00BB SHRINP 1C86 DIRX 1C92 OUTX 1C9E INX 0091 PA@ 0093 PB@ 0095 PA! 0097 PB! 0099 DACLR 009B DBCLR 009D PASET 009E DASET 00A0 PBSET 00A1 DBSET 00A3 PACLR 00A5 PBCLR 00B5 PIN@ 0004 RESET 0012 0EXIT 0014 EXIT 0016 NOP 0017 3DROP 001E ?DUP 0024 3RD 0026 4TH 00C1 CALL 00C5 JUMP 00E0 BRANCH> 00E7 >R 00EA R> 014E >L 00D6 L> 015D !SP 0165 !RP 0163 !LP 0161 !BP 1A6D DEPTH 0133 COG@ 0138 COGREG 013A COG! 013F COGID 00EF @REG 0114 EMIT 097C CR 019F SPACE 0117 CNT@ 1A49 ERROR 114E QD 1156 DUMP 117A DUMPW 11E0 COGDUMP 110E .BYTE 111E .WORD 112E .LONG 13B5 . 12B5 @PAD 12C7 HOLD 12D9 >CHAR 136B #> 1307 <# 131B # 135B #S 1387 <D> 13F3 PRINT$ 1393 LEN$ 13CB U. 1407 .DEC 0918 U/ 0908 U/MOD 0922 */ 0066 0< 0067 < 098A U< 1064 WITHIN 108A ?EXIT 10B4 ERASE 10B8 FILL 011E CMOVE 109A <CMOVE 10CC ms 0950 HEX 0944 DECIMAL 0936 BINARY 1603 GETWORD 161F SEARCH 0C00 NUMBER 1A75 .S 127B HERE 1C7C temp 1C72 accept 0EB2 names 0ED8 CREATEWORD 0F78 CREATE 0F8C ALLOT 1038 [COMPILE] 1000 pub 0ECA pri 0FE0 : 1010 ; 0E5C NFA' 0E76 ' 0D30 IF 0D4C ELSE 0D7E THEN 0D7E ENDIF 0CC2 BEGIN 0CD4 UNTIL 0D16 AGAIN 0D30 WHILE 0CEE REPEAT 1B51 \ 1B51 '' 1B75 ( 1BA5 { 0016 } 1B99 IFNDEF 1B8B IFDEF 0DB8 " 0DC8 ." 10E2 (.") 19AE UNKNOWN 19EC NOTFOUND ok
BTW, here's a quick iterative Fibonacci test. fibo(46) takes 8672 cycles unoptimized which compares with Tachyon P1 which takes 10,768 cycles.
pub fibo ( n -- f ) 1+ 0 1 ROT FOR OVER + SWAP NEXT NIP ; 46 LAP fibo LAP .LAP SPACE . 8672 1836311903 ok
With a OVER + SWAP made into a single operation:46 LAP fibo LAP .LAP 3248 ok
So if the P2 were to run at 160MHz that would take 20.3us for a fibo(46)Since I now have deep and efficient stacks I could also try out the recursive algorithm later.
: LEDTASK COGID #32 + BEGIN DUP HIGH DUP 3 SHL ms DUP LOW DUP 3 SHL ms AGAIN ; 16 1 DO ' LEDTASK I TASK ! LOOP
BTW, I noticed that the debug interrupt vectors are mirrored on every 256k boundary rather than at the end of 1M.
3.FF80 80 DUMPL FF80: 0000.0000 0000.0000 0000.0000 0000.0000 ................ FF90: 0000.0000 0000.0000 0000.0000 0000.0000 ................ FFA0: 0000.0000 0000.0000 0000.0000 0000.0000 ................ FFB0: 0000.0000 0000.0000 0000.0000 0000.0000 ................ FFC0: FABB.FFFF FABB.FFFF FABB.FFFF FABB.FFFF ................ FFD0: FABB.FFFF FABB.FFFF FABB.FFFF FABB.FFFF ................ FFE0: FABB.FFFF FABB.FFFF FABB.FFFF FABB.FFFF ................ FFF0: FABB.FFFF FABB.FFFF FABB.FFFF FABB.FFFF ................ ok
I thought the DE2 was good for 12 cogs.
To improve that I could work out if I can use words or longs in which case it will be much faster again at 205us for 64k fill (edit, which I have just tried with LFILL and confirmed)
Do any of you P2 PASM gurus have any tips and tricks you'd like to share?
' wrfast rep code -> $6.0000 $1.0000 LAP ERASE LAP DECIMAL .LAP 131200 ok (2 clks/byte) = 820us/64K ' ( addr cnt -- ) ERASE call #PUSHACC ' ( addr cnt fillch -- ) FILL wrfast tos1,tos2 rep @.L0,tos1 wfbyte tos .L0 jmp #DROP3
Do you mean around the 32/16/8 bit decision ?
Do you want this run time or compile time choice ?
If speed matters over size, you could cover most of the fill using longs for speed, and I think Chip still allows any alignment, so wr longs can start on any address (may have a start-cost ?) so you do longs, then check if an addendum is needed. As that is only 0.1.2.3, a byte fill is probably ok.
' $6.0000 $1.0000 0 LAP LFILL LAP DECIMAL .LAP 32864 ( 205us/64k ) ' LFILL ( addr bytes ch -- ) LFILL shr tos1,#2 wrfast tos1,tos2 rep @.L0,tos1 wflong tos .L0 jmp #DROP3
The wfbyte/wfword/wflong are great candidates though for a simple SETS as the source field of the instruction controls the word size. So I will have a look at this but I don't want to make the runtime word too complicated either in trying to divide the operation up into mixed operations. Anyway I can use discrete FILL/WFILL/LFILL words but that's a waste or I could modify the source field each time perhaps. It's fun playing with the code at this level too.