PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.
Are there hub $200..$3FF anymore? With HUB moved to $400, I thought there wasn't any address overlap in the latest memory map.
Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub.
I'm not quite following, are you talking about moving the stack by PC address ?
What about a case where the LUT is all already allocated ?
Using LUT as stack has obvious merits, but making it a hidden by-pc choice seems likely to surprise. Code that is moved from HUB to LUT changes more than expected.
This is all it is, and it has nothing to do with hub exec:
When CALLx/RETx execute and PTRx = %00000000001xxxxxxxxx or $200..$3FF, the lut is used for stack data, instead of the hub. And PTRx changes by +-1, not +-4. In these cases, CALLx is just two clocks and RETx is three, which are both a lot faster than hub accesses and are deterministic, as well.
This is all it is, and it has nothing to do with hub exec:
When CALLx/RETx execute and PTRx = %00000000001xxxxxxxxx or $200..$3FF, the lut is used for stack data, instead of the hub. And PTRx changes by +-1, not +-4. In these cases, CALLx is just two clocks and RETx is three, which are both a lot faster than hub accesses and are deterministic, as well.
I'm still not seeing how that protects the LUT from accidental clobbering ?
A change in destination area is quite significant, not something that should be buried ?
ie I think these need explicit opcodes, not a hidden operational flip-flop.
Chip,
Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.
Wow! Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub. I really like that, because it means we don't need a whole extra set of instructions and pointers to use lut for stack. Excellent!
Are there any other sleeper purposes for address sensitivity?
P.S. This scheme would work only for CALLA/CALLB/RETA/RETB, since they are dedicated instructions that could easily be redirected. PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.
Umm... Doesn't that mean that if I have PTRA pointing into LUT and do a CALLA to some function and that function does a POPA to get its return address, it will fetch the wrong value?
Chip,
Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.
Wow! Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub. I really like that, because it means we don't need a whole extra set of instructions and pointers to use lut for stack. Excellent!
Are there any other sleeper purposes for address sensitivity?
P.S. This scheme would work only for CALLA/CALLB/RETA/RETB, since they are dedicated instructions that could easily be redirected. PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.
As for cog ram, I don't know that much can be done about that. On the other hand, push/pop is not as important there since the stack is already located in directly addressable registers.
Chip,
Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.
Wow! Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub. I really like that, because it means we don't need a whole extra set of instructions and pointers to use lut for stack. Excellent!
Are there any other sleeper purposes for address sensitivity?
P.S. This scheme would work only for CALLA/CALLB/RETA/RETB, since they are dedicated instructions that could easily be redirected. PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.
Umm... Doesn't that mean that if I have PTRA pointing into LUT and do a CALLA to some function and that function does a POPA to get its return address, it will fetch the wrong value?
Yes, because PUSHA/PUSHB/POPA/POPB are actually WRLONG/RDLONG, only CALLA/CALLB/RETA/RETB could have this lut functionality. Maybe it's too complicated, in that sense, as it would make people suppose that PUSHx and POPx would work, too. To do a pop without caring about the data, you'd just 'SUB PTRx,#1', while a push would be 'WRLUT data,PTRx' plus 'ADD PTRx,#1'.
It would be easy to support PTRx expressions for RDLUT/WRLUT, where only the lower 9 bits are used for lut address. So, a pop from lut would be 'RDLUT data,--PTRx' and a push would be 'WRLUT data,PTRX++'. If we made discrete PUSHx/POPx instructions which weren't just aliases for WRLONG/RDLONG, we could handle this better by invoking either WRLONG/RDLONG or WRLUT/RDLUT, depending on the PTRx range being within lut, or not. That would be pretty easy.
We would have these discrete instructions which would use hub or lut, based on the PTRx address:
PUSHA D/#
PUSHB D/#
POPA D
POPB D
CALLA D/#/@
CALLB D/#/@
RETA
RETB
Meanwhile RDBYTE/RDWORD/RDLONG and WRBYTE/WRWORD/WRLONG would always access hub, only.
I know I keep pushing, but really if the instructions were always longs we would only require 18-bits of addressing, not 20-bits. This would free up 2-bits in those large instructions.
At least I need to understand some of the callx/retx/pushx/popx instructions because you caught me unawares about the pushx/popx being wrlong/rdlong instructions acting on the hub.
May I suggest that we get a new FPGA release out first, and that we continue this discussion in a new separate thread rather than here on Peter's thread. I think there are huge advantages in being able to use the LUT for the stack.
Umm... Doesn't that mean that if I have PTRA pointing into LUT and do a CALLA to some function and that function does a POPA to get its return address, it will fetch the wrong value?
On top of that, is the common ASM trick of CALL, then a Table Jump then RET
- if the RET is now in a different memory space, the Pointer has moved....
The idea of optional LUT as Stack is good but I think it needs explicit instructions.
When this breaks, it will be very hard to find.
I've been making some really good progress on this subroutine threaded version, it is looking really schmick now. Also now rather than performing relative branching with IF ELSE WHILE UNTIL AGAIN etc these words are now combined internally into just two definitions that read an absolute branch. So ELSE and AGAIN are unconditionally jumps whereas IF WHILE UNTIL are conditional jumps. Internally ELSE and AGAIN are just referred to as GOTO
In Tachyon P1 when we went to emit a single character it always required 4 bytes of code either as _BYTE,$0D,XCALL,xEMIT or XCALL,xPRTSTR,$0D,00 and it looked like we needed 6 bytes to do the same in P2 but this has dropped back down to 4 bytes with a simple PRTSTR,$0D since the $0D is compiled as a 16-bit word which is in fact the string $0D,00.
Here's the main demo plus a few other bits as it is compiled in PNut. It looks a lot cleaner without the relative branch calculations.
dat
orgh
Tachyon
word PRTSTR
byte "Tachyon Forth R16 for the P2 V0.1 151008 ",0
word CR,_8,STARS ' simple subroutine call
word CR,_WORD,$21,_WORD,$5D,FOR,DUP,EMIT,INC,forNEXT,DROP ' display ASCII using FOR NEXT
word CR,_WORD,$21
lbl0 word DUP,EMIT,INC,DUP,_WORD,$7E,_EQ,_UNTIL,lbl0,DROP ' display ASCII using BEGIN UNTIL
word CR,_LONG
long $CAFEF00D ' test Forth versions of number printing
word DUP,PRTHEX,SPACE,DUP,PRTBYTE,SPACE
word DUP,PRTWORD,SPACE,PRTLONG,SPACE
word _GETCNT,_WORD,10000 ' time a simple 10000 FOR $1234 DROP NEXT loop
word FOR,_WORD,$1234,DROP,forNEXT
word _GETCNT,SWAP,MINUS,SPACE,PRINT ' display results
word _GETCNT,_NOP,_GETCNT,SWAP,MINUS,SPACE,PRINT ' display timing for a NOP (plus GETCNTs)
word _GETCNT,_GETCNT,SWAP,MINUS,SPACE,PRINT ' display timing for GETCNTs themselves
word SLOWIO,SUBIO,FASTIO ' toggle some pins with various methods
word _0,_WORD,$100,HDUMP ' dump some hub memory
word CR,GOTO,Tachyon
CR call #COLON
word PRTSTR,$0D,PRTSTR,$0A,EXIT
STARS call #COLON
word FOR,PRTSTR,"*",forNEXT,EXIT
FASTIO mov X,#1
or DIRA,X
mov R2,#16
FIOLP or OUTA,X
andn OUTA,X
djnz R2,@FIOLP
ret
At present I am converting and integrating a lot of the bytecode from Tachyon P1 after which I will be able to switch on full console interaction. Not long now!
Now that I have Tachyon up and running for the P2 I need to go back into the kernel and optimize some stuff. I am already using the LUT for the four stacks as well as maintaining top of stack copies in cog ram. The kernel is much more simplified compared to the P1 in that all code is compiled as an address and all addresses are assumed to be code, whether in cog, lut, or hub. So all high-level code does a "call #COLON" then has 16-bit sub-routine threaded word-code addresses to "interpret" which is surprisingly compact. Compiling these word-codes as a full CALL takes twice as much memory but does not seem to run much faster.
The code space is only 64k at the moment but at least all of it is code space and that's a lot of code space. I would only need to align code entry on longs to increase this range up to 256k. The dictionary now has the luxury of being able to expand without having to resort to special compacting techniques as was done for P1 using the EEPROM. Also the dictionary is a bit simpler in that the header does not need to know what type of code it is anymore, it just has a 16-bit pointer and no vector table of course.
After I test Tachyon a bit more I will write an inline PASM style assembler that can read from SD so I can test out even more and expand the networking capabilities too.
I'd like to integrate the serial transmit with the receive all in one cog rather than transmit from the main cog so has anyone got any tips in regards to writing a full-duplex driver using interrupts? Is there a standard SPI Flash chip to use that we can use with the FPGAs so that we can reboot it like a regular setup?
Is there a standard SPI Flash chip to use that we can use with the FPGAs so that we can reboot it like a regular setup?
FWIR, pretty much all SPI flash parts have a common-subset of 1-bit mode, not-fast, and 24 bit address.
So you can start with that.
Most also do QuadSPI, and some have QuadSPI DDR/Dual edge, and some have a lock-mode where you can load that [Cmd+24b] address in Quad-mode.
Is there a standard SPI Flash chip to use that we can use with the FPGAs so that we can reboot it like a regular setup?
FWIR, pretty much all SPI flash parts have a common-subset of 1-bit mode, not-fast, and 24 bit address.
So you can start with that.
Most also do QuadSPI, and some have QuadSPI DDR/Dual edge, and some have a lock-mode where you can load that [Cmd+24b] address in Quad-mode.
I picked up a tube of W25Q80s a couple of years ago in anticipation back then for P2 but I can't seem to find anything about SPI Flash booting which I guess is probably not implemented as of yet.
It's not. Likely next stage of the design. For that, we need booter, crypt, etc... Gotta settle the basics first. Right now, it's all small as it goes through various revision to settle bugs in the instructions, form, addressing, etc... Chip has rewritten the basic support and sample programs, what 4 times?
I suspect once the bugs end, and we are all feeling happy, the next stage will kick off.
Forth is in WAY early. Nice job guys! (talking about Pfth too)
Thanks, it'll be nice when I can boot from SPI as then I have a real chance to test out the filesystem and networking for real but until then I still have testing and optimizing to do. When I'm happy enough with it I will post the source, just as a regular text document though via my dropbox links.
This bit of a screen grab dumps the temporary compilation area and so shows the 16-bit words that are compiled for that same line, then a couple of stack manipulations etc. The 0014 at the end of the dump is the EXIT that is appended automatically (plus one extra at the moment)
Parallax Propeller 2 .:.:--TACHYON--:.:. Forth V10151015.0000
----------------------------------------------------------------
%11010011 $1AF0 1234 HERE $20 DUMPW
1C00: 0087 00D3 0087 1AF0 0087 04D2 1207 0087 ................
1C10: 0020 1164 0014 0014 0000 0000 0000 0000 .d.............
ok
.S [0000.04D2 0000.1AF0 0000.00D3 0000.0003] ok
SWAP OVER + .S [0000.1FC2 0000.04D2 0000.00D3 0000.0003] ok
ROT .S [0000.00D3 0000.1FC2 0000.04D2 0000.0003] ok
In developing this kernel further which btw is working very nicely I'd like to optimize some cog code but I can't seem to find an easy way to do cog indexed or indirect. With the LUT we have RDLUT and WRLUT where we can specify an address register, same as we do for RDLONG/WRLONG etc. Now SETS and SETD still require two instructions or nops between the set and the instruction. Then there is ALTDS which is badly in need of some macros but looks like it could do the job although in a somewhat unwieldly fashion. It's also a pity that SETB/CLRB doesn't span addresses for bits>31 as that would be very useful for a whole number of things including accessing the ports as one.
So, have I missed something? Is there an easy way to address cog memory indirectly?
BTW, if someone wants to play with Tachyon "as is" just let me know as I don't need "bug" reports yet
Here's a little test loop that flashes the leds on the DE2 with pattens from memory and if key2 is pressed then it runs faster and exits if key3 is pressed
0 BEGIN DUP W@ $FFFF PBCLR PBSET 30 PIN@ IF #100 ELSE #20 THEN ms 2 + 31 PIN@ 0= UNTIL
ALTDS D,S/# 'modify D according to bits in S and possibly replace next instruction's CCCCOOOOOOOCZI / DDDDDDDDD / SSSSSSSSS fields.
In ALTDS, S provides the following pattern: %RRR_DDD_SSS
%RRR: (101 allows instruction substitution)
000 = don't affect D's CCCCOOOOOOOOOCZI field
001 = don't affect D's CCCCOOOOOOOOOCZI field, cancel write for next instruction
010 = decrement D's OOOOOOOCZ field
011 = increment D's OOOOOOOCZ field
100 = use D's OOOOOOOCZ field as the result register for the next instruction (separate from D)
101 = use D's CCCCOOOOOOOCZI field as next instruction's CCCCOOOOOOOCZI field
110 = use D's OOOOOOOCZ field as the result register for the next instruction, decrement D's OOOOOOOCZ field
111 = use D's OOOOOOOCZ field as the result register for the next instruction, increment D's OOOOOOOCZ field
%DDD
000 = don't affect D's DDDDDDDDD field
001 = copy D's SSSSSSSSS field into its DDDDDDDDD field
010 = decrement D's DDDDDDDDD field
011 = increment D's DDDDDDDDD field
100 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction
101 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction, copy D's SSSSSSSSS field into its DDDDDDDDD field
110 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction, decrement D's DDDDDDDDD field
111 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction, increment D's DDDDDDDDD field
%SSS
000 = don't affect D's SSSSSSSSS field
001 = copy D's DDDDDDDDD field into its SSSSSSSSS field
010 = decrement D's SSSSSSSSS field
011 = increment D's SSSSSSSSS field
100 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction
101 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, copy D's DDDDDDDDD field into its SSSSSSSSS field
110 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, decrement D's SSSSSSSSS field
111 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, increment D's SSSSSSSSS field
I was hoping for a simpler single "instruction" rather than an ALTDS instruction seeing we already have the RD/WR type of instructions for lut and hub.. Also does it actually modify the next instruction in memory or just in the pipeline which is what I think it must do?
Would help to if actual examples were given, it doesn't look clean and simple at all.
Yes, it just modifies the ALU input in the pipeline.
Perhaps the pointers PTRA & PTRB might help you. Not sure how they actually work in a normal instruction.
PTR addressing modes only work with hub read/write otherwise they are just a register like any other so that a MOV will read or write to the PTR register rather than use it indirectly. No, what's needed is a simple RDCOG and WRCOG instruction, surely that's not hard to implement and works in the same manner as RDLUT an WRLUT.
The ALTDS is a rather versatile but nonetheless kludgey instruction, not at all in keeping with the rest of the P2 instruction set. As I mentioned before, where are the examples of how this was intended to be used as immediately it looks kludgey when you try to use it. At least proper macros could hide most of this kludge but most of the time I just need a simple RDCOG/WRCOG.
That's not correct though as MOV moves from register or immediate to register whereas a RDCOG would use the source indirectly just as RDLUT and also as RDLONG does etc. So if the source pointed to a register "myio" which held an address that pointed to INB then RDCOG X,myio would be equivalent to MOV X,INB.
@Chip: How hard would it be to add RDCOG and WRCOG in the same manner?
It is the indirection, indexing that takes the time. It can happen in hubexec because the instruction comes from a different place. In cogex, it's all timing bound.
Yes, it is register to register in the cog, which is why wrcog turns into a self modify. On the hot chip, the pipe was deeper, more complex, and the cog ram was ported more too, I believe.
The pointers for cog use required depth and sophistication absent in this simpler design. I believe the options were to slow down the cog with a delay in the critical path, extend the pipe, or add ports to cog ram, likely a combination.
Other things got stripped, like waitvid, the pll, and more general purpose, simple things put in place. Hubexec lost cache, and has the fifo, which can also smooth the hub access.
I miss those pointers too.
There is actually a discussion on this in the older design thread.
Current word list for Tachyon as I continue testing etc. I've looked at what I need for pin I/O and so I've put in PINSET/PINCLR/PININP/PIN@ which accepts bits 0 to 63 to span both ports plus I have PASET/PACLR PBSET/PBCLR to write a 32-bit mask to the ports as well as sets the direction to output. I just need to clean up some more bugs and I should be able to release this source if anyone is interested.
BTW, here's a quick iterative Fibonacci test. fibo(46) takes 8672 cycles unoptimized which compares with Tachyon P1 which takes 10,768 cycles.
pub fibo ( n -- f ) 1+ 0 1 ROT FOR OVER + SWAP NEXT NIP ;
46 LAP fibo LAP .LAP SPACE . 8672 1836311903 ok
With a OVER + SWAP made into a single operation:
46 LAP fibo LAP .LAP 3248 ok
So if the P2 were to run at 160MHz that would take 20.3us for a fibo(46)
Since I now have deep and efficient stacks I could also try out the recursive algorithm later.
This simple one-liner is all I need to test launching tasks in cogs. It just flashes the DE2's leds at different rates so that 10 leds are blinking and my main cog is still talking to the console while this is happening. So that means there must be 11 cogs available on the DE2?
: LEDTASK COGID #32 + BEGIN DUP HIGH DUP 3 SHL ms DUP LOW DUP 3 SHL ms AGAIN ;
16 1 DO ' LEDTASK I TASK ! LOOP
BTW, I noticed that the debug interrupt vectors are mirrored on every 256k boundary rather than at the end of 1M.
In-between expanding Tachyon I am also revisiting functional code and seeing what I can do to optimize it. So it is with the FILL and ERASE function which was relying on high-level wordcode calling a CMOVE function. This works but was taking 2114016 cycles (13.2ms @160MHz) for 64k so I looked at trying to do this with SETQ and WRBYTE but it doesn't seem suited to that. So instead I tried the WRFAST/REP/WFBYTE sequence which seems to have dropped the fill to a much more respectable 2 cycles/byte so that 64k completes in 131200 cycles (820us @160MHz).
To improve that I could work out if I can use words or longs in which case it will be much faster again at 205us for 64k fill (edit, which I have just tried with LFILL and confirmed)
Do any of you P2 PASM gurus have any tips and tricks you'd like to share?
' wrfast rep code -> $6.0000 $1.0000 LAP ERASE LAP DECIMAL .LAP 131200 ok (2 clks/byte) = 820us/64K
' ( addr cnt -- )
ERASE call #PUSHACC
' ( addr cnt fillch -- )
FILL wrfast tos1,tos2
rep @.L0,tos1
wfbyte tos
.L0 jmp #DROP3
Do any of you P2 PASM gurus have any tips and tricks you'd like to share?
The code you have looks good.
Do you mean around the 32/16/8 bit decision ?
Do you want this run time or compile time choice ?
If speed matters over size, you could cover most of the fill using longs for speed, and I think Chip still allows any alignment, so wr longs can start on any address (may have a start-cost ?) so you do longs, then check if an addendum is needed. As that is only 0.1.2.3, a byte fill is probably ok.
Yes, I did try a long fill and it works well but byte/word/long are normally decided at runtime due to the wide range of parameters that will be thrown at it.
The wfbyte/wfword/wflong are great candidates though for a simple SETS as the source field of the instruction controls the word size. So I will have a look at this but I don't want to make the runtime word too complicated either in trying to divide the operation up into mixed operations. Anyway I can use discrete FILL/WFILL/LFILL words but that's a waste or I could modify the source field each time perhaps. It's fun playing with the code at this level too.
Comments
Are there hub $200..$3FF anymore? With HUB moved to $400, I thought there wasn't any address overlap in the latest memory map.
What about a case where the LUT is all already allocated ?
Using LUT as stack has obvious merits, but making it a hidden by-pc choice seems likely to surprise. Code that is moved from HUB to LUT changes more than expected.
When CALLx/RETx execute and PTRx = %00000000001xxxxxxxxx or $200..$3FF, the lut is used for stack data, instead of the hub. And PTRx changes by +-1, not +-4. In these cases, CALLx is just two clocks and RETx is three, which are both a lot faster than hub accesses and are deterministic, as well.
A change in destination area is quite significant, not something that should be buried ?
ie I think these need explicit opcodes, not a hidden operational flip-flop.
If you add PTRA/B variants of RDLUT/WRLUT (and here), then you can get the equivalent push/pop functionality.
As for cog ram, I don't know that much can be done about that. On the other hand, push/pop is not as important there since the stack is already located in directly addressable registers.
Yes, because PUSHA/PUSHB/POPA/POPB are actually WRLONG/RDLONG, only CALLA/CALLB/RETA/RETB could have this lut functionality. Maybe it's too complicated, in that sense, as it would make people suppose that PUSHx and POPx would work, too. To do a pop without caring about the data, you'd just 'SUB PTRx,#1', while a push would be 'WRLUT data,PTRx' plus 'ADD PTRx,#1'.
It would be easy to support PTRx expressions for RDLUT/WRLUT, where only the lower 9 bits are used for lut address. So, a pop from lut would be 'RDLUT data,--PTRx' and a push would be 'WRLUT data,PTRX++'. If we made discrete PUSHx/POPx instructions which weren't just aliases for WRLONG/RDLONG, we could handle this better by invoking either WRLONG/RDLONG or WRLUT/RDLUT, depending on the PTRx range being within lut, or not. That would be pretty easy.
We would have these discrete instructions which would use hub or lut, based on the PTRx address:
PUSHA D/#
PUSHB D/#
POPA D
POPB D
CALLA D/#/@
CALLB D/#/@
RETA
RETB
Meanwhile RDBYTE/RDWORD/RDLONG and WRBYTE/WRWORD/WRLONG would always access hub, only.
WOW! Love the idea.
I know I keep pushing, but really if the instructions were always longs we would only require 18-bits of addressing, not 20-bits. This would free up 2-bits in those large instructions.
At least I need to understand some of the callx/retx/pushx/popx instructions because you caught me unawares about the pushx/popx being wrlong/rdlong instructions acting on the hub.
May I suggest that we get a new FPGA release out first, and that we continue this discussion in a new separate thread rather than here on Peter's thread. I think there are huge advantages in being able to use the LUT for the stack.
- if the RET is now in a different memory space, the Pointer has moved....
The idea of optional LUT as Stack is good but I think it needs explicit instructions.
When this breaks, it will be very hard to find.
In Tachyon P1 when we went to emit a single character it always required 4 bytes of code either as _BYTE,$0D,XCALL,xEMIT or XCALL,xPRTSTR,$0D,00 and it looked like we needed 6 bytes to do the same in P2 but this has dropped back down to 4 bytes with a simple PRTSTR,$0D since the $0D is compiled as a 16-bit word which is in fact the string $0D,00.
Here's the main demo plus a few other bits as it is compiled in PNut. It looks a lot cleaner without the relative branch calculations.
At present I am converting and integrating a lot of the bytecode from Tachyon P1 after which I will be able to switch on full console interaction. Not long now!
The code space is only 64k at the moment but at least all of it is code space and that's a lot of code space. I would only need to align code entry on longs to increase this range up to 256k. The dictionary now has the luxury of being able to expand without having to resort to special compacting techniques as was done for P1 using the EEPROM. Also the dictionary is a bit simpler in that the header does not need to know what type of code it is anymore, it just has a 16-bit pointer and no vector table of course.
After I test Tachyon a bit more I will write an inline PASM style assembler that can read from SD so I can test out even more and expand the networking capabilities too.
I'd like to integrate the serial transmit with the receive all in one cog rather than transmit from the main cog so has anyone got any tips in regards to writing a full-duplex driver using interrupts? Is there a standard SPI Flash chip to use that we can use with the FPGAs so that we can reboot it like a regular setup?
So you can start with that.
Most also do QuadSPI, and some have QuadSPI DDR/Dual edge, and some have a lock-mode where you can load that [Cmd+24b] address in Quad-mode.
I suspect once the bugs end, and we are all feeling happy, the next stage will kick off.
Forth is in WAY early. Nice job guys! (talking about Pfth too)
This bit of a screen grab dumps the temporary compilation area and so shows the 16-bit words that are compiled for that same line, then a couple of stack manipulations etc. The 0014 at the end of the dump is the EXIT that is appended automatically (plus one extra at the moment)
So, have I missed something? Is there an easy way to address cog memory indirectly?
BTW, if someone wants to play with Tachyon "as is" just let me know as I don't need "bug" reports yet
Here's a little test loop that flashes the leds on the DE2 with pattens from memory and if key2 is pressed then it runs faster and exits if key3 is pressed
Have a look at the ALTDS instruction This should to the trick
Would help to if actual examples were given, it doesn't look clean and simple at all.
Perhaps the pointers PTRA & PTRB might help you. Not sure how they actually work in a normal instruction.
The ALTDS is a rather versatile but nonetheless kludgey instruction, not at all in keeping with the rest of the P2 instruction set. As I mentioned before, where are the examples of how this was intended to be used as immediately it looks kludgey when you try to use it. At least proper macros could hide most of this kludge but most of the time I just need a simple RDCOG/WRCOG.
The usual way is to maintain a pointer, add, setd, sets, etc...
Altds compresses this for some additional speed, in that it can work directly with the target instruction data.
We don't have the advanced pipeline to work otherwise. Cog ops of this kind take a couple instructions as a result.
That is the why, in any case. Maybe Chip has a better answer in the wings.
@Chip: How hard would it be to add RDCOG and WRCOG in the same manner?
Yes, it is register to register in the cog, which is why wrcog turns into a self modify. On the hot chip, the pipe was deeper, more complex, and the cog ram was ported more too, I believe.
The pointers for cog use required depth and sophistication absent in this simpler design. I believe the options were to slow down the cog with a delay in the critical path, extend the pipe, or add ports to cog ram, likely a combination.
Other things got stripped, like waitvid, the pll, and more general purpose, simple things put in place. Hubexec lost cache, and has the fifo, which can also smooth the hub access.
I miss those pointers too.
There is actually a discussion on this in the older design thread.
BTW, here's a quick iterative Fibonacci test. fibo(46) takes 8672 cycles unoptimized which compares with Tachyon P1 which takes 10,768 cycles. With a OVER + SWAP made into a single operation: So if the P2 were to run at 160MHz that would take 20.3us for a fibo(46)
Since I now have deep and efficient stacks I could also try out the recursive algorithm later.
BTW, I noticed that the debug interrupt vectors are mirrored on every 256k boundary rather than at the end of 1M.
I thought the DE2 was good for 12 cogs.
To improve that I could work out if I can use words or longs in which case it will be much faster again at 205us for 64k fill (edit, which I have just tried with LFILL and confirmed)
Do any of you P2 PASM gurus have any tips and tricks you'd like to share?
Do you mean around the 32/16/8 bit decision ?
Do you want this run time or compile time choice ?
If speed matters over size, you could cover most of the fill using longs for speed, and I think Chip still allows any alignment, so wr longs can start on any address (may have a start-cost ?) so you do longs, then check if an addendum is needed. As that is only 0.1.2.3, a byte fill is probably ok.
The wfbyte/wfword/wflong are great candidates though for a simple SETS as the source field of the instruction controls the word size. So I will have a look at this but I don't want to make the runtime word too complicated either in trying to divide the operation up into mixed operations. Anyway I can use discrete FILL/WFILL/LFILL words but that's a waste or I could modify the source field each time perhaps. It's fun playing with the code at this level too.