Shop OBEX P1 Docs P2 Docs Learn Events
Tachyon Forth for P2 -FAT32+WIZnet- Now Smartpins - wOOt! - Page 2 — Parallax Forums

Tachyon Forth for P2 -FAT32+WIZnet- Now Smartpins - wOOt!

2456789

Comments

  • cgracey wrote: »
    PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.

    Are there hub $200..$3FF anymore? With HUB moved to $400, I thought there wasn't any address overlap in the latest memory map.

  • jmgjmg Posts: 15,175
    cgracey wrote: »
    Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub.
    I'm not quite following, are you talking about moving the stack by PC address ?
    What about a case where the LUT is all already allocated ?
    Using LUT as stack has obvious merits, but making it a hidden by-pc choice seems likely to surprise. Code that is moved from HUB to LUT changes more than expected.
  • cgraceycgracey Posts: 14,222
    edited 2015-10-07 18:56
    This is all it is, and it has nothing to do with hub exec:

    When CALLx/RETx execute and PTRx = %00000000001xxxxxxxxx or $200..$3FF, the lut is used for stack data, instead of the hub. And PTRx changes by +-1, not +-4. In these cases, CALLx is just two clocks and RETx is three, which are both a lot faster than hub accesses and are deterministic, as well.
  • jmgjmg Posts: 15,175
    cgracey wrote: »
    This is all it is, and it has nothing to do with hub exec:

    When CALLx/RETx execute and PTRx = %00000000001xxxxxxxxx or $200..$3FF, the lut is used for stack data, instead of the hub. And PTRx changes by +-1, not +-4. In these cases, CALLx is just two clocks and RETx is three, which are both a lot faster than hub accesses and are deterministic, as well.
    I'm still not seeing how that protects the LUT from accidental clobbering ?
    A change in destination area is quite significant, not something that should be buried ?
    ie I think these need explicit opcodes, not a hidden operational flip-flop.

  • cgracey wrote: »
    evanh wrote: »
    Chip,
    Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.

    Wow! Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub. I really like that, because it means we don't need a whole extra set of instructions and pointers to use lut for stack. Excellent!

    Are there any other sleeper purposes for address sensitivity?

    P.S. This scheme would work only for CALLA/CALLB/RETA/RETB, since they are dedicated instructions that could easily be redirected. PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.
    Umm... Doesn't that mean that if I have PTRA pointing into LUT and do a CALLA to some function and that function does a POPA to get its return address, it will fetch the wrong value?

  • cgracey wrote: »
    evanh wrote: »
    Chip,
    Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.

    Wow! Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub. I really like that, because it means we don't need a whole extra set of instructions and pointers to use lut for stack. Excellent!

    Are there any other sleeper purposes for address sensitivity?

    P.S. This scheme would work only for CALLA/CALLB/RETA/RETB, since they are dedicated instructions that could easily be redirected. PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.

    If you add PTRA/B variants of RDLUT/WRLUT (and here), then you can get the equivalent push/pop functionality.

    As for cog ram, I don't know that much can be done about that. On the other hand, push/pop is not as important there since the stack is already located in directly addressable registers.
  • cgraceycgracey Posts: 14,222
    edited 2015-10-07 19:55
    David Betz wrote: »
    cgracey wrote: »
    evanh wrote: »
    Chip,
    Another stacking variant that might be more palatable to all is for the CALLA/B, PUSHA/B instructions being able to stack to CogRAM and/or LUTRAM. Would that be feasible? I guess that cuts off a chunk of Hub addresses though.

    Wow! Make CALLA/CALLB/RETA/RETB address sensitive, just like cog/hub instruction fetching is, so that addresses $200..$3FF use the lut, instead of the hub. I really like that, because it means we don't need a whole extra set of instructions and pointers to use lut for stack. Excellent!

    Are there any other sleeper purposes for address sensitivity?

    P.S. This scheme would work only for CALLA/CALLB/RETA/RETB, since they are dedicated instructions that could easily be redirected. PUSHA/PUSHB/POPA/POPB are actually cases of WRLONG/RDLONG, though, and to redirect them, as well, would mean that hub $200..$3FF would not be reachable for R/W. So, this would only work for calls and returns.
    Umm... Doesn't that mean that if I have PTRA pointing into LUT and do a CALLA to some function and that function does a POPA to get its return address, it will fetch the wrong value?

    Yes, because PUSHA/PUSHB/POPA/POPB are actually WRLONG/RDLONG, only CALLA/CALLB/RETA/RETB could have this lut functionality. Maybe it's too complicated, in that sense, as it would make people suppose that PUSHx and POPx would work, too. To do a pop without caring about the data, you'd just 'SUB PTRx,#1', while a push would be 'WRLUT data,PTRx' plus 'ADD PTRx,#1'.

    It would be easy to support PTRx expressions for RDLUT/WRLUT, where only the lower 9 bits are used for lut address. So, a pop from lut would be 'RDLUT data,--PTRx' and a push would be 'WRLUT data,PTRX++'. If we made discrete PUSHx/POPx instructions which weren't just aliases for WRLONG/RDLONG, we could handle this better by invoking either WRLONG/RDLONG or WRLUT/RDLUT, depending on the PTRx range being within lut, or not. That would be pretty easy.

    We would have these discrete instructions which would use hub or lut, based on the PTRx address:

    PUSHA D/#
    PUSHB D/#
    POPA D
    POPB D
    CALLA D/#/@
    CALLB D/#/@
    RETA
    RETB

    Meanwhile RDBYTE/RDWORD/RDLONG and WRBYTE/WRWORD/WRLONG would always access hub, only.
  • cgraceycgracey Posts: 14,222
    I think adding PTRx expressions to RDLUT/WRLUT is a no-brainer, in any case.
  • Cluso99Cluso99 Posts: 18,069
    Chip,
    WOW! Love the idea.

    I know I keep pushing, but really if the instructions were always longs we would only require 18-bits of addressing, not 20-bits. This would free up 2-bits in those large instructions.

    At least I need to understand some of the callx/retx/pushx/popx instructions because you caught me unawares about the pushx/popx being wrlong/rdlong instructions acting on the hub.

    May I suggest that we get a new FPGA release out first, and that we continue this discussion in a new separate thread rather than here on Peter's thread. I think there are huge advantages in being able to use the LUT for the stack.
  • jmgjmg Posts: 15,175
    edited 2015-10-07 20:24
    David Betz wrote: »
    Umm... Doesn't that mean that if I have PTRA pointing into LUT and do a CALLA to some function and that function does a POPA to get its return address, it will fetch the wrong value?
    On top of that, is the common ASM trick of CALL, then a Table Jump then RET
    - if the RET is now in a different memory space, the Pointer has moved....
    The idea of optional LUT as Stack is good but I think it needs explicit instructions.
    When this breaks, it will be very hard to find.

  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2015-10-08 00:54
    I've been making some really good progress on this subroutine threaded version, it is looking really schmick now. Also now rather than performing relative branching with IF ELSE WHILE UNTIL AGAIN etc these words are now combined internally into just two definitions that read an absolute branch. So ELSE and AGAIN are unconditionally jumps whereas IF WHILE UNTIL are conditional jumps. Internally ELSE and AGAIN are just referred to as GOTO :)

    In Tachyon P1 when we went to emit a single character it always required 4 bytes of code either as _BYTE,$0D,XCALL,xEMIT or XCALL,xPRTSTR,$0D,00 and it looked like we needed 6 bytes to do the same in P2 but this has dropped back down to 4 bytes with a simple PRTSTR,$0D since the $0D is compiled as a 16-bit word which is in fact the string $0D,00.

    Here's the main demo plus a few other bits as it is compiled in PNut. It looks a lot cleaner without the relative branch calculations.
    dat
     		orgh
    Tachyon
           	word	PRTSTR
           	byte	"Tachyon Forth R16 for the P2 V0.1 151008 ",0
    	word	CR,_8,STARS                                          		' simple subroutine call
    	word	CR,_WORD,$21,_WORD,$5D,FOR,DUP,EMIT,INC,forNEXT,DROP		' display ASCII using FOR NEXT
    	word	CR,_WORD,$21
    lbl0	word	DUP,EMIT,INC,DUP,_WORD,$7E,_EQ,_UNTIL,lbl0,DROP			' display ASCII using BEGIN UNTIL
     	word	CR,_LONG
    	long	$CAFEF00D							' test Forth versions of number printing
    	word	DUP,PRTHEX,SPACE,DUP,PRTBYTE,SPACE
    	word	DUP,PRTWORD,SPACE,PRTLONG,SPACE
    	word	_GETCNT,_WORD,10000						' time a simple 10000 FOR $1234 DROP NEXT loop
    	word	FOR,_WORD,$1234,DROP,forNEXT
    	word	_GETCNT,SWAP,MINUS,SPACE,PRINT					' display results
     	word	_GETCNT,_NOP,_GETCNT,SWAP,MINUS,SPACE,PRINT 			' display timing for a NOP (plus GETCNTs)
     	word	_GETCNT,_GETCNT,SWAP,MINUS,SPACE,PRINT				' display timing for GETCNTs themselves
    	word	SLOWIO,SUBIO,FASTIO						' toggle some pins with various methods
    	word	_0,_WORD,$100,HDUMP						' dump some hub  memory
     	word	CR,GOTO,Tachyon
    
    CR	call	#COLON
    	word	PRTSTR,$0D,PRTSTR,$0A,EXIT
    
    STARS	call	#COLON
    	word	FOR,PRTSTR,"*",forNEXT,EXIT
    
    FASTIO          mov	X,#1
    		or	DIRA,X
          		mov	R2,#16
    FIOLP		or	OUTA,X
    		andn	OUTA,X
    		djnz	R2,@FIOLP
    		ret
    

    At present I am converting and integrating a lot of the bytecode from Tachyon P1 after which I will be able to switch on full console interaction. Not long now!
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2015-10-16 02:35
    Now that I have Tachyon up and running for the P2 I need to go back into the kernel and optimize some stuff. I am already using the LUT for the four stacks as well as maintaining top of stack copies in cog ram. The kernel is much more simplified compared to the P1 in that all code is compiled as an address and all addresses are assumed to be code, whether in cog, lut, or hub. So all high-level code does a "call #COLON" then has 16-bit sub-routine threaded word-code addresses to "interpret" which is surprisingly compact. Compiling these word-codes as a full CALL takes twice as much memory but does not seem to run much faster.

    The code space is only 64k at the moment but at least all of it is code space and that's a lot of code space. I would only need to align code entry on longs to increase this range up to 256k. The dictionary now has the luxury of being able to expand without having to resort to special compacting techniques as was done for P1 using the EEPROM. Also the dictionary is a bit simpler in that the header does not need to know what type of code it is anymore, it just has a 16-bit pointer and no vector table of course.

    After I test Tachyon a bit more I will write an inline PASM style assembler that can read from SD so I can test out even more and expand the networking capabilities too.

    I'd like to integrate the serial transmit with the receive all in one cog rather than transmit from the main cog so has anyone got any tips in regards to writing a full-duplex driver using interrupts? Is there a standard SPI Flash chip to use that we can use with the FPGAs so that we can reboot it like a regular setup?






  • jmgjmg Posts: 15,175
    Is there a standard SPI Flash chip to use that we can use with the FPGAs so that we can reboot it like a regular setup?
    FWIR, pretty much all SPI flash parts have a common-subset of 1-bit mode, not-fast, and 24 bit address.
    So you can start with that.
    Most also do QuadSPI, and some have QuadSPI DDR/Dual edge, and some have a lock-mode where you can load that [Cmd+24b] address in Quad-mode.

  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2015-10-16 03:37
    jmg wrote: »
    Is there a standard SPI Flash chip to use that we can use with the FPGAs so that we can reboot it like a regular setup?
    FWIR, pretty much all SPI flash parts have a common-subset of 1-bit mode, not-fast, and 24 bit address.
    So you can start with that.
    Most also do QuadSPI, and some have QuadSPI DDR/Dual edge, and some have a lock-mode where you can load that [Cmd+24b] address in Quad-mode.
    I picked up a tube of W25Q80s a couple of years ago in anticipation back then for P2 but I can't seem to find anything about SPI Flash booting which I guess is probably not implemented as of yet.

  • potatoheadpotatohead Posts: 10,261
    edited 2015-10-16 03:42
    It's not. Likely next stage of the design. For that, we need booter, crypt, etc... Gotta settle the basics first. Right now, it's all small as it goes through various revision to settle bugs in the instructions, form, addressing, etc... Chip has rewritten the basic support and sample programs, what 4 times?

    I suspect once the bugs end, and we are all feeling happy, the next stage will kick off.

    Forth is in WAY early. Nice job guys! (talking about Pfth too)



  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2015-10-16 03:59
    Thanks, it'll be nice when I can boot from SPI as then I have a real chance to test out the filesystem and networking for real but until then I still have testing and optimizing to do. When I'm happy enough with it I will post the source, just as a regular text document though via my dropbox links.

    This bit of a screen grab dumps the temporary compilation area and so shows the 16-bit words that are compiled for that same line, then a couple of stack manipulations etc. The 0014 at the end of the dump is the EXIT that is appended automatically (plus one extra at the moment)
    Parallax Propeller 2 .:.:--TACHYON--:.:. Forth V10151015.0000
    ----------------------------------------------------------------
    %11010011 $1AF0 1234 HERE $20 DUMPW
    1C00: 0087 00D3 0087 1AF0 0087 04D2 1207 0087    ................
    1C10: 0020 1164 0014 0014 0000 0000 0000 0000     .d.............
     ok
    .S  [0000.04D2 0000.1AF0 0000.00D3 0000.0003]  ok
    SWAP OVER + .S  [0000.1FC2 0000.04D2 0000.00D3 0000.0003]  ok
    ROT .S  [0000.00D3 0000.1FC2 0000.04D2 0000.0003]  ok
    
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2015-10-17 05:23
    In developing this kernel further which btw is working very nicely I'd like to optimize some cog code but I can't seem to find an easy way to do cog indexed or indirect. With the LUT we have RDLUT and WRLUT where we can specify an address register, same as we do for RDLONG/WRLONG etc. Now SETS and SETD still require two instructions or nops between the set and the instruction. Then there is ALTDS which is badly in need of some macros but looks like it could do the job although in a somewhat unwieldly fashion. It's also a pity that SETB/CLRB doesn't span addresses for bits>31 as that would be very useful for a whole number of things including accessing the ports as one.

    So, have I missed something? Is there an easy way to address cog memory indirectly?

    BTW, if someone wants to play with Tachyon "as is" just let me know as I don't need "bug" reports yet :)

    Here's a little test loop that flashes the leds on the DE2 with pattens from memory and if key2 is pressed then it runs faster and exits if key3 is pressed
    0 BEGIN DUP W@ $FFFF PBCLR PBSET 30 PIN@ IF #100 ELSE #20 THEN ms 2 + 31 PIN@ 0= UNTIL
    


  • Peter
    Have a look at the ALTDS instruction
     ALTDS D,S/# 'modify D according to bits in S and possibly replace next instruction's CCCCOOOOOOOCZI / DDDDDDDDD / SSSSSSSSS fields.
    
     In ALTDS, S provides the following pattern: %RRR_DDD_SSS
    
     %RRR: (101 allows instruction substitution)
     000 = don't affect D's CCCCOOOOOOOOOCZI field
     001 = don't affect D's CCCCOOOOOOOOOCZI field, cancel write for next instruction
     010 = decrement D's OOOOOOOCZ field
     011 = increment D's OOOOOOOCZ field
     100 = use D's OOOOOOOCZ field as the result register for the next instruction (separate from D)
     101 = use D's CCCCOOOOOOOCZI field as next instruction's CCCCOOOOOOOCZI field
     110 = use D's OOOOOOOCZ field as the result register for the next instruction, decrement D's OOOOOOOCZ field
     111 = use D's OOOOOOOCZ field as the result register for the next instruction, increment D's OOOOOOOCZ field
    
     %DDD
     000 = don't affect D's DDDDDDDDD field
     001 = copy D's SSSSSSSSS field into its DDDDDDDDD field
     010 = decrement D's DDDDDDDDD field
     011 = increment D's DDDDDDDDD field
     100 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction
     101 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction, copy D's SSSSSSSSS field into its DDDDDDDDD field
     110 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction, decrement D's DDDDDDDDD field
     111 = use D's DDDDDDDDD field as the DDDDDDDDD field for the next instruction, increment D's DDDDDDDDD field
    
     %SSS
     000 = don't affect D's SSSSSSSSS field
     001 = copy D's DDDDDDDDD field into its SSSSSSSSS field
     010 = decrement D's SSSSSSSSS field
     011 = increment D's SSSSSSSSS field
     100 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction
     101 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, copy D's DDDDDDDDD field into its SSSSSSSSS field
     110 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, decrement D's SSSSSSSSS field
     111 = use D's SSSSSSSSS field as the SSSSSSSSS field for the next instruction, increment D's SSSSSSSSS field
    
    This should to the trick

  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2015-10-17 09:43
    I was hoping for a simpler single "instruction" rather than an ALTDS instruction seeing we already have the RD/WR type of instructions for lut and hub.. Also does it actually modify the next instruction in memory or just in the pipeline which is what I think it must do?

    Would help to if actual examples were given, it doesn't look clean and simple at all.
  • Cluso99Cluso99 Posts: 18,069
    Yes, it just modifies the ALU input in the pipeline.
    Perhaps the pointers PTRA & PTRB might help you. Not sure how they actually work in a normal instruction.
  • PTR addressing modes only work with hub read/write otherwise they are just a register like any other so that a MOV will read or write to the PTR register rather than use it indirectly. No, what's needed is a simple RDCOG and WRCOG instruction, surely that's not hard to implement and works in the same manner as RDLUT an WRLUT.

    The ALTDS is a rather versatile but nonetheless kludgey instruction, not at all in keeping with the rest of the P2 instruction set. As I mentioned before, where are the examples of how this was intended to be used as immediately it looks kludgey when you try to use it. At least proper macros could hide most of this kludge but most of the time I just need a simple RDCOG/WRCOG.
  • potatoheadpotatohead Posts: 10,261
    edited 2015-10-18 02:08
    Wrcog is mov. This design has the same limitation p1 does in that the pipeline requires modification to get indexing and in direction done.

    The usual way is to maintain a pointer, add, setd, sets, etc...

    Altds compresses this for some additional speed, in that it can work directly with the target instruction data.

    We don't have the advanced pipeline to work otherwise. Cog ops of this kind take a couple instructions as a result.

    That is the why, in any case. Maybe Chip has a better answer in the wings.

  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2015-10-18 02:10
    That's not correct though as MOV moves from register or immediate to register whereas a RDCOG would use the source indirectly just as RDLUT and also as RDLONG does etc. So if the source pointed to a register "myio" which held an address that pointed to INB then RDCOG X,myio would be equivalent to MOV X,INB.

    @Chip: How hard would it be to add RDCOG and WRCOG in the same manner?
  • potatoheadpotatohead Posts: 10,261
    edited 2015-10-18 02:21
    It is the indirection, indexing that takes the time. It can happen in hubexec because the instruction comes from a different place. In cogex, it's all timing bound.

    Yes, it is register to register in the cog, which is why wrcog turns into a self modify. On the hot chip, the pipe was deeper, more complex, and the cog ram was ported more too, I believe.

    The pointers for cog use required depth and sophistication absent in this simpler design. I believe the options were to slow down the cog with a delay in the critical path, extend the pipe, or add ports to cog ram, likely a combination.

    Other things got stripped, like waitvid, the pll, and more general purpose, simple things put in place. Hubexec lost cache, and has the fifo, which can also smooth the hub access.

    I miss those pointers too.

    There is actually a discussion on this in the older design thread.

  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2015-10-18 08:12
    Current word list for Tachyon as I continue testing etc. I've looked at what I need for pin I/O and so I've put in PINSET/PINCLR/PININP/PIN@ which accepts bits 0 to 63 to span both ports plus I have PASET/PACLR PBSET/PBCLR to write a 32-bit mask to the ports as well as sets the direction to output. I just need to clean up some more bugs and I should be able to release this source if anyone is interested.
      Parallax Propeller 2 .:.:--TACHYON--:.:. Forth V10151016.1800
    ----------------------------------------------------------------
      ok
      ok
    WORDS
    1D3E WORDS          1D1A WTAB           1D00 CTYPE          001F DUP
    0022 OVER           0019 DROP           0018 2DROP          0029 SWAP
    002D ROT            001C NIP            0028 BOUNDS         000C STREND
    0175 0              0174 1              0173 2              0172 3
    0171 4              016F 5              016E 6              016D 7
    016C 8              016B 9              0169 0D             0176 ON
    0176 TRUE           0176 -1             0167 BL             0168 16
    0175 FALSE          0175 OFF            0037 1+             0034 1-
    0036 2+             0033 2-             0031 +              0030 -
    00D1 DO             00DB LOOP           00D8 +LOOP          00D2 FOR
    00E3 NEXT           0085 I              003E INVERT         0040 AND
    0042 ANDN           0044 OR             0046 XOR            004C ROL
    004E ROR            0048 SHR            004A SHL            0050 2/
    0052 2*             0054 REV            0056 MASK           005A >N
    005B >B             005C 9BITS          0060 0=             0060 NOT
    005E =              0064 <>             0068 >              006F C@
    0071 W@             0073 @              0075 C+!            0077 C!
    006C C@++           0079 W+!            007B W!             007D +!
    007F !              00F7 UM*            00F3 *              00F1 ABS
    0039 -NEGATE        003A ?NEGATE        003C NEGATE         00A7 PINSET
    00AD PINCLR         00B3 PININP         00BE SHROUT         00BB SHRINP
    1C86 DIRX           1C92 OUTX           1C9E INX            0091 PA@
    0093 PB@            0095 PA!            0097 PB!            0099 DACLR
    009B DBCLR          009D PASET          009E DASET          00A0 PBSET
    00A1 DBSET          00A3 PACLR          00A5 PBCLR          00B5 PIN@
    0004 RESET          0012 0EXIT          0014 EXIT           0016 NOP
    0017 3DROP          001E ?DUP           0024 3RD            0026 4TH
    00C1 CALL           00C5 JUMP           00E0 BRANCH>        00E7 >R
    00EA R>             014E >L             00D6 L>             015D !SP
    0165 !RP            0163 !LP            0161 !BP            1A6D DEPTH
    0133 COG@           0138 COGREG         013A COG!           013F COGID
    00EF @REG           0114 EMIT           097C CR             019F SPACE
    0117 CNT@           1A49 ERROR          114E QD             1156 DUMP
    117A DUMPW          11E0 COGDUMP        110E .BYTE          111E .WORD
    112E .LONG          13B5 .              12B5 @PAD           12C7 HOLD
    12D9 >CHAR          136B #>             1307 <#             131B #
    135B #S             1387 <D>            13F3 PRINT$         1393 LEN$
    13CB U.             1407 .DEC           0918 U/             0908 U/MOD
    0922 */             0066 0<             0067 <              098A U<
    1064 WITHIN         108A ?EXIT          10B4 ERASE          10B8 FILL
    011E CMOVE          109A <CMOVE         10CC ms             0950 HEX
    0944 DECIMAL        0936 BINARY         1603 GETWORD        161F SEARCH
    0C00 NUMBER         1A75 .S             127B HERE           1C7C temp
    1C72 accept         0EB2 names          0ED8 CREATEWORD     0F78 CREATE
    0F8C ALLOT          1038 [COMPILE]      1000 pub            0ECA pri
    0FE0 :              1010 ;              0E5C NFA'           0E76 '
    0D30 IF             0D4C ELSE           0D7E THEN           0D7E ENDIF
    0CC2 BEGIN          0CD4 UNTIL          0D16 AGAIN          0D30 WHILE
    0CEE REPEAT         1B51 \              1B51 ''             1B75 (
    1BA5 {              0016 }              1B99 IFNDEF         1B8B IFDEF
    0DB8 "              0DC8 ."             10E2 (.")           19AE UNKNOWN
    19EC NOTFOUND        ok
    

    BTW, here's a quick iterative Fibonacci test. fibo(46) takes 8672 cycles unoptimized which compares with Tachyon P1 which takes 10,768 cycles.
    pub fibo ( n -- f )         1+ 0 1 ROT FOR OVER + SWAP NEXT NIP ;
    46 LAP fibo LAP .LAP SPACE . 8672 1836311903 ok
    
    With a OVER + SWAP made into a single operation:
    46 LAP fibo LAP .LAP 3248 ok
    
    So if the P2 were to run at 160MHz that would take 20.3us for a fibo(46)
    Since I now have deep and efficient stacks I could also try out the recursive algorithm later.
  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2015-10-18 15:52
    This simple one-liner is all I need to test launching tasks in cogs. It just flashes the DE2's leds at different rates so that 10 leds are blinking and my main cog is still talking to the console while this is happening. So that means there must be 11 cogs available on the DE2?
    : LEDTASK   COGID #32 + BEGIN DUP HIGH DUP 3 SHL ms DUP LOW DUP 3 SHL ms AGAIN ;
    16 1 DO ' LEDTASK I TASK ! LOOP
    

    BTW, I noticed that the debug interrupt vectors are mirrored on every 256k boundary rather than at the end of 1M.
    3.FF80 80 DUMPL
    FF80: 0000.0000 0000.0000 0000.0000 0000.0000    ................
    FF90: 0000.0000 0000.0000 0000.0000 0000.0000    ................
    FFA0: 0000.0000 0000.0000 0000.0000 0000.0000    ................
    FFB0: 0000.0000 0000.0000 0000.0000 0000.0000    ................
    FFC0: FABB.FFFF FABB.FFFF FABB.FFFF FABB.FFFF    ................
    FFD0: FABB.FFFF FABB.FFFF FABB.FFFF FABB.FFFF    ................
    FFE0: FABB.FFFF FABB.FFFF FABB.FFFF FABB.FFFF    ................
    FFF0: FABB.FFFF FABB.FFFF FABB.FFFF FABB.FFFF    ................ ok
    
  • That last one is a big oops! You might want to make a thread for Chip when he gets back from his travels this week.

    I thought the DE2 was good for 12 cogs.





  • Peter JakackiPeter Jakacki Posts: 10,193
    edited 2015-10-19 02:59
    In-between expanding Tachyon I am also revisiting functional code and seeing what I can do to optimize it. So it is with the FILL and ERASE function which was relying on high-level wordcode calling a CMOVE function. This works but was taking 2114016 cycles (13.2ms @160MHz) for 64k so I looked at trying to do this with SETQ and WRBYTE but it doesn't seem suited to that. So instead I tried the WRFAST/REP/WFBYTE sequence which seems to have dropped the fill to a much more respectable 2 cycles/byte so that 64k completes in 131200 cycles (820us @160MHz).

    To improve that I could work out if I can use words or longs in which case it will be much faster again at 205us for 64k fill (edit, which I have just tried with LFILL and confirmed)

    Do any of you P2 PASM gurus have any tips and tricks you'd like to share?
    ' wrfast rep code -> $6.0000 $1.0000 LAP ERASE LAP DECIMAL .LAP 131200 ok (2 clks/byte) =  820us/64K
     ' ( addr cnt -- )
    ERASE	call	#PUSHACC
    ' ( addr cnt fillch -- )
    FILL 	wrfast	tos1,tos2
    	rep	@.L0,tos1
    	wfbyte	tos
    .L0    	jmp	#DROP3
    

  • jmgjmg Posts: 15,175
    edited 2015-10-19 03:04
    Do any of you P2 PASM gurus have any tips and tricks you'd like to share?
    The code you have looks good.

    Do you mean around the 32/16/8 bit decision ?
    Do you want this run time or compile time choice ?

    If speed matters over size, you could cover most of the fill using longs for speed, and I think Chip still allows any alignment, so wr longs can start on any address (may have a start-cost ?) so you do longs, then check if an addendum is needed. As that is only 0.1.2.3, a byte fill is probably ok.

  • Yes, I did try a long fill and it works well but byte/word/long are normally decided at runtime due to the wide range of parameters that will be thrown at it.
    ' $6.0000 $1.0000 0 LAP LFILL LAP DECIMAL .LAP 32864 ( 205us/64k )
    ' LFILL ( addr bytes ch -- )
    LFILL	shr	tos1,#2
     	wrfast	tos1,tos2
    	rep	@.L0,tos1
    	wflong	tos
    .L0    	jmp	#DROP3
    

    The wfbyte/wfword/wflong are great candidates though for a simple SETS as the source field of the instruction controls the word size. So I will have a look at this but I don't want to make the runtime word too complicated either in trying to divide the operation up into mixed operations. Anyway I can use discrete FILL/WFILL/LFILL words but that's a waste or I could modify the source field each time perhaps. It's fun playing with the code at this level too.
Sign In or Register to comment.