SKIPF, EXECF & XBYTE (renamed)

Cluso99 · 2019-03-06 05:49

I am using SKIPF to decode and execute the mathops in my P2 spin interpreter and I am getting varying clock cycles...

Here is the pasm code being tested...

math_00
                getct   c1
                skipf   skipover                        ' skip over instruction(s)
'               skip    skipover                        ' skip over instruction(s)
'------------------------------------------------------------------------------------------------------------  
                                        '$F0=AND bool   ' %0000_0000_0011_1111_1110_1111_1111_0000
                                        '$F2=OR  bool   ' %0000_0000_0011_1111_1011_1111_1111_0000
math_F0                 CMP     x,#0       wz 'f=0,t=-1 '                                        *
                        MUXNZ   x,masklong              '                                       *
                        CMP     y,#0       wz 'f=0,t=-1 '                                      *
                        MUXNZ   y,masklong              '                                     *
'------------------------------------------------------------------------------------------------------------  
math_E0                 ROR     x,y     '$E0=ROR        ' %0000_0000_0011_1111_1111_1111_1110_1111
                        ROL     x,y     '$E1=ROL        ' %0000_0000_0011_1111_1111_1111_1101_1111
                        SHR     x,y     '$E2=SHR        ' %0000_0000_0011_1111_1111_1111_1011_1111
                        SHL     x,y     '$E3=SHL        ' %0000_0000_0011_1111_1111_1111_0111_1111
                        FGES    x,y     '$E4=FGES       ' %0000_0000_0011_1111_1111_1110_1111_1111
                        FLES    x,y     '$E5=FLES       ' %0000_0000_0011_1111_1111_1101_1111_1111
                        NEG     x,y     '$E6=NEG        ' %0000_0000_0011_1111_1111_1011_1111_1111
                        NOT     x,y     '$E7=NOT        ' %0000_0000_0011_1111_1111_0111_1111_1111
                        AND     x,y     '$E8=AND        ' %0000_0000_0011_1111_1110_1111_1111_1111
                        ABS     x,y     '$E9=ABS        ' %0000_0000_0011_1111_1101_1111_1111_1111
                        OR      x,y     '$EA=OR         ' %0000_0000_0011_1111_1011_1111_1111_1111
                        XOR     x,y     '$EB=XOR        ' %0000_0000_0011_1111_0111_1111_1111_1111
                        ADD     x,y     '$EC=ADD        ' %0000_0000_0011_1110_1111_1111_1111_1111
                        SUB     x,y     '$ED=SUB        ' %0000_0000_0011_1101_1111_1111_1111_1111
                        SAR     x,y     '$EE=SAR        ' %0000_0000_0011_1011_1111_1111_1111_1111
                        REV     x       '$EF=REV        ' %0000_0000_0011_0111_1111_1111_1111_1111
                        ENCOD   x,y     '$F1=ENCOD      ' %0000_0000_0010_1111_1111_1111_1111_1111
                        DECOD   x,y     '$F3=DECOD      ' %0000_0000_0001_1111_1111_1111_1111_1111
'------------------------------------------------------------------------------------------------------------  
                getct   c2
                RET

This is the table comprised of b31:10 for the skip bits and b9:0 for the cog/lut address

vector_table
'----------------------------------------------------------------------------------------------------------------------------                                  
  long  %0000_0000_0011_1111_1111_1111_1110_1111 <<10 +math_00   '$E0 ROR       1st -> 2nd  b       ->     ->=  rotate right                         
  long  %0000_0000_0011_1111_1111_1111_1101_1111 <<10 +math_00   '$E1 ROL       1st <- 2nd  b       <-     <-=  rotate left                          
  long  %0000_0000_0011_1111_1111_1111_1011_1111 <<10 +math_00   '$E2 SHR       1st >> 2nd  b       >>     >>=  shift right                          
  long  %0000_0000_0011_1111_1111_1111_0111_1111 <<10 +math_00   '$E3 SHL       1st << 2nd  b       <<     <<=  shift left                           
  long  %0000_0000_0011_1111_1111_1110_1111_1111 <<10 +math_00   '$E4 FGES      1st #> 2nd  b       #>     #>=  limit minimum (signed)               
  long  %0000_0000_0011_1111_1111_1101_1111_1111 <<10 +math_00   '$E5 FLES      1st <# 2nd  b       <#     <#=  limit maximum (signed)               
  long  %0000_0000_0011_1111_1111_1011_1111_1111 <<10 +math_00   '$E6 NEG       - 1st       unary   -      -    negate                               
  long  %0000_0000_0011_1111_1111_0111_1111_1111 <<10 +math_00   '$E7 NOT       ! 1st       unary   !      !    bitwise not                          
  long  %0000_0000_0011_1111_1110_1111_1111_1111 <<10 +math_00   '$E8 AND       1st & 2nd   b       &      &=   bitwise and                          
  long  %0000_0000_0011_1111_1101_1111_1111_1111 <<10 +math_00   '$E9 ABS       ABS( 1st )  unary   ||     ||   absolute                             
  long  %0000_0000_0011_1111_1011_1111_1111_1111 <<10 +math_00   '$EA OR        1st | 2nd   b       |      |=   bitwise or                           
  long  %0000_0000_0011_1111_0111_1111_1111_1111 <<10 +math_00   '$EB XOR       1st ^ 2nd   b       ^      ^=   bitwise xor                          
  long  %0000_0000_0011_1110_1111_1111_1111_1111 <<10 +math_00   '$EC ADD       1st + 2nd   b       +      +=   add                                  
  long  %0000_0000_0011_1101_1111_1111_1111_1111 <<10 +math_00   '$ED SUB       1st - 2nd   b       -      -=   subtract                             
  long  %0000_0000_0011_1011_1111_1111_1111_1111 <<10 +math_00   '$EE SAR       1st ~> 2nd  b       ~>     ~>=  shift arithmetic right               
  long  %0000_0000_0011_0111_1111_1111_1111_1111 <<10 +math_00   '$EF REV       1st >< 2nd  b       ><     ><=  reverse bits (neg y first)           
  long  %0000_0000_0011_1111_1110_1111_1111_0000 <<10 +math_00   '$F0 AND bool  1st AND 2nd b       AND         boolean and                          
  long  %0000_0000_0010_1111_1111_1111_1111_1111 <<10 +math_00   '$F1 ENCOD     >| 1st      unary   >|     >|   encode (0-32)                        
  long  %0000_0000_0011_1111_1011_1111_1111_0000 <<10 +math_00   '$F2 OR  bool  1st OR 2nd  b       OR          boolean or                           
  long  %0000_0000_0001_1111_1111_1111_1111_1111 <<10 +math_00   '$F3 DECOD     |< 1st      unary   |<     |<   decode                               
  long  %0000_0000_0000_0001_1111_1111_1111_1100 <<10 +math_00   '$F4 MPY       1st * 2nd   b       *      *=   multiply, return lower half (signed) 
  long  %0000_0000_0000_0001_1111_1111_1111_0011 <<10 +math_00   '$F5 MPY_MSW   1st ** 2nd  b       **     **=  multiply, return upper half (signed) 
  long  %0000_0000_0000_0001_1111_1111_1100_1111 <<10 +math_00   '$F6 DIV       1st / 2nd   b       /      /=   divide, return quotient (signed)     
  long  %0000_0000_0000_0001_1111_1111_0011_1111 <<10 +math_00   '$F7 MOD       1st // 2nd  b       //     //=  divide, return remainder (signed)    
  long  %0000_0000_0000_0001_1111_1100_1111_1111 <<10 +math_00   '$F8 SQRT      ^^ 1st      unary   ^^     ^^   square root                          
  long  %0000_0000_0000_0000_1000_0011_1111_1111 <<10 +math_00   '$F9 LT        1st < 2nd   b       <           test below (signed)                  
  long  %0000_0000_0000_0000_1000_0011_1111_1111 <<10 +math_00   '$FA GT        1st > 2nd   b       >           test above (signed)                  
  long  %0000_0000_0000_0000_1000_0011_1111_1111 <<10 +math_00   '$FB NE        1st <> 2nd  b       <>          test not equal                       
  long  %0000_0000_0000_0000_1000_0011_1111_1111 <<10 +math_00   '$FC EQ        1st == 2nd  b       ==          test equal                           
  long  %0000_0000_0000_0000_1000_0011_1111_1111 <<10 +math_00   '$FD LE        1st =< 2nd  b       =<          test below or equal (signed)         
  long  %0000_0000_0000_0000_1000_0011_1111_1111 <<10 +math_00   '$FE GE        1st => 2nd  b       =>          test above or equal (signed)         
  long  %0000_0000_0000_0000_0111_1111_1111_1111 <<10 +math_00   '$FF NOT bool  NOT 1st     unary   NOT    NOT  boolean not                          
'----------------------------------------------------------------------------------------------------------------------------

And the results

P2 Mathop testing v001g
op: 00000000  time: 0000000C  xya: 71111777 00000004 00003333  NC NZ
op: 00000001  time: 0000000C  xya: 11117777 00000004 00003333   C NZ
op: 00000002  time: 0000000A  xya: 01111777 00000004 00003333   C NZ
op: 00000003  time: 0000000A  xya: 11117770 00000004 00003333   C NZ
op: 00000004  time: 0000000A  xya: 11117770 00000004 00003333   C NZ
op: 00000005  time: 0000000C  xya: 00000004 00000004 00003333   C NZ
op: 00000006  time: 0000000C  xya: FFFFFFFC 00000004 00003333   C NZ
op: 00000007  time: 0000000C  xya: FFFFFFFB 00000004 00003333   C NZ
op: 00000008  time: 0000000C  xya: 00000000 00000004 00003333   C NZ
op: 00000009  time: 0000000C  xya: 00000004 00000004 00003333   C NZ
op: 0000000A  time: 0000000A  xya: 00000004 00000004 00003333   C NZ
op: 0000000B  time: 0000000A  xya: 00000000 00000004 00003333   C NZ
op: 0000000C  time: 0000000A  xya: 00000004 00000004 00003333   C NZ
op: 0000000D  time: 0000000C  xya: 00000000 00000004 00003333   C NZ
op: 0000000E  time: 0000000C  xya: 00000000 00000004 00003333   C NZ
op: 0000000F  time: 0000000C  xya: 00000000 00000004 00003333   C NZ
...

Note the time for most single instruction code is "C" = 12 clocks.
But some of the instructions take only "A" = 10 clocks.
And these results are always the same.
Why???

BTW when using skip rather than skip, they always take $30 = 48 clocks.

And WOW! What a feature

Next to try XBYTE

rogloh · 2019-03-06 06:40

Are there sufficient instructions executed outside of the mathops to complete the 32 potential SKIPs before this code fragment is re-entered using the next opcode? Just wondering, as the rest of the code is not shown. Looks like you would need at least 10 more instructions to be executed after DECOD x,y. Maybe your interpreter is too optimized??? LOL.

Also interesting to see that math ops 2,3,4 and A=2+8, B=3+8, C=4+8 have the same timing behavior, each consuming one less instruction...is that significant?

Cluso99 · 2019-03-06 07:09

rogloh wrote: »

Are there sufficient instructions executed outside of the mathops to complete the 32 potential SKIPs before this code fragment is re-entered using the next opcode? Just wondering, as the rest of the code is not shown. Looks like you would need at least 10 more instructions to be executed after DECOD x,y. Maybe your interpreter is too optimized??? LOL.

Also interesting to see that math ops 2,3,4 and A=2+8, B=3+8, C=4+8 have the same timing behavior, each consuming one less instruction...is that significant?

Full code was attached

No idea why they would be different.

I believe once the resulting skip bits are all 0, then the skip is complete. But they are all 0 in order to execute the final getct.

AJL · 2019-03-06 07:54

SKIP cancels the instructions, but they still take two cycles per cancelled instruction.
SKIPF jumps the PC by up to 8 instructions eliminating the time taken to execute them; only 7 instructions can be skipped for free.
So, 8 or more skipped instructions in a row will still consume some cycles to update the PC even though nothing comes of it.

mathop 0:
SKIPF: 2 cycles
First instruction after the SKIPF is cancelled: 2 cycles
Skip 4 and ROR is executed: 2 cycles
Skip 8: 2 cycles
Skip 8: 2 cycles
Skip 1 and GET CT is executed: 2 cycles

Total 12 cycles

mathop 1:
SKIPF: 2 cycles
First instruction after the SKIPF is cancelled: 2 cycles
Skip 5 and ROL is executed: 2 cycles
Skip 8: 2 cycles
Skip 8: 2 cycles
GETCT is executed: 2 cycles

Total 12 cycles

mathop 2:
SKIPF: 2 cycles
First instruction after the SKIPF is cancelled: 2 cycles
Skip 6 and SHR is executed: 2 cycles
Skip 8: 2 cycles
Skip 6 and GETCT is executed: 2 cycles

Total 10 cycles

mathop 3:
SKIPF: 2 cycles
First instruction after the SKIPF is cancelled: 2 cycles
Skip 7 and SHL is executed: 2 cycles
Skip 8: 2 cycles
Skip 5 and GETCT is executed: 2 cycles

Total 10 cycles

mathop 4:
SKIPF: 2 cycles
First instruction after the SKIPF is cancelled and Skip 8(?): 2 cycles
Skip 8: 2 cycles
FGES is executed: 2 cycles
Skip 8: 2 cycles
Skip 4 and GETCT is executed: 2 cycles

Total 10 cycles

Does this help?

Edit: something screwy here, 8 != 10, and 10 != 12

Second Edit: perhaps for mathop 4 the Skip first instruction and Skip 8 merge. That would bring it into line with the rest, and I missed the constant 2 cycle cost for SKIPF in the first place.

Cluso99 · 2019-03-06 09:09

Thanks AJL.
I was just working thru the sequence too. I noticed there is an effective 2 clock overhead for the skipf instruction to cancel the first instruction.
Makes determinism a little difficult to calculate, but I guess that's the penalty for the gain.

AJL · 2019-03-06 09:28

Cluso99 wrote: »

Thanks AJL.
I was just working thru the sequence too. I noticed there is an effective 2 clock overhead for the skipf instruction to cancel the first instruction.
Makes determinism a little difficult to calculate, but I guess that's the penalty for the gain.

Considering that it looks like you have about 10 spare skip slots, you might rearrange your table, with some 'early exit' points added. By strategically introducing extra RET instructions that are skipped when not needed you might achieve it.

It might be possible to normalise your execution times at around 12-14 cycles.

Regards,
Anthony.

cgracey · 2019-03-06 10:01

Here is some interpreter code which selectively skips among 19 instructions to realize 24 different functions which would have taken 149 instructions to code discretely. Neglecting the XBYTE jumpaddress+skipfpattern long for each of the 24 functions, this amounts to a code compaction ratio of almost 8-to-1, allowing the Spin2 interpreter to fit entirely within cog RAM:

'
' Setup hub variable
'
var_bas		rfvar	ad		'a b c d e f g h i j k l m n o p q r			a: setup byte[pbase + rfvar]
var_pop		popa	ad		'| | | | | | | | | | | | | | | | | | s t u		b: setup byte[vbase + rfvar]
					'							c: setup byte[dbase + rfvar]
		shl	x,#2		'| | | | | | | | | | | | | | | p q r | | u		d: setup byte[pbase + rfvar][pop index]
		shl	x,#1		'| | | | | | | | | j k l | | | | | | | t |		e: setup byte[vbase + rfvar][pop index]
		add	ad,x		'| | | d e f | | | j k l | | | p q r s t u		f: setup byte[dbase + rfvar][pop index]
					'							g: setup word[pbase + rfvar]
		add	ad,pbase	'a | | d | | g | | j | | m | | p | | | | |		h: setup word[vbase + rfvar]
		add	ad,vbase	'| b | | e | | h | | k | | n | | q | | | |		i: setup word[dbase + rfvar]
		add	ad,dbase	'| | c | | f | | i | | l | | o | | r | | |		j: setup word[pbase + rfvar][pop index]
					'							k: setup word[vbase + rfvar][pop index]
var_ptr		mov	ad,x		'| | | | | | | | | | | | | | | | | | | | | v w x	l: setup word[dbase + rfvar][pop index]
		popa	x		'| | | d e f | | | j k l | | | p q r s t u v w x	m: setup long[pbase + rfvar]
					'							n: setup long[vbase + rfvar]
		mov	rd,rd_long	'| | | | | | | | | | | | m n o p q r | | u | | x	o: setup long[dbase + rfvar]
		mov	wr,wr_long	'| | | | | | | | | | | | m n o p q r | | u | | x	p: setup long[pbase + rfvar][pop index]
	_ret_	mov	sz,#31		'| | | | | | | | | | | | m n o p q r | | u | | x	q: setup long[vbase + rfvar][pop index]
					'							r: setup long[dbase + rfvar][pop index]
		mov	rd,rd_byte	'a b c d e f | | | | | |             s |   v |		s: setup byte[pop base][pop index]
		mov	wr,wr_byte	'a b c d e f | | | | | |             s |   v |		t: setup word[pop base][pop index]
	_ret_	mov	sz,#7		'a b c d e f | | | | | |             s |   v |		u: setup long[pop base][pop index]
					'							v: setup byte[pop address]
		mov	rd,rd_word	'            g h i j k l               t     w		w: setup word[pop address]
		mov	wr,wr_word	'            g h i j k l               t     w		x: setup long[pop address]
	_ret_	mov	sz,#15		'            g h i j k l               t     w

The only time an extra 2-clock NOP is inserted into the pipeline is for the g/h/i cases. I figured that byte and long variables were going to be more commonly accessed than word variables, so word variable accesses suffered the 2-clock penalty. Otherwise, it's like every instruction executes in order, with no timing gaps.

Here is the XBYTE table for these 24 functions:

		long	var_bas	|       %000_111_11_110_111_10 << 10	'51	setup byte[pbase + rfvar]
		long	var_bas	|       %000_111_11_101_111_10 << 10	'52	setup byte[vbase + rfvar]
		long	var_bas	|       %000_111_11_011_111_10 << 10	'53	setup byte[dbase + rfvar]

		long	var_bas	|       %000_111_01_110_011_10 << 10	'54	setup byte[pbase + rfvar][pop index]
		long	var_bas	|       %000_111_01_101_011_10 << 10	'55	setup byte[vbase + rfvar][pop index]
		long	var_bas	|       %000_111_01_011_011_10 << 10	'56	setup byte[dbase + rfvar][pop index]

		long	var_bas	|   %000_111_111_11_110_111_10 << 10	'57	setup word[pbase + rfvar]
		long	var_bas	|   %000_111_111_11_101_111_10 << 10	'58	setup word[vbase + rfvar]
		long	var_bas	|   %000_111_111_11_011_111_10 << 10	'59	setup word[dbase + rfvar]

		long	var_bas	|   %000_111_111_01_110_011_10 << 10	'5A	setup word[pbase + rfvar][pop index]
		long	var_bas	|   %000_111_111_01_101_011_10 << 10	'5B	setup word[vbase + rfvar][pop index]
		long	var_bas	|   %000_111_111_01_011_011_10 << 10	'5C	setup word[dbase + rfvar][pop index]

		long	var_bas	|           %000_11_110_111_10 << 10	'5D	setup long[pbase + rfvar]
		long	var_bas	|           %000_11_101_111_10 << 10	'5E	setup long[vbase + rfvar]
		long	var_bas	|           %000_11_011_111_10 << 10	'5F	setup long[dbase + rfvar]

		long	var_bas	|           %000_01_110_011_10 << 10	'60	setup long[pbase + rfvar][pop index]
		long	var_bas	|           %000_01_101_011_10 << 10	'61	setup long[vbase + rfvar][pop index]
		long	var_bas	|           %000_01_011_011_10 << 10	'62	setup long[dbase + rfvar][pop index]

		long	var_pop	|        %000_111_01_111_011_0 << 10	'63	setup byte[pop base][pop index]
		long	var_pop	|    %000_111_111_01_111_001_0 << 10	'64	setup word[pop base][pop index]
		long	var_pop	|            %000_01_111_010_0 << 10	'65	setup long[pop base][pop index]

		long	var_ptr	|                  %000_111_00 << 10	'66	setup byte[pop address]
		long	var_ptr	|              %000_111_111_00 << 10	'67	setup word[pop address]
		long	var_ptr	|                      %000_00 << 10	'68	setup long[pop address]

Cluso99 · 2019-03-06 10:36

Chip,
Your sample is pretty much what I have. Even just using skip only takes 48 clocks, but skipf reduces even further to 10/12 clocks. This is simply amazing!

The whole mathop interpreter routine shrinks to ~48 instructions (plus the table of 32 longs). The original P1 ROM mathops takes ~104 instructions.
That's a huge improvement!!!

cgracey · 2019-03-06 10:47

Cluso99 wrote: »

Chip,
Your sample is pretty much what I have. Even just using skip only takes 48 clocks, but skipf reduces even further to 10/12 clocks. This is simply amazing!

The whole mathop interpreter routine shrinks to ~48 instructions (plus the table of 32 longs). The original P1 ROM mathops takes ~104 instructions.
That's a huge improvement!!!

Yeah, go with what works in the P2. You should find that you don't need the old JMPRET instruction, after all.

cgracey · 2019-03-06 10:54

Cluso99,

If you do this, you can drop the execution time to only 8 clocks per bytecode, total, using XBYTE:

ROR_    _RET_        ROR     x,y     '$E0=ROR        ' %0
ROL_    _RET_        ROL     x,y     '$E1=ROL        ' %0
SHR_    _RET_        SHR     x,y     '$E2=SHR        ' %0
SHL_    _RET_        SHL     x,y     '$E3=SHL        ' %0
FGES_   _RET_        FGES    x,y     '$E4=FGES       ' %0
FLES_   _RET_        FLES    x,y     '$E5=FLES       ' %0
NEG_    _RET_        NEG     x,y     '$E6=NEG        ' %0
NOT_    _RET_        NOT     x,y     '$E7=NOT        ' %0
AND_    _RET_        AND     x,y     '$E8=AND        ' %0
ABS_    _RET_        ABS     x,y     '$E9=ABS        ' %0
OR_     _RET_        OR      x,y     '$EA=OR         ' %0
XOR_    _RET_        XOR     x,y     '$EB=XOR        ' %0
ADD_    _RET_        ADD     x,y     '$EC=ADD        ' %0
SUB_    _RET_        SUB     x,y     '$ED=SUB        ' %0
SAR_    _RET_        SAR     x,y     '$EE=SAR        ' %0
REV_    _RET_        REV     x       '$EF=REV        ' %0
ENCOD_  _RET_        ENCOD   x,y     '$F1=ENCOD      ' %0
DECOD_  _RET_        DECOD   x,y     '$F3=DECOD      ' %0

AJL · 2019-03-06 10:54

Cluso99 wrote: »

Chip,
Your sample is pretty much what I have. Even just using skip only takes 48 clocks, but skipf reduces even further to 10/12 clocks. This is simply amazing!

The whole mathop interpreter routine shrinks to ~48 instructions (plus the table of 32 longs). The original P1 ROM mathops takes ~104 instructions.
That's a huge improvement!!!

You might want to check your table as the skip patterns for operations F9 to FE are identical.
You also need to skip all unneeded operations up to the RET; it looks like you might not be doing that, as all operations from F4 onward are performing every operation from subtract onward, some of them even more.

Hope it helps,

Anthony.

Cluso99 · 2019-03-06 11:29

cgracey wrote: »

Cluso99 wrote: »

Chip,
Your sample is pretty much what I have. Even just using skip only takes 48 clocks, but skipf reduces even further to 10/12 clocks. This is simply amazing!

The whole mathop interpreter routine shrinks to ~48 instructions (plus the table of 32 longs). The original P1 ROM mathops takes ~104 instructions.
That's a huge improvement!!!

Yeah, go with what works in the P2. You should find that you don't need the old JMPRET instruction, after all.

No. I don't need JMPRET but to do so required a major change. I still think it's a shame we didn't have an equivalent in P2 to make conversion easier.
We also missed having the DIRx/OUTx/INx mapped into the same register space.
We have what we have, and that's great anyway!

Cluso99 · 2019-03-06 11:34

cgracey wrote: »

Cluso99,

If you do this, you can drop the execution time to only 8 clocks per bytecode, total, using XBYTE:

ROR_    _RET_        ROR     x,y     '$E0=ROR        ' %0
ROL_    _RET_        ROL     x,y     '$E1=ROL        ' %0
SHR_    _RET_        SHR     x,y     '$E2=SHR        ' %0
SHL_    _RET_        SHL     x,y     '$E3=SHL        ' %0
FGES_   _RET_        FGES    x,y     '$E4=FGES       ' %0
FLES_   _RET_        FLES    x,y     '$E5=FLES       ' %0
NEG_    _RET_        NEG     x,y     '$E6=NEG        ' %0
NOT_    _RET_        NOT     x,y     '$E7=NOT        ' %0
AND_    _RET_        AND     x,y     '$E8=AND        ' %0
ABS_    _RET_        ABS     x,y     '$E9=ABS        ' %0
OR_     _RET_        OR      x,y     '$EA=OR         ' %0
XOR_    _RET_        XOR     x,y     '$EB=XOR        ' %0
ADD_    _RET_        ADD     x,y     '$EC=ADD        ' %0
SUB_    _RET_        SUB     x,y     '$ED=SUB        ' %0
SAR_    _RET_        SAR     x,y     '$EE=SAR        ' %0
REV_    _RET_        REV     x       '$EF=REV        ' %0
ENCOD_  _RET_        ENCOD   x,y     '$F1=ENCOD      ' %0
DECOD_  _RET_        DECOD   x,y     '$F3=DECOD      ' %0

I cannot do this as I am also popping before, and pushing the result after, but thanks anyway

I will likely be using XBYTE, but for now I need to insert some debug steps which affects the xbyte operation.

cgracey · 2019-03-06 11:37

Cluso99 wrote: »

...We also missed having the DIRx/OUTx/INx mapped into the same register space.

Do you mean that their addresses have changed?

Cluso99 · 2019-03-06 11:37

AJL wrote: »

Cluso99 wrote: »

Chip,
Your sample is pretty much what I have. Even just using skip only takes 48 clocks, but skipf reduces even further to 10/12 clocks. This is simply amazing!

The whole mathop interpreter routine shrinks to ~48 instructions (plus the table of 32 longs). The original P1 ROM mathops takes ~104 instructions.
That's a huge improvement!!!

You might want to check your table as the skip patterns for operations F9 to FE are identical.
You also need to skip all unneeded operations up to the RET; it looks like you might not be doing that, as all operations from F4 onward are performing every operation from subtract onward, some of them even more.

Hope it helps,

Anthony.

F9-FE: Yes. The difference is done elsewhere in the code, at least at this time anyway.
F4+: Yes, I hadn't got that far when I posted the code. It now jumps to a different address.

Cluso99 · 2019-03-08 02:28

cgracey wrote: »

Cluso99 wrote: »

...We also missed having the DIRx/OUTx/INx mapped into the same register space.

Do you mean that their addresses have changed?

Yes. For the P1 Spin Interpreter on P2 I have to re-map these registers (or re-compile).
Hindsight is wonderful

Cluso99 · 2019-03-08 04:19

Chip,

I've run out of 22 skip bits

So I need some more info please

Code in COG/LUT:
I presume when the call #\routine is not skipped, the skip bits are saved, the called routine executes, and when it returns the skip bits are restored?
Also, I presume I cannot use another skipf in the routine?

'all code in cog/lut...

                skipf           #$xxxx
                ...                                     'some instructions skipped. some not
                call            #\routine               'maybe skipped or not
                ...                                     'this instruction may be skipped or not
                ...                                     'some instructions skipped. some not 


routine         ...                                     'no instructions affected by skip
                ...
        _RET_   ...
                RET             {wcz}

I presume when a jmp/djnz/etc #routine (relative jump) is not skipped, the routine will now execute using the remaining skip bits?

'all code in cog/lut...

                skipf           #$xxxx
                ...                                     'some instructions skipped. some not
                jmp             #routine                'maybe skipped or not
                ...                                     'this instruction may be skipped or not
                ...                                     'some instructions skipped. some not 


routine         ...                                     'these instructions are affected by skip
                ...

Cluso99 · 2019-03-08 04:24

Also I presume a skipf #0 will cancel any currently executing skipf, provided the skipf #0 is not itself skipped?

cgracey · 2019-03-08 04:40

Cluso99 wrote: »

Also I presume a skipf #0 will cancel any currently executing skipf, provided the skipf #0 is not itself skipped?

Correct.

Read that section of the Google Doc. I explained it all very carefully in there. It's almost too much to remember, offhand.

cgracey · 2019-03-08 04:41

Cluso99 wrote: »

Chip,

I've run out of 22 skip bits

So I need some more info please

Code in COG/LUT:
I presume when the call #\routine is not skipped, the skip bits are saved, the called routine executes, and when it returns the skip bits are restored?
Also, I presume I cannot use another skipf in the routine?

'all code in cog/lut...

                skipf           #$xxxx
                ...                                     'some instructions skipped. some not
                call            #\routine               'maybe skipped or not
                ...                                     'this instruction may be skipped or not
                ...                                     'some instructions skipped. some not 


routine         ...                                     'no instructions affected by skip
                ...
        _RET_   ...
                RET             {wcz}

I presume when a jmp/djnz/etc #routine (relative jump) is not skipped, the routine will now execute using the remaining skip bits?

'all code in cog/lut...

                skipf           #$xxxx
                ...                                     'some instructions skipped. some not
                jmp             #routine                'maybe skipped or not
                ...                                     'this instruction may be skipped or not
                ...                                     'some instructions skipped. some not 


routine         ...                                     'these instructions are affected by skip
                ...

That all sounds correct.

Cluso99 · 2019-03-08 05:25

Thanks Chip. That is what I gleaned from the docs

ersmith · 2019-03-08 12:16

Cluso99 wrote: »

I will likely be using XBYTE, but for now I need to insert some debug steps which affects the xbyte operation.

I used XBYTE in my zpu interpreter, and to debug I provided an alternate implementation which mimics xbyte in software, something like:

*Edit*: fixed the order of rfbyte and getptr below

        rfbyte #0, initial_pc
next_instruction
        rfbyte pa
        getptr pb
        push #next_instruction
        rdlut temp, pa    ' depends on opcode table being at $0 in LUT
        execf temp

Then I could insert whatever debugging I wanted in the next_instruction loop, but since it's using the same structure as XBYTE I could easily toggle the hardware XBYTE on and off with an #ifdef.

ersmith · 2019-03-08 12:21

While we're on the subject of XBYTE, I had a question for @cgracey . I've been trying to use SETQ2 to select a temporary alternate opcode table for the next byte, but can't seem to get it to work reliably. I suspect it's something funny in my work flow. When exactly does the temporary table established by SETQ2 get restored to the default one? What happens if we do a SETQ2 while executing a bytecode set up by a previous SETQ2? I think it's acting like the second SETQ2 is being dropped, but I could be wrong.

Thanks,
Eric

cgracey · 2019-03-08 12:28

ersmith wrote: »

While we're on the subject of XBYTE, I had a question for @cgracey . I've been trying to use SETQ2 to select a temporary alternate opcode table for the next byte, but can't seem to get it to work reliably. I suspect it's something funny in my work flow. When exactly does the temporary table established by SETQ2 get restored to the default one? What happens if we do a SETQ2 while executing a bytecode set up by a previous SETQ2? I think it's acting like the second SETQ2 is being dropped, but I could be wrong.

Thanks,
Eric

"_RET_ SETQ2 {#}D" only affects the next XBYTE, then subsequent XBYTE operations revert to the main configuration that was established by the prior "_RET_ SETQ {#}D". Using multiple "_RET_ SETQ2 {#}D" in a row will keep things in the alternate mode, until a RET without a SETQ2 executes.

ersmith · 2019-03-08 12:47

"_RET_ SETQ2 {#}D" only affects the next XBYTE, then subsequent XBYTE operations revert to the main configuration that was established by the prior "_RET_ SETQ {#}D". Using multiple "_RET_ SETQ2 {#}D" in a row will keep things in the alternate mode, until a RET without a SETQ2 executes.

That's what I thought would happen, but it doesn't seem to be working for me. I need to look more closely, but it seems like the second SETQ2 is being ignored. I have a software emulation of XBYTE (similar to the loop I posted above, but with an additional temporary offset for the LUT table that gets used for the lookup and then reset before the execf) and that works as I expect, but when I switch to XBYTE it doesn't work.

Cluso99 · 2019-03-08 15:59

Thanks Eric. I’ve spent a lot of time recoding the sections to use the skipf concept. Wow it saves a lot of code, and execution time too. I’m not up to testing the new code yet.

ersmith · 2019-03-08 17:56

"_RET_ SETQ2 {#}D" only affects the next XBYTE, then subsequent XBYTE operations revert to the main configuration that was established by the prior "_RET_ SETQ {#}D". Using multiple "_RET_ SETQ2 {#}D" in a row will keep things in the alternate mode, until a RET without a SETQ2 executes.

Never mind, I think I found it: my original XBYTE emulation was doing:

getptr pb
rdfast #0, pa

but in fact the hardware does it in the other order:

rdfast #0, pa
getptr pb

so I was using the wrong value for pb in a few places. The docs actually do show it this way, I just didn't read them carefully enough. Sorry for the false alarm!

cgracey · 2019-03-08 21:30

ersmith wrote: »
"_RET_ SETQ2 {#}D" only affects the next XBYTE, then subsequent XBYTE operations revert to the main configuration that was established by the prior "_RET_ SETQ {#}D". Using multiple "_RET_ SETQ2 {#}D" in a row will keep things in the alternate mode, until a RET without a SETQ2 executes.

Never mind, I think I found it: my original XBYTE emulation was doing:
getptr pb
rdfast #0, pa
but in fact the hardware does it in the other order:
rdfast #0, pa
getptr pb
so I was using the wrong value for pb in a few places. The docs actually do show it this way, I just didn't read them carefully enough. Sorry for the false alarm!

Periodic alarm is therapeutic. Keeps things circulating.

Cluso99 · 2019-03-09 04:11

I thought I'd post my skipf routines for my P2 SpinInterpreter of P1 here.

Moved to my P1 Spin Interpreter for P2 thread here
forums.parallax.com/discussion/comment/1466681/#Comment_1466681

Cluso99 · 2019-03-09 04:36

-moved-

Cluso99 · 2019-03-09 04:36

-moved-

SKIPF, EXECF & XBYTE (renamed)

Comments