SKIPF, EXECF & XBYTE (renamed)
Cluso99
Posts: 18,069
I am using SKIPF to decode and execute the mathops in my P2 spin interpreter and I am getting varying clock cycles...
Here is the pasm code being tested...
But some of the instructions take only "A" = 10 clocks.
And these results are always the same.
Why???
BTW when using skip rather than skip, they always take $30 = 48 clocks.
And WOW! What a feature Next to try XBYTE
Here is the pasm code being tested...
math_00 getct c1 skipf skipover ' skip over instruction(s) ' skip skipover ' skip over instruction(s) '------------------------------------------------------------------------------------------------------------ '$F0=AND bool ' %0000_0000_0011_1111_1110_1111_1111_0000 '$F2=OR bool ' %0000_0000_0011_1111_1011_1111_1111_0000 math_F0 CMP x,#0 wz 'f=0,t=-1 ' * MUXNZ x,masklong ' * CMP y,#0 wz 'f=0,t=-1 ' * MUXNZ y,masklong ' * '------------------------------------------------------------------------------------------------------------ math_E0 ROR x,y '$E0=ROR ' %0000_0000_0011_1111_1111_1111_1110_1111 ROL x,y '$E1=ROL ' %0000_0000_0011_1111_1111_1111_1101_1111 SHR x,y '$E2=SHR ' %0000_0000_0011_1111_1111_1111_1011_1111 SHL x,y '$E3=SHL ' %0000_0000_0011_1111_1111_1111_0111_1111 FGES x,y '$E4=FGES ' %0000_0000_0011_1111_1111_1110_1111_1111 FLES x,y '$E5=FLES ' %0000_0000_0011_1111_1111_1101_1111_1111 NEG x,y '$E6=NEG ' %0000_0000_0011_1111_1111_1011_1111_1111 NOT x,y '$E7=NOT ' %0000_0000_0011_1111_1111_0111_1111_1111 AND x,y '$E8=AND ' %0000_0000_0011_1111_1110_1111_1111_1111 ABS x,y '$E9=ABS ' %0000_0000_0011_1111_1101_1111_1111_1111 OR x,y '$EA=OR ' %0000_0000_0011_1111_1011_1111_1111_1111 XOR x,y '$EB=XOR ' %0000_0000_0011_1111_0111_1111_1111_1111 ADD x,y '$EC=ADD ' %0000_0000_0011_1110_1111_1111_1111_1111 SUB x,y '$ED=SUB ' %0000_0000_0011_1101_1111_1111_1111_1111 SAR x,y '$EE=SAR ' %0000_0000_0011_1011_1111_1111_1111_1111 REV x '$EF=REV ' %0000_0000_0011_0111_1111_1111_1111_1111 ENCOD x,y '$F1=ENCOD ' %0000_0000_0010_1111_1111_1111_1111_1111 DECOD x,y '$F3=DECOD ' %0000_0000_0001_1111_1111_1111_1111_1111 '------------------------------------------------------------------------------------------------------------ getct c2 RETThis is the table comprised of b31:10 for the skip bits and b9:0 for the cog/lut address
vector_table '---------------------------------------------------------------------------------------------------------------------------- long %0000_0000_0011_1111_1111_1111_1110_1111 <<10 +math_00 '$E0 ROR 1st -> 2nd b -> ->= rotate right long %0000_0000_0011_1111_1111_1111_1101_1111 <<10 +math_00 '$E1 ROL 1st <- 2nd b <- <-= rotate left long %0000_0000_0011_1111_1111_1111_1011_1111 <<10 +math_00 '$E2 SHR 1st >> 2nd b >> >>= shift right long %0000_0000_0011_1111_1111_1111_0111_1111 <<10 +math_00 '$E3 SHL 1st << 2nd b << <<= shift left long %0000_0000_0011_1111_1111_1110_1111_1111 <<10 +math_00 '$E4 FGES 1st #> 2nd b #> #>= limit minimum (signed) long %0000_0000_0011_1111_1111_1101_1111_1111 <<10 +math_00 '$E5 FLES 1st <# 2nd b <# <#= limit maximum (signed) long %0000_0000_0011_1111_1111_1011_1111_1111 <<10 +math_00 '$E6 NEG - 1st unary - - negate long %0000_0000_0011_1111_1111_0111_1111_1111 <<10 +math_00 '$E7 NOT ! 1st unary ! ! bitwise not long %0000_0000_0011_1111_1110_1111_1111_1111 <<10 +math_00 '$E8 AND 1st & 2nd b & &= bitwise and long %0000_0000_0011_1111_1101_1111_1111_1111 <<10 +math_00 '$E9 ABS ABS( 1st ) unary || || absolute long %0000_0000_0011_1111_1011_1111_1111_1111 <<10 +math_00 '$EA OR 1st | 2nd b | |= bitwise or long %0000_0000_0011_1111_0111_1111_1111_1111 <<10 +math_00 '$EB XOR 1st ^ 2nd b ^ ^= bitwise xor long %0000_0000_0011_1110_1111_1111_1111_1111 <<10 +math_00 '$EC ADD 1st + 2nd b + += add long %0000_0000_0011_1101_1111_1111_1111_1111 <<10 +math_00 '$ED SUB 1st - 2nd b - -= subtract long %0000_0000_0011_1011_1111_1111_1111_1111 <<10 +math_00 '$EE SAR 1st ~> 2nd b ~> ~>= shift arithmetic right long %0000_0000_0011_0111_1111_1111_1111_1111 <<10 +math_00 '$EF REV 1st >< 2nd b >< ><= reverse bits (neg y first) long %0000_0000_0011_1111_1110_1111_1111_0000 <<10 +math_00 '$F0 AND bool 1st AND 2nd b AND boolean and long %0000_0000_0010_1111_1111_1111_1111_1111 <<10 +math_00 '$F1 ENCOD >| 1st unary >| >| encode (0-32) long %0000_0000_0011_1111_1011_1111_1111_0000 <<10 +math_00 '$F2 OR bool 1st OR 2nd b OR boolean or long %0000_0000_0001_1111_1111_1111_1111_1111 <<10 +math_00 '$F3 DECOD |< 1st unary |< |< decode long %0000_0000_0000_0001_1111_1111_1111_1100 <<10 +math_00 '$F4 MPY 1st * 2nd b * *= multiply, return lower half (signed) long %0000_0000_0000_0001_1111_1111_1111_0011 <<10 +math_00 '$F5 MPY_MSW 1st ** 2nd b ** **= multiply, return upper half (signed) long %0000_0000_0000_0001_1111_1111_1100_1111 <<10 +math_00 '$F6 DIV 1st / 2nd b / /= divide, return quotient (signed) long %0000_0000_0000_0001_1111_1111_0011_1111 <<10 +math_00 '$F7 MOD 1st // 2nd b // //= divide, return remainder (signed) long %0000_0000_0000_0001_1111_1100_1111_1111 <<10 +math_00 '$F8 SQRT ^^ 1st unary ^^ ^^ square root long %0000_0000_0000_0000_1000_0011_1111_1111 <<10 +math_00 '$F9 LT 1st < 2nd b < test below (signed) long %0000_0000_0000_0000_1000_0011_1111_1111 <<10 +math_00 '$FA GT 1st > 2nd b > test above (signed) long %0000_0000_0000_0000_1000_0011_1111_1111 <<10 +math_00 '$FB NE 1st <> 2nd b <> test not equal long %0000_0000_0000_0000_1000_0011_1111_1111 <<10 +math_00 '$FC EQ 1st == 2nd b == test equal long %0000_0000_0000_0000_1000_0011_1111_1111 <<10 +math_00 '$FD LE 1st =< 2nd b =< test below or equal (signed) long %0000_0000_0000_0000_1000_0011_1111_1111 <<10 +math_00 '$FE GE 1st => 2nd b => test above or equal (signed) long %0000_0000_0000_0000_0111_1111_1111_1111 <<10 +math_00 '$FF NOT bool NOT 1st unary NOT NOT boolean not '----------------------------------------------------------------------------------------------------------------------------And the results
P2 Mathop testing v001g op: 00000000 time: 0000000C xya: 71111777 00000004 00003333 NC NZ op: 00000001 time: 0000000C xya: 11117777 00000004 00003333 C NZ op: 00000002 time: 0000000A xya: 01111777 00000004 00003333 C NZ op: 00000003 time: 0000000A xya: 11117770 00000004 00003333 C NZ op: 00000004 time: 0000000A xya: 11117770 00000004 00003333 C NZ op: 00000005 time: 0000000C xya: 00000004 00000004 00003333 C NZ op: 00000006 time: 0000000C xya: FFFFFFFC 00000004 00003333 C NZ op: 00000007 time: 0000000C xya: FFFFFFFB 00000004 00003333 C NZ op: 00000008 time: 0000000C xya: 00000000 00000004 00003333 C NZ op: 00000009 time: 0000000C xya: 00000004 00000004 00003333 C NZ op: 0000000A time: 0000000A xya: 00000004 00000004 00003333 C NZ op: 0000000B time: 0000000A xya: 00000000 00000004 00003333 C NZ op: 0000000C time: 0000000A xya: 00000004 00000004 00003333 C NZ op: 0000000D time: 0000000C xya: 00000000 00000004 00003333 C NZ op: 0000000E time: 0000000C xya: 00000000 00000004 00003333 C NZ op: 0000000F time: 0000000C xya: 00000000 00000004 00003333 C NZ ...Note the time for most single instruction code is "C" = 12 clocks.
But some of the instructions take only "A" = 10 clocks.
And these results are always the same.
Why???
BTW when using skip rather than skip, they always take $30 = 48 clocks.
And WOW! What a feature Next to try XBYTE
spin
33K
Comments
Also interesting to see that math ops 2,3,4 and A=2+8, B=3+8, C=4+8 have the same timing behavior, each consuming one less instruction...is that significant?
No idea why they would be different.
I believe once the resulting skip bits are all 0, then the skip is complete. But they are all 0 in order to execute the final getct.
SKIPF jumps the PC by up to 8 instructions eliminating the time taken to execute them; only 7 instructions can be skipped for free.
So, 8 or more skipped instructions in a row will still consume some cycles to update the PC even though nothing comes of it.
mathop 0:
SKIPF: 2 cycles
First instruction after the SKIPF is cancelled: 2 cycles
Skip 4 and ROR is executed: 2 cycles
Skip 8: 2 cycles
Skip 8: 2 cycles
Skip 1 and GET CT is executed: 2 cycles
Total 12 cycles
mathop 1:
SKIPF: 2 cycles
First instruction after the SKIPF is cancelled: 2 cycles
Skip 5 and ROL is executed: 2 cycles
Skip 8: 2 cycles
Skip 8: 2 cycles
GETCT is executed: 2 cycles
Total 12 cycles
mathop 2:
SKIPF: 2 cycles
First instruction after the SKIPF is cancelled: 2 cycles
Skip 6 and SHR is executed: 2 cycles
Skip 8: 2 cycles
Skip 6 and GETCT is executed: 2 cycles
Total 10 cycles
mathop 3:
SKIPF: 2 cycles
First instruction after the SKIPF is cancelled: 2 cycles
Skip 7 and SHL is executed: 2 cycles
Skip 8: 2 cycles
Skip 5 and GETCT is executed: 2 cycles
Total 10 cycles
mathop 4:
SKIPF: 2 cycles
First instruction after the SKIPF is cancelled and Skip 8(?): 2 cycles
Skip 8: 2 cycles
FGES is executed: 2 cycles
Skip 8: 2 cycles
Skip 4 and GETCT is executed: 2 cycles
Total 10 cycles
Does this help?
Edit: something screwy here, 8 != 10, and 10 != 12
Second Edit: perhaps for mathop 4 the Skip first instruction and Skip 8 merge. That would bring it into line with the rest, and I missed the constant 2 cycle cost for SKIPF in the first place.
I was just working thru the sequence too. I noticed there is an effective 2 clock overhead for the skipf instruction to cancel the first instruction.
Makes determinism a little difficult to calculate, but I guess that's the penalty for the gain.
Considering that it looks like you have about 10 spare skip slots, you might rearrange your table, with some 'early exit' points added. By strategically introducing extra RET instructions that are skipped when not needed you might achieve it.
It might be possible to normalise your execution times at around 12-14 cycles.
Regards,
Anthony.
The only time an extra 2-clock NOP is inserted into the pipeline is for the g/h/i cases. I figured that byte and long variables were going to be more commonly accessed than word variables, so word variable accesses suffered the 2-clock penalty. Otherwise, it's like every instruction executes in order, with no timing gaps.
Here is the XBYTE table for these 24 functions:
Your sample is pretty much what I have. Even just using skip only takes 48 clocks, but skipf reduces even further to 10/12 clocks. This is simply amazing!
The whole mathop interpreter routine shrinks to ~48 instructions (plus the table of 32 longs). The original P1 ROM mathops takes ~104 instructions.
That's a huge improvement!!!
Yeah, go with what works in the P2. You should find that you don't need the old JMPRET instruction, after all.
If you do this, you can drop the execution time to only 8 clocks per bytecode, total, using XBYTE:
You might want to check your table as the skip patterns for operations F9 to FE are identical.
You also need to skip all unneeded operations up to the RET; it looks like you might not be doing that, as all operations from F4 onward are performing every operation from subtract onward, some of them even more.
Hope it helps,
Anthony.
No. I don't need JMPRET but to do so required a major change. I still think it's a shame we didn't have an equivalent in P2 to make conversion easier.
We also missed having the DIRx/OUTx/INx mapped into the same register space.
We have what we have, and that's great anyway!
I will likely be using XBYTE, but for now I need to insert some debug steps which affects the xbyte operation.
Do you mean that their addresses have changed?
F4+: Yes, I hadn't got that far when I posted the code. It now jumps to a different address.
Yes. For the P1 Spin Interpreter on P2 I have to re-map these registers (or re-compile).
Hindsight is wonderful
I've run out of 22 skip bits
So I need some more info please
Code in COG/LUT:
I presume when the call #\routine is not skipped, the skip bits are saved, the called routine executes, and when it returns the skip bits are restored?
Also, I presume I cannot use another skipf in the routine?
I presume when a jmp/djnz/etc #routine (relative jump) is not skipped, the routine will now execute using the remaining skip bits?
Correct.
Read that section of the Google Doc. I explained it all very carefully in there. It's almost too much to remember, offhand.
That all sounds correct.
I used XBYTE in my zpu interpreter, and to debug I provided an alternate implementation which mimics xbyte in software, something like:
*Edit*: fixed the order of rfbyte and getptr below Then I could insert whatever debugging I wanted in the next_instruction loop, but since it's using the same structure as XBYTE I could easily toggle the hardware XBYTE on and off with an #ifdef.
Thanks,
Eric
"_RET_ SETQ2 {#}D" only affects the next XBYTE, then subsequent XBYTE operations revert to the main configuration that was established by the prior "_RET_ SETQ {#}D". Using multiple "_RET_ SETQ2 {#}D" in a row will keep things in the alternate mode, until a RET without a SETQ2 executes.
That's what I thought would happen, but it doesn't seem to be working for me. I need to look more closely, but it seems like the second SETQ2 is being ignored. I have a software emulation of XBYTE (similar to the loop I posted above, but with an additional temporary offset for the LUT table that gets used for the lookup and then reset before the execf) and that works as I expect, but when I switch to XBYTE it doesn't work.
Never mind, I think I found it: my original XBYTE emulation was doing: but in fact the hardware does it in the other order: so I was using the wrong value for pb in a few places. The docs actually do show it this way, I just didn't read them carefully enough. Sorry for the false alarm!
Periodic alarm is therapeutic. Keeps things circulating.
Moved to my P1 Spin Interpreter for P2 thread here
forums.parallax.com/discussion/comment/1466681/#Comment_1466681