If I understand correctly, this 8-bit penalty only occurs if you encounter 9 or more consecutive ones in the mask. If that's the kind of block skip that's being performed, it seems like a standard branch would be a better fit anyhow.
I think that is the cost for each 8b crossing. ie cannot do 32 wide due to logic delays, so instead do 4 lots of 8, which is 3 boundaries to cross, no matter what bit pattern is contained there.
Skip of all but the last opcode (skip31) would be 3 skips + 1 opcode timing.
This is pretty cool ( but I would say that, as I've suggested a SKIP before )
There was a DSP, the AL3101, designed for audio applications whose only flow control opcode was "SKIP condition N". Because you need a constant sample rate for audio, the DSP had a fixed clock rate which was 1024 times the audio sample rate. The DSP executed all 1024 instructions in the program memory in a loop at 1 cycle per instruction. The only conditional execution was the SKIP opcode which tested the flags and did a skip of between 1 and 255 instructions. The effect of the SKIP was to execute those intervening instructions as NOPs. Like that, you had absolute determinism in your timings.
It's a technique that could be applied to more general cases where you must get something done in a fixed time.
Here's an example of how you can use SKIP without going insane.
Chip, I just read through that this evening. It's pretty darn clean. The use of constants makes a ton of sense.
I just want to point out, the little operators, ways we can specify data are powerful. The shift operators position the constants nicely, and the OR operators make for a clean, sane data table spec. Love it.
Looks like a big win to me. I know I can explain this to someone fairly quickly. We should be able to include some examples, perhaps just this one, maybe a driver related one or two, in the more verbose, new or just user docs to come, and get people to the right place on this quick.
Like anything really powerful, yeah one can make a mess. But, one can make something beautiful too.
Frankly, I find this easier to digest than I do a complicated set of branches, jumps, snippets and conditionals. In terms of vertical space, so important to see flow, it's hard to beat that body of code for what it does.
I realized that '_RET_ SKIP' was not working, due to the simultaneous jump. I got that fixed and tried out a bigger priority encoder and shifter, in order to be able to skip 15 instructions at once. It stuck out like a sore thumb in the timing analysis. So, I put things back to 7 instructions getting skipped. That's what we're going to have to live with.
It's still a potent option. And programmers do have lots of others, should that somehow end up a bottleneck.
It's 8x faster than just cancelling instructions. And it only even happens if there are more than seven 1's in a row. Most code can be jiggered around so that it never even happens. And in case you want to execute just one lone instruction amid many, there's always ALTI, which lets you execute a register as the next instruction.
I realized that '_RET_ SKIP' was not working, due to the simultaneous jump. I got that fixed and tried out a bigger priority encoder and shifter, in order to be able to skip 15 instructions at once. It stuck out like a sore thumb in the timing analysis. So, I put things back to 7 instructions getting skipped. That's what we're going to have to live with.
Easy to live with and way better than nothing!
Chip,
Just a thought. Since this makes some 1 clock instructions, and only if easy to do...
Are the if_xxx instructions where the conditional is not met (ie instruction not executed) capable of being 1 clock?
If so, could there be a cog setting to enable these 1 clock instructions? I am thinking for determinism they would need to be kept to 2 clocks (hence the default) but if the user wanted speed instead then it could be cog enabled.
In a similar vein, are any of the ALT/AUG/modifier instructions capable of being 1 clock instructions?
Forget it. Just realised that this would really upset the pipeline code.
I'm not sure what your objection to lambda functions is. Maybe they're more awkward in C++ than they were in LISP.
Mostly readability. If you issue a callback, inside a few nested if's, and then pass an inline lambda function, it just makes the code harder to follow. I prefer a function pointer, or a derived interface unless it's something really simple.
As I said, maybe lambda in C++ is less elegant than it was in LISP. I have never tried the C++ version.
If I understand correctly, this 8-bit penalty only occurs if you encounter 9 or more consecutive ones in the mask. If that's the kind of block skip that's being performed, it seems like a standard branch would be a better fit anyhow.
I think that is the cost for each 8b crossing. ie cannot do 32 wide due to logic delays, so instead do 4 lots of 8, which is 3 boundaries to cross, no matter what bit pattern is contained there.
Skip of all but the last opcode (skip31) would be 3 skips + 1 opcode timing.
No, I think he's shifting the mask on each instruction fetch. The cost is the depth of the mux for testing the LSBs of the mask. This means you only hit the problem if there is a long (>7?) consecutive run of skips in those LSBs.
One thing that concerns me about this development is that we're adding hardware to optimize one particular kind of program (an interpreter). I actually think that if higher Spin performance is a major goal then we should think more out of the box and implement a real Spin compiler (oh wait, we have that already:)) or a JIT bytecode interpreter. No matter what hardware tricks we perform, interpreting the instructions in a loop over and over is always going to be slower than running COG instructions in that same loop. So translating the Spin codes into PASM at run time, for example, would be a much bigger win than adding more hardware support for the interpreter.
Heck, if you really think that byte codes are that important, why not build the byte codes into the hardware? That will improve perforamnce much more than any specialized instructions for the interpreter.
I'm not saying "skip" is wrong or a bad idea, although it does have all kinds of potential side effects that make it a minefield for the programmer. But the overall process, where the focus is on optimizing one particular way of implementing one application, seems to me misguided.
What I really liked about P1 was how simple and elegant it was. I see P2 getting further and further away from that ideal. It seems like a classic second system effect.
Sorry, I guess I'm just venting here. I'm sure Chip will do whatever he wants, and it'll turn out the way it does. His track record so far is good, so I hope I'm wrong. Most of all I hope we get working silicon sometime this year...
Look at these snippets which read constants. I've harmonized them by making things like 'wz' consistent, even though it's not always needed:
con_bx rfbyte x wz
setcz x wc,wz
decod x
if_c sub x,#1
if_nz not x
_ret_ pusha x
con_bp rfbyte x wz
_ret_ pusha x
con_bn rfbyte x wz
if_nz not x
_ret_ pusha x
con_wp rfword x wz
_ret_ pusha x
con_wn rfword x wz
if_nz not x
_ret_ pusha x
con_lg rflong x
_ret_ pusha x
Those add up to 18 longs.
They can all be collapsed into 8 longs using SKIP:
con_x rfbyte x wz 'bx bp bn | | |
rfword x wz ' | | | wp wn |
rflong x ' | | | | | lg
setcz x wc,wz 'bx | | | | |
decod x 'bx | | | | |
if_c sub x,#1 'bx | | | | |
if_nz not x 'bx | bn | wn |
_ret_ pusha x 'bx bp bn wp wn lg
Here are the SKIP bit patterns to replicate the original functions (for use in the bytecode interpreter loop):
Neat, but that example seems counterproductive to your argument for SKIP. You've saved 4 longs at best, and less if you are doing the same lookup technique as your earlier example. While the first block of code is longer, it is definitely easier to read/understand/maintain/modify. With the skip approach, you've added quite a bit of complexity to the code for minor savings.
Neat, but that example seems counterproductive to your argument for SKIP. You've saved 4 longs at best, and less if you are doing the same lookup technique as your earlier example. While the first block of code is longer, it is definitely easier to read/understand/maintain/modify. With the skip approach, you've added quite a bit of complexity to the code for minor savings.
I'd call 18 -> 8 longs/instructions a good saving.
The code to jump & skip will be a common routine, so they do not take any more longs.
But, they will all take longer to execute by about 8 clocks??? (setup + skip instruction plus at least 2 skip overs)
Is that worth 10 longs??? I am not convinced in this example.
The setup and skip will be common, and therefore effectively free.
Still checking on the skip overs but these appear to also be free???
BTW I can easily see how this works. It's not obscure at all. In fact I find it easier to read what the various bytecodes will do.
Neat, but that example seems counterproductive to your argument for SKIP. You've saved 4 longs at best, and less if you are doing the same lookup technique as your earlier example. While the first block of code is longer, it is definitely easier to read/understand/maintain/modify. With the skip approach, you've added quite a bit of complexity to the code for minor savings.
I'd call 18 -> 8 longs/instructions a good saving.
You're leaving out the skip instruction itself, plus code and data to calculate the skip mask. That may be non-trivial. Chip's example isn't apples to apples because it doesn't include all the code.
The code to jump & skip will be a common routine, so they do not take any more longs.
It's a common routine but it still needs to be there, i.e. yes it does take up space. We'd need to see just how complicated the skip mask calculation is to estimate how much overhead there is.
For example, IIRC the original Tachyon interpreter executes bytecodes very quickly because each bytecode is basically the address of the routine implementing it. The outer loop for such an interpreter is extremely small and fast, and skip is unlikely to be of much help.
CON con_bx = %00000110 << 9 SKIP patterns for con_x
con_bp = %01111110 << 9
con_bn = %00111110 << 9
con_wp = %01111101 << 9
con_wn = %00111101 << 9
con_lg = %01111011 << 9
offset_byte = %110 << 9 SKIP patterns for mem_rw
offset_word = %101 << 9
offset_long = %011 << 9
base_pbase = %110 << 12
base_vbase = %101 << 12
base_dbase = %011 << 12
read_byte = %0000_0110_1111 << 15
read_byte_indx = %0000_0110_0110 << 15
read_word = %0000_0101_1111 << 15
read_word_indx = %0000_0101_0100 << 15
read_long = %0000_0011_1111 << 15
read_long_indx = %0000_0011_0010 << 15
write_byte = %0000_1111_1111 << 15
write_byte_indx = %0000_1111_0110 << 15
write_word = %0010_1111_1111 << 15
write_word_indx = %0010_1111_0100 << 15
write_long = %0110_1111_1111 << 15
write_long_indx = %0110_1111_0010 << 15
DAT org 'cog space
'
'
' Interpreter loop
'
rep #1,#8 'pre-stuff stack with loop address
push #loop 'all uncalled _ret_'s will jump to loop
loop rfbyte p 'get program bytecode
rdlut i,p 'lookup long in lut
altd i 'get snippet address from 9 LSBs
push #0 'push snippet address
shr i,#9 'shift right to get 23-bit skip pattern
_ret_ skip i 'set skip pattern and execute snippet
'
'
' Constant snippets
'
con_n1 _ret_ pusha _FFFFFFFF
con_0 _ret_ pusha #0
con_1 _ret_ pusha #1
con_2 _ret_ pusha #2
con_3 _ret_ pusha #3
con_4 _ret_ pusha #4
con_7 _ret_ pusha #7
con_8 _ret_ pusha #8
con_15 _ret_ pusha #15
con_16 _ret_ pusha #16
con_31 _ret_ pusha #31
con_32 _ret_ pusha #32
con_x rfbyte x wz 'bx bp bn | | |
rfword x wz '| | | wp wn |
rflong x '| | | | | lg
setcz x wc,wz 'bx | | | | |
decod x 'bx | | | | |
if_c sub x,#1 'bx | | | | |
if_nz not x 'bx | bn | wn |
_ret_ pusha x 'bx bp bn wp wn lg
'
'
' Memory read/write snippet
'
mem_rw rfbyte m 'ob | | offset
rfword m '| ow |
rflong m '| | ol
add m,pbase 'bp | | base
add m,vbase '| bv |
add m,dbase '| | bd
popa x '| ib iw il index
shl x,#1 '| | iw |
shl x,#2 '| | | il
add m,x '| ib iw il
rdbyte x,m 'rb | | read
rdword x,m '| rw |
rdlong x,m '| | rl
_ret_ pusha x 'rb rw rl
popa x 'wb ww wl write
_ret_ wrbyte x,m 'wb | |
_ret_ wrword x,m ' ww |
_ret_ wrlong x,m ' wl
'
'
' Bytecode lookup values
'
org $200 'lut space
rw_table long con_n1
long con_0
long con_1
long con_2
long con_3
long con_4
long con_7
long con_8
long con_15
long con_16
long con_31
long con_32
long con_x |con_bx
long con_x |con_bp
long con_x |con_bn
long con_x |con_wp
long con_x |con_wn
long con_x |con_lg
long mem_rw |offset_byte|base_pbase|read_byte
long mem_rw |offset_word|base_pbase|read_byte
long mem_rw |offset_long|base_pbase|read_byte
long mem_rw |offset_byte|base_vbase|read_byte
long mem_rw |offset_word|base_vbase|read_byte
long mem_rw |offset_long|base_vbase|read_byte
long mem_rw |offset_byte|base_dbase|read_byte
long mem_rw |offset_word|base_dbase|read_byte
long mem_rw |offset_long|base_dbase|read_byte
long mem_rw |offset_byte|base_pbase|read_byte_indx
long mem_rw |offset_word|base_pbase|read_byte_indx
long mem_rw |offset_long|base_pbase|read_byte_indx
long mem_rw |offset_byte|base_vbase|read_byte_indx
long mem_rw |offset_word|base_vbase|read_byte_indx
long mem_rw |offset_long|base_vbase|read_byte_indx
long mem_rw |offset_byte|base_dbase|read_byte_indx
long mem_rw |offset_word|base_dbase|read_byte_indx
long mem_rw |offset_long|base_dbase|read_byte_indx
long mem_rw |offset_byte|base_pbase|read_word
long mem_rw |offset_word|base_pbase|read_word
long mem_rw |offset_long|base_pbase|read_word
long mem_rw |offset_byte|base_vbase|read_word
long mem_rw |offset_word|base_vbase|read_word
long mem_rw |offset_long|base_vbase|read_word
long mem_rw |offset_byte|base_dbase|read_word
long mem_rw |offset_word|base_dbase|read_word
long mem_rw |offset_long|base_dbase|read_word
long mem_rw |offset_byte|base_pbase|read_word_indx
long mem_rw |offset_word|base_pbase|read_word_indx
long mem_rw |offset_long|base_pbase|read_word_indx
long mem_rw |offset_byte|base_vbase|read_word_indx
long mem_rw |offset_word|base_vbase|read_word_indx
long mem_rw |offset_long|base_vbase|read_word_indx
long mem_rw |offset_byte|base_dbase|read_word_indx
long mem_rw |offset_word|base_dbase|read_word_indx
long mem_rw |offset_long|base_dbase|read_word_indx
long mem_rw |offset_byte|base_pbase|read_long
long mem_rw |offset_word|base_pbase|read_long
long mem_rw |offset_long|base_pbase|read_long
long mem_rw |offset_byte|base_vbase|read_long
long mem_rw |offset_word|base_vbase|read_long
long mem_rw |offset_long|base_vbase|read_long
long mem_rw |offset_byte|base_dbase|read_long
long mem_rw |offset_word|base_dbase|read_long
long mem_rw |offset_long|base_dbase|read_long
long mem_rw |offset_byte|base_pbase|read_long_indx
long mem_rw |offset_word|base_pbase|read_long_indx
long mem_rw |offset_long|base_pbase|read_long_indx
long mem_rw |offset_byte|base_vbase|read_long_indx
long mem_rw |offset_word|base_vbase|read_long_indx
long mem_rw |offset_long|base_vbase|read_long_indx
long mem_rw |offset_byte|base_dbase|read_long_indx
long mem_rw |offset_word|base_dbase|read_long_indx
long mem_rw |offset_long|base_dbase|read_long_indx
long mem_rw |offset_byte|base_pbase|write_byte
long mem_rw |offset_word|base_pbase|write_byte
long mem_rw |offset_long|base_pbase|write_byte
long mem_rw |offset_byte|base_vbase|write_byte
long mem_rw |offset_word|base_vbase|write_byte
long mem_rw |offset_long|base_vbase|write_byte
long mem_rw |offset_byte|base_dbase|write_byte
long mem_rw |offset_word|base_dbase|write_byte
long mem_rw |offset_long|base_dbase|write_byte
long mem_rw |offset_byte|base_pbase|write_byte_indx
long mem_rw |offset_word|base_pbase|write_byte_indx
long mem_rw |offset_long|base_pbase|write_byte_indx
long mem_rw |offset_byte|base_vbase|write_byte_indx
long mem_rw |offset_word|base_vbase|write_byte_indx
long mem_rw |offset_long|base_vbase|write_byte_indx
long mem_rw |offset_byte|base_dbase|write_byte_indx
long mem_rw |offset_word|base_dbase|write_byte_indx
long mem_rw |offset_long|base_dbase|write_byte_indx
long mem_rw |offset_byte|base_pbase|write_word
long mem_rw |offset_word|base_pbase|write_word
long mem_rw |offset_long|base_pbase|write_word
long mem_rw |offset_byte|base_vbase|write_word
long mem_rw |offset_word|base_vbase|write_word
long mem_rw |offset_long|base_vbase|write_word
long mem_rw |offset_byte|base_dbase|write_word
long mem_rw |offset_word|base_dbase|write_word
long mem_rw |offset_long|base_dbase|write_word
long mem_rw |offset_byte|base_pbase|write_word_indx
long mem_rw |offset_word|base_pbase|write_word_indx
long mem_rw |offset_long|base_pbase|write_word_indx
long mem_rw |offset_byte|base_vbase|write_word_indx
long mem_rw |offset_word|base_vbase|write_word_indx
long mem_rw |offset_long|base_vbase|write_word_indx
long mem_rw |offset_byte|base_dbase|write_word_indx
long mem_rw |offset_word|base_dbase|write_word_indx
long mem_rw |offset_long|base_dbase|write_word_indx
long mem_rw |offset_byte|base_pbase|write_long
long mem_rw |offset_word|base_pbase|write_long
long mem_rw |offset_long|base_pbase|write_long
long mem_rw |offset_byte|base_vbase|write_long
long mem_rw |offset_word|base_vbase|write_long
long mem_rw |offset_long|base_vbase|write_long
long mem_rw |offset_byte|base_dbase|write_long
long mem_rw |offset_word|base_dbase|write_long
long mem_rw |offset_long|base_dbase|write_long
long mem_rw |offset_byte|base_pbase|write_long_indx
long mem_rw |offset_word|base_pbase|write_long_indx
long mem_rw |offset_long|base_pbase|write_long_indx
long mem_rw |offset_byte|base_vbase|write_long_indx
long mem_rw |offset_word|base_vbase|write_long_indx
long mem_rw |offset_long|base_vbase|write_long_indx
long mem_rw |offset_byte|base_dbase|write_long_indx
long mem_rw |offset_word|base_dbase|write_long_indx
long mem_rw |offset_long|base_dbase|write_long_indx
Here's the loop:
loop rfbyte p 'get program bytecode
rdlut i,p 'lookup long in lut
altd i 'get snippet address from 9 LSBs
push #0 'push snippet address
shr i,#9 'shift right to get 23-bit skip pattern
_ret_ skip i 'set skip pattern and execute snippet
That takes 15 clocks to get into the snippets, with SKIP already running. Every bytecode branches to a snippet with a unique 23-bit SKIP field.
loop rfbyte p 'get program bytecode 5-18
rdlut i,p 'lookup long in lut 2
altd i 'get snippet address from 9 LSBs 2
push #0 'push snippet address 2
shr i,#9 'shift right to get 23-bit skip pattern 2
_ret_ skip i 'set skip pattern and execute snippet 2
------
overhead = 15-28
con_x rfbyte x wz 'bx bp bn | | | 2 2 2 0 0 0
rfword x wz '| | | wp wn | 0 0 0 2 2 0
rflong x '| | | | | lg 0 0 0 0 0 2
setcz x wc,wz 'bx | | | | | 2 0 0 0 0 0
decod x 'bx | | | | | 2 0 0 0 0 0
if_c sub x,#1 'bx | | | | | 2 0 0 0 0 0
if_nz not x 'bx | bn | wn | 2 0 2 0 2 0
_ret_ pusha x 'bx bp bn wp wn lg 2 2 2 2 2 2
-- -- -- -- -- --
12 4 6 4 6 4
So, in the above, there is no execution penalty for these small skips. If there are more than 8 skips in a row, add an extra 2 clocks for each set of 8.
Shouldn't the _ret_ instructions be +3 clocks for reloading the pipeline? Also, I think it's if there are no more than 7 skips in a row, then you don't get the two clock penalty.
loop rfbyte p 'get program bytecode
rdlut i,p 'lookup long in lut
altd i 'get snippet address from 9 LSBs
push #0 'push snippet address
shr i,#9 'shift right to get 23-bit skip pattern
_ret_ skip i 'set skip pattern and execute snippet
That takes 15 clocks to get into the snippets, with SKIP already running. Every bytecode branches to a snippet with a unique 23-bit SKIP field.
Clever! I like how you use the the _ret_ and the pre-stuffed call stack.
Putting this in context to the discussion of processing constants earlier, it's reasonable to argue that the masked version really only takes 8 longs since you were still going to take up 6 longs in the LUT for the jump table either way.
This '_RET_' is an implied return that uses code %0000 of the conditional field. This means NOP cannot be $00000000, anymore, as it would become a RET instruction. Having implied RET's saves lots of instructions, especially in a situation like this where you have lots of snippets of code that get CALL'd. Each _RET_ saves two clocks and a long.
This '_RET_' is an implied return that uses code %0000 of the conditional field. This means NOP cannot be $00000000, anymore, as it would become a RET instruction. Having implied RET's saves lots of instructions, especially in a situation like this where you have lots of snippets of code that get CALL'd. Each _RET_ saves two clocks and a long.
So the _RET_ is free!
That can't be right, unless the _RET_ flag is being processed at the first stage of the pipeline. But that would have to come with a lot of caveats and edge cases! So, I'm assuming that the _RET_ is still being processed when the "host" instruction is executed. In which case...
You are only saving the cost of the additional RET instruction (i.e. "saves two clocks and a long"). But _RET_ should still cause the current pipelined instructions to be cancelled and the popped instruction branch to be loaded, just like a RET instruction would have caused. Looking at the instruction listing, RET costs 4 clocks. Presumably, that's 1 for executing and 3 for reloading the pipeline. In the case of something like "_ret_ skip #0", the SKIP will take two cylces to execute and the _ret_ will cause 3 additional cycles for reloading the pipeline.
This '_RET_' is an implied return that uses code %0000 of the conditional field. This means NOP cannot be $00000000, anymore, as it would become a RET instruction. Having implied RET's saves lots of instructions, especially in a situation like this where you have lots of snippets of code that get CALL'd. Each _RET_ saves two clocks and a long.
So the _RET_ is free!
That can't be right, unless the _RET_ flag is being processed at the first stage of the pipeline. But that would have to come with a lot of caveats and edge cases! So, I'm assuming that the _RET_ is still being processed when the "host" instruction is executed. In which case...
You are only saving the cost of the additional RET instruction (i.e. "saves two clocks and a long"). But _RET_ should still cause the current pipelined instructions to be cancelled and the popped instruction branch to be loaded, just like a RET instruction would have caused. Looking at the instruction listing, RET costs 4 clocks. Presumably, that's 1 for executing and 3 for reloading the pipeline. In the case of something like "_ret_ skip #0", the SKIP will take two cylces to execute and the _ret_ will cause 3 additional cycles for reloading the pipeline.
No. I believe there is no clock penalty.
Remember, the fetch of the pop isn't from a cog register/memory, but from an internal register, so it can be gated straight to the instruction counter logic as well as to the cog address register.
If clock 1 fetches the instruction with _RET_, the clock 2 fetches the D and S operands. Also we know in clock 2 the type of instruction. Provided it is not a jump type such as DJNZ, the RET address can be used as the next fetch address for clock 3, and at the same time gated to the increment logic for PC++. So for normal instructions, the RET will be fetching the instruction at the return address concurrently with the 2 clock ALU stage (clocks 3and 4) of the _RET_ instruction.
In the DJNZ and similar, in clocks 3 & 4, the DJNZ will assumed to be taken as usual, so will fetch its jump taken address (no ret). If it is determined during the DJNZ ALU execution that the jump is not to be taken, the pipe will be flushed and the PC effectively replaced with the _RET_ address. So again, it will be the DJNZ not taken clocks (think it's a 2 clock penalty?).
@Cluso99, take a look at Chip's main loop. He's performing a PUSH only two instructions beforehand. If _RET_ was processed early in the pipeline, it would POP before the PUSH occurred!
loop rfbyte p 'get program bytecode 5-18
rdlut i,p 'lookup long in lut 2
altd i 'get snippet address from 9 LSBs 2
push #0 'push snippet address 2
shr i,#9 'shift right to get 23-bit skip pattern 2
_ret_ skip i 'set skip pattern and execute snippet 2
------
overhead = 15-28
being able to be utilised in all sorts of interpreters, as well as in special lookup tables for executing special routines.
Therefore, what about 1 new instruction "execute p" as follows...
loop rfbyte p 'get program bytecode 5-18
execute p 2
------
overhead = 7-20
pipeline clocks for "execute p" would become:
1: rd "execute p" instruction
2: rd lut address "p" into internal 32bit register "i" (maybe same as SETQ2 register??)
3: forward i[8:0] to PC and cog address so that this clock fetches the next instruction (ie the _RET_ goto)
load internal skip register with i[31:9] as skip[22:0] & set skip valid
4:
5: optional: wr i[31:0] to cog register "p"
6:
By directly forwarding "i" (= contents of lut "p")into the return address and skip register, the push/pop is not required,
meaning neither is altd i and shr i,#9.
The "execute p" instruction effectively performs the following in 2 clocks (saving 8 clocks every bytecode)...
rdlut i,p 'lookup long in lut 2
altd i 'get snippet address from 9 LSBs 2
push #0 'push snippet address 2
shr i,#9 'shift right to get 23-bit skip pattern 2
_ret_ skip i 'set skip pattern and execute snippet 2
------
10
This would save 8 clocks overhead for every bytecode (down from 15-28).
You're leaving out the skip instruction itself, plus code and data to calculate the skip mask.
Though it can be used dynamically, easy, best case time and density savings is pretty obvious when the masks are known in advance. And that we have a LUT in this design is a big win. Code execute from there is like HUB code, in that the primary COG RAM is used for registers and code. Stuffing the skip patterns into LUT does for code what a LUT does for pixels.
Will be interesting to see how people end up using this dynamically.
As presented by Chip, it's really more like "case" for PASM than anything else. I can easily see people using a spreadsheet to optimize time / space critical COG code.
One thing that concerns me about this development is that we're adding hardware to optimize one particular kind of program (an interpreter). I actually think that if higher Spin performance is a major goal then we should think more out of the box and implement a real Spin compiler (oh wait, we have that already:)) or a JIT bytecode interpreter. No matter what hardware tricks we perform, interpreting the instructions in a loop over and over is always going to be slower than running COG instructions in that same loop.
Yes, and if we take a look at the more general case uses here, it's clear SKIP will deliver gains in COG centric code overall, not just interpreter code. If it were more specific case, like the instruction Cluso presented, that would actually center on interpreters, and your statement here would be a lot stronger. (not that Cluso has a bad idea there --it's just a great example of this dynamic we are discussing)
Chip did it general purpose, potentially trading the very peak gains possible, for broad applicability. It's worth doing on that basis. Again, driver code that is COG centric will benefit from this. One super easy case to see is video / pixel crunching COG drivers. Both of those are modal, resolution, depth, including optional dynamically generated and displayed elements.
SKIP masks, if dynamic, can be generated during blanking times, and or with HUB code, keeping the time critical things in COG. Otherwise, SKIP patterns can be used to account for a broad range of resolutions, depths, display characteristics. While a lot of display uses will be simple, bitmap, tile, sprite, this hardware can do a ton more on a lot less overall HUB RAM when a dynamic display is part of the picture.
Edit: One case here, with video specifically, involves the NCO based timing. Deriving timing mathematically with a PLL is a lot easier. It's often just a divide and go, maybe rounding a little. The NCO comes with the need to manage jitter and or phase for similar display consistency across different modes, depths and timings. SKIP can be put to great use here. Responsive video drivers are harder and require more code / constants. SKIP will improve on the kinds of drivers people can build and the dynamic capabilities they may offer. Work out all the necessary cases, tabulate them with SKIP, and the end result is going to be compact, easy to USE. In some cases, easy to modify, add to, where warranted.
Doing these things mathematically and or with various overlays, both offer flexibility. But, it's either hard, or slow, and or take code space that must be managed, or leaves the user of the driver code in a position where they should commit to a mode for optimal performance. SKIP, applied well and in a way similar to what we are seeing Chip do with the interpreter, should get the driver closer to looking like hardware, able to respond dynamically.
Had this been in the P1 in some form, many video drivers would have been either one COG, or more responsive than they ended up being.
I don't see this feature as a general PASM case tool. Could be, depending on the programmer, but Chip is right. "Ninja level."
SKIP is for maximizing the hardware access and or presenting library / driver code to others that can perform flexibly and optimally without having to manage so many constants and or conditional code builds. On that point, this presents conditional code in a compelling way when contrasted with #IfDef type meta constructs. IMHO, of course. It has the advantage of being dynamic as well, should a developer choose to offer such to parallel uses of their code. (Have to avoid "downstream" here, as that's often a single processor view, and not inclusive on multi-processor environments.)
In the case of our SPIN interpreter, making sure it occupies a single COG makes a ton of sense. When people fire off concurrent instances of SPIN, it's going to be nice to see them work mostly like they do on P1. This is a multi-processor point of view optimization.
SKIP viewed from a single processor point of view can seem excessive and or troublesome. It can't be interrupted, etc... But, SKIP from a multi-processor point of view makes a lot of sense as it's generally going to be compartmentalized and purpose based when used. Most downstream and or parallel use cases won't involve getting into it, meaning the hard work of making sure it's used well and debugged is worth doing. A whole group of us could work something out, and once done, it's done! Super useful PASM objects. No different than we did on P1, just more potent and flexible are possible now.
For comparison, interrupts, or events as we are gonna end up calling them on the P2, work in a similar way. From a single processor point of view, they are trouble, and we've all identified why and how that's true. All valid. However, P2 is a concurrent multi-processor, and those events are processor unique. We can compartmentalize difficult code. Once that is done, reuse by others is very significantly improved over what we would expect in a single processor, or multi-core scenario lacking the COG memory isolation and concurrency P2 has.
This is a non trivial difference, and worth keeping in mind, IMHO.
Comments
Skip of all but the last opcode (skip31) would be 3 skips + 1 opcode timing.
There was a DSP, the AL3101, designed for audio applications whose only flow control opcode was "SKIP condition N". Because you need a constant sample rate for audio, the DSP had a fixed clock rate which was 1024 times the audio sample rate. The DSP executed all 1024 instructions in the program memory in a loop at 1 cycle per instruction. The only conditional execution was the SKIP opcode which tested the flags and did a skip of between 1 and 255 instructions. The effect of the SKIP was to execute those intervening instructions as NOPs. Like that, you had absolute determinism in your timings.
It's a technique that could be applied to more general cases where you must get something done in a fixed time.
Chip, I just read through that this evening. It's pretty darn clean. The use of constants makes a ton of sense.
I just want to point out, the little operators, ways we can specify data are powerful. The shift operators position the constants nicely, and the OR operators make for a clean, sane data table spec. Love it.
Looks like a big win to me. I know I can explain this to someone fairly quickly. We should be able to include some examples, perhaps just this one, maybe a driver related one or two, in the more verbose, new or just user docs to come, and get people to the right place on this quick.
Like anything really powerful, yeah one can make a mess. But, one can make something beautiful too.
Frankly, I find this easier to digest than I do a complicated set of branches, jumps, snippets and conditionals. In terms of vertical space, so important to see flow, it's hard to beat that body of code for what it does.
It's 8x faster than just cancelling instructions. And it only even happens if there are more than seven 1's in a row. Most code can be jiggered around so that it never even happens. And in case you want to execute just one lone instruction amid many, there's always ALTI, which lets you execute a register as the next instruction.
Chip,
Just a thought. Since this makes some 1 clock instructions, and only if easy to do...
Are the if_xxx instructions where the conditional is not met (ie instruction not executed) capable of being 1 clock?
If so, could there be a cog setting to enable these 1 clock instructions? I am thinking for determinism they would need to be kept to 2 clocks (hence the default) but if the user wanted speed instead then it could be cog enabled.
In a similar vein, are any of the ALT/AUG/modifier instructions capable of being 1 clock instructions?
Forget it. Just realised that this would really upset the pipeline code.
No, I think he's shifting the mask on each instruction fetch. The cost is the depth of the mux for testing the LSBs of the mask. This means you only hit the problem if there is a long (>7?) consecutive run of skips in those LSBs.
Heck, if you really think that byte codes are that important, why not build the byte codes into the hardware? That will improve perforamnce much more than any specialized instructions for the interpreter.
I'm not saying "skip" is wrong or a bad idea, although it does have all kinds of potential side effects that make it a minefield for the programmer. But the overall process, where the focus is on optimizing one particular way of implementing one application, seems to me misguided.
What I really liked about P1 was how simple and elegant it was. I see P2 getting further and further away from that ideal. It seems like a classic second system effect.
Sorry, I guess I'm just venting here. I'm sure Chip will do whatever he wants, and it'll turn out the way it does. His track record so far is good, so I hope I'm wrong. Most of all I hope we get working silicon sometime this year...
Eric
Those add up to 18 longs.
They can all be collapsed into 8 longs using SKIP:
Here are the SKIP bit patterns to replicate the original functions (for use in the bytecode interpreter loop):
I'd call 18 -> 8 longs/instructions a good saving.
The code to jump & skip will be a common routine, so they do not take any more longs.
But, they will all take longer to execute by about 8 clocks??? (setup + skip instruction plus at least 2 skip overs)
Is that worth 10 longs??? I am not convinced in this example.
The setup and skip will be common, and therefore effectively free.
Still checking on the skip overs but these appear to also be free???
BTW I can easily see how this works. It's not obscure at all. In fact I find it easier to read what the various bytecodes will do.
It's a common routine but it still needs to be there, i.e. yes it does take up space. We'd need to see just how complicated the skip mask calculation is to estimate how much overhead there is.
For example, IIRC the original Tachyon interpreter executes bytecodes very quickly because each bytecode is basically the address of the routine implementing it. The outer loop for such an interpreter is extremely small and fast, and skip is unlikely to be of much help.
Eric
Here's the loop:
That takes 15 clocks to get into the snippets, with SKIP already running. Every bytecode branches to a snippet with a unique 23-bit SKIP field.
Is this timing correct..
So, in the above, there is no execution penalty for these small skips. If there are more than 8 skips in a row, add an extra 2 clocks for each set of 8.
Shouldn't the _ret_ instructions be +3 clocks for reloading the pipeline? Also, I think it's if there are no more than 7 skips in a row, then you don't get the two clock penalty.
Clever! I like how you use the the _ret_ and the pre-stuffed call stack.
Putting this in context to the discussion of processing constants earlier, it's reasonable to argue that the masked version really only takes 8 longs since you were still going to take up 6 longs in the LUT for the jump table either way.
Sequence: Do this, then do that, then do...
Selection: If this do that else do the other...
Iteration: Keep doing this until that happens....
With SKIP it looks like Chip has squished Sequence and Selection down to a single instruction!
Can we also put loops in a block of potentially skipped instructions? That would be a hat trick!
That can't be right, unless the _RET_ flag is being processed at the first stage of the pipeline. But that would have to come with a lot of caveats and edge cases! So, I'm assuming that the _RET_ is still being processed when the "host" instruction is executed. In which case...
You are only saving the cost of the additional RET instruction (i.e. "saves two clocks and a long"). But _RET_ should still cause the current pipelined instructions to be cancelled and the popped instruction branch to be loaded, just like a RET instruction would have caused. Looking at the instruction listing, RET costs 4 clocks. Presumably, that's 1 for executing and 3 for reloading the pipeline. In the case of something like "_ret_ skip #0", the SKIP will take two cylces to execute and the _ret_ will cause 3 additional cycles for reloading the pipeline.
You can just do a DECODE and then a SKIP, right?
Nope. Totally different use cases.
Remember, the fetch of the pop isn't from a cog register/memory, but from an internal register, so it can be gated straight to the instruction counter logic as well as to the cog address register.
If clock 1 fetches the instruction with _RET_, the clock 2 fetches the D and S operands. Also we know in clock 2 the type of instruction. Provided it is not a jump type such as DJNZ, the RET address can be used as the next fetch address for clock 3, and at the same time gated to the increment logic for PC++. So for normal instructions, the RET will be fetching the instruction at the return address concurrently with the 2 clock ALU stage (clocks 3and 4) of the _RET_ instruction.
In the DJNZ and similar, in clocks 3 & 4, the DJNZ will assumed to be taken as usual, so will fetch its jump taken address (no ret). If it is determined during the DJNZ ALU execution that the jump is not to be taken, the pipe will be flushed and the PC effectively replaced with the _RET_ address. So again, it will be the DJNZ not taken clocks (think it's a 2 clock penalty?).
So in all cases, as Chip said, the _RET_ is free!
I could see this execution loop being able to be utilised in all sorts of interpreters, as well as in special lookup tables for executing special routines.
Therefore, what about 1 new instruction "execute p" as follows... This would save 8 clocks overhead for every bytecode (down from 15-28).
Though it can be used dynamically, easy, best case time and density savings is pretty obvious when the masks are known in advance. And that we have a LUT in this design is a big win. Code execute from there is like HUB code, in that the primary COG RAM is used for registers and code. Stuffing the skip patterns into LUT does for code what a LUT does for pixels.
Will be interesting to see how people end up using this dynamically.
As presented by Chip, it's really more like "case" for PASM than anything else. I can easily see people using a spreadsheet to optimize time / space critical COG code.
Yes, and if we take a look at the more general case uses here, it's clear SKIP will deliver gains in COG centric code overall, not just interpreter code. If it were more specific case, like the instruction Cluso presented, that would actually center on interpreters, and your statement here would be a lot stronger. (not that Cluso has a bad idea there --it's just a great example of this dynamic we are discussing)
Chip did it general purpose, potentially trading the very peak gains possible, for broad applicability. It's worth doing on that basis. Again, driver code that is COG centric will benefit from this. One super easy case to see is video / pixel crunching COG drivers. Both of those are modal, resolution, depth, including optional dynamically generated and displayed elements.
SKIP masks, if dynamic, can be generated during blanking times, and or with HUB code, keeping the time critical things in COG. Otherwise, SKIP patterns can be used to account for a broad range of resolutions, depths, display characteristics. While a lot of display uses will be simple, bitmap, tile, sprite, this hardware can do a ton more on a lot less overall HUB RAM when a dynamic display is part of the picture.
Edit: One case here, with video specifically, involves the NCO based timing. Deriving timing mathematically with a PLL is a lot easier. It's often just a divide and go, maybe rounding a little. The NCO comes with the need to manage jitter and or phase for similar display consistency across different modes, depths and timings. SKIP can be put to great use here. Responsive video drivers are harder and require more code / constants. SKIP will improve on the kinds of drivers people can build and the dynamic capabilities they may offer. Work out all the necessary cases, tabulate them with SKIP, and the end result is going to be compact, easy to USE. In some cases, easy to modify, add to, where warranted.
Doing these things mathematically and or with various overlays, both offer flexibility. But, it's either hard, or slow, and or take code space that must be managed, or leaves the user of the driver code in a position where they should commit to a mode for optimal performance. SKIP, applied well and in a way similar to what we are seeing Chip do with the interpreter, should get the driver closer to looking like hardware, able to respond dynamically.
Had this been in the P1 in some form, many video drivers would have been either one COG, or more responsive than they ended up being.
I don't see this feature as a general PASM case tool. Could be, depending on the programmer, but Chip is right. "Ninja level."
SKIP is for maximizing the hardware access and or presenting library / driver code to others that can perform flexibly and optimally without having to manage so many constants and or conditional code builds. On that point, this presents conditional code in a compelling way when contrasted with #IfDef type meta constructs. IMHO, of course. It has the advantage of being dynamic as well, should a developer choose to offer such to parallel uses of their code. (Have to avoid "downstream" here, as that's often a single processor view, and not inclusive on multi-processor environments.)
In the case of our SPIN interpreter, making sure it occupies a single COG makes a ton of sense. When people fire off concurrent instances of SPIN, it's going to be nice to see them work mostly like they do on P1. This is a multi-processor point of view optimization.
SKIP viewed from a single processor point of view can seem excessive and or troublesome. It can't be interrupted, etc... But, SKIP from a multi-processor point of view makes a lot of sense as it's generally going to be compartmentalized and purpose based when used. Most downstream and or parallel use cases won't involve getting into it, meaning the hard work of making sure it's used well and debugged is worth doing. A whole group of us could work something out, and once done, it's done! Super useful PASM objects. No different than we did on P1, just more potent and flexible are possible now.
For comparison, interrupts, or events as we are gonna end up calling them on the P2, work in a similar way. From a single processor point of view, they are trouble, and we've all identified why and how that's true. All valid. However, P2 is a concurrent multi-processor, and those events are processor unique. We can compartmentalize difficult code. Once that is done, reuse by others is very significantly improved over what we would expect in a single processor, or multi-core scenario lacking the COG memory isolation and concurrency P2 has.
This is a non trivial difference, and worth keeping in mind, IMHO.