Hmm.. That's the thing, no interrupts but event driven. If a thread can wait on a pinor whatever it effectively becomes an interrupt handler. Except that when the event fires and the thread continues it has no effect on the execution of other threads. After all there is no context to save, it has its own, and it does not steal execution time. Determinism is maintained.
The one hiccup that might be hard to avoid is that some thread will need talk to cog ram. That will cause a brief stall.
As I said, the advantages of WAITxx is that the chip consumes less power while waiting and a bit less latency than polling.
Given the greater speed of the P2 I'm prepared to accept polling in this auto-threaded code if WAITs don't fit there.
If you want low power just stop the threads and then WAIT.
Not sure about the video waits though I have yet to ever use or think about them.
As I said, the advantages of WAITxx is that the chip consumes less power while waiting and a bit less latency than polling.
Given the greater speed of the P2 I'm prepared to accept polling in this auto-threaded code if WAITs don't fit there.
If you want low power just stop the threads and then WAIT.
Not sure about the video waits though I have yet to ever use or think about them.
With 500k gates of synthesized logic, I don't think power consumption will be as rational or predictable as it is with Prop I. You will probably only get a 60% power reduction in a WAITCNT.
Ok good:) We don't have to worry about power.
I forgot that another advantage of waitcnt over polling is removeing timming jitters when bit banging a serial protocol.
If a thread accessing the HUB causes jitters in its friends a WAICNT would help compat them.
Is this important?
Ok good:) We don't have to worry about power.
I forgot that another advantage of waitcnt over polling is removeing timming jitters when bit banging a serial protocol.
If a thread accessing the HUB causes jitters in its friends a WAICNT would help combat them.
Is this important?
And you'd have to avoid resource conflicts, like who's using INDA/INDB/PTRA/PTRB.
This seems like it would be a big problem for compilers. You wouldn't be able to assume that those registers were available all the time so the compiler would have to be able to generate code differently depending on whether it could use those registers and those registers are a big part of the added benefit of P2. Any chance of getting a separate copy of these for each thread? Or maybe this feature would be mostly useful for code written in PASM where it is possible to manage use of those registers manually.
I would go four threads with a 16 slot 32 bit table initially containing #0 value giving thread zero 100%. This allows a nice extended slicing map.
That would work too, I guess it depends on the cost of 4 vs 8 in Silicon and Speed.
It does make sense to pack config to 32 bits, so 4 Thread : 16 slots is one fit, or 8 threads, 8 slots + 8 flags, or 16 threads, 8 slots
Having more time slots is nice, as it allows finer allocate of resource, and one great side effect of a skewed allocate, like 15/16 & 1/16, is that you can overclock by 16/15, and get full speed operation, AND have full debug access from the 'background' thread.
ie you get debug almost for free.
This seems like it would be a big problem for compilers. You wouldn't be able to assume that those registers were available all the time so the compiler would have to be able to generate code differently depending on whether it could use those registers and those registers are a big part of the added benefit of P2. Any chance of getting a separate copy of these for each thread? Or maybe this feature would be mostly useful for code written in PASM where it is possible to manage use of those registers manually.
Those INDA/INDB/PTRA/PTRB registers are all critical-path, so there's no time to mux more of them. This multitasking would be strictly for hand-code assembly use. If I could make WAITVID poll-able, you could easily do a keyboard, mouse, video terminal in one cog.
Ok good:) We don't have to worry about power.
I forgot that another advantage of waitcnt over polling is removeing timming jitters when bit banging a serial protocol.
If a thread accessing the HUB causes jitters in its friends a WAICNT would help compat them.
Is this important?
You could square up the timing with a WAITCNT now and then, but it's probably not worth doing.
These are all the waits there are: WAITVID, WAITCNT, WAITPEQ, WAITPNE. Are they so important?
For hardware timing, yes WAITs are important..
A WAITxx opcode effectively removes the thread from the pipeline candidates, as it is a single opcode, and it paces an INC of PC, on another event.
If the pipeline can feed a set of incrementing PCs, it should be able to latch that single opcode until the next one is needed ?
Ideally, that WAIT (defer INC of PC) condition sampling, will be every clock even in a 8..16 way sliced system, to give 1CLK granularity.
Those INDA/INDB/PTRA/PTRB registers are all critical-path, so there's no time to mux more of them. This multitasking would be strictly for hand-code assembly use.
That caveat is fine for Debug use, as the Debug handler will always be in ASM, and small.
Those INDA/INDB/PTRA/PTRB registers are all critical-path, so there's no time to mux more of them. This multitasking would be strictly for hand-code assembly use. If I could make WAITVID poll-able, you could easily do a keyboard, mouse, video terminal in one cog.
I guess that's not surprising. As you say, the extra threads can be used for hand-coded assembler or the entire COG can be dedicated to handling multiple devices. It would let us cram even more "soft peripherals" into a single Propeller chip!
David,
I had already been wondering if a compiler would ever use those IND/PTR things?
Surely they are of no use to code compiled to LMM as they only index COG memory?
Then if you are writing C for native in COG code you are not really fishing for maximum speed.
For hardware timing, yes WAITs are important..
A WAITxx opcode effectively removes the thread from the pipeline candidates, as it is a single opcode, and it paces an INC of PC, on another event.
If the pipeline can feed a set of incrementing PCs, it should be able to latch that single opcode until the next one is needed ?
Ideally, that WAIT (defer INC of PC) condition sampling, will be every clock even in a 8..16 way sliced system, to give 1CLK granularity.
Too messy. Best to have four full pipes instead and get all the benefits. The pipeline is only four stages long so it isn't that much more compared to having special case for the WAITs.
Having now read the other replies I suppose a two instruction polling loop is not so bad really.
Hmmm, code that works in testing but breaks badly with the flip of a single config bit ... Lots of traps for young players in the Prop2.
I had already been wondering if a compiler would ever use those IND/PTR things?
Surely they are of no use to code compiled to LMM as they only index COG memory?
Then if you are writing C for native in COG code you are not really fishing for maximum speed.
I believe INDA and INDB are used for cog memory, and PTRA and PTRB are used for hub memory. From what I can tell, it looks like the machine description for GCC includes index registers, auto-increment and decrement and index-offset limitations. So it seems like GCC will be able to use the index registers. However, one of the PTR registers may be used by the LMM/XMM interpreter, so there may be only one PTR register available to the user program.
Wow! I have been out for the day and missed a fantastic discussion.
BTW Chip did you miss my SD boot idea or is it out of the question? (post #252)
I like the 4 tasks using 1 in 4 clocks and quite happy to not be able to use waitcnts and perhaps a few other instructions. I really dont see being able to multi-task in a video cog because we are always short on time and space so waitvid isnt a problem. I realise we have quad-long fetches and much faster instructions but I expect we will just find extra things to do in this time.
Now for a later P2B we will be asking for those multi-threads to also have their own cog memory too, excepting a small window of common cog ram for inter-task comms
Dave: LMM is not going to be able to use the REPS instruction and perhaps some others.
On pedward's suggestion, I've modified the SHA-256 and added HMAC into it. I also made it byte-level, so it can hash/HMAC any size strings. It's 229 longs:
'************************
'* SHA-256 + HMAC *
'* (byte-level) *
'************************
org
sha_256 setf #%0_1111_0000 'configure movf for sbyte0 -> {dbyte3,dbyte2,dbyte1,dbyte0,dbyte3,...}
call #init_hash 'init hash, clear hmac mode, reset byte count
'
'
' Command loop
'
sha_command rdlong x,ptra 'wait for command (%cc_nnnnnnnnnnnnn_ppppppppppppppppp)
tjz x,#sha_command
cachex 'invalidate cache for fresh rdbytec's
setptrb x 'get byte pointer into ptrb
mov count,x 'get byte count
shl count,#2
shr count,#2+17
add count,#1
shr x,#32-2 'get command (0 = terminate)
djz x,#begin_hmac '1 = begin hmac, bits[16..0] = @key (64 bytes)
djz x,#hash_bytes '2 = hash bytes, bits[16..0] = @message (n+1 bytes), bits[29..17] = n (0..8191)
djz x,#read_hash '3 = read hash, bits[16..0] = @hashbuffer (32 bytes)
'
'
' Terminate
'
terminate wrlong zero,ptra 'clear command to signal done
cogid zero 'get cog (d=0 in case fuses not yet hidden)
cogstop zero 'stop cog
'
'
' Begin hmac
'
begin_hmac call #end_hash 'end any hash in progress
mov count,#64 'get and hash ipad key
:ipad rdbytec x,ptrb++
xor x,#$36
call #hash_byte '(last iteration triggers hash_block)
djnz count,#:ipad
reps #16,#2 'save opad key
setinds opad_key,w
mov indb,inda++
xor indb++,opad
mov hmac,#1 'set hmac mode
sha_done wrlong zero,ptra 'clear command to signal done
jmp #sha_command 'get next command
'
'
' Hash bytes
'
hash_bytes rdbytec x,ptrb++ 'hash bytes
call #hash_byte
djnz count,#hash_bytes
jmp #sha_done
'
'
' Read hash
'
read_hash tjz hmac,#:not 'if not hmac, output hash
call #end_hash 'hmac, end current hash
reps #16,#1 'get opad key into w[0..15]
setinds w,opad_key
mov indb++,inda++
call #hash_block 'hash opad key
reps #8,#1 'get hashx[0..7] into w[0..7]
setinds w,hashx
mov indb++,inda++
movd hash_byte,#w+8 'account for opad key and hashx bytes
mov bytes,#64+32
:not call #end_hash 'end current hash
setinda hashx 'store hashx[0..7] at pointer
mov count,#8
:out reps #4,#2
mov x,inda++
rol x,#8
wrbyte x,ptrb++
djnz count,#:out
jmp #sha_done
'
'
' End hash
'
end_hash mov length,bytes 'get message length in bits
shl length,#3
mov x,#$80 'hash end-of-message $80 byte
:fill call #hash_byte '(may trigger hash_block)
mov x,bytes 'until at last 8 bytes of block, hash $00 bytes
and x,#$3F
cmp x,#$38 wz
mov x,#$00
if_nz jmp #:fill
mov count,#8 'hash eight length bytes
:len cmp count,#4 wz
if_z mov x,length '($00 for first 4 bytes, then length)
rol x,#8
call #hash_byte '(last iteration triggers hash_block)
djnz count,#:len
reps #8,#1 'save hash[0..7] into hashx[0..7]
setinds hashx,hash
mov indb++,inda++
init_hash reps #8,#1 'copy hash_init[0..7] into hash[0..7]
setinds hash,hash_init
mov indb++,inda++
mov hmac,#0 'clear hmac mode
mov bytes,#0 'reset byte count
init_hash_ret
end_hash_ret ret
'
'
' Hash byte - add byte to w[0..15] and hash block if full
'
hash_byte movf w,x 'add byte to w[0..15] as byte[3..0]
add bytes,#1 'increment byte count
test bytes,#$03 wz 'every 4th byte, increment w pointer
if_z add hash_byte,d0
test bytes,#$3F wz 'every 64th byte, reset w pointer
if_z movd hash_byte,#w
if_z call #hash_block 'every 64th byte, hash block
hash_byte_ret ret
'
'
' Hash Block - first extend w[0..15] into w[16..63] to generate schedule
'
hash_block reps #48,@:sch 'i = 16..63
setinds w+16,w+16-15+7 'indb = @w[i], inda = @w[i-15+7]
setinda --7 's0 = (w[i-15] -> 7) ^ (w[i-15] -> 18) ^ (w[i-15] >> 3)
mov indb,inda--
mov x,indb
rol x,#18-7
xor x,indb
ror x,#18
shr indb,#3
xor indb,x
add indb,inda 'w[i] = s0 + w[i-16]
setinda ++14 's1 = (w[i-2] -> 17) ^ (w[i-2] -> 19) ^ (w[i-2] >> 10)
mov x,inda
mov y,x
rol y,#19-17
xor y,x
ror y,#19
shr x,#10
xor x,y
add indb,x 'w[i] = s0 + w[i-16] + s1
setinda --5 'w[i] = s0 + w[i-16] + s1 + w[i-7]
:sch add indb++,inda
' Load variables from hash
reps #8,#1 'copy hash[0..7] into a..h
setinds a,hash
mov indb++,inda++
' Do 64 hash iterations on variables
reps #64,@:itr 'i = 0..63
setinds k+0,w+0 'indb = @k[i], inda = @w[i]
mov x,g 'ch = (e & f) ^ (!e & g)
xor x,f
and x,e
xor x,g
mov y,e 's1 = (e -> 6) ^ (e -> 11) ^ (e -> 25)
rol y,#11-6
xor y,e
rol y,#25-11
xor y,e
ror y,#25
add x,y 't1 = ch + s1
add x,indb++ 't1 = ch + s1 + k[i]
add x,inda++ 't1 = ch + s1 + k[i] + w[i]
add x,h 't1 = ch + s1 + k[i] + w[i] + h
mov y,c 'maj = (a & b) ^ (b & c) ^ (c & a)
and y,b
or y,a
mov h,c
or h,b
and y,h
mov h,a 's0 = (a -> 2) ^ (a -> 13) ^ (a -> 22)
rol h,#13-2
xor h,a
rol h,#22-13
xor h,a
ror h,#22
add y,h 't2 = maj + s0
mov h,g 'h = g
mov g,f 'g = f
mov f,e 'f = e
mov e,d 'e = d
mov d,c 'd = c
mov c,b 'c = b
mov b,a 'b = a
add e,x 'e = e + t1
mov a,x 'a = t1 + t2
:itr add a,y
' Add variables back into hash
reps #8,#1 'add a..h into hash[0..7]
setinds hash,a
add indb++,inda++
hash_block_ret ret
'
'
' Defined data
'
zero long 0
d0 long 1 << 9
opad long $36363636 ^ $5C5C5C5C
hash_init long $6A09E667, $BB67AE85, $3C6EF372, $A54FF53A, $510E527F, $9B05688C, $1F83D9AB, $5BE0CD19 'fractionals of square roots of primes 2..19
k long $428A2F98, $71374491, $B5C0FBCF, $E9B5DBA5, $3956C25B, $59F111F1, $923F82A4, $AB1C5ED5 'fractionals of cube roots of primes 2..311
long $D807AA98, $12835B01, $243185BE, $550C7DC3, $72BE5D74, $80DEB1FE, $9BDC06A7, $C19BF174
long $E49B69C1, $EFBE4786, $0FC19DC6, $240CA1CC, $2DE92C6F, $4A7484AA, $5CB0A9DC, $76F988DA
long $983E5152, $A831C66D, $B00327C8, $BF597FC7, $C6E00BF3, $D5A79147, $06CA6351, $14292967
long $27B70A85, $2E1B2138, $4D2C6DFC, $53380D13, $650A7354, $766A0ABB, $81C2C92E, $92722C85
long $A2BFE8A1, $A81A664B, $C24B8B70, $C76C51A3, $D192E819, $D6990624, $F40E3585, $106AA070
long $19A4C116, $1E376C08, $2748774C, $34B0BCB5, $391C0CB3, $4ED8AA4A, $5B9CCA4F, $682E6FF3
long $748F82EE, $78A5636F, $84C87814, $8CC70208, $90BEFFFA, $A4506CEB, $BEF9A3F7, $C67178F2
'
'
' Undefined data
'
hmac res 1
bytes res 1
count res 1
length res 1
opad_key res 16
hash res 8
hashx res 8
w res 64
a res 1
b res 1
c res 1
d res 1
e res 1
f res 1
g res 1
h res 1
x res 1
y res 1
RDxxxx/WRxxxx will work on all the I/O registers - don't worry.
I think the 8 executable registers at $1F8..$1FF are too much trouble to set up for regular I/O write blocking and special writing to make them useful as instruction locations. They only represent 1/64th of the executable memory, anyway.
To add something like this may not take more than one day, and it would add maybe several hours to the synthesis work, at this point, at $175/hr.
For this to work, you would have to avoid using instructions like WAITxxx or REPS that either stall or mess with the pipeline. A stall would just be ugly, with respect to other tasks, but instructions that toy with the pipeline would wreak havoc. 'Just some stuff you'd need to take into consideration when programming multiple tasks. And you'd have to avoid resource conflicts, like who's using INDA/INDB/PTRA/PTRB. Memory accesses would cause brief stalls. The cache wouldn't mind, though.
Those INDA/INDB/PTRA/PTRB registers are all critical-path, so there's no time to mux more of them. This multitasking would be strictly for hand-code assembly use. If I could make WAITVID poll-able, you could easily do a keyboard, mouse, video terminal in one cog.
We've got P1 code that uses the WHOP (Waitvid Hand Off Point) successfully. (Kurenko was successful doing this) That's not polling, more like synchronization. The key thing is the waitvid latch isn't really used. A similar technique could apply here, though it would be complex. Deffo manual PASM, but possible to do video and have the threads anyway. Just fire off the waitvid after synching up, then it does it's thing without stalling execution. Another waitvid instruction simply won't be executed by any COG thread, unless there is some compelling event requiring a major change.
Chip,
I cannot see the complete code on my phone here but that tasksw looks really sweet.
Now that you have a context switching mechanism is there a simple way to get task switch to happen automatically on every instruction? So two tasks would be able to run at half normal rate each. No overhead of having to read and execute a tasksw instruction. To keep it simple there would be no priority mechanism.
In fact it would be nice for the task switch to happen after every instruction time even if the instruction has not finished. Then multiple tasks could be waiting on different events, pin or time or vid.
I wish I had thought about this earlier, because it might have been somewhat trivial to have an array of 8 program counters and z/c flags that could be switched among. Man, that's pretty compelling! Ask yourself this: if instructions floated through the pipeline that each represented a different pc/z/c, would it matter, as long as appropriate pc/z/c's were updated at the end of each instruction? Would the registers care? I don't think so, but it would take a little consideration to know for sure.
In context to the WAITxxx commands, it would be nice to have a version that executes TASKSW if it was to block. The idea is that you yield control if you were to block. When the task returns to that instruction, it continues to yield if blocked. This is how you would handle blocking in traditional threading, you yield control if you were to waste cycles. The caveat is that it won't be cycle accurate, but if it's WAITVID, perhaps the data could be buffered and handed off. WAITCNT wouldn't be accurate, no way around that. WAITPxx could potentially be buffered, time sensitivity not so important.
It kinda gets into a bunch of specialist exceptions that make the use case more narrow.
Too messy. Best to have four full pipes instead and get all the benefits. The pipeline is only four stages long so it isn't that much more compared to having special case for the WAITs.
The WAITs are special cases, and to avoid stalls, you would need to duplicate the PC+Wait state engine Thread Times.
Once you have done that, it does not matter so much if there are four pipes, or tag bits on the contents, whichever actually works, with smallest silicon.
Four pipes is likely to have less surprises, but it is starting to sound silicon costly ?
The WAITs are special cases, and to avoid stalls, you would need to duplicate the PC+Wait state engine Thread Times.
Once you have done that, it does not matter so much if there are four pipes, or tag bits on the contents, whichever actually works, with smallest silicon.
Four pipes is likely to have less surprises, but it is starting to sound silicon costly ?
With four pipes comes four ALU's (HUGE area), so this is out of the question. We wouldn't need them, anyway, to get 99% of the functional equivalence by just mux'ing PC/Z/C's.
In context to the WAITxxx commands, it would be nice to have a version that executes TASKSW if it was to block. The idea is that you yield control if you were to block. When the task returns to that instruction, it continues to yield if blocked. This is how you would handle blocking in traditional threading, you yield control if you were to waste cycles. The caveat is that it won't be cycle accurate, but if it's WAITVID, perhaps the data could be buffered and handed off. WAITCNT wouldn't be accurate, no way around that. WAITPxx could potentially be buffered, time sensitivity not so important.
It kinda gets into a bunch of specialist exceptions that make the use case more narrow.
The trouble with blocking, which means attempting to re-execute the same instruction on the next time slot for the same task, is that the pipe already has, potentially, other instructions in it that belong to that same task, intermingled with other tasks' instructions. This would mean all kinds of pipeline reconstruction would have to be done, which would not be worth doing. Better to make polling options for instructions that otherwise stall the pipe. The pipeline is like a freight train that only goes one way.
Comments
The one hiccup that might be hard to avoid is that some thread will need talk to cog ram. That will cause a brief stall.
Given the greater speed of the P2 I'm prepared to accept polling in this auto-threaded code if WAITs don't fit there.
If you want low power just stop the threads and then WAIT.
Not sure about the video waits though I have yet to ever use or think about them.
With 500k gates of synthesized logic, I don't think power consumption will be as rational or predictable as it is with Prop I. You will probably only get a 60% power reduction in a WAITCNT.
I forgot that another advantage of waitcnt over polling is removeing timming jitters when bit banging a serial protocol.
If a thread accessing the HUB causes jitters in its friends a WAICNT would help compat them.
Is this important?
I forgot that another advantage of waitcnt over polling is removeing timming jitters when bit banging a serial protocol.
If a thread accessing the HUB causes jitters in its friends a WAICNT would help combat them.
Is this important?
That would work too, I guess it depends on the cost of 4 vs 8 in Silicon and Speed.
It does make sense to pack config to 32 bits, so 4 Thread : 16 slots is one fit, or 8 threads, 8 slots + 8 flags, or 16 threads, 8 slots
Having more time slots is nice, as it allows finer allocate of resource, and one great side effect of a skewed allocate, like 15/16 & 1/16, is that you can overclock by 16/15, and get full speed operation, AND have full debug access from the 'background' thread.
ie you get debug almost for free.
Those INDA/INDB/PTRA/PTRB registers are all critical-path, so there's no time to mux more of them. This multitasking would be strictly for hand-code assembly use. If I could make WAITVID poll-able, you could easily do a keyboard, mouse, video terminal in one cog.
You could square up the timing with a WAITCNT now and then, but it's probably not worth doing.
For hardware timing, yes WAITs are important..
A WAITxx opcode effectively removes the thread from the pipeline candidates, as it is a single opcode, and it paces an INC of PC, on another event.
If the pipeline can feed a set of incrementing PCs, it should be able to latch that single opcode until the next one is needed ?
Ideally, that WAIT (defer INC of PC) condition sampling, will be every clock even in a 8..16 way sliced system, to give 1CLK granularity.
That caveat is fine for Debug use, as the Debug handler will always be in ASM, and small.
I had already been wondering if a compiler would ever use those IND/PTR things?
Surely they are of no use to code compiled to LMM as they only index COG memory?
Then if you are writing C for native in COG code you are not really fishing for maximum speed.
Too messy. Best to have four full pipes instead and get all the benefits. The pipeline is only four stages long so it isn't that much more compared to having special case for the WAITs.
Having now read the other replies I suppose a two instruction polling loop is not so bad really.
Hmmm, code that works in testing but breaks badly with the flip of a single config bit ... Lots of traps for young players in the Prop2.
Certainly better than nothing though.
BTW Chip did you miss my SD boot idea or is it out of the question? (post #252)
I like the 4 tasks using 1 in 4 clocks and quite happy to not be able to use waitcnts and perhaps a few other instructions. I really dont see being able to multi-task in a video cog because we are always short on time and space so waitvid isnt a problem. I realise we have quad-long fetches and much faster instructions but I expect we will just find extra things to do in this time.
Now for a later P2B we will be asking for those multi-threads to also have their own cog memory too, excepting a small window of common cog ram for inter-task comms
Dave: LMM is not going to be able to use the REPS instruction and perhaps some others.
WAITVID dst,src NR WC - polling version, use C to return the wait status, does not actually wait
Actually, it could be generic:
WAITxxxx dst,src NR WC - polling version, use C to return the wait status, does not actually wait
Re: WAITVID being pollable.
We've got P1 code that uses the WHOP (Waitvid Hand Off Point) successfully. (Kurenko was successful doing this) That's not polling, more like synchronization. The key thing is the waitvid latch isn't really used. A similar technique could apply here, though it would be complex. Deffo manual PASM, but possible to do video and have the threads anyway. Just fire off the waitvid after synching up, then it does it's thing without stalling execution. Another waitvid instruction simply won't be executed by any COG thread, unless there is some compelling event requiring a major change.
Edit: Just saw Bill's post. Yeah, seconded.
It's deja vu all over again:
http://forums.parallax.com/showthread.php?106059&p=746960&viewfull=1#post746960
and the discussion following.
-Phil
I think we're all on a big Merry-Go-Round, or something. When, exactly, will the Chinese be taking over?
I think that "Merry-Go-Round" deeds for You to fresh up Yours ideas !!
It kinda gets into a bunch of specialist exceptions that make the use case more narrow.
Thanks,
The WAITs are special cases, and to avoid stalls, you would need to duplicate the PC+Wait state engine Thread Times.
Once you have done that, it does not matter so much if there are four pipes, or tag bits on the contents, whichever actually works, with smallest silicon.
Four pipes is likely to have less surprises, but it is starting to sound silicon costly ?
With four pipes comes four ALU's (HUGE area), so this is out of the question. We wouldn't need them, anyway, to get 99% of the functional equivalence by just mux'ing PC/Z/C's.
The trouble with blocking, which means attempting to re-execute the same instruction on the next time slot for the same task, is that the pipe already has, potentially, other instructions in it that belong to that same task, intermingled with other tasks' instructions. This would mean all kinds of pipeline reconstruction would have to be done, which would not be worth doing. Better to make polling options for instructions that otherwise stall the pipe. The pipeline is like a freight train that only goes one way.