Shop OBEX P1 Docs P2 Docs Learn Events
Propeller II - Page 11 — Parallax Forums

Propeller II

18911131445

Comments

  • cgraceycgracey Posts: 14,239
    edited 2012-08-11 04:03
    Heater. wrote: »
    Hmm.. That's the thing, no interrupts but event driven. If a thread can wait on a pinor whatever it effectively becomes an interrupt handler. Except that when the event fires and the thread continues it has no effect on the execution of other threads. After all there is no context to save, it has its own, and it does not steal execution time. Determinism is maintained.

    The one hiccup that might be hard to avoid is that some thread will need talk to cog ram. That will cause a brief stall.
  • Heater.Heater. Posts: 21,230
    edited 2012-08-11 04:06
    As I said, the advantages of WAITxx is that the chip consumes less power while waiting and a bit less latency than polling.
    Given the greater speed of the P2 I'm prepared to accept polling in this auto-threaded code if WAITs don't fit there.
    If you want low power just stop the threads and then WAIT.
    Not sure about the video waits though I have yet to ever use or think about them.
  • cgraceycgracey Posts: 14,239
    edited 2012-08-11 04:10
    Heater. wrote: »
    As I said, the advantages of WAITxx is that the chip consumes less power while waiting and a bit less latency than polling.
    Given the greater speed of the P2 I'm prepared to accept polling in this auto-threaded code if WAITs don't fit there.
    If you want low power just stop the threads and then WAIT.
    Not sure about the video waits though I have yet to ever use or think about them.

    With 500k gates of synthesized logic, I don't think power consumption will be as rational or predictable as it is with Prop I. You will probably only get a 60% power reduction in a WAITCNT.
  • Heater.Heater. Posts: 21,230
    edited 2012-08-11 04:24
    Ok good:) We don't have to worry about power.
    I forgot that another advantage of waitcnt over polling is removeing timming jitters when bit banging a serial protocol.
    If a thread accessing the HUB causes jitters in its friends a WAICNT would help compat them.
    Is this important?
  • Heater.Heater. Posts: 21,230
    edited 2012-08-11 04:29
    Ok good:) We don't have to worry about power.
    I forgot that another advantage of waitcnt over polling is removeing timming jitters when bit banging a serial protocol.
    If a thread accessing the HUB causes jitters in its friends a WAICNT would help combat them.
    Is this important?
  • David BetzDavid Betz Posts: 14,516
    edited 2012-08-11 04:29
    cgracey wrote: »
    And you'd have to avoid resource conflicts, like who's using INDA/INDB/PTRA/PTRB.
    This seems like it would be a big problem for compilers. You wouldn't be able to assume that those registers were available all the time so the compiler would have to be able to generate code differently depending on whether it could use those registers and those registers are a big part of the added benefit of P2. Any chance of getting a separate copy of these for each thread? Or maybe this feature would be mostly useful for code written in PASM where it is possible to manage use of those registers manually.
  • jmgjmg Posts: 15,182
    edited 2012-08-11 04:39
    evanh wrote: »
    I would go four threads with a 16 slot 32 bit table initially containing #0 value giving thread zero 100%. This allows a nice extended slicing map.

    That would work too, I guess it depends on the cost of 4 vs 8 in Silicon and Speed.
    It does make sense to pack config to 32 bits, so 4 Thread : 16 slots is one fit, or 8 threads, 8 slots + 8 flags, or 16 threads, 8 slots

    Having more time slots is nice, as it allows finer allocate of resource, and one great side effect of a skewed allocate, like 15/16 & 1/16, is that you can overclock by 16/15, and get full speed operation, AND have full debug access from the 'background' thread.
    ie you get debug almost for free.
  • cgraceycgracey Posts: 14,239
    edited 2012-08-11 04:46
    David Betz wrote: »
    This seems like it would be a big problem for compilers. You wouldn't be able to assume that those registers were available all the time so the compiler would have to be able to generate code differently depending on whether it could use those registers and those registers are a big part of the added benefit of P2. Any chance of getting a separate copy of these for each thread? Or maybe this feature would be mostly useful for code written in PASM where it is possible to manage use of those registers manually.

    Those INDA/INDB/PTRA/PTRB registers are all critical-path, so there's no time to mux more of them. This multitasking would be strictly for hand-code assembly use. If I could make WAITVID poll-able, you could easily do a keyboard, mouse, video terminal in one cog.
  • cgraceycgracey Posts: 14,239
    edited 2012-08-11 04:47
    Heater. wrote: »
    Ok good:) We don't have to worry about power.
    I forgot that another advantage of waitcnt over polling is removeing timming jitters when bit banging a serial protocol.
    If a thread accessing the HUB causes jitters in its friends a WAICNT would help compat them.
    Is this important?

    You could square up the timing with a WAITCNT now and then, but it's probably not worth doing.
  • jmgjmg Posts: 15,182
    edited 2012-08-11 04:54
    cgracey wrote: »
    These are all the waits there are: WAITVID, WAITCNT, WAITPEQ, WAITPNE. Are they so important?

    For hardware timing, yes WAITs are important..
    A WAITxx opcode effectively removes the thread from the pipeline candidates, as it is a single opcode, and it paces an INC of PC, on another event.
    If the pipeline can feed a set of incrementing PCs, it should be able to latch that single opcode until the next one is needed ?

    Ideally, that WAIT (defer INC of PC) condition sampling, will be every clock even in a 8..16 way sliced system, to give 1CLK granularity.
  • jmgjmg Posts: 15,182
    edited 2012-08-11 04:56
    cgracey wrote: »
    Those INDA/INDB/PTRA/PTRB registers are all critical-path, so there's no time to mux more of them. This multitasking would be strictly for hand-code assembly use.

    That caveat is fine for Debug use, as the Debug handler will always be in ASM, and small.
  • David BetzDavid Betz Posts: 14,516
    edited 2012-08-11 05:18
    cgracey wrote: »
    Those INDA/INDB/PTRA/PTRB registers are all critical-path, so there's no time to mux more of them. This multitasking would be strictly for hand-code assembly use. If I could make WAITVID poll-able, you could easily do a keyboard, mouse, video terminal in one cog.
    I guess that's not surprising. As you say, the extra threads can be used for hand-coded assembler or the entire COG can be dedicated to handling multiple devices. It would let us cram even more "soft peripherals" into a single Propeller chip!
  • Heater.Heater. Posts: 21,230
    edited 2012-08-11 05:24
    David,
    I had already been wondering if a compiler would ever use those IND/PTR things?
    Surely they are of no use to code compiled to LMM as they only index COG memory?
    Then if you are writing C for native in COG code you are not really fishing for maximum speed.
  • evanhevanh Posts: 16,086
    edited 2012-08-11 05:50
    jmg wrote: »
    For hardware timing, yes WAITs are important..
    A WAITxx opcode effectively removes the thread from the pipeline candidates, as it is a single opcode, and it paces an INC of PC, on another event.
    If the pipeline can feed a set of incrementing PCs, it should be able to latch that single opcode until the next one is needed ?

    Ideally, that WAIT (defer INC of PC) condition sampling, will be every clock even in a 8..16 way sliced system, to give 1CLK granularity.

    Too messy. Best to have four full pipes instead and get all the benefits. The pipeline is only four stages long so it isn't that much more compared to having special case for the WAITs.

    Having now read the other replies I suppose a two instruction polling loop is not so bad really.

    Hmmm, code that works in testing but breaks badly with the flip of a single config bit ... Lots of traps for young players in the Prop2.

    Certainly better than nothing though.
  • Dave HeinDave Hein Posts: 6,347
    edited 2012-08-11 06:05
    Heater. wrote: »
    I had already been wondering if a compiler would ever use those IND/PTR things?
    Surely they are of no use to code compiled to LMM as they only index COG memory?
    Then if you are writing C for native in COG code you are not really fishing for maximum speed.
    I believe INDA and INDB are used for cog memory, and PTRA and PTRB are used for hub memory. From what I can tell, it looks like the machine description for GCC includes index registers, auto-increment and decrement and index-offset limitations. So it seems like GCC will be able to use the index registers. However, one of the PTR registers may be used by the LMM/XMM interpreter, so there may be only one PTR register available to the user program.
  • Cluso99Cluso99 Posts: 18,069
    edited 2012-08-11 06:47
    Wow! I have been out for the day and missed a fantastic discussion.

    BTW Chip did you miss my SD boot idea or is it out of the question? (post #252)

    I like the 4 tasks using 1 in 4 clocks and quite happy to not be able to use waitcnts and perhaps a few other instructions. I really dont see being able to multi-task in a video cog because we are always short on time and space so waitvid isnt a problem. I realise we have quad-long fetches and much faster instructions but I expect we will just find extra things to do in this time.

    Now for a later P2B we will be asking for those multi-threads to also have their own cog memory too, excepting a small window of common cog ram for inter-task comms ;)

    Dave: LMM is not going to be able to use the REPS instruction and perhaps some others.
  • cgraceycgracey Posts: 14,239
    edited 2012-08-11 06:52
    On pedward's suggestion, I've modified the SHA-256 and added HMAC into it. I also made it byte-level, so it can hash/HMAC any size strings. It's 229 longs:
    '************************
    '*    SHA-256 + HMAC	*
    '*     (byte-level)	*
    '************************
    
    		org
    
    sha_256		setf	#%0_1111_0000	'configure movf for sbyte0 -> {dbyte3,dbyte2,dbyte1,dbyte0,dbyte3,...}
    
    		call	#init_hash	'init hash, clear hmac mode, reset byte count
    '
    '
    ' Command loop
    '
    sha_command	rdlong	x,ptra		'wait for command (%cc_nnnnnnnnnnnnn_ppppppppppppppppp)
    		tjz	x,#sha_command
    
    		cachex			'invalidate cache for fresh rdbytec's
    
    		setptrb	x		'get byte pointer into ptrb
    
    		mov	count,x		'get byte count
    		shl	count,#2
    		shr	count,#2+17
    		add	count,#1
    
    		shr	x,#32-2		'get command (0 = terminate)
    
    		djz	x,#begin_hmac	'1 = begin hmac, bits[16..0] = @key (64 bytes)
    		djz	x,#hash_bytes	'2 = hash bytes, bits[16..0] = @message (n+1 bytes), bits[29..17] = n (0..8191)
    		djz	x,#read_hash	'3 = read hash,  bits[16..0] = @hashbuffer (32 bytes)
    '
    '
    ' Terminate
    '
    terminate	wrlong	zero,ptra	'clear command to signal done
    
    		cogid	zero		'get cog (d=0 in case fuses not yet hidden)
    		cogstop	zero		'stop cog
    '
    '
    ' Begin hmac
    '
    begin_hmac	call	#end_hash	'end any hash in progress
    
    		mov	count,#64	'get and hash ipad key
    :ipad		rdbytec	x,ptrb++
    		xor	x,#$36
    		call	#hash_byte	'(last iteration triggers hash_block)
    		djnz	count,#:ipad
    
    		reps	#16,#2		'save opad key
    		setinds	opad_key,w
    		mov	indb,inda++
    		xor	indb++,opad
    
    		mov	hmac,#1		'set hmac mode
    
    sha_done	wrlong	zero,ptra	'clear command to signal done
    		jmp	#sha_command	'get next command
    '
    '
    ' Hash bytes
    '
    hash_bytes	rdbytec	x,ptrb++	'hash bytes
    		call	#hash_byte
    		djnz	count,#hash_bytes
    
    		jmp	#sha_done
    '
    '
    ' Read hash
    '
    read_hash	tjz	hmac,#:not	'if not hmac, output hash
    
    
    		call	#end_hash	'hmac, end current hash
    
    		reps	#16,#1		'get opad key into w[0..15]
    		setinds	w,opad_key
    		mov	indb++,inda++
    
    		call	#hash_block	'hash opad key
    
    		reps	#8,#1		'get hashx[0..7] into w[0..7]
    		setinds	w,hashx
    		mov	indb++,inda++
    
    		movd	hash_byte,#w+8	'account for opad key and hashx bytes
    		mov	bytes,#64+32
    
    
    :not		call	#end_hash	'end current hash
    
    		setinda	hashx		'store hashx[0..7] at pointer
    		mov	count,#8
    :out		reps	#4,#2
    		mov	x,inda++
    		rol	x,#8
    		wrbyte	x,ptrb++
    		djnz	count,#:out
    
    		jmp	#sha_done
    '
    '
    ' End hash
    '
    end_hash	mov	length,bytes	'get message length in bits
    		shl	length,#3
    
    		mov	x,#$80		'hash end-of-message $80 byte
    :fill		call	#hash_byte	'(may trigger hash_block)
    		mov	x,bytes		'until at last 8 bytes of block, hash $00 bytes
    		and	x,#$3F
    		cmp	x,#$38	wz
    		mov	x,#$00
    	if_nz	jmp	#:fill
    
    		mov	count,#8	'hash eight length bytes
    :len		cmp	count,#4  wz
    	if_z	mov	x,length	'($00 for first 4 bytes, then length)
    		rol	x,#8
    		call	#hash_byte	'(last iteration triggers hash_block)
    		djnz	count,#:len
    
    		reps	#8,#1		'save hash[0..7] into hashx[0..7]
    		setinds	hashx,hash
    		mov	indb++,inda++
    
    init_hash	reps	#8,#1		'copy hash_init[0..7] into hash[0..7]
    		setinds	hash,hash_init
    		mov	indb++,inda++
    
    		mov	hmac,#0		'clear hmac mode
    		mov	bytes,#0	'reset byte count
    init_hash_ret
    end_hash_ret	ret
    '
    '
    ' Hash byte - add byte to w[0..15] and hash block if full
    '
    hash_byte	movf	w,x		'add byte to w[0..15] as byte[3..0]
    
    		add	bytes,#1	'increment byte count
    
    		test	bytes,#$03  wz	'every 4th byte, increment w pointer
    	if_z	add	hash_byte,d0
    
    		test	bytes,#$3F  wz	'every 64th byte, reset w pointer
    	if_z	movd	hash_byte,#w
    
    	if_z	call	#hash_block	'every 64th byte, hash block
    
    hash_byte_ret	ret
    '
    '
    ' Hash Block - first extend w[0..15] into w[16..63] to generate schedule
    '
    hash_block	reps	#48,@:sch	'i = 16..63
    		setinds	w+16,w+16-15+7	'indb = @w[i], inda = @w[i-15+7]
    
    		setinda	--7		's0 = (w[i-15] -> 7) ^ (w[i-15] -> 18) ^ (w[i-15] >> 3)
    		mov	indb,inda--
    		mov	x,indb
    		rol	x,#18-7
    		xor	x,indb
    		ror	x,#18
    		shr	indb,#3
    		xor	indb,x
    
    		add	indb,inda	'w[i] = s0 + w[i-16]
    
    		setinda	++14		's1 = (w[i-2] -> 17) ^ (w[i-2] -> 19) ^ (w[i-2] >> 10)
    		mov	x,inda
    		mov	y,x
    		rol	y,#19-17
    		xor	y,x
    		ror	y,#19
    		shr	x,#10
    		xor	x,y
    
    		add	indb,x		'w[i] = s0 + w[i-16] + s1
    
    		setinda	--5		'w[i] = s0 + w[i-16] + s1 + w[i-7]
    :sch		add	indb++,inda
    
    
    ' Load variables from hash
    
    		reps	#8,#1		'copy hash[0..7] into a..h
    		setinds	a,hash
    		mov	indb++,inda++
    
    
    ' Do 64 hash iterations on variables
    
    		reps	#64,@:itr	'i = 0..63
    		setinds	k+0,w+0		'indb = @k[i], inda = @w[i]
    
    		mov	x,g		'ch = (e & f) ^ (!e & g)
    		xor	x,f
    		and	x,e
    		xor	x,g
    
    		mov	y,e		's1 = (e -> 6) ^ (e -> 11) ^ (e -> 25)
    		rol	y,#11-6
    		xor	y,e
    		rol	y,#25-11
    		xor	y,e
    		ror	y,#25
    
    		add	x,y		't1 = ch + s1
    		add	x,indb++	't1 = ch + s1 + k[i]
    		add	x,inda++	't1 = ch + s1 + k[i] + w[i]
    		add	x,h		't1 = ch + s1 + k[i] + w[i] + h
    
    		mov	y,c		'maj = (a & b) ^ (b & c) ^ (c & a)
    		and	y,b
    		or	y,a
    		mov	h,c
    		or	h,b
    		and	y,h
    
    		mov	h,a		's0 = (a -> 2) ^ (a -> 13) ^ (a -> 22)
    		rol	h,#13-2
    		xor	h,a
    		rol	h,#22-13
    		xor	h,a
    		ror	h,#22
    
    		add	y,h		't2 = maj + s0
    
    		mov	h,g		'h = g
    		mov	g,f		'g = f
    		mov	f,e		'f = e
    		mov	e,d		'e = d
    		mov	d,c		'd = c
    		mov	c,b		'c = b
    		mov	b,a		'b = a
    
    		add	e,x		'e = e + t1
    
    		mov	a,x		'a = t1 + t2
    :itr		add	a,y
    
    
    ' Add variables back into hash
    
    		reps	#8,#1		'add a..h into hash[0..7]
    		setinds	hash,a
    		add	indb++,inda++
    
    hash_block_ret	ret
    '
    '
    ' Defined data
    '
    zero		long	0
    d0		long 	1 << 9
    
    opad		long	$36363636 ^ $5C5C5C5C
    
    hash_init	long	$6A09E667, $BB67AE85, $3C6EF372, $A54FF53A, $510E527F, $9B05688C, $1F83D9AB, $5BE0CD19	'fractionals of square roots of primes 2..19
    
    k		long	$428A2F98, $71374491, $B5C0FBCF, $E9B5DBA5, $3956C25B, $59F111F1, $923F82A4, $AB1C5ED5	'fractionals of cube roots of primes 2..311
    		long	$D807AA98, $12835B01, $243185BE, $550C7DC3, $72BE5D74, $80DEB1FE, $9BDC06A7, $C19BF174
    		long	$E49B69C1, $EFBE4786, $0FC19DC6, $240CA1CC, $2DE92C6F, $4A7484AA, $5CB0A9DC, $76F988DA
    		long	$983E5152, $A831C66D, $B00327C8, $BF597FC7, $C6E00BF3, $D5A79147, $06CA6351, $14292967
    		long	$27B70A85, $2E1B2138, $4D2C6DFC, $53380D13, $650A7354, $766A0ABB, $81C2C92E, $92722C85
    		long	$A2BFE8A1, $A81A664B, $C24B8B70, $C76C51A3, $D192E819, $D6990624, $F40E3585, $106AA070
    		long	$19A4C116, $1E376C08, $2748774C, $34B0BCB5, $391C0CB3, $4ED8AA4A, $5B9CCA4F, $682E6FF3
    		long	$748F82EE, $78A5636F, $84C87814, $8CC70208, $90BEFFFA, $A4506CEB, $BEF9A3F7, $C67178F2
    '
    '
    ' Undefined data
    '
    hmac		res	1
    bytes		res	1
    count		res	1
    length		res	1
    
    opad_key	res	16
    
    hash		res	8
    hashx		res	8
    
    w		res	64
    
    a		res	1
    b		res	1
    c		res	1
    d		res	1
    e		res	1
    f		res	1
    g		res	1
    h		res	1
    
    x		res	1
    y		res	1
    
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-08-11 08:32
    Thanks Chip
    cgracey wrote: »
    RDxxxx/WRxxxx will work on all the I/O registers - don't worry.

    I think the 8 executable registers at $1F8..$1FF are too much trouble to set up for regular I/O write blocking and special writing to make them useful as instruction locations. They only represent 1/64th of the executable memory, anyway.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-08-11 08:42
    Sounds great to me, well worth some documented limitations.
    cgracey wrote: »
    That would be very clean!

    To add something like this may not take more than one day, and it would add maybe several hours to the synthesis work, at this point, at $175/hr.

    For this to work, you would have to avoid using instructions like WAITxxx or REPS that either stall or mess with the pipeline. A stall would just be ugly, with respect to other tasks, but instructions that toy with the pipeline would wreak havoc. 'Just some stuff you'd need to take into consideration when programming multiple tasks. And you'd have to avoid resource conflicts, like who's using INDA/INDB/PTRA/PTRB. Memory accesses would cause brief stalls. The cache wouldn't mind, though.
  • Bill HenningBill Henning Posts: 6,445
    edited 2012-08-11 08:47
    How about:

    WAITVID dst,src NR WC - polling version, use C to return the wait status, does not actually wait

    Actually, it could be generic:

    WAITxxxx dst,src NR WC - polling version, use C to return the wait status, does not actually wait
    cgracey wrote: »
    Those INDA/INDB/PTRA/PTRB registers are all critical-path, so there's no time to mux more of them. This multitasking would be strictly for hand-code assembly use. If I could make WAITVID poll-able, you could easily do a keyboard, mouse, video terminal in one cog.
  • potatoheadpotatohead Posts: 10,261
    edited 2012-08-11 08:48
    Excellent!

    Re: WAITVID being pollable.

    We've got P1 code that uses the WHOP (Waitvid Hand Off Point) successfully. (Kurenko was successful doing this) That's not polling, more like synchronization. The key thing is the waitvid latch isn't really used. A similar technique could apply here, though it would be complex. Deffo manual PASM, but possible to do video and have the threads anyway. Just fire off the waitvid after synching up, then it does it's thing without stalling execution. Another waitvid instruction simply won't be executed by any COG thread, unless there is some compelling event requiring a major change.

    Edit: Just saw Bill's post. Yeah, seconded.
  • Phil Pilgrim (PhiPi)Phil Pilgrim (PhiPi) Posts: 23,514
    edited 2012-08-11 09:11
    cgracey wrote:
    heater wrote:
    Chip,
    I cannot see the complete code on my phone here but that tasksw looks really sweet.
    Now that you have a context switching mechanism is there a simple way to get task switch to happen automatically on every instruction? So two tasks would be able to run at half normal rate each. No overhead of having to read and execute a tasksw instruction. To keep it simple there would be no priority mechanism.
    In fact it would be nice for the task switch to happen after every instruction time even if the instruction has not finished. Then multiple tasks could be waiting on different events, pin or time or vid.
    I wish I had thought about this earlier, because it might have been somewhat trivial to have an array of 8 program counters and z/c flags that could be switched among. Man, that's pretty compelling! Ask yourself this: if instructions floated through the pipeline that each represented a different pc/z/c, would it matter, as long as appropriate pc/z/c's were updated at the end of each instruction? Would the registers care? I don't think so, but it would take a little consideration to know for sure.

    It's deja vu all over again:

    and the discussion following. :)

    -Phil
  • cgraceycgracey Posts: 14,239
    edited 2012-08-11 09:23

    I think we're all on a big Merry-Go-Round, or something. When, exactly, will the Chinese be taking over?
  • SapiehaSapieha Posts: 2,964
    edited 2012-08-11 09:34
    Hi Chip.

    I think that "Merry-Go-Round" deeds for You to fresh up Yours ideas !!

    cgracey wrote: »
    I think we're all on a big Merry-Go-Round, or something. When, exactly, will the Chinese be taking over?
  • pedwardpedward Posts: 1,642
    edited 2012-08-11 11:17
    In context to the WAITxxx commands, it would be nice to have a version that executes TASKSW if it was to block. The idea is that you yield control if you were to block. When the task returns to that instruction, it continues to yield if blocked. This is how you would handle blocking in traditional threading, you yield control if you were to waste cycles. The caveat is that it won't be cycle accurate, but if it's WAITVID, perhaps the data could be buffered and handed off. WAITCNT wouldn't be accurate, no way around that. WAITPxx could potentially be buffered, time sensitivity not so important.

    It kinda gets into a bunch of specialist exceptions that make the use case more narrow.
  • Invent-O-DocInvent-O-Doc Posts: 768
    edited 2012-08-11 13:50
    This stuff is all cool, but does it make sense to make too many changes this late in the game?
  • KyeKye Posts: 2,200
    edited 2012-08-11 15:12
    Chip's likely not to make any changes. He just entertains them. Only the boot loader is up for suggestions right now. :)

    Thanks,
  • jmgjmg Posts: 15,182
    edited 2012-08-11 16:50
    evanh wrote: »
    Too messy. Best to have four full pipes instead and get all the benefits. The pipeline is only four stages long so it isn't that much more compared to having special case for the WAITs.

    The WAITs are special cases, and to avoid stalls, you would need to duplicate the PC+Wait state engine Thread Times.
    Once you have done that, it does not matter so much if there are four pipes, or tag bits on the contents, whichever actually works, with smallest silicon.
    Four pipes is likely to have less surprises, but it is starting to sound silicon costly ?
  • cgraceycgracey Posts: 14,239
    edited 2012-08-11 18:04
    jmg wrote: »
    The WAITs are special cases, and to avoid stalls, you would need to duplicate the PC+Wait state engine Thread Times.
    Once you have done that, it does not matter so much if there are four pipes, or tag bits on the contents, whichever actually works, with smallest silicon.
    Four pipes is likely to have less surprises, but it is starting to sound silicon costly ?

    With four pipes comes four ALU's (HUGE area), so this is out of the question. We wouldn't need them, anyway, to get 99% of the functional equivalence by just mux'ing PC/Z/C's.
  • cgraceycgracey Posts: 14,239
    edited 2012-08-11 18:11
    pedward wrote: »
    In context to the WAITxxx commands, it would be nice to have a version that executes TASKSW if it was to block. The idea is that you yield control if you were to block. When the task returns to that instruction, it continues to yield if blocked. This is how you would handle blocking in traditional threading, you yield control if you were to waste cycles. The caveat is that it won't be cycle accurate, but if it's WAITVID, perhaps the data could be buffered and handed off. WAITCNT wouldn't be accurate, no way around that. WAITPxx could potentially be buffered, time sensitivity not so important.

    It kinda gets into a bunch of specialist exceptions that make the use case more narrow.

    The trouble with blocking, which means attempting to re-execute the same instruction on the next time slot for the same task, is that the pipe already has, potentially, other instructions in it that belong to that same task, intermingled with other tasks' instructions. This would mean all kinds of pipeline reconstruction would have to be done, which would not be worth doing. Better to make polling options for instructions that otherwise stall the pipe. The pipeline is like a freight train that only goes one way.
Sign In or Register to comment.