Propeller II

cgracey · 2012-08-11 04:03

Heater. wrote: »

Hmm.. That's the thing, no interrupts but event driven. If a thread can wait on a pinor whatever it effectively becomes an interrupt handler. Except that when the event fires and the thread continues it has no effect on the execution of other threads. After all there is no context to save, it has its own, and it does not steal execution time. Determinism is maintained.

The one hiccup that might be hard to avoid is that some thread will need talk to cog ram. That will cause a brief stall.

Heater. · 2012-08-11 04:06

As I said, the advantages of WAITxx is that the chip consumes less power while waiting and a bit less latency than polling.
Given the greater speed of the P2 I'm prepared to accept polling in this auto-threaded code if WAITs don't fit there.
If you want low power just stop the threads and then WAIT.
Not sure about the video waits though I have yet to ever use or think about them.

cgracey · 2012-08-11 04:10

Heater. wrote: »

As I said, the advantages of WAITxx is that the chip consumes less power while waiting and a bit less latency than polling.
Given the greater speed of the P2 I'm prepared to accept polling in this auto-threaded code if WAITs don't fit there.
If you want low power just stop the threads and then WAIT.
Not sure about the video waits though I have yet to ever use or think about them.

With 500k gates of synthesized logic, I don't think power consumption will be as rational or predictable as it is with Prop I. You will probably only get a 60% power reduction in a WAITCNT.

Heater. · 2012-08-11 04:24

Ok good:) We don't have to worry about power.
I forgot that another advantage of waitcnt over polling is removeing timming jitters when bit banging a serial protocol.
If a thread accessing the HUB causes jitters in its friends a WAICNT would help compat them.
Is this important?

Heater. · 2012-08-11 04:29

Ok good:) We don't have to worry about power.
I forgot that another advantage of waitcnt over polling is removeing timming jitters when bit banging a serial protocol.
If a thread accessing the HUB causes jitters in its friends a WAICNT would help combat them.
Is this important?

David Betz · 2012-08-11 04:29

cgracey wrote: »

And you'd have to avoid resource conflicts, like who's using INDA/INDB/PTRA/PTRB.

This seems like it would be a big problem for compilers. You wouldn't be able to assume that those registers were available all the time so the compiler would have to be able to generate code differently depending on whether it could use those registers and those registers are a big part of the added benefit of P2. Any chance of getting a separate copy of these for each thread? Or maybe this feature would be mostly useful for code written in PASM where it is possible to manage use of those registers manually.

jmg · 2012-08-11 04:39

evanh wrote: »

I would go four threads with a 16 slot 32 bit table initially containing #0 value giving thread zero 100%. This allows a nice extended slicing map.

That would work too, I guess it depends on the cost of 4 vs 8 in Silicon and Speed.
It does make sense to pack config to 32 bits, so 4 Thread : 16 slots is one fit, or 8 threads, 8 slots + 8 flags, or 16 threads, 8 slots

Having more time slots is nice, as it allows finer allocate of resource, and one great side effect of a skewed allocate, like 15/16 & 1/16, is that you can overclock by 16/15, and get full speed operation, AND have full debug access from the 'background' thread.
ie you get debug almost for free.

cgracey · 2012-08-11 04:46

David Betz wrote: »

This seems like it would be a big problem for compilers. You wouldn't be able to assume that those registers were available all the time so the compiler would have to be able to generate code differently depending on whether it could use those registers and those registers are a big part of the added benefit of P2. Any chance of getting a separate copy of these for each thread? Or maybe this feature would be mostly useful for code written in PASM where it is possible to manage use of those registers manually.

Those INDA/INDB/PTRA/PTRB registers are all critical-path, so there's no time to mux more of them. This multitasking would be strictly for hand-code assembly use. If I could make WAITVID poll-able, you could easily do a keyboard, mouse, video terminal in one cog.

cgracey · 2012-08-11 04:47

Heater. wrote: »

Ok good:) We don't have to worry about power.
I forgot that another advantage of waitcnt over polling is removeing timming jitters when bit banging a serial protocol.
If a thread accessing the HUB causes jitters in its friends a WAICNT would help compat them.
Is this important?

You could square up the timing with a WAITCNT now and then, but it's probably not worth doing.

jmg · 2012-08-11 04:54

cgracey wrote: »

These are all the waits there are: WAITVID, WAITCNT, WAITPEQ, WAITPNE. Are they so important?

For hardware timing, yes WAITs are important..
A WAITxx opcode effectively removes the thread from the pipeline candidates, as it is a single opcode, and it paces an INC of PC, on another event.
If the pipeline can feed a set of incrementing PCs, it should be able to latch that single opcode until the next one is needed ?

Ideally, that WAIT (defer INC of PC) condition sampling, will be every clock even in a 8..16 way sliced system, to give 1CLK granularity.

jmg · 2012-08-11 04:56

cgracey wrote: »

Those INDA/INDB/PTRA/PTRB registers are all critical-path, so there's no time to mux more of them. This multitasking would be strictly for hand-code assembly use.

That caveat is fine for Debug use, as the Debug handler will always be in ASM, and small.

David Betz · 2012-08-11 05:18

cgracey wrote: »

Those INDA/INDB/PTRA/PTRB registers are all critical-path, so there's no time to mux more of them. This multitasking would be strictly for hand-code assembly use. If I could make WAITVID poll-able, you could easily do a keyboard, mouse, video terminal in one cog.

I guess that's not surprising. As you say, the extra threads can be used for hand-coded assembler or the entire COG can be dedicated to handling multiple devices. It would let us cram even more "soft peripherals" into a single Propeller chip!

Heater. · 2012-08-11 05:24

David,
I had already been wondering if a compiler would ever use those IND/PTR things?
Surely they are of no use to code compiled to LMM as they only index COG memory?
Then if you are writing C for native in COG code you are not really fishing for maximum speed.

evanh · 2012-08-11 05:50

jmg wrote: »

For hardware timing, yes WAITs are important..
A WAITxx opcode effectively removes the thread from the pipeline candidates, as it is a single opcode, and it paces an INC of PC, on another event.
If the pipeline can feed a set of incrementing PCs, it should be able to latch that single opcode until the next one is needed ?

Ideally, that WAIT (defer INC of PC) condition sampling, will be every clock even in a 8..16 way sliced system, to give 1CLK granularity.

Too messy. Best to have four full pipes instead and get all the benefits. The pipeline is only four stages long so it isn't that much more compared to having special case for the WAITs.

Having now read the other replies I suppose a two instruction polling loop is not so bad really.

Hmmm, code that works in testing but breaks badly with the flip of a single config bit ... Lots of traps for young players in the Prop2.

Certainly better than nothing though.

Dave Hein · 2012-08-11 06:05

Heater. wrote: »

I had already been wondering if a compiler would ever use those IND/PTR things?
Surely they are of no use to code compiled to LMM as they only index COG memory?
Then if you are writing C for native in COG code you are not really fishing for maximum speed.

I believe INDA and INDB are used for cog memory, and PTRA and PTRB are used for hub memory. From what I can tell, it looks like the machine description for GCC includes index registers, auto-increment and decrement and index-offset limitations. So it seems like GCC will be able to use the index registers. However, one of the PTR registers may be used by the LMM/XMM interpreter, so there may be only one PTR register available to the user program.

Cluso99 · 2012-08-11 06:47

Wow! I have been out for the day and missed a fantastic discussion.

BTW Chip did you miss my SD boot idea or is it out of the question? (post #252)

I like the 4 tasks using 1 in 4 clocks and quite happy to not be able to use waitcnts and perhaps a few other instructions. I really dont see being able to multi-task in a video cog because we are always short on time and space so waitvid isnt a problem. I realise we have quad-long fetches and much faster instructions but I expect we will just find extra things to do in this time.

Now for a later P2B we will be asking for those multi-threads to also have their own cog memory too, excepting a small window of common cog ram for inter-task comms

Dave: LMM is not going to be able to use the REPS instruction and perhaps some others.

cgracey · 2012-08-11 06:52

On pedward's suggestion, I've modified the SHA-256 and added HMAC into it. I also made it byte-level, so it can hash/HMAC any size strings. It's 229 longs:

'************************
'*    SHA-256 + HMAC	*
'*     (byte-level)	*
'************************

		org

sha_256		setf	#%0_1111_0000	'configure movf for sbyte0 -> {dbyte3,dbyte2,dbyte1,dbyte0,dbyte3,...}

		call	#init_hash	'init hash, clear hmac mode, reset byte count
'
'
' Command loop
'
sha_command	rdlong	x,ptra		'wait for command (%cc_nnnnnnnnnnnnn_ppppppppppppppppp)
		tjz	x,#sha_command

		cachex			'invalidate cache for fresh rdbytec's

		setptrb	x		'get byte pointer into ptrb

		mov	count,x		'get byte count
		shl	count,#2
		shr	count,#2+17
		add	count,#1

		shr	x,#32-2		'get command (0 = terminate)

		djz	x,#begin_hmac	'1 = begin hmac, bits[16..0] = @key (64 bytes)
		djz	x,#hash_bytes	'2 = hash bytes, bits[16..0] = @message (n+1 bytes), bits[29..17] = n (0..8191)
		djz	x,#read_hash	'3 = read hash,  bits[16..0] = @hashbuffer (32 bytes)
'
'
' Terminate
'
terminate	wrlong	zero,ptra	'clear command to signal done

		cogid	zero		'get cog (d=0 in case fuses not yet hidden)
		cogstop	zero		'stop cog
'
'
' Begin hmac
'
begin_hmac	call	#end_hash	'end any hash in progress

		mov	count,#64	'get and hash ipad key
:ipad		rdbytec	x,ptrb++
		xor	x,#$36
		call	#hash_byte	'(last iteration triggers hash_block)
		djnz	count,#:ipad

		reps	#16,#2		'save opad key
		setinds	opad_key,w
		mov	indb,inda++
		xor	indb++,opad

		mov	hmac,#1		'set hmac mode

sha_done	wrlong	zero,ptra	'clear command to signal done
		jmp	#sha_command	'get next command
'
'
' Hash bytes
'
hash_bytes	rdbytec	x,ptrb++	'hash bytes
		call	#hash_byte
		djnz	count,#hash_bytes

		jmp	#sha_done
'
'
' Read hash
'
read_hash	tjz	hmac,#:not	'if not hmac, output hash


		call	#end_hash	'hmac, end current hash

		reps	#16,#1		'get opad key into w[0..15]
		setinds	w,opad_key
		mov	indb++,inda++

		call	#hash_block	'hash opad key

		reps	#8,#1		'get hashx[0..7] into w[0..7]
		setinds	w,hashx
		mov	indb++,inda++

		movd	hash_byte,#w+8	'account for opad key and hashx bytes
		mov	bytes,#64+32


:not		call	#end_hash	'end current hash

		setinda	hashx		'store hashx[0..7] at pointer
		mov	count,#8
:out		reps	#4,#2
		mov	x,inda++
		rol	x,#8
		wrbyte	x,ptrb++
		djnz	count,#:out

		jmp	#sha_done
'
'
' End hash
'
end_hash	mov	length,bytes	'get message length in bits
		shl	length,#3

		mov	x,#$80		'hash end-of-message $80 byte
:fill		call	#hash_byte	'(may trigger hash_block)
		mov	x,bytes		'until at last 8 bytes of block, hash $00 bytes
		and	x,#$3F
		cmp	x,#$38	wz
		mov	x,#$00
	if_nz	jmp	#:fill

		mov	count,#8	'hash eight length bytes
:len		cmp	count,#4  wz
	if_z	mov	x,length	'($00 for first 4 bytes, then length)
		rol	x,#8
		call	#hash_byte	'(last iteration triggers hash_block)
		djnz	count,#:len

		reps	#8,#1		'save hash[0..7] into hashx[0..7]
		setinds	hashx,hash
		mov	indb++,inda++

init_hash	reps	#8,#1		'copy hash_init[0..7] into hash[0..7]
		setinds	hash,hash_init
		mov	indb++,inda++

		mov	hmac,#0		'clear hmac mode
		mov	bytes,#0	'reset byte count
init_hash_ret
end_hash_ret	ret
'
'
' Hash byte - add byte to w[0..15] and hash block if full
'
hash_byte	movf	w,x		'add byte to w[0..15] as byte[3..0]

		add	bytes,#1	'increment byte count

		test	bytes,#$03  wz	'every 4th byte, increment w pointer
	if_z	add	hash_byte,d0

		test	bytes,#$3F  wz	'every 64th byte, reset w pointer
	if_z	movd	hash_byte,#w

	if_z	call	#hash_block	'every 64th byte, hash block

hash_byte_ret	ret
'
'
' Hash Block - first extend w[0..15] into w[16..63] to generate schedule
'
hash_block	reps	#48,@:sch	'i = 16..63
		setinds	w+16,w+16-15+7	'indb = @w[i], inda = @w[i-15+7]

		setinda	--7		's0 = (w[i-15] -> 7) ^ (w[i-15] -> 18) ^ (w[i-15] >> 3)
		mov	indb,inda--
		mov	x,indb
		rol	x,#18-7
		xor	x,indb
		ror	x,#18
		shr	indb,#3
		xor	indb,x

		add	indb,inda	'w[i] = s0 + w[i-16]

		setinda	++14		's1 = (w[i-2] -> 17) ^ (w[i-2] -> 19) ^ (w[i-2] >> 10)
		mov	x,inda
		mov	y,x
		rol	y,#19-17
		xor	y,x
		ror	y,#19
		shr	x,#10
		xor	x,y

		add	indb,x		'w[i] = s0 + w[i-16] + s1

		setinda	--5		'w[i] = s0 + w[i-16] + s1 + w[i-7]
:sch		add	indb++,inda


' Load variables from hash

		reps	#8,#1		'copy hash[0..7] into a..h
		setinds	a,hash
		mov	indb++,inda++


' Do 64 hash iterations on variables

		reps	#64,@:itr	'i = 0..63
		setinds	k+0,w+0		'indb = @k[i], inda = @w[i]

		mov	x,g		'ch = (e & f) ^ (!e & g)
		xor	x,f
		and	x,e
		xor	x,g

		mov	y,e		's1 = (e -> 6) ^ (e -> 11) ^ (e -> 25)
		rol	y,#11-6
		xor	y,e
		rol	y,#25-11
		xor	y,e
		ror	y,#25

		add	x,y		't1 = ch + s1
		add	x,indb++	't1 = ch + s1 + k[i]
		add	x,inda++	't1 = ch + s1 + k[i] + w[i]
		add	x,h		't1 = ch + s1 + k[i] + w[i] + h

		mov	y,c		'maj = (a & b) ^ (b & c) ^ (c & a)
		and	y,b
		or	y,a
		mov	h,c
		or	h,b
		and	y,h

		mov	h,a		's0 = (a -> 2) ^ (a -> 13) ^ (a -> 22)
		rol	h,#13-2
		xor	h,a
		rol	h,#22-13
		xor	h,a
		ror	h,#22

		add	y,h		't2 = maj + s0

		mov	h,g		'h = g
		mov	g,f		'g = f
		mov	f,e		'f = e
		mov	e,d		'e = d
		mov	d,c		'd = c
		mov	c,b		'c = b
		mov	b,a		'b = a

		add	e,x		'e = e + t1

		mov	a,x		'a = t1 + t2
:itr		add	a,y


' Add variables back into hash

		reps	#8,#1		'add a..h into hash[0..7]
		setinds	hash,a
		add	indb++,inda++

hash_block_ret	ret
'
'
' Defined data
'
zero		long	0
d0		long 	1 << 9

opad		long	$36363636 ^ $5C5C5C5C

hash_init	long	$6A09E667, $BB67AE85, $3C6EF372, $A54FF53A, $510E527F, $9B05688C, $1F83D9AB, $5BE0CD19	'fractionals of square roots of primes 2..19

k		long	$428A2F98, $71374491, $B5C0FBCF, $E9B5DBA5, $3956C25B, $59F111F1, $923F82A4, $AB1C5ED5	'fractionals of cube roots of primes 2..311
		long	$D807AA98, $12835B01, $243185BE, $550C7DC3, $72BE5D74, $80DEB1FE, $9BDC06A7, $C19BF174
		long	$E49B69C1, $EFBE4786, $0FC19DC6, $240CA1CC, $2DE92C6F, $4A7484AA, $5CB0A9DC, $76F988DA
		long	$983E5152, $A831C66D, $B00327C8, $BF597FC7, $C6E00BF3, $D5A79147, $06CA6351, $14292967
		long	$27B70A85, $2E1B2138, $4D2C6DFC, $53380D13, $650A7354, $766A0ABB, $81C2C92E, $92722C85
		long	$A2BFE8A1, $A81A664B, $C24B8B70, $C76C51A3, $D192E819, $D6990624, $F40E3585, $106AA070
		long	$19A4C116, $1E376C08, $2748774C, $34B0BCB5, $391C0CB3, $4ED8AA4A, $5B9CCA4F, $682E6FF3
		long	$748F82EE, $78A5636F, $84C87814, $8CC70208, $90BEFFFA, $A4506CEB, $BEF9A3F7, $C67178F2
'
'
' Undefined data
'
hmac		res	1
bytes		res	1
count		res	1
length		res	1

opad_key	res	16

hash		res	8
hashx		res	8

w		res	64

a		res	1
b		res	1
c		res	1
d		res	1
e		res	1
f		res	1
g		res	1
h		res	1

x		res	1
y		res	1

Bill Henning · 2012-08-11 08:32

Thanks Chip

cgracey wrote: »

RDxxxx/WRxxxx will work on all the I/O registers - don't worry.

I think the 8 executable registers at $1F8..$1FF are too much trouble to set up for regular I/O write blocking and special writing to make them useful as instruction locations. They only represent 1/64th of the executable memory, anyway.

Bill Henning · 2012-08-11 08:42

Sounds great to me, well worth some documented limitations.

cgracey wrote: »

That would be very clean!

To add something like this may not take more than one day, and it would add maybe several hours to the synthesis work, at this point, at $175/hr.

For this to work, you would have to avoid using instructions like WAITxxx or REPS that either stall or mess with the pipeline. A stall would just be ugly, with respect to other tasks, but instructions that toy with the pipeline would wreak havoc. 'Just some stuff you'd need to take into consideration when programming multiple tasks. And you'd have to avoid resource conflicts, like who's using INDA/INDB/PTRA/PTRB. Memory accesses would cause brief stalls. The cache wouldn't mind, though.

Bill Henning · 2012-08-11 08:47

How about:

WAITVID dst,src NR WC - polling version, use C to return the wait status, does not actually wait

Actually, it could be generic:

WAITxxxx dst,src NR WC - polling version, use C to return the wait status, does not actually wait

cgracey wrote: »

Those INDA/INDB/PTRA/PTRB registers are all critical-path, so there's no time to mux more of them. This multitasking would be strictly for hand-code assembly use. If I could make WAITVID poll-able, you could easily do a keyboard, mouse, video terminal in one cog.

potatohead · 2012-08-11 08:48

Excellent!

Re: WAITVID being pollable.

We've got P1 code that uses the WHOP (Waitvid Hand Off Point) successfully. (Kurenko was successful doing this) That's not polling, more like synchronization. The key thing is the waitvid latch isn't really used. A similar technique could apply here, though it would be complex. Deffo manual PASM, but possible to do video and have the threads anyway. Just fire off the waitvid after synching up, then it does it's thing without stalling execution. Another waitvid instruction simply won't be executed by any COG thread, unless there is some compelling event requiring a major change.

Edit: Just saw Bill's post. Yeah, seconded.

Phil Pilgrim (PhiPi) · 2012-08-11 09:11

cgracey wrote:

heater wrote:

Chip,
I cannot see the complete code on my phone here but that tasksw looks really sweet.
Now that you have a context switching mechanism is there a simple way to get task switch to happen automatically on every instruction? So two tasks would be able to run at half normal rate each. No overhead of having to read and execute a tasksw instruction. To keep it simple there would be no priority mechanism.
In fact it would be nice for the task switch to happen after every instruction time even if the instruction has not finished. Then multiple tasks could be waiting on different events, pin or time or vid.

I wish I had thought about this earlier, because it might have been somewhat trivial to have an array of 8 program counters and z/c flags that could be switched among. Man, that's pretty compelling! Ask yourself this: if instructions floated through the pipeline that each represented a different pc/z/c, would it matter, as long as appropriate pc/z/c's were updated at the end of each instruction? Would the registers care? I don't think so, but it would take a little consideration to know for sure.

It's deja vu all over again:

http://forums.parallax.com/showthread.php?106059&p=746900&viewfull=1#post746900
http://forums.parallax.com/showthread.php?106059&p=746960&viewfull=1#post746960

and the discussion following.

-Phil

cgracey · 2012-08-11 09:23

Phil Pilgrim (PhiPi) wrote: »

It's deja vu all over again:
http://forums.parallax.com/showthread.php?106059&p=746900&viewfull=1#post746900
http://forums.parallax.com/showthread.php?106059&p=746960&viewfull=1#post746960

and the discussion following.

-Phil

I think we're all on a big Merry-Go-Round, or something. When, exactly, will the Chinese be taking over?

Sapieha · 2012-08-11 09:34

Hi Chip.

I think that "Merry-Go-Round" deeds for You to fresh up Yours ideas !!

cgracey wrote: »

I think we're all on a big Merry-Go-Round, or something. When, exactly, will the Chinese be taking over?

pedward · 2012-08-11 11:17

In context to the WAITxxx commands, it would be nice to have a version that executes TASKSW if it was to block. The idea is that you yield control if you were to block. When the task returns to that instruction, it continues to yield if blocked. This is how you would handle blocking in traditional threading, you yield control if you were to waste cycles. The caveat is that it won't be cycle accurate, but if it's WAITVID, perhaps the data could be buffered and handed off. WAITCNT wouldn't be accurate, no way around that. WAITPxx could potentially be buffered, time sensitivity not so important.

It kinda gets into a bunch of specialist exceptions that make the use case more narrow.

Invent-O-Doc · 2012-08-11 13:50

This stuff is all cool, but does it make sense to make too many changes this late in the game?

Kye · 2012-08-11 15:12

Chip's likely not to make any changes. He just entertains them. Only the boot loader is up for suggestions right now.

Thanks,

jmg · 2012-08-11 16:50

evanh wrote: »

Too messy. Best to have four full pipes instead and get all the benefits. The pipeline is only four stages long so it isn't that much more compared to having special case for the WAITs.

The WAITs are special cases, and to avoid stalls, you would need to duplicate the PC+Wait state engine Thread Times.
Once you have done that, it does not matter so much if there are four pipes, or tag bits on the contents, whichever actually works, with smallest silicon.
Four pipes is likely to have less surprises, but it is starting to sound silicon costly ?

cgracey · 2012-08-11 18:04

jmg wrote: »

The WAITs are special cases, and to avoid stalls, you would need to duplicate the PC+Wait state engine Thread Times.
Once you have done that, it does not matter so much if there are four pipes, or tag bits on the contents, whichever actually works, with smallest silicon.
Four pipes is likely to have less surprises, but it is starting to sound silicon costly ?

With four pipes comes four ALU's (HUGE area), so this is out of the question. We wouldn't need them, anyway, to get 99% of the functional equivalence by just mux'ing PC/Z/C's.

cgracey · 2012-08-11 18:11

pedward wrote: »

In context to the WAITxxx commands, it would be nice to have a version that executes TASKSW if it was to block. The idea is that you yield control if you were to block. When the task returns to that instruction, it continues to yield if blocked. This is how you would handle blocking in traditional threading, you yield control if you were to waste cycles. The caveat is that it won't be cycle accurate, but if it's WAITVID, perhaps the data could be buffered and handed off. WAITCNT wouldn't be accurate, no way around that. WAITPxx could potentially be buffered, time sensitivity not so important.

It kinda gets into a bunch of specialist exceptions that make the use case more narrow.

The trouble with blocking, which means attempting to re-execute the same instruction on the next time slot for the same task, is that the pipe already has, potentially, other instructions in it that belong to that same task, intermingled with other tasks' instructions. This would mean all kinds of pipeline reconstruction would have to be done, which would not be worth doing. Better to make polling options for instructions that otherwise stall the pipe. The pipeline is like a freight train that only goes one way.

Propeller II

Comments