Propeller II

evanh · 2012-08-26 04:41

Chip did say it's so easy it's pretty much a done thing as long as it's usable, ie: He was asking if we could handle having Hub reads incurring at least one clock cycle stall per read.

The answer was a yes, it's usable, as far as I could tell.

Heater. · 2012-08-26 05:01

It seemed quite straight forward. Effectively performing a JMPRET after every instruction automatically.
I was wondering though. How do the threads get set up initially and how do you get into and out of this thread slicing mode?

Dave Hein · 2012-08-26 05:07

Heater, P1 is a very RISC-like processor with something like 62 simple instructions plus a few hub instructions that are a little different. P2 builds on P1 with the addition of many special instructions. To fit them in with the existing instructions Chip had to use the immediate and conditional write bits for other purposes. The potential x86 baggage bridge was crossed a long time ago in the P2 design phase. Adding a TASKSW instruction in the mix of the new P2 instructions isn't that big of a deal.

Heater. · 2012-08-26 05:20

Dave,

Yes the Prop is very lean and regular. All instructions being 32 bit, all but a few taking the same execution time, all using the same conditional execution flags the same way etc. That's why many of us like it so much. And that's why I worry about it growing "warts", having odd special cases and so on.

From what Chip posted earlier we see that the TASKSW instruction is already in there and he gave an example code showing how it can be used. All good stuff.

The next step that I suggested was the possibility to basically do the TASKSW automatically after every instruction. That would minimize threads latency in response to events, maximize execution speed for threadead code and reduce code size.

Chip said it was easy to do but we don't know if he wants to pursue it at this late stage.

evanh · 2012-08-26 05:26

Heater. wrote: »

It seemed quite straight forward. Effectively performing a JMPRET after every instruction automatically.
I was wondering though. How do the threads get set up initially and how do you get into and out of this thread slicing mode?

Ah, I wouldn't equate slicing to switching. Though, he's prolly using the same context registers I guess, which actually makes much better use of them. Not sure about conflicting use ...

Anyway, to answer the question, slicing is in reality always on. But there will be an additional special register to hold the slicing config. The default value in this config register will be #0, which means fetch thread zero on every cycle, which, in turn, is the same as a single threaded processor.

The encoding of the config register will be (for four threaded model):

Bits 0 and 1 are binary encoding of thread executed for first time slice.
Bits 2 and 3 are binary encoding of thread executed for second time slice.
And so on to 16 slices (32 bit register) before looping back to start of the register.

Heater. · 2012-08-26 05:45

evanh,

"Slicing", "switching", you might have to elaborate on the differences. The XMOS devices do a similar thing, up to 8 threads can be executing at the same time, an instruction from one, then an instruction from the next and so on in a round robin fashion. They refere to this as "thread scheduling" and it's all done in hardware.

The approach you outlined seems quite sound, slicing always on and a control register. No special instructions.

Dave Hein · 2012-08-26 06:01

Heater. wrote: »

The next step that I suggested was the possibility to basically do the TASKSW automatically after every instruction. That would minimize threads latency in response to events, maximize execution speed for threadead code and reduce code size.

OK, I misunderstood. Actually I think you would have better latency control by using an explicit TASKSW instruction. If you automatically switch threads after every instruction you could have a situation where all the other threads are executing a rdlong instruction at the same time versus other times where all the other threads are executing a single-cycle instruction. There would be a 56-cycle latency in one case and 7 cycles in the other case.

Let's say that each task requires a maximum latency of 210 cycles. With an explicit TASKSW instruction you would need to limit each task to 210/7 = 30 cycles. This would be hard to guarantee with an implicit automatic TASKSW.

evanh · 2012-08-26 06:02

Heater. wrote: »

"Slicing", "switching", you might have to elaborate on the differences.

Fair call. I guess it's a usage thing. Task switching is normally associated with whole processor context switching, as in swapping to the stack. I know this isn't the case on the Prop2 and that there is a small group remapping going on instead.

Threading involves having duplicate contexts, including the general register set, in hardware. This then subdivides into time sliced execution and prioritised execution.

The Prop2 TASKSW instruction sits in the middle between task switching and hardware threading. Even computers are never black and white!

evanh · 2012-08-26 06:38

Dave Hein wrote: »

Let's say that each task requires a maximum latency of 210 cycles. With an explicit TASKSW instruction you would need to limit each task to 210/7 = 30 cycles. This would be hard to guarantee with an implicit automatic TASKSW.

EDIT: Excuse me while I have a go at rewording my babble ...

Yep, given the limitation of the implementation, generalised use of slicing will not be viable. It will be best used as one supervisory task (Doing *all* the Hub accesses) and up to three subordinates. A factor in this thinking is that slicing offers no improvement in MIPS and therefore doesn't offer much toward number crunching activities, ie: Slicing is primarily intended for improving latency. In other words, a typical use is for packaging drivers into sets for the purpose of reducing the number of Cogs used in soft devices.

May even be able to use the TASKSW instruction in conjunction, dunno.

potatohead · 2012-08-26 09:24

Re: Intel syndrome.

Yeah, that was actually on the table early on, then extending the instruction set became a reality given how Chip likes to add sub-systems to the CPU to get stuff done in parallel with instructions. We have crossed that line, but...

There could be lessons learned with P2. I suspect quite a few of them. Ideally, those are all good. Maybe a refactor for P3, as the next jump in scale really might not be micro-controller anymore. Could be CPU at that point. I sure don't know and dare not even suggest that's warranted. Way too early. Just practicing one look ahead as I like to do, that's all.

IMHO, if the stall issue is sorted out, I think this feature is in the Prop philosophy as it would be very regular and straightforward to use. With the clog that's on the table right now, it looks like what it is, a late term hack that's a great idea, killer feature, pinned due to where it was realized in the product life cycle. Deffo a P3 thing, because it's the right idea. Funny too, after all those early discussions, none of us ever quite got there.

jmg · 2012-08-26 15:06

evanh wrote: »

Anyway, to answer the question, slicing is in reality always on. But there will be an additional special register to hold the slicing config. The default value in this config register will be #0, which means fetch thread zero on every cycle, which, in turn, is the same as a single threaded processor.

The encoding of the config register will be (for four threaded model):

Bits 0 and 1 are binary encoding of thread executed for first time slice.
Bits 2 and 3 are binary encoding of thread executed for second time slice.
And so on to 16 slices (32 bit register) before looping back to start of the register.

This type of hardware slicing (as opposed to software switching) is an ideal fit to the Prop, as there are no interrupts to complicate this.
You do need to share the COG resource amongst what the slices do, but that is what makes it such a good fit.
It means you can now 'fill up' a COG, instead of having a lot of silicon doing nothing but wait...

I would study whether x8 Slice, x8 slot model was better, or x4 Slice, x16 slots

What appeals on a x8 x8 is then a Prop 2 COG, can slice 8 ways to match a Prop 1 COGs speed, and provided your code actually fits, you could swallow a whole Prop1 into a single Prop 2 cog

If x8 was deemed to give too little Code RAM, then two Cogs at x4x16 would swallow a Prop 1.

This obviously has serious marketing leverage.

Heater. · 2012-08-27 06:11

Dave Hein,

Actually I think you would have better latency control by using an explicit TASKSW instruction.

No, I do not think so. Explanation below.

If you automatically switch threads after every instruction you could have a situation where all the other threads are executing a rdlong instruction at the same time versus other times where all the other threads are executing a single-cycle instruction.

Yes you could but it does not matter because:

1) Software TASKSW

Imagine we have 4 thread running using the TASKSW to cooperatively switch. No hardware scheduling.

Imagine that those threads have sequences of up to, say, 10 instructions between every TASKSW instruction.

Now imagine thread 1 is in a loop polling for some event, a CNT value or pin change etc, in that loop he does a TASKSW to give threads 2, 3, 4 time to run. Whilst in that TASKSW the other threads are run, each one of them taking up to 10 instructions to release control again. That is to say thread 1 has to wait a max of 40 odd instructions before he gets control again and can service his event. Depending on what the other threads are doing that is a random latency in response time for thread 1 from a hand full of instructions up to 40 or so.

As I said before one can reduce this "blockage" time and improve a threads latency in response to events only by peppering the code with more TASKSW instructions, which in the extreme double code space and halves execution time.

2) Hardware thread scheduling

If the COG is scheduling a different threads instruction to be run every instruction time obviously in our imaginary case above there is no more random blocking from a few instructions time up to 40 instructions time. A thread in a polling loop will be able to start running 4 instructions after he sees his event condition.

Now, there is the issue of those pesky HUB ops like RDLONG that stall everything. In the face of the huge and random thread stalls described above for software threading I think this is a non-issue for hardware scheduling which improves our situation dramatically.

cgracey · 2012-08-28 15:32

The time slicing is now working in the Prop II. Here is some code and a screenshot of it running at 200MHz. Note that when all four tasks are running, jumps only take one clock, as there's no same-thread instructions in the pipeline to cancel. In this example, each task loops every 8 clocks. Were any of these tasks to run solely, they would take 5 clocks (1 for the NOTP and 4 for the JMP). This way, they take just two clocks each.

PUB go

  coginit(0,@tasks,0)

DAT

tasks			org

			jmp	#t0			'task0 starts at $000 (normal operation)
			jmp	#t1			'task1 starts at $001 (if used)
			jmp	#t2			'task2 starts at $002 (if used)
			jmp	#t3			'task3 starts at $003 (if used)


t0			settask	tasklong		'task0 enables more tasks

t0_			notp	#0			'task0
			jmp	#t0_

t1			notp	#1			'task1
			jmp	#t1

t2			notp	#2			'task2
			jmp	#t2

t3			notp	#3			'task3
			jmp	#t3


tasklong		long	%%3210321032103210	'task list is 16 x 2 bits, rotates right by 2

As I got into coding the ROM monitor, I realized that time slicing would make things much nicer, but I had abandoned the idea earlier because it seemed that I would exacerbate a critical path through the Z and C flag circuits. After thinking about it, though, it occurred to me that most of the flag maintenance could be done in the next cycle and I would only need to add one new layer of multiplexing to make it happen. This was really hard to think about, at first, because I didn't realize that MULTIPLE program counters may need updating on the same clock. Now, I've got to write some test code to be sure it's all straight.

Thanks for pushing for this, Everyone. I had already added a POLLVID instruction, so we should be able to make a complete VGA, USB mouse and keyboard driver in one cog.

erco · 2012-08-28 15:50

cgracey wrote: »

I had already added a POLLVID instruction, so we should be able to make a complete VGA, USB mouse and keyboard driver in one cog.

Now THAT is impressive. Way to go Chip!

jmg · 2012-08-28 15:58

cgracey wrote: »

The time slicing is now working in the Prop II. Here is some code and a screenshot of it running at 200MHz. Note that when all four tasks are running, jumps only take one clock, as there's no same-thread instructions in the pipeline to cancel. In this example, each task loops every 8 clocks. Were any of these tasks to run solely, they would take 5 clocks (1 for the NOTP and 4 for the JMP). This way, they take just two clocks each.

WOW !! Now that is seriously impressive!!!

Can you code another one that has differing NOPs packed in each thread, and has uneven Slice allocates, so they all run at different rates ?
That would make the scope more obviously show async thread operation.

One immediate example I can see is Live Debug, which could take one slot of 1/16, and leave 15:16 for the main task, which could either slightly overclock, or have less slack on WAITxx - so a third example with this weighting would be great.

What is the effect of a WAITxx ? - or a SNDSER opcode in one slice ?

Is 4 x 16 tasks, a more efficient fit than 8 x 10 ?

The appeal of a 8 way slice, is then a single COG can 'swallow' a whole Prop1 (code size permitting), but I guess swallow of 8 tasks into 2 COGs is still very impressive (and probably a better code balance )

Sapieha · 2012-08-28 15:59

Hi Chip.

Now only 2 instructions to Rotate Register's trough PIN --- AND we have very fast SERIAL IN-OUT that are suitable for USB emulation !!

cgracey wrote: »
The time slicing is now working in the Prop II. Here is some code and a screenshot of it running at 200MHz. Note that when all four tasks are running, jumps only take one clock, as there's no same-thread instructions in the pipeline to cancel. In this example, each task loops every 8 clocks. Were any of these tasks to run solely, they would take 5 clocks (1 for the NOTP and 4 for the JMP). This way, they take just two clocks each.
PUB go

  coginit(0,@tasks,0)

DAT

tasks            org

            jmp    #t0            'task0 starts at $000 (normal operation)
            jmp    #t1            'task1 starts at $001 (if used)
            jmp    #t2            'task2 starts at $002 (if used)
            jmp    #t3            'task3 starts at $003 (if used)


t0            settask    tasklong        'task0 enables more tasks

t0_            notp    #0            'task0
            jmp    #t0_

t1            notp    #1            'task1
            jmp    #t1

t2            notp    #2            'task2
            jmp    #t2

t3            notp    #3            'task3
            jmp    #t3


tasklong        long    %%3210321032103210    'task list is 16 x 2 bits, rotates right by 2
As I got into coding the ROM monitor, I realized that time slicing would make things much nicer, but I had abandoned the idea earlier because it seemed that I would exacerbate a critical path through the Z and C flag circuits. After thinking about it, though, it occurred to me that most of the flag maintenance could be done in the next cycle and I would only need to add one new layer of multiplexing to make it happen. This was really hard to think about, at first, because I didn't realize that MULTIPLE program counters may need updating on the same clock. Now, I've got to write some test code to be sure it's all straight.

Thanks for pushing for this, Everyone. I had already added a POLLVID instruction, so we should be able to make a complete VGA, USB mouse and keyboard driver in one cog.

Bill Henning · 2012-08-28 16:00

REALLY nice work Chip!

Makes P2 much more powerful; and allows combining many peripherals into each cog.

cgracey wrote: »
The time slicing is now working in the Prop II. Here is some code and a screenshot of it running at 200MHz. Note that when all four tasks are running, jumps only take one clock, as there's no same-thread instructions in the pipeline to cancel. In this example, each task loops every 8 clocks. Were any of these tasks to run solely, they would take 5 clocks (1 for the NOTP and 4 for the JMP). This way, they take just two clocks each.
PUB go

  coginit(0,@tasks,0)

DAT

tasks			org

			jmp	#t0			'task0 starts at $000 (normal operation)
			jmp	#t1			'task1 starts at $001 (if used)
			jmp	#t2			'task2 starts at $002 (if used)
			jmp	#t3			'task3 starts at $003 (if used)


t0			settask	tasklong		'task0 enables more tasks

t0_			notp	#0			'task0
			jmp	#t0_

t1			notp	#1			'task1
			jmp	#t1

t2			notp	#2			'task2
			jmp	#t2

t3			notp	#3			'task3
			jmp	#t3


tasklong		long	%%3210321032103210	'task list is 16 x 2 bits, rotates right by 2
As I got into coding the ROM monitor, I realized that time slicing would make things much nicer, but I had abandoned the idea earlier because it seemed that I would exacerbate a critical path through the Z and C flag circuits. After thinking about it, though, it occurred to me that most of the flag maintenance could be done in the next cycle and I would only need to add one new layer of multiplexing to make it happen. This was really hard to think about, at first, because I didn't realize that MULTIPLE program counters may need updating on the same clock. Now, I've got to write some test code to be sure it's all straight.

Thanks for pushing for this, Everyone. I had already added a POLLVID instruction, so we should be able to make a complete VGA, USB mouse and keyboard driver in one cog.

cgracey · 2012-08-28 16:16

jmg wrote: »

Can you code another one that has differing NOPs packed in each thread, and has uneven Slice allocates, so they all run at different rates ?
That would make the scope more obviously show async thread operation.

I changed the TASKLONG value:

PUB go

  coginit(0,@tasks,0)

DAT

tasks			org

			jmp	#t0			'task0 starts at $000 (normal operation)
			jmp	#t1			'task1 starts at $001 (if used)
			jmp	#t2			'task2 starts at $002 (if used)
			jmp	#t3			'task3 starts at $003 (if used)


t0			settask	tasklong		'task0 enables more tasks

t0_			notp	#0			'task0
			jmp	#t0_

t1			notp	#1			'task1
			jmp	#t1

t2			notp	#2			'task2
			jmp	#t2

t3			notp	#3			'task3
			jmp	#t3


tasklong		long	%%3010201020102010	'task list is 16 x 2 bits, rotates right by 2

Note that task0 now gets every other time slot, and twice as many as task1, but it doesn't run twice as fast because much of its pipeline gets thrown away when its JMP executes. You can also see that task3, which gets every 16th time slot executes a loop in 160ns, or 32 clocks - 1 for the NOTP and 1 for the JMP.

cgracey · 2012-08-28 16:33

jmg wrote: »

What is the effect of a WAITxx ? - or a SNDSER opcode in one slice ?

Is 4 x 16 tasks, a more efficient fit than 8 x 10 ?

The SNDSER and RCVSER already have pollable forms, which won't tie things up.

WAITCNT can be gotten around with the new SUBCNT instruction, which has a CMPCNT version:

'
'
' transmit chr (x)
'
tx		getcnt	time		'get the initial time

		setb	x,#8		'set stop bit
		shl	x,#1		'insert start bit

		mov	y,#10		'ready for start + data(8) + stop bits

:loop		add	time,period	'add bit period to time

:wait		cmpcnt	time	wc	'loop until bit period elapsed (accommodates time slicing)
	if_c	jmp	#:wait

		shr	x,#1	wc	'shift out next bit to send
		setpc	tx_pin		'write to tx pin

		djnz	y,#:loop	'next bit

tx_ret		ret

This serial output routine can run as a single-cycle task.

As far as 8 tasks go, I think it might be overkill and would slow things down another degree. Anyone REALLY want EIGHT?

pedward · 2012-08-28 16:41

This per-instruction threading is *in addition* to the existing TASKSW mechanism?

jmg · 2012-08-28 16:47

cgracey wrote: »

As far as 8 tasks go, I think it might be overkill and would slow things down another degree. Anyone REALLY want EIGHT?

It was more psychological than a hard requirement.
Things like 8 COGs, 8 Tasks = 64 slices, and being able to swallow a whole Prop 1 into a Single Prop 2 cog (with size caveats)

If it has a speed cost, then obviously that moves it (very quickly) down the list, but I would certainly code and test it.

A 16 slice granularity has appeal, and that would drop to 10 (or maybe 11?) on a 8 way task.

The other resource is code, 8 way slices mean an average of 64 opcodes/ each of 8 slices, vs 128 opcodes / each of 4
- but it seems you can do a lot in 64 opcodes in a Prop2

edit: eg that Tx Example, looks to need just 11 opcodes, so still has room for FIFO buffers.

I recall doing a design exercise on a uC from China, with a core 3 way hard sliced, and being REALLY annoyed they forgot to allow a 2:1 weighting of just 2 cores.

Bill Henning · 2012-08-28 16:54

If 8 tasks slowed things down by more than the time consumed by them, then definitely stay with 4 tasks.

The advantage to 8 tasks is cases where eight simple peripherals would fit in a cog with very clear simple code like your tx sample - ie four full duplex serial ports; however this is NOT worth it if there is a speed penalty.

cgracey wrote: »
The SNDSER and RCVSER already have pollable forms, which won't tie things up.

WAITCNT can be gotten around with the new SUBCNT instruction, which has a CMPCNT version:
'
'
' transmit chr (x)
'
tx		getcnt	time		'get the initial time

		setb	x,#8		'set stop bit
		shl	x,#1		'insert start bit

		mov	y,#10		'ready for start + data(8) + stop bits

:loop		add	time,period	'add bit period to time

:wait		cmpcnt	time	wc	'loop until bit period elapsed (accommodates time slicing)
	if_c	jmp	#:wait

		shr	x,#1	wc	'shift out next bit to send
		setpc	tx_pin		'write to tx pin

		djnz	y,#:loop	'next bit

tx_ret		ret
This serial output routine can run as a single-cycle task.

As far as 8 tasks go, I think it might be overkill and would slow things down another degree. Anyone REALLY want EIGHT?

cgracey · 2012-08-28 16:56

pedward wrote: »

This per-instruction threading is *in addition* to the existing TASKSW mechanism?

Correct.

jmg · 2012-08-28 17:02

cgracey wrote: »

tasklong long %%3010201020102010 'task list is 16 x 2 bits, rotates right by 2

Note that task0 now gets every other time slot, and twice as many as task1, but it doesn't run twice as fast because much of its pipeline gets thrown away when its JMP executes. You can also see that task3, which gets every 16th time slot executes a loop in 160ns, or 32 clocks - 1 for the NOTP and 1 for the JMP.

{paste} Note that when all four tasks are running, jumps only take one clock, as there's no same-thread instructions in the pipeline to cancel.

Just to clarify, that task0 not doubled, is not an inter-slice effect, but a own-slice effect ? and it is a result of the pasted comment.

So a thread with 2 or 3 NOPs would become slice-proportional again, as it avoids pipeline effects.
This also means users should spread their slices, rather than pack them ?

jmg · 2012-08-28 17:04

pedward wrote: »

This per-instruction threading is *in addition* to the existing TASKSW mechanism?

Yes, this is great, deterministic, hard time-slice, rather than software hand-over time sharing.

cgracey · 2012-08-28 17:14

jmg wrote: »

Just to clarify, that task0 not doubled, is not an inter-slice effect, but a own-slice effect ? and it is a result of the pasted comment.

So a thread with 2 or 3 NOPs would become slice-proportional again, as it avoids pipeline effects.
This also means users should spread their slices, rather than pack them ?

Right. Whenever a JMP occurs, anything in the pipeline which belongs to that task must be cancelled, as those instructions are not going to execute, due to the change in the program counter. That task0 didn't double was because every other instruction in the pipe belonged to it, so that when its JMP executed, the second instruction after it (another's was in-between) got thrown away, pulling it back towards single-thread performance.

It's true that if you can spread your tasks apart time-wise by having all four run in sequence, nothing with ever get thrown away. As long as no task runs more often than every fourth clock, there's no waste. Makes me want to see about 8.

pedward · 2012-08-28 17:25

I guess if you did 8 and had a 4 bit field, 8*4=32 so you would be able to assign 8 threads in a given order. However, with 4 you get to weight the threads very flexibly.

I guess, from the code, that once you hit the SETTASK instruction, that COG forever executes those 4 tasks? Is there anyway to break out of that? I assume it implements a round-robin approach with the scheduling based on the tasklong?

So lemme guess:

The program starts, the first 4 jmps are JMPRET locations to store the instruction pointers of the tasks. You call SETTASK and it iterates through the list of tasks *once*, then drops through and you execute from the top with a JMP instruction?

jmg · 2012-08-28 17:27

pedward wrote: »

I guess, from the code, that once you hit the SETTASK instruction, that COG forever executes those 4 tasks? Is there anyway to break out of that?

Reload the SetTask(Slice) register ?

jmg · 2012-08-28 17:40

pedward wrote: »

I guess if you did 8 and had a 4 bit field, 8*4=32 so you would be able to assign 8 threads in a given order. However, with 4 you get to weight the threads very flexibly.

? Weighting would done be the same as shown, a simple "circular rotate and choose", not an additional look-up.

32 bits allows 16 slots to choose 1 of 4, or it can allow 10 slots to chose 1 of 8, or even 11 slots, to chose 1:8 in 10, and 1:4 in 11,
ie the 11th slice has 2 bits, so bit 33 is inferred as 0 or 1.

Only wrinkle in this, is if someone wanted a perfectly balanced resource slice per task, you would use 80% of horsepower.
Not sure if that is a big issue ? A Shift modulus would solve that, but it is more register bits...

There might be a case where someone wanted an exact spread of the CPU over 3,5,6,7 tasks ? N/16 or N/8 does not quite allow that, but a Shift modulus does.

Edit: If this was important, the 2 spare bits could encode a shift modulus, to allow :
00-> Mod 6 - Supports 1/2 1/3 N/6 slices
10-> Mod 8 - Supports 1/2 1/4 N/8 slices
11-> Mod 10 - Supports 1/2 1/5 N/10 slices ( choice of 8 slices supported )
One more for 7 or 9 ? - Perhaps 7, as someone might want 7 equal threads, but they cannot
balance 9, and N/10 is there for finer granularity cases ?
10-> Mod 7 - Supports N/7 slices

I see debug as an important use of this, and with a 10 or 11 slot, you need 10% or 9% overclock, to give the same average opcode bandwidth AND include your debug stub. (which could be live-watch of a selected variable list, for one example )

pedward · 2012-08-28 18:02

With 8 slots:

4 x 2 slots -- 01230123
3 x 1 slot 1 x 4 slots -- 01020304
5 x 1 slot 1 x 3 slots -- 01023045
6 x 1 slot 1 x 2 slots -- 01230456
8 x 1 slot -- 01234567

The only options which give you even slicing are 3 x 1 x 4, 4 x 2, 6 x 1 x 2, and 8 x 1.

Propeller II

Comments