Preemptive Multithreading Example

cgracey · 2014-03-11 04:56

Here is a switcher I wrote to test the T3LOAD/T3SAVE instructions, which load and save all of task3's state data, plus DCACHE status and INDA/INDB. This switcher also saves and restores task3's LIFO stack and the WIDE registers, for a pretty thorough program context. It also remaps 16 blocks of 16 registers, so that each of the 16 threaded programs has its own register pool from $000..$00F.

At 80MHz, a complete thread switch takes about 750ns. This switcher changes contexts every 10us to run the 16 programs in a round-robin pattern.

' Preemptive Multithreading Demo
' - runs 16 threaded programs
' - use F11 to download


DAT		orgh	$380			'pad to $E00
		org				'cog assembler mode


' 16 blocks of 16 regsiters addressable from $000..$00F by each threaded program

		long	0[16*16]		'(nops, initially)


' Thread switcher

		setwide	$1E8			'locate wide registers from $1E8..$1EF
		getcnt	time			'get initial time


:loop		setmap	thread_cnt,#%00_100_100	'remap registers for next program thread

		locptra	#context0		'get address of next 3-wide context
		mov	x,thread_cnt
		mul	x,#3<<5
		addptra	x

		rdwideq	ptra[0]			'load task3 state data
		t3load

		rdwideq	ptra[1]			'load task3 lifo stack without affecting dcache status
		tpush	$1EB,#3
		tpush	$1EA,#3
		tpush	$1E9,#3
		tpush	$1E8,#3

		rdwideq	ptra[2]			'load wides without affecting dcache status

		settask	thread_on		'give task3 15/16 slots

		add	time,time_delta		'wait for switch time
		passcnt	time

		settask	thread_off		'give task0 16/16 slots, task3 will empty from pipeline before next task0 executes

		wrwide	ptra[2]			'task3 dormant now, save wides

		tpop	$1E8,#3			'save task3 lifo stack
		tpop	$1E9,#3
		tpop	$1EA,#3
		tpop	$1EB,#3
		wrwide	ptra[1]

		t3save				'save task3 state data
		wrwide	ptra[0]

		incmod	thread_cnt,#15		'next program thread
		jmp	#:loop


' Thread switcher variables

time		long	0
time_delta	long	80_000_000/100_000	'context switches occur at 100KHz, threads run for ~9.25us, switches take ~0.75us
thread_cnt	long	0
thread_on	long	%%3333333333333330	'task3 gets 15 of 16 slots
thread_off	long	%%0000000000000000	'task0 gets 16 of 16 slots, task3 is starved of slots
x		long	0


' Program thread contexts - 3 wides per context

		orgh				'hub assembler mode
		orgw				'align to next wide

		'	t3load/t3save			lifo		wide
context0	long	@program0>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context1	long	@program1>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context2	long	@program2>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context3	long	@program3>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context4	long	@program4>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context5	long	@program5>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context6	long	@program6>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context7	long	@program7>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context8	long	@program8>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context9	long	@program9>>2,0,0,0,0,0,0,0,	0[8],		0[8]
contextA	long	@programA>>2,0,0,0,0,0,0,0,	0[8],		0[8]
contextB	long	@programB>>2,0,0,0,0,0,0,0,	0[8],		0[8]
contextC	long	@programC>>2,0,0,0,0,0,0,0,	0[8],		0[8]
contextD	long	@programD>>2,0,0,0,0,0,0,0,	0[8],		0[8]
contextE	long	@programE>>2,0,0,0,0,0,0,0,	0[8],		0[8]
contextF	long	@programF>>2,0,0,0,0,0,0,0,	0[8],		0[8]


' Each of the following programs has virtual uninterrupted use of:
'
'	registers 0..15
'	the wide registers at $1E8, including dcache function
'	inda/indb
'	all task3 resources
'
' Use of big math circuits should be encapsulated within tlock/tfree instructions.

program0	notp	#0
		jmp	@program0

program1	notp	#1
		jmp	@program1

program2	notp	#2
		jmp	@program2

program3	notp	#3
		jmp	@program3

program4	notp	#4
		jmp	@program4

program5	notp	#5
		jmp	@program5

program6	notp	#6
		jmp	@program6

program7	notp	#7
		jmp	@program7

program8	notp	#8
		jmp	@program8

program9	notp	#9
		jmp	@program9

programA	notp	#10
		jmp	@programA

programB	notp	#11
		jmp	@programB

programC	notp	#12
		jmp	@programC

programD	notp	#13
		jmp	@programD

programE	notp	#14
		jmp	@programE

programF	notp	#15
		jmp	@programF

ozpropdev · 2014-03-11 05:01

Looks great!

Bill Henning · 2014-03-11 05:18

Chip,

That looks great - nice and simple, easy to read.

cgracey · 2014-03-11 05:25

Bill Henning wrote: »

Chip,

That looks great - nice and simple, easy to read.

I'm changing the RDWIDEQ instruction so that it waits for three more clocks, allowing us to get rid of the WAITs. I'll just change that in the code listing now... done. I also got rid of the '#0,' write mask in the WRWIDE instructions, since I'll have the assembler be smart about that.

Bill Henning · 2014-03-11 05:33

Even better - that removes the need for wait #2's

Bill Henning · 2014-03-11 05:41

Here is a version of Chip's switcher, modified for cooperative multi-threading:

' Cooperative Multithreading Demo (slight modification of preemptive demo), untested
' - runs 16 threaded programs
' - use F11 to download

CON     
                YIELD = 1

DAT		orgh	$380			'pad to $E00
		org				'cog assembler mode


' 16 blocks of 16 regsiters addressable from $000..$00F by each threaded program

		long	0[16*16]		'(nops, initially)


' Thread switcher

		setwide	$1E8			'locate wide registers from $1E8..$1EF

:loop		setmap	thread_cnt,#%00_100_100	'remap registers for next program thread

		locptra	#context0		'get address of next 3-wide context
		mov	x,thread_cnt
		mul	x,#3<<5
		addptra	x

		rdwideq	ptra[0]			'load task3 state data
		t3load

		rdwideq	ptra[1]			'load task3 lifo stack without affecting dcache status
		tpush	$1EB,#3
		tpush	$1EA,#3
		tpush	$1E9,#3
		tpush	$1E8,#3

		rdwideq	ptra[2]			'load wides without affecting dcache status

                mov         t3request, #0           ' release YIELD, moved after Chip's correct comment

		settask	thread_on		'give task3 15/16 slots

                JZ            t3request,@$          ' wait for request

		settask	thread_off		'give task0 16/16 slots, task3 will empty from pipeline before next task0 executes

		wrwide	ptra[2]			'task3 dormant now, save wides

		tpop	$1E8,#3			'save task3 lifo stack
		tpop	$1E9,#3
		tpop	$1EA,#3
		tpop	$1EB,#3
		wrwide	ptra[1]

		t3save				'save task3 state data
		wrwide	ptra[0]

		incmod	thread_cnt,#15		'next program thread
		jmp	#:loop


' Thread switcher variables

t3request long 0

thread_cnt	long	0
thread_on	long	%%3333333333333330	'task3 gets 15 of 16 slots
thread_off	long	%%0000000000000000	'task0 gets 16 of 16 slots, task3 is starved of slots
x		long	0


' Program thread contexts - 3 wides per context

		orgh				'hub assembler mode
		orgw				'align to next wide

		'	t3load/t3save			lifo		wide
context0	long	@program0>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context1	long	@program1>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context2	long	@program2>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context3	long	@program3>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context4	long	@program4>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context5	long	@program5>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context6	long	@program6>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context7	long	@program7>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context8	long	@program8>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context9	long	@program9>>2,0,0,0,0,0,0,0,	0[8],		0[8]
contextA	long	@programA>>2,0,0,0,0,0,0,0,	0[8],		0[8]
contextB	long	@programB>>2,0,0,0,0,0,0,0,	0[8],		0[8]
contextC	long	@programC>>2,0,0,0,0,0,0,0,	0[8],		0[8]
contextD	long	@programD>>2,0,0,0,0,0,0,0,	0[8],		0[8]
contextE	long	@programE>>2,0,0,0,0,0,0,0,	0[8],		0[8]
contextF	long	@programF>>2,0,0,0,0,0,0,0,	0[8],		0[8]


' Each of the following programs has virtual uninterrupted use of:
'
'	registers 0..15
'	the wide registers at $1E8, including dcache function
'	inda/indb
'	all task3 resources
'
' Use of big math circuits should be encapsulated within tlock/tfree instructions.

program0	notp	#0
                tcheck  t3request,#YIELD
		jmp	@program0

program1	notp	#1
                tcheck  t3request,#YIELD
		jmp	@program1

program2	notp	#2
                tcheck  t3request,#YIELD
		jmp	@program2

program3	notp	#3
                tcheck  t3request,#YIELD
		jmp	@program3

program4	notp	#4
                tcheck  t3request,#YIELD
		jmp	@program4

program5	notp	#5
                tcheck  t3request,#YIELD
		jmp	@program5

program6	notp	#6
                tcheck  t3request,#YIELD
		jmp	@program6

program7	notp	#7
                tcheck  t3request,#YIELD
		jmp	@program7

program8	notp	#8
                tcheck  t3request,#YIELD
		jmp	@program8

program9	notp	#9
                tcheck  t3request,#YIELD
		jmp	@program9

programA	notp	#10
                tcheck  t3request,#YIELD
		jmp	@programA

programB	notp	#11
                tcheck  t3request,#YIELD
		jmp	@programB

programC	notp	#12
                tcheck  t3request,#YIELD
		jmp	@programC

programD	notp	#13
                tcheck  t3request,#YIELD
		jmp	@programD

programE	notp	#14
                tcheck  t3request,#YIELD
		jmp	@programE

programF	notp	#15
                tcheck  t3request,#YIELD
		jmp	@programF

Cluso99 · 2014-03-11 05:47

Chip,
Looks really impressive!

cgracey · 2014-03-11 05:55

Bill Henning wrote: »

Here is a version of Chip's switcher, modified for cooperative multi-threading:

Bill, I think you forgot to put the 'JZ t3request,@$' instruction before the 'MOV t3request,#0'.

Bill Henning · 2014-03-11 06:05

OOPS!

Thanks, you are right of course.

That should teach me about trying to modify code before coffee...

cgracey wrote: »

Bill, I think you forgot to put the 'JZ t3request,@$' instruction before the 'MOV t3request,#0'.

cgracey · 2014-03-11 06:11

Bill Henning wrote: »

OOPS!

Thanks, you are right of course.

That should teach me about trying to modify code before coffee...

I made your changes and it runs as expected.

Bill Henning · 2014-03-11 06:16

Thanks!

Now the people who prefer cooperative threading can easily do so.

I am really looking forward to all the neat features we can implement with TCHECK.

rogloh · 2014-03-11 06:22

cgracey wrote: »

I made your changes and it runs as expected.

Great effort Chip. All proceeding very nicely right now. It will pay dividends down the track. :thumb:

ctwardell · 2014-03-11 06:36

We're going to be like kids in a candy store!

C.W.

cgracey · 2014-03-11 06:39

Bill Henning wrote: »

OOPS!

Thanks, you are right of course.

That should teach me about trying to modify code before coffee...

Actually, it needs to be a little different than what either of us thought. You'd want to clear the 'request' register BEFORE starting the thread by giving it time slots. Then, you'd wait for it to go non-zero. When it's non-zero, then you switch threads. The THALT will be cleared next time around.

Bill Henning · 2014-03-11 06:48

You are absolutely right!

I fixed the example, and I am currently drinking a BIG cup of strong coffee...

cgracey wrote: »

Actually, it needs to be a little different than what either of us thought. You'd want to clear the 'request' register BEFORE starting the thread by giving it time slots. Then, you'd wait for it to go non-zero. When it's non-zero, then you switch threads. The THALT will be cleared next time around.

Cluso99 · 2014-03-11 07:17

ctwardell wrote: »

We're going to be like kids in a candy store!

C.W.

Couldn't agree more!

There is certainly going to be a lot to try with the coming release

Rayman · 2014-03-11 11:54

Can one combine this multithreading with hubexec mode?
I mean, can these 16 threads be running code from the hub?

Bill Henning · 2014-03-11 12:00

Yes

Rayman wrote: »

Can one combine this multithreading with hubexec mode?
I mean, can these 16 threads be running code from the hub?

Rayman · 2014-03-11 12:08

Thanks Bill. Also, is the choice of 16 for # threads in this example is arbitrary, right?
Is there any limit on # of threads this way? I think the other way has a limit of 4 threads...

BTW: Why did Chip pick task#3 for this?

Bill Henning · 2014-03-11 12:12

You are welcome.

The number of threads is arbitrary... thousands of tiny threads are possible.

Only a single task supports saving its whole state, and I am guessing that Chip chose the last one so that 1&2 could theoretically run cog drivers.

Personally, I expect I will always run the scheduler/debugger in #0, not use 1&2, and use #3 for threads - by not using #1&#2, #3 will get much more benefit from the single dcache line and the four icache lines.

But in case we run out of tasks, it is nice to have the capability to run drivers in 1&2 (sharing the scheduler/thread cog)

Rayman wrote: »

Thanks Bill. Also, is the choice of 16 for # threads in this example is arbitrary, right?
Is there any limit on # of threads this way? I think the other way has a limit of 4 threads...

BTW: Why did Chip pick task#3 for this?

pjv · 2014-03-11 12:14

Hi Bill;

Say, I'm curious how fast that will run (exclusive of each task's own code) if you had 4 cooperative tasks running at once.

Cheers,

Peter (pjv)

Bill Henning · 2014-03-11 12:36

There would be a fixed overhead of 1/16 cycles for the scheduler task, after that Chip's thread save/restore takes 750ns @ 80Mhz, which translates to 300ns @ 200Mhz

It would depend on how often a context switch occurred.

Note that faster context saves are possible if you don't save the LIFO or the WIDE for the thread (in case you don't need to)

Time to schedule = (200*1/16) overhead + num_context_switches * .3us

So if you had 1000 context switches in a second, 300us for context switches, plus 65,500us for the scheduler idling... total of 6.58% overhead

46.71MIPS/thread assuming each thread used 100% of its cycles

If two of the threads only used 10% of their cycles, the other two would run a at about 80MIPS each.

Four hardware tasks would run at 50MIPS/task

To me, the more interesting case is being able to run tons of slower threads (so many threads can wait on sockets & USB endpoints)

pjv wrote: »

Hi Bill;

Say, I'm curious how fast that will run (exclusive of each task's own code) if you had 4 cooperative tasks running at once.

Cheers,

Peter (pjv)

Bill Henning · 2014-03-11 12:40

FYI,

If someone wanted, they could run cooperative threads only in task#3

If they did that, by having a "YIELD" subroutine in the cog memory for task#3 they would not need the scheduler thread, and would only take the hit of thread state saving/restoring, avoiding the 1/16 cycle cost of the scheduler.

Such theads would not however be able to pre-emptive.

What I like about Chip's implementation is that a developer can use cooperative or pre-emptive threading - whichever is more appropriate to the problem the developer is solving.

potatohead · 2014-03-11 13:02

Yes, I like that too.

Wow! Now, that whole discussion was painful, but the result is nice and clean. Well done all around.

And, the niceness of the PASM syntax for data really shines here, just saying.

pjv · 2014-03-11 13:02

Hi Bill;

Perhaps I should as the question with a specific simple example:

How fast could 4 independent cooperative threads run where each thread is a simple XOR outa,BitN toggling a different port bit. With and without the optional saves you mentioned.

Cheers,

Peter (pjv)

Bill Henning · 2014-03-11 13:38

With Chip's full state save, at 200Mhz, cooperative

.3us per fask switch, ~30ns per task... .33us total... approx. 3M task swaps per sec

With minimal state save (only PC etc wide, no LIFO, no WIDE save) approximately 50 cycles for task swap, 6 for task

approx. 16M+ task swaps per sec

So hooking up an LED to each thread...

with full state save: ~.38Mhz square wave per LED

with minimal state save: ~2Mhz square wave per LED

with four hardware tasks using REPD with a count set at 2^32 - 1

~6.25Mhz square wave per LED

Cheating version:

Putting the four XOR's into a REPD block set to 2^32 - 1 repeats:

~25Mhz square wave per LED

save/restore time dominates the non-cheating versions.

pjv · 2014-03-11 13:58

Bill Thanks;

But a little thick in the head here..... does your statement "square wave per LED" mean that all 4 LED pins are toggling with ~ 2.6 uSec high, then 2.6 uSec low for the full save case, and ~ 0.5 uSec high with 0.5 uSec low for the minimal case ?

I want to be sure to not mis-interpret your statements as that can be easily done in these cases.

Cheers,

Peter (pjv)

Bill Henning · 2014-03-11 14:02

Yep, each LED toggling independently.

FYI, the above numbers were quick off the top of my head approximations (except the cheating case); once we have the new FPGA image it will be possible to try out all sorts of schemes (albeit at 80Mhz)

pjv wrote: »

Bill Thanks;

But a little thick in the head here..... does your statement "square wave per LED" mean that all 4 LED pins are toggling with ~ 2.6 uSec high, then 2.6 uSec low for the full save case, and ~ 0.5 uSec high with 0.5 uSec low for the minimal case ?

I want to be sure to not mis-interpret your statements as that can be easily done in these cases.

Cheers,

Peter (pjv)

Rayman · 2014-03-11 16:39

Having this Preemptive Multithreading opens the door to running an OS like Android, right?

Bill Henning · 2014-03-11 16:48

Not without some XLMM type kernel - 256KB is not nearly enough for Android.

For P3, I am sure there will be some kind of execute from DDR2/3/?, but it is waaaaaay too much to try to squeeze into P2.

An LMM-like kernel for the SDRAM on the P2 module is possible, trading a performance hit for much larger programs and a simple OS (uCLinux?)

Rayman wrote: »

Having this Preemptive Multithreading opens the door to running an OS like Android, right?

Rayman · 2014-03-11 18:12

I wonder if you could swap memory in and out from SDRAM to HUB for each of these threads...
Maybe that would take too long, but if you could, you could then execute from a vast pool of memory...

Preemptive Multithreading Example

Comments