Shop OBEX P1 Docs P2 Docs Learn Events
Preemptive Multithreading Example — Parallax Forums

Preemptive Multithreading Example

cgraceycgracey Posts: 14,151
edited 2014-03-17 15:06 in Propeller 2
Here is a switcher I wrote to test the T3LOAD/T3SAVE instructions, which load and save all of task3's state data, plus DCACHE status and INDA/INDB. This switcher also saves and restores task3's LIFO stack and the WIDE registers, for a pretty thorough program context. It also remaps 16 blocks of 16 registers, so that each of the 16 threaded programs has its own register pool from $000..$00F.

At 80MHz, a complete thread switch takes about 750ns. This switcher changes contexts every 10us to run the 16 programs in a round-robin pattern.
' Preemptive Multithreading Demo
' - runs 16 threaded programs
' - use F11 to download


DAT		orgh	$380			'pad to $E00
		org				'cog assembler mode


' 16 blocks of 16 regsiters addressable from $000..$00F by each threaded program

		long	0[16*16]		'(nops, initially)


' Thread switcher

		setwide	$1E8			'locate wide registers from $1E8..$1EF
		getcnt	time			'get initial time


:loop		setmap	thread_cnt,#%00_100_100	'remap registers for next program thread

		locptra	#context0		'get address of next 3-wide context
		mov	x,thread_cnt
		mul	x,#3<<5
		addptra	x

		rdwideq	ptra[0]			'load task3 state data
		t3load

		rdwideq	ptra[1]			'load task3 lifo stack without affecting dcache status
		tpush	$1EB,#3
		tpush	$1EA,#3
		tpush	$1E9,#3
		tpush	$1E8,#3

		rdwideq	ptra[2]			'load wides without affecting dcache status

		settask	thread_on		'give task3 15/16 slots

		add	time,time_delta		'wait for switch time
		passcnt	time

		settask	thread_off		'give task0 16/16 slots, task3 will empty from pipeline before next task0 executes

		wrwide	ptra[2]			'task3 dormant now, save wides

		tpop	$1E8,#3			'save task3 lifo stack
		tpop	$1E9,#3
		tpop	$1EA,#3
		tpop	$1EB,#3
		wrwide	ptra[1]

		t3save				'save task3 state data
		wrwide	ptra[0]

		incmod	thread_cnt,#15		'next program thread
		jmp	#:loop


' Thread switcher variables

time		long	0
time_delta	long	80_000_000/100_000	'context switches occur at 100KHz, threads run for ~9.25us, switches take ~0.75us
thread_cnt	long	0
thread_on	long	%%3333333333333330	'task3 gets 15 of 16 slots
thread_off	long	%%0000000000000000	'task0 gets 16 of 16 slots, task3 is starved of slots
x		long	0


' Program thread contexts - 3 wides per context

		orgh				'hub assembler mode
		orgw				'align to next wide

		'	t3load/t3save			lifo		wide
context0	long	@program0>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context1	long	@program1>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context2	long	@program2>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context3	long	@program3>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context4	long	@program4>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context5	long	@program5>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context6	long	@program6>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context7	long	@program7>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context8	long	@program8>>2,0,0,0,0,0,0,0,	0[8],		0[8]
context9	long	@program9>>2,0,0,0,0,0,0,0,	0[8],		0[8]
contextA	long	@programA>>2,0,0,0,0,0,0,0,	0[8],		0[8]
contextB	long	@programB>>2,0,0,0,0,0,0,0,	0[8],		0[8]
contextC	long	@programC>>2,0,0,0,0,0,0,0,	0[8],		0[8]
contextD	long	@programD>>2,0,0,0,0,0,0,0,	0[8],		0[8]
contextE	long	@programE>>2,0,0,0,0,0,0,0,	0[8],		0[8]
contextF	long	@programF>>2,0,0,0,0,0,0,0,	0[8],		0[8]


' Each of the following programs has virtual uninterrupted use of:
'
'	registers 0..15
'	the wide registers at $1E8, including dcache function
'	inda/indb
'	all task3 resources
'
' Use of big math circuits should be encapsulated within tlock/tfree instructions.

program0	notp	#0
		jmp	@program0

program1	notp	#1
		jmp	@program1

program2	notp	#2
		jmp	@program2

program3	notp	#3
		jmp	@program3

program4	notp	#4
		jmp	@program4

program5	notp	#5
		jmp	@program5

program6	notp	#6
		jmp	@program6

program7	notp	#7
		jmp	@program7

program8	notp	#8
		jmp	@program8

program9	notp	#9
		jmp	@program9

programA	notp	#10
		jmp	@programA

programB	notp	#11
		jmp	@programB

programC	notp	#12
		jmp	@programC

programD	notp	#13
		jmp	@programD

programE	notp	#14
		jmp	@programE

programF	notp	#15
		jmp	@programF

«13

Comments

  • ozpropdevozpropdev Posts: 2,792
    edited 2014-03-11 05:01
    Looks great! :)
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-03-11 05:18
    Chip,

    That looks great - nice and simple, easy to read.
  • cgraceycgracey Posts: 14,151
    edited 2014-03-11 05:25
    Chip,

    That looks great - nice and simple, easy to read.


    I'm changing the RDWIDEQ instruction so that it waits for three more clocks, allowing us to get rid of the WAITs. I'll just change that in the code listing now... done. I also got rid of the '#0,' write mask in the WRWIDE instructions, since I'll have the assembler be smart about that.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-03-11 05:33
    Even better - that removes the need for wait #2's
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-03-11 05:41
    Here is a version of Chip's switcher, modified for cooperative multi-threading:
    ' Cooperative Multithreading Demo (slight modification of preemptive demo), untested
    ' - runs 16 threaded programs
    ' - use F11 to download
    
    CON     
                    YIELD = 1
    
    DAT		orgh	$380			'pad to $E00
    		org				'cog assembler mode
    
    
    ' 16 blocks of 16 regsiters addressable from $000..$00F by each threaded program
    
    		long	0[16*16]		'(nops, initially)
    
    
    ' Thread switcher
    
    		setwide	$1E8			'locate wide registers from $1E8..$1EF
    
    :loop		setmap	thread_cnt,#%00_100_100	'remap registers for next program thread
    
    		locptra	#context0		'get address of next 3-wide context
    		mov	x,thread_cnt
    		mul	x,#3<<5
    		addptra	x
    
    		rdwideq	ptra[0]			'load task3 state data
    		t3load
    
    		rdwideq	ptra[1]			'load task3 lifo stack without affecting dcache status
    		tpush	$1EB,#3
    		tpush	$1EA,#3
    		tpush	$1E9,#3
    		tpush	$1E8,#3
    
    		rdwideq	ptra[2]			'load wides without affecting dcache status
    
                    mov         t3request, #0           ' release YIELD, moved after Chip's correct comment
    
    		settask	thread_on		'give task3 15/16 slots
    
                    JZ            t3request,@$          ' wait for request
    
    		settask	thread_off		'give task0 16/16 slots, task3 will empty from pipeline before next task0 executes
    
    		wrwide	ptra[2]			'task3 dormant now, save wides
    
    		tpop	$1E8,#3			'save task3 lifo stack
    		tpop	$1E9,#3
    		tpop	$1EA,#3
    		tpop	$1EB,#3
    		wrwide	ptra[1]
    
    		t3save				'save task3 state data
    		wrwide	ptra[0]
    
    		incmod	thread_cnt,#15		'next program thread
    		jmp	#:loop
    
    
    ' Thread switcher variables
    
    t3request long 0
    
    thread_cnt	long	0
    thread_on	long	%%3333333333333330	'task3 gets 15 of 16 slots
    thread_off	long	%%0000000000000000	'task0 gets 16 of 16 slots, task3 is starved of slots
    x		long	0
    
    
    ' Program thread contexts - 3 wides per context
    
    		orgh				'hub assembler mode
    		orgw				'align to next wide
    
    		'	t3load/t3save			lifo		wide
    context0	long	@program0>>2,0,0,0,0,0,0,0,	0[8],		0[8]
    context1	long	@program1>>2,0,0,0,0,0,0,0,	0[8],		0[8]
    context2	long	@program2>>2,0,0,0,0,0,0,0,	0[8],		0[8]
    context3	long	@program3>>2,0,0,0,0,0,0,0,	0[8],		0[8]
    context4	long	@program4>>2,0,0,0,0,0,0,0,	0[8],		0[8]
    context5	long	@program5>>2,0,0,0,0,0,0,0,	0[8],		0[8]
    context6	long	@program6>>2,0,0,0,0,0,0,0,	0[8],		0[8]
    context7	long	@program7>>2,0,0,0,0,0,0,0,	0[8],		0[8]
    context8	long	@program8>>2,0,0,0,0,0,0,0,	0[8],		0[8]
    context9	long	@program9>>2,0,0,0,0,0,0,0,	0[8],		0[8]
    contextA	long	@programA>>2,0,0,0,0,0,0,0,	0[8],		0[8]
    contextB	long	@programB>>2,0,0,0,0,0,0,0,	0[8],		0[8]
    contextC	long	@programC>>2,0,0,0,0,0,0,0,	0[8],		0[8]
    contextD	long	@programD>>2,0,0,0,0,0,0,0,	0[8],		0[8]
    contextE	long	@programE>>2,0,0,0,0,0,0,0,	0[8],		0[8]
    contextF	long	@programF>>2,0,0,0,0,0,0,0,	0[8],		0[8]
    
    
    ' Each of the following programs has virtual uninterrupted use of:
    '
    '	registers 0..15
    '	the wide registers at $1E8, including dcache function
    '	inda/indb
    '	all task3 resources
    '
    ' Use of big math circuits should be encapsulated within tlock/tfree instructions.
    
    program0	notp	#0
                    tcheck  t3request,#YIELD
    		jmp	@program0
    
    program1	notp	#1
                    tcheck  t3request,#YIELD
    		jmp	@program1
    
    program2	notp	#2
                    tcheck  t3request,#YIELD
    		jmp	@program2
    
    program3	notp	#3
                    tcheck  t3request,#YIELD
    		jmp	@program3
    
    program4	notp	#4
                    tcheck  t3request,#YIELD
    		jmp	@program4
    
    program5	notp	#5
                    tcheck  t3request,#YIELD
    		jmp	@program5
    
    program6	notp	#6
                    tcheck  t3request,#YIELD
    		jmp	@program6
    
    program7	notp	#7
                    tcheck  t3request,#YIELD
    		jmp	@program7
    
    program8	notp	#8
                    tcheck  t3request,#YIELD
    		jmp	@program8
    
    program9	notp	#9
                    tcheck  t3request,#YIELD
    		jmp	@program9
    
    programA	notp	#10
                    tcheck  t3request,#YIELD
    		jmp	@programA
    
    programB	notp	#11
                    tcheck  t3request,#YIELD
    		jmp	@programB
    
    programC	notp	#12
                    tcheck  t3request,#YIELD
    		jmp	@programC
    
    programD	notp	#13
                    tcheck  t3request,#YIELD
    		jmp	@programD
    
    programE	notp	#14
                    tcheck  t3request,#YIELD
    		jmp	@programE
    
    programF	notp	#15
                    tcheck  t3request,#YIELD
    		jmp	@programF
    
    
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-03-11 05:47
    Chip,
    Looks really impressive!
  • cgraceycgracey Posts: 14,151
    edited 2014-03-11 05:55
    Here is a version of Chip's switcher, modified for cooperative multi-threading:


    Bill, I think you forgot to put the 'JZ t3request,@$' instruction before the 'MOV t3request,#0'.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-03-11 06:05
    OOPS!

    Thanks, you are right of course.

    That should teach me about trying to modify code before coffee...
    cgracey wrote: »
    Bill, I think you forgot to put the 'JZ t3request,@$' instruction before the 'MOV t3request,#0'.
  • cgraceycgracey Posts: 14,151
    edited 2014-03-11 06:11
    OOPS!

    Thanks, you are right of course.

    That should teach me about trying to modify code before coffee...


    I made your changes and it runs as expected.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-03-11 06:16
    Thanks!

    Now the people who prefer cooperative threading can easily do so.

    I am really looking forward to all the neat features we can implement with TCHECK.
  • roglohrogloh Posts: 5,786
    edited 2014-03-11 06:22
    cgracey wrote: »
    I made your changes and it runs as expected.

    Great effort Chip. All proceeding very nicely right now. It will pay dividends down the track. :thumb:
  • ctwardellctwardell Posts: 1,716
    edited 2014-03-11 06:36
    We're going to be like kids in a candy store!

    C.W.
  • cgraceycgracey Posts: 14,151
    edited 2014-03-11 06:39
    OOPS!

    Thanks, you are right of course.

    That should teach me about trying to modify code before coffee...


    Actually, it needs to be a little different than what either of us thought. You'd want to clear the 'request' register BEFORE starting the thread by giving it time slots. Then, you'd wait for it to go non-zero. When it's non-zero, then you switch threads. The THALT will be cleared next time around.
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-03-11 06:48
    You are absolutely right!

    I fixed the example, and I am currently drinking a BIG cup of strong coffee...
    cgracey wrote: »
    Actually, it needs to be a little different than what either of us thought. You'd want to clear the 'request' register BEFORE starting the thread by giving it time slots. Then, you'd wait for it to go non-zero. When it's non-zero, then you switch threads. The THALT will be cleared next time around.
  • Cluso99Cluso99 Posts: 18,069
    edited 2014-03-11 07:17
    ctwardell wrote: »
    We're going to be like kids in a candy store!

    C.W.
    Couldn't agree more!

    There is certainly going to be a lot to try with the coming release ;)
  • RaymanRayman Posts: 14,640
    edited 2014-03-11 11:54
    Can one combine this multithreading with hubexec mode?
    I mean, can these 16 threads be running code from the hub?
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-03-11 12:00
    Yes :)
    Rayman wrote: »
    Can one combine this multithreading with hubexec mode?
    I mean, can these 16 threads be running code from the hub?
  • RaymanRayman Posts: 14,640
    edited 2014-03-11 12:08
    Thanks Bill. Also, is the choice of 16 for # threads in this example is arbitrary, right?
    Is there any limit on # of threads this way? I think the other way has a limit of 4 threads...

    BTW: Why did Chip pick task#3 for this?
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-03-11 12:12
    You are welcome.

    The number of threads is arbitrary... thousands of tiny threads are possible.

    Only a single task supports saving its whole state, and I am guessing that Chip chose the last one so that 1&2 could theoretically run cog drivers.

    Personally, I expect I will always run the scheduler/debugger in #0, not use 1&2, and use #3 for threads - by not using #1&#2, #3 will get much more benefit from the single dcache line and the four icache lines.

    But in case we run out of tasks, it is nice to have the capability to run drivers in 1&2 (sharing the scheduler/thread cog)
    Rayman wrote: »
    Thanks Bill. Also, is the choice of 16 for # threads in this example is arbitrary, right?
    Is there any limit on # of threads this way? I think the other way has a limit of 4 threads...

    BTW: Why did Chip pick task#3 for this?
  • pjvpjv Posts: 1,903
    edited 2014-03-11 12:14
    Hi Bill;

    Say, I'm curious how fast that will run (exclusive of each task's own code) if you had 4 cooperative tasks running at once.

    Cheers,

    Peter (pjv)
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-03-11 12:36
    There would be a fixed overhead of 1/16 cycles for the scheduler task, after that Chip's thread save/restore takes 750ns @ 80Mhz, which translates to 300ns @ 200Mhz

    It would depend on how often a context switch occurred.

    Note that faster context saves are possible if you don't save the LIFO or the WIDE for the thread (in case you don't need to)

    Time to schedule = (200*1/16) overhead + num_context_switches * .3us

    So if you had 1000 context switches in a second, 300us for context switches, plus 65,500us for the scheduler idling... total of 6.58% overhead

    46.71MIPS/thread assuming each thread used 100% of its cycles

    If two of the threads only used 10% of their cycles, the other two would run a at about 80MIPS each.

    Four hardware tasks would run at 50MIPS/task

    To me, the more interesting case is being able to run tons of slower threads (so many threads can wait on sockets & USB endpoints)
    pjv wrote: »
    Hi Bill;

    Say, I'm curious how fast that will run (exclusive of each task's own code) if you had 4 cooperative tasks running at once.

    Cheers,

    Peter (pjv)
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-03-11 12:40
    FYI,

    If someone wanted, they could run cooperative threads only in task#3

    If they did that, by having a "YIELD" subroutine in the cog memory for task#3 they would not need the scheduler thread, and would only take the hit of thread state saving/restoring, avoiding the 1/16 cycle cost of the scheduler.

    Such theads would not however be able to pre-emptive.

    What I like about Chip's implementation is that a developer can use cooperative or pre-emptive threading - whichever is more appropriate to the problem the developer is solving.
  • potatoheadpotatohead Posts: 10,261
    edited 2014-03-11 13:02
    Yes, I like that too.

    Wow! Now, that whole discussion was painful, but the result is nice and clean. Well done all around.

    And, the niceness of the PASM syntax for data really shines here, just saying.
  • pjvpjv Posts: 1,903
    edited 2014-03-11 13:02
    Hi Bill;

    Perhaps I should as the question with a specific simple example:

    How fast could 4 independent cooperative threads run where each thread is a simple XOR outa,BitN toggling a different port bit. With and without the optional saves you mentioned.

    Cheers,

    Peter (pjv)
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-03-11 13:38
    With Chip's full state save, at 200Mhz, cooperative

    .3us per fask switch, ~30ns per task... .33us total... approx. 3M task swaps per sec

    With minimal state save (only PC etc wide, no LIFO, no WIDE save) approximately 50 cycles for task swap, 6 for task

    approx. 16M+ task swaps per sec

    So hooking up an LED to each thread...

    with full state save: ~.38Mhz square wave per LED

    with minimal state save: ~2Mhz square wave per LED

    with four hardware tasks using REPD with a count set at 2^32 - 1

    ~6.25Mhz square wave per LED

    Cheating version:

    Putting the four XOR's into a REPD block set to 2^32 - 1 repeats:

    ~25Mhz square wave per LED

    save/restore time dominates the non-cheating versions.
  • pjvpjv Posts: 1,903
    edited 2014-03-11 13:58
    Bill Thanks;

    But a little thick in the head here..... does your statement "square wave per LED" mean that all 4 LED pins are toggling with ~ 2.6 uSec high, then 2.6 uSec low for the full save case, and ~ 0.5 uSec high with 0.5 uSec low for the minimal case ?

    I want to be sure to not mis-interpret your statements as that can be easily done in these cases.

    Cheers,

    Peter (pjv)
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-03-11 14:02
    Yep, each LED toggling independently.

    FYI, the above numbers were quick off the top of my head approximations (except the cheating case); once we have the new FPGA image it will be possible to try out all sorts of schemes (albeit at 80Mhz)
    pjv wrote: »
    Bill Thanks;

    But a little thick in the head here..... does your statement "square wave per LED" mean that all 4 LED pins are toggling with ~ 2.6 uSec high, then 2.6 uSec low for the full save case, and ~ 0.5 uSec high with 0.5 uSec low for the minimal case ?

    I want to be sure to not mis-interpret your statements as that can be easily done in these cases.

    Cheers,

    Peter (pjv)
  • RaymanRayman Posts: 14,640
    edited 2014-03-11 16:39
    Having this Preemptive Multithreading opens the door to running an OS like Android, right?
  • Bill HenningBill Henning Posts: 6,445
    edited 2014-03-11 16:48
    Not without some XLMM type kernel - 256KB is not nearly enough for Android.

    For P3, I am sure there will be some kind of execute from DDR2/3/?, but it is waaaaaay too much to try to squeeze into P2.

    An LMM-like kernel for the SDRAM on the P2 module is possible, trading a performance hit for much larger programs and a simple OS (uCLinux?)
    Rayman wrote: »
    Having this Preemptive Multithreading opens the door to running an OS like Android, right?
  • RaymanRayman Posts: 14,640
    edited 2014-03-11 18:12
    I wonder if you could swap memory in and out from SDRAM to HUB for each of these threads...
    Maybe that would take too long, but if you could, you could then execute from a vast pool of memory...
Sign In or Register to comment.