Preemptive Multithreading Example
cgracey
Posts: 14,206
Here is a switcher I wrote to test the T3LOAD/T3SAVE instructions, which load and save all of task3's state data, plus DCACHE status and INDA/INDB. This switcher also saves and restores task3's LIFO stack and the WIDE registers, for a pretty thorough program context. It also remaps 16 blocks of 16 registers, so that each of the 16 threaded programs has its own register pool from $000..$00F.
At 80MHz, a complete thread switch takes about 750ns. This switcher changes contexts every 10us to run the 16 programs in a round-robin pattern.
At 80MHz, a complete thread switch takes about 750ns. This switcher changes contexts every 10us to run the 16 programs in a round-robin pattern.
' Preemptive Multithreading Demo ' - runs 16 threaded programs ' - use F11 to download DAT orgh $380 'pad to $E00 org 'cog assembler mode ' 16 blocks of 16 regsiters addressable from $000..$00F by each threaded program long 0[16*16] '(nops, initially) ' Thread switcher setwide $1E8 'locate wide registers from $1E8..$1EF getcnt time 'get initial time :loop setmap thread_cnt,#%00_100_100 'remap registers for next program thread locptra #context0 'get address of next 3-wide context mov x,thread_cnt mul x,#3<<5 addptra x rdwideq ptra[0] 'load task3 state data t3load rdwideq ptra[1] 'load task3 lifo stack without affecting dcache status tpush $1EB,#3 tpush $1EA,#3 tpush $1E9,#3 tpush $1E8,#3 rdwideq ptra[2] 'load wides without affecting dcache status settask thread_on 'give task3 15/16 slots add time,time_delta 'wait for switch time passcnt time settask thread_off 'give task0 16/16 slots, task3 will empty from pipeline before next task0 executes wrwide ptra[2] 'task3 dormant now, save wides tpop $1E8,#3 'save task3 lifo stack tpop $1E9,#3 tpop $1EA,#3 tpop $1EB,#3 wrwide ptra[1] t3save 'save task3 state data wrwide ptra[0] incmod thread_cnt,#15 'next program thread jmp #:loop ' Thread switcher variables time long 0 time_delta long 80_000_000/100_000 'context switches occur at 100KHz, threads run for ~9.25us, switches take ~0.75us thread_cnt long 0 thread_on long %%3333333333333330 'task3 gets 15 of 16 slots thread_off long %%0000000000000000 'task0 gets 16 of 16 slots, task3 is starved of slots x long 0 ' Program thread contexts - 3 wides per context orgh 'hub assembler mode orgw 'align to next wide ' t3load/t3save lifo wide context0 long @program0>>2,0,0,0,0,0,0,0, 0[8], 0[8] context1 long @program1>>2,0,0,0,0,0,0,0, 0[8], 0[8] context2 long @program2>>2,0,0,0,0,0,0,0, 0[8], 0[8] context3 long @program3>>2,0,0,0,0,0,0,0, 0[8], 0[8] context4 long @program4>>2,0,0,0,0,0,0,0, 0[8], 0[8] context5 long @program5>>2,0,0,0,0,0,0,0, 0[8], 0[8] context6 long @program6>>2,0,0,0,0,0,0,0, 0[8], 0[8] context7 long @program7>>2,0,0,0,0,0,0,0, 0[8], 0[8] context8 long @program8>>2,0,0,0,0,0,0,0, 0[8], 0[8] context9 long @program9>>2,0,0,0,0,0,0,0, 0[8], 0[8] contextA long @programA>>2,0,0,0,0,0,0,0, 0[8], 0[8] contextB long @programB>>2,0,0,0,0,0,0,0, 0[8], 0[8] contextC long @programC>>2,0,0,0,0,0,0,0, 0[8], 0[8] contextD long @programD>>2,0,0,0,0,0,0,0, 0[8], 0[8] contextE long @programE>>2,0,0,0,0,0,0,0, 0[8], 0[8] contextF long @programF>>2,0,0,0,0,0,0,0, 0[8], 0[8] ' Each of the following programs has virtual uninterrupted use of: ' ' registers 0..15 ' the wide registers at $1E8, including dcache function ' inda/indb ' all task3 resources ' ' Use of big math circuits should be encapsulated within tlock/tfree instructions. program0 notp #0 jmp @program0 program1 notp #1 jmp @program1 program2 notp #2 jmp @program2 program3 notp #3 jmp @program3 program4 notp #4 jmp @program4 program5 notp #5 jmp @program5 program6 notp #6 jmp @program6 program7 notp #7 jmp @program7 program8 notp #8 jmp @program8 program9 notp #9 jmp @program9 programA notp #10 jmp @programA programB notp #11 jmp @programB programC notp #12 jmp @programC programD notp #13 jmp @programD programE notp #14 jmp @programE programF notp #15 jmp @programF
Comments
That looks great - nice and simple, easy to read.
I'm changing the RDWIDEQ instruction so that it waits for three more clocks, allowing us to get rid of the WAITs. I'll just change that in the code listing now... done. I also got rid of the '#0,' write mask in the WRWIDE instructions, since I'll have the assembler be smart about that.
Looks really impressive!
Bill, I think you forgot to put the 'JZ t3request,@$' instruction before the 'MOV t3request,#0'.
Thanks, you are right of course.
That should teach me about trying to modify code before coffee...
I made your changes and it runs as expected.
Now the people who prefer cooperative threading can easily do so.
I am really looking forward to all the neat features we can implement with TCHECK.
Great effort Chip. All proceeding very nicely right now. It will pay dividends down the track. :thumb:
C.W.
Actually, it needs to be a little different than what either of us thought. You'd want to clear the 'request' register BEFORE starting the thread by giving it time slots. Then, you'd wait for it to go non-zero. When it's non-zero, then you switch threads. The THALT will be cleared next time around.
I fixed the example, and I am currently drinking a BIG cup of strong coffee...
There is certainly going to be a lot to try with the coming release
I mean, can these 16 threads be running code from the hub?
Is there any limit on # of threads this way? I think the other way has a limit of 4 threads...
BTW: Why did Chip pick task#3 for this?
The number of threads is arbitrary... thousands of tiny threads are possible.
Only a single task supports saving its whole state, and I am guessing that Chip chose the last one so that 1&2 could theoretically run cog drivers.
Personally, I expect I will always run the scheduler/debugger in #0, not use 1&2, and use #3 for threads - by not using #1, #3 will get much more benefit from the single dcache line and the four icache lines.
But in case we run out of tasks, it is nice to have the capability to run drivers in 1&2 (sharing the scheduler/thread cog)
Say, I'm curious how fast that will run (exclusive of each task's own code) if you had 4 cooperative tasks running at once.
Cheers,
Peter (pjv)
It would depend on how often a context switch occurred.
Note that faster context saves are possible if you don't save the LIFO or the WIDE for the thread (in case you don't need to)
Time to schedule = (200*1/16) overhead + num_context_switches * .3us
So if you had 1000 context switches in a second, 300us for context switches, plus 65,500us for the scheduler idling... total of 6.58% overhead
46.71MIPS/thread assuming each thread used 100% of its cycles
If two of the threads only used 10% of their cycles, the other two would run a at about 80MIPS each.
Four hardware tasks would run at 50MIPS/task
To me, the more interesting case is being able to run tons of slower threads (so many threads can wait on sockets & USB endpoints)
If someone wanted, they could run cooperative threads only in task#3
If they did that, by having a "YIELD" subroutine in the cog memory for task#3 they would not need the scheduler thread, and would only take the hit of thread state saving/restoring, avoiding the 1/16 cycle cost of the scheduler.
Such theads would not however be able to pre-emptive.
What I like about Chip's implementation is that a developer can use cooperative or pre-emptive threading - whichever is more appropriate to the problem the developer is solving.
Wow! Now, that whole discussion was painful, but the result is nice and clean. Well done all around.
And, the niceness of the PASM syntax for data really shines here, just saying.
Perhaps I should as the question with a specific simple example:
How fast could 4 independent cooperative threads run where each thread is a simple XOR outa,BitN toggling a different port bit. With and without the optional saves you mentioned.
Cheers,
Peter (pjv)
.3us per fask switch, ~30ns per task... .33us total... approx. 3M task swaps per sec
With minimal state save (only PC etc wide, no LIFO, no WIDE save) approximately 50 cycles for task swap, 6 for task
approx. 16M+ task swaps per sec
So hooking up an LED to each thread...
with full state save: ~.38Mhz square wave per LED
with minimal state save: ~2Mhz square wave per LED
with four hardware tasks using REPD with a count set at 2^32 - 1
~6.25Mhz square wave per LED
Cheating version:
Putting the four XOR's into a REPD block set to 2^32 - 1 repeats:
~25Mhz square wave per LED
save/restore time dominates the non-cheating versions.
But a little thick in the head here..... does your statement "square wave per LED" mean that all 4 LED pins are toggling with ~ 2.6 uSec high, then 2.6 uSec low for the full save case, and ~ 0.5 uSec high with 0.5 uSec low for the minimal case ?
I want to be sure to not mis-interpret your statements as that can be easily done in these cases.
Cheers,
Peter (pjv)
FYI, the above numbers were quick off the top of my head approximations (except the cheating case); once we have the new FPGA image it will be possible to try out all sorts of schemes (albeit at 80Mhz)
For P3, I am sure there will be some kind of execute from DDR2/3/?, but it is waaaaaay too much to try to squeeze into P2.
An LMM-like kernel for the SDRAM on the P2 module is possible, trading a performance hit for much larger programs and a simple OS (uCLinux?)
Maybe that would take too long, but if you could, you could then execute from a vast pool of memory...