Just to finish off this thread - the P1-style lock simulation works as expected, and I now have Catalina's multi-threading support working properly on the P2.
Here is a more sophisticated multi-threading demo - this program runs 5 multi-threaded kernel cogs (4 started dynamically) and then 50 threads. The threads wander around between the kernel cogs, moving themselves from cog to cog randomly. As usual, this program is compiled for the P2 EVAL board, serial interface, 230400 baud.
Impressive. How granular / quick is this thread switching, to get a feel for where it could be used, and where it could/should not be used ?
How granular / quick is this thread switching, to get a feel for where it could be used, and where it could/should not be used ?
Context switching takes around 20 instructions, plus the overhead of saving and re-loading 32 registers from Hub RAM each time.
These figures are for LMM execution mode. The overhead is higher for compact mode, but will be lower for Native mode. In all modes, the overhead of loading/saving the registers is constant and unavoidable, but is of course much less on the P2 than on the P1 (because of the fast block moves).
So the cost of a context switch is quite high. You don't want to do it more often than you have to. For testing, I do it every 100 instructions (to help shake out the bugs!) but in a production program you would probably set this value much higher to reduce the overall overhead. BTW, each thread can set its own number of instructions to be executed before the next context switch - if you need more processor time (relative to the other tasks) you can set this number higher. If you need less, you set it lower.
On the P2, the mechanism will probably change to being time-sliced - but I may also introduce some kind of "priority" system to allow for different threads to get more or less processor time.
On the P2, the mechanism will probably change to being time-sliced - but I may also introduce some kind of "priority" system to allow for different threads to get more or less processor time.
Looking forward to this... If you add time-slicing, pre-emption and priorities this could become a good foundation for a nice little RTOS to be built for P2. Priority based thread scheduling can obviously introduce thread starvation but sometimes this is exactly what the programmer wants and can carefully code for, or against. A mix of some COGs running C-code in an RTOS kernel and some other dedicated cogs doing more IO specific processing could definitely work out quite nicely with the P2 in some real-time data processing applications, particularly once you start to become COG limited and can't dedicate a separate COG to each task and need to share the processing resources more.
As some people expressed an interest in seeing the pasm, I have attached a P2 multi-threading demo, this time compiled in p2 "native" mode, with the binary, source code and pasm listing included. Good luck with reading the pasm!
It is the same program as described previously, which starts 50 threads which wander around randomly between 5 cogs. As usual, it uses a serial interface running at 230400 baud.
The main difference is that in native mode the context switching between threads is done using P2 interrupts. I am still contemplating how and/or whether to add a more complex priority-based scheduling system. After seeing the overheads that even this very simple "round-robin" scheduling scheme imposes, I am now leaning towards not complicating things any further.
However, I will at least add the ability to set a C function as an interrupt handler. The context switching uses interrupt 3, leaving interrupts 1 and 2 available for other purposes if required.
As some people expressed an interest in seeing the pasm, I have attached a P2 multi-threading demo, this time compiled in p2 "native" mode, with the binary, source code and pasm listing included. Good luck with reading the pasm!
It is the same program as described previously, which starts 50 threads which wander around randomly between 5 cogs. As usual, it uses a serial interface running at 230400 baud.
The main difference is that in native mode the context switching between threads is done using P2 interrupts. I am still contemplating how and/or whether to add a more complex priority-based scheduling system. After seeing the overheads that even this very simple "round-robin" scheduling scheme imposes, I am now leaning towards not complicating things any further.
However, I will at least add the ability to set a C function as an interrupt handler. The context switching uses interrupt 3, leaving interrupts 1 and 2 available for other purposes if required.
Ross, I don't think I can find it. Where is it? There's a lot of stuff in there. I'm looking at the list file, by the way.
I've been finding, to my consternation, that it is challenging to code the P2 for max-possible performance. The reason is that there can be lots of subtleties in implementation, intertwined with problem redefinition, in order to exploit possible tricks to get big speed gains.
This code used to be one instruction shorter, but I had to change it to accommodate the way SETQ+RDLONG+PTRx works on the next silicon:
'
'
' a: ABORT result C=0
' b: ABORT x C=0
' c: RETURN result C=1
' d: RETURN x C=1
'
' repeat
' ptr = dbase (point to current stack)
' ptr[-5] --> z (top of lower stack)
' ptr[-4] --> pbase | flags
' ptr[-3] --> vbase
' ptr[-2] --> dbase (lower stack pointer)
' ptr[-1] --> y (bytecode return pointer)
' (ptr[-0] --> x) (first pass only if ABORT/RETURN result, overwrites x with result)
' ptr -= 5 (point to top of lower stack)
' while ABORT && !try
'
' if push result
' ++ptr <-- x (x is current)
' else
' x = z (x is current)
'
'
retn mov ptra,dbase 'a b c d ptra points to dbase
sub ptra,#5*4 'a b c d
setq #6-1 'a | c | pop z/pbase/vbase/dbase/y/x (1st pass only)
setq #5-1 '| b | d pop z/pbase/vbase/dbase/y
rdlong z,ptra 'a b c d ptra points to top of stack after pop
ror pbase,#2 ' get !try into msb
if_nc tjns pbase,#retn ' if abort and !try, return again
shl pbase,#2 wc ' restore pbase, push result?
if_c wrlong x,++ptra ' if so, x holds result, push x
if_nc mov x,z ' if not, get z (top of stack) into x
_ret_ rdfast #0,y ' start new bytecode read
This is the code that does ABORT/RETURN in Spin2. This just about killed me and if I go away from it for a bit, I need to relearn it, but it has lots of folded efficiencies in it.
You can see from the skip pattern, on entry, that it will only overwrite 'x' if 'result' is being passed, which is the case for ABORT/RETURN with no argument expressed. In the case of ABORT, where it might do multiple stack pops, subsequent loops will always execute 'setq #5-1', leaving 'x' intact. In the case of ABORT/RETURN with an argument expressed, 'x' is never overwritten, but is the last value on the (old) math stack.
Then, I wracked my brain for a long time trying to figure out how to efficiently deal with the two LSB's of pbase, which need to wind up cleared after being polled. The ROR and SHL do this, with the TJNS serving as the poll and branch.
Mainly, the SETQ+RDLONG make the stack-pop operation very efficient.
All this took a long time to work out, but in the end it's fast and small.
Ross, I don't think I can find it. Where is it? There's a lot of stuff in there. I'm looking at the list file, by the way.
Are you looking for the interrupt service/context switch code? It is a function called NMM_isr at line 729. All you really need to know is that TP always points to the currently executing thread, which is part of a circular linked list of thread blocks, used to save and restore the thread state.
Some of the code in the isr is to do with internal "houskeeping" (handling thread terminations etc). I may be able to move some of that out to be done elsewhere. But the biggest overhead is of course the saving and loading of 30 registers to hub RAM - but this is pretty much unavoidable given the way the LCC C compiler works. LCC is not entirely either stack or frame based - it also relies on general purpose registers. In this instance, 24 of these registers are directly available for use by C code (r0 - r23) and the others are support registers (stack and frame pointers etc). I experimented with having both more and less general purpose registers in the early days of Catalina, but 24 seems to be about the "sweet spot" for the LCC compiler. You can get by with less, but you suffer a significant hit to execution speed, and the code generator can't seem to usefully use any more.
I've been finding, to my consternation, that it is challenging to code the P2 for max-possible performance. The reason is that there can be lots of subtleties in implementation, intertwined with problem redefinition, in order to exploit possible tricks to get big speed gains..
Not a great thing to hear, we hope the P2 has not become an i860... powerful, but very difficult to program to the peak...
I've been finding, to my consternation, that it is challenging to code the P2 for max-possible performance. The reason is that there can be lots of subtleties in implementation, intertwined with problem redefinition, in order to exploit possible tricks to get big speed gains..
Not a great thing to hear, we hope the P2 has not become an i860... powerful, but very difficult to program to the peak...
I've wondered that, too, but on the other hand, when some architecture affords advantages by special features, max performance means putting those features to use. The Prop1 is simpler to program to max performance because the ceiling of possibility is lower. Here we have something much more like chess than checkers. This can get a lot more stressful because so much more is possible. On a simple architecture, you can quickly feel like you've made the best implementation without too much effort, and move forward with confidence. Prop2 takes longer to achieve that feeling with, but what's possible is much greater.
Ross, I don't think I can find it. Where is it? There's a lot of stuff in there. I'm looking at the list file, by the way.
Are you looking for the interrupt service/context switch code? It is a function called NMM_isr at line 729. All you really need to know is that TP always points to the currently executing thread, which is part of a circular linked list of thread blocks, used to save and restore the thread state.
Some of the code in the isr is to do with internal "houskeeping" (handling thread terminations etc). I may be able to move some of that out to be done elsewhere. But the biggest overhead is of course the saving and loading of 30 registers to hub RAM - but this is pretty much unavoidable given the way the LCC C compiler works. LCC is not entirely either stack or frame based - it also relies on general purpose registers. In this instance, 24 of these registers are directly available for use by C code (r0 - r23) and the others are support registers (stack and frame pointers etc). I experimented with having both more and less general purpose registers in the early days of Catalina, but 24 seems to be about the "sweet spot" for the LCC compiler. You can get by with less, but you suffer a significant hit to execution speed, and the code generator can't seem to usefully use any more.
Interesting about needing to swap out register sets. On the P2-Hot, we had register remapping, where you could select which set of actual registers would be accessed at, say, addresses $000..$01F. You could set up register addresses %0_0000_xxxx to remap to %y_yyyy_xxxx, where your ISR would only need to change the %y_yyyy value. In the current case of the P2, one extra clock per register is not too bad, especially compared to what the Prop1 would have required, which would have been 8x more clocks per register, at least.
Interesting about needing to swap out register sets. On the P2-Hot, we had register remapping, where you could select which set of actual registers would be accessed at, say, addresses $000..$01F. You could set up register addresses %0_0000_xxxx to remap to %y_yyyy_xxxx, where your ISR would only need to change the %y_yyyy value. In the current case of the P2, one extra clock per register is not too bad, especially compared to what the Prop1 would have required, which would have been 8x more clocks per register, at least.
Register remapping would certainly be very useful for co-routines and interrupt handling. Maybe on the P3?
Ross, I don't think I can find it. Where is it? There's a lot of stuff in there. I'm looking at the list file, by the way.
I've just had my first look. Search for "bith" in the .lst file. There's three occurrences, one in the NMM_isr that Ross mentioned, one in GET_KERNEL_LOCK and one in GET_POOL_LOCK.
Comments
Context switching takes around 20 instructions, plus the overhead of saving and re-loading 32 registers from Hub RAM each time.
These figures are for LMM execution mode. The overhead is higher for compact mode, but will be lower for Native mode. In all modes, the overhead of loading/saving the registers is constant and unavoidable, but is of course much less on the P2 than on the P1 (because of the fast block moves).
So the cost of a context switch is quite high. You don't want to do it more often than you have to. For testing, I do it every 100 instructions (to help shake out the bugs!) but in a production program you would probably set this value much higher to reduce the overall overhead. BTW, each thread can set its own number of instructions to be executed before the next context switch - if you need more processor time (relative to the other tasks) you can set this number higher. If you need less, you set it lower.
On the P2, the mechanism will probably change to being time-sliced - but I may also introduce some kind of "priority" system to allow for different threads to get more or less processor time.
Looking forward to this... If you add time-slicing, pre-emption and priorities this could become a good foundation for a nice little RTOS to be built for P2. Priority based thread scheduling can obviously introduce thread starvation but sometimes this is exactly what the programmer wants and can carefully code for, or against. A mix of some COGs running C-code in an RTOS kernel and some other dedicated cogs doing more IO specific processing could definitely work out quite nicely with the P2 in some real-time data processing applications, particularly once you start to become COG limited and can't dedicate a separate COG to each task and need to share the processing resources more.
It is the same program as described previously, which starts 50 threads which wander around randomly between 5 cogs. As usual, it uses a serial interface running at 230400 baud.
The main difference is that in native mode the context switching between threads is done using P2 interrupts. I am still contemplating how and/or whether to add a more complex priority-based scheduling system. After seeing the overheads that even this very simple "round-robin" scheduling scheme imposes, I am now leaning towards not complicating things any further.
However, I will at least add the ability to set a C function as an interrupt handler. The context switching uses interrupt 3, leaving interrupts 1 and 2 available for other purposes if required.
Ross, I don't think I can find it. Where is it? There's a lot of stuff in there. I'm looking at the list file, by the way.
This code used to be one instruction shorter, but I had to change it to accommodate the way SETQ+RDLONG+PTRx works on the next silicon:
This is the code that does ABORT/RETURN in Spin2. This just about killed me and if I go away from it for a bit, I need to relearn it, but it has lots of folded efficiencies in it.
You can see from the skip pattern, on entry, that it will only overwrite 'x' if 'result' is being passed, which is the case for ABORT/RETURN with no argument expressed. In the case of ABORT, where it might do multiple stack pops, subsequent loops will always execute 'setq #5-1', leaving 'x' intact. In the case of ABORT/RETURN with an argument expressed, 'x' is never overwritten, but is the last value on the (old) math stack.
Then, I wracked my brain for a long time trying to figure out how to efficiently deal with the two LSB's of pbase, which need to wind up cleared after being polled. The ROR and SHL do this, with the TJNS serving as the poll and branch.
Mainly, the SETQ+RDLONG make the stack-pop operation very efficient.
All this took a long time to work out, but in the end it's fast and small.
Are you looking for the interrupt service/context switch code? It is a function called NMM_isr at line 729. All you really need to know is that TP always points to the currently executing thread, which is part of a circular linked list of thread blocks, used to save and restore the thread state.
Some of the code in the isr is to do with internal "houskeeping" (handling thread terminations etc). I may be able to move some of that out to be done elsewhere. But the biggest overhead is of course the saving and loading of 30 registers to hub RAM - but this is pretty much unavoidable given the way the LCC C compiler works. LCC is not entirely either stack or frame based - it also relies on general purpose registers. In this instance, 24 of these registers are directly available for use by C code (r0 - r23) and the others are support registers (stack and frame pointers etc). I experimented with having both more and less general purpose registers in the early days of Catalina, but 24 seems to be about the "sweet spot" for the LCC compiler. You can get by with less, but you suffer a significant hit to execution speed, and the code generator can't seem to usefully use any more.
I'll study your code - there are sure to be some tricks there I can steal!
I've wondered that, too, but on the other hand, when some architecture affords advantages by special features, max performance means putting those features to use. The Prop1 is simpler to program to max performance because the ceiling of possibility is lower. Here we have something much more like chess than checkers. This can get a lot more stressful because so much more is possible. On a simple architecture, you can quickly feel like you've made the best implementation without too much effort, and move forward with confidence. Prop2 takes longer to achieve that feeling with, but what's possible is much greater.
Interesting about needing to swap out register sets. On the P2-Hot, we had register remapping, where you could select which set of actual registers would be accessed at, say, addresses $000..$01F. You could set up register addresses %0_0000_xxxx to remap to %y_yyyy_xxxx, where your ISR would only need to change the %y_yyyy value. In the current case of the P2, one extra clock per register is not too bad, especially compared to what the Prop1 would have required, which would have been 8x more clocks per register, at least.
Register remapping would certainly be very useful for co-routines and interrupt handling. Maybe on the P3?