And it is so clean too! Somehow, I pictured something considerably more complex. Leave it to Chip to boil it down to the bare nubs. Heater really got me stoked about it after posting up his realization that the core unit of determinism is the COG, and that it is what impacts the strong reuse features. I personally did not visualize this very well at all.
Heh... "What happens in the COG, stays in the COG."
**my earlier comments gladly withdrawn, and Heater, good on you for seeing it, and advocating it --deffo the right move. Being able to have this discussion play out is special. How can this chip NOT be good? Seriously! I am very wary of things that would break the simple utility of the COGS, because that is a strength that does not have many equals out there. Worth keeping, as a good differentiator is important to overcome "it's just like X, so why not just use X?" kinds of decisions. --and it resonates with the idea of software peripherials, because the barrier to implementing them is low. Clearly, this impacts neither, for a nice net gain period.
4 or 8... My gut says the 4 is best, particularly if speed is at issue. But, I'm really torn over the the ability to over clock a little and get debug for basically free.
Would the speed penalty be incurred for any number of tasks, even less than or 4 tasks, if it's blown out to 8 total?
If that's not too big of a deal, maybe 8 is the far better choice, because then we get that "look inside the COG" missing from P1, and get it really cheap!
Re: Hardware changes, costs, etc... IMHO, this one is "chasing the window" a little, but good grief! It's a big jump that already seems like it should be there. It's half the chip without it, as both a net throughput and code size gain resulted without a lot of complexity added. Damn good call Chip.
@Martin: If it does, it won't be much. And that's expensive no doubt, but it's worth it because it's not just an increment, or add on, but a core thing that very fully exploits all the work done so far, without many downsides. Honestly, those of us, myself included, who believe delays are growing very expensive, likely thought it would not turn around so quickly. Given what we were just shown, it's worth it. The P2 would be half the chip otherwise.
If that's not too big of a deal, maybe 8 is the far better choice, because then we get that "look inside the COG" missing from P1, and get it really cheap!
That's a good point :
In a 4T16S case. whilst a Debug stub can use as low as 1/16 of CPU time, it swallows 25% of Possible user threads (you can debug 1/2/3 threads)
In a 8T10S case, ( 1-10 slices 1-8 threads) , the Debug stub can use as low as 1/10 of time, and swallows 12.5% of Threads. (you can debug 1/2/3/4/5/6/7 threads)
I think over-clock of 10% is not much more of a reach than overclock of 6.25%, - well, not for a test bench case, to give same- average-bandwidth if the user needs that.
Or you could enable the Debug, 1:1, and know you have a 10% testing margin.
This doesn't have any impact on the timeline - yet. If I can get the testing done tonight, this will slip into the next synthesis workday without a ripple.
The reason 8 tasks would be slower is because it would add another layer of mux'ing. I don't know that it wouldn't meet timing - it very well may. As someone pointed out, though, time-slice allocation starts to get un-elegant very quickly, compared to the simplicity of 4 tasks and 16 slices in a single long.
This doesn't have any impact on the timeline - yet. If I can get the testing done tonight, this will slip into the next synthesis workday without a ripple.
I hope this really does slide in nicely with the schedule else I'm in big trouble. Ken will be mad, there will be a terrible family feud, the Parallax empire will collapse, we won't get our Prop IIs, and everyone will blame me!
Anyone REALLY want EIGHT?
I don't have an opinion on this. Eight sounds like a natural number to aim for. Given the code space available 8 might be too much, unless all the threads run the same code of course. If 8 slows things down then probably not. If 8 threatens the dev schedule then definitely not. Perhaps it's better to just thoroughly test what you have now.
Thanks for pushing for this, Everyone..
My pleasure:) I'm really looking forward to Christmas now.
I'm wondering:
1) What happens when someone changes the tasklong value and does that settask again? Is it really so that we can dynamically reorganize the scheduling or does something bad happen?
2) How on earth are the GCC guys going to manage this in C code?
1) What happens when someone changes the tasklong value and does that settask again? Is it really so that we can dynamically reorganize the scheduling or does something bad happen?
2) How on earth are the GCC guys going to manage this in C code?
You can do another SETTASK whenever you want. Those bits you write to it just keep rotating right by two bits every time an instruction completes. The two LSB's are always used to mux the next program counter which issues the next instruction address. You can rewrite those 32 bits whenever you want, without ill effect.
I also added a 'JMPTASK D/#n,#mask' which sets all the tasks' program counters in the 4-bit mask to the D/#n value. It also flushes the pipeline of any instructions in the affected tasks. This way, you can cleanly direct one or more tasks to a new address. Good for herd management.
I got the register remapping working with time-slicing. Instead of using INDA to select the bank you're in, it uses the task number:
PUB go
coginit(0,@pgm,0)
DAT org
pgm
pin long 0 '4 sets of 4 regs
period long 100 'each set appears at $000..$003
time long 0 '(these are like nop's to task 0)
long 0
long 1
long 200
long 0
long 0
long 2
long 400
long 0
long 0
long 3
long 800
long 0
long 0
setmap #%1_010_010 'map 4 sets of 4 regs, time sliced
jmptask #taskstart,#%1110 'herd tasks 3..1 to taskstart
settask tasklong 'set tasks 3..1 in motion
taskstart getcnt time 'start of group task, get time
:loop add time,period 'add period for next target
:wait cmpcnt time wc 'check time target
if_c jmp #:wait 'if below target, wait
notp pin 'flip pin
jmp #:loop 'loop
tasklong long %%3210321032103210
This was a quick compile on the FPGA that is only running at 20MHz. You can see how four tasks are each tracking separate time targets to flip their pins by.
I'm getting lost. That mapping/herding thing might need some detail explanation.
You can see four sets of four data longs at the start of the program. When the tasks run, they all run the same code, but each sees its own set of four data longs in registers 0..3. Task0 makes the shortest period on pin0, while task3 makes the longest period on pin3. It's because they each access a unique set of four data longs, based on their task number. When task0 is running, actual 0..3 appear in 0..3. When task1 is running, 4..7 appear in 0..3. When task2 is running, 8..11 appear in 0..3. When task3 is running, 12..15 appear in 0..3. It changes every clock.
Actually thatwas going to be a my next question, how do identical threads get different parameters so they can operate on different pins and data etc.
I saw the four sets of four longs, but how does this mapping occur? With that setmap I guess but what do all the bits mean there.
Then we have the jmptask which forces all specified tasks to some address OK (?)
and the settask to get them all running to some scheule.
I think I'm more lost, where are all the program counters hiding? With JMPRET and TASKSW they are saved/restored to COG logs which we define in our code. Now I'm not sure I understand the initial threading example from earlier today.
How do hub stalls interact with other threads? Does it delay when the switch is made, or can instructions in the other threads execute while one thread is stalled on a hub access?
And there's the symmetry again. Wow! Before, one could write a small piece of code and have lots of COGS run it. Now one can do that in a COG,and have the threads all run it.
@Heater, I think he means COG address translation. COG address 0-3 for TASK 0, is mapped to real addresses 0-3, TASK 1 mapped to real addresses 4 - 7, TASK 2 mapped to 8 - 11, etc...
How do hub stalls interact with other threads? Does it delay when the switch is made, or can instructions in the other threads execute while one thread is stalled on a hub access?
If other threads can run why a thread is stalled that would make the P2 even faster than a single-threaded design. I believe that would require dedicated instruction pipelines for each thread. As I said, this is dangerously close to implementing an interrupt handler. A thread could be stalled on a WAITPEQ, and once the condition is satisfied it resumes running. That's basically an interrupt.
A thread could be stalled on a WAITPEQ, and once the condition is satisfied it resumes running. That's basically an interrupt.
Good point. but if one thread is waiting on a condition and the other threads are running normally then when the condition becomes true and the waiting thread starts to run nothing is being interrupted. Also there is no context switching going on as with a regular interrupt. In the XMOS and others they call this "event" driven as it is so different from an old fashioned interrupt.
As it happens, as far as I understand, WAITPEQ and other waits will stall all threads until the condition is true. One has to have a polling loop around a test for condition in order to have all continue threads running. That is why Chip mentioned the POLVID (or whatever is called) as previously there was no way to do a polling loop instead of a WAITVID.
Now, not having WAITxxx is not really a loss. The polling loops can be very tight with low latency (especially as we have a faster chip anyway). WAITxxx would give a lower power consumption but for a single threaded COG it does save you much power compared to P1 and anyway in threaded code the other threads would be running so WAITxxx saves no power.
Yes, I understand. My earlier questions were rhetorical. Adding an instruction pipeline to each thread would be too big of a change at this point in the P2 development. If fact I'm quite surprised that Chip is even considering adding the slicer at this point. There is still the issue of hub stalls. Any slice that is executing hub instructions will generate unpredictable delays in the other slices. Also, the slice with the hub access will have to factor in the time used by the other threads. Instead of doing 6 (or is it now 7) non-hub instructions between a rdlong instruction, a slice can only do 1 instruction without incurring another 8-cycle delay.
If each slice did have its own instruction pipeline it would be possible for the other slices to execute while a slice is waiting on a hub operation. This could keep the cog busy 100% of the time rather than stalling on hub accesses. It may be something to consider for P3.
There would only be one ALU, but multiple instruction pipelines, so the cog RAM would remain the same. This allows a slice's pipeline to move while another slice's pipeline is stalled. With a single pipeline all of the slices get stalled.
This time-slicing can be thought of as a vacuum sucking up instructions pointed to by various program counters, then running them through the pipeline, single-file. Each instruction's task ID is known as it travels through the pipeline, so that appropriate program counters and flag sets can be selected and updated as each instruction executes. If any of those instruction does a WAITxxx, the whole pipeline gets stalled, so it's important to code polling loops whenever possible and avoid stalls. What cannot be gotten around is the RDxxxx/WRxxxx stalls. Whatever task executes those instructions will cause the whole pipe to stall for up to 8 clocks, potentially. The CMPCNT instruction exists, though, so that you can repeatedly poll to find out if you've reached or surpassed a target time. That will be key to real-time coding.
I'm sorry the example with SETMAP was a little confusing. The prior example WAS as simple as it looked, by the way. The SETMAP example is a little much to understand without some explanation of what SETMAP does.
So, even with multiple pipelines, each task would have to await its turn at the cog memory trough, the same as cogs have to wait for a hub slot, right? At least this would ensure determinism for the non-stalled tasks. (I'm definitely not suggesting further delays to do it that way, though. )
You can see four sets of four data longs at the start of the program. When the tasks run, they all run the same code, but each sees its own set of four data longs in registers 0..3.
Very clever.
I can see one small difference which puzzles :
In #1247, you have four Jumps, which are the ORG.JMP's where each Task starts from, when launched.
but in #1277, there are no jumps ?
Where exactly did execution start from ?
Does this mean the two modes are exclusive ?
- ie Register mapping displaces Task.ORG.JMP , and you then have to use only jmptask #taskstart,#%1110 ?
What about managing a mix of Task.ORG.JMP (separate code blocks) and a couple running common code, with split R0..R3 ?
I asked him about that last night. I don't know if the current model is as such, but when he wrote the first example code the first 4 longs of memory were instruction vectors that would cause a jump to the appropriate address. The thread's program counters were initialized to 0,1,2,3 respectively and it is the programmer's responsibility to ensure that vector table has appropriate instructions.
Now he added a new instruction to initialize the internal PC registers, so I suspect he has done away with the vector table.
Any slice that is executing hub instructions will generate unpredictable delays in the other slices. Also, the slice with the hub access will have to factor in the time used by the other threads. Instead of doing 6 (or is it now 7) non-hub instructions between a rdlong instruction, a slice can only do 1 instruction without incurring another 8-cycle delay.
Not quite; the available window is MOD 8 clocks, so you can do as many opcodes as your slice allocation gives, and provided it still aligns on a clock multiple of 8, you avoid a bonus delay.
The User, or some automation software, could choose combinations of NOP/code shuffle and Slice to keep MOD 8 for those cases where this was critically important.
If I have this right, I think a 'magic' case exists where one thread is allocated 1:8 slices, then every opcode in that thread can be hub-slot-aligned, and it does not matter when it does hub access. I guess this auto-syncs on the first hub-access, thereafter it is locked. ( of course, other threads cannot get hub access without disturbing this, but you can easily code so normal run data streaming is from one thread. )
If each slice did have its own instruction pipeline it would be possible for the other slices to execute while a slice is waiting on a hub operation. This could keep the cog busy 100% of the time rather than stalling on hub accesses. It may be something to consider for P3.
Sounds like a lot of silicon. There are plenty of other designer-solutions, before you need to add silicon, and there is always the choice of using another COG for really time paranoid tasks.
Comments
Heh... "What happens in the COG, stays in the COG."
**my earlier comments gladly withdrawn, and Heater, good on you for seeing it, and advocating it --deffo the right move. Being able to have this discussion play out is special. How can this chip NOT be good? Seriously! I am very wary of things that would break the simple utility of the COGS, because that is a strength that does not have many equals out there. Worth keeping, as a good differentiator is important to overcome "it's just like X, so why not just use X?" kinds of decisions. --and it resonates with the idea of software peripherials, because the barrier to implementing them is low. Clearly, this impacts neither, for a nice net gain period.
4 or 8... My gut says the 4 is best, particularly if speed is at issue. But, I'm really torn over the the ability to over clock a little and get debug for basically free.
Would the speed penalty be incurred for any number of tasks, even less than or 4 tasks, if it's blown out to 8 total?
If that's not too big of a deal, maybe 8 is the far better choice, because then we get that "look inside the COG" missing from P1, and get it really cheap!
Re: Hardware changes, costs, etc... IMHO, this one is "chasing the window" a little, but good grief! It's a big jump that already seems like it should be there. It's half the chip without it, as both a net throughput and code size gain resulted without a lot of complexity added. Damn good call Chip.
@Martin: If it does, it won't be much. And that's expensive no doubt, but it's worth it because it's not just an increment, or add on, but a core thing that very fully exploits all the work done so far, without many downsides. Honestly, those of us, myself included, who believe delays are growing very expensive, likely thought it would not turn around so quickly. Given what we were just shown, it's worth it. The P2 would be half the chip otherwise.
That's a good point :
In a 4T16S case. whilst a Debug stub can use as low as 1/16 of CPU time, it swallows 25% of Possible user threads (you can debug 1/2/3 threads)
In a 8T10S case, ( 1-10 slices 1-8 threads) , the Debug stub can use as low as 1/10 of time, and swallows 12.5% of Threads. (you can debug 1/2/3/4/5/6/7 threads)
I think over-clock of 10% is not much more of a reach than overclock of 6.25%, - well, not for a test bench case, to give same- average-bandwidth if the user needs that.
Or you could enable the Debug, 1:1, and know you have a 10% testing margin.
The reason 8 tasks would be slower is because it would add another layer of mux'ing. I don't know that it wouldn't meet timing - it very well may. As someone pointed out, though, time-slice allocation starts to get un-elegant very quickly, compared to the simplicity of 4 tasks and 16 slices in a single long.
Oh my God, you have only gone and done it!
Incredible, I only suggested hardware-scheduling on the 11th Aug, http://forums.parallax.com/showthread.php?141706-Propeller-II&p=1117188&viewfull=1#post1117188 without ever daring to dream it might happen. But in about 14 working days it has been discussed, designed, implemented and mostly tested. Along with all the other stuff you are up to. Very impressive. I hope this really does slide in nicely with the schedule else I'm in big trouble. Ken will be mad, there will be a terrible family feud, the Parallax empire will collapse, we won't get our Prop IIs, and everyone will blame me! I don't have an opinion on this. Eight sounds like a natural number to aim for. Given the code space available 8 might be too much, unless all the threads run the same code of course. If 8 slows things down then probably not. If 8 threatens the dev schedule then definitely not. Perhaps it's better to just thoroughly test what you have now. My pleasure:) I'm really looking forward to Christmas now.
I'm wondering:
1) What happens when someone changes the tasklong value and does that settask again? Is it really so that we can dynamically reorganize the scheduling or does something bad happen?
2) How on earth are the GCC guys going to manage this in C code?
You can do another SETTASK whenever you want. Those bits you write to it just keep rotating right by two bits every time an instruction completes. The two LSB's are always used to mux the next program counter which issues the next instruction address. You can rewrite those 32 bits whenever you want, without ill effect.
I also added a 'JMPTASK D/#n,#mask' which sets all the tasks' program counters in the 4-bit mask to the D/#n value. It also flushes the pipeline of any instructions in the affected tasks. This way, you can cleanly direct one or more tasks to a new address. Good for herd management.
Are CLKSET controlled in same way as Propeller 1 (1x,2x,4x,8x,16x) else by N-Counter --> that can give better granularity to control of CORE speed
That's great.
By the way I have moved my Christmas Day to whichever day I have the first Prop II in my hands. Whilst I'm at it I'll rename it to Chipmas Day:)
..but only if it works
It multiplies from 1 to 16 (1x, 2x, 3x,... 16x).
This was a quick compile on the FPGA that is only running at 20MHz. You can see how four tasks are each tracking separate time targets to flip their pins by.
Thanks .
So As I see it --- I can use any value from 1 to 16 odd ones to -- That is very good --->
Are "0" reserved to set it to Internal Frequency ?
I'm getting lost. That mapping/herding thing might need some detail explanation.
You can see four sets of four data longs at the start of the program. When the tasks run, they all run the same code, but each sees its own set of four data longs in registers 0..3. Task0 makes the shortest period on pin0, while task3 makes the longest period on pin3. It's because they each access a unique set of four data longs, based on their task number. When task0 is running, actual 0..3 appear in 0..3. When task1 is running, 4..7 appear in 0..3. When task2 is running, 8..11 appear in 0..3. When task3 is running, 12..15 appear in 0..3. It changes every clock.
This is exciting stuff!! The register mapping is very slick!
Is a task able to find out its task number similar to how you can find out which COG you are running on?
Thanks!
I saw the four sets of four longs, but how does this mapping occur? With that setmap I guess but what do all the bits mean there.
Then we have the jmptask which forces all specified tasks to some address OK (?)
and the settask to get them all running to some scheule.
I think I'm more lost, where are all the program counters hiding? With JMPRET and TASKSW they are saved/restored to COG logs which we define in our code. Now I'm not sure I understand the initial threading example from earlier today.
@Heater, I think he means COG address translation. COG address 0-3 for TASK 0, is mapped to real addresses 0-3, TASK 1 mapped to real addresses 4 - 7, TASK 2 mapped to 8 - 11, etc...
Good point. but if one thread is waiting on a condition and the other threads are running normally then when the condition becomes true and the waiting thread starts to run nothing is being interrupted. Also there is no context switching going on as with a regular interrupt. In the XMOS and others they call this "event" driven as it is so different from an old fashioned interrupt.
As it happens, as far as I understand, WAITPEQ and other waits will stall all threads until the condition is true. One has to have a polling loop around a test for condition in order to have all continue threads running. That is why Chip mentioned the POLVID (or whatever is called) as previously there was no way to do a polling loop instead of a WAITVID.
Now, not having WAITxxx is not really a loss. The polling loops can be very tight with low latency (especially as we have a faster chip anyway). WAITxxx would give a lower power consumption but for a single threaded COG it does save you much power compared to P1 and anyway in threaded code the other threads would be running so WAITxxx saves no power.
If each slice did have its own instruction pipeline it would be possible for the other slices to execute while a slice is waiting on a hub operation. This could keep the cog busy 100% of the time rather than stalling on hub accesses. It may be something to consider for P3.
-Phil
I'm sorry the example with SETMAP was a little confusing. The prior example WAS as simple as it looked, by the way. The SETMAP example is a little much to understand without some explanation of what SETMAP does.
-Phil
Very clever.
I can see one small difference which puzzles :
In #1247, you have four Jumps, which are the ORG.JMP's where each Task starts from, when launched.
but in #1277, there are no jumps ?
Where exactly did execution start from ?
Does this mean the two modes are exclusive ?
- ie Register mapping displaces Task.ORG.JMP , and you then have to use only jmptask #taskstart,#%1110 ?
What about managing a mix of Task.ORG.JMP (separate code blocks) and a couple running common code, with split R0..R3 ?
Now he added a new instruction to initialize the internal PC registers, so I suspect he has done away with the vector table.
Not quite; the available window is MOD 8 clocks, so you can do as many opcodes as your slice allocation gives, and provided it still aligns on a clock multiple of 8, you avoid a bonus delay.
The User, or some automation software, could choose combinations of NOP/code shuffle and Slice to keep MOD 8 for those cases where this was critically important.
If I have this right, I think a 'magic' case exists where one thread is allocated 1:8 slices, then every opcode in that thread can be hub-slot-aligned, and it does not matter when it does hub access. I guess this auto-syncs on the first hub-access, thereafter it is locked. ( of course, other threads cannot get hub access without disturbing this, but you can easily code so normal run data streaming is from one thread. )
Sounds like a lot of silicon. There are plenty of other designer-solutions, before you need to add silicon, and there is always the choice of using another COG for really time paranoid tasks.