Let's blow this HUB access determinism completely!
Heater.
Posts: 21,230
Good, that got your attention.
I'd like to make a proposal for optimizing the HUB access arbitration that seems to blow COG-HUB determinism away completely but at the end of the day might give us the determinism we crave. It's surely not a new idea but perhaps I have a very slightly different take on it that changes things dramatically.
Firstly a statement of the problem. Currently a Propeller has multiple processors that from time to time require access to a shared RAM. They cannot all do that at he same time so the Propeller has the HUB arbiter that grants access to COGs on a round-robin basis. The upshot is that a COG requiring access may get it immediately or it may have to wait for the HUB to cycle around giving time to all the other COGs first. It has to wait even if there are no other COGs that need HUB access!
The problem, obviously, is that this is grossly wasteful of potential execution time and HUB bandwidth. Even if there is only a single COG running at all it has to wait for that HUB to go round a whole turn before being able to continue any useful work.
That round-robin HUB arbiter may have been acceptable with 8 COGs and when nobody cared about performance. They were programming in Spin anyway so clearly performance was not an issue. But it starts to look very poor with the 16 COGs of the proposed new PII design. Especially when you consider multiple of those COGs all running LMM code, compiled from C, from the nice big spacious shared RAM we have now. Even worse if we get hubexec.
The proposed optimization is obvious and no doubt not original:
1) Introduce a First In First Out queue (FIFO) with 16 elements.
2) When a COG requires shared memory access, hits a RDLONG for example, it's COG ID is entered into the tail of the FIFO.
3) For every shared memory access window the HUB arbiter removes the COG ID at the head of the FIFO and grants that COG access.
What does this give us?
1) Shared memory bandwidth use is maximized. All possible memory access windows are used all the time. (Assuming of course there is some COG that want to use them).
2) COG execution speed is maximized. A COG waiting for access gets it as soon as possible. No more waiting for the HUB to come around to the right position even when no other COG want's it's slot.
3) Maximum shared memory access latency is maintained. COGS never have to wait for more than 15 other COGs in front of them to get their access.
4) Simplicity. No complicated HUB slot table planning or mooching/pairing priority schemes for the user to worry about.
5) Determinism. We still have a known upper bound on the time it will take a hub access to execute. Which is the same for all COGS in all places for everyone.
Is this even possible ?
I'm no chip designer but it seems to me the amount of gates required to implement this queue mechanism is not very big.
I can imagine that there are issues with timing. Entering things to a queue, adjusting the head and tail pointers, removing things etc. This all probably takes some extra clock cycles.
"Ah", you might say "That's no good. All that complex decision making takes clocks and time and we don't have the time."
My response to that is: Well OK, so what? Let the arbiter have those clocks and time. It may well be that although it is making decisions two or three or four times slower than a simple round-robin scheme the gains accrued, as described above, outweigh any losses there.
Does this blow the sacred determinism?
Not really. As noted above we still have a fixed, known upper bound on HUB access latency. We can't exactly tune our PASM code to HUB access windows. Meh, who cares?
For those who really crave LMM or hubexec speed they have a choice. Limit your application to using only a single COG! Or perhaps 8. or whatever you like.
Do I expect any of this to happen?
Not really. Just idly speculating about possibilities.
I'd like to make a proposal for optimizing the HUB access arbitration that seems to blow COG-HUB determinism away completely but at the end of the day might give us the determinism we crave. It's surely not a new idea but perhaps I have a very slightly different take on it that changes things dramatically.
Firstly a statement of the problem. Currently a Propeller has multiple processors that from time to time require access to a shared RAM. They cannot all do that at he same time so the Propeller has the HUB arbiter that grants access to COGs on a round-robin basis. The upshot is that a COG requiring access may get it immediately or it may have to wait for the HUB to cycle around giving time to all the other COGs first. It has to wait even if there are no other COGs that need HUB access!
The problem, obviously, is that this is grossly wasteful of potential execution time and HUB bandwidth. Even if there is only a single COG running at all it has to wait for that HUB to go round a whole turn before being able to continue any useful work.
That round-robin HUB arbiter may have been acceptable with 8 COGs and when nobody cared about performance. They were programming in Spin anyway so clearly performance was not an issue. But it starts to look very poor with the 16 COGs of the proposed new PII design. Especially when you consider multiple of those COGs all running LMM code, compiled from C, from the nice big spacious shared RAM we have now. Even worse if we get hubexec.
The proposed optimization is obvious and no doubt not original:
1) Introduce a First In First Out queue (FIFO) with 16 elements.
2) When a COG requires shared memory access, hits a RDLONG for example, it's COG ID is entered into the tail of the FIFO.
3) For every shared memory access window the HUB arbiter removes the COG ID at the head of the FIFO and grants that COG access.
What does this give us?
1) Shared memory bandwidth use is maximized. All possible memory access windows are used all the time. (Assuming of course there is some COG that want to use them).
2) COG execution speed is maximized. A COG waiting for access gets it as soon as possible. No more waiting for the HUB to come around to the right position even when no other COG want's it's slot.
3) Maximum shared memory access latency is maintained. COGS never have to wait for more than 15 other COGs in front of them to get their access.
4) Simplicity. No complicated HUB slot table planning or mooching/pairing priority schemes for the user to worry about.
5) Determinism. We still have a known upper bound on the time it will take a hub access to execute. Which is the same for all COGS in all places for everyone.
Is this even possible ?
I'm no chip designer but it seems to me the amount of gates required to implement this queue mechanism is not very big.
I can imagine that there are issues with timing. Entering things to a queue, adjusting the head and tail pointers, removing things etc. This all probably takes some extra clock cycles.
"Ah", you might say "That's no good. All that complex decision making takes clocks and time and we don't have the time."
My response to that is: Well OK, so what? Let the arbiter have those clocks and time. It may well be that although it is making decisions two or three or four times slower than a simple round-robin scheme the gains accrued, as described above, outweigh any losses there.
Does this blow the sacred determinism?
Not really. As noted above we still have a fixed, known upper bound on HUB access latency. We can't exactly tune our PASM code to HUB access windows. Meh, who cares?
For those who really crave LMM or hubexec speed they have a choice. Limit your application to using only a single COG! Or perhaps 8. or whatever you like.
Do I expect any of this to happen?
Not really. Just idly speculating about possibilities.
Comments
Otherwise, I agree with you in general. My work-conserving arbiter suggestion (see my blog entry) does the same thing, sort of. Like your suggestion, it always guarantees that a cog will get access at least every 16 clock cycles. Instead of a queue, however, it is still working in a round-robin fashion, but simply skipping over all cogs that are not asserting a request at that time. This guarantees that a cog will be serviced on every clock cycle, like your suggestion does.
But. I don't think you're going to be able to crack the determinism nut.
If it happens too many times the chip explodes and my whole scheme is in tatters.
We have to think on this....
What we need is this thing that looks a bit like a one-toothed gear wheel which rotates around giving each core a slot to write to the FIFO....
So the request priority would be a combination of the two counters, and would be (RequestCount << LastUsedBits) | LastUsedCount. LastUsedBits is the number of bits in the last-used counter. The last-used counter will saturate at the maximum value that it can hold.
I wrote a simple simulator of this last week, and it seems to work quite well. In my simulator I also allow for cogs to either be in the deterministic mode or in the mooch mode so that determinism can still be maintained when needed. Here's a sample output from the simulator. The Rate in the table above is the maximum number of cycles between hub requests, where the hub requests are generated by (rand() % Rate).
In this example, cog 0 is in mooch mode, and all of the other cogs are in deterministic mode. Cog 0 only has an average of 1.2 stalls/access. Cog 1 has the same rate as cog 0, but sees an average of 12 stalls/access because it is in the deterministic mode. The remaining cogs see hub stalls that are close to the expected value of 7.5 stalls/access for random sporadic hub requests.
This hub arbitration scheme would require 15 comparators cascaded in stages of 8, 4, 2 and 1 comparator. The width of the comparators would be 4+LastUsedBits. For the example I posted above I used 5 bits for the last-used counters, for a total of 9 bits for the comparators.
Was this in Verilog?
Yes, of course, determinism is blown totally out of the water!
Nice full back-flip for someone, who earlier was worried about breaking Objects with some form of USER SET Slot lottery. ?!
(just because they might turn it on..)
See the numbers in #8 for how much of a lottery this now is.
Hmm, one day ago, it was you who claimed to care, now, it seems not at all ?
Is this more 'humour', or a troll, hard to tell ?
I guess that's the way with non-technical weightings, they can move in the breeze.
Sounds cool.
Can users get the desired full determinism control, in determinism mode, by set of ReLoad and Slot mapping ?
A fixed 16:1 is not determinism at all, for someone whose design needs 1:15 or 1:13, or whose design needs a Slot Rate fixed at (say) 40MHz. Those designs need a better solution.
Sounds encouraging.
Does anybody know if this simulation in C can be made real in HDL?
Can we skip the distinction between "mooch" mode and "deterministic" mode. In my scheme as outlined above all COGs are equal in the race for HUB access.
Isn't there a flaw here: "...accesses can be ordered by maintaining counters that indicate the number of cycles since the last access."
When things kick off all such counters are zero. So when two or more COGs hit the HUB for the first time, simultaneously, who wins?
What is worth what?
That is what this is. Challening the assumptions sometimes works really well when all of them are challenged. I am reading with interest.
It just gets funnier, now even the word "deterministic" is being scrubbed from the lexicon !!
Instead all COGs are equal in the race for HUB access. Place your bets !!!
Who cares about hard real time anyway.
This is not totally, 100%, a technical issue.
There are trade-offs and value judgements all down the line.
Engineers do not work in a vacuum devoid of all the unquantifiables they cannot measure.
Having worked in the world of "hard real-time" for most of my career from process control, to jet engine management, to avionics I take it very seriously.
The scheme, as outlined, provides a known upper bound on the latency of any COG wanting HUB access.
That is your hard "real-time". That is the very definition of "real-time".
Perhaps, that upper bound is too long for you. Perhaps you you would like to tweak things for special cases to something lower.
Tough.
Having said all that, perhaps also, what I am suggesting here is not actually realizable in silicon.
Tough for me it that case.
What is so wrong with the slot table approach?
There is no proven need for all cogs to be 'equal', if that was true we should have a memory and pin lottery as well.
C.W.
jmg, I just noticed what you said there.
Well, what do you call it when two entities vie for the same resource at the same time?
That's right, it's called a "race". Or perhaps a "fight". And yes exactly "place your bets".
The whole debate about shared RAM access is exactly about that race.
We can resolve that "race" by random chance. Or by giving victory to the guy with the biggest/lowest COG ID or whatever.
Very true. It does not. Deliberately.
The assertion here is that "all COGS will be equal"
The slot table approach breaks that assertion.
I'm not sure what "proven" would mean. But the past 6 or 8 years history of the Propeller chip demonstrate that it's a damn good idea.
I don't get the, oft repeated, "memory and pin lottery" idea. By making "all COGS equal" we are achieving the same as saying "all pins are equal" or "all memory is equal". Makes life much easier to think of things that way.
The primary issue is if that is a valid assertion. If all cogs don't do equal work, why do they need equal resources?
If it is a problem for cogs to not have equal access to the hub why should they be able to use varying amounts of pins and memory? I don't see how controlling the frequency of access to memory should be any different that controlling how much memory a given cog has allocated.
C.W.
That isn't completely correct. The reason is that the hub operation has to be executed to be placed in the queue.
So now you can no longer do other operations in between hub operations and execute the next hub operation just in time for the slot.
In other words you can no longer be synchronized to the hub.
C.W.
As soon as you start global planning of which process get which amount of time you are into the same world as having to decide which process on your RTOS has greater priority, or which interrupt has greater priority. And the when you want to mix and match code from here and there you have to re-analyse all of that.
If that is the world you want to live in why not get a super quick STM32 ARM chip, they are really fast and cheap now, and be happy? So far I don't see anyone using dynamic memory allocation on the Propeller. Think "malloc". People expect that all memory usage is known at compile time. Either the code/data fits or it does not.
Similarly my execution speed should be known at compile time. Not modulated by whatever is happening at run time. True.
So what?
Nope, anything that has floating point averages on cycles, and high levels of jitter is NOT hard real time at all.
Deterministic only in the very loosest sense. 'Not over 16' - just like any 35c uC with interrupts, can say not over X cycles.
All those examples you loved to quote about fixed cycle objects, now broken, in this 'scheme'.
A maximum time is only part of the specification, but I suspect you do know that, & this is not serious.
So you used to be able to do useful work in between hub operations without affecting the timing of the next hub operation.
This scheme forces you to wait in line instead of having a standing reservation.
C.W.
I'll play along.
Nonsense, you can put ANY COGid in any slot. All COGs are plainly equal, and the user has absolute time control.
The user gets to choose what work each COG does, but that is what designers always do.
Remove that absolute time control, and the Prop becomes just like those processors you grumble about
Your 'scheme' has actually delivered what you claimed earlier to oppose.
A mere ceiling on response and ship-loads of Jitter.
In the world of "hard real-time" what is important is your interaction with the external world. Nobody need know or care what goes on inside the "black box" as long as the results come out with the required timing tolerances.
The interface between the Propeller "black box" and the world is via the pins which are in turn driven by the COGs.
So. If everything inside the box, hub memory exchanges, can be made to happen within known upper bounds on time then the COGS can do their real-world, real-time interfacing predictably as required.
No problem.
There is no requirement to hit HUB RAM locations on the nano second. No one can measure it and no one will know about it.
Now I am sure you are simply trolling, and not serious.
Go and look at the many quoted examples of where objects are paced to/by hub access - the ones you earlier claimed to not want to break .
Those certainly DO expect clock-precise HUB RAM allocations, and you can certainly measure the outcome if it fails to do so.
Now, of course, I could craft a design where some COGs could choose a Primary allocation that was cycle deterministic, and other cogs could choose a Secondary allocation that was 'who-ever-wants-it', and that would not break those objects.
Oh wait, I already have that, in Table Mapping.
The Secondary Allocation can match your SuperCOG0, or it could match this new Quasi-FIFO scheme.