Let's blow this HUB access determinism completely!

Heater. · 2014-05-12 07:16

Good, that got your attention.

I'd like to make a proposal for optimizing the HUB access arbitration that seems to blow COG-HUB determinism away completely but at the end of the day might give us the determinism we crave. It's surely not a new idea but perhaps I have a very slightly different take on it that changes things dramatically.

Firstly a statement of the problem. Currently a Propeller has multiple processors that from time to time require access to a shared RAM. They cannot all do that at he same time so the Propeller has the HUB arbiter that grants access to COGs on a round-robin basis. The upshot is that a COG requiring access may get it immediately or it may have to wait for the HUB to cycle around giving time to all the other COGs first. It has to wait even if there are no other COGs that need HUB access!

The problem, obviously, is that this is grossly wasteful of potential execution time and HUB bandwidth. Even if there is only a single COG running at all it has to wait for that HUB to go round a whole turn before being able to continue any useful work.

That round-robin HUB arbiter may have been acceptable with 8 COGs and when nobody cared about performance. They were programming in Spin anyway so clearly performance was not an issue. But it starts to look very poor with the 16 COGs of the proposed new PII design. Especially when you consider multiple of those COGs all running LMM code, compiled from C, from the nice big spacious shared RAM we have now. Even worse if we get hubexec.

The proposed optimization is obvious and no doubt not original:

1) Introduce a First In First Out queue (FIFO) with 16 elements.

2) When a COG requires shared memory access, hits a RDLONG for example, it's COG ID is entered into the tail of the FIFO.

3) For every shared memory access window the HUB arbiter removes the COG ID at the head of the FIFO and grants that COG access.

What does this give us?

1) Shared memory bandwidth use is maximized. All possible memory access windows are used all the time. (Assuming of course there is some COG that want to use them).

2) COG execution speed is maximized. A COG waiting for access gets it as soon as possible. No more waiting for the HUB to come around to the right position even when no other COG want's it's slot.

3) Maximum shared memory access latency is maintained. COGS never have to wait for more than 15 other COGs in front of them to get their access.

4) Simplicity. No complicated HUB slot table planning or mooching/pairing priority schemes for the user to worry about.

5) Determinism. We still have a known upper bound on the time it will take a hub access to execute. Which is the same for all COGS in all places for everyone.

Is this even possible ?

I'm no chip designer but it seems to me the amount of gates required to implement this queue mechanism is not very big.

I can imagine that there are issues with timing. Entering things to a queue, adjusting the head and tail pointers, removing things etc. This all probably takes some extra clock cycles.

"Ah", you might say "That's no good. All that complex decision making takes clocks and time and we don't have the time."

My response to that is: Well OK, so what? Let the arbiter have those clocks and time. It may well be that although it is making decisions two or three or four times slower than a simple round-robin scheme the gains accrued, as described above, outweigh any losses there.

Does this blow the sacred determinism?

Not really. As noted above we still have a fixed, known upper bound on HUB access latency. We can't exactly tune our PASM code to HUB access windows. Meh, who cares?

For those who really crave LMM or hubexec speed they have a choice. Limit your application to using only a single COG! Or perhaps 8. or whatever you like.

Do I expect any of this to happen?

Not really. Just idly speculating about possibilities.

Brian Fairchild · 2014-05-12 07:23

Interesting idea. And it raises one question - does determinism matter? Or is what is needed a simple way to sync multiple cores together for parallel working?

Mike Green · 2014-05-12 07:26

Several cogs may be executing at the same time (same clock cycle). What if several of them issue a RDLONG at the same time? What gets entered into the FIFO?

Seairth · 2014-05-12 07:34

The one technical (implementation) issue I see is that the queue itself would have to be arbitrated in the event that two cogs try to add themselves at the same time.

Otherwise, I agree with you in general. My work-conserving arbiter suggestion (see my blog entry) does the same thing, sort of. Like your suggestion, it always guarantees that a cog will get access at least every 16 clock cycles. Instead of a queue, however, it is still working in a round-robin fashion, but simply skipping over all cogs that are not asserting a request at that time. This guarantees that a cog will be serviced on every clock cycle, like your suggestion does.

But. I don't think you're going to be able to crack the determinism nut.

Mike Green · 2014-05-12 07:38

Synchronization can still be achieved between two cogs by using WAITPxx, preferably on an internal "I/O port" or by using WAITCLK for absolute time synchronization.

Heater. · 2014-05-12 07:39

Mike Green,

Several cogs may be executing at the same time (same clock cycle). What if several of them issue a RDLONG at the same time?

The one with the biggest transistors wins!

If it happens too many times the chip explodes and my whole scheme is in tatters.

We have to think on this....

Brian Fairchild · 2014-05-12 07:46

Heater. wrote: »

We have to think on this....

What we need is this thing that looks a bit like a one-toothed gear wheel which rotates around giving each core a slot to write to the FIFO....

Dave Hein · 2014-05-12 08:21

The "queue" can be implemented without actually implementing a physical queue. Instead, accesses can be ordered by maintaining counters that indicate the number of cycles since the last access. When there are multiple cogs requesting the hub, the cog with the longest time since the last access would be granted access. When there are multiple cogs in the queue another set of counters will keep track of the cycles since hub access was requested. This will prevent a cog that hasn't accessed the hub for a long time from jumping ahead of the queued cogs.

So the request priority would be a combination of the two counters, and would be (RequestCount << LastUsedBits) | LastUsedCount. LastUsedBits is the number of bits in the last-used counter. The last-used counter will saturate at the maximum value that it can hold.

I wrote a simple simulator of this last week, and it seems to work quite well. In my simulator I also allow for cogs to either be in the deterministic mode or in the mooch mode so that determinism can still be maintained when needed. Here's a sample output from the simulator.

COG Mooch Rate  Hub Stalls  Hub Accesses  Stalls/Access
--- ----- ----  ----------  ------------  -------------
 0    M    3       2431         1934           1.2
 1         3       7517          625          12.0
 2        13       3751          437           8.5
 3        13       3639          428           8.5
 4        13       3759          438           8.5
 5        13       3856          444           8.6
 6        13       3869          437           8.8
 7        13       3894          443           8.7
 8        23       2685          319           8.4
 9        23       2465          291           8.4
10        23       2403          294           8.1
11        23       2546          314           8.1
12        23       2455          294           8.3
13        23       2327          294           7.9
14        23       2398          301           7.9
15        23       2641          331           7.9

The Rate in the table above is the maximum number of cycles between hub requests, where the hub requests are generated by (rand() % Rate).

In this example, cog 0 is in mooch mode, and all of the other cogs are in deterministic mode. Cog 0 only has an average of 1.2 stalls/access. Cog 1 has the same rate as cog 0, but sees an average of 12 stalls/access because it is in the deterministic mode. The remaining cogs see hub stalls that are close to the expected value of 7.5 stalls/access for random sporadic hub requests.

This hub arbitration scheme would require 15 comparators cascaded in stages of 8, 4, 2 and 1 comparator. The width of the comparators would be 4+LastUsedBits. For the example I posted above I used 5 bits for the last-used counters, for a total of 9 bits for the comparators.

Seairth · 2014-05-12 08:45

Dave Hein wrote: »

I wrote a simple simulator of this last week, and it seems to work quite well.

Was this in Verilog?

Dave Hein · 2014-05-12 08:58

No, it was written in C. I've attached the source file.

jmg · 2014-05-12 12:29

Heater. wrote: »

Does this blow the sacred determinism?

Yes, of course, determinism is blown totally out of the water!
Nice full back-flip for someone, who earlier was worried about breaking Objects with some form of USER SET Slot lottery. ?!
(just because they might turn it on..)

See the numbers in #8 for how much of a lottery this now is.

Heater. wrote: »

Not really. As noted above we still have a fixed, known upper bound on HUB access latency. We can't exactly tune our PASM code to HUB access windows. Meh, who cares?

Hmm, one day ago, it was you who claimed to care, now, it seems not at all ?

Is this more 'humour', or a troll, hard to tell ?

I guess that's the way with non-technical weightings, they can move in the breeze.

jmg · 2014-05-12 12:34

Dave Hein wrote: »

I wrote a simple simulator of this last week, and it seems to work quite well. In my simulator I also allow for cogs to either be in the deterministic mode or in the mooch mode so that determinism can still be maintained when needed. Here's a sample output from the simulator.

Sounds cool.

Can users get the desired full determinism control, in determinism mode, by set of ReLoad and Slot mapping ?

A fixed 16:1 is not determinism at all, for someone whose design needs 1:15 or 1:13, or whose design needs a Slot Rate fixed at (say) 40MHz. Those designs need a better solution.

Heater. · 2014-05-12 12:37

Dave,

Sounds encouraging.

Does anybody know if this simulation in C can be made real in HDL?

Can we skip the distinction between "mooch" mode and "deterministic" mode. In my scheme as outlined above all COGs are equal in the race for HUB access.

Isn't there a flaw here: "...accesses can be ordered by maintaining counters that indicate the number of cycles since the last access."

When things kick off all such counters are zero. So when two or more COGs hit the HUB for the first time, simultaneously, who wins?

potatohead · 2014-05-12 12:38

No jmg, that is the difference between engineering a specific solution and exploring the problem domain in an attempt to better understand what should then be enginerred.

What is worth what?

That is what this is. Challening the assumptions sometimes works really well when all of them are challenged. I am reading with interest.

jmg · 2014-05-12 12:41

Heater. wrote: »

Can we skip the distinction between "mooch" mode and "deterministic" mode. In my scheme as outlined above all COGs are equal in the race for HUB access.

It just gets funnier, now even the word "deterministic" is being scrubbed from the lexicon !!

Instead all COGs are equal in the race for HUB access. Place your bets !!!

Who cares about hard real time anyway.

Dave Hein · 2014-05-12 12:46

I only simulated the current 16-slot mode for the deterministic cogs. I suppose the deterministic cogs could use whatever slot scheme that gets implemented, and the mooch cogs would just share the unused slots. I'm thinking that the arbitration can be done in one cycle time, and the actual hub access is done in the next cycle. I think this will allow for the same hub-access latency as the fix 16-slot method since an instruction takes 2 cycles.

Heater. · 2014-05-12 12:46

jmg,

I guess that's the way with non-technical weightings, they can move in the breeze.

You are starting to understand.

This is not totally, 100%, a technical issue.

There are trade-offs and value judgements all down the line.

Engineers do not work in a vacuum devoid of all the unquantifiables they cannot measure.

Dave Hein · 2014-05-12 12:54

Heater. wrote: »

Dave,

Sounds encouraging.

Does anybody know if this simulation in C can be made real in HDL?

Can we skip the distinction between "mooch" mode and "deterministic" mode. In my scheme as outlined above all COGs are equal in the race for HUB access.

Isn't there a flaw here: "...accesses can be ordered by maintaining counters that indicate the number of cycles since the last access."

When things kick off all such counters are zero. So when two or more COGs hit the HUB for the first time, simultaneously, who wins?

The simulation currently awards ties to the lowest numbered cog. Ties can occur at the beginning when all of the counters start from zero, or when cogs haven't accessed the hub for a while, and their last-used counters saturate. The occurrence of the latter case can be reduced by using a greater number of bits for the last-used counters. I initially thought of using 8 bits for the last-used counters.

Heater. · 2014-05-12 12:57

jmg,

Who cares about hard real time anyway.

I do.

Having worked in the world of "hard real-time" for most of my career from process control, to jet engine management, to avionics I take it very seriously.

The scheme, as outlined, provides a known upper bound on the latency of any COG wanting HUB access.

That is your hard "real-time". That is the very definition of "real-time".

Perhaps, that upper bound is too long for you. Perhaps you you would like to tweak things for special cases to something lower.

Tough.

Having said all that, perhaps also, what I am suggesting here is not actually realizable in silicon.

Tough for me it that case.

ctwardell · 2014-05-12 13:00

This fails to address the situation where an operation needs consistent high speed access to the hub, say every 4th instruction instead of every 8th.

What is so wrong with the slot table approach?

There is no proven need for all cogs to be 'equal', if that was true we should have a memory and pin lottery as well.

C.W.

Heater. · 2014-05-12 13:06

Dave,

The simulation currently awards ties to the lowest numbered cog.

This sounds even more encouraging.

jmg,

Instead all COGs are equal in the race for HUB access. Place your bets !!!

I just noticed what you said there.

Well, what do you call it when two entities vie for the same resource at the same time?

That's right, it's called a "race". Or perhaps a "fight". And yes exactly "place your bets".

The whole debate about shared RAM access is exactly about that race.

We can resolve that "race" by random chance. Or by giving victory to the guy with the biggest/lowest COG ID or whatever.

Heater. · 2014-05-12 13:16

ctwardell,

This fails to address the situation where an operation needs consistent high speed access to the hub, say every 4th instruction instead of every 8th.

What is so wrong with the slot table approach?

Very true. It does not. Deliberately.

The assertion here is that "all COGS will be equal"

The slot table approach breaks that assertion.

There is no proven need for all cogs to be 'equal', if that was true we should have a memory and pin lottery as well.

I'm not sure what "proven" would mean. But the past 6 or 8 years history of the Propeller chip demonstrate that it's a damn good idea.

I don't get the, oft repeated, "memory and pin lottery" idea. By making "all COGS equal" we are achieving the same as saying "all pins are equal" or "all memory is equal". Makes life much easier to think of things that way.

ctwardell · 2014-05-12 13:24

Heater. wrote: »

The assertion here is that "all COGS will be equal"

The primary issue is if that is a valid assertion. If all cogs don't do equal work, why do they need equal resources?

Heater. wrote: »

I don't get the, oft repeated, "memory and pin lottery" idea. By making "all COGS equal" we are achieving the same as saying "all pins are equal" or "all memory is equal". Makes life much easier to think of things that way.

If it is a problem for cogs to not have equal access to the hub why should they be able to use varying amounts of pins and memory? I don't see how controlling the frequency of access to memory should be any different that controlling how much memory a given cog has allocated.

C.W.

ctwardell · 2014-05-12 13:31

Heater. wrote: »

Maximum shared memory access latency is maintained. COGS never have to wait for more than 15 other COGs in front of them to get their access.

That isn't completely correct. The reason is that the hub operation has to be executed to be placed in the queue.

So now you can no longer do other operations in between hub operations and execute the next hub operation just in time for the slot.

In other words you can no longer be synchronized to the hub.

C.W.

Heater. · 2014-05-12 13:51

ctwardell,

The primary issue is if that is a valid assertion. If all cogs don't do equal work, why do they need equal resources?

Good question.

As soon as you start global planning of which process get which amount of time you are into the same world as having to decide which process on your RTOS has greater priority, or which interrupt has greater priority. And the when you want to mix and match code from here and there you have to re-analyse all of that.

If that is the world you want to live in why not get a super quick STM32 ARM chip, they are really fast and cheap now, and be happy?

I don't see how controlling the frequency of access to memory should be any different that controlling how much memory a given cog has allocated.

So far I don't see anyone using dynamic memory allocation on the Propeller. Think "malloc". People expect that all memory usage is known at compile time. Either the code/data fits or it does not.

Similarly my execution speed should be known at compile time. Not modulated by whatever is happening at run time.

In other words you can no longer be synchronized to the hub.

True.

So what?

jmg · 2014-05-12 13:52

Heater. wrote: »

jmg,

I do.

Having worked in the world of "hard real-time" for most of my career from process control, to jet engine management, to avionics I take it very seriously.

The scheme, as outlined, provides a known upper bound on the latency of any COG wanting HUB access.

That is your hard "real-time". That is the very definition of "real-time".

Nope, anything that has floating point averages on cycles, and high levels of jitter is NOT hard real time at all.

Deterministic only in the very loosest sense. 'Not over 16' - just like any 35c uC with interrupts, can say not over X cycles.

All those examples you loved to quote about fixed cycle objects, now broken, in this 'scheme'.

A maximum time is only part of the specification, but I suspect you do know that, & this is not serious.

ctwardell · 2014-05-12 13:58

Heater. wrote: »

So what?

So you used to be able to do useful work in between hub operations without affecting the timing of the next hub operation.

This scheme forces you to wait in line instead of having a standing reservation.

C.W.

jmg · 2014-05-12 13:59

Heater. wrote: »

Very true. It does not. Deliberately.

The assertion here is that "all COGS will be equal"

The slot table approach breaks that assertion.

I'll play along.

Nonsense, you can put ANY COGid in any slot. All COGs are plainly equal, and the user has absolute time control.
The user gets to choose what work each COG does, but that is what designers always do.

Remove that absolute time control, and the Prop becomes just like those processors you grumble about
Your 'scheme' has actually delivered what you claimed earlier to oppose.
A mere ceiling on response and ship-loads of Jitter.

Heater. · 2014-05-12 14:23

jmg,

Nope, anything that has floating point averages on cycles, and high levels of jitter is NOT hard real time at all.

Yes but, no but.

In the world of "hard real-time" what is important is your interaction with the external world. Nobody need know or care what goes on inside the "black box" as long as the results come out with the required timing tolerances.

The interface between the Propeller "black box" and the world is via the pins which are in turn driven by the COGs.

So. If everything inside the box, hub memory exchanges, can be made to happen within known upper bounds on time then the COGS can do their real-world, real-time interfacing predictably as required.

No problem.

There is no requirement to hit HUB RAM locations on the nano second. No one can measure it and no one will know about it.

jmg · 2014-05-12 14:41

Heater. wrote: »

jmg,

Yes but, no but.

In the world of "hard real-time" what is important is your interaction with the external world. Nobody need know or care what goes on inside the "black box" as long as the results come out with the required timing tolerances.

The interface between the Propeller "black box" and the world is via the pins which are in turn driven by the COGs.

So. If everything inside the box, hub memory exchanges, can be made to happen within known upper bounds on time then the COGS can do their real-world, real-time interfacing predictably as required.

No problem.

There is no requirement to hit HUB RAM locations on the nano second. No one can measure it and no one will know about it.

Now I am sure you are simply trolling, and not serious.

Go and look at the many quoted examples of where objects are paced to/by hub access - the ones you earlier claimed to not want to break .

Those certainly DO expect clock-precise HUB RAM allocations, and you can certainly measure the outcome if it fails to do so.

Now, of course, I could craft a design where some COGs could choose a Primary allocation that was cycle deterministic, and other cogs could choose a Secondary allocation that was 'who-ever-wants-it', and that would not break those objects.

Oh wait, I already have that, in Table Mapping.

The Secondary Allocation can match your SuperCOG0, or it could match this new Quasi-FIFO scheme.

mark · 2014-05-12 15:17

In light of this thread, can someone remind me why table mapping is a bad idea, because judging by some of the stuff being proposed here, apparently no one cares if your code that needs hub access doesn't work at all as expected (unless you expect it not to work, that is). So since that can't be the (faulty) argument against it, the only other possibly reason is that people are too stupid to comprehend it? Yeah, I guess loading some bytes in memory which describe the hub allocation cycle is *impossible* to grasp. Realistically, I'm probably the biggest programming n00b in these discussions, yet I was still able to wrap my head around it, and I think it's brilliant. I think it's very much within the original prop's ethos of instruction timing determinism. It just gives the programmer a bit more control, which I always thought was supposed to be a good thing.

Let's blow this HUB access determinism completely!

Comments