P2 - New Instruction Ideas, Discussions and Requests

bartgrantham · 2014-03-27 15:15

Gets us to individually setting the hungry mode per COG, with the assumption that we know it gets X cycles minimum. You could flag all the cogs, or just one, etc...

Why not just make it the default since there's nothing except gain to be had?

Completely agreed. This discussion is a lot like the tasking one. We talked about a ton of stuff, all of which boiled down to a couple of key features in the silicon.

I strongly disagree with explicit hub access management, so it pains me to suggest the following, but I'm going to anyways... how about a global "hub access" register similar to the task scheduling register. It would have 8 3-bit slots and the you could program which of the 8 cogs have access to the hub on any given hub cycle. Make it 8 4-bit slots and the top bit can be reserved for a future 16-cog version. This would probably be straightforward to implement, simple to manage in code, and it mirrors how task switching works. But lets be clear though: I hate this idea, I'd much rather it be handled automatically in silicon.

rogloh · 2014-03-27 15:36

bartgrantham wrote: »

I strongly disagree with explicit hub access management, so it pains me to suggest the following, but I'm going to anyways... how about a global "hub access" register similar to the task scheduling register. It would have 8 3-bit slots and the you could program which of the 8 cogs have access to the hub on any given hub cycle. Make it 8 4-bit slots and the top bit can be reserved for a future 16-cog version. This would probably be straightforward to implement, simple to manage in code, and it mirrors how task switching works. But lets be clear though: I hate this idea, I'd much rather it be handled automatically in silicon.

I think I hate that idea even more than you do.

Having a global hub access register like this would be a nightmare to manage if any COG can control it once you start sharing software from different developers. It wouldn't work. We should not do this.

Dave Hein · 2014-03-27 15:55

Bill Henning wrote: »

Personally, I think all we really need is the ability for a cog to use otherwise unused hub cycles, perhaps shared round robbin the way Chip suggested.

Even with a round robin method you will need a way to prioritize multiple hub requests on the same cycle, and the requests will have to be queued up in some order.

Bill Henning · 2014-03-27 16:01

Agreed - but I was going to let Chip worry about that

Dave Hein wrote: »

Even with a round robin method you will need a way to prioritize multiple hub requests on the same cycle, and the requests will have to be queued up in some order.

Cluso99 · 2014-03-27 16:46

The scheme needs to be simple to implement, and no cog should be disadvantaged except by itself specifically.

So, in essence...

The powerup default is as now, each cog# gets its own slot# which is precisely 1 in 8. (ie cog0=slot0, etc)
Each cog may request earlier/additional slots from unused slots.
- Perhaps further discussion for how to allocate these "unused" slots between "green" cogs.
A "donor" cog could grant its slot to a "donee" cog with or without "priority"
- "Priority" means that the "donor" cog gives the "donee" priority to it's slot first, then itself.
- The "donor" must execute the instruction to "donate" unless it is not active.
- Typically the donor and donee cogs will be spaced by 4 slots.
  - But it is possible for 4 cogs to work in quads, spaced by 2 slots, or whatever the programmer decides.
  - Remember, this only works for specialised co-operating programs, and is organised between themselves
  - ie no other cog can interfere with this scheme
- Why? Because I see some uses where special software can make enormous gains using this scheme.
  - Again I say, it does not impact any other software/cogs !

A couple of comments to previous posts...

I don't see the requirement for special allocation of "unused" slots. I mean there is no need to remember which cogs had last use of any slots. We shouldn't get too hung up over who gets what extra unused slots. It will be some pretty smart program that would manage to hog all the unused slots for itself - we don't have to worry about viruses do we???

I actually see the main requirement being that a "green" cog can get its hub access earlier than it would if it has to wait for its own slot. Typically it will then not require its own slot, so that slot goes back into the pool.

No cog can be disadvantaged by this scheme! The worst case it that a cog gets its own slot every 8 slots, unless it specifically chooses to give priority to another cog.

IMHO (if I may be so bold) we have two things to discuss...

(a) How do we allocate the "unused" slots to "green" cogs
(b) How do we "donate" slots with/without "priority".

I will leave both of these for another post.

Cluso99 · 2014-03-27 17:10

Green cogs:

ie A cog that can utilise "unused" slots. (Note I am ignoring donor/donee slots for this post)

I have two alternate ideas for this.

(a) simple...
SETSLOT #n
where n=0 (default) cannot use "unused" slots
n=1 can use any "unused" slots

(b) slightly more complex...
SETSLOT D/#
where D/# = nnnnnnnn where each "n" represents a bit per slot (b0=slot0, b1=slot1, etc)
if n=1, then this cog can use that slot if it is "unused".

Unused slot allocation...

Slot  0 1 2 3 4 5 6 7     
----------------------
      0 1 2 3 4 5 6 7    | Allocation priority to cogs
      7 6 5 4 3 2 1 0    |
      6 5 4 3 2 1 0 7    V
      5 4 3 2 1 0 7 6
      4 3 2 1 0 7 6 5
      3 2 1 0 7 6 5 4
      2 1 0 7 6 5 4 3
      1 0 7 6 5 4 3 2

The above gives priority to cogs that just missed their slots.
Would this give the best performance boost ???
Or would this be better (by offsetting by 4)...

Slot  0 1 2 3 4 5 6 7     
----------------------
      0 1 2 3 4 5 6 7    | Allocation priority to cogs
      4 3 2 1 0 7 6 5    |
      3 2 1 0 7 6 5 4    V
      2 1 0 7 6 5 4 3
      1 0 7 6 5 4 3 2
      7 6 5 4 3 2 1 0    
      6 5 4 3 2 1 0 7    
      5 4 3 2 1 0 7 6

BTW It really only matters where more than 1 cog bids for a free slot.

bartgrantham · 2014-03-27 17:26

Cluso99 wrote: »

[*]Remember, this only works for specialised co-operating programs, and is organised between themselves
[*]ie no other cog can interfere with this scheme

This just seems so complex. Donor/donee bandwidth? Can I have multiple donor/donee pairs/sets? Will it only be 2/4 or can 3 cogs share? If this idea requires cogs to coordinate and it's based on cog id, won't the "rendezvous" code just be boilerplate?

Cluso99 wrote: »

No cog can be disadvantaged by this scheme! The worst case it that a cog gets its own slot every 8 slots, unless it specifically chooses to give priority to another cog.

Again, why not just have all cogs start in "hungry" mode where they have priority based on LRU? An 8-way pile up means all 8 cogs get even access. A 4-way pile-up means 25% bandwidth for each. If only one cog is asking for hub access it'll get it on every slot... until the next slot when a hub-sipper asks for access, and they'll trade off until the hub-sipper stops accessing the hub. Dave Hein's post above gives a scheme, which I think is equivalent to LRU and could probably be straightforward to implement:

Dave Hein wrote: »

A fairly simple way to arbitrate the hub would be to weight a hub request by the distance of from the cog's hub-slot

Bill Henning · 2014-03-27 17:52

Cluso99 wrote: »

Green cogs:

ie A cog that can utilise "unused" slots. (Note I am ignoring donor/donee slots for this post)

I have two alternate ideas for this.

(a) simple...
SETSLOT #n
where n=0 (default) cannot use "unused" slots
n=1 can use any "unused" slots

That is what I have been pushing all along.

KISS principle.

Anything more complex (for "green") adds needless logic, complexity etc., and I have yet to see a technical argument for needing that complexity.

(re donor/donee pairing... I can see potential uses, and I am for it if it does not complicate logic too much. Other priority schemes - I am against.)

potatohead · 2014-03-27 17:53

What a mess!

Agreed with Bill, and really conservative. I would have, what GREEN we are calling it now? Only.

Bill Henning · 2014-03-27 17:56

1) All cogs must start in legacy "every 8th clock I get the hub" mode in order to support deterministic drivers. This is a given.

2) Cogs should be able to switch to a "hungry" / "recycle" / "green" mode where they don't have to wait for their next turn if there is an unused slot available.

Chip suggested simple "round robbin" for which of multiple "hungry" cogs gets the next available spare slot.

I like Chip's round robbin assigment of unused slots, it is fair, does not mess up determinism for deterministic cogs.

3) Ray has made a good case for a "donor/donee" scheme where one cog in a pair could get guaranteed 4 cycle hub access for fast-response deterministic drivers. Note this does not increase bandwidth, but does cut latency in half.

4) I have yet to see a technical argument showing how anything more complex than the above would actually improve performance.

bartgrantham wrote: »

This just seems so complex. Donor/donee bandwidth? Can I have multiple donor/donee pairs/sets? Will it only be 2/4 or can 3 cogs share? If this idea requires cogs to coordinate and it's based on cog id, won't the "rendezvous" code just be boilerplate?

Again, why not just have all cogs start in "hungry" mode where they have priority based on LRU? An 8-way pile up means all 8 cogs get even access. A 4-way pile-up means 25% bandwidth for each. If only one cog is asking for hub access it'll get it on every slot... until the next slot when a hub-sipper asks for access, and they'll trade off until the hub-sipper stops accessing the hub. Dave Hein's post above gives a scheme, which I think is equivalent to LRU and could probably be straightforward to implement:

Bill Henning · 2014-03-27 17:59

potatohead wrote: »

What a mess!

Agreed with Bill, and really conservative. I would have, what GREEN we are calling it now? Only.

Thanks.

Yep, I am currently trying to decide between "GREEN" or "RECYCLING".... lol .... some people thought "HUNGRY" was too aggressive.

Besides... recycling all those poor wasted hub slots is the "green" environmentally friendly thing to do

Cluso99 · 2014-03-27 18:04

bartgrantham wrote: »

This just seems so complex. Donor/donee bandwidth? Can I have multiple donor/donee pairs/sets? Will it only be 2/4 or can 3 cogs share? If this idea requires cogs to coordinate and it's based on cog id, won't the "rendezvous" code just be boilerplate?

Actually its quite simple. It requires cogs to coordinate because it is likely these will be paired software. And we don't want to interfere with any other cogs.

Again, why not just have all cogs start in "hungry" mode where they have priority based on LRU? An 8-way pile up means all 8 cogs get even access. A 4-way pile-up means 25% bandwidth for each. If only one cog is asking for hub access it'll get it on every slot... until the next slot when a hub-sipper asks for access, and they'll trade off until the hub-sipper stops accessing the hub.

Absolutely not. This would break our current determinism. We require the current 1:8 to be the default.

Dave Hein's post above gives a scheme, which I think is equivalent to LRU and could probably be straightforward to implement:

No, its not equivalent to LRU. That requires registers to store who got slots last. We just have to pick a default priority scheme and live with it. Better than not having any scheme at all and wasting all those unused slots.

potatohead · 2014-03-27 18:10

I think the modes should be:

0 = compliant

1 = mooch

Hey man, can you spare a cycle?

Bill Henning · 2014-03-27 18:12

Lol!

+1

potatohead wrote: »

i think the modes should be:

0 = compliant

1 = mooch

hey man, can you spare a cycle?

bartgrantham · 2014-03-27 18:25

Bill Henning wrote: »

1) All cogs must start in legacy "every 8th clock I get the hub" mode in order to support deterministic drivers. This is a given.

2) Cogs should be able to switch to a "hungry" / "recycle" / "green" mode where they don't have to wait for their next turn if there is an unused slot available.

Why not the other way around? Imagine reading this in a manual: "Your cog will get hub access at 1/8 the system clock. Unless you include this instruction, in which case it will go up to 8 times faster with no penalty to the other cog's hub access." My very first thought would be "if there's no negative consequence, why do I have to explicitly ask for this huge performance gain?" Shouldn't the burden be on the code that is relying on a chip behavior that isn't a bottleneck any more to ask for deterministic timing?

Chip suggested simple "round robbin" for which of multiple "hungry" cogs gets the next available spare slot.

This is even simpler, but it would have greater latency for the access than LRU (but is still faster than no hub slot sharing).

"green" / "recycling" / "hungry" ... none are particularly descriptive. Considering the huge volume of jargon that has been coined in this forum over the past few years, making sure things have concise and descriptive names is probably a good idea.

Cluso99 · 2014-03-27 19:16

bartgrantham wrote: »

Why not the other way around? Imagine reading this in a manual: "Your cog will get hub access at 1/8 the system clock. Unless you include this instruction, in which case it will go up to 8 times faster with no penalty to the other cog's hub access." My very first thought would be "if there's no negative consequence, why do I have to explicitly ask for this huge performance gain?" Shouldn't the burden be on the code that is relying on a chip behavior that isn't a bottleneck any more to ask for deterministic timing?

No. The current way is what is expected. There is no guaranteed extra performance. And there has been massive arguments against even permitting slot sharing. Please don't rock the boat

Cluso99 · 2014-03-27 19:40

Bill Henning wrote: »

1) All cogs must start in legacy "every 8th clock I get the hub" mode in order to support deterministic drivers. This is a given.

2) Cogs should be able to switch to a "hungry" / "recycle" / "green" mode where they don't have to wait for their next turn if there is an unused slot available.

Chip suggested simple "round robbin" for which of multiple "hungry" cogs gets the next available spare slot.

I like Chip's round robbin assigment of unused slots, it is fair, does not mess up determinism for deterministic cogs.

3) Ray has made a good case for a "donor/donee" scheme where one cog in a pair could get guaranteed 4 cycle hub access for fast-response deterministic drivers. Note this does not increase bandwidth, but does cut latency in half.

4) I have yet to see a technical argument showing how anything more complex than the above would actually improve performance.

Bill,
Even with the simplest "green" instruction, Chip still has to provide a method of determining which cog will get the next free slot from those who request it. That is what the table is about.

You may have noted I provided 2 examples, one where the "unused" slot is allocated to the cog who would gain the most benefit from it (ie just missed {or used} its own slot) and the other was to place them 4 slots apart. I am not sure of what is the best method.

The more complicated instruction would merely gate this cogs request for an unused slot with the available slots. Its not complicated and could perhaps permit us to better control who gets available slots. Trying to think of possible future benefits for little silicon.

Bill Henning · 2014-03-27 19:52

Ray,

Basically, it does not matter to me how the spare slots are shared, as long as they are available to be used

I am just trying to keep it as simple as possible, so it uses as few gates as possibe - in order that it makes it into P2

Chip has been really good about boiling things down, so I am sure he will have an easy mechanism.

Cluso99 wrote: »

Bill,
Even with the simplest "green" instruction, Chip still has to provide a method of determining which cog will get the next free slot from those who request it. That is what the table is about.

You may have noted I provided 2 examples, one where the "unused" slot is allocated to the cog who would gain the most benefit from it (ie just missed {or used} its own slot) and the other was to place them 4 slots apart. I am not sure of what is the best method.

The more complicated instruction would merely gate this cogs request for an unused slot with the available slots. Its not complicated and could perhaps permit us to better control who gets available slots. Trying to think of possible future benefits for little silicon.

Bob Lawrence (VE1RLL) · 2014-03-27 20:18

yep, i am currently trying to decide between "green" or "recycling".... Lol

borrow

loan

share

bartgrantham · 2014-03-28 07:35

Cluso99 wrote: »

No. The current way is what is expected. There is no guaranteed extra performance. And there has been massive arguments against even permitting slot sharing. Please don't rock the boat

I remember, I contributed to the arguments in the big thread 2 1/2 months ago.

I also remember having an instinctual revulsion to Bill's arguments for hub slot reuse, but ultimately being convinced. The thing that clinched it for me at the time was restating in my own head what the contract with the programmer would be:

bartgrantham wrote: »

“Initial hub access will take between 1 and 8 cycles, after which you can rely on every access thereafter taking exactly 8 cycles.”

With hungry mode (I prefer “LRU hub access”) the contract would be:

“Hub access will take between 1 and 8 cycles. No guarantees are made about the timing of subsequent access except that it will take 8 or fewer cycles.”

I know it sounds like I'm undervaluing the huge body of existing code that relies on hub window timing, but how much of that code is actually relying on the hub window to slow it down? I recall seeing a lot of code that was cycle-counted so that it could run as fast as possible, but that's different than using hub access to regulate the timing.

You guys know far more about the existing body of code to be able to gauge the value of backwards compatibility in this case, but strictly from a performance perspective it seems like having the "turbo bandwidth" mode not be the default will feel like a pointless quirk for people new to the platform.

Heater. · 2014-03-28 08:12

The argument against "turbo bandwidth" mode is that if your object uses a COG and requires turbo bandwidth and my object uses a COG and requires turbo band width then when some unsuspecting other user fetches our objects into his program it will fail mysteriously.

That has to be so because we can't get more band width than there physically is.

Thus "turbo bandwidth" destroys that happy timing determinism, that confidence that objects can be mixed and matched without any worries about them stealing time from each other.

Personally I think the argument has a lot of weight. As much weight as the rejection of interrupts. Interrupts introduce exactly similar timing dependencies and surprises.

cgracey · 2014-03-28 08:18

I worry about this kind of scenario:

Someone writes some code to perform some task that must be done within a certain, perhaps cyclical, amount of time. He uses hub slots not being used by other cogs to gain some speed. He's got it working just fine.

He adds some other object, or cog program, which, in some occasional mode of operation, uses lots of hub slots, depriving his other cog of slots it needed to make timing.

Now he's got a mess that will need diagnosing. If he never had use of those extra slots, he never would have gotten into trouble, because performance would have been static, all along, without any variance, no matter what other cogs were up to.

That scenario shows the danger of making hub sharing a default operating mode.

Of course, many people aren't even used to programming with real-time considerations, so they wouldn't even know any of this variance might matter - until they got caught in what they didn't realize was a subtle trap.

ADDED: Looks like Heater was thinking the same things at the same time.

Dave Hein · 2014-03-28 08:36

After reading the comments on hub-slot sharing I've modified my proposal a bit.

- cogs start up using their own hub-slot, and must explicitly run the turbo instruction to enable using other slots
- a non-tubo cog always gets guaranteed hub access during its own hub-slot
- if a cog is not accessing the hub during its slot time the hub-slot is automatically available to other cogs
- access to unused hub slots are granted in a round-robin, or first-come first-served manner
- multiple hub requests in the same cycle are ordered by (cogid - cnt) & 7

I like the term "turbo" to describe hub-slot sharing, so I'll use that name until someone comes up with a better name. I also like the idea of using a round-robin scheme, but there does need to be a method to prioritize simultaneous hub requests. I'm proposing using the order given by "(cogid - cnt) & 7". This method ensures that cogs do not have a higher priority based solely on their cogid.

One major difference between this proposal and my previous one is that turbo cogs no longer have dedicated access to their hub slots. This should not be important to turbo cogs. Non-turbo cogs do have dedicated access to their own hub slots, and can only access the hub during their own hub slot.

It is true that if you depend on the turbo mode to achieve a certain latency you may have problems running multiple turbo cogs. This type of thing is encountered all the time in multi-tasking systems. In a multi-tasking system you need to add up the worst case latency of each task to determine if the overall latency requirements are met. I think most applications that would use the turbo mode would enable it for only one cog that does not have a latency requirement.

My understanding is that P2 is not only targeted for hobbyist, but is meant to be used for commercial applications as well. I think the turbo mode will be of great use for commercial applications, and professional programmers should have no problem working with it.

Ramon · 2014-03-28 09:05

Heater, Chip,

Please, treat the programmer like an adult person that is supposed to know that if he push all the hub access for just one cog, then it cannot be used by two cogs (this is by definition: "turbo bandwidth: pushes all hub accesses to be used by ONE cog).

Bill Henning · 2014-03-28 09:38

Chip, Heater:

That concern is overblown.

Consider:

"Bob" writes a new top level Spin object. He pulls in three objects, each of which start three cogs. OOPS.

"Smith" pulls in five objects that in aggregate use 300KB of hub. OOPS.

"Jones" waitpeq's on a pin that will never change, for the wrong state. OOPS.

How are the above scenarios different from slot recycling?

There are about a billion other possible scenarios where the programmer passes the limits.

Any system has limits.

Programmers have to be aware of the limits.

If they don't Read The Fine Manual, and blindly pull random code together without knowing what they are doing, they can hardly complain if it does not work as expected.

The same argument, that more than one hungry object might give results the users don't expect, could be used to:

- restricting objects to launching only one cog
- using only a tiny fraction of the hub.

Note: We should be targeting programmers & developers, not rank beginners, as beginners will start with a subset of features, and work themselves up.

With P2 coming on a module, and being surface mount, and more expensive than the P1, most beginners will start with a P1 anyway - at least in education.

(it would also be easy to have Obex scan objects, and throw a warning if there are any "SETHUB RECYCLE" instructions in them)

whicker · 2014-03-28 10:24

Getting rid of the round-robin hub access of the cog destroys the core concept of what a Propeller series chip is.

Green-washing the terminology doesn't help the case either, because I doubt the people opposed to this idea are from a subclass that responds favorably to those terms.

bartgrantham · 2014-03-28 10:29

cgracey wrote: »

I worry about this kind of scenario:

...programmer gets him/herself into trouble...

Now he's got a mess that will need diagnosing. If he never had use of those extra slots, he never would have gotten into trouble, because performance would have been static, all along, without any variance, no matter what other cogs were up to.

That scenario shows the danger of making hub sharing a default operating mode.

Despite my arguments above, I agree with this concern. I can imagine the countless threads of "Why is my code half as fast when I use it with this other object?" Making it an explicit thing you have to ask for may or may not cut down on the debugging pain (I can imagine advanced developers getting tripped on this). An explicit instruction also gives tools the ability to warn about performance and mixing of objects that use this feature, which would really help.

Still, I've seen a few API's that had quirks like that that would dramatically speed up the code and people often don't use it because they don't realize it's there. Most developers aren't expecting to have dramatic speed increases just by flipping a switch, so the feature might fly under their radar. Without understanding the emphasis on determinism in the design, it's not intuitive why the system would leave performance on the table by default.

Chip, IIRC you said you had a thought about how it would work but you weren't sure if you could fit it in, so this whole discussion might be academic anyways.

potatohead · 2014-03-28 10:43

I like the term "mooch" for these concerns. Turbo implies a sure thing. The only sure thing is round robin hub access.

Mooch implies, if possible, and nobody really counts on a moocher. So it is a nice to have, but might not happen mindset which encourages fail / fall back object authoring.

Bill Henning · 2014-03-28 10:48

whicker,

You seem to misunderstand.

The default behavior would still be round robbin, every 8 cycles. A cogs regular every-8-clock cycle could not be taken away.

The change would allow one or two cogs to benefit from hub access slots that would not be used by the cogs they were assigned to, and this mode could only be entered with an explicit instruction.

The benefit is that compiled code would run significantly faster, as would virtual machines (emulators) WITHOUT AFFECTING deterministic behavior of other cogs at all.

whicker wrote: »

Getting rid of the round-robin hub access of the cog destroys the core concept of what a Propeller series chip is.

Green-washing the terminology doesn't help the case either, because I doubt the people opposed to this idea are from a subclass that responds favorably to those terms.

potatohead · 2014-03-28 10:49

Frankly, if we don't do it at all, I'm fine with that too. If we get any mor complex than, normal / mooch modes, I'm opposed anyway.

P2 - New Instruction Ideas, Discussions and Requests

Comments