Propeller II update - BLOG

Roy Eltham · 2013-12-06 14:00

I was against the slot sharing from the start, and I still am. Also, since we now have double the read/write bandwidth because of the 8 long wide stuff, there is even more reason NOT to do it.

David, I was pretty sure everyone in the world had heard Rush's "Tom Sawyer" at least a dozen times. http://www.youtube.com/watch?v=ANiaZvdGO8U

Heater. · 2013-12-06 14:03

jmg,

And yet those terrible, user biting devices, sell in HUGE volumes, the Prop can only dream about?

Yes they do.

And the more the Prop is made to look and behave like one of them the less reason anyone has to choose the Prop. Parallax could never compete on price or mind share and the Prop would not have a "unique selling point" to distinguish it from the heard

Users are not as dumb as you imagine.

Sure I am. Oh wait..:)

To limit it, on some 'Protect the novices' mantra, is frankly quite self-defeating.

We keep talking about "novices" here. I don't think this is right. Even the most professional and turbo-charged engineering genius might rather be using his guru powers on sorting out his own engineering challenges rather than wasting his time and skills sorting out problems that are not of his creation.

His tools should just work.

T Chap · 2013-12-06 14:09

You guys are speaking in terms of hypothetical possibilities of what might go wrong where there is empirical evidence to what WILL be the case and resolution of such slot problems.

Newbie: I can't get the superdupervideo object to work,

Guru: that object steals/borrows/other term slots, are you sure that you are not having a problem of a slot conflict?

Newbie; oh ok, I see that, thanks, working now.

Every day for many years now there are newbies and gurus alike trying to figure out why something isn't working as it "should". The answer is always "aha!" Solving problems is part of the whole process and is a large part of the learning experience.

jmg · 2013-12-06 14:09

potatohead wrote: »

Lastly, can you guys wanting this feature comment on slot sharing vs the general case wide open variable slots and the potential impact on the higher performance cases you envision? In other words, does it really have to be whole hog, or is there a very basic, common sense step that can be taken to get a net gain overall?

I want to get on board with that. Help me. I'm flat out opposed to the general case, but the pairing is something I'm flirting with because we could do it and still say, "works no matter what" with very few qualifiers, which again I consider EXTREMELY important to the overall positioning of this product for high adoption rates, which it seriously needs to do.

I've not seen suggestions of non co-operating slot handling, ?

Certainly slot stealing is a bad thing.

To me making technical decisions on what surprises may be imagined is nebulous - smarter is to use firm quantifiable things like determinism and bandwidth.

Determinism is important, and there are co-operating slot-allocation designs, that are 100% deterministic.

Designers simply choose the minimum bandwidth their code requires. The slot manager hardware delivers that.

Threading already works a lot like this now, and I do not see anyone asking for threading to be removed, just because someone may allocate threads wrong ?

potatohead · 2013-12-06 14:12

Heater just made a GREAT point.

Strong differentiators matter. "Works no matter what" is a strong differentiators regarding the COGS, to Roy's point. Dilute the product vision and you dilute the reasons why people would even take a look at it.

Secondly, ease of use matters to EVERYBODY. Don't think it doesn't. In fact, that is the single most compelling reason to consider new products in most cases. Spec sheets will say all kinds of stuff. What really matters is to get them to visualize themselves using your product, and having a more favorable experience while doing so, which speaks back to strong differentiators. "me too" type broad coverage to hit some peak cases isn't worth it, due to the dilution of value perception in the majority sweet spot cases, IMHO.

David Betz · 2013-12-06 14:12

Roy Eltham wrote: »

I was against the slot sharing from the start, and I still am. Also, since we now have double the read/write bandwidth because of the 8 long wide stuff, there is even more reason NOT to do it.

David, I was pretty sure everyone in the world had heard Rush's "Tom Sawyer" at least a dozen times. http://www.youtube.com/watch?v=ANiaZvdGO8U

Okay, now I've heard at least part of it once. I can't say it sounded at all familiar though. I guess I must have been spending too much time playing Colossal Cave! :-)

potatohead · 2013-12-06 14:15

To me making technical decisions on what surprises may be imagined is nebulous - smarter is to use firm quantifiable things like determinism and bandwidth.

What? You seem to do that a lot. "Smarter to", "Always better to", etc...

You are claiming that maximizing ONE product attribute at the EXPENSE of other very strong attributes is worth it. I think you need to post up some support for that idea, like use cases.

Let's hear it.

Most of us know what the "works no matter what" attribute of the Propeller design is worth. If we go your way, it's a "me too" type chip that's just a different sort of mess from what the other guys are selling, and worse, it's new, and more worse, there is only one supplier, etc...

And it's not imagined. We know what using a Propeller is like, and we know what using other devices is like. We have good solid data, where you have a strong desire to maximize one feature at the cost of other darn good ones and you've not really made a case for it outside of the fact that you would prefer it.

...which incidentally is "always smarter" somehow?

Doesn't matter whether or not we are talking about a passive type system or an active one. The problem remains. Somebody will maximize one or two COGS, then the others aren't so useful. And that's the tease a lot of us don't like, because once somebody is there, we don't have good answers for them.

On the other hand, if the thing is built to deliver 8 cores that always perform, users never get there, do they? And we've got much better answers.

Better, if they do need a huge amount of throughput or "bandwidth" (never did like that term in this context), they can use two of the things, or employ a dedicated CPU for that task.

I'm having trouble envisioning a case where we just have to have a cog or two maxing out the HUB, leaving the others either:

seriously under performing

, or

causing the maxed out COG performance to vary considerably

, where

it's needed to sell bazillions of chips.

Can you post up a case or two JMG?

David Betz · 2013-12-06 14:21

This may sound like a dumb idea but maybe this can be solved by requiring a COG to donate its hub slot to another COG explicitly. In other words, my OBEX object that needs two hub slots to work will run in one COG and launch another COG that will do nothing except donate its hub slot back to my main COG and then go to sleep. That way there are never more more COGs running than there are free hub slots. The user of that object will know right away whether it will work or not because its documentation says it requires two COGs and if only one is available it can't be used. Of course, this wastes a COG that could be used for a task that doesn't require a hub slot but at least it provides a way to guarantee that a collection of objects will work as advertised as long as there are enough COGs available. This is the only guarantee we have now with P1.

Dave Hein · 2013-12-06 14:29

One way to do this is to let each cog define whether it needs a dedicated slot or not. When a cog starts running it will have a dedicated hub slot. If the cog doesn't require a dedicated slot it could set a bit to release its slot to the bus arbiter. When a cog needs to access the hub it must request it from the arbiter, and if no other cogs have requested the hub it will be granted access immediately. If other cogs have requested the hub, its request will be added to a FIFO queue.

This gives the best of both worlds. If you need a dedicated slot you can keep your dedicated slot. If you don't need a dedicated slot then other cogs can share the hub more efficiently.

potatohead · 2013-12-06 14:29

I was against the slot sharing from the start, and I still am. Also, since we now have double the read/write bandwidth because of the 8 long wide stuff, there is even more reason NOT to do it.

+1

Yeah. Me too.

Heater. · 2013-12-06 14:35

jmg,

You are looking at a lower level of "determinism" than is under discussion here.

The determinism we are after is that if a user selects objects A, B and C from OBEX or elsewhere he knows 100% that they will work together in his program and he does not ever have to worry about timing interactions between them. EVER.

A soon as there is the possibility that more than one of A, B and C need as many HUB slots as they can get that high level determinism is blown.

Threading already works a lot like this now, and I do not see anyone asking for threading to be removed, just because someone may allocate threads wrong ?

Threading is a totally different case.

Whatever weirdness goes on inside a COG is totally confined to that COG. The "atomic unit" of reusable code here is the not the thread but the COG program, or the object that contains it.

...and there are co-operating slot-allocation designs, that are 100% deterministic.

Now you have my attention. If an object or some C library used two COGs and if those two COGs could trade their HUB slots between themselves, and only between themselves, that isolates the issue totally from the rest of the world.

Sounds like a very rare scenario though.

KC_Rob · 2013-12-06 14:41

potatohead wrote: »

I think you need to post up some support for that idea, like use cases.

Yes! Use cases, with specifics, so objective comparisons can be made! Not only in this specific instance but in general. We've seen way too little of that.

Jim Fouch · 2013-12-06 14:45

My thought on the idea of Slot Sharing being a problem....

Hub memory is also something that MUST be shared between more than one COG. There has to be some responsibility on the person who picks objects from the OBEX to make sure the combined memory usage is less than the Prop has. I see no difference in the Slot Sharing requirements. I don't think it is asking too much of a user to coordinate the requirements of the objects they want to put into a project.

I could see requiring that objects in the OBEX that use Slot Sharing be well documented accordingly.

Heater. · 2013-12-06 14:47

David Betz,

Yes, that extreme example of a two COG object or library unit that can do what it likes with its share of HUB slots with out any possible conflict with other objects. It is extreme to totally waste a COG though!

Dave Hein,
That still does not get around the fundamental objection that random objects A and B may both require as many HUB slots as possible and won't work together because the can't both have them.

GOGS fighting over HUB is much like interrupts fighting over CPU time. If it's not there it's not there.

Bill Henning · 2013-12-06 14:53

NOTE: I DO NOT ADVOCATE SLOT STEALING!!!! That would remove determinism, and is in the "don't even think it" category.

Usage case:

For arguments sake, let's say COG#1 is running, and COG#1 only gets its own slot.

hubexec running gcc code will have - for arguments sake - two hub references in every 8 longs.

To execute that window of eight longs will take 24 clock cycles (3x8) best case.

Now let's assume a peaceful world of happy cooperation, where unused slots go free to any cog that needs a slot. Note I am not even talking about yielding a slot, merely unusued slots.

Let's say out of the other 7 cogs, two are not using their slot.

Happyness!

Now the propgcc code might execute in 8 clock cycles. Frankly, on average, more like 12 clock cycles.

But 2x-3x speedup is nothing to sneeze at.

Note - no loss of determinism to the other cogs.

The hubexec cog can just make use of unused slots to run faster. If two cogs were running hubexec, they'd get to share the surplus.

I don't think there is any real need for anything fancier than this.

This way, EVERY cog gets its guaranteed slot.

hubexec cogs can vacuum up any table scraps.

This does not undermine determinism, just speed up hubexec code (Spin, GCC, etc).

It maybe useful to have a

SETSLOT "DETERMINISTIC | HUNGRY"

In deterministic mode, a cog can only get its own slots, and may not use other cogs slots. The hub goes merrily around every 8 cycles. If it leaves a slot unused, a HUNGRY cog can use it.

In HUNGRY mode, a cog get its own guaranteed slot, and shares spare slots with other hungry cogs. Only guaranteed 1/8 slots, but may get more.

I am sure I've seen this (or extremely similar) proposal posted before, but there have been too many messages, so no idea where.

Then, we can leave more complex schemes for P3 :-)

Personally, I like being able to set priorities, yielding etc., and don't see the Obex issue as being so serious - however the above may be a good compromise for the first P2 series microcontroller.

KC_Rob wrote: »

Yes! Use cases, with specifics, so objective comparisons can be made! Not only in this specific instance but in general. We've seen way too little of that.

Bill Henning · 2013-12-06 14:54

removed accidental self quote - forum would not delete it

Dave Hein · 2013-12-06 14:55

Heater. wrote: »

That still does not get around the fundamental objection that random objects A and B may both require as many HUB slots as possible and won't work together because the can't both have them.

Yes, that is a problem that exists in multiprocessor environments. It can be even worse because A and B may work together most of the time, but occasionally their peak bus usage overlap, and there's a glitch. The developer of an object would need to specify the bus usage for the object, so that others could ensure that all the objects will work together.

Heater. · 2013-12-06 14:58

Let me get this straight...

Last I heard a COG will be able to read or write from HUB eight LONGS in one go when it gets it's HUB slot. Eight for Gods sake!

Now presumably it will take a bunch of instructions to create that 8 LONGs worth of data to write or to consume 8 longs that are read.

Looks to me as if being able to read or write 8 LONGs twice per HUB cycle is totally pointless. There is not time in between to do anything with them !

This HUB slot sharing perhaps buys you nothing. Am I missing something here?

Dave Hein · 2013-12-06 14:59

Bill, if slot stealing is made optional it would allow developers to take full advantage of the bus. If your cog needs a dedicated slot, you can keep your dedicated slot -- period.

Heater, the problem with a dedicated slot is that you need to structure your PASM code to access the hub exactly on 8-cycle boundaries. Slot stealing would eliminate this, but at the cost of a loss of determinism. Many programs do not require determinism in their timing.

Bill Henning · 2013-12-06 15:03

Heater, take peek at #3766

It would allow hubexec code to use unused hub cycles, see the example I posted, avoiding most stalls waiting for the hub. That is a good use case.

Video drivers don't need it; with 800MB/sec hub, that is more than 3x what is needed for 1080p30hz, for sprites, overlays, more cogs can be thrown at the same buffer - each with 800MB/sec.

Heater. wrote: »

Let me get this straight...

Last I heard a COG will be able to read or write from HUB eight LONGS in one go when it gets it's HUB slot. Eight for Gods sake!

Now presumably it will take a bunch of instructions to create that 8 LONGs worth of data to write or to consume 8 longs that are read.

Looks to me as if being able to read or write 8 LONGs twice per HUB cycle is totally pointless. There is not time in between to do anything with them !

This HUB slot sharing perhaps buys you nothing. Am I missing something here?

Bill Henning · 2013-12-06 15:05

Dave Hein wrote: »

Bill, if slot stealing is made optional it would allow developers to take full advantage of the bus. If your cog needs a dedicated slot, you can keep your dedicated slot -- period.

Heater, the problem with a dedicated slot is that you need to structure your PASM code to access the hub exactly on 8-cycle boundaries. Slot stealing would eliminate this, but at the cost of a loss of determinism. Many programs do not require determinism in their timing.

(sshhh!!! here is a secret: I support voluntarily YIELDing a slot, then only getting unused cycles... which is effectively the same as stealing, as by planning and yielding you can guarantee more slots)

T Chap · 2013-12-06 15:05

Does any object currently exist that will occupy the P2 OBEX? Or the P2 OBEX is going to be from scratch once released(or modified P1 objects)? If from scratch, certainly some scheme could be found by that time to be incorporated into all P2 objects so that some arbitration of slots can be implemented. Possibly a variable dedicated in hub that is decremented for each cog that needs a slot, so that any future cogs upon launch must check to be sure it will have enough slots to run. Then, even if you tried to run cogs that are all going to compete for slots, you get an error on cog launch if there is a conflict. No time lost debugging. It it's not there, you find a work around.

Sapieha · 2013-12-06 15:07

Hi Dave Hein.

In one of my old proposal I had proposed ---- Variable slot length.

If any of COG set is flag --I don't use it -- Slot will change 8 - 1 and so on.

Maybe it is time to look more on that possibility.

Dave Hein wrote: »

One way to do this is to let each cog define whether it needs a dedicated slot or not. When a cog starts running it will have a dedicated hub slot. If the cog doesn't require a dedicated slot it could set a bit to release its slot to the bus arbiter. When a cog needs to access the hub it must request it from the arbiter, and if no other cogs have requested the hub it will be granted access immediately. If other cogs have requested the hub, its request will be added to a FIFO queue.

This gives the best of both worlds. If you need a dedicated slot you can keep your dedicated slot. If you don't need a dedicated slot then other cogs can share the hub more efficiently.

Heater. · 2013-12-06 15:23

Bill,

OK hubexec is the first compelling use case I think we have seen here to justify this variable slot timing complexity.

Not sure if I'm convinced, lets see...

Cluso99 · 2013-12-06 15:33

Ken Gracey wrote: »

Just spoke with Chip and he says that maybe this weekend he'll have hubexec working. I'm sure he'll figure out the right way to do it and will report back with his results.

Now I'm on the "more features bandwagon" for P2, too. Who'd have thought that would ever happen?* This is especially true since I got a few messages from Bill Henning and David Betz about how much more function and performance large languages (like Propeller C) could work in P2 with hubexec. And beyond performance, they mentioned the GCC design cost...now they're really speaking my language.

Ken Gracey

Yes, such a radical (and fairly simple) improvement has come out of the "Holiday weekend/week" that it begs the exploration/review of a few of the other things to come from this exchange. The P2 will be radically improved by magnitudes with these changes and the markets significantly widen.

I cannot speak for GCC, but I do know HUBEXEC mode will give another dynamic improvement.

Really glad to have you on side !

*Still not interested in DDRAM support that was born over Thanksgiving Holiday while Chip slept twice.

No. We fleshed that out. Was fun while it lasted, and I am sure gave Chip a real "buz", but in the end there are far too many caveats including longevity of supply of the DDR2 chips. But this was the opener to the removal of the DAC bus, and the 256KB hub resulting in WIDE transfers to double the hub/cog bandwidth. So it was well worth the trip over the w/e

Cluso99 · 2013-12-06 15:38

ctwardell wrote: »

How about in honor of Canada, and Rush of course, call it YYZ...

Now that I'm listening to it, the pace of YYZ fits the idea pretty well too!

C.W.

Even better...
YMMV #nnnnnnnnn

Bill Henning · 2013-12-06 15:39

LOL!!! That's PERFECT! It fits the functionality!

Cluso99 wrote: »

Even better...
YMMV #nnnnnnnnn

potatohead · 2013-12-06 16:15

LMFAO!! Well played.

Cluso99 · 2013-12-06 16:36

SETSLOT
SETSLOT is I understand quite simple. It uses otherwise unused bandwidth, so really its a no-brainer. If you dont want to tell everyone, thats fine. But dont miss the opportunity.
This is a simple implementation...
(a) Each COG can YIELD (other cog takes priority) or GIFT (this cog has priority) its slot to another COG
SETSLOT #0_0_y_g_ccc
(b) Each COG can receive other COG(s) YIELD/GIFT slot(s), and/or accept any AVAILABLE slots
SETSLOT #r_a_0_0_000

What are you guys missing, aside from a bunch of naysayers using scaremongering tactics to derail a simple, powerful option that utilises otherwise "unused" power of the P2 silicon???

YIELD or GIFT a slot to another specific cog:

The donated slot must be specifically donated by a cog to another.
YIELD means it gives priority for its slot to the other specified cog, and
GIFT means it gives priority for its slot to the other specified cog, only if it does not require it.

Only the "donour" cog can yield or gift its' own slot specifically to another cog !

RECEIVE a YIELDed or GIFTed slot(s) from (an)other cog(s):

A cog can specifically enable itself to receive any specifically donated slot(s) to itself from other cog(s) by setting the "r" bit.

This use of YIELD, GIFT, RECEIVE would require sw (objects) to be written as cog pairs/triplets/etc and specifically enable their use.
No other cog(s) would be impacted by this change in slot behaviour !!!

What is not to like about specifically sharing bandwith between co-operating cogs ???

UNUSED SLOTS:
Any cog can specifically enable ( "ACCEPT" ) the use of any unused slot by setting the "a" bit.

These are unused slots. No guarantee can be made about the availability of any unused slots.
"User beware: Other cogs may also be vying for these free slots, or other cogs may be slot intensive."

With the above caveat, why would you deprive an advanced user, who specifically and totally sets up his cog structure to take advantage of this,
the opportunity to squeeze out the last drop of available power from his P2 setup ??? Remember, it must be specifically be enabled !!!

jmg · 2013-12-06 16:38

Bill Henning wrote: »

It would allow hubexec code to use unused hub cycles, see the example I posted, avoiding most stalls waiting for the hub. That is a good use case.

Yes, Sounds good to me.

Propeller II update - BLOG

Comments