P2 - New Instruction Ideas, Discussions and Requests

Bill Henning · 2014-03-29 19:49

I think having a way of testing (forcing mooch off) would be extremely helpful. I think I already suggested using a "Lock #31" as a global mooch disable for testing purposes.

Personally, I plan on using mooching to speed up compiled business logic / hmi code and VM's, where the worst consequence is running a bit slower.

potatohead · 2014-03-29 20:19

See, that is where I would put it too Bill. That's the most compelling case. Drivers and other things are the least compelling case.

On a COG basis:

We've got drivers, which should not mooch. These also see the most reuse.

There are specific purpose things, which may benefit from mooching. They have moderate to low reuse, which means they have low to moderate potential conflict, if they even conflict. It's not like every single thing needs the max mooching.

Compute COGs of various kinds. Probably won't benefit from mooching. High reuse potential.

What else? VM's? Those would likely benefit from mooching. They could conflict with themselves and special purpose things, but I don't know that they would be used where strict timing is needed either. Thoughts on this case? Could see significant reuse.

HUBEXEC

Sprite engine, game loop, GUI, Business logic, etc.... Could be compiled or PASM. They are all good mooch cases, and they all would interact with drivers, which are not good mooch cases, because we don't depend on moochers. They could conflict with specific purpose things, depending.

Drivers? SD card, etc... Not sure on this one.

Packaged multi-purpose drivers could easily be HUBEXEC, and also are not always going to be strict COG code, which has reuse implications in addition to potential failures. These may be tempting to mooch, but we shouldn't.

XMM EXEC

You all know it's gonna happen, which is precisely why Bill is always talking about VM's.

The VM these depend on would benefit from mooching. Reuse is high, if the XMM EXEC demand is high. Really big programs.

What I'm not seeing guys is massive failure cases. I am seeing niche failure cases, and a few potential commons with moderate failure probability cases. A whole lot of those come with problems the user would have likely had anyway, with mooch being a kludge fix at best.

The key here is we don't mooch in drivers. Really.

I second the NO MOOCH case, however we set it. However, I would not declare it for testing, instead declaring it compliant, or reliable mode, something like that. Maybe just no mooch, or normal. Compliant, etc...

Perhaps Chip can cram this into COGID, like he did the fuses, thus not generating another instruction.

Heater. · 2014-03-29 20:20

potatohead,

Now, the other failure case is generalized in the post above. What's your answer to that question?

I'm not sure I can identify that use case in the above.

All these failure cases generalize to the same thing. There isn't enough bandwidth to do what two or more things want to do. Things that I may have picked from OBEX or elsewhere and assumed would "just work" as in they do in the P1.

Personally, I plan on using mooching to speed up compiled business logic / hmi code and VM's, where the worst consequence is running a bit slower.

I like this approach. Why would I not want my Z80 emulator to run as fast as possible? For example. If it happens to not get that full speed due to the presence of a moocher it's not a big deal. (Assuming the other moocher is also not so time critical of course).

How can we allow mocchers but at the same time avoid the potential mysterious random failures due to timing conflicts I have alluded to. Users of mooching objects need to know in advance that they they may have issues.

Oddly this mooching business introduces a "slop" in timing familiar to uses of Linux and such systems. The system tries to get as much performance for your code as possible, but with no guarantees. A kind of slop Chip has railed against in the past.

RossH · 2014-03-29 20:23

Heater. wrote: »

My counter argument is that two such functionalities, that require the extra bandwidth provided by mooching, cannot be used together in a single Propeller. There simply isn't the bandwidth available. This introduces incompatibilities between objects we might want to use together due to timing issues that do not exist in the P1 or the semantics of Spin/PASM.
[*] This has an impact on the ease of code sharing and reuse we have with the Propeller. Much akin to the difficulties interrupts cause for code reuse in regular processors.

These new timing issues and breaking the Spin/PASM semantics worries me. It seems to not worry many others. Chip will decide.
[*] These arguments are not restricted to Spin/PASM of course. They are a change in the existing "contract" that Spin offers programmers regarding timing.

I think this is a powerful argument against general-purpose mooching, Heater. The contract the Propeller makes with its users is a much stronger contract than most other micros make, and this is on of the Prop's major attractions. It includes, for instance "deterministic timing", but there are other implicit things such as "no cog is special" (at least after boot-time) and "no pin is special" (ditto).

Parallax should think long and hard before it breaks that contract on the P2. As you point out, it would make the whole concept of the OBEX frustrating and unworkable for beginners. Programs that work fine with one instance of a class of objects (e.g. a video driver) will fail utterly (or worse, occasionally and mysteriously) with another instance even if the second instance looks like a drop-in replacement!

However, Cluso and I were recently discussing special-case mooching - I haven't read all the posts on this stuff, so apologies if this has already been discussed to death. It is essentially the idea of allowing "multi-cog" clusters to mooch - but only between cogs in the same cluster. Cogs outside the cluster are guaranteed to get the current deterministic behaviour (unless they are themselves in another multi-cog cluster).

The classic example on the P1 of such clusters would be the more complex video drivers. Catalina has video drivers that require up to 5 co-operating cogs (e.g. for high resolution VGA with vector graphics) - let's call this a "5 cog cluster". On the P2, cogs in such a cluster could "donate" their resources (hub access slots, mainly) to other cogs in the same cluster.

The simplest example would be a 2-cog cluster where one of the cogs uses all the hub access slots of both cogs (essentially doubling its HUBEXEC speed), and the other cog essentially does nothing (or at least nothing that requires hub access - it could of course perform other tasks on behalf of the cluster).

While the "cluster" concept need not be formalized in hardware, this would be advantageous - requiring a cog to specifically donate its resources to the cluster it belongs to would prevent accidental cross-cluster dependencies.

This preserves the essence of Prop's basic contract with its users, which is now converted from the P1's promise of "no surprises for any other cogs", to a P2 promise of "no surprises for any other cogs outside the cluster".

Ross.

Heater. · 2014-03-29 20:29

potatohead,

We've got drivers, which should not mooch. These also see the most reuse.

Why do you single out drivers. This sloppy timing effects all code the same.

Consider:

I have some large HUB exec code performing some calculation that takes a long time. Say an FFT. It needs to do that FFT every 20ms, say. At which time some other part of the system is requiring the result.

To meet the performance demands I let my FFT mooch. That might shave enough ms off to get the job done. All is well.

Until I want to use some other mooching object. Boom one or both fails at random.

As a consumer of objects I need to be made aware of such potential failures ASAP. At compile time preferably. Especially if I'm used to the P1's determinism.

P.S. FFT may not be a good example here as it is actually small enough to fit in COG. The point is the same it's not just driver code impacted.

Dave Hein · 2014-03-29 20:31

Though I think hub-slot sharing would be beneficial to P2 I give up trying to promote it. It's doubtful that Chip would implement it anyway, so I'm done wasting my time talking about.

Heater. · 2014-03-29 20:35

Ross,

Parallax should think long and hard before it breaks that contract on the P2

Yes, exactly. (Are we agreeing again? It must be a full moon or something)

I can of course see the benefit of the performance offered my "mooching". I'm all for it if that "break in contract" sticks out like a saw thumb to users.

Have to think about that "clustering" idea. Interesting. At fist sight it seems complicated. I'd like to see instructions removed from the PII rather than added. Anything that smells like a "mode" has me worried.

potatohead · 2014-03-29 20:51

I'm OK with mooch not making it in there myself. Just to make that clear.

I am genuinely attracted to the compiled code case. No matter how I think of it, this one is compelling. And it might make something like XMM EXEC viable, practical instead of just possible too.

Re: Complex video drivers.

On P2, we aren't going to have those in the same way we do P1.

First, waitvid is double buffered. This helps A LOT. Many tight timing cases are designed away, leaving more robust display signals. Then we have waitvid capable of handling a lot more data than P1 waitvid currently does. Often, for a whole lot of very common and desirable display cases, ONE waitvid can do an entire scan line worth of data!

Second, the overall speed of P2 is much better matched to HDTV display requirements. We can get text, bitmaps, tiles and other basic arrangements in a single COG. P1 has to run COGS in parallel due to a lot of things. We don't have to do that on P2. That said, dynamic displays will still be complicated and they will still benefit from multiple COGS, however those could also benefit from a mooch and fail / fall back nicely enough. We had those same kinds of fall backs and failures on P1 when there were not enough COGS to generate display data, and in the case of high resolution, a failure when there aren't enough too. Mooch doesn't change that at all.

P2 is fast enough to deliver scan lines and signals all the way up to 1080i without mooching. So the core basis for video is a non-mooch case out of the box.

Third, the display renders out of AUX RAM. This very seriously improves waitvid in that it's always got data, and there isn't a hard fail when something else takes too long to fetch data, like there is in the case of a P1 type display doing a bitmap or text, for example.

One artifact of this is we can also adjust pixel clocks easy! It's stupid simple to go from 256 pixels / scan line to 1000 pixels on a scan line on a TV display, and that works right now at FPGA speeds. One only need to adjust the code filling the AUX ram.

Basically, this potential mooch failure case goes away on P2 with the waitvid design changes. We may experience it on very complex 3D type drivers, or where somebody is seriously playing tricks, but that's not a reuse case generally anyway. An example of such a trick might be mixed mode displays where multiple waitvids do happen per scanline. Could be something like a 4 color high resolution display with a lower resolution full color window in it, for example. Another one might be textures. We've yet to author serious code yet. When we get to polys, that's going to benefit from mooching. Lots of data fetch, math, etc... all of which can happen in parallel. Honestly, that one is as compelling as compiled code is.

RossH · 2014-03-29 20:53

Heater. wrote: »

Yes, exactly. (Are we agreeing again? It must be a full moon or something)

I know - it's a worry, isn't it? It's almost enough to make me think I must be wrong after all!

Heater. wrote: »

Have to think about that "clustering" idea. Interesting. At fist sight it seems complicated. I'd like to see instructions removed from the PII rather than added. Anything that smells like a "mode" has me worried.

I'm sure Chip could easily do it in hardware - essentially just one more instruction (a "donate" instruction). And who's going to notice one more amongst the zillions of new ones we already have on the P2?

Ross.

Bill Henning · 2014-03-29 21:00

Guys,

I have to make this point again, because there is much talk of failure cases due to not having enough (mooched) bandwidth for drivers.

THAT IS FLATLY IMPOSSIBLE.

The RDWIDE loop Chip showed is the MAXIMUM POSSIBLE ATTAINABLE BANDWIDTH. MOOCH CANNOT IMPROVE ON THAT.

A single cog, single or multi-tasked, CANNOT use more than one long per clock cycle, and a WIDE has eight longs. Mooching cannot change this.

The only thing mooching can change is the latency to the next hub access slot.

This means WRxxx can be faster, RDxxxx can be faster, the RDxxxx cache can be reloaded faster.

This will mainly benefit manipulating non-sequential hub data, or "random seeks".

Drivers mostly deal with sequential bursts, where mooching will not help.

Even if P2 runs only at 160MHz, there will be 20M hub slots per second.

Show me ANY drivers need more *RANDOM HUB ACCESS* than that.

Note that that access is the same as cog memory access on the P1!!!!!

Compiled code, VM's, complex sprite graphics etc., can greatly benefit from mooching. But those have "soft" harmless failure cases.

Hard deterministic drivers DO NOT NEED mooch.

Think about it.

Seriously.

The "potential problems" are being HUGELY blown out of proportion.

potatohead · 2014-03-29 21:03

This sloppy timing effects all code the same.

No it doesn't.

There is code that has to meet a deadline and there is code that doesn't. Where there is some sort of very regular, recurring activity, the determinism is what makes a Prop rock hard. We all know that. Where something has to happen at a specific time, or before an amount of time has passed, it rocks hard too.

When does this happen? Drivers mostly. That's why I singled them out.

User interaction code, business logic, GUI, and other general case code just needs to run fast. And where it needs to be accurate, we have lots of other things getting in the way of that. HUBEXEC isn't the same as COG code in this way.

Bill's comments seconded.

Which leaves us the case of bigger programs, compiled code, which is a very compelling case.

RossH · 2014-03-29 21:03

potatohead wrote: »

Re: Complex video drivers.

On P2, we aren't going to have those in the same way we do P1.

...

I agree about video drivers - what takes 5 cogs on the P1 should only take 1 cog (or perhaps 2 cogs) on the P2.

But that's missing the point - on the P1 we now routinely do heaps of things that would have been considered "out of the question" when the P1 was on the drawing board.

I'm sure (or at least I hope!) the same will be true of the P2 - but if I could predict now exactly what it would be that we are going to be struggling to use the P2 for few years from now, I'm sure Chip would add an instruction for it so we didn't need to!

Bill Henning · 2014-03-29 21:07

Clustering, priorities etc greatly complicate the hardware required to implement mooch, for no discernible benefit, and also waste unused slots that could go to a good purpose.

I have yet to see a technical argument showing a benefit from them.

I can see a slight potential benefit to the "paired" guaranteed 4 cycle mode, but even that is very limited.

If simple to implement, round robin sharing of unused slots (as Chip proposed) is not a bad idea.

The even/odd mooching I proposed some messages back is not terrible.

ANYTHING COMPLICATED I AM AGAINST.

potatohead · 2014-03-29 21:08

Yes, and that's gonna happen (pushing it) on the P2, but you know what that will be like video wise?

Let's talk multi-COG delivering 4K pixel display capability in a micro-controller. (and yes, I'll attempt it at some point, and I think it's totally possible too, provided the 4K display has YCbCr inputs...) Even doing that won't benefit from mooch as Bill said. Moving bursts of pixels already happens at peak speed today, sans mooch.

Now, the sprite list or GUI code associated with that 4K display? Deffo moochable.

That's pushing the P2, and it's insane! Pushing the P1 to get XGA made a lot of sense. Perhaps a 4K display will too, I don't know, but I do know the need to push it is very significantly diminished over what the need is on P1.

Bill Henning · 2014-03-29 21:14

Nicely said.

And it points out that NOT having mooch will significantly limit us.

Right off the bat, not having mooch will *DELIBERATELY THROW AWAY* a LOT of VM/compiled code performance, to no benefit.

That is why I will keep fighting for mooch, at every message against it (*1)

Sheesh

Note 1: When I find time, and wifey does not notice

potatohead wrote: »

Yes, and that's gonna happen on the P2, but you know what that will be like video wise?

Let's talk multi-COG delivering 4K pixel display capability in a micro-controller. (and yes, I'll attempt it at some point, and I think it's totally possible too, provided the 4K display has YCbCr inputs...) Even doing that won't benefit from mooch as Bill said. Moving bursts of pixels already happens at peak speed today, sans mooch.

Now, the sprite list or GUI code associated with that 4K display? Deffo moochable.

That's pushing the P2, and it's insane! Pushing the P1 to get XGA made a lot of sense. Perhaps a 4K display will too, I don't know, but I do know the need to push it is very significantly diminished over what the need is on P1.

RossH · 2014-03-29 21:19

Bill Henning wrote: »

ANYTHING COMPLICATED I AM AGAINST.

I think that's kind of Heater's (and my) point - we are saying essentially the same thing, but perhaps from a different perspective.

Anything that breaks the contract between the P2 and the user is undesirable, since it complicates things for the user, and may lead to mysterious timing problems.

But mooching would be fine provided you can find a way to do it that does not do that.

Ross.

potatohead · 2014-03-29 21:26

I'm against the complexity too. No schemes, registers, lists, etc... A COG is either 0=compliant, or 1=moocher.

You know, I wonder whether or not the "who gets the mooched cycle?" can't be a round robin of the active mooching COGS? If there is one moocher, it gets all it can mooch. If there are two moochers, they get 'em in turns. If there are three, then it's 012012... etc.

Is that simple in terms of chip logic?

I am not sure I understand what is needed for XMM EXEC to make sense. Does this benefit from mooching, on the COG that's doing the XMM EXEC in the first place? Bill? Come on, I know you've had to have explored this...

BTW, a buddy of mine has one of those cheap, early 4K displays. And it's got YCbCr on it...

Just sayin' I'll have to mooch it away from him for a while someday when we've got real chips. Who knows? Probably the FPGA could do it now multi-cog. Sheesh. We just don't have multi-cog DAC emulation. If we did, there are already some component video tricks I would love to try.

Bill Henning · 2014-03-29 21:27

I should have been clearer... I meant anything that makes mooching compilcated

I truly believe that the concerns are not really warranted, and are blown way out of proportion.

I have yet to see a valid technical example of a problem. I see a lot of hand waving about potential problems if people abuse it. Never mind that it is not really abusable, as mooching *cannot* be used to get more bandwidth.

It does not break "the contract".

PLEASE PROVIDE EXAMPLES OF "MYSTERIOUS TIMING PROBLEMS" !!!!!

When I advocate, or oppose, a suggestion *I PROVIDE SOUND TECHNICAL REASONING*

I can be convinced with sound technical arguments, backed by examples and proofs.

I hate anything that looks like FUD, and will oppose hand waving "might cause a problem" arguments as they are NOT scientific or reasonable.

One poorly written object clobbering the hub somewhere is FAR WORSE than anything mooch could do. AND HAPPENS EVERY DAY.

Yet no one is proposing getting rid of objects that use the hub...

Seriously people.

Make logical arguments, backed up by well reasoned technical arguments. Not FUD.

Leave FUD to the politicians. They do it so well

EDIT:

Just to make sure I am not mis-understood. I *WELCOME* technical objections from everyone. But I expect them to be backed up, and argued on the technical metrits, instead of "it might cause problems". Show examples.Or at least provide strong technical support for the position taken. Don't just wave your hands.

RossH wrote: »

I think that's kind of Heater's (and my) point - we are saying essentially the same thing, but perhaps from a different perspective.

Anything that breaks the contract between the P2 and the user is undesirable, since it complicates things for the user, and may lead to mysterious timing problems.

But mooching would be fine provided you can find a way to do it that does not do that.

Ross.

Bill Henning · 2014-03-29 21:28

Actually that is exactly what Chip was proposing when I was discussing "HUNGRY" with him. I liked it.

potatohead wrote: »

I'm against the complexity too. No schemes, registers, lists, etc... A COG is either 0=compliant, or 1=moocher.

You know, I wonder whether or not the "who gets the mooched cycle?" can't be a round robin of the active mooching COGS? If there is one moocher, it gets all it can mooch. If there are two moochers, they get 'em in turns. If there are three, then it's 012012... etc.

Is that simple in terms of chip logic?

I am not sure I understand what is needed for XMM EXEC to make sense. Does this benefit from mooching, on the COG that's doing the XMM EXEC in the first place? Bill? Come on, I know you've had to have explored this...

Heater. · 2014-03-29 21:33

@Bill,

I have to make this point again, because there is much talk of failure cases due to not having enough (mooched) bandwidth for drivers.

THAT IS FLATLY IMPOSSIBLE.

That sounds wrong to me. Defying Amdahl's law and physics itself.

The argument is that mooching gets a COG more bandwidth to HUB for increased performance. Sounds desirable.

That bandwidth has to come from somewhere.

Clearly 1 moocher benefits when it can "suck up" band width left over by other COGs.

8 moochers that require the extra kick of mooching can not work together. If they could then why ever have any non-mooching mode?

Hard deterministic drivers DO NOT NEED mooch.

Perhaps. But large hub exec'ed code with hard real-time requirements, which could be in the order of ms upwards might need to.

@Potatohead,

No it doesn't.

Yes it does.

The extra band width of mooching does not come out of thin air. If it did mooching would be the normal and only mode as I said above.

@Bill,

Keep at it. Your arguments have nearly sold me on the idea. Not because it's "FLATLY IMPOSSIBLE" for mooching to cause problems when mixing and matching code, which is physically impossible, but because you have almost convinced me that such problems are going to be extremely rare in practice.

If only there were a way to know with confidence when there is or is not a problem.

I agree the clustering business seems overly complex.

potatohead · 2014-03-29 21:38

The argument is that mooching gets a COG more bandwidth to HUB for increased performance.

That isn't the argument though.

Without mooching, a COG gets it's chance at the HUB every time it's turn comes up on the round about. That COG can ask for a QUAD, WIDE, LONG, BYTE, WORD, whatever.

So a bite is taken, and the COG gets what it asked for. If it doesn't ask for much, it doesn't get much, The cycle gets used. It can ask for the max RDWIDE and the cycle gets used. Either way, it gets used.

Asking for the max takes up all the time. It can't mooch because it's busy dealing with the full payload it asked for and by the time it's ready, it's normally assigned HUB cycle time has come up again. Mooch or not, the behavior is the same.

Now a moocher COG takes a smaller bite of data from the HUB. Say a WORD. And it needs another WORD.

If it's not mooching, then it has to wait the full round robin time for that word. But if it is mooching, and another COG fails to utilize it's cycle, then the moocher can get another WORD right then. Or, if the COG asking for the max, suddenly wants to nibble on bytes, it can mooch cycles to get those bytes fetched more quickly, etc...

That's what mooching is.

Essentially, "Hey man, can you spare a cycle?"

Bill Henning · 2014-03-29 21:49

Heater. wrote: »

@Bill,

I have to make this point again, because there is much talk of failure cases due to not having enough (mooched) bandwidth for drivers.

THAT IS FLATLY IMPOSSIBLE.

That sounds wrong to me. Defying Amdahl's law and physics itself.

Well, compared to optimized code it is impossible.

A cog, single tasking or not, can only use 8 longs per 8 clocks, even with Chip's REPS WIDE loops.

So if drivers are written for maximum performance with RDWIDE/WRWIDE, they cannot get extra bandwidth from mooching, as they cannot use more than 8 longs per 8 clocks.

Now poorly written drivers, that RDBYTE each byte separately, would get a speed boost, but that's crappy code.

Heater. wrote: »

The argument is that mooching gets a COG more bandwidth to HUB for increased performance. Sounds desirable.

That bandwidth has to come from somewhere.

Actually the benefit is the lower latency, so that more dcache refills, WRxxxx, and non-cached RDxxxx can happen. Which is the usage pattern of VM's and compiled code, who do not use maximum bandwidth WIDE loops, but exhibit a more random access pattern, thus benefiting greatly from lower latency.

Due to the 8-long WIDE hub bus, and clkfreq, no matter what is done, nor how, a cog cannot read/write the hub faster than 1 long per clkfreq, due to the design of cogs. Therefore mooch or not, the maximum bandwidth is achieved by Chip's RD/WR WIDE loops. End of story, without a complete chip re-design.

(Now on a P3, with hub and cog memory re-organized in 8 long lines, as Ray was pushing for, is a different story, but we can argue about that after the P2 is in our hot little hands)

Heater. wrote: »

Clearly 1 moocher benefits when it can "suck up" band width left over by other COGs.

8 moochers that require the extra kick of mooching can not work together. If they could then why ever have any non-mooching mode?

Perhaps. But large hub exec'ed code with hard real-time requirements, which could be in the order of ms upwards might need to.

The extra band width of mooching does not come out of thin air. If it did mooching would be the normal and only mode as I said above.

Why on earth would 8 cogs turn on mooch? I see no reason why anyone logical would do that. Might as well try to protect newbies from filling the hub with null, or stopping their own cog.

It would still be faster than not having mooch (assuming round robin) as most cogs tend to not use most of their hub cycles, but worst case, it would degenerate to 1 in 8

For the hard real time guaranteed timing, don't realy on mooch - its nuts to do that as it cannot be deterministic. Throw it in a hard real time cog.

Heater, please think about it. Mooching gets you lower latency, the extra bandwidth is a side effect, and far less than what tight WIDE code can get. What I referred to was the impossibility of getting more bandwidth than Chip's RDWIDE/WRWIDE 1 long per clkfreq.

EDIT:

Re hubexec 1ms "hard" real time.

On a P2, I think that qualifies as soft real time

1ms @ 160Mhz = 160,000 cycles = 20,000 hub slots without mooching. Plenty for 1ms grain events, even with hubexec, unless you need more than 160k cycles to process the event.

Heater. wrote: »

@Bill,

Keep at it. Your arguments have nearly sold me on the idea. Not because it's "FLATLY IMPOSSIBLE" for mooching to cause problems when mixing and matching code,

Not what I said/meant. Not possible to get more bandwidth with mooch than Chip's 1 long per clock WIDE example, which is what I meant.

Heater. wrote: »

which is physically impossible, but because you have almost convinced me that such problems are going to be extremely rare in practice.

If only there were a way to know with confidence when there is or is not a problem.

Unlikely to generate problems frequently, and even then, only with poorly written code.

Other poorly written code will generate tons more problems

Heater. wrote: »

I agree the clustering business seems overly complex.

Heater. · 2014-03-29 21:57

Bill,

I do like a debate about the pros and cons of technical solutions. But please don't suggest that opposing views are "illogical" or "FUD".

Mooching provides more bandwidth to HUB for a COG for increased performance. If not why have it? Very desirable I must say.

Band width is limited. 8 moochers that need the performance won't get it. Amdah's law makes that certain. Ergo there are now timing dependencies between COGs that would not exist otherwise. The logic is sound.

As in all engineering decisions there are trade off's to me made. Those in favour of mooching value the potential performance gains for a COG over the loss of timing independence of COGs. Those who worry about mooching value the timing independence.

What Chip has to decide is are the performance gains of a COG of sufficient benefit to out way the the break in semantics of the device.

I must say that you have almost convinced me that the issues I worry about are unlikely in the majority of applications.

potatohead · 2014-03-29 22:03

As in all engineering decisions there are trade off's to me made. Those in favour of mooching value the potential performance gains for a COG over the loss of timing independence of COGs. Those who worry about mooching value the timing independence.

Yes, and there are those of us who also see that the cases where the performance benefits are favored strongly are also not likely to be timing dependent. (this took me quite some time, to be honest)

That's precisely why I was talking through various cases. Was looking for a strong correlation between the kinds of HUB transactions that mooch would improve on and timing dependencies.

It's hard to find them.

I do like a debate about the pros and cons of technical solutions. But please don't suggest that opposing views are "illogical" or "FUD".

Seconded. Some of the aspects of this kind of debate involves people, and those aspects are not always technical.

Heater. · 2014-03-29 22:13

Bill,

Why on earth would 8 cogs turn on mooch? I see no reason why anyone logical would do that.

8 might be a bit extreme but I can imagine the exact same code being run on say 4 cogs. Don't ask me what, say some funky high speed communications driver that I want 4 links of. All of which need "mooching speed". (XMOS xlink protocol perhaps).

How does mooching fair for those 4 comms channels? Do I find that I can test one and it works fast enough but when I have 4 it runs out of steam? Or are we still good?

Edit:

Re hubexec 1ms "hard" real time.

On a P2, I think that qualifies as soft real time

Not at all. "hard real-time" says nothing about how long that time deadline is. Could me a microsecond, a millisecond, a second... a year. The only point is that if you miss it you are dead.

My code that takes a millisecond with mooching may take 1.5 without, oops...missed deadline.

Bill Henning · 2014-03-29 22:17

Heater,

Heater. wrote: »

Bill,

I do like a debate about the pros and cons of technical solutions. But please don't suggest that opposing views are "illogical" or "FUD".

With all due respect - and I do respect you and your opinions - suggesting that an improvement not be put in, because it may cause some hypothetical adverse issues if misused, without presenting plausible cases, is dangerously close to FUD.

Warning that it may cause potential problems is fine, but you have been fighting pretty hard NOT to have it put in. Thus I had to counter that

Heater. wrote: »

Mooching provides more bandwidth to HUB for a COG for increased performance. Very desirable I must say.

That additional bandwidth in reality is due to the lowered latency, and while it does show up as extra bandwidth, that is a side effect of the lower latency.

Heater. wrote: »

Band width is limited. 8 moochers that need the performance won't get it. Amdah's law makes that certain. Ergo there are now timing dependencies between COGs that would not exist otherwise. The logic is sound.

*IF* it was reasonable to have eight moochers expect extra performance, the logic would be sound.

I posit it is NOT reasonable.

There are only timing dependencies if the people writing the code ignore the warnings in the manual, or do not sufficiently get familiar with the system they are trying to program as a critical system.

I don't know of a single well run project where such issues would not be exposed during a code review, and the guilty party making bad assumptions would not get a little talk from the project manager.

Now as a matter of interest, if someone was dumb enough to turn on mooching in all eight cogs, the overall performance would go up, as there is no way that all cogs would use most of their hub bandwidth. But no competent engineer would rely on that.

Heater. wrote: »

As in all engineering decisions there are trade off's to me made. Those in favour of mooching value the potential performance gains for a COG over the loss of timing independence of COGs. Those who worry about mooching value the timing independence.

I posit that there is no loss of timing independence with competently written code.

And I am not concerned with incompetently written code, as those would also do other stupid things like clobbering the hub, not allocating enough stack, etc.

Heater. wrote: »

What Chip has to decide is are the performance gains of a COG occasionally of sufficient benefit to out way the the break in semantics of the device.

Heater, again you are using loaded terms, incorrectly, in an attempt to influence.

"the performance gains of a COG occasionally" - denigrates the potential performance gain, which Dave has shown to be 30% for Dhrystone as one data point.

"break in semantics" - I consider this factually incorrect, as if "mooch" is not enabled, there is no change, and even if there is, the timing can only change for cogs that use mooch.

Heater. wrote: »

I must say that you have almost convinced me that the issues I worry about are unlikely in the majority of applications.

Thanks.

I will be amazed if any problems arise from mooch from well written applications.

Heck, I would fall out of my chair if any problems were caused by you using mooch.

potatohead · 2014-03-29 22:19

Wouldn't that be a case of depending on the moocher Heater?

To depend on the moocher and fail, those 4 COGs would have to be requesting lots of little bits of data from the HUB, and they would be needing to get it all done more quickly than they would by hitting their normal HUB cycle access schedules, and they would need to be running at the same time, and the other COGs in the system would have to be hitting their cycles hard enough to make the mooching unproductive.

But we don't depend on the moocher. That's got to be baked in to the dialog from the start.

Seems like a pathalogical case to me.

Something like a 4K display also serving as a 4 port USB hub, with keyboard, mouse, etc...

Bill Henning · 2014-03-29 22:28

Ok, let's look at 4 cogs.

Are you telling me that you would not test the four link case if you needed four links?

Also, I cannot see the need for mooch for drivers, as it could not get the driver any more hub bandwidth than a driver using WIDE's.

Now regarding XMOS links...

As I recall, they come in two wire and five wire flavors.

1) A two wire link is essentially a differential link, slightly bit stuffed, with a 7.5ns inter-symbol delay

With hardware SERDES and clever code, this should be possible. Off the top of my head, receive 10 symbols, look up and done. 7.5 ns inter-symbol delay, assuming it correlates with bit period, implies a 133Mhz symbol rate, however it could be asynchronous, in which SERDES won't help. In any case, either not possible, or the 10 symbols drop the 8 bit data rate to about 13MB/sec, trivial to handle without mooch.

2) A five wire link, which has four symbols per token (byte?), and P2 won't be able to handle it (well, *maybe* with really clever code if XFER handles 8 bit transfers).

Even then, only about 32MB/sec... where is the need for mooch?

Sorry, your proposed example is no where near enough to need more than the 640MB/sec hub bandwidth available to a non-mooching cog.

Heater. wrote: »

Bill,

8 might be a bit extreme but I can imagine the exact same code being run on say 4 cogs. Don't ask me what, say some funky high speed communications driver that I want 4 links of. All of which need "mooching speed". (XMOS xlink protocol perhaps).

How does mooching fair for those 4 comms channels? Do I find that I can test one and it works fast enough but when I have 4 it runs out of steam? Or are we still good?

I guess its pretty obvious I've played with Xmos

I have their latest "Pi" board (it is a pretty cute board), and about half a dozen of their older dev boards.

I hate the restrictions on XC, on sharing memory, and the way the I/O mapping is done.

I do like the way you can automatically fire out a long as eight strobed nibbles. Or get eight nibbles strobed in and read it as a long. Neat.

I find the P2 far more fun even in FPGA emulation.

Bill Henning · 2014-03-29 22:33

Quick calculation:

4K display = 3840x2160@30Hz (ignoring blanking times), officially 330Mhz dot clock

= 248.8MB.sec @ 8bb

Cog bandwidth at 160Mhz

160*4 = 640MB/sec

Conclusion:

Assuming we can drive the dot clock to 330Mhz, 4K display can be done on a single cog, no sweat. Even at 16bpp.

WITHOUT MOOCH

Edit: The question is... will the TV's take a 4K signal over component video or VGA?

Heater. · 2014-03-29 22:40

Bill,

Thus I had to counter that

Of course. And I had to counter that

Whilst I'm all in favour of people RTFMing and being "competent" to be honest I think it's an unrealistic expectation. 99% of users are not going to invest the time to become familiar with the nearly 500 instructions in the PII and it's mass other complex features. Mooching may well be one thing they skip over. Most users will be incompetent most of the time.

Well run projects with dedicated and skilled teams is one thing. The mass of the worlds programmers cannot operate like that.

One should care very much about those "incompetents". They are the majority. They are potential customers. The chip should help them.

Heater, again you are using loaded terms, incorrectly, in an attempt to influence...."the performance gains of a COG occasionally" - denigrates the potential performance gain

A fair point. I retract the "occationally".

"break in semantics" - I consider this factually incorrect, as if "mooch" is not enabled, there is no change, and even if there is, the timing can only change for cogs that use mooch.

I think it's factually sound. If a COGs can throw the switch and thus influence each others timing that is a different semantic or "contract" as RossH put it.

Heck, I would fall out of my chair if any problems were caused by you using mooch.

Ha! I take that as a challenge. I will henceforth lobby Chip to include the mooching feature so that I can take it up.

Looks like you have won with that

P2 - New Instruction Ideas, Discussions and Requests

Comments