P2 - New Instruction Ideas, Discussions and Requests

ctwardell · 2014-03-28 15:25

Heater. wrote: »

This is a non sequitur. My arguments have been all about objects and drivers and such, in whatever language, that reside within a COG playing well with other code in a different COG.

Yes, but the reason you give is the possible timing issues, those same timing issues can happen in a cog running multiple tasks.

Heater. wrote: »

This is not true. I don't see how those two things are related (someone please correct me if I am wrong here)

Yes, my mistake, I meant the preemptive multitasking.

Heater. wrote: »

It's not a case of removing all those "bad things". We don't have them yet. It's a case of not introducing new bad things.

That's my point. We already have "bad things" from an absolute determinism point of view, so why such a strong stand against this one?

C.W.

evanh · 2014-03-28 15:36

ctwardell wrote: »

Yes, but the reason you give is the possible timing issues, those same timing issues can happen in a cog running multiple tasks.

I haven't really tried to step through the race-like conditions that might be of concern here but this might be one of those "too much rope" situations.

Slot pairing seemed a pretty good idea I thought.

Bill Henning · 2014-03-28 15:39

Heater,

All cogs can count on a hub cycle evey eight cycles.

That's ALL they can count on.

A global disable for testing (or the paranoid) is a good idea. I support it.

Banning objects that mooch from Obex is a good idea. I support it.

I frankly don't see the theoretical problem cases you are postulating as being an issue.

Zog, ZiCog, MotoCog would all greatly benefit from mooch.

I cannot understand your stance against using otherwise unused slots.

With the two items above (global disable, banning from Obex) I do not believe you can provide a valid argument against mooch.

Bill Henning · 2014-03-28 15:41

Slot pairing is the next step after simple recycling.

Slot pairing would allow deterministic 4 cycle hub latency.

Personally, I would use "mooching" more, for compiled code.

I think that people are missing the main point - slot recycling does NOT increase the bandwidth a cog has to the hub, it just reduces latency. WIDE's already max out the bandwidth.

evanh wrote: »

I haven't really tried to step through the race-like conditions that might be of concern here but this might be one of those "too much rope" situations.

Slot pairing seemed a pretty good idea I thought.

evanh · 2014-03-28 15:46

Bill, I was mostly responding to the request from C.W. to have multiple MOOCHing Cogs, as per post #83

Heater. · 2014-03-28 15:48

ctwardell,

Yes, but the reason you give is the possible timing issues, those same timing issues can happen in a cog running multiple tasks.

No doubt tasks within in a COG can mess with each others timing. My argument is all about COG to COG timing.

We already have "bad things" from an absolute determinism point of view,...

No we don't. On a P1 a COG cannot mess with the timing of another COG.

...so why such a strong stand against this one?

A very good question.

Simply because I want you to be able to incorporate my driver into your project and for you to be confident that there will be no time stolen from your code, or vice versa, that will cause your program to fail in unexpected and random ways.

Bill Henning · 2014-03-28 15:50

Evan,

"most people..." was not directed at you

Only reason I was responding was the "too much rope"... I disagree with that part. Leaving such potential performance gain for compiled code / vm's on the table due to unreasonable fears blows my mind. (again, not directed at you)

evanh wrote: »

Bill, I was mostly responding to the request from C.W. to have multiple MOOCHing Cogs, as per post #83

Bill Henning · 2014-03-28 15:53

Mooch CANNOT steal time from your driver.

A Mooching cog CANNOT steal time from any cog!!! and cannot take the 1/8 cycle even from another mooching cog.

You have not shown any case where mooch could cause "your program to fail in unexpected and random ways"

You are using inflamatory and inaccurate language - mooching does not "steal" any time/slots, it uses slots that are otherwise unused.

Heater. wrote: »

Simply because I want you to be able to incorporate my driver into your project and for you to be confident that there will be no time stolen from your code, or vice versa, that will cause your program to fail in unexpected and random ways.

Dave Hein · 2014-03-28 15:56

Heater, I don't understand the resistance to mooching. If you don't want to use it then don't. Please let those that want the benefits of mooching use it. If you're afraid that mooching will cause problems with OBEX objects we can prohibit mooched objects from the OBEX.

Heater. · 2014-03-28 16:02

Bill,

With the two items above (global disable, banning from Obex) I do not believe you can provide a valid argument against mooch.

Well, yes you are right.

My entire argument is about a community of people sharing code. I've heard it's quite popular recently. Linux and all it's billions of applications is one example. The Propeller and OBEX and the Prop community is another,

My case is all about the frictionless reuse of code from wherever it came. So that we can all build things easily, quickly and often in advance of our own capabilities and knowledge. Plug and play, no surprises.

So there is the value judgement. If that "community" of code and coders matters to you then so does the problems that slot sharing will cause. If it is not so important to you then, well, who cares let's cause chaos.

Banning from OBEX is not the thing. Spin objects come from other places all the time. And what about libraries in other languages. and so on?

As I said somewhere here already. It's Chip's call.

evanh · 2014-03-28 16:05

Heater does have a point, people will use it, even a singular variant, inappropriately.

However, I don't see that as the end of world. Others will rewrite the messes.

Bill Henning · 2014-03-28 16:14

Heater,

the point is, by banning mooching object from Obex:

- people can share all the non-mooching objects they want
- obex will not spread mooching objects, which is what you want
- obex objects cannot "interfere" with each other due to mooching, which is what you want

Therefore, the above achieves all your stated aims.

Regarding libraries, same stance can be taken for Obex libraries - no mooching.

I have zero objections to banning mooching from obex, for all languages.

The community code/spirit with Obex lives the way you want.

Please note, just like I should not be able to force you to use mooching, you should not be able to force me NOT to use it.

I would be on your side - against mooching - if it could actually take hub cycles from cogs. But it cannot.

There are TONS of ways for obex objects to interfere with each other - say multiple instances of the same object improperly sharing DAT segments - that are far worse than any theortical issues about a mooching object failing because of another mooching objects. These occor far more frequently already.

Heater. wrote: »

Bill,

Well, yes you are right.

My entire argument is about a community of people sharing code. I've heard it's quite popular recently. Linux and all it's billions of applications is one example. The Propeller and OBEX and the Prop community is another,

My case is all about the frictionless reuse of code from wherever it came. So that we can all build things easily, quickly and often in advance of our own capabilities and knowledge. Plug and play, no surprises.

So there is the value judgement. If that "community" of code and coders matters to you then so does the problems that slot sharing will cause. If it is not so important to you then, well, who cares let's cause chaos.

Banning from OBEX is not the thing. Spin objects come from other places all the time. And what about libraries in other languages. and so on?

As I said somewhere here already. It's Chip's call.

Heater. · 2014-03-28 16:28

I had this weird idea. Bear with me this is a bit sideways...

What if...what if COG 0 got very even HUB cycle, 0, 2, 4, 6, 8...

COG 1 gets every other odd cycle, 1, 5, 9...
COG 3 in turn gets half of the cycles that are left
And so on.

In this way COG 0 gets half the possible HUB bandwidth. COG 1 gets a quarter, COG 3 get's a eighth...and so on.

This scheme would probably satisfy pretty much most cases where a maximal performance COG is required whilst still working with other cogs on various tasks. Whilst at the same time enforcing that COG timing independence I campaign for.

Of course the designations of COG 0, 1, 2, 3 need not be their actual COG id's but rather assigned at run time. Like thread cycling is done.

Just thinking ...

Heater. · 2014-03-28 16:39

Bill,
So "mooching" is banned from OBEX and all public repositories in all languages. The "good stuff" is under the counter. Great.

There are TONS of ways for obex objects to interfere with each other - say multiple instances of the same object improperly sharing DAT segments - that are far worse than any theortical issues about a mooching object failing because of another mooching objects. These occor far more frequently already.

Yes indeed.

That is why sharing memory between processes should also be banned. Tony Hoare worked all this out decades ago with his Communicating Sequential Process model (CSP) as implemented in OCCAM on the Transputer and more recently on the XMOS chips with XC. No sharing of RAM and no sharing of time. Processes independence makes things predictable and reliable.

Bill Henning · 2014-03-28 16:39

Interesting.

I've been thinking about how to share unused slots fairly.

How about:

Even cogs can only mooch unused even slots.

Odd cogs can only mooch unused odd slots.

I thought of this while trying to figure out how to hide the time it would take to make a "fair" decision as to who can get access to the otherwise unused slots.

It occurred to me that it is increditbly unlikely that a cog could consistently make use of all potentially moochable cycles, as a cog cannot do very useful work using all back to back hub cycles, and even if it could, Chip's special wide loop maxes out the hub bandwidth.

So I propose:

SETHUB #0|1 wc

where 0 is the default, "NORMAL" eight cycle cog

and 1 is a cog requesting mooching.

C = 1 if mooch was enabled for the cog successfully

C = 0 if mooch was not possible (due to global disable, or another cog mod 2 already is mooching.

This would allow:

- testing with mooching disabled
- one even mooching cog
- one odd mooching cog
- the odd/even mooches could not EVER interfere with each other
- the odd/even mooches could not EVER interfere with any other cog
- better utilization of spare slots than just allowing a single mooch per P2

Better yet, by making the even/odd distinction, future P2+ could have more invovled, "fairer" arbitration of who can get spare slots.

Best of all... this should be trivial in Verilog.

Your thoughs?

FYI, this bears resemblance to what you wrote, however this still guarantees 1/8 to every cog, and allows two mooches that cannot interfere with each other.

Heater. wrote: »

I had this weird idea. Bear with me this is a bit sideways...

What if...what if COG 0 got very even HUB cycle, 0, 2, 4, 6, 8...

COG 1 gets every other odd cycle, 1, 5, 9...
COG 3 in turn gets half of the cycles that are left
And so on.

In this way COG 0 gets half the possible HUB bandwidth. COG 1 gets a quarter, COG 3 get's a eighth...and so on.

This scheme would probably satisfy pretty much most cases where a maximal performance COG is required whilst still working with other cogs on various tasks. Whilst at the same time enforcing that COG timing independence I campaign for.

Of course the designations of COG 0, 1, 2, 3 need not be their actual COG id's but rather assigned at run time. Like thread cycling is done.

Just thinking ...

Bill Henning · 2014-03-28 16:42

Heater,

At least you acknowledge its good stuff :-)

FYI,

I disagree with banning sharing memory between processes. It can be a very high performance data conduit. But we do need locks... I hope Chip bumps the number of locks to 32 (or at least 16)

Heater. wrote: »

Bill,
So "mooching" is banned from OBEX and all public repositories in all languages. The "good stuff" is under the counter. Great.

Yes indeed.

That is why sharing memory between processes should also be banned. Tony Hoare worked all this out decades ago with his Communicating Sequential Process model (CSP) as implemented in OCCAM on the Transputer and more recently on the XMOS chips with XC. No sharing of RAM and no sharing of time. Processes independence makes things predictable and reliable.

Cluso99 · 2014-03-28 16:52

heater: Far worse than all this slot sharing is hub ram sharing. In your eyes, the P2 should be banned because hub ram is not protectedbetween cogs. Any cog can corrupt hub ram.
Get over it- let us make informed decisions whether and when to use it. You can globally disable it, butfor me (and others) at least let us havethe performance gains this gets us. </rant>

Heater. · 2014-03-28 16:54

Bill,

As soon as we talk about introducing new instructions to do this my brain shuts down. As far as I can see the P2 already has 400 op codes too many! Better not to think about it.

Yes, I do get the idea about the "good stuff". Just trying to avoid the bad stuff:)

I disagree with banning sharing memory between processes. It can be a very high performance data conduit.

But you are not disagreeing at all.

A high speed channel between processes is in no way the same as sharing memory.

Heater. · 2014-03-28 17:05

Clusso,

I do appreciate your desire for slot sharing. I'm only presenting the issues it can cause and my gut feeling about it's impact on code reuse.

I don't have to "get over it" as it has not come to pass yet. From what Chip has said he may go for it or not. It's his call.

If it does come to pass I can live with it.

jmg · 2014-03-28 21:25

Bill Henning wrote: »

Whichever way takes less logic and less of Chip's time

The advantage to using XFER is the engine to do it in the background.

I was not meaning instead of any XFER, but more as a optional/config means to double the Quad bandwidth.
Doing an OCT mode in the shifter is easy enough, but the pin-mapping may be a step too far, hence the idea of allowing the join of two Quad SerDes by 'locking' then in step.to support 8 wide moves.
This has no pin-map changes, but it does consume both serdes.

potatohead · 2014-03-29 17:51

About this mooching... Got your coffee or tea handy? Good. Let's go!

Today, I thought about use cases. Perhaps we should go through a few of those and discuss the implications.

I'll write about the video driver to start off. I'm good at these and have experienced every failure mode there is, including actually damaging an old TV with a P1 and some poorly placed sync signals!

What are the possible failure modes?

1. Failure to generate the right signal at the right timing on the right pins given whatever clock speed the Propeller 2 is running at.

2. Failure to display the graphics in whatever graphics buffer makes the most sense.

3. In the case of dynamically drawn displays, those including sprites and other real time display tricks, failure to get all the data needed when it's needed exactly. Scan line time frames are fairly short.

4. Memory conflicts.

5. Communication conflicts.

Mooching really won't impact failure case #1. The P2 waitvid is advanced in a few basic ways compared to the waitvid in P1. It can operate on much more data per waitvid, and the way pixels and resolutions are expressed allows for graceful failure in a lot of cases that would corrupt the display on P1. A common case was too many or not enough PLLA cycles per pixel resulting in bad scan line timing and a corrupt display, BTW. The other was failure to reach the next waitvid in time, resulting in a glitched or corrupt display.

The single most important thing we can do to improve reuse and prevent failure overall on P2 is to make the drivers parametric so they calculate the signal timings and other basic constants they use from simple input: display type, resolutions, color depth, clock speed, etc...

Failure case #2 involves getting pixels from either the HUB or external memory and into the COG for waitvid to display. At the FPGA speeds, mooching will significantly improve HDTV type bitmap or tile drivers. Users could see failure in the form of some pixels not being displayed, sparkles and blank areas, etc... if there are not enough cycles to move the buffer from storage to COG. Additionally, the time required to get pixels from say, SDRAM to the COG determines "fill rate" time available for other programs to write to the buffer in order to display changing graphics, not just a static image.

Failure case #3 is related and a bit more involved. Mooching can help get more pixels into the COG for display, however at real chip clock speeds, doing this won't be mandatory. We can and will get HDTV displays running without the need for mooching. Mooching can improve some other metrics, like number of sprites per line, etc... but those all have graceful failure modes. Things simply won't appear, or will be partially displayed. This leaves the user free to make choices and see the impact of the mooched cycles. In general, well written driver code will perform very nicely without mooching, and won't fail because of it.

The single most important thing we can do to promote reuse is to use a multi-language friendly parameter passing scheme and take as much of the complication out of allocating a memory buffer as we can. I like the method Dave Hein recently put here where we use a single jmp instruction to hop over the parameters and mailboxes placed at the front of the COG image. With this method, we can poke data in from SPIN or C, if we want to, and or we can use a mailbox with relative ease. It's no more complicated than figuring out what the right pin, clock speed, etc... setups are for the signals.

Mooching could have some failure. However, it's most likely to be a graceful or soft failure.

It could result in a user expecting some display capabilities they don't actually see in practice given too much mooching is going on with other programs, but this will also be easily seen and the user is very highly likely to have options. One example of this is how many graphics COGS a user may apply to the problem of dynamically displaying objects, or simply displaying fewer objects. It's going to work either way, leaving them to push it, if they find they can, or want to.

At the real chip clock speeds, failures due to excessive mooching aren't likely to manifest in any but the most demanding and complex video schemes. Reuse is most impacted by some reasonable coding to pass parameters and or calculations to derive constants to make display setup and change easier. Like a user wants to show characters, then a bitmap, then something from a sprite list, etc...

Failure case #4 involves two kinds of conflicts. One is more than one program writing to a buffer or list area, corrupting the display of objects. Display signal is baked in as I've mentioned before, leaving us with what is displayed. If the users do not synchronize their buffer use, they may well see unexpected graphics. Mooching isn't a factor here.

A related thing may be fill rate improvements I have described in case #3 where having a process finish early may well allow another one more access time, which would improve fill rate due to display data scanning rates and transfer from buffer to COG rates improved. Failures are soft failures. Not hard ones where no display is seen, but more like some data in the display not seen or not seen consistently. Debuggable.

It's worth noting that these problems on P1 nearly always resulted in a failed display itself, not just pixels getting mangled up. The former is difficult to troubleshoot. The latter can be, but does leave the user free to look at the display and make choices as opposed to seeing nothing and having to deal with signal timing.

Case number 5 was a common display problem on P1 when authoring advanced displays. Most often it resulted in pixel corruption, not display corruption, but it could result in display corruption if the driver was not written in a way that prioritized the signal over the pixels. On P2, we've got Port D, various flags in HUB memory, locks and other things we can use to communicate with. Mooching really isn't a factor here either. Things optionally getting done sooner may free up options, but it's not going to cause failures.

All in all, mooching is a clear net gain, and might not even be needed! At the real speeds, I doubt it will be, unless....

Somebody wants to make a multi-driver package. In this scenario tasking would be used to draw a video display from a simple buffer, perhaps read a mouse, keyboard, etc... Mooching could impact this one, depending on how it was written, due to the fact that we use polling for video operations, not the double buffered waitvid, which acts more like a latch.

In this case, mooch could be used to provide another feature. Say a no mooch driver can do video, scan a bitmap display, overlay a mouse pointer, read a keyboard, read a mouse and offer some basic sound capability, maybe add in SD card functions through some kind of task overlay done with the task 3 and it's swap in / out capability or HUBEXEC. Whatever.

Video is polled in this kind of a model, as are other input devices. It's possible that mooching could come to be necessary for the whole package to work, and failures would increase with this kind of driver package in a mooch capable world.

The single most important thing we could do to prevent this is to author the driver sans mooch, and either simply do not offer extras with mooch turned on, or have those extras fail gracefully.

Summary: Mooch is a nice addition to most video driver tasks that does not come with a failure mode that didn't already exist to some degree. Very advanced drivers may well depend on mooch, but may not be reused either, due to advanced often being associated with specific or dedicated to a target application too.

The video driver most suseptable to mooching type failures is the multi-purpose "all in one" driver package, and the need to insure it offers all the capabilities advertized sans the mooching. From there users could crank it up, but would always be able to count on the core capability set. At clock speed, the core capability set will be awesome enough that many will be perfectly happy to run with it.

Unlike P1, which offered almost enough video capability to make people happy, P2 does this easily for a ton of common use cases.

Reuse is more impacted by basic coding techniques, math calcs to insure stable signals across a wide array of setups, sane buffer allocations that are simple and easy, etc... mooch really doesn't impact these possible failure cases.

Overall mooch is a clear win.

Anyone care to break down the other primary use case; compiled code?

A simplistic example of application code might be a game loop. Game loops need to finish each video frame if the graphics are dynamically drawn and linked to the various actions, movements, and such found in a video game. It's fairly easy to write a game loop to fail nicely should it miss a frame too.

For this simple case, mooch is a win. Nobody will be reusing game loops without coding up a bunch of stuff to connect it all together again in the new context. For a game running just a tad slow, skipping frames, turning mooch on would very likely improve it to the point where it doesn't skip frames, leaving the user with a good experience.

Reuse isn't really a consideration here, but failure modes are. The single most important thing we can do to prevent failure is insure drivers perform to spec without mooching. Developers can build their game how they want to, and they could choose how to use mooch and fail gracefully. One example might be game modes, where easy and medium do not stress the chip at all, but hard wants to make it tough on the player. A mooch would simply enhance the hard setting, which would be written to do everything it can.

Mooch in and of itself is only a net gain here with few downsides that I can see. Not being able to depend on it impacts a lot of basic decisions! For driver development, it turns out the basic considerations (for video and similar things at least) needed to insure performance is excellent and failure cases reduced to eliminated down to soft failures easily dealt with are the same cases where mooch would not contribute anything meaningful.

Would anybody care to comment on dynamic control systems, perhaps in tandem with compiled code?

Think through one and identify where potential failure modes exist and what impacts them, then factor in mooch as I did. We all should do this for some things we know about and understand better what the impact of mooch can be.

I really don't see it changing a whole lot of what I do and how reusable those things may be. The big reuse impact and failure avoidance comes from structuring the code in specific ways to basically design the failure modes out, or reduce them to soft failures that can be easily dealt with. eg:partial object display, or pixels not ending up in a buffer on time. Choices would be to display fewer things, reduce resolution, switch to a slower display, which allows more time, etc... all of which are improved by better, smarter driver code and not by mooch.

@Heater: It is this thinking through I just shared that convinced me having a simple mooch option is only an upside in most of the cases I'm familiar with.

If it gets more complex, like hub priority schemes, pairing, etc... then we end up with more reuse problems and more potential failure modes associated with those features, but a passive cycle mooch isn't at issue so much as well designed driver code is.

And here is the thing: Well designed driver code is a problem for us mooch or not! Adding mooch doesn't improve on that in a material way, leaving us with the same basic problems whether or not we have mooch.

So I changed sides on that basis. For the things I've done, mooching would not have brought me any new problems, and it rarely, if ever, would have augmented ones I had anyway.

I want a walk through on some technical level as I have done here for some use cases where mooch would be a problem.

Sapieha · 2014-03-29 18:13

Hi All.

Most of things I read on this thread around Round-Robin access to HUB are not applicable.

Onl thing that can met with success are 7 rings of Round-Robin. switch that every one are one COG shorter.

0-1-2-3-4-5-6-7 ---- 0-1-2-3-4-5-6 ---- 0-1-2-3-4-5 ---- 0-1-2-3-4 ---- 0-1-2-3 ---- 0-1-2 ---- 0-1 ---- 0 ---

and 2 instructions for COG ----
RRON
RROFF

potatohead · 2014-03-29 18:31

I don't like that one. COGS would be written with the assumption that X number of COGS would be out of the round robin scheme, and that definitely would create the kinds of problems we all want to avoid.

Dave Hein · 2014-03-29 19:12

I really don't understand any of the arguments against mooching. If your code is time critical then don't mooch. It seems like the single-cog mooching algorithm proposed by Bill should be doable in a small amount of logic. I hope that Chip is seriously considering this so we can try it out on the FPGA. If it turns out to be as horrible as the naysayers claim then it should be removed. Chip please let us at least try it out on the FPGA.

Bill Henning · 2014-03-29 19:17

potatohead,

Nice analysis, but only oriented towards video. I'll briefly address video, then go onto the cases I find mooching even more helpful in

1) Video

Video can be broken down into a few different usage cases

1A) Video Display Refresh Engine - ie just drawing a bitmap from the hub, or SDRAM, or making a display from tiles

Mooching does not help here, nor is it needed. With Chip's latest loop, an RDWIDE can be transferred to AUX or cog memory in 8 clock cycles, synced with the hub.

Even if the driver wanted to mooch, it could not transfer more than 8 longs (a wide) in 8 cycles... so mooching is moot for this case.

1B) Video Drawing Engine - drawing into a bitmap

As you clearly pointed out, games are already used to skipping frames when a new frame could not be generated in time. With page flipping, and pages in sdram, such games would degrade gracefully if mooching was turned off, but could handle higher frame rates with mooching on.

A clear win for mooching, and the downside is the lower frame rate you would get if you did not mooch. To me, that means no down side.

1C) Dynamically Drawn Screens

Generally, such screens draw a scan line, which the video circuit displays while the next scan line (or block of lines) is built.

If there is not enough time, usually objects are left off once the renderer is out of time - so the failure case is missing some sprites, or more likely flickering sprites.

Mooching would help draw more objects, but as you point out, it is hardly a critical failure mode if some graphics "sparkle".

The "failure mode" is the same as if there was no mooching, so there is no real down side!

P2 Limitations

One of the largest remaining P2 limitations is the single line of data cache.

This will mean a great deal of trashing when there is more than one task in a cog accessing the hub frequently, as each access by a different task to data in the hub will re-load the dcache, quite possibly reducing hub access to non-cached speed.

Mooching will help greatly here!

Examples:

2) Tasks

Mooching will help greatly here, as the cache reloads can occur as soon as there is an available slot to recycle

Two or more tasks in a cog will run a lot smoother with mooching enabled, should they need the hub bandwidth, as the dcache will re-load in less than 8 cycles.

Any argument that the tasks should not count on extra mooched hub access is ridiculous, as anyone who worries about that should never use the RDxxxxC cached reads.

Anyone using RDxxxxC, then complaining about not being able to count on extra speed due to mooch is being a hypocrate.

Compiled hubexec code

With all the stack access compiled C code does, along with global, array, and other data access, mooch will help compiled code greatly, as the single line of dcache will be constantly re-loaded. BIG speed win here.

Again, if anyone wants to complain about not being able to rely on extra mooched hub cycles, they should never use RDxxxxC.

Virtual Machines (byte code interpreters)

Basically the same situation as compiled code, the single line of dcache limits performance, and mooch will alleviate it to a great extent.

Again, if anyone wants to complain about not being able to rely on extra mooched hub cycles, they should never use RDxxxxC.

More than one cog using mooch

I posited a simple "even number cogs can only mooch even slots, odd number cogs can only mooch odd slots" that could provide for two mooches that cannot possibly interfere with each other.

That is a good backup solution, however I would not be at all surprised if Chip could do a "round robbin" allocation of extra slots fairly easily.

Executive Summery

- Mooching is a very big win
- The "risks" are minimal

Ramon · 2014-03-29 19:18

Sapieha wrote: »

0-1-2-3-4-5-6-7 ---- 0-1-2-3-4-5-6 ---- 0-1-2-3-4-5 ---- 0-1-2-3-4 ---- 0-1-2-3 ---- 0-1-2 ---- 0-1 ---- 0 ---

I think that "0-1-2-3" would be very useful for testing HUB execution in the 4 cogs DE2-115. (and " --- 0 ---" for the 1 cog DE0-nano).

Heater. · 2014-03-29 19:27

@Potatohead,

Good mooching...I mean morning, it's 5.11am here. I put the kettle on especially.

All of your analysis boils down to the fact that mooching, extra bandwidth available for some functionality, is a "clear win"...for the functionality that needs it. In your examples video functionality.

This is obvious at a cursory glance is it not?

My counter argument is that two such functionalities, that require the extra bandwidth provided by mooching, cannot be used together in a single Propeller. There simply isn't the bandwidth available. This introduces incompatibilities between objects we might want to use together due to timing issues that do not exist in the P1 or the semantics of Spin/PASM.
[*] This has an impact on the ease of code sharing and reuse we have with the Propeller. Much akin to the difficulties interrupts cause for code reuse in regular processors.

These new timing issues and breaking the Spin/PASM semantics worries me. It seems to not worry many others. Chip will decide.
[*] These arguments are not restricted to Spin/PASM of course. They are a change in the existing "contract" that Spin offers programmers regarding timing.

@Sapieha,

Your suggested scheme for configurable HUB slot distribution is much better than my musings earlier. I don't like either of them though.

potatohead · 2014-03-29 19:38

Yes Bill, oriented toward video, and was hoping others would toss in. We are doing this to help the "don't understand" part of things. And to vet it. Perhaps there really is a problem! I think some use case discussion would ferret that right out.

Personally, the use case I like the most is for compiled code. That one is very compelling. Most of the trouble scenarios were framed around drivers and or comms interaction between processes.

Honestly, the only one I can think of is the case where there are two dependent processes, and one of them needs to run on a time schedule, keyed to some event. It would accept data from other data generation / aggregation type processes, which might not get it done. So somebody turns mooch on instead of refactoring to hit the goal. Refactors could include using another COG to better parallelize the problem, or it could be a rearranging of code to make sure the pipeline works better, hub cycles, etc...

So now they have it working with a speed injection obtained from the untrustworthy moocher!

Then they decide to do something like increase their specs on a driver, turning on a mode that comes much closer to consuming the maximum non-mooch per COG cycle allocation and their speed boost goes away, triggering failure.

In this scenario, not having mooch would have forced them to refactor, or simply not meet their requirement or finish the project. "it's impossible" kind of thing, or "too much work" kind of thing.

We've all had that happen on P1, and the answer was to refactor and parallelize to get it done. That taught all of us lots of things we know now that new users won't.

So here's the question: Won't they have to learn that anyway?

In the case of no mooch, they would have to learn it without seeing the benefit of some increased speed. In the case of mooch, they might not learn it and something works, but when it doesn't, don't they have to learn it anyway?

And this is why I really liked mooch. Take that same question and frame it up with something like "Turbo" and the answer remains the same, but the connotation and expectations surrounding that answer change considerably.

In the case of mooch, it's a clear bandaid masking what is otherwise some basic thing they should be doing, but aren't, or aren't yet. In the case of turbo, it's not a clear bandaid, because it's turbo! Turbos are supposed to go faster!

Mooching works until it doesn't, but there isn't really an expectation of it working consistenty. Turbo does come with an expectation of working, otherwise why have turbo? But turbo would also work, until it doesn't.

The difference is in the expectations. Mooch would more frequently result in, "I knew better, but..." and Turbo would more frequently result in, "It should have, but..." and there is a whole world of difference there when it comes to the work needed to follow through and refactor to make it work solid.

Heater. · 2014-03-29 19:38

Dave,

I really don't understand any of the arguments against mooching. If your code is time critical then don't mooch.

It's not about me not mooching in my code if I don't like it. If all the code I run is mine I can handle mooching.

It's about code in your object, that mooches, and Bill's object, that mooches, that I mix up in my program only then to find the program fails at random due to the lack of bandwidth to support both. Thanks guys.

...we can try it out on the FPGA. If it turns out to be as horrible as the naysayers claim then it should be removed.

Only problem is people will try it out and may well say it's easy and great, the issues I'm alluding to won't show up until the PII has hundreds of objects in OBEX and elsewhere an thousands of confused users.

potatohead · 2014-03-29 19:44

@heater, yes.

I did identify one that is at issue, and that's the multi-purpose driver crammed into a COG that depends on mooch. That one is a valid potential hard failure mode when more than one is present, or some other tasks require most, or all of their HUB cycles. If somebody builds one of those, they really need to build it non mooch.

That one isn't a clear win, and it comes down to expectations if built to depend on mooching. We don't depend on moochers, right?

Now, the other failure case is generalized in the post above. What's your answer to that question?

P2 - New Instruction Ideas, Discussions and Requests

Comments