It has been asked for proof of benefits - there has been some examples already given. How about proving how failures can occur with standard obex objects if no objects using slot sharing are permitted in the obex!
Now, I have already given an example of USB FS where hub slot sharing may be required. Currently using 80MHz FPGA P2 cog code, I cannot perform usb fs. Multiple cogs does not actually help if you cannot read the bits inline! And this was using single clocked instructions of the old P2. I am hoping to get a helper instruction from Chip.
Just so I dont mislead, the P1 overclocked did do usb fs with a lot of non-compliance and iirc 5 cogs.
I am hoping to avoid a lot of this non-compliance, but i dont expect to be fully compliant.
Now, what is the harm if I can write an object that can get 2x slot accesses by taking 2 cogs (paired cogs as in 1 & 9, etc) and by using a slot sharing scheme
There is a simple Test metric here, which is the Peak Burst Rate possible for each of 8/16/32 bit.
There are many designs that need to burst capture 8/16/32 bits, and the REP opcode gives a zero overhead loop.
That makes the slot-spacing the primary limiter on Peak Burst Rate bandwidth.
What are the Peak Burst Rates for Burst Read / Write for each of 8/16/32 bit writes, & with/without Slot management. ?
Another class of Burst Capture wants to also capture the trigger - I think 2 COGS are needed here.
These issues of parallel processing and shared memory and maximizing performance/through put are as old as computing itself. Gene Amdahl even has a law in his name describing the problem.
There is no "one size fits all" solution.
The good old, brain dead stupid, round-robin approach is quite OK for 99% of anything anyone will want to do with the Propeller. Whilst at the same time making it easy to use.
I'm trying to remember... Do we still have Quad read/writes to the HUB? Isn't that enough?
Even that only reaches 50% of the possible bandwidth, and if Writing TO the HUB, you also need code to build the Quad.
The methods being discussed can always be configured to match the present lower bandwidth (so you lose nothing).
What they add (at minimal silicon cost) is the ability to use less power for the same BW as now, and to allow unused cogs to not represent wasted bandwidth.
Some of us disagree with the Propeller concept that all COGS are the same and the COG code always works no matter what the other COGS are doing.
The rest of us aren't trying to solve for anything! We understand the value in that concept and will refactor or parallelize problems instead of breaking a core attribute of the Propeller.
Uhh, no, thats an entire mischaracterization of what is being proposed. Just because you disagree with it does not mean you are right.
As I asked before, when you demanded that everyone interested in it prove that it won't break X,Y,Z.
With just the basic, and probably most used implementation of Core0 (Primary) using Core8 (Donor), please explain how this breaks anything in Cores 1-7, and 9-15 ?
And again, we're assuming the programmer is NOT an idiot, and loading objects needing hub access into Core8 (Donor).
As someone pointed out as well, with this feature, no one NEEDS to expend the skull sweat to somehow thread this between 2 Cores.
While its great that someone/s have, requiring everyone to do so, and do so by copying & tweaking, or hand-rolling their own, seems rather bassackwards .
I don't really have a horse in this race as a newbie. However the logic seems pretty sound, especially as for all the groaning and FUD, I haven't seen ANYONE actually point out a FAILURE to even the simple request I have up above.
There is also this idea that the chip won't compete unless we break the core Propeller concept, and the need for larger, faster programs is the most frequently cited reason. With 16 COGS and enough RAM in the HUB to do run multiple and meaningful programs at the same time, plus augment them with fast PASM drivers / helpers, we have more room than anyone really expected to make it all work very nicely.
I agree the current capabilities are fantastic.
However, these repeated allegations that 'the sky is falling' and the Prop's raison d'etre is being sacrificed are, in this case, seemingly false.
Until someone actually shows this, not just says so.
This isn't even close to change the core of the Propeller, heck, not even the core of an apple.
You have a Prop project using 9-10 Cores, with 6 sitting there powered down.
You need more bandwidth/lower latency, for some reason.
Without this feature, you may/may not be able to figure out some wacky 2,3,6 multi-processing ball of string.
With this feature, you can get your primary Core the resources it needs and complete it, without affecting any of your other Cores.
Realistically, its then just as fair to say that those few who have managed to herd some cats with a ball of string simply want everyone else to be forced to do it that way, instead of what appears to be an intrinsically simpler way.
Once we have this one out there, making money to fund the future, seriously discussing how we might improve on the robust and proven round robin access makes sense. We can write code, simulate, tweak, etc... and get it to a place where we know it makes sense, or does not.
Preserve the awesome, by shipping it as soon as we can.
People seem to be mindlessly repeating this, ad neaseum.
OK, what is the Date we are looking at for completing this WITHOUT this suggestion?
And, what is the Date we are looking at for completing this WITH the suggestion?
No one here knows.
Chip can probably give a good guestimate, and thats it.
Considering Chip has ALREADY OPINED that this would most likely be a "not to difficult" (can't recall exact words) or some such effort, I guess people here are better informed than him.
To those crying FUD, no. It is a value judgement, and the value of COG code always working no matter what the other COGS are doing is a very high value thing. Some of us don't agree, and that is the basis for the discussion, which I support. Right now, we have very little real consensus on how to implement a more complex scheme, nor do we have it on the potential impacts.
I'm not inclined to support any of it, until those two happen, and for it to happen we will need to do some exploring, testing, etc... none of which makes sense on this design or time table.
On the next one? Yeah. Bring it. I personally think we should blow it out a little, target real Sysyem On A Chip, no OS required and have the HUB be external RAM with the MMU logic needed to operate on very large amounts of memory. Maybe package it all together, or make a single reference design that can be produced, whatever...
At that scale, and with some goals aimed at being more than a micro-controller, the schemes may well make a ton of sense.
Actually, No its basically FUD.
Neither you or any of the others naysayers have shown where it would fail even in its most basic implementation as per above.
Neither you or anyone else has shown the smallest sliver of data that adding this will ACTUALLY affect any sort of tapeout, timeline, etc.
Ergo, these ARE value judgements based on no proven facts whatsoever, but lots of hyperbole, and down right "I want it now" personal desires.
IF Chip were to come out and say that this feature has a distinct possibility of delaying them making their next shuttle run, I am pretty confident that everyone wanting this feature would voice their opinion to DROP IT for the sake of the project.
However, as the Prop always pushed the boundaries, when you hit the Core limit what do you do then.
You have to resort to the ball of string multi-core solution. And even with that, you are just as likely to get diminishing returns.
If Chip were to come out and say this feature is less than significant to implement, and in reality has little bearing on the overall tapeout of the chip; seeing as it is voluntary to use, and the OBEX would quite likely have a seperate section of hub-sharing Objects to avoid stupid newbies like me from making those mistakes, would you still be against it?
If so, why?
(Had to make that sort of a leading question since I know you're the shy, retiring type )
These issues of parallel processing and shared memory and maximizing performance/through put are as old as computing itself. Gene Amdahl even has a law in his name describing the problem.
There is no "one size fits all" solution.
The good old, brain dead stupid, round-robin approach is quite OK for 99% of anything anyone will want to do with the Propeller. Whilst at the same time making it easy to use.
As I stated in a prior post, the solution I offer should only require (assuming Chip uses a 4-bit counter for tracking) the addition of a 4-bit mask and AND gate in the hub-selection circuitry, as well a very small amount of logic to set the mask in the cog management circuitry. It doesn't even require any additional instructions to enable/disable anything. As I said, it's so simple that Chip could add it right now and no one would even know it's there! I bet that adding this wouldn't even slow him down in getting out an FPGA image.
Because this TopCog scan control is very simple to do, I coded up a couple of choices in verilog -> FPGA to give some real numbers.
As expected, neither has any upper speed impact, as a small 4 bit counter, no matter how it reloads, runs >> 200MHz
Design S) used a Compare-at-Write to set the TopCog value.
TopCog starts at 0(RST) and takes the highest-activated-cog number when running.
Design P) takes a Boolean from each COG, which represents CogIsOn
TopCog is then a simple priority encoder on those set of 16 flags.
In both cases, enabling Cog 16, gives exactly the present 1:8 slot assignment.
Both update on scan reload, when Scan=0 reloads to TopCog.
Both use very small amounts of logic, P uses 27 LUT4 and S uses 10 LUT4. (& some of this is path-test logic)
S reports 422.476MHz @ P&R and P reports 389.408MHz
I prefer P, as that follows the TopCog on a live basis - whilst a Compare (S) can only ever increase until reset.
To me, P wins on the basis of giving the user more control than S.
Designers can even margin test - they can design with 8 COGS and then stress-test at 16, to see what actually happens at lower possible bandwidths.
People seem to be mindlessly repeating this, ad neaseum.
OK, what is the Date we are looking at for completing this WITHOUT this suggestion?
And, what is the Date we are looking at for completing this WITH the suggestion?
No one here knows.
I can give a partial answer
Whilst others were posting, I've coded up a couple of Verilog options for Auto-sense of TopCog (see above for more details). The code is very simple (and now done).
There is no operational speed impact, so this code can slot in to replace the present fixed scan-counter.
Logic cost is a single 16ip priority encoder & just one is needed, not one per COG.
IF Chip were to come out and say that this feature has a distinct possibility of delaying them making their next shuttle run, I am pretty confident that everyone wanting this feature would voice their opinion to DROP IT for the sake of the project.
Yes of course.
To me this is an issue of a balanced design. We have doubled the pins, MHz, instructions per clock and cogs, but by doubling the cogs the service interval has effectively gone backwards, and this is one thing that would help offset that.
Also for a lot of practical designs I imagine there will be a very busy cog 0 and quite a few spare cogs. This mechanism would make it a whole lot easier / user friendly to extend, than finding a way to parallel the code and add a layer for shuffling the data around between cogs.
As I asked before, when you demanded that everyone interested in it prove that it won't break X,Y,Z.
With just the basic, and probably most used implementation of Core0 (Primary) using Core8 (Donor), please explain how this breaks anything in Cores 1-7, and 9-15 ?
And again, we're assuming the programmer is NOT an idiot, and loading objects needing hub access into Core8 (Donor).
There is your answer. With the P1 scheme, a COG is a COG. When we promote some COGS and demote others, now a COG isn't a COG. That's changing one of the very basic things that define what a Propeller is.
Considering Chip has ALREADY OPINED that this would most likely be a "not to difficult" (can't recall exact words) or some such effort, I guess people here are better informed than him.
That's been said before. Tasks were one such thing, and they ended up taking a very long time to resolve. There have been others in the prior P2 design that were "easy" but the detail resolution of them took considerable time.
So far as I've seen, and I just went through and read a ton of the earlier discussion, the only thing we've got out of this discussion is discussion. Altering how the HUB behaves never actually happened, and we've a few statements from Chip on how it might look, but that's it.
To be frank, you guys had a supporter for the idea of mooch and perhaps the pairing now that we have more COGS. But... no. Not anymore. Sorry.
**** it.
I'm now against all of it, always. P1 style HUB or nothing.
Actually, it wasn't the processor threading that was hard to resolve but rather the integrating with hubexec that followed. The two together had issues.
The argument against has finally moved to it may impact the timeline. But remember those against originally opposed on disguised FUD - warnings about how it wouldbreak everything. When they lost that argument, they movedto implementation time.
My slot mechanisms have always come with implementation time/silicon/power caveats.
Implementation ideas were not even able to bediscussed because of FUD. A number of ideas have been proposed, some simple, some more complex. This is a discussion that really needs to be continues from months ago, together with some hindsight.
I'm just trying to compute how long a typical potatohead post would be if all of it were that font size!
-Phil
Phil, you also have an important project underway for Parallax and Bueno Systems. Do we have to tell you to stay off the forums a bit so you can get that job done?
Some benefits of slot reassignment (irrespective of implementation)
* Increased hub slots gives higher bandwidth
* increased hub slots reduces latency - fastercode
Mooching gives higher bandwidth and/or reduced latency. It can only affect other mooching cogs.
But mooching is likely more complex to implement than some other options.
Mathops would most likely still only work with the original slot.
Cog pairing should be quite simple in mostbasic form. Gives 2x hub slots, equally spaced. Should be simple to therefore permit 2x mathops in parallel, based one 1 on original slot and 1on added slot. Increases bandwidth 2x and/or 1/2 latency. Additionally should permit 2x mathop power (parallel staggered by 8 clocks =4 instructions)
This implementation would be totally deterministic, and could not affect other cogs. Up to 8 pairs could be implemented.
An expansion to this would be the sharing of the pair of slots between both cogs, with an increase in silicon complexity. Both cogs could stillbe deterministic, but could be more difficult to calculate.
While other more flexible slot mapping could be implemented, they would be more complex in silicon.
On this basis, I am inclined to propose the following to minimise complexity...
Cog pairs 0-8, 1-9, etc can share slots
Simple: Cog 0 gets both 0 & 8 slots, Cog 8 gets no slots
Complex: Cog 0 gets both slots as priority, cog 8 can use both slots if Cog 0 does not require them. There are a couple of alternative ways this could work.
Mooching (uses any unused slots)
Simple: Only Cog 0 can mooch
Complex: Any cog(s) can mooch. This requires considerable complexity to determine priorities for allocation.
As a simplification...
Cog pairs operating in 1. above cannot mooch
Cog pairs operating in 1. above do not donate unused slots to mooching
Mid-complex: Cog 0 mooches only half the slots, an Cog 8 mooches the other half of the slots.
Of course all these modes must be enabled. Default is as per original - all cogs get 1:16 slots.
There is your answer. With the P1 scheme, a COG is a COG. When we promote some COGS and demote others, now a COG isn't a COG. That's changing one of the very basic things that define what a Propeller is.
Thats not any answer.
Where is the Core 'Promotion"/"Demotion" coming from.
We, the Programer, are at the most basic implementation, having Core 0 run a program, and taking resources (hub access) from Core 9.
Core 9 can do something, just not use and hub access/resource.
You, the Programmer, KNOW this.
If you DON'T, then you didn't read the doc that came with the object, or read the inline code comments.
That potential for failure is in everything, from coding, to walking while texting, etc, etc.
We haven't banned cell phones because people are stupid enough to cross the street while texting.
But you are basically saying in effect, that in this case we should....
And, you still haven't given a PROOF for the following:
"As I asked before, when you demanded that everyone interested in it prove that it won't break X,Y,Z.
With just the basic, and probably most used implementation of Core0 (Primary) using Core8 (Donor), please explain how this breaks anything in Cores 1-7, and 9-15 ?
And again, we're assuming the programmer is NOT an idiot, and loading objects needing hub access into Core8 <9> (Donor)."
That's been said before. Tasks were one such thing, and they ended up taking a very long time to resolve. There have been others in the prior P2 design that were "easy" but the detail resolution of them took considerable time..
Thats meaningless, because Chip has already stated this appears to be a semi-trivial task to complete.
So far as I've seen, and I just went through and read a ton of the earlier discussion, the only thing we've got out of this discussion is discussion. Altering how the HUB behaves never actually happened, and we've a few statements from Chip on how it might look, but that's it.
And that was under the old PT Titanic version, right?
There was enough 'other' stuff being discussed, and issues popping up that it likely was overshadowed.
To be frank, you guys had a supporter for the idea of mooch and perhaps the pairing now that we have more COGS. But... no. Not anymore. Sorry.
**** it.
I'm now against all of it, always. P1 style HUB or nothing.
Well, you're certainly allowed your opinion.
However to date, you haven't given anything factual on it being some giant time sink that will imperil the new P16.
You've also not given anything factual on how it will break determinism and run the ethos of the Prop.
The other/my side has given more than enough close to factual data on how it will not imperil the P16 from tapeout.
We've also given several as of yet unchallenged implementation methods which show it will not break determinism on other Cores.
Personally, to paraphrase the Chip himself, the R&D time is minimal, and since he is even CONSIDERING it, I think that makes it clear Chip does not see it as BREAKING determinism or HIS VISION of the product.
ANONYMOUS has a saying, "We are not your personal Army".
Parallax should have one, like "We are not your personal uController".
Demanding that anyone, including Parallax bend to your personal desire is business suicide.
This is a uController at heart, not a religion.
Maybe take a month off and come back to see where we are and what Chip/Ken's decisions are for the P16.
Yes, I agree. There is no point in raising even modest suggestions because as soon as you do you get the usual "pile on" of various people with the agenda of steering the discussion towards their favored approach - again. So we just keep going over the same old ground.
In fact, I think I'll stop reading this thread, and just wait until the P16X32B actually arrives.
I think discussion is still valuable, it doesn't have to affect the critical path. Discussion tends to precede the better decisions.
I tend to value this forum for its technical discussion and range of views, above hollering.
In some ways it'd be good to get the current beast shipped, so we can get back to the main game (proposing and discussing stuff)
I agree.
Ultimately, what is unknown, is when will this product be ready for final tapeout.
In the other thread, someone opined that it could easily by 6-12 months before we see, optimally a completed P16.
This sounds right from what I know of the R&D to foundry process.
With some of the simpler implementations suggested, and from Chip's own comments, it would appear as though this is simple compared to a lot of other suggestions, and is not expected to be impacting.
Not sure how many times Chip has to pop in and repeat himself to drive this home to some people.
I get the impression people think if they keep pushing and hollarin', they're going to get a chip in their hands in 3-4 months.....
If you break the fixed round-robin hub-access order, what you'll end up with is utter chaos. Different objects, written by different programmers, will have different "requirements", each competing with the others for hub slots. I cannot see any way that order and determinism could ever prevail in such a scenerio.
Phil, you also have an important project underway for Parallax and Bueno Systems. Do we have to tell you to stay off the forums a bit so you can get that job done?
My work here is done. Thank you, gentlemen!
-Phil
P.S. Ken, it's started raining again, with small-craft advisories. I'm back on task, I promise!
Personally I think Chip should create a private forum and invite productive people who can actually help him in finishing his project and creating a nice product. At this point, I am certain he knows just exactly which people can help to bring his project to fruition.
You're assuming that Chip needs our "help." I do not believe he does.
I am not assuming that he needs anyones help, however I do believe that he likes creative input from knowledgable people. Chip and jmg had a nice discussion going, where it seemed that jmg had some valid points that Chip acknowledged.
Koehler, being a newbie has nothing to do with anything. No worries on that.
One statement Chip made on this was something on the order of, "I suspect this needs to be between concenting objects", and after a considerable discussion, another favoring mooch as an option, due to it's passive nature.
Those suggest some of this current discussion may be viable.
Ken put a schedule out that allowed for time to get this more conservative design in a sort of end of year timeframe. I seriously doubt anyone is thinking otherwise. There are no four weeks expectations that have any relevance.
The core difference of opinion lies along simple lines, with simple and consistent being weighed against more complex, and potentially inconsistent.
I value the former more highly than the latter, and am not alone in that.
In general, I also see the desire for a fast processor with lots of peripherials defined by software being in conflict with a symmetric, concurrent multi-processor, which the P1 currently is.
I'm withdrawing my support for any of this because I think the more favorable alignment to how P1 users do things is optimal for a lot of reasons, none of which I have any interest in hashing out yet again. When the next design starts, yeah. That is the time to really blow it out and take a few great steps.
The idea that we always know what COG code does no matter what other COGS are doing is a really powerful, simple, effective one, and it's binary. Every scheme put here has cases where that is no longer true, some to an extreme.
If we give up that idea, I also think it should be very fully developed, not some niche case kludge, and I think the best way to do that is in the context of a new design where doing that is a core assumption, not an add on, or exception.
I'm open to that, but not on this design where doing that was not a built in assumption.
You're assuming that Chip needs our "help." I do not believe he does.
-Phil
Chip does a great job but its a huge undertaking in breadth and depth. There's at least half a dozen instances I can think of where external help has been clearly beneficial
You yourself have contributed very important corrections for things Chip has overlooked, that probably wouldn't have been noticed otherwise.
I understand the call to ship something. It can't happen straight away, so the question that needs an articulated answer from the community is what should we expend forum energy on in the mean time that might be beneficial.
Comments
There is a simple Test metric here, which is the Peak Burst Rate possible for each of 8/16/32 bit.
There are many designs that need to burst capture 8/16/32 bits, and the REP opcode gives a zero overhead loop.
That makes the slot-spacing the primary limiter on Peak Burst Rate bandwidth.
What are the Peak Burst Rates for Burst Read / Write for each of 8/16/32 bit writes, & with/without Slot management. ?
Another class of Burst Capture wants to also capture the trigger - I think 2 COGS are needed here.
Get me it, Now!
These issues of parallel processing and shared memory and maximizing performance/through put are as old as computing itself. Gene Amdahl even has a law in his name describing the problem.
There is no "one size fits all" solution.
The good old, brain dead stupid, round-robin approach is quite OK for 99% of anything anyone will want to do with the Propeller. Whilst at the same time making it easy to use.
Even that only reaches 50% of the possible bandwidth, and if Writing TO the HUB, you also need code to build the Quad.
The methods being discussed can always be configured to match the present lower bandwidth (so you lose nothing).
What they add (at minimal silicon cost) is the ability to use less power for the same BW as now, and to allow unused cogs to not represent wasted bandwidth.
??
Uhh, no, thats an entire mischaracterization of what is being proposed. Just because you disagree with it does not mean you are right.
As I asked before, when you demanded that everyone interested in it prove that it won't break X,Y,Z.
With just the basic, and probably most used implementation of Core0 (Primary) using Core8 (Donor), please explain how this breaks anything in Cores 1-7, and 9-15 ?
And again, we're assuming the programmer is NOT an idiot, and loading objects needing hub access into Core8 (Donor).
As someone pointed out as well, with this feature, no one NEEDS to expend the skull sweat to somehow thread this between 2 Cores.
While its great that someone/s have, requiring everyone to do so, and do so by copying & tweaking, or hand-rolling their own, seems rather bassackwards .
I don't really have a horse in this race as a newbie. However the logic seems pretty sound, especially as for all the groaning and FUD, I haven't seen ANYONE actually point out a FAILURE to even the simple request I have up above.
I agree the current capabilities are fantastic.
However, these repeated allegations that 'the sky is falling' and the Prop's raison d'etre is being sacrificed are, in this case, seemingly false.
Until someone actually shows this, not just says so.
THIS. http://en.wikipedia.org/wiki/Hyperbole
This isn't even close to change the core of the Propeller, heck, not even the core of an apple.
You have a Prop project using 9-10 Cores, with 6 sitting there powered down.
You need more bandwidth/lower latency, for some reason.
Without this feature, you may/may not be able to figure out some wacky 2,3,6 multi-processing ball of string.
With this feature, you can get your primary Core the resources it needs and complete it, without affecting any of your other Cores.
Realistically, its then just as fair to say that those few who have managed to herd some cats with a ball of string simply want everyone else to be forced to do it that way, instead of what appears to be an intrinsically simpler way.
People seem to be mindlessly repeating this, ad neaseum.
OK, what is the Date we are looking at for completing this WITHOUT this suggestion?
And, what is the Date we are looking at for completing this WITH the suggestion?
No one here knows.
Chip can probably give a good guestimate, and thats it.
Considering Chip has ALREADY OPINED that this would most likely be a "not to difficult" (can't recall exact words) or some such effort, I guess people here are better informed than him.
Actually, No its basically FUD.
Neither you or any of the others naysayers have shown where it would fail even in its most basic implementation as per above.
Neither you or anyone else has shown the smallest sliver of data that adding this will ACTUALLY affect any sort of tapeout, timeline, etc.
Ergo, these ARE value judgements based on no proven facts whatsoever, but lots of hyperbole, and down right "I want it now" personal desires.
IF Chip were to come out and say that this feature has a distinct possibility of delaying them making their next shuttle run, I am pretty confident that everyone wanting this feature would voice their opinion to DROP IT for the sake of the project.
However, as the Prop always pushed the boundaries, when you hit the Core limit what do you do then.
You have to resort to the ball of string multi-core solution. And even with that, you are just as likely to get diminishing returns.
If Chip were to come out and say this feature is less than significant to implement, and in reality has little bearing on the overall tapeout of the chip; seeing as it is voluntary to use, and the OBEX would quite likely have a seperate section of hub-sharing Objects to avoid stupid newbies like me from making those mistakes, would you still be against it?
If so, why?
(Had to make that sort of a leading question since I know you're the shy, retiring type )
Because this TopCog scan control is very simple to do, I coded up a couple of choices in verilog -> FPGA to give some real numbers.
As expected, neither has any upper speed impact, as a small 4 bit counter, no matter how it reloads, runs >> 200MHz
Design S) used a Compare-at-Write to set the TopCog value.
TopCog starts at 0(RST) and takes the highest-activated-cog number when running.
Design P) takes a Boolean from each COG, which represents CogIsOn
TopCog is then a simple priority encoder on those set of 16 flags.
In both cases, enabling Cog 16, gives exactly the present 1:8 slot assignment.
Both update on scan reload, when Scan=0 reloads to TopCog.
Both use very small amounts of logic, P uses 27 LUT4 and S uses 10 LUT4. (& some of this is path-test logic)
S reports 422.476MHz @ P&R and P reports 389.408MHz
I prefer P, as that follows the TopCog on a live basis - whilst a Compare (S) can only ever increase until reset.
To me, P wins on the basis of giving the user more control than S.
Designers can even margin test - they can design with 8 COGS and then stress-test at 16, to see what actually happens at lower possible bandwidths.
I can give a partial answer
Whilst others were posting, I've coded up a couple of Verilog options for Auto-sense of TopCog (see above for more details). The code is very simple (and now done).
There is no operational speed impact, so this code can slot in to replace the present fixed scan-counter.
Logic cost is a single 16ip priority encoder & just one is needed, not one per COG.
Yes of course.
To me this is an issue of a balanced design. We have doubled the pins, MHz, instructions per clock and cogs, but by doubling the cogs the service interval has effectively gone backwards, and this is one thing that would help offset that.
Also for a lot of practical designs I imagine there will be a very busy cog 0 and quite a few spare cogs. This mechanism would make it a whole lot easier / user friendly to extend, than finding a way to parallel the code and add a layer for shuffling the data around between cogs.
There is your answer. With the P1 scheme, a COG is a COG. When we promote some COGS and demote others, now a COG isn't a COG. That's changing one of the very basic things that define what a Propeller is.
That's been said before. Tasks were one such thing, and they ended up taking a very long time to resolve. There have been others in the prior P2 design that were "easy" but the detail resolution of them took considerable time.
So far as I've seen, and I just went through and read a ton of the earlier discussion, the only thing we've got out of this discussion is discussion. Altering how the HUB behaves never actually happened, and we've a few statements from Chip on how it might look, but that's it.
To be frank, you guys had a supporter for the idea of mooch and perhaps the pairing now that we have more COGS. But... no. Not anymore. Sorry.
**** it.
I'm now against all of it, always. P1 style HUB or nothing.
You can put a nice word there, or a bad word there! All up to you.
I choose "ship"
My slot mechanisms have always come with implementation time/silicon/power caveats.
Implementation ideas were not even able to bediscussed because of FUD. A number of ideas have been proposed, some simple, some more complex. This is a discussion that really needs to be continues from months ago, together with some hindsight.
-Phil
Phil, you also have an important project underway for Parallax and Bueno Systems. Do we have to tell you to stay off the forums a bit so you can get that job done?
Ken Gracey
Some benefits of slot reassignment (irrespective of implementation)
* Increased hub slots gives higher bandwidth
* increased hub slots reduces latency - fastercode
Mooching gives higher bandwidth and/or reduced latency. It can only affect other mooching cogs.
But mooching is likely more complex to implement than some other options.
Mathops would most likely still only work with the original slot.
Cog pairing should be quite simple in mostbasic form. Gives 2x hub slots, equally spaced. Should be simple to therefore permit 2x mathops in parallel, based one 1 on original slot and 1on added slot. Increases bandwidth 2x and/or 1/2 latency. Additionally should permit 2x mathop power (parallel staggered by 8 clocks =4 instructions)
This implementation would be totally deterministic, and could not affect other cogs. Up to 8 pairs could be implemented.
An expansion to this would be the sharing of the pair of slots between both cogs, with an increase in silicon complexity. Both cogs could stillbe deterministic, but could be more difficult to calculate.
While other more flexible slot mapping could be implemented, they would be more complex in silicon.
On this basis, I am inclined to propose the following to minimise complexity...
I think discussion is still valuable, it doesn't have to affect the critical path. Discussion tends to precede the better decisions.
I tend to value this forum for its technical discussion and range of views, above hollering.
In some ways it'd be good to get the current beast shipped, so we can get back to the main game (proposing and discussing stuff)
Thats not any answer.
Where is the Core 'Promotion"/"Demotion" coming from.
We, the Programer, are at the most basic implementation, having Core 0 run a program, and taking resources (hub access) from Core 9.
Core 9 can do something, just not use and hub access/resource.
You, the Programmer, KNOW this.
If you DON'T, then you didn't read the doc that came with the object, or read the inline code comments.
That potential for failure is in everything, from coding, to walking while texting, etc, etc.
We haven't banned cell phones because people are stupid enough to cross the street while texting.
But you are basically saying in effect, that in this case we should....
And, you still haven't given a PROOF for the following:
"As I asked before, when you demanded that everyone interested in it prove that it won't break X,Y,Z.
With just the basic, and probably most used implementation of Core0 (Primary) using Core8 (Donor), please explain how this breaks anything in Cores 1-7, and 9-15 ?
And again, we're assuming the programmer is NOT an idiot, and loading objects needing hub access into Core8 <9> (Donor)."
Thats meaningless, because Chip has already stated this appears to be a semi-trivial task to complete.
Are you somehow betting informed than him?
And that was under the old PT Titanic version, right?
There was enough 'other' stuff being discussed, and issues popping up that it likely was overshadowed.
Well, you're certainly allowed your opinion.
However to date, you haven't given anything factual on it being some giant time sink that will imperil the new P16.
You've also not given anything factual on how it will break determinism and run the ethos of the Prop.
The other/my side has given more than enough close to factual data on how it will not imperil the P16 from tapeout.
We've also given several as of yet unchallenged implementation methods which show it will not break determinism on other Cores.
Personally, to paraphrase the Chip himself, the R&D time is minimal, and since he is even CONSIDERING it, I think that makes it clear Chip does not see it as BREAKING determinism or HIS VISION of the product.
ANONYMOUS has a saying, "We are not your personal Army".
Parallax should have one, like "We are not your personal uController".
Demanding that anyone, including Parallax bend to your personal desire is business suicide.
This is a uController at heart, not a religion.
Maybe take a month off and come back to see where we are and what Chip/Ken's decisions are for the P16.
Yes, I agree. There is no point in raising even modest suggestions because as soon as you do you get the usual "pile on" of various people with the agenda of steering the discussion towards their favored approach - again. So we just keep going over the same old ground.
In fact, I think I'll stop reading this thread, and just wait until the P16X32B actually arrives.
Ross.
I agree.
Ultimately, what is unknown, is when will this product be ready for final tapeout.
In the other thread, someone opined that it could easily by 6-12 months before we see, optimally a completed P16.
This sounds right from what I know of the R&D to foundry process.
With some of the simpler implementations suggested, and from Chip's own comments, it would appear as though this is simple compared to a lot of other suggestions, and is not expected to be impacting.
Not sure how many times Chip has to pop in and repeat himself to drive this home to some people.
I get the impression people think if they keep pushing and hollarin', they're going to get a chip in their hands in 3-4 months.....
My work here is done. Thank you, gentlemen!
-Phil
P.S. Ken, it's started raining again, with small-craft advisories. I'm back on task, I promise!
-Phil
I am not assuming that he needs anyones help, however I do believe that he likes creative input from knowledgable people. Chip and jmg had a nice discussion going, where it seemed that jmg had some valid points that Chip acknowledged.
Koehler, being a newbie has nothing to do with anything. No worries on that.
One statement Chip made on this was something on the order of, "I suspect this needs to be between concenting objects", and after a considerable discussion, another favoring mooch as an option, due to it's passive nature.
Those suggest some of this current discussion may be viable.
Ken put a schedule out that allowed for time to get this more conservative design in a sort of end of year timeframe. I seriously doubt anyone is thinking otherwise. There are no four weeks expectations that have any relevance.
The core difference of opinion lies along simple lines, with simple and consistent being weighed against more complex, and potentially inconsistent.
I value the former more highly than the latter, and am not alone in that.
In general, I also see the desire for a fast processor with lots of peripherials defined by software being in conflict with a symmetric, concurrent multi-processor, which the P1 currently is.
I'm withdrawing my support for any of this because I think the more favorable alignment to how P1 users do things is optimal for a lot of reasons, none of which I have any interest in hashing out yet again. When the next design starts, yeah. That is the time to really blow it out and take a few great steps.
The idea that we always know what COG code does no matter what other COGS are doing is a really powerful, simple, effective one, and it's binary. Every scheme put here has cases where that is no longer true, some to an extreme.
If we give up that idea, I also think it should be very fully developed, not some niche case kludge, and I think the best way to do that is in the context of a new design where doing that is a core assumption, not an add on, or exception.
I'm open to that, but not on this design where doing that was not a built in assumption.
Chip does a great job but its a huge undertaking in breadth and depth. There's at least half a dozen instances I can think of where external help has been clearly beneficial
You yourself have contributed very important corrections for things Chip has overlooked, that probably wouldn't have been noticed otherwise.
'Edit' ?
I understand the call to ship something. It can't happen straight away, so the question that needs an articulated answer from the community is what should we expend forum energy on in the mean time that might be beneficial.