"pecial fast access, turbo, low latency, high priority mode (SFATLLHPM) "
A wha?
Uhh, say that 5 times fast!
"What happens when a user picks two drivers out of OBEX that happen to both need SFATLLHPM?"
Of course, there is only so much bandwidth available through the hub, so one way or the other, there's a bottleneck. But at the bare minimum, it could still get it's 1/8th share. I just thought it could potentially be a bit faster (but never slower than currently) when a cog has to shuffle around a lot of data (video bitmap swaps).
Actually, I think it would could be pretty cool if the mode could be randomly switched on the fly at run time. That way, the majority of code could run deterministically based around hub timing only when it needs precisely-timed data access, then if the program needs to try to move data as quickly as possible, it could switch modes, and so on.
potatohead, isn't there an instance where data access is more limited by hub bandwidth than cog processing power? I see no desire to force two cogs to do something that might be handled by just one.
markaeric: Well having a software switchable "turbo" mode may well me cool and Probably technically doable. BUT we have to think of the user here. How would anyone know which objects one wants to use would need "turbo mode and when? What if two or three selected objects all use turbo mode, how would they interact? What about the program I'm writing that randomly does weird things when some other object I'm using goes turbo?
It basically brings us back to the situation of having interrupts with high and low priority handling as on many processors. To deal with that is a nightmare when mixing and matching bits of code. It either requires an operating system to prioritize and schedule things for the user ,which always makes things slower, or the user has to think very hard about that turbo switch and the needs of all the objects he wants to use.
In one fell swoop such a turbo switch would remove a great deal of the simplicity of programming with the Propeller. Unless you can think of a way to implement it such that you can guarantee that that a COG with it's hands on the switch can NEVER impact the determinism of other COGs I'm dead against it.
I think you'll find such an implementation is very hard, probably impossible.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Well, the trade off is not knowing the state of the Propeller.
Right now, the way it is, means sometimes having to apply more than one COG to a COG - HUB throughput problem. Interestingly, there are only a few instructions between HUB access windows, meaning compute is a very real limitation in a large number of cases. Given how Prop II is being built the same way, it's going to hold true at the higher speed, with only the scale of the problem being different. What I'm saying here is that it's often a compute problem, not a COG - HUB throughput problem.
(IMHO, bandwidth is not the right term, but I could be wrong on that)
However let's say it's strictly a throughput problem, and that Propellers access windows could vary based on number of COGs, or some other thing, like you have proposed. For any one bit of code, that could yield some speed increase. The cost would be rarely being able to use other peoples code without significant changes and or planning being required.
This is an extremely high cost!! (As Heater just wrote shorter than I did)
Three things about Propellers that are proving to be well worth the trouble:
1. Deterministic behavior. The state of a Propeller is always known. Code on a COG will always run the same way no matter what else the chip is doing. There are other nice elements to this, but I'm focusing on this one as the code re-use cost being low on Propeller is one of it's primary differentiators that give it an advantage over other designs.
2. Symmetry. A Propeller is a Propeller and a COG is a COG. Again, this goes right back to code re-use cost. If you have a Propeller, you can run your code on it, and you can run your code on any COG as well, with no worries about it being a specific COG, or combination of COGs, and without worrying about other code running on other COGs.
3. Do it in software to the maximum degree possible, exposing only key enabling functions in silicon, leaving the rest to the programmer. Secondly, do so with a minimum of external components. Thirdly, keep those additional components as simple and consistent as possible.
That's the secret sauce right there!
These things are why a Propeller is a Propeller. Where most CPUs operate with interrupts, Propellers use COGs to get the same kinds of things done, and do so with a minimum of code changes being required to re-use code objects.
It's very attractive to consider "tweaks" for speed. In almost every case discussed here, those tweaks would diminish one of the core things above. Where those things are diminished, the Propeller loses it's rapid development attributes, and or hardware simplicity, without gaining enough speed to warrant doing it in the first place.
Where they would not do that, they are going to end up in Propeller II. [noparse]:)[/noparse]
Go look at the OBEX, and the body of code to be found here in the forum, and on key contributors web sites.
Let's say you have some complex servo control system running on a few COG's, and you decide you need a display to debug, or to prompt the user, or.... whatever it is. Chances are, objects have been written for that.
You can go download them, or write them, and add them to your project code without having to worry about them disturbing the timing on code you've already written.
Adding a TV or VGA display to a project is as simple as adding a few resistors to the board, and running the display object. Generating that display won't actually disturb what you are doing, and I don't know of any other MCU where that's both possible and practical to do.
A COG is just a COG, and what one COG is doing does not impact the other ones in any way, unless the programmer codes the COGs to interact. Once you start that display, it appears like "hardware" does to your other program code. The display can be scanning the electron beam and reading HUB memory at the same time your code is moving a servo and updating display memory with coordinates.
Almost no, and often NO changes are needed to do these kinds of things on a Propeller, where they are almost always necessary on other MCUs, unless one is using a pre-programmed library where a kernel handles these things within limits. If you want to use something new, not part of that library, you've got to get in and do kernel level programming to make it possible, more often than not.
The Propeller user will, more often than not, just read the object RAM requirements, inputs and outputs, and add it to the project, with few worries.
From there, you can just call the methods exposed by those objects and continue on with what you were doing before, not having to make sure interrupts all match up, or that the new programs don't consume too much time, which messes with your servos, or some other thing.
Losing that is the cost for special case tweaks, and most Propeller users wouldn't want to pay that price.
If there were a turbo mode, it would only apply to the cogs that set it on. Other cogs would NOT have anything changed.
The way it would work is that if a cog was in turbo mode, if another cog was not using the slot, a turbo cog could use that clock cycle for a transfer. So, a speed improvement without penalty to other cogs.
Therfore, suppose we have a hub access every 8 clocks, 1 for each cog (this is what Chip has suggested will be the case - currently is it 1 in 16). Now, if 7 cogs are not requiring hub access and a turbo cog requires 8 accesses, it will get all 8 clock accesses - an 8 times improvement! If one other cog requires access, the priority cog will get 7 clock accesses.
What is the downside...
Determinism is lost. Yes, but remember you do not have to use it!!!
Runs faster. Yes, nice!!!
Symmetry. So what!!! All cogs are the same. We don't use always use counters.
Obex compatibility. What's this???
What is the upside..
Can get a cog to have priority hub access, so it runs faster
Usually·one main cog doing a lot of processing that requires this faster hub access
No detriment to the other cogs
More than one cog can be turbo and the spare slots would be shared on a first come first served basis
Now, I know the PropII will have twice the hub access (1 in 8). Also, it will have quad-long access per clock and possibly block moves or at least auto-incrementing. However, this ma be too late anyway, and I would not like to delay Prop II any further.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Links to other interesting threads:
Cluso, no, no, we are not missing the point at all.
It is very clear that if a COG does not need it's HUB access window then the slot could be used by some other COG for a potential speed gain. This is a very simple and tempting idea.
What we ARE doing is making a different value judgment about the pros and cons of doing this.
As I tried to say, and Potatohead I think said better, as soon as you do this you have destroyed timing determinism and thereby blown the simplicity of using the Propeller away. We are just making the judgment that losing determinism and the resulting simplicity in the general case is not worth the possible speed gains of a few special cases.
By the way, it's not clear to me that the COG that gets the "turbo" slots benefits much anyway. If the raw speed of that COG depends on having free slots then what happens when it finds itself running in an application that has few free slots? It never had any timing determinism to start with and now it does not have so much speed gain either.
Result: It's impossible to make any rigorous statements about the performance of that turbo'ed COG. To determine if it meets the requirements of any given application one has to understand what all the other COGs in the app are doing. Then one has to risk the whole house of cards failing when some new feature is added to the app that requires a new COG or otherwise eats HUB slots.
Our opinion is that turboing COGs as described is a bad trade off of general predicatbility vs marginal speed gains in some cases.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Actually I think I just noticed that the flaw in your argument, Cluso, starts at the beginning with the initial assumption:
"If there were a turbo mode, it would only apply to the cogs that set it on."
On the face of it that seems to be obviously true. I suggest it is not. At least not directly...
Let's say a turbo enabled COG driver out of OBEX is advertised at providing some service at a peak speed of 20MBytes/second (What ever that service might be). So I build it into an application that requires that service at that speed. Everything goes fine because the rest of my app leaves enough free slots for that turbo COG to run fast.
Now I add some new feature that eats HUB slots, perhaps only occasionally. BOOM my application now has random failures occurring as the tubo COG is starved of slots occasionally. Perhaps I only discover this after the system has been out in the field for months and does something weird due to a rare external stimulus at the wrong moment.
So whilst "tubo mode" is only applied to one COG it's effects are applied to the whole application and even to the whole system that the Propeller is operating in.
Not good.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
heater: You are trying to run with the lowest common denominator. All designers must understand the product at hand, and use features to their benefit. Simply not adding something because others may have trouble using it is not a readson to not provide it.
Anyway the point is moot as I think Chip has decided long ago.
PS How did this thread get revived again
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Links to other interesting threads:
The hub (probably?) would have an 8 queue buffer, which cogs could send requests to (a cog can only send 1 at a time, obviously). The cogs not operating in "turbo" mode would automatically get added to that queue on every n/8 cycles as usual, so code can still be written completely deterministically. So again, if a cog is NOT run in turbo mode, it would NOT be affected in any way by anything any other cog is doing.
Heater/potatohead, you're completely correct that "turbo" can't guarantee hub access more than 1/8 sysClock, but it's certainly possible. This of course should be a consideration to the programmer. Not all applications would benefit from it, but I think that some could.
I also think that in reality, most major applications will probably only use a limited amount of unmodified objects from the OBEX, as they are mostly designed to run in their own cog, which as many here know, can easily depletes all available cogs. Concessions would have to be made, and objects normally run in it's own cog might need to be combined, as people do now.
I have a dream - a dream of a Propeller X with access to hub memory every cycle, and interrupts that can quickly launch a cog to run an ISR routine, and then shut down again, waiting to run another ISR. a form of LMM might be the closest thing to accomplishing this - perhaps there could be in-cog rom with such a program.
Not sure why or how this thread woke up again but this is a great debate so I'd like to continue.
It might look like I'm "run[noparse][[/noparse]ning] with the lowest common denominator" but that is not my motivation at all.
Time for a story...
I used to do a lot of work in the avionics industry. Sometimes writing but mostly testing all kinds of systems from Rolls-Royce engine management to Boing 777 Primary Flight Computers. We had development methods and tools and even languages designed to ensure correct behavior of those systems. One language would even spit out a report after compiling a module of exactly how much of it's alloted time slot it took to execute. For a whole application of many modules it would report on the time usage of every part and you could be sure that those 10ms or 100ms execution slots were NEVER going to be exceeded. Life was good. One could mix and match modules without worrying about them tripping each other up. If there was a possibility of that happening the compiler would tell you in advance.
All hell broke lose on one project where they had decided to use ADA instead. All of a sudden no one had any idea how long anything took to execute. In one of the final builds I discovered that the code was quite often consuming 95% of it's alloted time. No one could make a concrete statement that it would NEVER exceed 100% thereby causing failure. Not to mention the design requirement was for only 50% CPU usage!
The result of all this is that I'm very much attached to the Propeller's timing determinism. As far as I know there are only two generally available systems that make the timing guarantees in the face of multiple tasks that the Propeller does, the other one being, dare I say it, XMOS. They are about to release a compiler that also does timing analysis. So I don't want to see this feature being given up easily.
In general purpose CPU's all is aimed at maximizing throughput in the general case. So they grow pipelines which change your execution speed depending on which way your jumps go. They grow caches which changes your execution speed depending on where your data might be in memory. They implement prioritized interrupts which change your execution speed depending on whatever else is going on in the system. Then to make all this manageable they put a general purpose OS on top at which point all bets are off as you now have no idea when or even if you are ever going to be scheduled.
Great, overall through put is optimized, but now, for example, my Linux sound systems splutters and coughs. A few hardy developers have to spend years getting "real time" stuff to sort of function.
Bottom line is that if I was happy with a general purpose and unpredictable processor I could use many others besides the Prop. The Prop has this "unique selling point", as they say, that it should not give up.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
markaeric, "So again, if a cog is NOT run in turbo mode, it would NOT be affected in any way by anything any other cog is doing"
This is simply NOT true, as I said before. If my non-tubo COG or COGs is actually my application and my application requires the tubo COG to be able to operate as fast as possible. Then it follows that my application COGs MUST leave enough HUB slots free. If not, the tubo COG slows down and the application fails.
In summary the tubo-COG now dictates what can be in my application if I need it to zoom along.
In the general case I can't be sure that the turbo-COG can deliver it's turbo performance, as it depends on my app. therefore I have to assume it has a lesser, normal, performance, therefore there is no point in turbo-mode.
Currently the use of an object only requires being aware of how many COG's it uses, it's memory requirements, its pin usage and possibly it's clock requirements. All of which are inescapable (unless you totally separate COGs into separate CPUs). Adding this indeterminate timing problem into the mix makes things much harder to manage.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
With this "turbo" feature, none of the current functionality is gone. It's just that hub memory intensive tasks could potentially be sped up under the correct conditions. The beauty is that some cogs could use it if they wanted to, while others didn't. Lets use the graphics.spin driver as an example, the 'copy' and 'clear' functions might benefit from having access to unused hub windows, potentially speeding up those functions. But as you suggest, since there is no guarantee of bandwidth to cogs in "turbo" mode, there are ways it could pose a problem. But that's not to say that you *cant* program it completely deterministically. If you program four cogs to only need hub access once every 8 clocks, you could know for certain that one cog in turbo mode would be guaranteed hub access 4 out of every 8 clocks (and they could be all consecutive if it was the last cog launched!).
That's the thing, as a user of objects created by others I want them to work as advertised no matter what stupid stuff I get up to in my app [noparse]:)[/noparse]
I don't want to have to analyze every piece of code for it's timing dependencies.
This ease of re-usability is key to the Propeller, which relies on software to create peripherals that would otherwise be implemented in silicon in regular MCUs.
Clearly there are cases where a turbo would work and get you some performance gain and perhaps make things possible that were not possible otherwise.
I'm just leaning towards wanting predictability all the time rather than the speed gain occasionally.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Fair enough. Objects available to the publish should never use "turbo" mode. The only ones that should are ones "from scratch"- custom to the particular application. We could get this all sorted by Prop 3
I can't wait for this device. I got a product planned that will take great advantage of it's IO prowess! Before I learned of all the upcoming features of the P2, I was looking at Cypress's PSoCs - but they're overly complex, and have heard about there supposedly being various bugs.
I've noticed that the Prop has features which one may not value highly on a first flip through it's spec sheet. Features that are not not missed until one is trying to do something easy on some other chip and it turns out to be harder than it should be.
This timing issue is such a feature in my mind. It has subtle consequences that are not appreciated at first and can have a huge impact on your end results.
Another is the regularity of the I/O pins. Pretty much all pins are equal on the Prop. Apart from a few necessary service pins. This gets taken for granted until you run up against a chip where this is not so. All of a sudden you are finding that there are pins dedicated to certain functions, or pins with different current drive capabilities, or pins that can only be set as input or output together in banks of 4 or 8 or 16, or pins that can be driven from one CPU core on the chip but not others. Baahhh ! I've got all these pins but I can't use them how I want to, give me my Prop back!
I'm sure I have found many other such Prop features, subtle enough that I can't remember them just now...[noparse]:)[/noparse]
The sad part about all this is that those who buy into a MCU because of it's sales "bullet points":
32 I/O pins - check.
xxxMIPs - check
C compiler - check
1000 levels of nested prioritized interrupts - check
Other useless junk - check
tend to skip over the Propeller and consequently miss out on what a wonderfully refreshing architecture it is.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
There are a coupla other features worth mentioning here.
There are no bugs in the chip. A propeller does exactly what it is asked to do every time.
The HUB-COG interaction as it is right now allows a user to balance throughput / compute * COGs however they see fit to do, and all users code will operate with all other users code, given the sum of the code to be used does not exceed the overall raw capability of the Propeller.
To express Heaters point another way, let's say several COGs are used to get the throughput needed. Doesn't that consume COG's, just like a "turbo" COG would consume COGs, in that the others would have to be limited in what they do to permit the "turbo" one to run?
Sure it does.
So then, just use more COG's, the end product being the same, without the ugly costs associated with kludges in silicon!
The very real effect is still the loss of COG's no matter how one looks at it. There is no free lunch in these things.
Special case tweaks really are kludges. Here's the thing about kludges. They scale just like the good stuff does, meaning the size of the problems that come with kludges scales as well. Where one person would be required to resolve them, scale up the chip, and now a team is required to resolve them.
The net productivity of the chip is diminished, while arguably the overall capability isn't increased in like kind!
That's why the cost is so high!
Propeller users get this because they've been programming and building with Props for a while. Users of other CPUs don't get this, as they've never, ever been exposed to it.
If you look at MCUs where they do these kinds of things, kludges, they end up with lots of chips in a series a, b, c, whatever... and people become selectors instead of builders. Flipping though the catalog means finding that set of kludges that matches your special case, but for this one little difference... if that difference gets attention, another chip gets added to the catalog.
What occurs then is madness over time as some chips in the series continue to get made, others don't, and everybody everywhere has this massive set of things to sort out on every project and worse, over the life cycle of the project, where sometimes things get revised, and or re-kludged, and there is your team effort right there.
For the entire life of the Propeller I, there will be just the Propeller I.
Either your scope of work fits into the chip, or it does not. If it does, then you have no worries for a very, very long time. If it doesn't, the fact that the chip is largely software powered means you can apply grey matter to the problem and potentially still get the chip to perform. If so, great! It will then always perform.
...and unlike most other designs, all other users can benefit from your design effort, without significantly impacting their own.
This is powerful stuff, and well worth the bump in cost for the Propeller, particularly for small to mid sized runs.
There is another element here as well.
When people are spending their time as selectors, they need this chip and two of that one, and another one over here, and damn! That last one reached end of life last year, so we will just have to KLUDGE around this other one because we've got lots of them...
You get the point, I am sure.
So far, we have seen the Propeller chip exceed many boundaries, some of which suprised the designer of it! As we all apply our thoughts to the chip, innovation occurs. Because all the code works on all the chips, and because the interaction between code bodies is the minimum possible, we all benefit from that innovation to the maximum extent possible.
Code can be shared freely, meaning skills aquired with it can be shared freely as well. If one's skill is locked into very specific selections of things, that cannot be shared freely, resulting in less over all innovation value for all users of that series of chips.
These are not insignificant things!
It is tempting to modify Propellers to get "our" case to work, but each modification has a cost. There is production costs, debug costs, inventory costs, code re-use costs, innovation costs as detailed above. For each member of a series, these costs multiply into the mess we have today.
Propeller II will behave the same way. There will be a Propeller, and it will operate as Prop I did + the innovations - kludges, to scale up the design to enable a greater scope of tasks.
Again, either that task will fit, or it will not, or one can use software to make it fit. All others, where the code efforts are shared, benefit right away, and can build on that to max out the chip and get full use value from it in their designs.
If Prop I were kludged like you say, then what happens when Prop II addresses some of this? What will happen is a kludge of a kludge because the code re-use value is such an attractive thing, that it will be tempting to just keep kludging for this case and that, and pretty soon the chip runs hot, is buggy, has multiple variations, and it's just not a Propeller anymore.
If you want that kind of speed, then use a few COGs to do it. That's the right way to break the problem down. Most of us here continue to learn and improve on doing that, and it's pretty amazing what can be done as soon as you start thinking in parallel, and leverage the deterministic timing attributes.
potatohead: This would not be a kludge. It would be to use bandwidth which is otherwise not being used. And it IS a current bottleneck. It would take silicon. How much?·I don't know as I don't do these things. Is it worth it? Don't know, but from a potential user's point of view, ABSOLUTELY for me.
heater: I am sorry, I don't buy your argument against. It is innovation. It does not have to be used. The user gets to decide. It is just the same as the multithreading that Chip intends to add. Currently, I don't think there is any use for me, but who knows - I don't have a crystal ball.
BTW, what if you only have 4 cogs running? Those 4 hub cycles are wasted. You cannot just throw another cog at the problem.
Chip has largely addressed the issues of hub-cog speed, but I think the PropII will go into new areas never thought of before, and it is likely to become a bottleneck again.
Anyway, FWIW I predict the PropII will suffer from a shortage of·cogs because everything else seems to be so fantastic we will just push the envelope so much further. I would rather pay a little more for 16 cogs. Anyway, there will always be shortcomings·
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔ Links to other interesting threads:
I'm going to have to balk on not calling that change a kludge.
If symmetry and determinism is broken, then we don't have a Propeller. We have something that runs the same instructions as a Propeller does, but we don't have the set of assumptions in play surrounding those instructions. It does not act like a Propeller.
The reason why things are easy, consistent and productive is because we have symmetry and determinism.
There isn't a free lunch on this. Truth is, the silicon has a max throughput and compute. Those things don't change. How it is presented to us can change, and that presentation impacts the overall cost of computing.
It's hard for me to characterize something as "innovation" when the net cost is higher after having done it.
Right now we have any code, any cog.
What that means is I can ask a question about how something is done, get the answer, know it's absolutely going to work as advertised, without having to either abstract the question to avoid sharing all my project details, or overwhelm others and myself with those details.
That goes away, if we break determinism and symmetry.
Now, here's the other case I have for that being a kludge.
Has anyone demonstrated that we would get more out of the Propeller chip given that scenario? I don't think anyone can, because the performance one gets from a Propeller is a function of how well thought out the problem case is. Seems to me, using several COGs is no different than only using one COG, but doing so in a way that denies the other COGs from running as they otherwise could be.
Boil all that down, and what is being asked for is to make thinking through a problem case easier, while raising the cost of using solutions already found and published!
That all feels like one of those, "well, let's just do this now, and we can come back and fix it later things". Classic kludge!
There is always time to do it right the first time, so why not simply do that then? For this problem case, the answer is to use more COGs, and if that means saturating the chip, then it means the problem case exceeds the Propeller overall, or multiple Propellers are needed, or Propeller + some other thing is needed, but it doesn't mean breaking the Propeller.
Put really simply, unless a net gain can be shown that exceeds that possible right now, using the cogs in tandem, making a "turbo" COG option is a kludge.
Kludges scale too, like I wrote before. There will be code changes for Prop II. Those are needed to scale out the chip, and that's fine. However, the fundemental things are not changing, because they were done right the first time. In other words, not kludged.
Some realizations have occured that show us where efficiencies can be had, without kludging, and we will see the product of those in Prop II.
No matter what scale the Propeller is at, there will always be this dilemma. Given that there are plenty of other MCU designs that do not have the qualities a Propeller does, I submit endangering those qualities with a kludge, without also demonstrating a net increase in throughput and compute, not otherwise attainable, doesn't make sense.
Edit: So there is the code reuse and authoring cost being low right now. That's real.
There is also the silicon cost. If the simple round-robin state machine is broken, a larger, hotter, more buggy one will have to replace it. My guess, given that Chip tends to like to release finished chips sans bugs, is the cost would be heat and time to market. The product will have more test cases to vet, meaning we don't see Prop II sooner, and when we do see it, it will run hotter, and cost more to code for.
I basically agree with Cluso99 in that his "turbo" mode would provide a simple solution where you need higher cog <-> hub throughput and are willing to dedicate two or more cogs to the solution and where it would be substantially more difficult to solve the problem using the 2nd cog with code running in it. There's no real cog to cog communications and it's not always straightforward to use two cogs to nearly double the throughput of a single cog.
I understand what Cluso99 wants, and I was a proponent of it a while back, but I think the problem was largely solved. I believe what was decided was to transfer up to 8 longs per hub access. (though I don't know how to confirm this without reading another 30page thread).
There still may be higher needs than that, but I think mike hit on a point that I would like to see is some sort of cog2cog communication within the chip. Of course, it can be done by using some of the IO lines as a data buss, but and intercog communication grid would be better and not waste those precious pins.
I would rather have a completely deterministic chip as heater and potatohead pointed out...but I would rather have it SOON. If the question was posed about the port B prop1 with the PROP2 being this far out, I think more people would have voted for the Port B prop.....but I digress ;^)
potatoehead: now that you are using another avatar, would you mind if I used the SGI bug?
I have to side with heater and potatohead on this one, I'm afraid. While we all agree that "turbo mode" would affect determinism only for the cogs that use it, the more sinister downside is its effect on object/cog compartmentalization. An object should always behave the same, independently of what "mode" other cogs (whose code someone else may have written) are operating in. Anyone who writes an object needs to know that its performance won't be compromised by hidden side effects. Therefore, one might say, "So don't use turbo mode." And I would say, "Okay, but then no one would use turbo mode. So why have it at all?"
Of course, as was pointed out, this is merely a hypothetical argument, the hub timing having been decided long ago.
While Chip has said that each hub window will be several Longs wide (ok, that sounded weird), I believe it's only one address, so all the longs would therefore have to be consecutive in the rd/wr?
Oh wow.....a prop II discussion I can participate in (most are over my head).
I have a question for this hypothetical discussion on a "turbo" mode.
How would this actually work?
This is my take on the subject......if the regular code takes four cogs, and the "turbo" code takes 4 cogs, there would be little to no benefit. Here is why (as I see it).
If you split the regular code into reads of 0-2-4-6 then the timing of that code takes a hit.
If the turbo code does a sequential use of the hub for 4-5-6-7 then it has to wait (round robin style) for it's turn again. Where have you helped? Sure while it has the hub, it's screaming, only to be put on hold for 4 hub ticks (was going to say cycles...but that doesn't fit).
Then the "turbo" code is playing catch up. I don't see where it would be balanced. It almost seems like interrupt action if you ask me.
Now inter-cog communication, I can see that would help a lot.
Just my view.....but I'm blind in one eye, and can't see out the other,
As the basis of the mode was described in the following:
"The hub (probably?) would have an 8 queue buffer, which cogs could send requests to (a cog can only send 1 at a time, obviously). The cogs not operating in "turbo" mode would automatically get added to that queue on every 1/8 clock cycles as usual [noparse][[/noparse]edit: in their specific order cog 0-7], so code can still be written completely deterministically. So again, if a cog is NOT run in turbo mode, it would NOT be affected in any way by anything any other cog is doing."
As you mentioned, the programmer needs to take into account some basic ground rules, or else you get either no additional benefit, or even undesirable operation as has been pointed out. So lets go over the performance we're going to see in the next gen prop based on what we know:
First off, according to the current prop2 specs, each cog will have a bandwidth of 32-bits * 4 every sysClock/8, for a total of 240MB/sec @120MHz (Wait, am I doing something wrong?! That looks pretty nice!). Multiply that by the number of cogs (8), and you get total hub bandwidth of 1.92GB(!) per sec.
With that said, any given cog could never have more than max hub bandwidth, and in reality would usually have significantly less. It would also never have more or less than 240MB/sec in non-turbo. Now depending on the number of cogs running in "non-turbo", and the order of which they're running(cog #) will dictate how the turbo access window looks, since non-turbo cogs get placed in their natural order on the 8-queue request buffer, the window could look funky, but not necessarily creating any problems..
The primary goal is to carefully give a particular cog larger window access to hub ram instead of having to assign additional cogs to perform part of the same task when greater hub bandwidth is required. Here's an example of where it could be useful: Dedicating a cog as an external RAM controller (we got the pins!). Data can be transferred very quickly by requesting most (if not all) cogs to go in turbo mode, and not generate requests to the queue. This would allow the memory controller cog full access to hub ram for transfer. In practice, this particular scenario can't occur because the pins couldn't keep up without the use of a huge external bus. Though with ~1/4 of hub access, max transfer could probably occur.
I wonder if the four-longs-per-transfer is made possible by the four-port cog RAM? If so, it seems reasonable that all four longs could be written in the same cycle. 128 bits is a really wide hub/cog data path, though.
Comments
A wha?
Uhh, say that 5 times fast!
"What happens when a user picks two drivers out of OBEX that happen to both need SFATLLHPM?"
Of course, there is only so much bandwidth available through the hub, so one way or the other, there's a bottleneck. But at the bare minimum, it could still get it's 1/8th share. I just thought it could potentially be a bit faster (but never slower than currently) when a cog has to shuffle around a lot of data (video bitmap swaps).
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
8x8 color 80 Column NTSC Text Object
Safety Tip: Life is as good as YOU think it is!
It basically brings us back to the situation of having interrupts with high and low priority handling as on many processors. To deal with that is a nightmare when mixing and matching bits of code. It either requires an operating system to prioritize and schedule things for the user ,which always makes things slower, or the user has to think very hard about that turbo switch and the needs of all the objects he wants to use.
In one fell swoop such a turbo switch would remove a great deal of the simplicity of programming with the Propeller. Unless you can think of a way to implement it such that you can guarantee that that a COG with it's hands on the switch can NEVER impact the determinism of other COGs I'm dead against it.
I think you'll find such an implementation is very hard, probably impossible.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Right now, the way it is, means sometimes having to apply more than one COG to a COG - HUB throughput problem. Interestingly, there are only a few instructions between HUB access windows, meaning compute is a very real limitation in a large number of cases. Given how Prop II is being built the same way, it's going to hold true at the higher speed, with only the scale of the problem being different. What I'm saying here is that it's often a compute problem, not a COG - HUB throughput problem.
(IMHO, bandwidth is not the right term, but I could be wrong on that)
However let's say it's strictly a throughput problem, and that Propellers access windows could vary based on number of COGs, or some other thing, like you have proposed. For any one bit of code, that could yield some speed increase. The cost would be rarely being able to use other peoples code without significant changes and or planning being required.
This is an extremely high cost!! (As Heater just wrote shorter than I did)
Three things about Propellers that are proving to be well worth the trouble:
1. Deterministic behavior. The state of a Propeller is always known. Code on a COG will always run the same way no matter what else the chip is doing. There are other nice elements to this, but I'm focusing on this one as the code re-use cost being low on Propeller is one of it's primary differentiators that give it an advantage over other designs.
2. Symmetry. A Propeller is a Propeller and a COG is a COG. Again, this goes right back to code re-use cost. If you have a Propeller, you can run your code on it, and you can run your code on any COG as well, with no worries about it being a specific COG, or combination of COGs, and without worrying about other code running on other COGs.
3. Do it in software to the maximum degree possible, exposing only key enabling functions in silicon, leaving the rest to the programmer. Secondly, do so with a minimum of external components. Thirdly, keep those additional components as simple and consistent as possible.
That's the secret sauce right there!
These things are why a Propeller is a Propeller. Where most CPUs operate with interrupts, Propellers use COGs to get the same kinds of things done, and do so with a minimum of code changes being required to re-use code objects.
It's very attractive to consider "tweaks" for speed. In almost every case discussed here, those tweaks would diminish one of the core things above. Where those things are diminished, the Propeller loses it's rapid development attributes, and or hardware simplicity, without gaining enough speed to warrant doing it in the first place.
Where they would not do that, they are going to end up in Propeller II. [noparse]:)[/noparse]
Go look at the OBEX, and the body of code to be found here in the forum, and on key contributors web sites.
Let's say you have some complex servo control system running on a few COG's, and you decide you need a display to debug, or to prompt the user, or.... whatever it is. Chances are, objects have been written for that.
You can go download them, or write them, and add them to your project code without having to worry about them disturbing the timing on code you've already written.
Adding a TV or VGA display to a project is as simple as adding a few resistors to the board, and running the display object. Generating that display won't actually disturb what you are doing, and I don't know of any other MCU where that's both possible and practical to do.
A COG is just a COG, and what one COG is doing does not impact the other ones in any way, unless the programmer codes the COGs to interact. Once you start that display, it appears like "hardware" does to your other program code. The display can be scanning the electron beam and reading HUB memory at the same time your code is moving a servo and updating display memory with coordinates.
Almost no, and often NO changes are needed to do these kinds of things on a Propeller, where they are almost always necessary on other MCUs, unless one is using a pre-programmed library where a kernel handles these things within limits. If you want to use something new, not part of that library, you've got to get in and do kernel level programming to make it possible, more often than not.
The Propeller user will, more often than not, just read the object RAM requirements, inputs and outputs, and add it to the project, with few worries.
From there, you can just call the methods exposed by those objects and continue on with what you were doing before, not having to make sure interrupts all match up, or that the new programs don't consume too much time, which messes with your servos, or some other thing.
Losing that is the cost for special case tweaks, and most Propeller users wouldn't want to pay that price.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
8x8 color 80 Column NTSC Text Object
Safety Tip: Life is as good as YOU think it is!
Post Edited (potatohead) : 12/28/2009 8:25:23 AM GMT
If there were a turbo mode, it would only apply to the cogs that set it on. Other cogs would NOT have anything changed.
The way it would work is that if a cog was in turbo mode, if another cog was not using the slot, a turbo cog could use that clock cycle for a transfer. So, a speed improvement without penalty to other cogs.
Therfore, suppose we have a hub access every 8 clocks, 1 for each cog (this is what Chip has suggested will be the case - currently is it 1 in 16). Now, if 7 cogs are not requiring hub access and a turbo cog requires 8 accesses, it will get all 8 clock accesses - an 8 times improvement! If one other cog requires access, the priority cog will get 7 clock accesses.
What is the downside...
What is the upside..
Now, I know the PropII will have twice the hub access (1 in 8). Also, it will have quad-long access per clock and possibly block moves or at least auto-incrementing. However, this ma be too late anyway, and I would not like to delay Prop II any further.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
It is very clear that if a COG does not need it's HUB access window then the slot could be used by some other COG for a potential speed gain. This is a very simple and tempting idea.
What we ARE doing is making a different value judgment about the pros and cons of doing this.
As I tried to say, and Potatohead I think said better, as soon as you do this you have destroyed timing determinism and thereby blown the simplicity of using the Propeller away. We are just making the judgment that losing determinism and the resulting simplicity in the general case is not worth the possible speed gains of a few special cases.
By the way, it's not clear to me that the COG that gets the "turbo" slots benefits much anyway. If the raw speed of that COG depends on having free slots then what happens when it finds itself running in an application that has few free slots? It never had any timing determinism to start with and now it does not have so much speed gain either.
Result: It's impossible to make any rigorous statements about the performance of that turbo'ed COG. To determine if it meets the requirements of any given application one has to understand what all the other COGs in the app are doing. Then one has to risk the whole house of cards failing when some new feature is added to the app that requires a new COG or otherwise eats HUB slots.
Our opinion is that turboing COGs as described is a bad trade off of general predicatbility vs marginal speed gains in some cases.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
"If there were a turbo mode, it would only apply to the cogs that set it on."
On the face of it that seems to be obviously true. I suggest it is not. At least not directly...
Let's say a turbo enabled COG driver out of OBEX is advertised at providing some service at a peak speed of 20MBytes/second (What ever that service might be). So I build it into an application that requires that service at that speed. Everything goes fine because the rest of my app leaves enough free slots for that turbo COG to run fast.
Now I add some new feature that eats HUB slots, perhaps only occasionally. BOOM my application now has random failures occurring as the tubo COG is starved of slots occasionally. Perhaps I only discover this after the system has been out in the field for months and does something weird due to a rare external stimulus at the wrong moment.
So whilst "tubo mode" is only applied to one COG it's effects are applied to the whole application and even to the whole system that the Propeller is operating in.
Not good.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Anyway the point is moot as I think Chip has decided long ago.
PS How did this thread get revived again
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
The hub (probably?) would have an 8 queue buffer, which cogs could send requests to (a cog can only send 1 at a time, obviously). The cogs not operating in "turbo" mode would automatically get added to that queue on every n/8 cycles as usual, so code can still be written completely deterministically. So again, if a cog is NOT run in turbo mode, it would NOT be affected in any way by anything any other cog is doing.
Heater/potatohead, you're completely correct that "turbo" can't guarantee hub access more than 1/8 sysClock, but it's certainly possible. This of course should be a consideration to the programmer. Not all applications would benefit from it, but I think that some could.
I also think that in reality, most major applications will probably only use a limited amount of unmodified objects from the OBEX, as they are mostly designed to run in their own cog, which as many here know, can easily depletes all available cogs. Concessions would have to be made, and objects normally run in it's own cog might need to be combined, as people do now.
I have a dream - a dream of a Propeller X with access to hub memory every cycle, and interrupts that can quickly launch a cog to run an ISR routine, and then shut down again, waiting to run another ISR. a form of LMM might be the closest thing to accomplishing this - perhaps there could be in-cog rom with such a program.
It might look like I'm "run[noparse][[/noparse]ning] with the lowest common denominator" but that is not my motivation at all.
Time for a story...
I used to do a lot of work in the avionics industry. Sometimes writing but mostly testing all kinds of systems from Rolls-Royce engine management to Boing 777 Primary Flight Computers. We had development methods and tools and even languages designed to ensure correct behavior of those systems. One language would even spit out a report after compiling a module of exactly how much of it's alloted time slot it took to execute. For a whole application of many modules it would report on the time usage of every part and you could be sure that those 10ms or 100ms execution slots were NEVER going to be exceeded. Life was good. One could mix and match modules without worrying about them tripping each other up. If there was a possibility of that happening the compiler would tell you in advance.
All hell broke lose on one project where they had decided to use ADA instead. All of a sudden no one had any idea how long anything took to execute. In one of the final builds I discovered that the code was quite often consuming 95% of it's alloted time. No one could make a concrete statement that it would NEVER exceed 100% thereby causing failure. Not to mention the design requirement was for only 50% CPU usage!
The result of all this is that I'm very much attached to the Propeller's timing determinism. As far as I know there are only two generally available systems that make the timing guarantees in the face of multiple tasks that the Propeller does, the other one being, dare I say it, XMOS. They are about to release a compiler that also does timing analysis. So I don't want to see this feature being given up easily.
In general purpose CPU's all is aimed at maximizing throughput in the general case. So they grow pipelines which change your execution speed depending on which way your jumps go. They grow caches which changes your execution speed depending on where your data might be in memory. They implement prioritized interrupts which change your execution speed depending on whatever else is going on in the system. Then to make all this manageable they put a general purpose OS on top at which point all bets are off as you now have no idea when or even if you are ever going to be scheduled.
Great, overall through put is optimized, but now, for example, my Linux sound systems splutters and coughs. A few hardy developers have to spend years getting "real time" stuff to sort of function.
Bottom line is that if I was happy with a general purpose and unpredictable processor I could use many others besides the Prop. The Prop has this "unique selling point", as they say, that it should not give up.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
Post Edited (heater) : 12/28/2009 12:31:05 PM GMT
This is simply NOT true, as I said before. If my non-tubo COG or COGs is actually my application and my application requires the tubo COG to be able to operate as fast as possible. Then it follows that my application COGs MUST leave enough HUB slots free. If not, the tubo COG slows down and the application fails.
In summary the tubo-COG now dictates what can be in my application if I need it to zoom along.
In the general case I can't be sure that the turbo-COG can deliver it's turbo performance, as it depends on my app. therefore I have to assume it has a lesser, normal, performance, therefore there is no point in turbo-mode.
Currently the use of an object only requires being aware of how many COG's it uses, it's memory requirements, its pin usage and possibly it's clock requirements. All of which are inescapable (unless you totally separate COGs into separate CPUs). Adding this indeterminate timing problem into the mix makes things much harder to manage.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
That's the thing, as a user of objects created by others I want them to work as advertised no matter what stupid stuff I get up to in my app [noparse]:)[/noparse]
I don't want to have to analyze every piece of code for it's timing dependencies.
This ease of re-usability is key to the Propeller, which relies on software to create peripherals that would otherwise be implemented in silicon in regular MCUs.
Clearly there are cases where a turbo would work and get you some performance gain and perhaps make things possible that were not possible otherwise.
I'm just leaning towards wanting predictability all the time rather than the speed gain occasionally.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
I can't wait for this device. I got a product planned that will take great advantage of it's IO prowess! Before I learned of all the upcoming features of the P2, I was looking at Cypress's PSoCs - but they're overly complex, and have heard about there supposedly being various bugs.
This timing issue is such a feature in my mind. It has subtle consequences that are not appreciated at first and can have a huge impact on your end results.
Another is the regularity of the I/O pins. Pretty much all pins are equal on the Prop. Apart from a few necessary service pins. This gets taken for granted until you run up against a chip where this is not so. All of a sudden you are finding that there are pins dedicated to certain functions, or pins with different current drive capabilities, or pins that can only be set as input or output together in banks of 4 or 8 or 16, or pins that can be driven from one CPU core on the chip but not others. Baahhh ! I've got all these pins but I can't use them how I want to, give me my Prop back!
I'm sure I have found many other such Prop features, subtle enough that I can't remember them just now...[noparse]:)[/noparse]
The sad part about all this is that those who buy into a MCU because of it's sales "bullet points":
32 I/O pins - check.
xxxMIPs - check
C compiler - check
1000 levels of nested prioritized interrupts - check
Other useless junk - check
tend to skip over the Propeller and consequently miss out on what a wonderfully refreshing architecture it is.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.
There are no bugs in the chip. A propeller does exactly what it is asked to do every time.
The HUB-COG interaction as it is right now allows a user to balance throughput / compute * COGs however they see fit to do, and all users code will operate with all other users code, given the sum of the code to be used does not exceed the overall raw capability of the Propeller.
To express Heaters point another way, let's say several COGs are used to get the throughput needed. Doesn't that consume COG's, just like a "turbo" COG would consume COGs, in that the others would have to be limited in what they do to permit the "turbo" one to run?
Sure it does.
So then, just use more COG's, the end product being the same, without the ugly costs associated with kludges in silicon!
The very real effect is still the loss of COG's no matter how one looks at it. There is no free lunch in these things.
Special case tweaks really are kludges. Here's the thing about kludges. They scale just like the good stuff does, meaning the size of the problems that come with kludges scales as well. Where one person would be required to resolve them, scale up the chip, and now a team is required to resolve them.
The net productivity of the chip is diminished, while arguably the overall capability isn't increased in like kind!
That's why the cost is so high!
Propeller users get this because they've been programming and building with Props for a while. Users of other CPUs don't get this, as they've never, ever been exposed to it.
If you look at MCUs where they do these kinds of things, kludges, they end up with lots of chips in a series a, b, c, whatever... and people become selectors instead of builders. Flipping though the catalog means finding that set of kludges that matches your special case, but for this one little difference... if that difference gets attention, another chip gets added to the catalog.
What occurs then is madness over time as some chips in the series continue to get made, others don't, and everybody everywhere has this massive set of things to sort out on every project and worse, over the life cycle of the project, where sometimes things get revised, and or re-kludged, and there is your team effort right there.
For the entire life of the Propeller I, there will be just the Propeller I.
Either your scope of work fits into the chip, or it does not. If it does, then you have no worries for a very, very long time. If it doesn't, the fact that the chip is largely software powered means you can apply grey matter to the problem and potentially still get the chip to perform. If so, great! It will then always perform.
...and unlike most other designs, all other users can benefit from your design effort, without significantly impacting their own.
This is powerful stuff, and well worth the bump in cost for the Propeller, particularly for small to mid sized runs.
There is another element here as well.
When people are spending their time as selectors, they need this chip and two of that one, and another one over here, and damn! That last one reached end of life last year, so we will just have to KLUDGE around this other one because we've got lots of them...
You get the point, I am sure.
So far, we have seen the Propeller chip exceed many boundaries, some of which suprised the designer of it! As we all apply our thoughts to the chip, innovation occurs. Because all the code works on all the chips, and because the interaction between code bodies is the minimum possible, we all benefit from that innovation to the maximum extent possible.
Code can be shared freely, meaning skills aquired with it can be shared freely as well. If one's skill is locked into very specific selections of things, that cannot be shared freely, resulting in less over all innovation value for all users of that series of chips.
These are not insignificant things!
It is tempting to modify Propellers to get "our" case to work, but each modification has a cost. There is production costs, debug costs, inventory costs, code re-use costs, innovation costs as detailed above. For each member of a series, these costs multiply into the mess we have today.
Propeller II will behave the same way. There will be a Propeller, and it will operate as Prop I did + the innovations - kludges, to scale up the design to enable a greater scope of tasks.
Again, either that task will fit, or it will not, or one can use software to make it fit. All others, where the code efforts are shared, benefit right away, and can build on that to max out the chip and get full use value from it in their designs.
If Prop I were kludged like you say, then what happens when Prop II addresses some of this? What will happen is a kludge of a kludge because the code re-use value is such an attractive thing, that it will be tempting to just keep kludging for this case and that, and pretty soon the chip runs hot, is buggy, has multiple variations, and it's just not a Propeller anymore.
If you want that kind of speed, then use a few COGs to do it. That's the right way to break the problem down. Most of us here continue to learn and improve on doing that, and it's pretty amazing what can be done as soon as you start thinking in parallel, and leverage the deterministic timing attributes.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
8x8 color 80 Column NTSC Text Object
Safety Tip: Life is as good as YOU think it is!
Post Edited (potatohead) : 12/28/2009 7:43:48 PM GMT
heater: I am sorry, I don't buy your argument against. It is innovation. It does not have to be used. The user gets to decide. It is just the same as the multithreading that Chip intends to add. Currently, I don't think there is any use for me, but who knows - I don't have a crystal ball.
BTW, what if you only have 4 cogs running? Those 4 hub cycles are wasted. You cannot just throw another cog at the problem.
Chip has largely addressed the issues of hub-cog speed, but I think the PropII will go into new areas never thought of before, and it is likely to become a bottleneck again.
Anyway, FWIW I predict the PropII will suffer from a shortage of·cogs because everything else seems to be so fantastic we will just push the envelope so much further. I would rather pay a little more for 16 cogs. Anyway, there will always be shortcomings·
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
If symmetry and determinism is broken, then we don't have a Propeller. We have something that runs the same instructions as a Propeller does, but we don't have the set of assumptions in play surrounding those instructions. It does not act like a Propeller.
The reason why things are easy, consistent and productive is because we have symmetry and determinism.
There isn't a free lunch on this. Truth is, the silicon has a max throughput and compute. Those things don't change. How it is presented to us can change, and that presentation impacts the overall cost of computing.
It's hard for me to characterize something as "innovation" when the net cost is higher after having done it.
Right now we have any code, any cog.
What that means is I can ask a question about how something is done, get the answer, know it's absolutely going to work as advertised, without having to either abstract the question to avoid sharing all my project details, or overwhelm others and myself with those details.
That goes away, if we break determinism and symmetry.
Now, here's the other case I have for that being a kludge.
Has anyone demonstrated that we would get more out of the Propeller chip given that scenario? I don't think anyone can, because the performance one gets from a Propeller is a function of how well thought out the problem case is. Seems to me, using several COGs is no different than only using one COG, but doing so in a way that denies the other COGs from running as they otherwise could be.
Boil all that down, and what is being asked for is to make thinking through a problem case easier, while raising the cost of using solutions already found and published!
That all feels like one of those, "well, let's just do this now, and we can come back and fix it later things". Classic kludge!
There is always time to do it right the first time, so why not simply do that then? For this problem case, the answer is to use more COGs, and if that means saturating the chip, then it means the problem case exceeds the Propeller overall, or multiple Propellers are needed, or Propeller + some other thing is needed, but it doesn't mean breaking the Propeller.
Put really simply, unless a net gain can be shown that exceeds that possible right now, using the cogs in tandem, making a "turbo" COG option is a kludge.
Kludges scale too, like I wrote before. There will be code changes for Prop II. Those are needed to scale out the chip, and that's fine. However, the fundemental things are not changing, because they were done right the first time. In other words, not kludged.
Some realizations have occured that show us where efficiencies can be had, without kludging, and we will see the product of those in Prop II.
No matter what scale the Propeller is at, there will always be this dilemma. Given that there are plenty of other MCU designs that do not have the qualities a Propeller does, I submit endangering those qualities with a kludge, without also demonstrating a net increase in throughput and compute, not otherwise attainable, doesn't make sense.
Edit: So there is the code reuse and authoring cost being low right now. That's real.
There is also the silicon cost. If the simple round-robin state machine is broken, a larger, hotter, more buggy one will have to replace it. My guess, given that Chip tends to like to release finished chips sans bugs, is the cost would be heat and time to market. The product will have more test cases to vet, meaning we don't see Prop II sooner, and when we do see it, it will run hotter, and cost more to code for.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
8x8 color 80 Column NTSC Text Object
Safety Tip: Life is as good as YOU think it is!
Post Edited (potatohead) : 12/29/2009 12:00:30 AM GMT
I basically agree with Cluso99 in that his "turbo" mode would provide a simple solution where you need higher cog <-> hub throughput and are willing to dedicate two or more cogs to the solution and where it would be substantially more difficult to solve the problem using the 2nd cog with code running in it. There's no real cog to cog communications and it's not always straightforward to use two cogs to nearly double the throughput of a single cog.
I understand what Cluso99 wants, and I was a proponent of it a while back, but I think the problem was largely solved. I believe what was decided was to transfer up to 8 longs per hub access. (though I don't know how to confirm this without reading another 30page thread).
There still may be higher needs than that, but I think mike hit on a point that I would like to see is some sort of cog2cog communication within the chip. Of course, it can be done by using some of the IO lines as a data buss, but and intercog communication grid would be better and not waste those precious pins.
I would rather have a completely deterministic chip as heater and potatohead pointed out...but I would rather have it SOON. If the question was posed about the port B prop1 with the PROP2 being this far out, I think more people would have voted for the Port B prop.....but I digress ;^)
potatoehead: now that you are using another avatar, would you mind if I used the SGI bug?
Of course, as was pointed out, this is merely a hypothetical argument, the hub timing having been decided long ago.
-Phil
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Propeller Wiki: Share the coolness!
8x8 color 80 Column NTSC Text Object
Safety Tip: Life is as good as YOU think it is!
Since there are now 96+ I/O's intercog comms could use some of those new pins. There may be other possibilities too.
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Links to other interesting threads:
· Home of the MultiBladeProps: TriBlade,·RamBlade,·SixBlade, website
· Single Board Computer:·3 Propeller ICs·and a·TriBladeProp board (ZiCog Z80 Emulator)
· Prop Tools under Development or Completed (Index)
· Emulators: CPUs Z80 etc; Micros Altair etc;· Terminals·VT100 etc; (Index) ZiCog (Z80) , MoCog (6809)
· Search the Propeller forums·(uses advanced Google search)
My cruising website is: ·www.bluemagic.biz·· MultiBladeProp is: www.bluemagic.biz/cluso.htm
I have a question for this hypothetical discussion on a "turbo" mode.
How would this actually work?
This is my take on the subject......if the regular code takes four cogs, and the "turbo" code takes 4 cogs, there would be little to no benefit. Here is why (as I see it).
If you split the regular code into reads of 0-2-4-6 then the timing of that code takes a hit.
If the turbo code does a sequential use of the hub for 4-5-6-7 then it has to wait (round robin style) for it's turn again. Where have you helped? Sure while it has the hub, it's screaming, only to be put on hold for 4 hub ticks (was going to say cycles...but that doesn't fit).
Then the "turbo" code is playing catch up. I don't see where it would be balanced. It almost seems like interrupt action if you ask me.
Now inter-cog communication, I can see that would help a lot.
Just my view.....but I'm blind in one eye, and can't see out the other,
James L
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
James L
Partner/Designer
Lil Brother SMT Assembly Services
Are you addicted to technology or Micro-controllers..... then checkout the forums at Savage Circuits. Learn to build your own Gizmos!
As the basis of the mode was described in the following:
"The hub (probably?) would have an 8 queue buffer, which cogs could send requests to (a cog can only send 1 at a time, obviously). The cogs not operating in "turbo" mode would automatically get added to that queue on every 1/8 clock cycles as usual [noparse][[/noparse]edit: in their specific order cog 0-7], so code can still be written completely deterministically. So again, if a cog is NOT run in turbo mode, it would NOT be affected in any way by anything any other cog is doing."
As you mentioned, the programmer needs to take into account some basic ground rules, or else you get either no additional benefit, or even undesirable operation as has been pointed out. So lets go over the performance we're going to see in the next gen prop based on what we know:
First off, according to the current prop2 specs, each cog will have a bandwidth of 32-bits * 4 every sysClock/8, for a total of 240MB/sec @120MHz (Wait, am I doing something wrong?! That looks pretty nice!). Multiply that by the number of cogs (8), and you get total hub bandwidth of 1.92GB(!) per sec.
With that said, any given cog could never have more than max hub bandwidth, and in reality would usually have significantly less. It would also never have more or less than 240MB/sec in non-turbo. Now depending on the number of cogs running in "non-turbo", and the order of which they're running(cog #) will dictate how the turbo access window looks, since non-turbo cogs get placed in their natural order on the 8-queue request buffer, the window could look funky, but not necessarily creating any problems..
The primary goal is to carefully give a particular cog larger window access to hub ram instead of having to assign additional cogs to perform part of the same task when greater hub bandwidth is required. Here's an example of where it could be useful: Dedicating a cog as an external RAM controller (we got the pins!). Data can be transferred very quickly by requesting most (if not all) cogs to go in turbo mode, and not generate requests to the queue. This would allow the memory controller cog full access to hub ram for transfer. In practice, this particular scenario can't occur because the pins couldn't keep up without the use of a huge external bus. Though with ~1/4 of hub access, max transfer could probably occur.
-Phil