2) in most cases, only the overall "business logic" or "HMI" app would use "hungry" mode
Plus, there is a potential power saving, from allowing this "HMI" app to apply the "hurry-up-then-wait" approach.
It can finish, and get to whatever low power wait modes the P2 will have, sooner.
I see a case where I could use 2 co-operating cogs, where 1 cog specifically "donates" its slot to the co-operating cog if that cog requires it. The donate would be enabled by the donourcog and wouldbe in two forms... donate to cog x if i dont use the slot, and donate to cog x and if it does not use it, I can use it (submissive).
This way I can guarantee the master cog can get 200% hub cycles (donate submissive). This is then deterministic for the master cog.
Why? I may have a program or driver that requires a very quick response. By deliberately coding a pair of cogs, I can control this - this is something I currently cannot do. And there is no impact for any other 6 cogs. In fact, using this concept, I could write all cog code to donate submissive to 1 or more masters. So I could have 1 master cog capable of using all 8 slots per hub cycle.
In all cases, the user must deliberately code for this, but do not deny him the opportunity to do this for some irrationally percieved notion it is bad.
I see a case where I could use 2 co-operating cogs, where 1 cog specifically "donates" its slot to the co-operating cog if that cog requires it. The donate would be enabled by the donourcog and wouldbe in two forms... donate to cog x if i dont use the slot, and donate to cog x and if it does not use it, I can use it (submissive).
This way I can guarantee the master cog can get 200% hub cycles (donate submissive). This is then deterministic for the master cog.
Why? I may have a program or driver that requires a very quick response. By deliberately coding a pair of cogs, I can control this - this is something I currently cannot do. And there is no impact for any other 6 cogs. In fact, using this concept, I could write all cog code to donate submissive to 1 or more masters. So I could have 1 master cog capable of using all 8 slots per hub cycle.
In all cases, the user must deliberately code for this, but do not deny him the opportunity to do this for some irrationally percieved notion it is bad.
I see a case where I could use 2 co-operating cogs, where 1 cog specifically "donates" its slot to the co-operating cog if that cog requires it. The donate would be enabled by the donourcog and wouldbe in two forms... donate to cog x if i dont use the slot, and donate to cog x and if it does not use it, I can use it (submissive).
This way I can guarantee the master cog can get 200% hub cycles (donate submissive). This is then deterministic for the master cog.
Why? I may have a program or driver that requires a very quick response. By deliberately coding a pair of cogs, I can control this - this is something I currently cannot do. And there is no impact for any other 6 cogs. In fact, using this concept, I could write all cog code to donate submissive to 1 or more masters. So I could have 1 master cog capable of using all 8 slots per hub cycle.
In all cases, the user must deliberately code for this, but do not deny him the opportunity to do this for some irrationally percieved notion it is bad.
Perhaps another way of donating slots could be done as follows.
If a 2 bit field existed that selects the frequency of donation of the slot.
DMODE0 %00 = 1/8 slot time (normal/default) - NO slot donation
DMODE1 %01 = 1/16 slot time (every 2nd slot is ALWAYS donated to a HM cog.)
DMODE2 %10 = 1/32 slot time (every 2nd,3rd and 4th slot is ALWAYS donated to a HM cog.)
DMODE3 %11 = 1/8 time slot (Donate ALL free slots to a HM cog) - "Open season" donation mode
DMODE1 & DMODE2 still would give some determinism to the donating cog and also guarantees slots to the HM cog.
This would balance/stabilze the slot allocation.
Modes to prevent unused slots from being used by hungry cogs are counter-productive and useless.
It comes down to designer control, again.
I can think of (rare) test coverage and development cases, where someone may want to 'bookmark' slots early in development, knowing they will be needed, or needed in some usage cases that they want to ensure does not produce a (late) change in behaviour.
It even gives enough Designer Control, to allow those worrying about Designer control, to protect themselves from themselves.
Cluso's voluntary donation to a paired cog for guaranteed extra bandwidth is potentially very useful (assuming it fits)
That is what Chip's table (slightly old) says is possible. Looks good to me.
Quite a while ago, I also suggested a 'floor' on the donation, to allow (low) Bandwidth minimums to be met, in a simple manner. Not sure where that is in the mix ?
A floor means you can guarantee just under double from a volunteer, and the volunteer gives most, but not quite all, of their slots, in highest-demand cases.
ie The difference between 95% and 97.5% is not large, but 2.5% to 0 is quite a change.
I guess if HW support for a floor is too complex, it can be managed less simply in SW ?
I see a case where I could use 2 co-operating cogs, where 1 cog specifically "donates" its slot to the co-operating cog if that cog requires it. The donate would be enabled by the donourcog and wouldbe in two forms... donate to cog x if i dont use the slot, and donate to cog x and if it does not use it, I can use it (submissive).
This way I can guarantee the master cog can get 200% hub cycles (donate submissive). This is then deterministic for the master cog.
Why? I may have a program or driver that requires a very quick response. By deliberately coding a pair of cogs, I can control this - this is something I currently cannot do. And there is no impact for any other 6 cogs. In fact, using this concept, I could write all cog code to donate submissive to 1 or more masters. So I could have 1 master cog capable of using all 8 slots per hub cycle.
In all cases, the user must deliberately code for this, but do not deny him the opportunity to do this for some irrationally percieved notion it is bad.
Cluso's voluntary donation to a paired cog for guaranteed extra bandwidth is potentially very useful (assuming it fits)
Bill, as I write a few posts earlier PERHAPS you are right (with this explanation #4629) and so I disagree here with Cluso and, based on your further posts, I am surprised you don't also.
Cluso by coding an object that uses 2 cogs its enough that one of them don't use any hub slot and automatically the other have all the unused slots thus having the double bandwidth guaranteed (the same apply for more cogs). I see no reason to-over complicate that.
If this route has to be taken than a hungry cog that can benefit from unused slots and in any way can't stole other's it's more than enough. The cog have a local setup ro request hungry-mode (disabled by default) and a global hungry-disable can be available. That's all.
I would also like to stop this debate at last until Chip show his thoughts/ideas over it and I prefer that you all, clever guys, move-on (with wasting silicon) on something perhaps more useful like eg a hw assisted usart/spi/i2c master/slave modes by eventually supporting 4-bit spi for flash/sd uses.
I can think of (rare) test coverage and development cases, where someone may want to 'bookmark' slots early in development, knowing they will be needed, or needed in some usage cases that they want to ensure does not produce a (late) change in behaviour.
It even gives enough Designer Control, to allow those worrying about Designer control, to protect themselves from themselves.
That is what Chip's table (slightly old) says is possible. Looks good to me.
Quite a while ago, I also suggested a 'floor' on the donation, to allow (low) Bandwidth minimums to be met, in a simple manner. Not sure where that is in the mix ?
A floor means you can guarantee just under double from a volunteer, and the volunteer gives most, but not quite all, of their slots, in highest-demand cases.
ie The difference between 95% and 97.5% is not large, but 2.5% to 0 is quite a change.
I guess if HW support for a floor is too complex, it can be managed less simply in SW ?
I would also like to stop this debate at last until Chip show his thoughts/ideas over it and I prefer that you all, clever guys, move-on (with wasting silicon) on something perhaps more useful like eg a hw assisted usart/spi/i2c master/slave modes by eventually supporting 4-bit spi for flash/sd uses.
Yes, SerDes and Counters are still awaiting a full reveal.
Chip did talk a bit about QuadSPI, where pin-map is probably the most complex part.
I would also like to stop this debate at last until Chip show his thoughts/ideas over it and I prefer that you all, clever guys, move-on (with wasting silicon) on something perhaps more useful like eg a hw assisted usart/spi/i2c master/slave modes by eventually supporting 4-bit spi for flash/sd uses.
Many ideas have been floated on SERDES design and implementation. Same for HUB SLOT allocation.
Chip ultimately will sift thru these suggestions and come up with something that delivers on the broad
needs of the user.
I don't see how any of this is "wasting silicon" .
HUB EXEC mode was spawned from ideas bounced around this forum, clearly not a waste.
I believe Chip already has some ideas on SERDES in the works.
I would also like to stop this debate at last until Chip show his thoughts/ideas over it
Me too. I think all the important things have been exchanged and there is some good dialog for Chip to consider. Given time, shuttle, the last few features, Chip is going to have to decide what is worth what.
FWIW: The attempts to marginalize valid concerns as "fear", etc... aren't doing any real good.
Again, I'm not opposed to this feature, but there are concerns. Some of those have been addressed in the discussion, WHICH IS WHY WE HAVE THE DISCUSSION.
Modes to prevent unused slots from being used by hungry cogs are counter-productive and useless.
Cluso's voluntary donation to a paired cog for guaranteed extra bandwidth is potentially very useful (assuming it fits)
Bill
I agree with Cluso's suggestion.
The idea behind the "modes" was to appease the "determinists" out there.
If I can get my hands on "free slots" by any method, that would "make my day"
Chip said that no hubslot stealing is implemented yet (see here).
I thought the increase from Quads to Wide hub reads was the solution that replaced the hubstealing.
I fear that implementing a hungry/greedy mode may take another month or two, and may only speed up not optimized code by 20..30%. Optimized single thread code runs already with 100% PASM speed, or you can just let run critical code in cog mode.
I thought the increase from Quads to Wide hub reads was the solution that replaced the hubstealing
Andy,
In the scenario's that I can think of where "free slots" would significantly boost performance, none of these
benefit from WIDE reads/writes as most of the processing is on non-consecutive LONGS (Limited by the 32 bit operations).
The problem is highlighted further in multi-tasking apps where tasks are fighting for the one slot.
Also using RDWIDE,SETWIDE in multiple tasks becomes tricky as only one task can use it at a time.
You keep saying "weird timing interaction between cogs" etc trying to kill the use of ***UNUSED SPARE HUB SLOTS***
Please provide a valid example where this could happen.
No example coming just an observation:
There is a finite about of bandwidth to hub. We can't get away from that fact. This is currently shared equally among COGs. In this way timing of code in different COGS is totally independent of each other. This decoupling is a big feature of the Prop I believe.
The "hungry" proposal is to allow at least one COG use HUB bandwidth that may be unused by the other COGs. We note that this does not affect those other 7 COGs. They continue in isolation as usual.
That "hungry" COG can now run faster if there is bandwidth available. It will run slower if there is not. The hungry COG has a timing dependency on the rest of the system.
You point out that for many uses that is OK. We get more speed if available and we are not worried about timing accuracy in our big hungry COG code.
So far so good. I like the idea. As I said I would be quite happy for that JavaScript interpreter to be "hungry".
BUT:
The weird timing interactions are:
I write a hungry program that runs at rate X. "Great" I think and proceed to rely on X for the code I create for the other 7 COGs. BOOM it does not work.
I write a hungry program that expect to run at rate X but it does not because the rest of my code is written as HUB execute and saturates the HUB bandwidth.
The whole idea is predicated on not caring if the "hungry" COG can run at full speed or not. Which I agree is a reasonable approach in many situations. The "weird timing interactions" will occur when people get to rely on that full speed for some reason.
So. If this is done at all it needs to be documented:
1) This code may not run at the speed you think it should.
2) It's speed may vary depending on the current load presented by the rest of the COGs.
3) YMMV.
As Potatohead said this is not about Spin or OBEX it's about expectations. Suitably bold face documentation may satisfy that.
Modes to prevent unused slots from being used by hungry cogs are counter-productive and useless.
Yes indeed.
Cluso's voluntary donation to a paired cog for guaranteed extra bandwidth is potentially very useful (assuming it fits)
This whole thing about "paired cogs" and "slot donation" etc is perhaps the straw of complexity that will break the camels back. Both for Chip to design it and for users to ever understand how to program the thing. I feel we should not spend time and effort to add any more complexity that only 0.001% of dedicated Propeller fans will ever understand let alone use.
ctwardell,
This seems like spreading FUD to me. ... It is very clear that nobody is advocating that a cog can steal slots from another cog.
Not at all. Labelling a statement as "fearful" is not a constructive form of debate. I have said nothing about "stealing" hub slots yet. That is a whole other can of worms that I refuse to even think about.
How fast does a "hungry COG" run if the other seven COGs are executing directly from HUB?
I ask because I can imagine someone firing up their C compiler and building all their code from scratch, using all COGs and execute from HUB. Seems like a natural thing to do.
Chip has already suggested that execute from HUB will be the natural starting point for most. Which makes sense if you are writing in C and don't want to worry about that tiny COG code size limit.
This whole thing about "paired cogs" and "slot donation" etc is perhaps the straw of complexity that will break the camels back. Both for Chip to design it and for users to ever understand how to program the thing. I feel we should not spend time and effort to add any more complexity that only 0.001% of dedicated Propeller fans will ever understand let alone use.
Yes Indeed reply to Bill's "Modes to prevent unused slots from being used by hungry cogs are counter-productive and useless."
In general I agree, but it may have some use in tuning. Consider the case where there is a cog that due to the nature of its task uses either all or none of its slots, lets say this is in a repeating pattern of 0.25 seconds of using them all, 0.25 seconds of using none of them. This could be a HUBEXEC task that does something at regular intervals, so I don't think it is a terribly contrived case. Then say we have a hungry task that is providing a GUI interface. The GUI may experience visible jitter due to the "rhythmic" availability of extra slots. This would be a case where being able to have a cog NOT share any slots might be beneficial
Not at all. Labelling a statement as "fearful" is not a constructive form of debate. I have said nothing about "stealing" hub slots yet. That is a whole other can of worms that I refuse to even think about.
Your post that I was responding to included the statement "As such it is very important that the actions of one COG do not affect the timing of others" in regard to "off-the-shelf" UART, SPI, I2C code, etc., to me that implies that the so called hungry cog can somehow impact those items, the only way it could do that would be by stealing cycles, which again isn't on the table. The reference to "other" can of worms is a sly way of painting the current suggestion of slot sharing as a can of worms without directly saying it.
How fast does a "hungry COG" run if the other seven COGs are executing directly from HUB?
I ask because I can imagine someone firing up their C compiler and building all their code from scratch, using all COGs and execute from HUB. Seems like a natural thing to do.
Chip has already suggested that execute from HUB will be the natural starting point for most. Which makes sense if you are writing in C and don't want to worry about that tiny COG code size limit.
Yes, you've pointed out a case where slot sharing will be of very little benefit. Unless some of the cogs are running hub code that spends a lot of time in tight loops that fit in the cache each cog will be using the vast majority of its own slots.
Since SPIN is going to use HUBEXEC the use of HUBEXEC will be pretty high, so this is a valid concern. I do think that having ALL 8 cogs doing HUBEXEC code would be a rare case, and having just one cog donating slots would provide some performance benefit.
How fast does a "hungry COG" run if the other seven COGs are executing directly from HUB?
I ask because I can imagine someone firing up their C compiler and building all their code from scratch, using all COGs and execute from HUB. Seems like a natural thing to do.
I think that this is not a problem because the hungry option will not be under direct user control but filtered by the compiler (c/spin/...).Rule: compiler will never start more than one cog in hungry-hubexec, period. As I understood the direct control over the hungry mode will be available only for direct (hand) pasm programming.
In general I agree, but it may have some use in tuning. Consider the case where there is a cog that due to the nature of its task uses either all or none of its slots, lets say this is in a repeating pattern of 0.25 seconds of using them all, 0.25 seconds of using none of them. This could be a HUBEXEC task that does something at regular intervals, so I don't think it is a terribly contrived case. Then say we have a hungry task that is providing a GUI interface. The GUI may experience visible jitter due to the "rhythmic" availability of extra slots. This would be a case where being able to have a cog NOT share any slots might be beneficial
By doing so you'll deliberately slow down other cog. This will end to take a someone else's code made for a cog and insert only the instruction to give away 1/4 of its slots (the easy way). bang, the cog have issues (seeking help on the forums).
I you are a so smart/clever coder that need this functionality you can adapt the same code so that hub access happens on 1 window over four. By doing so you have acquired the knowledge over someone else's code, have understood how it works and see if this is doable avoiding issues. The end result is the same, even without a sharing(donation) option.
If you are writing from scratch then again you do not need sharing option because the code can be written to not use all of the slots. If you are not able to accomplish that, take your time to study first, to understand how it works and in the meantime stay out of hungry modes.
Hasn't the hungry COG issue been beaten to death already? I keep looking in this thread for exciting news from Chip but get mired in a zillion messages about hungry mode. Any chance we could take a break until Chip weighs in on this? He certainly has a lot of input in here already suggesting a miriad of alternatives.
Very easy. Who is the author of compilers? Perhaps Chip for Spin and David for GCC? It's enough two persons to accomplish that.
The same for the prop ... many here are debating on what they want and how to do that ... at the end the silicon contents will be the result of one person's will, isn't it?
Sorry guys, I don't have time for individual replies for everyone today; I have to get the BOM and announcements done for my first Propeller based add-on for the Raspberry Pi :-) It also works as a stand-alone Prop board. (once the web page is done, I'll start a thread in the P1 forum for those who are interested)
Andy:
"I fear that implementing a hungry/greedy mode may take another month or two, and may only speed up not optimized code by 20..30%. Optimized single thread code runs already with 100% PASM speed, or you can just let run critical code in cog mode."
I do not want any delays either, which is why I was asking only for the simplest case (hungry cogs can use spare slots, no prioritization, no pairing, dead simple); based on earlier messages, this would be quick for Chip to implement. We could then explore the prioritization etc AFTER SERDES.
More on the performance benefits below.
Heater:
I understand your concerns, but do not share them. There are many other limits (pixel clock, waits etc) that users also have to respect.
See above - I was suggesting the simplest implementation for now.
More below.
dMajo:
I like your proposed rule!
If the default is that the compiler/linker only allows one hungry cog (unless over-ridden by an "advanced" option) then one cog could freely use whatever spare slots are available, without impacting "normal" cogs, and no one has to worry about two hungry cogs contending for extra hub slots.
David:
Discussions are good; Chip makes the final call for what he has time to put in.
ALL:
My overriding goal is to make the P2 the most competitive and fastest it can be, while being consistent with the April shuttle run.
That is why I have been suggesting the simplest HUNGRY concept.
I consider the worries about some hypothetical user possibly getting confused by running multiple HUNGRY cogs and not getting the performance he expects to be overblown, as it can be addressed by warnings.
Any developer who reads the warnings on HUNGRY, and proceeds to ignore them, and not test the final app deserves what he gets. Sorry if that sounds cruel.
dMajo's suggestion is a good one, and while it cannot be perfectly enforced, compilers and linkers can definitely detect the presence of "SETMODE_HUNGRY" in more than one binary blob. An advanced option could disable this.
More advanced priorities etc will be great in P3 - while I think Chip may have time for the simple scheme, a complicated scheme may cause delays.
I think there is a misunderstanding out there.
HUNGRY mode is NOT about more bandwidth for cogs. With RDOCT/WROCT, a cog can read/write 8 longs in 8 clock cycles (plus maybe an overhead before it hits its stride and locks to the hub). Donated, or spare slots will not help at all with raw bandwidth.
HUNGRY helps greatly with latency, by potentially providing more random access slots to the hub within the normal 8 cycle window.
This would not help display drivers.
It would GREATLY help Spin, compiled C/C++/whatever code, other VM's, other interpreters.
People should not be writing cycle-accurate deterministic code in hubexec or any interpreted code!
The biggest difference will be to VM's, followed by compiled hubexec code.
The four line icache does not help at all with random hub data access.
The one line dcache also does not help with random hub data access. (four line would be a little better for non-random access)
Using slots would make a big difference. Potentially 4x+ faster (assuming three hub accesses + icache line fetch within 8 cycles)
Are you guys REALLY against PropGCC, Spin, JavaScipt, Forth and all other VM's running much faster?????
Out of borderline case worries, which can be addressed with warnings?
I suggest we start with:
- (compiler checked) ONE hungry cog at a time (without advanced mode enabled, perhaps only in PASM options allow more than one HUNGRY cog)
- the single hungry cog can use any spare cycles
We cannot default to "other cogs may not use my spare hub cycles" as that would require a developer to modify every single object he uses, and waste a LOT of developer time.
If it does not take too much time to implement, a "DISABLEHUNGRY" hub instruction could be used for testing behaviour of hungry cogs without spare cycles.
Later, after the shuttle run, we could explore priorities etc for spare slots.
Donating slots to a specific cog would allow more control, but is not necessary for the first pass.
Even the simple first pass would be a HUGE win, and greatly improve Spin/C/etc performance, and make P2 more competitive.
Now I am going back to working on marketing material for my latest product
Very easy. Who is the author of compilers? Perhaps Chip for Spin and David for GCC? It's enough two persons to accomplish that.
The same for the prop ... many here are debating on what they want and how to do that ... at the end the silicon contents will be the result of one person's will, isn't it?
Actually, Eric Smith did most of the GCC porting. In any case, PropGCC is mostly GCC which is a standard code base that we don't really want to touch any more than we have to. Our contribution has been to write a backend for the compiler, assembler, and linker to generate code for the Propeller and providing library and runtime support. As Heater says, it isn't really possible to guarantee that only one "hungry" COG is started. That would require a full dynamic analysis of the entire program which would be particularly difficult with C because each module is compiled separately.
If we do this, it needs to be bonehead simple. I propose the same as Bill. Hungry is modal per COG, default is off. Override mode per P2 chip for "no hungry regardless" to allow for testing and binary blobs that may not be modifiable. (God I hate that we will have those, but they are necessary...)
Expectations need docs, use cases, etc... I think we have some great discussion going and I personally will work on those, because that is my primary concern.
That's it, no complex modes, pairing, etc... Not needed really.
I like the main program running quickly use case. I don't think much of multiple hungry COGS, but then again, there are a lot of timing dynamics in the P2 we are going to have to get sorted for people. Work and writing code and docs to be done.
I've said what I need to. I really do not thing much of the crappy characterizations, nor do I those authoritative, general statements made during the course of the discussion. I really do like the basic analysis Bill did. (thank you)
This simple case has resolved for me. I think it can and needs to be framed properly and I think it can be.
Finally, I was asked about what "do it right" means.
It means not reaching for the speed throttle when the reality is properly expressing the problem as both the timing problem and parallel problem it really should be on a Propeller.
Early on, many of us took some time to realize what a multi-processor means. Chip boiled it down to timing and that boils down to thinking through and planning how things need to work and that is likely the biggest disconnect people will have with a feature like this because it will be seen as an out away from the harder multi-COG interaction and coordination we all know makes the chip sing for us.
Cheers all! I'm not going to comment on this again, unless I see additional complexity being advocated. We really don't need it.
It is not possible to enforce that "only one COG hungry rule".
I'm going to assume that my language provides a function/method/macro/intrisic to allow me to use whatever hungry instructions I need. Like we have for many other hardware specific features.
If I write a function that contains a "HUNGRY" instruction the compiler cannot tell which COG will end up running that function. Or how many running COGS will call the function that contains that "HUNGRY" statement. Or actually if that HUNGRY statement ever gets executed in the whole execution life of the program.
Static analysis by a compiler can not determine if that HUNGRY is used, or used more than once, at run time.
Such a rule would require hardware support. Think of it like claiming a lock. And please God let's not go there.
Comments
Plus, there is a potential power saving, from allowing this "HMI" app to apply the "hurry-up-then-wait" approach.
It can finish, and get to whatever low power wait modes the P2 will have, sooner.
This way I can guarantee the master cog can get 200% hub cycles (donate submissive). This is then deterministic for the master cog.
Why? I may have a program or driver that requires a very quick response. By deliberately coding a pair of cogs, I can control this - this is something I currently cannot do. And there is no impact for any other 6 cogs. In fact, using this concept, I could write all cog code to donate submissive to 1 or more masters. So I could have 1 master cog capable of using all 8 slots per hub cycle.
In all cases, the user must deliberately code for this, but do not deny him the opportunity to do this for some irrationally percieved notion it is bad.
Chip has those cases covered in post #3156: http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223254&viewfull=1#post1223254
In this case it looks like the master cog would use setting 010 and the submissive donor would use 100.
C.W.
Perhaps another way of donating slots could be done as follows.
If a 2 bit field existed that selects the frequency of donation of the slot.
DMODE0 %00 = 1/8 slot time (normal/default) - NO slot donation
DMODE1 %01 = 1/16 slot time (every 2nd slot is ALWAYS donated to a HM cog.)
DMODE2 %10 = 1/32 slot time (every 2nd,3rd and 4th slot is ALWAYS donated to a HM cog.)
DMODE3 %11 = 1/8 time slot (Donate ALL free slots to a HM cog) - "Open season" donation mode
DMODE1 & DMODE2 still would give some determinism to the donating cog and also guarantees slots to the HM cog.
This would balance/stabilze the slot allocation.
With so many postings it becomes "information overload" sometimes.
Thanks C.W. for the memory JOG!
Yes, looks good, if that is still current on how it works. ( A few days is a long time in P2 iteration-land. )
Modes to prevent unused slots from being used by hungry cogs are counter-productive and useless.
Cluso's voluntary donation to a paired cog for guaranteed extra bandwidth is potentially very useful (assuming it fits)
It comes down to designer control, again.
I can think of (rare) test coverage and development cases, where someone may want to 'bookmark' slots early in development, knowing they will be needed, or needed in some usage cases that they want to ensure does not produce a (late) change in behaviour.
It even gives enough Designer Control, to allow those worrying about Designer control, to protect themselves from themselves.
That is what Chip's table (slightly old) says is possible. Looks good to me.
Quite a while ago, I also suggested a 'floor' on the donation, to allow (low) Bandwidth minimums to be met, in a simple manner. Not sure where that is in the mix ?
A floor means you can guarantee just under double from a volunteer, and the volunteer gives most, but not quite all, of their slots, in highest-demand cases.
ie The difference between 95% and 97.5% is not large, but 2.5% to 0 is quite a change.
I guess if HW support for a floor is too complex, it can be managed less simply in SW ?
Bill, as I write a few posts earlier PERHAPS you are right (with this explanation #4629) and so I disagree here with Cluso and, based on your further posts, I am surprised you don't also.
Cluso by coding an object that uses 2 cogs its enough that one of them don't use any hub slot and automatically the other have all the unused slots thus having the double bandwidth guaranteed (the same apply for more cogs). I see no reason to-over complicate that.
If this route has to be taken than a hungry cog that can benefit from unused slots and in any way can't stole other's it's more than enough. The cog have a local setup ro request hungry-mode (disabled by default) and a global hungry-disable can be available. That's all.
I would also like to stop this debate at last until Chip show his thoughts/ideas over it and I prefer that you all, clever guys, move-on (with wasting silicon) on something perhaps more useful like eg a hw assisted usart/spi/i2c master/slave modes by eventually supporting 4-bit spi for flash/sd uses.
The "default" should be "any can use unused slots" - this is needed to avoid having to modify Obex objects.
Yes, SerDes and Counters are still awaiting a full reveal.
Chip did talk a bit about QuadSPI, where pin-map is probably the most complex part.
Many ideas have been floated on SERDES design and implementation. Same for HUB SLOT allocation.
Chip ultimately will sift thru these suggestions and come up with something that delivers on the broad
needs of the user.
I don't see how any of this is "wasting silicon" .
HUB EXEC mode was spawned from ideas bounced around this forum, clearly not a waste.
I believe Chip already has some ideas on SERDES in the works.
Me too. I think all the important things have been exchanged and there is some good dialog for Chip to consider. Given time, shuttle, the last few features, Chip is going to have to decide what is worth what.
FWIW: The attempts to marginalize valid concerns as "fear", etc... aren't doing any real good.
Again, I'm not opposed to this feature, but there are concerns. Some of those have been addressed in the discussion, WHICH IS WHY WE HAVE THE DISCUSSION.
Bill
I agree with Cluso's suggestion.
The idea behind the "modes" was to appease the "determinists" out there.
If I can get my hands on "free slots" by any method, that would "make my day"
Cheers
Brian
I thought the increase from Quads to Wide hub reads was the solution that replaced the hubstealing.
I fear that implementing a hungry/greedy mode may take another month or two, and may only speed up not optimized code by 20..30%. Optimized single thread code runs already with 100% PASM speed, or you can just let run critical code in cog mode.
Andy
Andy,
In the scenario's that I can think of where "free slots" would significantly boost performance, none of these
benefit from WIDE reads/writes as most of the processing is on non-consecutive LONGS (Limited by the 32 bit operations).
The problem is highlighted further in multi-tasking apps where tasks are fighting for the one slot.
Also using RDWIDE,SETWIDE in multiple tasks becomes tricky as only one task can use it at a time.
Brian
There is a finite about of bandwidth to hub. We can't get away from that fact. This is currently shared equally among COGs. In this way timing of code in different COGS is totally independent of each other. This decoupling is a big feature of the Prop I believe.
The "hungry" proposal is to allow at least one COG use HUB bandwidth that may be unused by the other COGs. We note that this does not affect those other 7 COGs. They continue in isolation as usual.
That "hungry" COG can now run faster if there is bandwidth available. It will run slower if there is not. The hungry COG has a timing dependency on the rest of the system.
You point out that for many uses that is OK. We get more speed if available and we are not worried about timing accuracy in our big hungry COG code.
So far so good. I like the idea. As I said I would be quite happy for that JavaScript interpreter to be "hungry".
BUT:
The weird timing interactions are:
I write a hungry program that runs at rate X. "Great" I think and proceed to rely on X for the code I create for the other 7 COGs. BOOM it does not work.
I write a hungry program that expect to run at rate X but it does not because the rest of my code is written as HUB execute and saturates the HUB bandwidth.
The whole idea is predicated on not caring if the "hungry" COG can run at full speed or not. Which I agree is a reasonable approach in many situations. The "weird timing interactions" will occur when people get to rely on that full speed for some reason.
So. If this is done at all it needs to be documented:
1) This code may not run at the speed you think it should.
2) It's speed may vary depending on the current load presented by the rest of the COGs.
3) YMMV.
As Potatohead said this is not about Spin or OBEX it's about expectations. Suitably bold face documentation may satisfy that.
Yes indeed. This whole thing about "paired cogs" and "slot donation" etc is perhaps the straw of complexity that will break the camels back. Both for Chip to design it and for users to ever understand how to program the thing. I feel we should not spend time and effort to add any more complexity that only 0.001% of dedicated Propeller fans will ever understand let alone use.
ctwardell, Not at all. Labelling a statement as "fearful" is not a constructive form of debate. I have said nothing about "stealing" hub slots yet. That is a whole other can of worms that I refuse to even think about.
How fast does a "hungry COG" run if the other seven COGs are executing directly from HUB?
I ask because I can imagine someone firing up their C compiler and building all their code from scratch, using all COGs and execute from HUB. Seems like a natural thing to do.
Chip has already suggested that execute from HUB will be the natural starting point for most. Which makes sense if you are writing in C and don't want to worry about that tiny COG code size limit.
Chip already has these concepts in what he offered up for review: http://forums.parallax.com/showthread.php/125543-Propeller-II-update-BLOG?p=1223254&viewfull=1#post1223254
Hyperbole? Statements like that are what I consider spreading fear and doubt.
In general I agree, but it may have some use in tuning. Consider the case where there is a cog that due to the nature of its task uses either all or none of its slots, lets say this is in a repeating pattern of 0.25 seconds of using them all, 0.25 seconds of using none of them. This could be a HUBEXEC task that does something at regular intervals, so I don't think it is a terribly contrived case. Then say we have a hungry task that is providing a GUI interface. The GUI may experience visible jitter due to the "rhythmic" availability of extra slots. This would be a case where being able to have a cog NOT share any slots might be beneficial
Your post that I was responding to included the statement "As such it is very important that the actions of one COG do not affect the timing of others" in regard to "off-the-shelf" UART, SPI, I2C code, etc., to me that implies that the so called hungry cog can somehow impact those items, the only way it could do that would be by stealing cycles, which again isn't on the table. The reference to "other" can of worms is a sly way of painting the current suggestion of slot sharing as a can of worms without directly saying it.
C.W.
Yes, you've pointed out a case where slot sharing will be of very little benefit. Unless some of the cogs are running hub code that spends a lot of time in tight loops that fit in the cache each cog will be using the vast majority of its own slots.
Since SPIN is going to use HUBEXEC the use of HUBEXEC will be pretty high, so this is a valid concern. I do think that having ALL 8 cogs doing HUBEXEC code would be a rare case, and having just one cog donating slots would provide some performance benefit.
C.W.
By doing so you'll deliberately slow down other cog. This will end to take a someone else's code made for a cog and insert only the instruction to give away 1/4 of its slots (the easy way). bang, the cog have issues (seeking help on the forums).
I you are a so smart/clever coder that need this functionality you can adapt the same code so that hub access happens on 1 window over four. By doing so you have acquired the knowledge over someone else's code, have understood how it works and see if this is doable avoiding issues. The end result is the same, even without a sharing(donation) option.
If you are writing from scratch then again you do not need sharing option because the code can be written to not use all of the slots. If you are not able to accomplish that, take your time to study first, to understand how it works and in the meantime stay out of hungry modes.
Them made many posts to reash 500 000 post's even if them don't have much to say
Very easy. Who is the author of compilers? Perhaps Chip for Spin and David for GCC? It's enough two persons to accomplish that.
The same for the prop ... many here are debating on what they want and how to do that ... at the end the silicon contents will be the result of one person's will, isn't it?
-Tor
Andy:
"I fear that implementing a hungry/greedy mode may take another month or two, and may only speed up not optimized code by 20..30%. Optimized single thread code runs already with 100% PASM speed, or you can just let run critical code in cog mode."
I do not want any delays either, which is why I was asking only for the simplest case (hungry cogs can use spare slots, no prioritization, no pairing, dead simple); based on earlier messages, this would be quick for Chip to implement. We could then explore the prioritization etc AFTER SERDES.
More on the performance benefits below.
Heater:
I understand your concerns, but do not share them. There are many other limits (pixel clock, waits etc) that users also have to respect.
See above - I was suggesting the simplest implementation for now.
More below.
dMajo:
I like your proposed rule!
If the default is that the compiler/linker only allows one hungry cog (unless over-ridden by an "advanced" option) then one cog could freely use whatever spare slots are available, without impacting "normal" cogs, and no one has to worry about two hungry cogs contending for extra hub slots.
David:
Discussions are good; Chip makes the final call for what he has time to put in.
ALL:
My overriding goal is to make the P2 the most competitive and fastest it can be, while being consistent with the April shuttle run.
That is why I have been suggesting the simplest HUNGRY concept.
I consider the worries about some hypothetical user possibly getting confused by running multiple HUNGRY cogs and not getting the performance he expects to be overblown, as it can be addressed by warnings.
Any developer who reads the warnings on HUNGRY, and proceeds to ignore them, and not test the final app deserves what he gets. Sorry if that sounds cruel.
dMajo's suggestion is a good one, and while it cannot be perfectly enforced, compilers and linkers can definitely detect the presence of "SETMODE_HUNGRY" in more than one binary blob. An advanced option could disable this.
More advanced priorities etc will be great in P3 - while I think Chip may have time for the simple scheme, a complicated scheme may cause delays.
I think there is a misunderstanding out there.
HUNGRY mode is NOT about more bandwidth for cogs. With RDOCT/WROCT, a cog can read/write 8 longs in 8 clock cycles (plus maybe an overhead before it hits its stride and locks to the hub). Donated, or spare slots will not help at all with raw bandwidth.
HUNGRY helps greatly with latency, by potentially providing more random access slots to the hub within the normal 8 cycle window.
This would not help display drivers.
It would GREATLY help Spin, compiled C/C++/whatever code, other VM's, other interpreters.
People should not be writing cycle-accurate deterministic code in hubexec or any interpreted code!
The biggest difference will be to VM's, followed by compiled hubexec code.
The four line icache does not help at all with random hub data access.
The one line dcache also does not help with random hub data access. (four line would be a little better for non-random access)
Using slots would make a big difference. Potentially 4x+ faster (assuming three hub accesses + icache line fetch within 8 cycles)
Are you guys REALLY against PropGCC, Spin, JavaScipt, Forth and all other VM's running much faster?????
Out of borderline case worries, which can be addressed with warnings?
I suggest we start with:
- (compiler checked) ONE hungry cog at a time (without advanced mode enabled, perhaps only in PASM options allow more than one HUNGRY cog)
- the single hungry cog can use any spare cycles
We cannot default to "other cogs may not use my spare hub cycles" as that would require a developer to modify every single object he uses, and waste a LOT of developer time.
If it does not take too much time to implement, a "DISABLEHUNGRY" hub instruction could be used for testing behaviour of hungry cogs without spare cycles.
Later, after the shuttle run, we could explore priorities etc for spare slots.
Donating slots to a specific cog would allow more control, but is not necessary for the first pass.
Even the simple first pass would be a HUGE win, and greatly improve Spin/C/etc performance, and make P2 more competitive.
Now I am going back to working on marketing material for my latest product
Expectations need docs, use cases, etc... I think we have some great discussion going and I personally will work on those, because that is my primary concern.
That's it, no complex modes, pairing, etc... Not needed really.
I like the main program running quickly use case. I don't think much of multiple hungry COGS, but then again, there are a lot of timing dynamics in the P2 we are going to have to get sorted for people. Work and writing code and docs to be done.
I've said what I need to. I really do not thing much of the crappy characterizations, nor do I those authoritative, general statements made during the course of the discussion. I really do like the basic analysis Bill did. (thank you)
This simple case has resolved for me. I think it can and needs to be framed properly and I think it can be.
Finally, I was asked about what "do it right" means.
It means not reaching for the speed throttle when the reality is properly expressing the problem as both the timing problem and parallel problem it really should be on a Propeller.
Early on, many of us took some time to realize what a multi-processor means. Chip boiled it down to timing and that boils down to thinking through and planning how things need to work and that is likely the biggest disconnect people will have with a feature like this because it will be seen as an out away from the harder multi-COG interaction and coordination we all know makes the chip sing for us.
Cheers all! I'm not going to comment on this again, unless I see additional complexity being advocated. We really don't need it.
I'm going to assume that my language provides a function/method/macro/intrisic to allow me to use whatever hungry instructions I need. Like we have for many other hardware specific features.
If I write a function that contains a "HUNGRY" instruction the compiler cannot tell which COG will end up running that function. Or how many running COGS will call the function that contains that "HUNGRY" statement. Or actually if that HUNGRY statement ever gets executed in the whole execution life of the program.
Static analysis by a compiler can not determine if that HUNGRY is used, or used more than once, at run time.
Such a rule would require hardware support. Think of it like claiming a lock. And please God let's not go there.