By extending the complexity slightly...
COGINIT(9, hubaddr/ptraddr, "PAIRED") where paired=1 (default=0=off)
(Cog 9 starts and is loaded, then set to NO HUB SLOTS)
COGINIT(1, hubaddr/ptraddr, "PAIRED")
(Cog 1 can only start if that Cog 9 is either not running or running in paired mode)
(Cog 9 starts and is loaded, then set to NO HUB SLOTS)
- yes, run with Hub released to another COG, is supported in what I have done.
The same command that does 'set to NO HUB SLOTS,' can allocate a 4b HubID, to determine who gets the released slot.
( id=Self=me)
It can easily be the cog-pair if you wish, but I've coded it so it can be any COG-ID.
That means much more than 2:1 gains are possible ie 100MHz is now possible, on 2 COGs.
Introducing more complexity for more benefits...
By extending the complexity...
COGINIT (9, hubaddr/ptraddr, "SUBMISSSIVE")
COGINIT (1, hubaddr/ptraddr, "PAIRED")
Cog 1 will get priority to both cog 1 & 9 slots. If either are not required, Cog 9 can have them.
Yes, I earlier tried this " If either are not required," coded, and the problem there, is this becomes a fetch-time decision.
Sim Results say this design approach is best avoided.
The other design kills this one, in this respect. 12.5Mhz doubled is nothing compared to what we had before. Seems to me, no matter what COGS are going to be used together on this one.
Doubled is doubled...
Assuming Chip implements Hub Exec and sticks with Quads then having the extra opposing slot (0-8, 1-9, etc.) should let you keep the cog busy during straight line code and have reduced wait to fill the cache during branches. It also reduces the time cost of hitting the hub for data due to the reduced latency.
It is about maximizing this chip, not what it can do compared to the previous incarnation of the P2.
Assuming Chip implements Hub Exec and sticks with Quads then having the extra opposing slot (0-8, 1-9, etc.) should let you keep the cog busy during straight line code and have reduced wait to fill the cache during branches. It also reduces the time cost of hitting the hub for data due to the reduced latency.
It is about maximizing this chip, not what it can do compared to the previous incarnation of the P2.
C.W.
Precisely. The old P2 is "dead in the water" at least for now. No point in comparing anything.
Slot sharing (by numerous methods) can improve cog performance, bit deterministic and non-deterministic. If its simple and easy to implement, hopefully Chip will do it. Too many uses not to do it just because it is not liked.
Some of my P1 apps have run out of cogs, and others I have had cogs to burn so to speak. But most of my apps could have used any extra speed in at least one cog, and they would have benefitted from some form of slot sharing because having more hub accesses would have benefitted this by both hub bandwidth and reduced latency. These cogs were not deterministic, but an average performance increase would be readilly seen.
So it is all about big, fast programs. If so, I agree and that needs to be put out there as a "is it worth it?" discussion, IMHO. Byte code, and big programs in general would benefit. But then again, several big programs and or byte codes can be running at the same time too. And that's one thing SPIN programmers on P1 will do frequently, to avoid PASM, if nothing else.
With the round robin, any combination of that always works consistently. Any combination of those programs can be used together and they all perform the same. Without it, we can't say any combination of those programs can be used together. Or if we can, they may not perform the same. The more of them there are, the more likely this is to be true.
Which leaves the concept of "one big fast program" with lots of supporting peripherials. That's a compelling one.
What is worth what? That's the real resolution as I see it.
Regarding examples, Jazzed produced a nice thing, using some code from Cluso I believe, that sampled pins using a small routine launched in up to 4 COGS. Each COG ran the same code, got it's COGID, then wrote it's sample to a buffer, resulting in a high sample rate.
Video is HUB access needy, and it's needy in terms of taking buffer data from the HUB and pushing it through WAITVID. That's the primary case. Using COGS together for this one is tricky, due to more than one COG having to output a signal using the same pins. We won't be doing that one on this design due to the DACS, unless it's all software, or we connect the pins electrically.
We should ask people what is worth what on that one too. Color depth or resolution or more of both and how much is enough?
It's also needy in terms of dynamically drawn objects being assembled into buffers. Sprites, tiles, etc... A similar approach is used. Fire off lots of COGS, they get their IDS, or take a parameter, and they work on buffers well ahead of the signal cog. HUB fetch, mask, fetch, write is often needed. This one works nicely across COGS.
Etc...
I'm not going to generate more detail than that at present. It's not needed. And my head is into some other thing at the moment.
My point here isn't to fully define the argument, nor stand for "the other side", but to highlight the goal of "maximizing the chip" may well be seen differently, and that's where the conflict is, and it's not an entirely technical thing, and I think I've made that point beyond FUD, etc...
Edit: Propalzer, or something like that. I used it once to analyze logic states on my Apple 2. Very cool.
I want it resolved. Given the sentiment surrounding "the beast", and man I really loved the beast too, and the overall response from Chip being conservative and practical, I'm not inclined to push this one at all. I'm also not convinced the HUB throughput and latency is the key to success, though it is compelling.
Finally, I did feel it worth it to expand on the differences some, just so people are talking more. I did put the big blob of text there, and it got taken badly. Wasn't my intent, and people know that now.
The alternative proposals (more complex) are to do some form of more generic slot mapping.
As I see it, there are two possibilities (while conserving the default 1:16).
(a) Keep the 16 slot and add a cog# table for each slot
(b) Extend this to 32 slots and add a cog# table for each slot
Both these methods start off with 16 slots numbered 0-15 (or 32 with 0-31) and the table is filled with 0..15 (or 0..15 + 0..15).
This is likely the only real way we can keep decoding fast.
For now, just use 16 slots...
If I want to do slot pairing (simplest) my example pair of 1 & 9 means my table gets "1" placed in both slots "1" & "9".
Then, as the added benefit, I could then presumably load slots 1 & 5 & 9 & 13 with "1" each and cog 1 would now get 4:16 (1:4) slots (4x hub access).
Of course, now cogs 5 & 9 & 13 do not get slots unless we further add an "if unused" case/table, which is more complex.
Mooching other than a single cog (unless you do 2 cogs with even/odd slot access), complicates things further, due to a "if then elseif then elseif then" scenario. More logic can of course decode this quicker.
Maybe in this case, what we need it a double layer table, with the first layer being the cog# to get access to the slot, and the second layer being the alternate cog# to be offered the slot if the primary cog does not require it. Default is the first layer is set in a one to one slot to cog basis, and the second layer is set to "none" as a special value.
The alternative proposals (more complex) are to do some form of more generic slot mapping.
As I see it, there are two possibilities (while conserving the default 1:16).
(a) Keep the 16 slot and add a cog# table for each slot
(b) Extend this to 32 slots and add a cog# table for each slot
Both these methods start off with 16 slots numbered 0-15 (or 32 with 0-31) and the table is filled with 0..15 (or 0..15 + 0..15).
This is likely the only real way we can keep decoding fast.
For now, just use 16 slots...
If I want to do slot pairing (simplest) my example pair of 1 & 9 means my table gets "1" placed in both slots "1" & "9".
Then, as the added benefit, I could then presumably load slots 1 & 5 & 9 & 13 with "1" each and cog 1 would now get 4:16 (1:4) slots (4x hub access).
Of course, now cogs 5 & 9 & 13 do not get slots unless we further add an "if unused" case/table, which is more complex.
That (a) is pretty much exactly what my code does. - only adds TopUseCog scan option, and includes a UsesHub boolean.
Only instead of a storing this as a central table, each COG owns it own entry, of 5 bits, in each COG Config register, but it behaves as 16 x 4 bit values when 'seen' by the Scanner State engine.
(see my earlier posts, and test bench for 100MHz access this design allows )
12.5MHz is a low value, on a 200MHz device, 25MHz is better, but given the HW can easily allow the user choice of up to 100MHz, it would be silly to overlook that
Mooching other than a single cog (unless you do 2 cogs with even/odd slot access), complicates things further, due to a "if then elseif then elseif then" scenario. More logic can of course decode this quicker.
Maybe in this case, what we need it a double layer table, with the first layer being the cog# to get access to the slot, and the second layer being the alternate cog# to be offered the slot if the primary cog does not require it. Default is the first layer is set in a one to one slot to cog basis, and the second layer is set to "none" as a special value.
As already mentioned, fetch-time decisions are best avoided. ie terms like "if the primary cog does not require it" change the code from outside the critical path, to likely impacting the final speed.
One idea mentioned was to limit the total number of slots to the highest cog allocated, or alternative to the number of cogs allocated. Anything other than 16 or a multiple thereof, has an implication to normal obex programs, and this has to be avoided.
But there are other methods as has been discussed. While I have a few minutes, here is some further input...
Simpler single slot table:
When starting a cog, it should check to ensure it's own slot is available. If not, then it cannot start (because it cannot load itself from hub, because it has no slots).
So if it can start, it could then either
(a) do nothing - its slot is allocated to itself
(b) store an alternate cog# in its slot (donates its slot to another known cog)
If it donates its slot to another cog, it has no hub access, and if it tries it will wait forever! It must use another method to communicate.
jmg: Have to go out. But I am missing the TopUseCog meaning.
That's a priority encoder that auto-adjust the scan, to match the TopUsedCog ID,
If you want 16, simply use COG16. So is operationally a safe superset, but it also allows faster/simpler cases too.
When starting a cog, it should check to ensure it's own slot is available. If not, then it cannot start (because it cannot load itself from hub, because it has no slots).
In the code I have done, this never happens, as the COG loads it's own slot register, and that defaults to Self.
(That was one reason I moved from a central table)
So if it can start, it could then either
(a) do nothing - its slot is allocated to itself
(b) store an alternate cog# in its slot (donates its slot to another known cog)
If it donates its slot to another cog, it has no hub access, and if it tries it will wait forever! It must use another method to communicate.
I have (a) as the default, and only this COG can change it's slot register.
It can be coded to donate under timer/flag control.
I see that as safer, and more atomic than a common table, which anyone can 'have at'.
If you want to get slots from unused COGS, they need a tiny (automated?) stub that Starts -> Sets Slot & HubUsed -> Sleeps forever.
Only the smallest number of COGs need to be running to manage the design, saving power.
The SysCLK can likely also be lower, saving further power.
So it is all about big, fast programs. If so, I agree and that needs to be put out there as a "is it worth it?" discussion, IMHO.
Yes, I agree - in fact, I would go further. We should have a discussion about whether this supposed need for "big, fast" programs existsat all on the Propeller. For me, the whole point of the Propeller is as a deterministic symmetric multiprocessor with an abundance of flexible I/O pins, in a package suitable for low-power and embedded applications. You simply can't get that anywhere else. The "big, fast programs" use case actually seems more like an edge case.
And yes, I understand the argument about "but we can have this as well because it costs virtually nothing to add ...so why not do it?". But the reality is that nothing cost nothing. Quite the contrary, in fact - this one wins the trifecta by adding significant time, cost and risk.
Yes, I agree - in fact, I would go further. We should have a discussion about whether this supposed need for "big, fast" programs existsat all on the Propeller. For me, the whole point of the Propeller is as a deterministic symmetric multiprocessor with an abundance of flexible I/O pins, in a package suitable for low-power and embedded applications. You simply can't get that anywhere else. The "big, fast programs" use case actually seems more like an edge case.
And yes, I understand the argument about "but we can have this as well because it costs virtually nothing to add ...so why not do it?". But the reality is that nothing cost nothing. Quite the contrary, in fact - this one wins the trifecta by adding significant time, cost and risk.
Ross.
Actually Ross, I think that in fact the contrary is true... It is minimal time, cost and negligible risk!
And nearly all prop versions I have built have had one larger program that could definitely benefit from more processing power, and is not deterministic. I think this is the biggest benefit we get from this... the ability to do both the big processor and the deterministic simple parallel drivers.
That's a priority encoder that auto-adjust the scan, to match the TopUsedCog ID,
If you want 16, simply use COG16. So is operationally a safe superset, but it also allows faster/simpler cases too.
OK. Unfortunately I am against this because it changes the operation of all cogs.
For example, if the top cog is 10, then there are only 11 slots allocated, and the default gives 1:11 slots to each cog and that affects the original determinism because now the default cogs may run faster or slower. Why? Because a driver can be written for the default 1:16 and so is coded with enough instructions between hub slots to just catch each 16 clocks. Now, because the hub comes early, the slot is missed and the code has to wait for the next slot, meaning a slot every 2x11=22 clocks. Therefore the code is slower.
The slot defaults must remain to be 16 (or a multiple of 16) to maintain deterministic code. I would not like users to be forced to start cog 15 (not 16!) to maintain a 16 clock loop.
So just keep the original 16 slot loop, and vary who gets each slot.
In the code I have done, this never happens, as the COG loads it's own slot register, and that defaults to Self.
(That was one reason I moved from a central table)
I have (a) as the default, and only this COG can change it's slot register.
It can be coded to donate under timer/flag control.
I see that as safer, and more atomic than a common table, which anyone can 'have at'.
If you want to get slots from unused COGS, they need a tiny (automated?) stub that Starts -> Sets Slot & HubUsed -> Sleeps forever.
Sleeps forever ~= cogstop
Yes, I can work with this. It has the advantage over the simplified paired cogs that it can work with the other cog to dynamically share the slots.
Another advantage is that more than 1 slot can be donated.
So I could have a video driver cog and a games cog. By jointly timing both cogs, the video cog could be given both (or more) hub slots when filling the display buffer, and the games cog could get all the hub slots at other times.
It's easy to implement with a single instruction SETSLOT n which sets this cogs slot to cog n where n=0-15 (which may be me and is the default).
This method does not allow for utilising any unused slots, or having priority slots. While I would like one level for a secondary cog table (if the primary cog did not require its slot), I can live without it. However, I think Bill might have issue without mooching.
The slot defaults must remain to be 16 (or a multiple of 16) to maintain deterministic code. I would not like users to be forced to start cog 15 (not 16!) to maintain a 16 clock loop.
So just keep the original 16 slot loop, and vary who gets each slot.
??
It is easy to have a global control for 16, and why penalize those designs that need less than 16, when it is so easy to set 16 any time someone wants it ?
My design can deliver 100MHz HUB BW, using just 2-3 GOGS, simple and under full user control, whilst your idea of locking to 16 forces COGs to be used to give bandwidth - both clumsy and higher power, and the locked constraint means it cannot cover all mappings.
On my design the simple benchmark is easy with 3/2/3 COG alternate, for 100MHz interleave - but with fixed 16, you have gaps - eg 7 cogs + 7 cogs + 1 control cannot deliver an evenly spaced 100% to each channel, so it fails the simplest application test.
Sleeps forever ~= cogstop
Yes, I can work with this. It has the advantage over the simplified paired cogs that it can work with the other cog to dynamically share the slots.
Another advantage is that more than 1 slot can be donated.
So I could have a video driver cog and a games cog. By jointly timing both cogs, the video cog could be given both (or more) hub slots when filling the display buffer, and the games cog could get all the hub slots at other times.
It's easy to implement with a single instruction SETSLOT n which sets this cogs slot to cog n where n=0-15 (which may be me and is the default).
Yes, either an instruction, or a 5 bit field in a Config register.
I designed 5 bits, to allow 4b CogID and a flag for UsesHUB - that allows a (master) COG to trigger, then timed release BW, and then reclaim it. The COG can run, and not use a HUB Slot.
In my benchmark case, it releases via that 5th bit to give 100MHz to 2 COGS, for burst R/W then can flip to get 67MHz for post burst W/R.
This method does not allow for utilising any unused slots, or having priority slots. While I would like one level for a secondary cog table (if the primary cog did not require its slot), I can live without it. However, I think Bill might have issue without mooching.
Sure, having more control would be nice, but any fetch-time decision is on the wrong side of the scanner, and impacts the operate speed.
State-engine control on HUB slot on the D-Side of the State engine FF's have no speed cost, provided the State Engine can reach 200MHz - on a 4 bit counter/state logic, that is relatively easy.
It is easy to have a global control for 16, and why penalize those designs that need less than 16, when it is so easy to set 16 any time someone wants it ?
My design can deliver 100MHz HUB BW, using just 2-3 GOGS, simple and under full user control, whilst your idea of locking to 16 forces COGs to be used to give bandwidth - both clumsy and higher power, and the locked constraint means it cannot cover all mappings.
On my design the simple benchmark is easy with 3/2/3 COG alternate, for 100MHz interleave - but with fixed 16, you have gaps - eg 7 cogs + 7 cogs + 1 control cannot deliver an evenly spaced 100% to each channel, so it fails the simplest application test.
While I agree with your arguments, it breaks being able to use all objects without consideration.
This is the single biggest objection to slot sharing by others, and here I share their concern. That is why I don't want to break it, and hence why the 16 slot (or multiple) must remain. None of the other propositions break determinism like this does.
While I agree with your arguments, it breaks being able to use all objects without consideration.
This is the single biggest objection to slot sharing by others, and here I share their concern. That is why I don't want to break it, and hence why the 16 slot (or multiple) must remain. None of the other propositions break determinism like this does.
Err which part of OPTIONAL do you not understand ? The user has control. The rules are very simple.
A wry smile is needed at claims of "break determinism" when my design can do both 16 and 100%/ 100MHz with determinism in my test case, and yet your no-choice fails determinism in my test case.
Actually Ross, I think that in fact the contrary is true... It is minimal time, cost and negligible risk!
And nearly all prop versions I have built have had one larger program that could definitely benefit from more processing power, and is not deterministic. I think this is the biggest benefit we get from this... the ability to do both the big processor and the deterministic simple parallel drivers.
No, I don't think so. I agree that your solution may be simple enough to be considered (at least the simpler variant may be), and also that the cost may be relatively low - but it is not zero, and it does add complexity, and so it does bear risk, and it will delay the P16X32B. How much? Well, only Chip could really say with any degree of accuracy at all.
At least your proposal preserves determinism, and does not lead to potentially incompatible objects in the OBEX, as some of the other proposals do. But I still question whether it is worthwhile - the resulting Propeller will not be a "big processor" by modern standards (even with this feature). All you are effectively doing is turning a 16 cog chip into a chip with less usable cogs, but where some cogs execute faster - but only if they make a lot of Hub accesses!
But we may be better off just simplifying the chip and just going for a faster overall clock speed - this would then benefit all objects and all programs, not just ones that use Hubexec.
So I am questioning whether any slot sharing is worthwile. I think I'd rather just have a simpler and faster chip ... and get it sooner! But Increasingly, I am thinking we don't need any improvements at all over the basic P16X32B, given the faster clock speed, more pins, more cogs and more Hub RAM this chip already promises over the P1.
For me, the whole point of the Propeller is as a deterministic symmetric multiprocessor with an abundance of flexible I/O pins, in a package suitable for low-power and embedded applications. You simply can't get that anywhere else.
I agree with this, which is why I take the trouble to trial Verilog designs to improve the HUB bandwidth.
12.5MHz is really quite low, on a 200MHz fSys device sold as "a deterministic symmetric multiprocessor with an abundance of flexible I/O pins"
I agree with this, which is why I take the trouble to trial Verilog designs to improve the HUB bandwidth.
12.5MHz is really quite low, on a 200MHz fSys device sold as "a deterministic symmetric multiprocessor with an abundance of flexible I/O pins"
Where do you get the 12.5Mhz from?
EDIT: Ok I assume you are simply saying that each cog gets a hub access every 16 clocks, so 200Mhz divided by 16. But even with quite simple LMM techniques and RDQUAD, we should be able to get close to 25Mhz per cog - around 2.5 times faster than the P1. And we have sixteen cogs available!
Cogs can be started and stopped at any time. That means the COGS in use at any moment may have non-sequential ID's. There can well be gaps. In the extreme something like COG0 and COG15 could be running and none of the ones in between.
What does "TopUsedCog" mean then?
Normally we don't need to know the IDs of any COGs. We just fly with whatever we are allocated. I'd like that to remain true.
EDIT: Ok I assume you are simply saying that each cog gets a hub access every 16 clocks, so 200Mhz divided by 16. But even with quite simple LMM techniques and RDQUAD, we should be able to get close to 25Mhz per cog - around 2.5 times faster than the P1. And we have sixteen cogs available!
Do we need 200 MHz (2.5 times more frequency clock than the P1 at 80 MHz) to get it 2.5 times "faster" ?
Whether these advance programs/objects are made available to the masses likely depends on those who argue against these features. Currently, those against, do not want any of these objects made available (via OBEX) for fear of the unknown (no tangible technical evidence has been provided by those against).
I am quite happy not to provide my Objects this way, if it means getting access to these features.
The bus needed for full pin flexibility is too expensive at this 180nm process physics.
If we keep COGS equal, it means doing a COGINIT or COGRUN X to put the code where the pins are, and it means doing multi-COG video tricks to the same pins will require a physical connect on the PCB.
Non PLL DAC ops sans WAITVID work with any pin as expected.
As I see it, we have two camps representing two overlapping views of how the P1+ would be used. One camp is concerned that putting hub access under program control (assuming it is indeed easily implementable) will lead to overuse and a breaking of the easy determinism and independence of library objects. The other camp is concerned that the absence of this feature will lead to applications for the P1+ being out of reach due to insufficient hub data throughput and/or insufficient cog throughput due to the bottleneck of hub access.
In one case, we're mostly looking at objects individually, somewhat in isolation. In the other case, we're mostly looking at overall programs and the global assignment of cogs to functions and allocating hub access globally. Both are legitimate goals, but won't overlap very much.
I wonder whether we can add some support to compiled programs to facilitate this. The simplest thing would be to mark all objects as to whether they: 1) make use of dynamic allocation of hub access slots or not; 2) require a fixed 1:16 cog to hub access ratio. We may be able to come up with some standards for objects to specify the conditions under which they will run properly when hub access slots are allocated dynamically. We may end up with one or more objects that do hub management.
I like the idea of having a 4-bit cog number register for each hub access slot with these being initialized to the corresponding cog numbers on a reset so we have the fixed 1:16 relationship as a default. Once you allow dynamic hub access allocation, I don't see a strong need to enforce good behavior in the hardware. I would be happy with an instruction that takes a cog number and a bit mask and sets the hub access slots corresponding to the bit mask so the specified cog uses them (and leaves the others alone). That way, a cog could dynamically change its own slot usage or change another cog's usage. There's risk that a program could hang or not function properly, but the default would be what's expected and we could create library routines to manage this correctly.
Actually Ross, I think that in fact the contrary is true... It is minimal time, cost and negligible risk!
And nearly all prop versions I have built have had one larger program that could definitely benefit from more processing power, and is not deterministic. I think this is the biggest benefit we get from this... the ability to do both the big processor and the deterministic simple parallel drivers.
Why then just use the unused huw windows from other cogs (hungry or mooch or whatever it is called)? This will give you more bandwidth (if available) in non deterministic way and at the same time will warrant that every started cog can use deterministically its slot.
I'm pretty sure that simplicity is no longer with us because of the fast DAC's with 4 pins assigned to each cog.
C.W.
This is one reason more to discard the TopUsedCog or Power2 idea. Many times the pins function is determined by hw requirements, in order to simplify the pcb routing. Now as consequence this will determine which cog will be used. I do not want to be forced to use different pins because the naturally selected/paired cog has wrong hub bandwith due to its ID.
Comments
Ah, ok, my test code can work at a pair level if you want it to.
Possibly, but I tend to ignore non-technical posts, and focus on where the delay and logic cost impacts are, in some test Verilog.
Your other post :
(Cog 9 starts and is loaded, then set to NO HUB SLOTS)
- yes, run with Hub released to another COG, is supported in what I have done.
The same command that does 'set to NO HUB SLOTS,' can allocate a 4b HubID, to determine who gets the released slot.
( id=Self=me)
It can easily be the cog-pair if you wish, but I've coded it so it can be any COG-ID.
That means much more than 2:1 gains are possible ie 100MHz is now possible, on 2 COGs.
As you say, this is fully deterministic.
Yes, I earlier tried this " If either are not required," coded, and the problem there, is this becomes a fetch-time decision.
Sim Results say this design approach is best avoided.
Doubled is doubled...
Assuming Chip implements Hub Exec and sticks with Quads then having the extra opposing slot (0-8, 1-9, etc.) should let you keep the cog busy during straight line code and have reduced wait to fill the cache during branches. It also reduces the time cost of hitting the hub for data due to the reduced latency.
It is about maximizing this chip, not what it can do compared to the previous incarnation of the P2.
C.W.
Slot sharing (by numerous methods) can improve cog performance, bit deterministic and non-deterministic. If its simple and easy to implement, hopefully Chip will do it. Too many uses not to do it just because it is not liked.
Some of my P1 apps have run out of cogs, and others I have had cogs to burn so to speak. But most of my apps could have used any extra speed in at least one cog, and they would have benefitted from some form of slot sharing because having more hub accesses would have benefitted this by both hub bandwidth and reduced latency. These cogs were not deterministic, but an average performance increase would be readilly seen.
With the round robin, any combination of that always works consistently. Any combination of those programs can be used together and they all perform the same. Without it, we can't say any combination of those programs can be used together. Or if we can, they may not perform the same. The more of them there are, the more likely this is to be true.
Which leaves the concept of "one big fast program" with lots of supporting peripherials. That's a compelling one.
What is worth what? That's the real resolution as I see it.
Regarding examples, Jazzed produced a nice thing, using some code from Cluso I believe, that sampled pins using a small routine launched in up to 4 COGS. Each COG ran the same code, got it's COGID, then wrote it's sample to a buffer, resulting in a high sample rate.
Video is HUB access needy, and it's needy in terms of taking buffer data from the HUB and pushing it through WAITVID. That's the primary case. Using COGS together for this one is tricky, due to more than one COG having to output a signal using the same pins. We won't be doing that one on this design due to the DACS, unless it's all software, or we connect the pins electrically.
We should ask people what is worth what on that one too. Color depth or resolution or more of both and how much is enough?
It's also needy in terms of dynamically drawn objects being assembled into buffers. Sprites, tiles, etc... A similar approach is used. Fire off lots of COGS, they get their IDS, or take a parameter, and they work on buffers well ahead of the signal cog. HUB fetch, mask, fetch, write is often needed. This one works nicely across COGS.
Etc...
I'm not going to generate more detail than that at present. It's not needed. And my head is into some other thing at the moment.
My point here isn't to fully define the argument, nor stand for "the other side", but to highlight the goal of "maximizing the chip" may well be seen differently, and that's where the conflict is, and it's not an entirely technical thing, and I think I've made that point beyond FUD, etc...
Edit: Propalzer, or something like that. I used it once to analyze logic states on my Apple 2. Very cool.
I want it resolved. Given the sentiment surrounding "the beast", and man I really loved the beast too, and the overall response from Chip being conservative and practical, I'm not inclined to push this one at all. I'm also not convinced the HUB throughput and latency is the key to success, though it is compelling.
Finally, I did feel it worth it to expand on the differences some, just so people are talking more. I did put the big blob of text there, and it got taken badly. Wasn't my intent, and people know that now.
FWIW I am happy with "simple pairing".
The alternative proposals (more complex) are to do some form of more generic slot mapping.
As I see it, there are two possibilities (while conserving the default 1:16).
(a) Keep the 16 slot and add a cog# table for each slot
(b) Extend this to 32 slots and add a cog# table for each slot
Both these methods start off with 16 slots numbered 0-15 (or 32 with 0-31) and the table is filled with 0..15 (or 0..15 + 0..15).
This is likely the only real way we can keep decoding fast.
For now, just use 16 slots...
If I want to do slot pairing (simplest) my example pair of 1 & 9 means my table gets "1" placed in both slots "1" & "9".
Then, as the added benefit, I could then presumably load slots 1 & 5 & 9 & 13 with "1" each and cog 1 would now get 4:16 (1:4) slots (4x hub access).
Of course, now cogs 5 & 9 & 13 do not get slots unless we further add an "if unused" case/table, which is more complex.
Mooching other than a single cog (unless you do 2 cogs with even/odd slot access), complicates things further, due to a "if then elseif then elseif then" scenario. More logic can of course decode this quicker.
Maybe in this case, what we need it a double layer table, with the first layer being the cog# to get access to the slot, and the second layer being the alternate cog# to be offered the slot if the primary cog does not require it. Default is the first layer is set in a one to one slot to cog basis, and the second layer is set to "none" as a special value.
That (a) is pretty much exactly what my code does. - only adds TopUseCog scan option, and includes a UsesHub boolean.
Only instead of a storing this as a central table, each COG owns it own entry, of 5 bits, in each COG Config register, but it behaves as 16 x 4 bit values when 'seen' by the Scanner State engine.
(see my earlier posts, and test bench for 100MHz access this design allows )
12.5MHz is a low value, on a 200MHz device, 25MHz is better, but given the HW can easily allow the user choice of up to 100MHz, it would be silly to overlook that
As already mentioned, fetch-time decisions are best avoided. ie terms like "if the primary cog does not require it" change the code from outside the critical path, to likely impacting the final speed.
But there are other methods as has been discussed. While I have a few minutes, here is some further input...
Simpler single slot table:
When starting a cog, it should check to ensure it's own slot is available. If not, then it cannot start (because it cannot load itself from hub, because it has no slots).
So if it can start, it could then either
(a) do nothing - its slot is allocated to itself
(b) store an alternate cog# in its slot (donates its slot to another known cog)
If it donates its slot to another cog, it has no hub access, and if it tries it will wait forever! It must use another method to communicate.
That's a priority encoder that auto-adjust the scan, to match the TopUsedCog ID,
If you want 16, simply use COG16. So is operationally a safe superset, but it also allows faster/simpler cases too.
In the code I have done, this never happens, as the COG loads it's own slot register, and that defaults to Self.
(That was one reason I moved from a central table)
I have (a) as the default, and only this COG can change it's slot register.
It can be coded to donate under timer/flag control.
I see that as safer, and more atomic than a common table, which anyone can 'have at'.
If you want to get slots from unused COGS, they need a tiny (automated?) stub that Starts -> Sets Slot & HubUsed -> Sleeps forever.
Only the smallest number of COGs need to be running to manage the design, saving power.
The SysCLK can likely also be lower, saving further power.
Yes, I agree - in fact, I would go further. We should have a discussion about whether this supposed need for "big, fast" programs exists at all on the Propeller. For me, the whole point of the Propeller is as a deterministic symmetric multiprocessor with an abundance of flexible I/O pins, in a package suitable for low-power and embedded applications. You simply can't get that anywhere else. The "big, fast programs" use case actually seems more like an edge case.
And yes, I understand the argument about "but we can have this as well because it costs virtually nothing to add ...so why not do it?". But the reality is that nothing cost nothing. Quite the contrary, in fact - this one wins the trifecta by adding significant time, cost and risk.
Ross.
And nearly all prop versions I have built have had one larger program that could definitely benefit from more processing power, and is not deterministic. I think this is the biggest benefit we get from this... the ability to do both the big processor and the deterministic simple parallel drivers.
For example, if the top cog is 10, then there are only 11 slots allocated, and the default gives 1:11 slots to each cog and that affects the original determinism because now the default cogs may run faster or slower. Why? Because a driver can be written for the default 1:16 and so is coded with enough instructions between hub slots to just catch each 16 clocks. Now, because the hub comes early, the slot is missed and the code has to wait for the next slot, meaning a slot every 2x11=22 clocks. Therefore the code is slower.
The slot defaults must remain to be 16 (or a multiple of 16) to maintain deterministic code. I would not like users to be forced to start cog 15 (not 16!) to maintain a 16 clock loop.
So just keep the original 16 slot loop, and vary who gets each slot.
Yes, I can work with this. It has the advantage over the simplified paired cogs that it can work with the other cog to dynamically share the slots.
Another advantage is that more than 1 slot can be donated.
So I could have a video driver cog and a games cog. By jointly timing both cogs, the video cog could be given both (or more) hub slots when filling the display buffer, and the games cog could get all the hub slots at other times.
It's easy to implement with a single instruction SETSLOT n which sets this cogs slot to cog n where n=0-15 (which may be me and is the default).
This method does not allow for utilising any unused slots, or having priority slots. While I would like one level for a secondary cog table (if the primary cog did not require its slot), I can live without it. However, I think Bill might have issue without mooching.
??
It is easy to have a global control for 16, and why penalize those designs that need less than 16, when it is so easy to set 16 any time someone wants it ?
My design can deliver 100MHz HUB BW, using just 2-3 GOGS, simple and under full user control, whilst your idea of locking to 16 forces COGs to be used to give bandwidth - both clumsy and higher power, and the locked constraint means it cannot cover all mappings.
On my design the simple benchmark is easy with 3/2/3 COG alternate, for 100MHz interleave - but with fixed 16, you have gaps - eg 7 cogs + 7 cogs + 1 control cannot deliver an evenly spaced 100% to each channel, so it fails the simplest application test.
I designed 5 bits, to allow 4b CogID and a flag for UsesHUB - that allows a (master) COG to trigger, then timed release BW, and then reclaim it. The COG can run, and not use a HUB Slot.
In my benchmark case, it releases via that 5th bit to give 100MHz to 2 COGS, for burst R/W then can flip to get 67MHz for post burst W/R.
Sure, having more control would be nice, but any fetch-time decision is on the wrong side of the scanner, and impacts the operate speed.
State-engine control on HUB slot on the D-Side of the State engine FF's have no speed cost, provided the State Engine can reach 200MHz - on a 4 bit counter/state logic, that is relatively easy.
This is the single biggest objection to slot sharing by others, and here I share their concern. That is why I don't want to break it, and hence why the 16 slot (or multiple) must remain. None of the other propositions break determinism like this does.
Err which part of OPTIONAL do you not understand ? The user has control. The rules are very simple.
A wry smile is needed at claims of "break determinism" when my design can do both 16 and 100%/ 100MHz with determinism in my test case, and yet your no-choice fails determinism in my test case.
No, I don't think so. I agree that your solution may be simple enough to be considered (at least the simpler variant may be), and also that the cost may be relatively low - but it is not zero, and it does add complexity, and so it does bear risk, and it will delay the P16X32B. How much? Well, only Chip could really say with any degree of accuracy at all.
At least your proposal preserves determinism, and does not lead to potentially incompatible objects in the OBEX, as some of the other proposals do. But I still question whether it is worthwhile - the resulting Propeller will not be a "big processor" by modern standards (even with this feature). All you are effectively doing is turning a 16 cog chip into a chip with less usable cogs, but where some cogs execute faster - but only if they make a lot of Hub accesses!
But we may be better off just simplifying the chip and just going for a faster overall clock speed - this would then benefit all objects and all programs, not just ones that use Hubexec.
So I am questioning whether any slot sharing is worthwile. I think I'd rather just have a simpler and faster chip ... and get it sooner! But Increasingly, I am thinking we don't need any improvements at all over the basic P16X32B, given the faster clock speed, more pins, more cogs and more Hub RAM this chip already promises over the P1.
Ross.
I agree with this, which is why I take the trouble to trial Verilog designs to improve the HUB bandwidth.
12.5MHz is really quite low, on a 200MHz fSys device sold as "a deterministic symmetric multiprocessor with an abundance of flexible I/O pins"
Where do you get the 12.5Mhz from?
EDIT: Ok I assume you are simply saying that each cog gets a hub access every 16 clocks, so 200Mhz divided by 16. But even with quite simple LMM techniques and RDQUAD, we should be able to get close to 25Mhz per cog - around 2.5 times faster than the P1. And we have sixteen cogs available!
Ross.
Cogs can be started and stopped at any time. That means the COGS in use at any moment may have non-sequential ID's. There can well be gaps. In the extreme something like COG0 and COG15 could be running and none of the ones in between.
What does "TopUsedCog" mean then?
Normally we don't need to know the IDs of any COGs. We just fly with whatever we are allocated. I'd like that to remain true.
I'm pretty sure that simplicity is no longer with us because of the fast DAC's with 4 pins assigned to each cog.
C.W.
I may have missed the memo on that little detail.
All I can say is:
"Oh dear".
I think it has been mentioned multiple times, but here is a reference that points in that direction:
http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip?p=1257696&viewfull=1#post1257696
C.W.
Do we need 200 MHz (2.5 times more frequency clock than the P1 at 80 MHz) to get it 2.5 times "faster" ?
The bus needed for full pin flexibility is too expensive at this 180nm process physics.
If we keep COGS equal, it means doing a COGINIT or COGRUN X to put the code where the pins are, and it means doing multi-COG video tricks to the same pins will require a physical connect on the PCB.
Non PLL DAC ops sans WAITVID work with any pin as expected.
In one case, we're mostly looking at objects individually, somewhat in isolation. In the other case, we're mostly looking at overall programs and the global assignment of cogs to functions and allocating hub access globally. Both are legitimate goals, but won't overlap very much.
I wonder whether we can add some support to compiled programs to facilitate this. The simplest thing would be to mark all objects as to whether they: 1) make use of dynamic allocation of hub access slots or not; 2) require a fixed 1:16 cog to hub access ratio. We may be able to come up with some standards for objects to specify the conditions under which they will run properly when hub access slots are allocated dynamically. We may end up with one or more objects that do hub management.
I like the idea of having a 4-bit cog number register for each hub access slot with these being initialized to the corresponding cog numbers on a reset so we have the fixed 1:16 relationship as a default. Once you allow dynamic hub access allocation, I don't see a strong need to enforce good behavior in the hardware. I would be happy with an instruction that takes a cog number and a bit mask and sets the hub access slots corresponding to the bit mask so the specified cog uses them (and leaves the others alone). That way, a cog could dynamically change its own slot usage or change another cog's usage. There's risk that a program could hang or not function properly, but the default would be what's expected and we could create library routines to manage this correctly.
Why then just use the unused huw windows from other cogs (hungry or mooch or whatever it is called)? This will give you more bandwidth (if available) in non deterministic way and at the same time will warrant that every started cog can use deterministically its slot.
This is one reason more to discard the TopUsedCog or Power2 idea. Many times the pins function is determined by hw requirements, in order to simplify the pcb routing. Now as consequence this will determine which cog will be used. I do not want to be forced to use different pins because the naturally selected/paired cog has wrong hub bandwith due to its ID.