YACSS - Yet another cog sharing suggestion.
kwinn
Posts: 8,697
I wasn't too wild about the super cog idea since that really kills the all cogs are (almost) equal symmetry of the prop, and there seems to be quite a bit of resistance to the cog slot table suggestion, so how about what I think of as the fall through method of hub sharing.
Each cog gets a slot in round robin order just like the P1 does it now.
Each cog has a bit that indicates if it wants a hub access.
If the cog does not want to access hub the slot falls through to the next cog (in round robin order) that wants a hub access.
This maintains cog equality, makes maximum use of hub bandwidth, and is very simple to understand and implement. This would also be simple to combine with a cog slot table to guarantee determinism for those cogs that need it.
Each cog gets a slot in round robin order just like the P1 does it now.
Each cog has a bit that indicates if it wants a hub access.
If the cog does not want to access hub the slot falls through to the next cog (in round robin order) that wants a hub access.
This maintains cog equality, makes maximum use of hub bandwidth, and is very simple to understand and implement. This would also be simple to combine with a cog slot table to guarantee determinism for those cogs that need it.
Comments
Not from where I'm sitting.
There are variations and some sub-sets, but pretty much everyone has suggested some form of slot-sharing.
They vary only in how they are coded.
The widest support seems to be for the most flexible, which is no real surprise.
Clearly, ideas that lock to a certain COG, are less flexible. (as also is locking to only a very slow 1:16 )
Given the FPGA P&R shows a flexible solution is also easily fast enough, that also makes an easy choice.
As this is re-mapping COGs, it needs to be combined with a Table to make any sense.
In that form it is really just a variant on the Second-Cog choice :
Instead of a second-table-read, the second-choice COG is now chosen on "slot falls through to the next cog"
"next cog" is a little hard to nail down, but you can priority encode the 'Waiting on Slot' bits, and feed that in.
For fun, I've coded that, and it reports P&R of similar speeds to a Dual Entry table. So it can be done, with a Table.
That makes the highest COG, first in the second choice queue, which could tend to starve the lowest cog in the list of bonus slots.
(or vice versa if the encoder is LSB biased)
However, they are bonus slots, so there is still the primary table to allocate minimum Mapping.
Pros : Saves needing data values in the second-choice table
Pros : Is smarter than a fixed allocate of all 'spare' slots to just one COG
Cons : Looses the all cogs are (almost) equal, and looses the user ability to define any COG allocates.
Cons : Is more of a lottery, but 2nd-choice plans all tend to be lotteries, on the bonus slots, and the control boolean can disable, 2nd-choiice, to restore determinism where that matters.
Cons: More serious If someone has carefully paced a slot spacing on COG N, to lock with a Code loop, they do not want that arriving early, on a random basis.
ie an important feature of a 2nd-choice table is not only whois on it, but also who is not allowed random slots.
Comes back to the common chestnut of user control. - remove that, and there is nearly always a case that bites you.
Perhaps "fall through" was a poor choice of name for it. Spin or rotate through might have been better. What I pictured was more like the current hub selection diagram, but with 16 cogs in a circle and the hub selector switch stepping around to each one that wanted a hub access. The selector would rotate in one direction only, skipping cogs that did not need an access. If a cog needed an access (as indicated by the slot table) at a fixed interval the table slot selection would override the "Spin through" selection.
Ah, yes, that approach is what I call hot-brick or skipping stone algorithm, and it is easy to say, but actually logically complex to implement. Not impossible, but effectively it is a state engine with (roughly) 16 priority encoders.
I did code a Similar 16 x 16 hot-brick variant, when I was looking at hub-scan choices.
It has 20 IP lines, and 4 output lines, and 272 lines of equations.
Note that totally smashes any determinism on what it controls, and the Slot rate is highly volatile, from 1-16 - it's basically playing whack-a-mole.
The logic area of this, would be larger than a simple change of Table ( and some pairing designs do not even double the entries, instead they split the table, eg 32x splits to 16x/16x ) - those have almost no extra cost.
That (playing whack-a-mole) doesn't sound good. I pictured it more like two independent tables where the second table consists of unused slots from the first table, and the first table has absolute priority. Have to sleep on this a bit. 04:00 so I need to hit the sack for a while. Lots to do tomorrow morning.
If this is limited to the secondary mapping, where they are bonus slots anyway, it might be tolerable (if made optional).
I'm not sure you gain much over a more direct mapping of the second choice ?
However, there is still the serious issue of a possible early-grant to a COG in the primary list.
That surprise is simply not there, in a dual-table case, as the user selects who is in what slot table.
All the slot sharing ideas do this. Heater's idea at least has the merit of ..
1) addressing the only use case that anyone seems to be able to identify for a fast cog ... i.e. one "SuperCog" executing a high-level language, and fifteen "normal" cogs implementing soft-peripherals.
2) maintaining both absolute (i.e chip wide) and individual (i.e. cog wide) determinism for 15 of the 16 cogs, relaxing determinism only for the "SuperCog" - which doesn't need it anyway.
3) requiring no complex tables or additional instructions.
4) making it possible to "plug and play" with cog objects (as we do on the P1).
Ross.
All ?? Err, totally wrong, but you seem strangely unable to remember any other use cases provided, or the reasons why favouring any single COGid is a problem.
Oh wait, isn't your Catalina "executing a high-level language" ?
There are many other users who will be operating PASM in all those other cogs, and they WILL want control over them.
The 'executing a high-level language' task is in many cases, the least demanding of accurate slot placement.
It is usually the lowest level stuff that needs the most control.
Here is another recent real example, of the problem of fixed Slots, from P1 land
http://forums.parallax.com/showthread.php/118166-1-pin-PS2-Keyboard-amp-1-pin-TV-40x25-Released-on-OBEX?p=1265694&viewfull=1#post1265694
The funniest part of this, is the SuperCog0, is a subset of the Table solution, so anyone who wants only that, can be given two magic SuperCog0 init values, and they need remember nothing else.
Anyone else, facing a real world issue like the one in the link above, can enter a value to fix it.
SuperCog0 also needs the slowest and most system-complex component in the Table design, yet delivers the least flexible 'solution'.
Correct, and supercog0 also fails to solve the DAC-COG-PIN issues.
Yes and no. The top table has a set of priority CogIDs, but if you only select IDs for 2nd table not in the top table, you miss a minimum level of access.
I think the 'unwanted extra' effect mentioned above could be managed with another boolean, that says 'I want in on the bonus pool' - users can then select which COGids are in the second pool, and they can have IDs in both if they want.
Those COGs demanding fixed access rates would not set their WantBonus flags. Software centric COGS would.
The table can be smaller, but two settings are needed to properly control it. Nibbles: N x 4 and Booleans : 16 x 1
Given this is larger in Logic, and has more COG connections to the AllocBlock, I'd still favour the simpler table with choice of dual.
Ross.
Move it to the hub and give it it's own port to access hub memory.
Attach the video hardware to it so it can drive the display.
Optimize the instruction set for this use or just go with an arm based processor..
Have this processor load and start the cogs.
Maybe even add floating point and an mmu to it so we can run linux.
Then replace the video hardware in the cogs with general purpose shifters for serial use.
However that's for the Propeller version 3, or 4 or ...
Right now I want the KISS Propeller 2 as described in the "Ship It" thread.
The reason I ask is that as I've been reading the discussions about hub slot reallocation, super cogs, super duper cogs, etc I keep noticing that the reasons for "needing" all that complexity seem to be to increase the bandwidth between the cogs and the hub as well as speeding up cog to cog transfers. If the P2 has the the ability to transfer longs between cogs directly without the hub then that will no longer be a problem. I do know that Chip is working on the ability to copy four longs between the hub and cogs during a transfer window.
If that's the case I see little need or purpose of increasing the bandwidth between the hub and the cogs. It's all fine and good to increase the rate but if you can process the data transferred in any meaningful then there's no point to increasing the data rate.
Agree 100% - in fact, I made much the same point just this morning in the SuperCog thread.
In all the hullabaloo over all these massively complex hub slot sharing schemes, people seems to have lost sight of the fact that in most cases the natural solution (given the Propeller architecture) is that if you need two hub slots you could just use two cogs - if we had a better cog-to-cog communications method!
Ross.
I somehow missed seeing any massively complex hub slot sharing schemes - got any links ?
The ones I have coded are a few lines of Verilog, you clearly must be talking of something else, otherwise you would not have used the words massively complex hub slot sharing schemes ?
Hardly Genius to waste a Shipload of valuable silicon, just to solve what can be fixed with a few lines of Verilog.
- but I'll admit on P1 there was no other choice offered. Hopefully, the next release avoid repeating that.
Yup, better cog-to-cog communications method would be good, no matter what else is in there.
I'm guessing there will be something in the Smart Pins for this, for modest speeds.
I disagree 70%....... Yes, the ability to transfer data between cogs would reduce the need to increase hub bandwidth somewhat. Even if it was only the ability of a cog to communicate with the cog to the left of me and the cog to the right of me it would be a big help.
Problem is that most of the memory is in the hub, and the program that coordinates everything (the master) is also in the hub. Regardless of whether it is a spin program being interpreted by a cog, or a C program using hubex, that cog needs hub bandwidth to run the program.
The rest of the cogs that are performing as I/O devices or little slave processors (slaves) also need enough bandwidth to transfer the data that the master needs to the hub. The master cannot fetch data from the slaves. It's a balancing act, and a fixed allocation will not work in all cases, nor will having a super cog.
The bandwidth has to be divvied up so that each cog gets the hub bandwidth it needs. This is done by the programmer. In the P1 it is done by using multiple cogs when one is not enough. This is wasteful of cogs when all that is needed is a little more hub bandwidth. Fine when we have extra cogs, but better to be able to assign hub bandwidth where it is needed for those times when you do not have an extra cog to do what we do now.
I would hardly call a 5 bit register, a 5 bit counter, and a 32x8 look up table "massively complex". The hardware would be trivial, but the flexibility it gives certainly could be considered massive when compared to the hardware it took to create it.
I'm a glutton for punishment this weekend I guess, so here goes.
Good thing Chip isn't reading this...
I think the overall concept of symmetry is one that can produce incredible things in life, etc.
However, symmetry is not the end all, be all, nor is it in any way particularly better/worse than asymmetry in life.
Interestingly, the universe is asymmetrical.
One does not achieve balance through symmetry or asymmetry alone.
Symmetry in the Prop makes a subset of problems easier to solve.
It also makes a subset of problems harder to solve.
We used to have symmetry in networking, called TokenRing.
Its beautiful when operating within its domain, and rather than breaking down under load, will fail in a more graceful manner.
However, it fails to scale, quite horribly.
We have asymmetry in networking, called CDMA. It is the opposite in almost every way, except that it seems to perform/scale quite well.
I see the same apparent behavior in the new Prop, as a symmetrical device.
As a device with a small domain (P1), its managed to accomplish what it was designed for at the time, and then some.
With the current P2, it seems as though it is reaching a level wherein the failings of symmetry are starting to rear its head.
At some point, trying to remain wedded to the calling of symmetry-only, is going to lead to the need for substantially more logic and/or power than many potential asymmetrical designs, if it doesn't already.
If logic and power were free, that wouldn't be a problem so much.
However they are not free, so staying true to that paradigm is going to either cost more in some fashion, or reach a scaling limit.
I submit that just as in life/reality, adding asymmetry to the Prop's symmetry paradigm can give better, faster, cheaper gains than the alternative.
Hubsharing is asymmetrical by its nature.
Allowing -some- asymmetry into the Prop can make the domain field it operates in larger/more inclusive of both symmetrical/asymmetrical domains, more cost effective, and overall, a more useful, balanced entity.
/navel-gazing
The claims that introduction of hubsharing will break the Prop's symmetry are factually true.
However, that is all that is true.
Case 1
Factually, if I'm a current large customer who has a product using 8 Cogs and I move to the P2 which hypothetically has hubsharing, how does the now broken symmetry impact me?
It doesn't.
"But "symmetry" is broken!!!" say the Faithful.
"I don't give a rat's a**, it works fine for me" says the large customer
Case 2
Factually, if I'm a current large customer who has a product using 8 Cogs and I move to the P2 which hypothetically has hubsharing, AND I decide to add a feature that uses hubsharing on 2-3 of those unused Cores, how does the now broken symmetry impact me?
It doesn't.
"But "symmetry" is broken!!!" say the Faithful.
"I don't give a rat's a**, I've added a feature that I couldn't get working with conjoined Cores, and it works fine for me" says the large customer
Case 3
Factually, if I'm a current large customer who has a product using 8 Cogs and I move to the P2 which hypothetically has hubsharing, AND I decide to add a feature that uses hubsharing on 2-3 of those unused Cores, AND THEN I decide to add yet another feature that uses hubsharing on 2-3 more of those unused Cores, how does the now broken symmetry impact me?
It doesn't.
"But "symmetry" is broken!!!" say the Faithful.
"I don't give a rat's a**, I've added a feature that I couldn't get working with conjoined Cores, and then done it again out of spite, and it works fine for me" says the large customer
Case 3*
Factually, if I'm a current hobbyist who has a doo-dad using 8 Cogs and I move to the P2 which hypothetically has hubsharing, AND I decide to add a feature that uses hubsharing on 9 unused Cores, how does the now broken symmetry impact me?
It doesn't.
"But "symmetry" is broken!!!" say the Faithful.
"I know, and thats why my doo-dad don't work no more" says the small hobbyist
*OK, a bit tongue in cheek.
But I look at headers/docs before I #include them in stuff I do. I'm just funny that way though.
Yup, 100% correct.
.. and that Counter replaces a 4 bit one, for an even smaller incremental cost, and this needs only a single copy, as it is not COG duplicated.
Methinks I am detecting a note of exasperation here. Does this mean you are in favor of a method of divvying up hub bandwidth that allows symmetry if that is what you want, or asymmetry if that is your desire. Or are you opposed?
A response of "in favor" or "opposed" are the only acceptable responses. ;-)
Heh, not exasperated at all now, though maybe thats what gave me that out of the box sideways moment.
I am strongly in favor of pushing hard for the inclusion of the asymmetry of hubsharing, UNLESS there is a financial imperative on Parallax's side that precludes ANY delay, for any reason.
And so far, outside of forum angst, I haven't seen anything from Ken that intimates such.
Though I'm sure he is far more frustrated than most here, since he has to keep all the balls in the air.
I couldn't agree more with your case examples. Slot-sharing has no impact those who don't want it, but it gives a huge benefit to those who do.
And the implementation of hub slots is actually quite simple - in fact way simpler than quad transfers, hubexec, multi-tasking (now dead), and many of the other bits/instructions that got added to the older P2 version.
I think you should keep "sharing" separate from "mooching". Mooching impacts no-one, but sharing impacts everyone, since it requires co-operation from the cogs that are losing their slots - which in turn dictates what other objects you can run at the same time.
Ross.
Yes, and as an example of that flexibility I have a use case where a secondary slot should be allocated via table (not FCFS variants)
Primary slots are allocated to HW Critical COGs, where either bandwidth, or slot granularity matter, but that granularity case does not mean the COG will use every one of the possible slots.
key Point: If you know what the Primary COG slot 'use hit rate' is, you can allocate the Secondary slot, and know what the secondary BW of that will be.
eg Taking the 14x example given before for USB+PAL timing, that points to a x2 Primary alloc, but of low actual use %
If those x2 are secondary allocated to two SW COGS, each of those can get 25% of full BW, and the primary COG does not even know they are there. If Quad read is there, that is full HUB Bandwidth, for both, for read-code.
( The other methods of secondary COG alloc discussed above, lack the control of who gets the slot.)
That's another tick in favour of the simpler, dual table design.
This example has two COGS, at Full SW speed and one COG at a lowest granularity.jitter setting, all possible via the simple table.
Yup, and the example above, is just one of how it gives a huge benefit to those who do.
User control can make the Prop Sing.
I agree, sharing (although I think “assigning” or “scheduling” would be more accurate) should be kept separate from hub slot assignment. I also agree that mooching has some benefits, and having both would be fantastic. This is how I see the two concepts working. Feel free to correct me if I am misunderstanding either one.
Hub slot assigning would be used to guarantee that a cog gets access to the hub when it needs it. The simplest method of doing this seems to be a hub access slot table, which is very simple to implement, and has a very low silicon cost.
Mooching allows cogs to make use of hub access slots that are not needed by other cogs. I can think of several ways to implement this, but it would probably be more complex to implement and more costly in silicon than hub slot assigning.
Combining the two would add another level of complexity and silicon cost, but it may be worth it for the added flexibility and hub bandwidth gained. Chip would have to decide that.
If I had to choose between the two I would choose the hub slot assignment table for the control it provides.
Paise da lawd 'n pass the buttahmilk.
Hobbyists speak now or forever hold your peace if you actually read anything ;-)
No, it isn't.
If a cog donates one or more of its hub slot to another cog (which is essentially what all the table-based schemes do) then there are objects that will operate differently - if at all - when loaded into the donating cog.
Mooching, on the other hand, is guaranteed to only use slots that are unused. But in fact even mooching could still lead to some timing differences in overall program behavior - but it seems unlikely it would have anywhere near the impact of the table-based schemes.
Ross.
If you like singers who "stutter" occasionally!