Any interest in (more than 8) cog chip?
prof_braino
Posts: 4,313
Another thread discusses a 1 cog chip
http://forums.parallax.com/showthread.php/123505-Any-interest-in-one-COG-chip
This got me thinking, but rather than derail that discussion, I thought a separate thread would be appropriate.
Is there any interest in a "more than 8 cog" chip?
Most simple apps have trouble using all 8 cogs. On the other hand, some complex apps can use many more than 8 cogs.
In the context of propforth, we can simply add more physical chips, and communicate with MSC at the cost of one cog on both the master and the slave chip, and two I/O pins each. MSC is similar to Beau's synchronous serial. MSC is also synchronous serial and include protocol, and the link continuously sens 96 bit packets at clock speed. The effect is that we drop in a character on this side, and it pops out the other side, into the i/o stream of what ever cog is connected.
this is fine, but really, for the larger apps, we would benefit from 16 or 32 cogs on a chip, and more I/O pins. Even if the cost is more than $1 per cog, this would be a benefit. The cogs would be full, compete, cores; not pared down or otherwise reduced function.
More I/O pins: Althought the prop 1 COULD control 32 servos at once (for example) we would not be able to talk to the chip. Another 32 pins would be the smallest reasonable increase, I think.
So it seems our need seems to go the opposite direction of the other threads needs in several ways:
- need more COMPETE, identical cogs ; 16 or 32.
- need more I/O pins. 64 pins total would be minimum on a chip with more cores.
- More cog memory would be secondary, but nice.
- More hub memory would be secondary, but nice.
- faster clock would be secondary, but nice.
- VGA support could be removed. Rarely do apps require this. Even those that do only need it on a few cogs, so having it present on all cogs is not a big benefit. Any external solution might provide this functionality better.
Does anyone find a need for MORE cogs?
Could the cost still be around $1 per core?
Thanks for your opinions.
http://forums.parallax.com/showthread.php/123505-Any-interest-in-one-COG-chip
This got me thinking, but rather than derail that discussion, I thought a separate thread would be appropriate.
Is there any interest in a "more than 8 cog" chip?
Most simple apps have trouble using all 8 cogs. On the other hand, some complex apps can use many more than 8 cogs.
In the context of propforth, we can simply add more physical chips, and communicate with MSC at the cost of one cog on both the master and the slave chip, and two I/O pins each. MSC is similar to Beau's synchronous serial. MSC is also synchronous serial and include protocol, and the link continuously sens 96 bit packets at clock speed. The effect is that we drop in a character on this side, and it pops out the other side, into the i/o stream of what ever cog is connected.
this is fine, but really, for the larger apps, we would benefit from 16 or 32 cogs on a chip, and more I/O pins. Even if the cost is more than $1 per cog, this would be a benefit. The cogs would be full, compete, cores; not pared down or otherwise reduced function.
More I/O pins: Althought the prop 1 COULD control 32 servos at once (for example) we would not be able to talk to the chip. Another 32 pins would be the smallest reasonable increase, I think.
So it seems our need seems to go the opposite direction of the other threads needs in several ways:
- need more COMPETE, identical cogs ; 16 or 32.
- need more I/O pins. 64 pins total would be minimum on a chip with more cores.
- More cog memory would be secondary, but nice.
- More hub memory would be secondary, but nice.
- faster clock would be secondary, but nice.
- VGA support could be removed. Rarely do apps require this. Even those that do only need it on a few cogs, so having it present on all cogs is not a big benefit. Any external solution might provide this functionality better.
Does anyone find a need for MORE cogs?
Could the cost still be around $1 per core?
Thanks for your opinions.
Comments
I know this idea gets bounced around every so often.
First off, I often need/want more cogs. I frequently use a second Prop in my projects.
There are lots of severe problems with adding additional cogs to a Propeller.
One problem is just the physical layout of the chip. I don't think there's much more room around the hub for the additional silicon the new cogs would require.
The other big problem is with timing. More cogs means less access to the hub. Many drivers that work on an 8 cog Prop would not work on a 16 cog version.
I ran into the problem of having no pins left for input when I made my 32 servo demo. Sure the Prop could drive 32 servos but all movements had to be pre-programmed since there weren't any pins left over to receive input. This problem could easily be solved with a few extra chips though. One Prop pin could be used to drive many servos if external circuitry is used.
The most common requests we are receiving are lower pin-count chips, at a lower cost, maybe even with fewer cogs. Aside from this, they ask for more RAM, code protect, A/D and more pins (Propeller 2). This larger request might also include more cogs.
With 16 cogs, each cog would be assigned by default one hub access slot out of 16. This could then be modified where one cog might get every 4th access slot, while other cogs get assigned on a first come, first served basis.
It would be nice if the cog memory was increased. An additional 2K of memory per cog would be good. And more hub RAM would be good. 128K by P2 would be very nice.
Thanks for the input! So, it looks like more PINS is the driving request (and last place), and more cogs has been secondary (or off the list), to most users. That tells the story.
And, of course, more memory!
- more I/O
- more hub ram
- more cog ram
- more cogs
With a new P1C design (keeping power usage low)...- more I/O
- 48-64 I/O would be nice (perhaps 2 bonding options?)
- add in another 32 internal I/O for use between cogs (just I/O without the P2 support)
- more hub ram
- suggest ~60KB hub ram and use ~4KB for boot/debug (as P2)
- 64KB is a current P1 limitation and suggest this be kept
- yes it is a larger die
- more cog ram
- Would love to see this but it's just too complex
- P2 adds a 256 long clut/fifo
- more cogs
- Personally I would love to see this. However, doesn't seem to be so much demand
Other comments...- I would like to see the hub access from 1:16 to 1:8 if possible.
- I don't think VGA or the counters add much silicon space, so no real benefit to remove them from some cogs
Smaller P1...I am really surprised by this request. Seems to me that no one really understands that Parallax cannot significantly reduce the P1's cost without huge
volume. The die is a certain size, and removing pins does not really reduce this much.
I believe the driving force to limit the P1 is to achieve reduced cost, not board space.
I wonder if some of these things can be done in software rather than hardware?
Buy two propeller chips. Devote two pins to an interprop comms system (objects already written) and devote a small amount of hub ram for circular buffers. Open a spin editor, and start adding objects. Add one line to each object which defines which chip it resides in.
A lot of the code could be hidden from the user, or added automatically, For instance, say one prop talks to a keyboard, and the other prop does the display. Use a keyboard object and a display object. Write a tiny bit of code that is a series of virtual ports, so the keyboard might decide to send all the captured data to a virtual port #1. And on the display propeller, the characters to display are collected from virtual port #1. The virtual port handler object would have n ports, with circular buffers of size x.
I've got more things on a "wish list", but thinking about what could be done now with the existing tools. Prop chips can program other prop chips. I have done experiments with the speed and it it possible for a prop chip to 'capture' a download and send it out again to a second prop chip. The virtual port object is a combination of existing code already out there. The second prop could have an eeprom or it could work without an eeprom and be loaded from the first prop's eeprom.
So ok, not a prop with 16 cogs, but it could be done with two props and one 64k eeprom.
I don't mind getting a few pcbs done, but I sure would not want to put the $$$$ into a new P1 just because a few of us have pet projects that may have use for them.
If I were Chip I'd be rather ticked off at it. I mean he spent years developing the fabulous P2 with much input and discussion with all us folks on the forum. And now before it even hits the shelves people are clamoring for more this and better that.
At one point Chip did ask if we wanted more COGs or more RAM. Given that you have to make a trade off, you don't have an infinite number of transistors or amount of space to play with and you have a price point to consider.. I think the majority said "Gives us the RAM, 8 COGs is OK"
The reasons for this being:
1) Adding RAM adds a lot more utility to the device than adding COGs. You can put a lot more functionality into your code for the amount of silicon consumed.
2) Adding COGs reduces bandwidth to HUB. Not good.
There are those that say the bandwidth issue can be mitigated by changing the round robin HUB access some how and allowing more HUB cycles to COGs that need it. Or arranging for a bus arbiter that can dynamically let COGs take access slots when other COGs do not.
I believe this is a really bad idea because:
1) It buggers up determinism. (You don't have real-time deterministic timing on your HUB accesses)
2) It buggers up determinism. (You can no longer tell if random objects selected from OBEX will work together)
Some say that 1) is OK because we rarely need real-time determinism in HUB access, that's all done on the COG side of things.
BUT I say 2) is the most important thing about the Propeller. Consider: I write an object or two that actually requires the maximum hub bandwidth I can get with some asymmetric HUB access scheme. Then I select an object I like from OBEX that also requires maximum hub access bandwidth.
BOOM I'm screwed those objects will not play together. And the reasons why they fail may not be obvious.
Introducing a bus arbiter or such HUB access mechanism means that not all COGs are equal. That breaks a fundamental design principle of the Propeller. Not good.
Don't forget the PII now has hardware multi-threading in each COG. So an 8 COG PII can run, what is it, 32 real time threads. That goes a very long way to mitigating the desire for more COGs.
Interest ? Sure, interest is free. Commercial critical mass ? : Nope.
Once you add the threading of Prop 2, this chestnut simply goes away.
The 8 Cogs that exist now, are very rarely fully code loaded, and it is that code-memory-usage that can now be fully harvested with threading/time slicing.
That same threading, is what allows 4 or even 1 COG parts to have practical use.
All COGs do not need to be equal. There is significant silicon cost to the many fast MUL ad DIV, and I doubt many designs will use EIGHT sets of Screaming MathOps.
I would rather have (say) 4 Full Speed math cores, and 4 smaller, slower ones, could release silicon for some smarter serial IO and smarter Counters/timers.
Imagine a "threadinit" function that acted just the same as as coginit. Point it at some PASM, give it some PAR pointer and start up a thread.
I'm sure the idea has issues. Like, how do you know the code you want to run as a thread will fit in the COG that is already running some threads? How does it get loaded, you would need a kernel in the COG to do the load as the hardware can't load a partial COG?
Hmm..messy. Scratch that idea.
Even though I am very much new to the Propeller, I can't stop to add my opinion here:
I believe it is important to feed Chip and his Team with Ideas, but on the other hand we have to be fair enough to keep the commercial interest in mind as well.
Companies have been failing just because they started to early diversifying chip designs (i.e. Inmos).
See you around
Frank
I disagree. All COGs do need to be equal.
Of course if you are writing all your code yourself you can be sure you know what resources are needed by your different processes and allocate them as required. That's a bit of extra thoought and planning required but doable. It's all under your control.
BUT:
One of the most brilliant features of the Prop is that I can fetch a bunch of objects from OBEX, that I did not write. Then without studying their internals I can stitch them together into an application with a bit of my own top level code. All I have to worry about is:
0) Do I have as many COGs as those objects require free?
1) Do I have pins available for them to use?
2) Do I have enough memory space to fit them in.
Given the above conditions I can be sure that mixing and matching any objects will "just work". This amazing simplicity of combining stuff is one of the Props greatest strengths.
However if you destroy the symmetry, if all my COGS are different, all that ease of use of third party objects is also destroyed. I would have to check carefully the resources required of each object and check if they will play nicely together. Yuk.
In a way it's like putting interrupts into the Prop. Which would bring us back to the world of having to organize resources between objects. Effectively having to assign priorities to objects. Having to give up when two objects both want the top priority etc ect.
This is not good.
Your example of maths cores illustrates this perfectly.
You say it's unlikely we need 8 cores with full high speed maths support.
I say, what is unlikely is surely going to happen. What if Joe user happens to stumble across five objects that need those maths capabilities? BOOM he is thwarted. And thwarted for reasons he may not understand and should not have to worry about.
By the way your plan for only four cores with high speed maths has just buggered up my plan for running my Fast Fourier Transform in parallel chunks over six cores:(
It's fun to discuss this sort of thing. It will probably never happen. The P2 will take us way up to the next level and beyond. However, power will be a problem - just how much we don't know yet. IMHO the only mistake we made was when asked about the P1B we said NO, go for the P2. In hindsight, I think most of us would have now preferred to get the P1B a year or two ago.
I agree that threading will help a lot. But there is still the old 2KB cog restriction although we do have some clut to use as fifo or scratch registers.
And of course, you are 100% correct in that we asked for more hub ram in preference to cogs, and this was before we got threading. But unfortunately we didn't get as much ram as we would have liked - but that's life.
What really annoys me is that I just cannot think of that killer app that would sell millions of P1s or P2s
What you describe is a luxury, and not a necessity.
If you really want equal COGS, you will have that in Prop 2. - but you do need to be aware that luxury is not cost free.
History shows us Asymmetric Cores are already being used more, where the price matters. Making all cores the same, simply adds too much to the price.
Some how I believe more in the concept of growing slow but steady. (even though I admit that the parallax team might feel better with the income of the killer app )
There for don't be anoid, keep thinking
"disrespectful' might be a bit of an over statement.
I'm just thinking about it like a mother spending all day cooking a fabulous roast beef dinner with all the trimmings for her children. When it's done and on the table the kids look to mom and say "Where are the hamburgers?".
I can't help thinking that all those clamoring for more this and better that should come back after the have had the Prop II in their hands for a couple of years and have exhausted all it's possibilities.
Or as that mother might say to the kids, "next Sunday you can make your own f'ing dinner."
What I describe may well be a luxury, but I believe it is a very important luxury that makes the Prop easier to use for an ocean of people. And not just those new to programming. Also to those that that want to quickly throw something together and not have to invest a lot of time into sorting out the conflicts I describe.
That is clear enough. I believe the cost is worth it.
Only if by "history" you mean the past couple of years:).
We see this in ARM processors now. The logic goes something like this:
1) Let's have full up ARM core with piles of memory space that can run Linux
and hence Android. Or IOS.
2) Better have two or more of those for performance.
3) We'd like more for real time I/O work in other embedded apps but a full ARM
is expensive, Let's tack on a couple of lesser non-ARM cores to do that.
4) Of course you have the GPU to deal with if you want any display output.
Well, if you want all that just buy a Beagle Bone Black. It will cost you about the same as a Prop board and has similar amount's of IO. There is no need for you to have a P II.
You might find though that programming all that is a bit of a handful. That's OK if you have a team of software engneers and are going to deliver millions of units of the same product. You can afford to invest the time in development.
It's not OK for where the Prop(s) end up being used.
More philosophically. The symmetry and purity of the Prop is a unique selling point. The only other such devices come from XMOS. Even they don't quite do it (XMOS cores can only access "their" set of pins on the device). If we give that up we might as well give up the whole idea and go the way of the modern ARMs as you suggest.
In general there is no point in trying to compete on those terms, there are already many players there.
P.S. By the way history shows no such thing. I'm really glad my 4 or 8 core Intel box has all cores the same. That makes it dead easy for my app to be parallelized or for the OS to share app load amoung cores.
You almost need a degree to fathom which devices you can use together with some other micros as they share specific sets of pins.
A large portion of the silicon on the P2 is devoted to each pin configuration/circuit. Some die space could probably be saved here, but then again, the pins would not be equal. The same goes for the clut on each cog. I am totally in awe of what Chip has managed to cram into the P2. I am just hoping I cannot cook my eggs on the P2
As mentioned in the first post, this is what we do, except we use cog ram for the buffers.This give us 7 usable cogs on the first prop, 6 usable props on the last prop, and 5 usable cogs on all the one in between. So if N is the number of prop chips 2 or more, the number of available cogs is (N-2) *5 + 13. The delay from the first prop to the last prop is like one cycle per prop chip, otherwise its the communication parts it the same as being on the same chip.
But we lose resources to the over head. This question is to see if anyone else has found this to be a problem.
Don't be silly. The only stupid question is the one that isn't asked. This is free market research, and if it confirms previous research, it means you did it right. If it doesn't, it means you better start thinking about it right now. In this case, it seems to be showing that folk haven't finished growing into 8 cores, and have not out grown and need more than 8 cores.
As far as any implementation details, these should be left up to the guy doing the work. Unless you consider yourself Chip's peer in design skill.
The need we feel is for MANY IDENTICAL cores (16 or 32 or more). The cores should be complete cpu cores in there own right, not trimmed down "core-ettes" as found in the Green Arrays GA144.
And it seems to confirm that this is NOT general need, so there is not need to get agitated. But its still a need for me, and I will continue to be patient.
This is being tested on the P1, with 8 threads. So far it appears to work really well, but multitasking is orders of magnitude more difficult to debug.
The developer does have to be aware of what going on to use it properly.
Imho, this discussion (and also the 1-cog version) is a WOMBAT (waste of money, brain and time).
Any special, downsized version of the P1 would be more expensive than the current one (because of lower production volume) and therefore useless.
Any special, more complex version of the P1 will be inferior to the P2 but nearly as expensive and therefore useless.
The only improvement to the P1 that has a slight chance of making sense would be scaling down the current design unchanged to a smaller process to cut cost. I don't know if this is possible.
"More cogs" can be solved with multiple chips. "More pins" can be handled with external shift registers (HC595-something).
So I think Parallax' decision against many derivates is right! All resources for further development should be concentrated on bringing out the long awaited P2. Basta.
Stupid questions are like when your wife comes home with a new blue dress and you ask "Why didn't you buy the red one?"
Which is kind of what we have here.
After years of consultation, and tremendous effort my Chip and crew we are about to get the Prop II.
I don't what to be looking to the Prop III, VI, V... to hard before we have even celebrated the arrival of P II let alone worked with it.
Good points. Sorry your post has gone off topic a bit. Debating the merits/new design features/existence etc of the P2 is easy. Actually solving problems is hard. Let's talk about hard problems.
Firstly, interesting you are using cog ram as the buffer rather than hub ram. What are the merits of each system? I guess it depends where the data goes. If most of the data is going through a propeller and on to another one, then it can stay in the cog. If it is being passed to other cogs, presumably it has to go to the hub so other cogs can access it. That kind of implies data packets, with a header attached which contains the destination, and some code to work out where the packet should go.
You forth boffins are probably ahead of the game here.
Ok, data comms systems. For years it was a 'common bus', eg 50 ohm cable around the place running ethernet. Then everyone seemed to go to a 'star' system with cat5 cable. So for a 'star' system, your master propeller will have maybe one cog and two pins per connection. You could maybe have a cog multitasking but it would slow down the comms. So the star maybe isn't the best as it runs out of cogs too quickly.
I think your solution may indeed be close to optimum.
Perhaps there is something to learn from the biology of brains, which seem to have mostly local connections, and then a lesser number of longer distance connections. The brain does not appear to be one general purpose neural network, but instead, functions are divided into fairly distinct anatomical areas. When someone has a stroke, it becomes more obvious how this works, eg a patient I have in her late '30s who has had a stroke that has damaged the expressive part of speech, so she cannot say any words, but she can still communicate with the aid of customised software on an ipad. The speed with which she communicates makes it quite clear she is still able to 'think' words, they just don't get translated into speech so the data has to go through another path.
I wonder if one could think about local networks of propellers and whether that leads to a hybrid star/single buss model? Take a cluster, say, of 8 propellers, and build the most efficient comms system in terms of cogs and pins. Maybe this is a common bus with 2 pins per propeller and one cog per propeller? Use data packets and label each one as local or external to the group. For external messages, one propeller in the group is devoted to handling external messages, perhaps on another common bus. So you have clusters of 8 propellers and then clusters of clusters?
Could this change your equation so the 5 becomes 6, and hence frees up another cog?
The sending cog sends preproced -results-. The recieving cogs processes those results and has other results. So we don't send raw data have less to buffer, run faster, whenever possible, which is most of the time if its done properly. (If it isn't then we should think a little more).
Just Sal, not me. I only talk.
So the limit is is the number that can hang off one prop master, which is 7 slaves. .
Could do, but then the cost of support overhead goes up, needing extra code to set up and organize all the props runs out of room for code for each cog activity. Each cog could run individual script from SD for example, but its a whole series of trade offs.