The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Rayman · 2014-05-06 08:01

Is there any application in mind where 2X hub bandwidth is needed where one couldn't just use 2 cogs to do it?

Seems to me this desire for faster hub bandwidth would only be needed in cases where one just wants to spit out HUB data to the I/O pins (or vice versa), without doing any processing on it.

But, it seems to me you could get the same throughput use a pair of coordinated cogs...

MJB · 2014-05-06 08:16

Cluso99 wrote: »

Why cannot some of you not see the benefits ???

I can ;-) - does this help?

MJB · 2014-05-06 09:24

Brian Fairchild wrote: »

It's long been my opinion, from the earliest mentions of the P2, that there will need to be something more than the OBEX. That there will need to be an official Parallax Peripheral Library, maybe even built right into the various IDEs, which implements the usual peripherals found on other processors.

+1 - this will be VERY important for professionals, to have officially certified and supported standard peripheral libraries.
hobbyists can live with searching an OBEX, with many objects, but of doubtful quality and support (OBEX is great - but not for Professionals, who need to explain their boss why they trust such a source).

potatohead · 2014-05-06 10:08

@Heater: I loved it! Hilarious!

Heater. · 2014-05-06 11:10

MJB,

+1 - this will be VERY important for professionals, to have officially certified and supported standard peripheral libraries.
hobbyists can live with searching an OBEX, with many objects, but of doubtful quality and support (OBEX is great - but not for Professionals, who need to explain their boss why they trust such a source).

I don't mean to be argumentative but, you know, it's me.

I find this a very odd statement because:

1) In my experience most of the time engineers never have to explain anything to their bosses. Who would not understand if they tried. Just make the product work.

2) In the cases where code has to work then there is a lot of code review and testing to make sure it does. In which case it matters little where it came from.

Now I do very much agree that something more than OBEX is required.

Ideally there would be thousands, millions, of Prop II users all posting their code to OBEX, to github, to SourceForge, GoogleCode, random blogs and web sites all over the place.

Such as you see for the Arduino for example.

To get there, things have to be kept simple. So that mixing and matching code from wherever "just works" without a lot of head scratching.

Heater. · 2014-05-06 11:26

T Chap,

Many hours of bright minds trying to find a solution to a problem over many months, and still no consensus. That means the method is flawed.

My impression is that the problems of parallel processing and the required shared resource arbitration have been the subject of study since the beginning of the computer era itself.

Shared memory, dedicated memory, CSP, NUMA .... there have been generations of youngsters earning the PhDs thinking about these issues.

So, inevitably, we are at an impasse here. It's not just a "technical problem" that has a solution. It depends so much on expected use cases, tolerable complexity for the user, etc, etc.

Clearly my "Monte Carlo HUB" solution of randomizing COG access to HUB is a winner in terms of achieving maximal HUB utilization[1] and simplicity for the programmer. For some odd reason it is seen as a joke[2] and no doubt rejected by all. It does not satisfy their own other personal requirements for whatever reason.

1. Assuming the mechanics of random selection don't overshadow the gains achievable.
2. I did propose it as bit of "light relief" however the more I think about it the more I think it really is superior to all other half-baked HUB sharing schemes:)

Invent-O-Doc · 2014-05-06 11:43

I'm getting so sick of this. There is no agreement because these concepts lack simplicity and elegance. People won't buy something they don't understand (unless it is an Apple product.)

Roy Eltham · 2014-05-06 11:56

And it's still spinning in the mud...

I hope Chip can make the 8 long read/write per window work, it gets us most of what we want, and is a realistic practical use case. Then we can drop all this slot sharing Smile... and all it's theoretical gains that aren't very practical.

T Chap · 2014-05-06 11:58

Well I hope you guys solve it soon because this debate is giving me ad nauseum.

dnalor · 2014-05-06 12:10

Heater. wrote: »

MJB,

I don't mean to be argumentative but, you know, it's me.

I find this a very odd statement because:

1) In my experience most of the time engineers never have to explain anything to their bosses. Who would not understand if they tried. Just make the product work.

2) In the cases where code has to work then there is a lot of code review and testing to make sure it does. In which case it matters little where it came from.

Now I do very much agree that something more than OBEX is required.

Ideally there would be thousands, millions, of Prop II users all posting their code to OBEX, to github, to SourceForge, GoogleCode, random blogs and web sites all over the place.

Such as you see for the Arduino for example.

To get there, things have to be kept simple. So that mixing and matching code from wherever "just works" without a lot of head scratching.

It is not a very odd statement.
You have to explain, why your code do not work. My boss is a engineer, he wants to know the reasons and he understands.
You simply can not risk to take code from obex, because than it is your fault if it does not work, because you used it.
You have to use certified and supported code or write it yourself. You have to minimize risks. Searching errors in foreign code can be very time intensiv and uncertain.
Thousands or millions. NO!!! One source, with proven/tested code for all the standard peripherals other chips have in hardware.

potatohead · 2014-05-06 12:26

For those who have not had to deal with software risk metrics, there are (roughly) these levels:

Packaged software for a purpose, vendor tested

A macro or journal or input file to same, still vendor tested, but light use case testing required on the part of the engineering team

Executable code written against an API for commerical, packaged software, still vendor supported, more testing required by engineering team

Executable code written with libraries, about the same case

Stand Alone Executable! All testing, certification, etc on engineering team.

Those go from moderate to high risk.

In the case of OBEX, the risk goes right to the higher level, due to there being no vendor to share responsibility for testing fixing, etc... Also, certifications where required.

How this plays out varies widely, and it is perfectly understandable to assume all fault as stated above.

Some companies want to manage risk and will want to limit it by keeping their code to a minimum, which is why some people are saying an OBEX with known, tested, supported objects in it is important. For them it is.

Others will simply do their testing and not worry so much.

Just FYI. I went through this little exercise with a customer recently, and it was a very interesting discussion worth sharing here for perspective.

BTW, there is an opportunity for Parallax and or friends to package and support code for risk quantification purposes. Here is the interesting thing! The very same code with a path to support and some demonstrated tests is worth paying for just to buy down risk perception. It can be in the open, free for all to use MIT, and it can be packaged and those who need the transaction as a matter of business will pay to check the lower risk box.

Something to think about, isn't it? (and seeing these thing is part of what I do for a living, and I'm sharing it out of the common interest here, not making any kind of pitch)

@Heater: I took it as both a joke and serious. Your idea actually rings true to me in that the HUB issue really is binary. Either we say a COG always acts like a COG, or it does not. The zillions of split the middle ideas we got collected now, plus your binary extreme one, highlight the overall dynamic perfectly.

Well done!

Phil Pilgrim (PhiPi) · 2014-05-06 12:27

evanh wrote:

However, when Chip asks for input we might have something he quickly likes.

Hopefully, he won't make that mistake again!

-Phil

jmg · 2014-05-06 12:37

Rayman wrote: »

Is there any application in mind where 2X hub bandwidth is needed where one couldn't just use 2 cogs to do it?

Seems to me this desire for faster hub bandwidth would only be needed in cases where one just wants to spit out HUB data to the I/O pins (or vice versa), without doing any processing on it.

But, it seems to me you could get the same throughput use a pair of coordinated cogs...

Mostly, but using two COGS is difficult to manage, and ultimately wasteful, when you hit the COG limit. It also costs POWER.

COGs represent significant silicon, and to use one simply to solve a slot allocate oversight, rather magnifies that one made the oversight....

The fundamental simple idea of Scan re-map and Scan re-load is easy to do, and has low silicon cost, and defaults safe.
Covering all the gotcha's is a little more difficult - I see setup as more challenging than scan.

potatohead · 2014-05-06 12:47

But we do that all the time now on P1.

Really, as Roy mentioned, the real case for this is big programs, and if Chip can do the 8 longs per cycle, we will have done huge benefit to that case.

I thought about your case of detect, or input, process, respond, and the truth is we do that all the time too! We do it on a COG, which runs much faster than the HUB does, and that COG writes it's results to the HUB, or it is controlled by some other COG.

When a burst is needed, we simply have a few COGS run the code, and often it is the very same code, and they do it, then optionally wait, and or get tasked to do something else. One can fire them off in sequence, or by specific id, and the first thing they do is read the ID, then do their thing.

In the case of bigger programs, the one downside on P1, apart from SPIN being easy, is running a few of them at once.

Now, on this chip, we have a lot more COGS and RAM, and ideally the start in hubex mode and or start at attress without reloading feature, both of which make reusing and combining cogs very efficient and easier than it is on P1, and P1 just isn't hard for most cases. Additionally, we can run multiple big programs at once, which is just great. It will be easy to have a program or two working at the same time, big programs can do things like set a buffer pointer and have another big program writing to an SD card or something, no worries. With only one big program, it has to be fast. When there are a few of them plus COGS, things play out much differently.

And once this is done on a COG, it's done for everybody! I could go right now, get the guts of Propalyzer, drop it into something I'm doing, say a nice graphical output or SD card storage interface, and it will perform to spec, 80 - 100Mhz captures on P1, easy. (whatever the clock on P1 is)

The only consideration I have is number of COGS, and that code will work with 1-4 COGS depending on burst sample rate needed, place to put the data, and a few house keeping considerations.

I don't even need to know much about how it works! (and didn't last time I used it)

This is a very powerful thing, and when we can't say a COG always performs in a specific way, we need to know a lot more to get the same thing done, and we can't reuse as easily, etc...

Aside from running a big hubex program, using the COGS together maximizes the strength of the chip.

I see this as people not really embracing the symmetric, concurrent, multi-processor attribute of the Propeller, which is the defining feature and strength of it.

Most of this scheme discussion centers on makeing it a better serial processor with some peripherials dangling off of it, and that is just not what a Propeller is good at. As mentioned, one would be much better off running a fast, serial, uniprocessor with a Prop connected to do simple tasks.

Truth is, both can be done on a Prop, and the big limit was both number of COGS and amount of HUB RAM

Now we have a lot more of both and the smart pins!!

That same program, Propalyzer, can do it's thing, contain a nice GUI, SD, etc... and the main program for it won't need huge speed, because the parallel functions will be there, ready to go.

On P1, SPIN runs about as fast as assembly language on a good 8 bit computer does. On this chip, it may well be a few times faster, and or will allow inline PASM. Looking at C, we will see the same attributes.

At a few Mhz of instruction speed, the main program will prove to interact with things very well, leaving the real time parts in the COGS where it gets knocked out of the park.

Compiled code will run much faster, and COG code is real time. Put the real time in the COGS!

Not only is this easy, but it is robust and we know it works well! Actually making big programs hubex style to accomplish tasks like the ones being discussed here will never be as optimal as a COG will, and that is by design. Compartmentalize the problem, get killer easy reuse when you do it, and make writing that main program easy, robust, and have it perform with few hassles.

That is how the Propeller way really works, and these other schemes don't add a lot of value, IMHO of course.

Regarding power and silicon cost, the COGS are a sunk cost. We got 'em. For power considerations, just load one and have it wait on an event. Or terminate from a burst, assuming we get the start at address sans reload feature. Even if we don't, they start more quickly on this design than they do on P1.

Dave Hein · 2014-05-06 12:49

In the professional world there would be no issues with an optional hub slot sharing mode. Professional developers could handle the intricacies involved with such a feature. However, I can see that this could cause problems in the markets that Parallax participates in. They must be sensitive to a wide range of programming expertise, and features that might be useful for professional applications may be risky for novice programmers. So as I said before, even though I like hub-slot sharing I don't think it's a good idea for P1+.

potatohead · 2014-05-06 13:25

I think exactly that too.

On the "beast" some really killer features got added. Personally, I hope we get to revisit that one on a much more favorable process, flesh it out some, and maximize it's potential.

Deffo target non novice, and or System On Chip, sans OS, etc... as discussed. Personally, I don't think this design has the speed and scale to go there. That is not saying it is a bad design. It is not looking that way. It is a necessary design given the process physics.

There is the underlying assumption that Pros won't or don't need the consistent COG behavior. I'm not sure that is true, but I do recognize it as a valid point of discussion.

jmg · 2014-05-06 14:05

potatohead wrote: »

But we do that all the time now on P1.

Yup, only because there is no alternative. We also run out of COGS earlier than necessary, as a result.

Dave Hein wrote: »

In the professional world there would be no issues with an optional hub slot sharing mode. Professional developers could handle the intricacies involved with such a feature. However, I can see that this could cause problems in the markets that Parallax participates in. They must be sensitive to a wide range of programming expertise, and features that might be useful for professional applications may be risky for novice programmers. So as I said before, even though I like hub-slot sharing I don't think it's a good idea for P1+.

I agree about the strata/tier effect - posting on here shows that.
That is why proposals universally default to 1:16.
The chip powers up and works 1:16, nothing at all needs to be done. Novices need read no further.

Adding a few gates and a control boolean to ALSO include those Professional developers to me, is a clear no-brainer.
Even novice users should be encouraging this, as it means the P1+ can hit critical mass.

The area under the sales curve is what mattes to Parallax.

potatohead · 2014-05-06 14:10

The area under the sales curve is what mattes to Parallax.

Ok, do you have a few cases that would improve that area that require detail, per COG control?

Dave Hein · 2014-05-06 14:21

jmg wrote: »

Adding a few gates and a control boolean to ALSO include those Professional developers to me, is a clear no-brainer.
Even novice users should be encouraging this, as it means the P1+ can hit critical mass.

But then some expert developer will post some code using hub-slot sharing, which will pollute the minds of novices.

Roy Eltham · 2014-05-06 15:01

Phil Pilgrim (PhiPi) wrote: »

Hopefully, he won't make that mistake again!

-Phil

Thankfully, Chip and Ken do not agree with you that it's a mistake.

Roy Eltham · 2014-05-06 15:14

jmg,
I do not for a second believe that hub slot sharing will be the deciding factor on if the P2 can hit critical mass.

I think overall cost is the number one deciding factor in that.
After that comes things like number of I/O pins total, analog capability, and amount of memory. Those are all distant seconds to cost.

Performance comes into play when it's not enough, but in this case at 200Mhz sysclock, 100MIPs per core, 100MB shared memory bandwidth per core (16 bytes every 16 clocks), it's well above a great majority of other MCUs available, and even many CPUs being used in the embedded space. Even if you only compare hubexec at 50MIPs (or even less in practice) it still outperforms a ton of stuff, and will be more than enough for many products.

Assuming the overall cost can be kept low enough... and it's not just the cost of the chip fab.

tonyp12 · 2014-05-06 15:32

Default 1:16 but there is a mode that is not discussed in any novice learning books is a mode that will do 1:8 for the cogs 0-7 and 1:32 for cogs 8-15.
For example 0-7 are used for VGA + sprites, cogs 8-15 are for joystick/keyboard/uart etc.
One extra mode is enough to keep it simple, no 1:4 or 1:2 could possible be needed.

Rayman · 2014-05-06 15:40

Think we could just combine kb mouse and uart into one cog

tonyp12 · 2014-05-06 15:45

>just combine kb mouse and uart into one cog
It's just an example, could be some rutines that are large and use a whole cog by itself but hub access is low.

Half the cogs with twice as many hub slots and half the cogs half fewer.
Simple and is better than a fixed 1:16 as just one bit somewhere could flip it on.
Like cognew using an address+1 (eg odd) could flip the mode on.

kwinn · 2014-05-06 16:02

There's a reason I suggested that cogs only donate half of their slots. One was that it only requires a 32 nibble lut, and that lut could be filled by writing one quad to it. That way the table could be set up to the default 1/16 slots with a single write after power up. The software could then change that to whatever the user wants.

The second reason was that all the cogs can still access hub, although at a slower rate for some. I also considered a 64 nibble table so a cog could donate 25%, 50%, or 75% of it's slots.

With a lut though we are not limited to binary assignments. We could have one cog get every second slot for part of the lut, and none after that. The flexibility comes with using a lut. An async serial cog needs to shift out 10 bits for every byte it reads so how many slots does it really need.

Truth is I really don't have an axe to grind in this debate. The one project I am contemplating using this chip for would be marginally simpler with slot sharing, but easily done by using 2 cogs. IOW, don't care which way it goes. Just presenting the facts (as I see them).

I do think there are applications where being able to load a burst of data to hub for part of a hub access cycle would be beneficial. Does it take some planning and care to get things working right? Of course it does, that's why they pay us the big bucks, isn't it?

Those who don't want the complication can stay with the default setting.

RossH · 2014-05-06 16:25

Dave Hein wrote: »

In the professional world there would be no issues with an optional hub slot sharing mode. Professional developers could handle the intricacies involved with such a feature. However, I can see that this could cause problems in the markets that Parallax participates in. They must be sensitive to a wide range of programming expertise, and features that might be useful for professional applications may be risky for novice programmers. So as I said before, even though I like hub-slot sharing I don't think it's a good idea for P1+.

In my experience the reality is likely to be exactly the opposite of this. Novices will spend hours unraveling the intricacies of schemes like this, happily working around its deficiencies, using tricks that no-one ever thought of before - and also accepting the oddities that occasionally result from using it.

But professionals have cost and risk constraints, and also hard deadlines to meet. Most will just take one look at the complexity of a scheme like this and then recommend their company use a simpler, cheaper and faster chip instead. If by some mischance they do manage to convince their company to try it, the first time it all goes pear shaped (which will be the first time Marketing comes along and says "Quick! We need to add just this one tiny new feature!") they will drop the whole thing as a bad job, and Parallax's reputation will gradually turn to mud.

Ross.

Rayman · 2014-05-06 16:33

Chip: Please make them stop...

Dave Hein · 2014-05-06 16:41

RossH wrote: »

In my experience the reality is likely to be exactly the opposite of this. Novices will spend hours unraveling the intricacies of schemes like this, happily working around its deficiencies, using tricks that no-one ever thought of before - and also accepting the oddities that occasionally result from using it.

That's not my experience. I help to develop a few different videoconferencing platforms that used multiple processors with a shared bus, and we produced some fine products that generate multi-millions of dollars of revenue. I don't think novices could have figured out how to do that.

Of course, if the processor is designed to require spending lots of time to unravel its intricacies, and tricks are needed, and it contains lots of oddities then its probably not a good chip for novices or professionals.

jmg · 2014-05-06 16:53

kwinn wrote: »

There's a reason I suggested that cogs only donate half of their slots. One was that it only requires a 32 nibble lut, and that lut could be filled by writing one quad to it. That way the table could be set up to the default 1/16 slots with a single write after power up. The software could then change that to whatever the user wants.

The second reason was that all the cogs can still access hub, although at a slower rate for some. I also considered a 64 nibble table so a cog could donate 25%, 50%, or 75% of it's slots.

Agreed. I've looked into this too, and 32 will match a Quad write. It also needs a 5b Reload field.

kwinn wrote: »

With a lut though we are not limited to binary assignments. We could have one cog get every second slot for part of the lut, and none after that. The flexibility comes with using a lut. An async serial cog needs to shift out 10 bits for every byte it reads so how many slots does it really need.

Another key advantage of an (atomic access WRQUAD) table of 4b fields, is any CogID can go anywhere.
This solves/avoids the all-COGS-are-not-actually-quite-equal issues some have raised.

kwinn wrote: »

Truth is I really don't have an axe to grind in this debate. The one project I am contemplating using this chip for would be marginally simpler with slot sharing, but easily done by using 2 cogs. IOW, don't care which way it goes. Just presenting the facts (as I see them).

I do think there are applications where being able to load a burst of data to hub for part of a hub access cycle would be beneficial. Does it take some planning and care to get things working right? Of course it does, that's why they pay us the big bucks, isn't it?

I am looking at a good some planning and care use case right now, of PAL Video and USB time domain budgets.

With a Locked 1:16 I get to about the 5th line, and it pretty much drops dead, without complex gymnastics.

Key point: HUB Slot Scan Rate (SSR) is not just about Bandwitdh is also impacts tight code-loop granularity.

With a COG map, I can get many lines further.

eg I have a MOD 28 scanner, a PAL fSYS at 38x Burst, (1.684775125e8) and I have a 12MHz USB Data Sync DPLL that can lock +/- 0.285 % with a centre point that is ~ 15ppm off.

I have 2 'fast' COGS running @ 20ns SRR & 14 'slow' cogs with @ 140ns SSR, both jitter free.

(the fast and slow refer only to the SSR, in all other aspects COGS are of course identical)

The Fast COG SSR, can co-operate with my DPLL in a way that is impossible with a locked 1:16 rate.
Other SW issues may occur, but the timing budget moves to 'possible' on this simple example.

Some of the many possible operational Table+Reload choices are

COGS No, @ Slot Sample Rate, some coverage options, Any-COGid allowed
 2 @ 20ns SRR(fast) & 14 @ 160ns SSR(slow) 
[B] 2 @ 20ns SRR(fast) & 14 @ 140ns SSR(slow) [/B]
 3 @ 30ns SRR(fast) & 12 @ 120ns SSR(slow) 
 4 @ 40ns SRR(fast) & 12 @ 120ns SSR(slow)
 5 @ 50ns SRR(fast) & 10 @ 100ns SSR(slow)
 2 @ 20ns SRR(fast) & 10 @ 100ns SSR(slow)
 reference Default  is 16 @ 80ns SSR (slow)

Unallocated COGs are not in the Hub Scan, but they can still be used for other tasks.

kwinn wrote: »

Those who don't want the complication can stay with the default setting.

Yup, they need never know it is even there.

jmg · 2014-05-06 16:57

RossH wrote: »

In my experience the reality is likely to be exactly the opposite of this. Novices will spend hours unraveling the intricacies of schemes like this, happily working around its deficiencies, using tricks that no-one ever thought of before - and also accepting the oddities that occasionally result from using it.

But professionals have cost and risk constraints, and also hard deadlines to meet. Most will just take one look at the complexity of a scheme like this and then recommend their company use a simpler, cheaper and faster chip instead. If by some mischance they do manage to convince their company to try it, the first time it all goes pear shaped

Is this a serious post ? - it has me laughing.

The simple solution to your 'example'; is for the company to sack the 'Professionals' and hire the 'Novices'. !!

The New 16-Cog, 512KB, 64 analog I/O Propeller Chip

Comments