We're looking at 5 Watts in a BGA!

jmg · 2014-04-05 17:29

Ken Gracey wrote: »

I'd never suggest it takes us two months, ever.

Ken Gracey

Yes, sometimes enthusiasm can over-run engineering realities...

RossH · 2014-04-05 17:32

Ken Gracey wrote: »

I'd never suggest it takes us two months, ever.

Ken Gracey

Sorry Ken, I didn't mean to put words in your mouth. I was referring to this post.

Ross.

mindrobots · 2014-04-05 17:37

Ken Gracey wrote: »

I'd never suggest it takes us two months, ever.

Ken Gracey

Thank you, Ken for bring some sense of reality to this party.

If you had a finished, tested design ready to send off to be synghesized and shuttled on Monday morning, I don't thing you could have first chips packaged and ready to test in two months no matter how hard you guys tried.

Best bet is a P1 variant ready for sale to customers by EOY 2014.

koehler · 2014-04-05 17:57

Heater, thats an excellent description.
Can't say I dispute much of anything there.

However, to me it seems as though the entire foundation of the Prop is being flipped in a way.
No longer are we looking at 8 equal, discrete cores, as much as some sort of amalgamation of Core/Cogs with an overlay of C/C++ on top that wlll stuff compiled code into Cogs, and... well I'm a little confused.

So in addition to explaining the physical layer of the Prop, you also have to sell people on the hubexec and other tasks.
Not that it can't be done, however I think it just winnows out even more potential clients.

And, this may have been addressed, however another question that pops to mind which seem common sense is "Do I have to worry about memory managmenent" in this uC/PSOC now?

I'm honestly not trying to be negative here, however I think unlike most here I'm not really looking at this from a "What can the P2 do for me", but more from a simple business proposition.
As a PSOC, I think the current P2 is probably fine, and will come in around 3W+/-.
As a uC, which is where Parallax has this aimed, I do not see this being viable for many new customers at all.

Hopefully I am simply wrong.

Heater. wrote: »

koehler,

I look at it the other way around. The P2 has 8 CPU's executing from HUB memory that happen to have 512 registers. But, as a bonus speed boost one can also execute code from those nice fast registers.

When you put multiple processors together sharing RAM as in modern ARM and Intel devices there is a law of diminishing returns coming in to play. Adding more processors reduces the memory bandwidth of each. Past a certain number it's just not worth adding anymore.

This is true of the Prop II as it is of any other symmetric multi-processor. But the PII has that bonus, with fcache the C compiler can stuff small tight loops into the COG and get around that HUB bandwidth problem.

An example of this working very well is the FFT on the propeller, The guts of the FFT executes from withing COG giving about the same performance as the hand crafted PASM version. No special coding need be done to benefit from this bonus.

Phil Pilgrim (PhiPi) · 2014-04-05 18:26

cgracey wrote:

RossH wrote: »

All,

Over in the "consensus" thread we now have 34 in favor of developing a P16X32B and 2 against. Opinion also seems to favor the simpler (16-cog) variant, and more compatibility with the P1.

Ross.

Argh! Just when I make up my mind!

Chip, I'm not sure why that should even matter. It's not like it's a poll among your potential volume customers -- just a handfull of very vocal and opinionated forumistas. (I include myself in that category.) My advice is to stay away from the forum, consult with your marketing people and potential customers, and do what you -- and they -- think will sell the most chips. Despite our loyalty and demonstrable enthusiasm, this simply cannot be about our dreams for the P2. We're neither experienced chip designers nor IC marketeers, and we're just a minuscule fraction of the Propeller's future customer base. What it's about is what will launch Parallax into the next decade of sustained sales and market recognition. But nothing will do that if it's never actually finished.

I understand how the engineering mind works and how intoxicating the forum instant-positive-feedback loop can be. But if you're spending millions of dollars and not getting closer to actual silicon because we're in the way of finishing things (and, frankly, our track record in that regard has proven to be abysmal), maybe that approach needs to change.

So get away from here. Please! Before we completely pollute your fertile mind with further unsustainable fantasies. It's time to bring the "open dev process" to an end. The next thing I'd like to hear from you is a product announcement!

Seriously. It worked for the P1, and it blew us away. It's a proven strategy that should also work for the P2, P1+, or whatever you finally cook up.

Thanks,
-Phil

koehler · 2014-04-05 18:34

+1 for both comments below which seem to strike at the heart of the matter:

Prop 1- Interruptless, spin of a Cog for a task
Prop 2- Interruptless, requires multi-tasking within a Cog to have any resources/Cogs left over

While I like the idea of a P16x, and happy to hear its reaching 'consensus' on the forum, I'd push any real decision back to Ken and team as to which makes the most business sense for existing customers, as it will probably be similar to potential new ones.

As someone else mentioned, this should not be rushed because of some future shuttle run.
Unless Parallax has a spare $50K laying around.

Cluso99 wrote: »

Chip,
I am quite happy with handling interrupts but I came to the Prop P1 because I didn't have to worry about those and I could just add simple program blocks and drivers in its own cog.

Ross- I find the idea of a 4-cog version of the P2 less than compelling. It just doesn't seem to have enough flexibility, apart for use as a high-level language execution engine - for which we already have many other alternatives.

Lawson · 2014-04-05 18:35

Bill Henning wrote: »

Really?

P2 @ 160Mhz: hubexec is ~160MIPS for normal instructions

P2 @ 100Mhz: hubexec is ~100MIPS for normal instructions

P1B 32 cog 2 cycle @ 200Mhz LMM: ~6.25MIPS for normal instructions (1/16th same clock freq P2 performance)

P1B 16 cog 2 cycle @ 200Mhz LMM: ~12.5MIPS for normal instructions (1/8 P2 same clock freq performance)

Interesting definition of adequate.

This is why Quad-long access was developed. It lets you keep the simplicity and code isolation of round-robin access while scaling the bandwidth high enough to keep 32-cogs happy. The main cost is a rather massive 128 bit data buss to hub ram, more latency, and 8 clock + hub access instructions. The instruction might end up faster depending on if the cog registers are r-r/w or r/w-r/w dual ported.

Marty

RossH · 2014-04-05 18:38

Phil Pilgrim (PhiPi) wrote: »

So get away from here. Please! Before we completely pollute your fertile mind with further unsustainable fantasies. It's time to bring the "open dev process" to an end.

Hi Phil,

Let's not forget that Chip himself wanted us to offer up some kind of consensus on the P16X32B - which we now seem to have.

Other than that, I agree with everything you said.

Ross.

potatohead · 2014-04-05 18:39

The main cost is a rather massive 128 bit data buss to hub ram, more latency, and 8 clock + hub access instructions.

All of which adds to the power profile of the COG.

Phil Pilgrim (PhiPi) · 2014-04-05 18:42

RossH wrote:

Let's not forget that Chip himself wanted us to offer up some kind of consensus on the P16X32B.

I know, I know. And that's part of the problem: he's asking the wrong people.

-Phil

mindrobots · 2014-04-05 18:45

How about the next announcement is, "Here's a new emulation release, can you guys test??". I really have too much tied up in FPGAs at this point to have things completely close down. :frown:

RossH · 2014-04-05 18:47

Phil Pilgrim (PhiPi) wrote: »

I know, I know. And that's part of the problem: he's asking the wrong people.

-Phil

Well, I would put it slightly more ... er ... diplomatically ... and say Parallax also needs to consult their higher-volume customers.

Ross.

koehler · 2014-04-05 18:47

Bill Henning wrote: »

As the next device, I'd be perfectly happy with a four cog P2 with 256KB hub

Bill, how many thousand units/yr can Parallax then expect you to order?

Ross's thread ( I assume he started it ) is a pretty useful measure as anyone for/against on the forum can say aye/nay.

However, both it, and your opinion as to what Parallax should actually commit to doing, and spending another $50-100,000 on, are equally-- value-less.
Actually, unless you actually do routinely order large numbers of P1's, or will do so with P2's, Ross's thread has more merit on sales alone.

What really matters, ethically speaking, is what the customers who make up the bulk of Parallax's revenue are looking for, correct?
If this were Atmel or Microchip with the deep pockets, then it'd be fine.
Parallax is at least somewhat Capitol constrained, so having them make a decision based on what is probably less than a couple thousand units is not in their best, long term interests.
Especially if anyone wants an eventual P3.

Bill Henning · 2014-04-05 18:50

RossH wrote: »

I don't get it. You voted No, so have two others. What's the problem?

Ancient polling trick. The people who are opposed are less likely to visit that thread, those who agree with P1B more likely. Thus, skewed vote.

RossH wrote: »

Your performance thread is good - it shows a P16X32B would perform between 2 and 5 times faster than a P1. I think we could count on a four-fold performance improvement for many applications.

Thank you.

I agree with you if you postulate 80MHz P1.

My calculations were based on 100MHz P1, so in that context, I'd expect 1.75x - 4x

Of course, compared to P1, for compiled large code, a P2 hubexec cog will outperform P1 LMM cog somewhere between 12x - 32x, and will have 32x the hub bandwidth compared to P1, and 4x to 8x compared to a P16E32 LMM cog.

RossH wrote: »

Ross.

Bill Henning · 2014-04-05 18:52

+100

potatohead wrote: »

The assumption that we could get EITHER chip in 2 months is a very weak assumption.

Dave Hein · 2014-04-05 18:54

RossH wrote: »

Let's not forget that Chip himself wanted us to offer up some kind of consensus on the P16X32B - which we now seem to have.

A consensus has the following meaning "agreement, harmony, concurrence, accord, unity, unanimity, solidarity". I don't believe we have reached a consensus on the P16X32B since three people object to it.

Phil Pilgrim (PhiPi) · 2014-04-05 18:57

RossH wrote:

Well, I would put it slightly more ... er ... diplomatically ...

Diplomacy never has been my strong suit.

-Phil

potatohead · 2014-04-05 19:00

I very strongly agree with Parallax getting some sort of consult on this. Perhaps the right expression is simply emphatic. It's worth the data points.

Lawson · 2014-04-05 19:01

potatohead wrote: »

All of which adds to the power profile of the COG.

I wouldn't be so certain the Quad-long access increases the power consumption of the chip. The 4x wider hub data buss is likely to take more power, but the 8 cycle + hub sync instruction needed to get 4 longs into cog ram is likely to save power because instruction decode and the ALU can be gated for 6+ clocks. Have to work out the details to be sure one way or another.

Marty

Invent-O-Doc · 2014-04-05 19:05

Ok, I've been following this surprising development for a while. Here are my comments for Chip and for the community:

You need a chip to sell soon. That chip can fund your full P2 65nm if you can generate new income in the meantime. It appears unlikely you can get the P2 65nm on your budget.

I see that you have two options:
1) 16 core P1 w/ some analog and 256-512k RAM, maybe fuses and ROM monitor, wonderful! Keep it simple as possible, don't fill the die. Call it P2 (and make current P2 a later 'P3').
2) 4 core P2 on 180 process.

Regarding which one to do. Ask yourselves these two questions (in order of importance).
1) Which option is the LOWEST RISK and FASTEST TO PRODUCT?
2) Which option will your VOLUME BUYERS be more likely to purchase NEAR TERM?

I'm personally leaning towards 1 (with code compatibility!) while P2 evolves on FPGA. (Don't forget to call whatever you make P2). I admit I'm ignorant of the details that will inform your decision between the two. (also, since you are synthesizing, consider a process that allows for FLASH memory).

So - you've got a new interim 'P2 chip' to market (based on P1 or your P2 advanced design) with increasing revenues that can fund your next gen 65nm chip

There's my

RossH · 2014-04-05 19:05

Dave Hein wrote: »

A consensus has the following meaning "agreement, harmony, concurrence, accord, unity, unanimity, solidarity". I don't believe we have reached a consensus on the P16X32B since three people object to it.

We appear to have a achieved a consensus on what kind of PaaXbbc we want. Whether Parallax actually builds it is up to them (and always has been).

Ross.

potatohead · 2014-04-05 19:10

I wouldn't be so certain the Quad-long access increases the power consumption of the chip.

The way I see it, the thing is either a set of P1 COGS connected together with the pins, or it's not. If it's not, there will be features added, and those will consume power. How much is perfectly debatable as you say. Agreed.

Secondly, all those COGS add up, and this process has some increasingly clear compute / heat limits too. So the way I see that is if we get performance at all, we are going to use watts. No magic bullet for that.

The P1E won't be sipping the juice. It's going to require some power to perform. The P2 is going to require power to perform. Peak performance of the P1E is much lower than the P2. That's nice for power considerations, because we just don't get much, because we aren't going to be asking for as much. However, we could build the P2, and simply not ask it for as much, and still get very good performance relative to a P1E.

All of it comes down to power management. We either bake it in with a design that simply won't use super high amounts of power for lack of potential to do so, or we produce a design that we can max out, and apply it appropriately, given our power budget and other design constraints.

For those who would want more of that peak, the P2 is going to deliver, but they will need to manage the power. For those wanting more out of the P1E, no go. It will be what it will be.

Either variant is some time away. So there is nothing quick about either one at this point. One is a simpler design that can go from idea to build test more quickly, likely placing it on par with the other which is closer to a build / test stage already, for having investment made in it.

Relative to the P1, both are attractive!

Which leaves us with?

Which one generates more revenue so that our futures are well funded.? I don't think we know the answer to that. We know a lot about what people think they want, and what they say they want. But we really don't have an understanding yet of what the business looks like.

potatohead · 2014-04-05 19:15

We appear to have a achieved a consensus on what kind of PaaXbbc we want.

Having read through that, all I see is a general consensus on that people say they want a PaaXbbc. Given that's not very well defined yet, I would leave the word consensus out of it, until it's passed the "but we need this feature" phase, which won't begin in a material way, until after a build commit happens.

In a real sense, that build commit would be to a different design path, not some actual product that has been defined, etc... It's a thumb in the wind alternative that looks sexy because people freaked out over power, prior to allowing that discussion to run it's course. And, power will have yet to be discussed on that variant, which will have some power issues, or it simply won't perform very well relative to general expectations now, though either performs well relative to the P1. One or the other with this process.

AntoineDoinel · 2014-04-05 19:19

Dave Hein wrote: »

A consensus has the following meaning "agreement, harmony, concurrence, accord, unity, unanimity, solidarity". I don't believe we have reached a consensus on the P16X32B since three people object to it.

Dave, we didn't reach consensus not because three people object, but because all the rest is apparently giving consent each one to a different thing!

So Chip, please take Phil advice!

Things like 34bit longs are already going to give me nighmares... evil PDP-10s attacking me with their swirling tapereels!

Escape the opium while you can!

koehler · 2014-04-05 19:22

Mindrobots,

sadly this seems like an unfair jab at Ross.
Here you take his words, add timeline and durations to them, and then knock them down like a typical strawman attack.
So far, some "Sarcasm" aside, this forum is one of my favorite on the net due to its generally congenial if heated discussions.

I 'got' exactly what Ross was saying.

If I may give my reading of his post,

Parallax currently has the P1 in production, a known good working physical device that has been proven.
The P2, is currently a device that has already failed a shuttle run, and since that time, has morphed far beyond what that shuttle run device entailed.
So not only is there the potential for still existing errors to be in that part of the original P2, but there are just as likely to be similar problems within all of the new stuff thats been tacked on, and probably far more possiblility of 'subtle' errata.

Chip has already opined earlier that pulling out the old P1 and revisiting it was basically a no-brainer, because it was such a simple design, comparatively speaking. His words, or close to it.

So you do the math.
How hard would it be for Chip to take the P1, and double the Cog's from 8 to 16, knowing that its a proven design?
I think its likely realistically a lot easier for Parallax to do that successfully in a timely fashion than to 'prove' the P2.

If you are in favor of the P2, perhaps because you have spent $$ on a FPGA, then maybe just say so.
Trying to dismiss Ross' comments as wispy, ungrounded in reality is rather unseemly to anyone, and perhaps more so to someone who has given real substance back to the community at large.

Dave Hein · 2014-04-05 19:36

RossH wrote: »

We appear to have a achieved a consensus on what kind of PaaXbbc we want. Whether Parallax actually builds it is up to them (and always has been).

So you are saying that everyone agrees that the kind of PaaXbbc we want is to continue with the P2 as it was a few days ago?

rogloh · 2014-04-05 19:37

One interesting and simple thing you might imagine is the proposed P1 variant running a keyboard driver or a simple UART in a COG. Assuming no hub exec or LMM is used, you manage to squeeze the code into the 512 longs (which we already know can be done in this case) and you then burn a potential 100MIPs to do a keyboard driver. A whole 100MIPS of the device is lost just to decode PS/2 protocol! No problem we have plenty of COGs. LOL!

Now if 512 longs gets too small and you need more memory for your I/O driver or other code you could always try to use LMM and get yourself a 25MIP VM. You now consume 100MIPs for running your 25MIP VM. I'm assuming 1:8 hub cycles per COG and a 200MHz hub with a optimistic 4 cycle VM loop (not even sure that is possible, depends on final jump delay). This is only 25% efficient use of the COG's inherent power when fully loaded and running at 25MIPs, but you are consuming 75% of the total COG power for running the actual VM loop. These are the realities we face with a P1 variant @ 100MHz. It is going to be difficult to utilize all that power effectively and efficiently unless almost everything uses lots of high speed I/O, fits in single COGs and the COGs are working flat out driving the pins and not requiring large amounts of hub memory bandwidth.

So if you don't rewrite and share a bunch of drivers in single COGs (something Cluso didn't want to have to do for P2), what do you have
100 MIPs keyboard/mouse driver COG,
100 MIPs full duplex uart COG,
25 MIPs LMM main application actually consuming 100MIPs of raw power but only getting 25% efficiency at best
etc

So what you can see from this is that the fundamental way to realize the true potental benefit and raw performance of the P1 variant would be to use hub exec and/or tasking. But I imagine this is non trivial change to the P1 variant and will take quite a while to do, so some people will not want to go down that path for expediency (and I would agree there too). So you skip it. End result is you have an updated chip with a whole lot of potential power on paper but quite difficult to realize in practice. Maybe that will satisfy the existing P1 market however when it comes to I/O pin limits and total hub RAM. It does solve that problem at least.

In comparison, in my opinion the P2 seems better balanced and designed for optimizing system performance by providing the ability to share the resources, but I would have to say I still prefer more than 4 COGs if at all possible, assuming they could fit the die size/power envelope. 4 COGs will definitely require more I/O driver sharing, but given what we've seen above it may make sense to do this on the P2. At least a USB host COG on a P2 would allow lots of I/O peripherals to be attached and help reduce the number of COGs consumed. Would FS USB host be possible on P1 variant without extra H/W support? Not sure. Maybe with multiple COGs in parallel. But again lots of potential MIPs used to do it.

koehler · 2014-04-05 19:49

Ken,

We know you're lurking about, how about throwing us a bone and help us help you?

You're probably sitting back, hoping Chip can work his usual magic and make this problem go away for the most part.
However, it looks like we're at a full stop, with probably 4 Cogs.

The problem some of us see, is that going ahead with 4 Cog's means multithreading is now required, where as in the past we had something akin to lightweight, disposable Cogs.

The premise of the Prop has been simplicity, and interruptless, single-threading functionality.

The P2 option now available appears to jettison that simplicity, and require multi-threading. Without useful interrupts at that.

The P1b/P16 may be an option, although less powerful, it seems to be closer to the Prop's raison d'

Bill Henning · 2014-04-05 19:50

koehler,

I run a real consulting / design company - that is my "day" job, and I do not reveal sales numbers. It is none of anyone's business.

With all due respect, what have you contributed? What right do you have to say that Ross' or my advice is valueless?

I've contributed LMM. many optimizations, a lot of contribution to the P2 design, partially responsible for saving at least one shuttle run. On a P1, I have made a product with 256 color high resolution bitmapped graphics with 20MB/sec external memory on a Propeller 1 - something that was considered impossible. I am not going to waste time going into more detail, as I do not have to justify anything to you..

Ross has also contributed a LOT over the years, including the Catalina C compiler.

In the forums, the normal rule is, make technical arguments, not ad hominem attacks.

koehler wrote: »

Bill, how many thousand units/yr can Parallax then expect you to order?

Ross's thread ( I assume he started it ) is a pretty useful measure as anyone for/against on the forum can say aye/nay.

However, both it, and your opinion as to what Parallax should actually commit to doing, and spending another $50-100,000 on, are equally-- value-less.
Actually, unless you actually do routinely order large numbers of P1's, or will do so with P2's, Ross's thread has more merit on sales alone.

What really matters, ethically speaking, is what the customers who make up the bulk of Parallax's revenue are looking for, correct?
If this were Atmel or Microchip with the deep pockets, then it'd be fine.
Parallax is at least somewhat Capitol constrained, so having them make a decision based on what is probably less than a couple thousand units is not in their best, long term interests.
Especially if anyone wants an eventual P3.

David Betz · 2014-04-05 20:00

cgracey wrote: »

Actually, the die will still be pretty big, even with 4 cogs and 256KB RAM.

The difference between Prop2 and Prop1 is that Prop2 has tons of DSP capability and lots of creature comforts, whereas Prop1 is more bare-bones. The two chips we've been talking about are very different animals. Each has a claim to existence.

I'm actually a big fan of bare bones. It is one of the things I grew to like in the P1 and was afraid I might miss in the P2. Of course, I still want hub execution! :-)

We're looking at 5 Watts in a BGA!

Comments