#549, which set you off had the following technical arguments comparing a 16 core P1B and the current P2 design:
Maximum Hub Bandwidth per Cog
Total hub bandwidth per chip
MIPS per cog
MIPS per chip
Video Limits
Signal Capture / Generation Limits
SDRAM
LMM vs HUBEXEC
Pins
Which were based on simple algebraic calculations and known specifications.
I am well qualified to make those calculations, and I dare say you are also qualified - if you have been following the specifications, and have dwelled into the P2 instructions at a cycle level.
Which I have.
And I don't think anyone would dispute that I know the P1 inside out, and sideways.
The only point where I am not as qualified, and neither are you was
Power consumption
where I wrote "Roughly the Same" based on my general knowledge, and would have welcomed a technical post showing the difference by those that design ASIC's.
Given the rough equivalence in the number of transistors, 200Mhz clock with 2 cycle instructions with twice as many cogs, vs 160Mhz with single cycle instructions, and half the hub, that is a reasonable estimate. As reasonable as 5W for P2.
I have enough knowledge to be able to say that, and in case you missed it, I welcomed corrections by jmg (re power usage) and Ray (re misrimembering number of transistors in an sram cell).
The point is, instead of arguing all but one of the points (ie power consumption) - which you chose not to, either because you could not be bothered to do the math, or you realized I was correct and could not argue - you made a personal attack, by mis-characterizing what I said.
Therefore my request stands.
I am actually quite happy to get technical rebuttals, because I can learn from them when I am incorrect. So you are VERY welcome to dispute technical matters, I quite enjoy that, and learn from such discussion.
I will not accept ad hominem or strawman arguments, and will always fight them. From everyone.
I would if I thought we were both qualified to make such arguments, considering the dirth of hard data available to support either position. I know I'm not. Even if you're right -- and I'm not saying you aren't -- it still seems like a good time to step back, take a breath, and consider other alternatives, even if they don't materialize.
I've worked with SXes and NMOS Z8s in industiral environments. Both got hot, and the thermal considerations were unpleasant to deal with. When I see figures like 5W -- or even 3W -- I think, "Oh, no, here we go again!" But maybe I've just gotten spoiled by the cool-running P1.
I fully intend to keep using the P1 when appropriate, no matter P2/3/4/...
Here is one particularly thorny thermal environment I had to deal with about 20 years ago:
I had to design a sensor network for use within ovens, that absolutely had to run to at least 125'C and ideally above that. The PIC nodes had to be networked with RS485 as I recall.
Got it running - the biggest problem was chasing down milspec parts for all the components in the design. All units I tested ran fine for prolonged testing to at least 150'C, some all the way up to 175'C.
I've worked with SXes and NMOS Z8s in industiral environments. Both got hot, and the thermal considerations were unpleasant to deal with. When I see figures like 5W -- or even 3W -- I think, "Oh, no, here we go again!" But maybe I've just gotten spoiled by the cool-running P1.
I agree that P2 has to be built ASAP. A wealth of ideas has gone into it and it is the prime platform. Today, the engineer at OnSemi is going to check power dissipation for the case of S and D not toggling. This will tell us right away what power level we could approach with lots of flop gating into the ALU sections.
About the Prop1 ideas: I was thinking that hub exec could be realized by having no more than a single cache line that would afford 100% execution speed in a straight line. Also, this would allow the dual-port cog RAM to be cut in half, from 0.292 sq mm to ~0.15 sq mm, since we only need it for deterministic code and variables. Add in the ~0.1 sq mm of cog logic and the cog area drops from 0.4 sq mm to 0.25 sq mm - a ~38% shrink, while affording execution from hub. This is all at 180nm.
When I went in to look at the old Prop1 code, I was blown away how there is almost nothing there. It doesn't have the creature comforts of the Prop2, but it is as lean as can be. It's so compact that you can just add more of them, rather than inflate them.
Lots of P1 COGs sounds interesting especially if you can find a way to keep a simplified way to execute code from hub memory. Actually, that's the main thing I wanted out of P2 anyway! :-)
I'm not trying to be disagreeable here, but just trying to be realistic. When the BASICstamp came out, there wasn't much else around, nowadays there are lots of options, from PICaxe, Arduinos, and Teensys, all real cheap and very capable (maybe not PICaxe so much). But it's never been real clear to me what the P2 was aimed at, with the video capabilities it seemed like maybe the RasbPi, but I don't think it would benchmark well against it. I view the RasbPi as a closed system, and have avoided using it, while we have looked pretty closely at the BeagleBone and right now for a lot of things it looks pretty good, and hence they are shipping about >10,000 a month of those.
Now yes you do have that camera sensor, but it is only BW and 128x96 pixels (pretty low res). I just received my 2 PixyCams for the same price with a built in LPC4330 which has 2 200 MHz ARM cores plus all the other peripherals. I just don't see how a P2 will match it in performance.
Basically if you are going to convince the world you have a better mousetrap, you need to do some proving. The P1 came up short on a lot of measures, sure you can do video with it, but only in some very constrained ways. Yes we had a lot of real talented people here at the forum who could with lots of pre-processing and lots of man hours, could produce some pretty images or demos. But I remember in the forum a few years back, when it was apparent the P1 could not display an arbitrary image at VGA resolutions. That's why I say if the P2 is a monster compute engine that also does video, I want to see that proven, because there are lots of cheap ARMs out there that already do it.
A P2 running at 20MHz or 40MHz or 80Mhz (and I am guessing not even at 160MHz) will not take anywhere near the theoretical 5W worst case calculation of everything toggling in every cog.
To compare to an ARM fairly, pick EIGHT arm chips at the appropriate clock frequency, and add togeather their thermal output.
Or just run one cog on the prop, and compare power used at the same frequency.
And don't forget the cost differencce between say eight Cortex M3's and one P2.
The difference is, with the P2 you could crank up the clock, if you had the power budget, to run eight cores at 160Mhz.
The bottom line with the P2 it is a power hog in comparison to the ARM's it will be compared to. When outsiders get a look at it's power reqs they're gonna walk away from the P2. Not to mention it's a no-go for portable apps.
In short the P2 is coming up short for a chip that needs to sell millions for Parallax to break even on.
All it will take to hurt it's sales is a few bad reviews in trade mags that expose it's power problem. And Parallax isn't Microchip that is big enough to absorb some duds.
Can Parallax sell a $12 4-5W micro in market that where chips can be found at half that price and 1/4 to 1/8 the power reqs?
Almost forgot.
Not fair comparing a $12 eight core design to a single core $6 design.
- P2 will have tons of ADC, DAC's, serial ports, video, hardware threads etc.
Different animal.
I think the design wins will come from integrating designs - where you'd need several arms, maybe more microcontrollers, adc's, dacs's etc into a single P2 based device.
I agree that this is a concern. That is why the renewed/continued discussion about what to do with the P1 is important and is continuing.
There are a host of other issues to consider. First is the impact of the P2 on the development and marketability of direct sales into vertical markets at the wholesale level.
The second would be to differentiate that effort sufficiently so as not to disturb (and hopefully to support) markets carefully groomed by third party developers.
It wouldn't matter to me if most of the world walked away… what would matter to me is how many new people flocked my way…. while
being careful not to push away loyal developers.
It's a sharp edge… I think Parallax will successfully negotiate that edge.
I just talked to OnSemi about where this whole project is headed.
The understanding is that we need to determine how much power the P2 will take before we know if it's viable to build at 180nm. They said doing this at 65nm would cost between $1-2M and they get asked all the time to partner with hopefuls like Parallax, but that's not their business model.
Their engineer got back to me after a doing a test where the ALU inputs were kept from toggling. A cog's core power went from 700mW to 450mW, or down by 36%. So, with aggressive flop gating into the ALU, we could maybe reduce power by 30%. That's still 500mW per cog, or 4W, total, for just the core.
We could probably get the 1.5V 100MHz case under 2W with aggressive flop gating in the cogs' ALUs. He's also setting up some memory considerations, which are going to give us a realistic idea of total core power.
Not fair comparing a $12 eight core design to a single core $6 design.
- P2 will have tons of ADC, DAC's, serial ports, video, hardware threads etc.
I think this is where the P2 will struggle as it's not an obvious comparison.
This afternoon I was thinking about dropping a new MPU into an existing unit to give more memory for the next round of software. While I was at it I thought I may as well drop in a chip with lots of everything.
85 IO pins
512k Flash
128k RAM
USB 2.0 OTG
Ethernet
2 CAN modules
8 DMA Channels
4 SPI
5 I2C
16 ADC
6 20Mbps USARTS
5 timers with bells and whistles
And the cost for that?
$5 and 132mW
Quoted DMIPS is 105.
OK, so the raw DMIPS of a P2 is likely to be up around 1600 BUT, to achieve that amount of peripherals, you're going to be using what, 6 COGs? Leaving you two COGs for the main code with 400 DMIPS of processing. But I can stick in 4 of the other chips for the same DMIPS rating, about the same money and way less power consumption at 528mW.
And don't forget the cost difference between say eight Cortex M3's and one P2
LPC4370 (Digikey quantity 1 price <$10). What you get 3 204 MHz ARM Cores, 208KB internal memory -- no external core supply
Built in peripherals SPIFI quad hispeed Flash interface, SDRAM controller, lots of UARTs, CAN, Ethernet, SPI, I2C, video output ...
P2 with SDRAM (1 COG gone), video output (another COG gone), some serial ports (another COG gone) pick another peripheral (another COG gone)
So you're down to 4 COGs maybe running an application, but then they also have that memory bottleneck
So where's the performance advantage? And the LPC4370 about 1/2W and P2 2-5W
The P2 may find its niche out there, it is just not what I see the volume microcontroller market needs. And there are new ARMs being introduced every few weeks.
A P1B would take away resources from the P2, but I think Chip is considering the P1B as a stop-gap to fill the business needs of existing demand.
If releasing a P1B would provide more capital towards a P2, then it might make sense.
I really think that 1MB of ram is the right target for a P1B because there are so many applications where the P1 is fast enough, but doesn't have enough ram to do a job. Video will require big RAM, datalogging/high speed sampling requires big RAM, heck, just complex programs require big ram.
I've been thinking that the 256k ram in the P2 has been a little meager, given the increased capability and performance -- you don't want a bottleneck created by having too scarce a resource, the P1 already suffers from this and demonstrates the problem well (frame buffers are too small to take advantage of what the P1 can do today).
I think the P2 needs to go on a functionality diet. I would like to see hubex for 1 thread retained, but ditch all the other enhancements for multi-tasking and MT hubex. These features are too arcane for the target customers to use effectively, hubex is an easy thing to implement when C is the language, but the other features require heavy compiler work and will require a software that won't be available for a year after launch.
Hubex was a good win, but let's remember that the P2 had none of these power problems less than a year ago, and the major changes since then are hubex and all the caches. The P2 ran a full cog in a DE0 nano, it doesn't now.
USB OTG guess: 1 cog (with SERDES, tasks)
Ethernet: 1 cog (assuming 10Mbps, serdes, might only take .5 cog, we will know once we try)
2 CAN: 0.25 cog, and UART or SERDES
8 DMA channles: 1 cogs, Chip's 1 cycle per long move, could be 0 full time cogs if occasional
4 SPI: with serdes, two 1/4 cogs, so 0.25 total
5 I2C: depends on speed, bit banged 1/4 cog @ 400khz
16 ADC: 0 cogs
6 20Mbps USARTS: 3 cogs serial ports, 1/4 of the time of those cogs
5 timers with bells and whistles: timers from 2.5 cogs
So assuming that you actually use all of that, you would need: (first approximation)
1+1+0.25+1+.5+.25+.75+2.5 (sort of) = 6.75
With care, combining brings the number way down:
1 USB Otg
1 Eth + CAN
3 DMA + SPI + UARTS + I2C + timers
5 cogs
If we can use some other resources, or tasks, in the Ethernet cog, probably only 4 cogs.
One cog should suffice for the "business/main" logic.
Of course, it is not fair to count cogs used to provide peripherals in the design.
May I ask what peripherals of the list you will actually use?
Next, we don't know how fast it will need to run. With USB helper instructions, 72Mhz may suffice.
Also, I strongly suspect the power consumption figures you gave did not assume all those peripherals going full blast, so let's not assume the P2 cogs would be full blast either.
I think this is where the P2 will struggle as it's not an obvious comparison.
This afternoon I was thinking about dropping a new MPU into an existing unit to give more memory for the next round of software. While I was at it I thought I may as well drop in a chip with lots of everything.
85 IO pins
512k Flash
128k RAM
USB 2.0 OTG
Ethernet
2 CAN modules
8 DMA Channels
4 SPI
5 I2C
16 ADC
6 20Mbps USARTS
5 timers with bells and whistles
And the cost for that?
$5 and 132mW
Quoted DMIPS is 105.
OK, so the raw DMIPS of a P2 is likely to be up around 1600 BUT, to achieve that amount of peripherals, you're going to be using what, 6 COGs? Leaving you two COGs for the main code with 400 DMIPS of processing. But I can stick in 4 of the other chips for the same DMIPS rating, about the same money and way less power consumption at 528mW.
A P1B would take away resources from the P2, but I think Chip is considering the P1B as a stop-gap to fill the business needs of existing demand.
If releasing a P1B would provide more capital towards a P2, then it might make sense.
I really think that 1MB of ram is the right target for a P1B because there are so many applications where the P1 is fast enough, but doesn't have enough ram to do a job. Video will require big RAM, datalogging/high speed sampling requires big RAM, heck, just complex programs require big ram.
I've been thinking that the 256k ram in the P2 has been a little meager, given the increased capability and performance -- you don't want a bottleneck created by having too scarce a resource, the P1 already suffers from this and demonstrates the problem well (frame buffers are too small to take advantage of what the P1 can do today).
I think the P2 needs to go on a functionality diet. I would like to see hubex for 1 thread retained, but ditch all the other enhancements for multi-tasking and MT hubex. These features are too arcane for the target customers to use effectively, hubex is an easy thing to implement when C is the language, but the other features require heavy compiler work and will require a software that won't be available for a year after launch.
Hubex was a good win, but let's remember that the P2 had none of these power problems less than a year ago, and the major changes since then are hubex and all the caches. The P2 ran a full cog in a DE0 nano, it doesn't now.
I just talked to OnSemi about where this whole project is headed.
The understanding is that we need to determine how much power the P2 will take before we know if it's viable to build at 180nm. They said doing this at 65nm would cost between $1-2M and they get asked all the time to partner with hopefuls like Parallax, but that's not their business model.
Their engineer got back to me after a doing a test where the ALU inputs were kept from toggling. A cog's core power went from 700mW to 450mW, or down by 36%. So, with aggressive flop gating into the ALU, we could maybe reduce power by 30%. That's still 500mW per cog, or 4W, total, for just the core.
We could probably get the 1.5V 100MHz case under 2W with aggressive flop gating in the cogs' ALUs. He's also setting up some memory considerations, which are going to give us a realistic idea of total core power.
Not fair comparing a $12 eight core design to a single core $6 design.
- P2 will have tons of ADC, DAC's, serial ports, video, hardware threads etc.
Different animal.
I think the design wins will come from integrating designs - where you'd need several arms, maybe more microcontrollers, adc's, dacs's etc into a single P2 based device.
Agreed.
Most ARM core setups are targeted towards cell phones and tablet devices where power usage is critical. How many additional components would be needed to get an ARM setup comparable to the P2 setup? Just look at the STM32F429xx data sheet and read where they talk about all of the 'power profiles' they have! Normal Mode, Low-voltage mode, Over-drive mode, Under-drive mode, power-down mode... mode this, mode that. People are worried about a P2 with 2 or maybe 3 power/speed profiles? Talk about complex, table after table of what pins you can use (and only use) for what! P2 any pin, anything (except high speed analog pins). That ST is a single core, 180MHz, limited configuration controller and it is not cheap at $14.00+ for 50.
The P2 is a fully open design which will compare well in a true head to head match up in applications it is suited for. Build a tablet with it? NO of course not. Build a state of the art PLC, automation controller, data logger, robot controller, custom widget? You bet!
Chip, with those latest metrics, adding the per cog clocking could really make a difference! Have you explored that possibility yet?
@All I personally translated bean counters into "I know about business realities" and in that sense, the comment and quoted pieces made good sense.
The power discussion continues to be an interesting one!
First and foremost, we need a LOT more, what is possible, and I mean in terms of who would adopt the chip at a modest coupla watt power budget, type discussions.
There are plenty of want to have like P1 discussions, and the other guys do X discussions, but really those are the wrong ones at 180nm, because it just won't be about that. A shrink could make it more about that, but I have yet to see a path to get us to the shrink without making this 180nm P2 fund it, so we need the what is possible discussion more than we need the wich it were type discussion.
Really, it simply is, barring those things we can do in order to maximize the 180nm design. Those may yield half the extreme profile, and good code strategy may well get us to a couple watts for a sweet spot case, ignoring education, which may be better still, depending.
Another general macro level observation is the impact of education. That business does well.
Would this design do as well, or maybe better at a reasonable power profile? The USB can delever 2.5 watts, so we play it safe and targer 1.5 watts. Will this design be viable or maybe even potent at that budget?
pedward,
We have no idea what power problems the P2 had before now. They were not able to test it before, and they never made a successful run. It's very possible (and likely) that if the last run had "functioned" that it would have had similar high power usage.
It's a bummer that going to 65nm costs so much. Is it an option at OnSemi to go to 90nm? Would it help enough, and how much would it cost?
This is a frustrating situation, and it's clearly causing feathers to be ruffled all around. People need to step back, and stop pointing fingers, and just let Chip/Parallax and OnSemi work out a solution.
I really like the possibility of a single cache for hubexec. But I think this needs a little more though about the best way vs improving LMM. We needto be careful to KISS and not go overboard.
I am really interested in more OnSemi info.
However the P32X32B is way more in line with what was initially targetted as the P2. The current P2 is so much away from original specs that it is really P3. But none of this really matters. It is the ultimate chip(s) deliveredthat matters.
To me, the 5W caused a major point to stop and reflect. The P32X32B (P1B if you like) has way more appeal to me because
1. Way less risk
2. Significantly quicker to market
3. Will use less power than P2 (read Chip's posts for proof)
4. P2 can still continue - see Chips post
-- slightly delayed, but may get caught up because P32X32B will prove OnSemi process.
5. What everyone wanted originally.
6. 32 Cores and 512KB Ram @~200MHz will get a lot of free press - more exposure --> more sales?
Meanwhile P2 power can be refined and USB/SERDES done.
One big thing to come out of all this is the realisation that lots of simple cogs for peripherals trumps the fewer way more complex cogs that require complex sw and multi-tasking and multi-threading to get the job done!
As a final note: Parallax has never been driven by bean counters, but they won't risk the farm either, so our P1 supply is not in question.
Not sure I'm following this properly, but what is the power budget this proposal is targeting?
Everyone wants more cores, and more memory, and more speed.
How much power are you willing to spend to get that, and are we not getting then back to the same problem the P2 currently faces?
Why not 16 cores, more memory, more speed, and at half the power?
That is in and of itself going to be at least 2x (200%) P1 on cores alone, and another 5x (500%) on top of that for speed, which I guess is 2x * 5X = 10x P1.
Thats far, far above an incremental change to P1, ie P1 > P1B.
As an interim solution, it satisfies the core, memory and speed requirements well, and the power budget is not quite as insane.
Plus, few cores means higher die per wafer, and better revenue from which to fund the PII.
If you did this right, I really think Parallax would soon find it couldn't make these things fast enough!
Don't think of it (or market it) as a P2 - that chip is still to come next year, and will be a game changer. These chips are not that, but they would be a great stopgap, and potentially raise a lot of interest (and money!) for the P2. Market them as "Quad Props" and keep the compatibility with the current P1 objects as high as possible - then tout the sheer number of 32 bit processors you can now dedicate to any tasks, and the number of "soft" peripherals that are already available. This makes any idea of "pre-wired" peripherals (which is all the opposition can offer) look totally ridiculous.
But 64k per core is not ideal - if you can't get 256k per core, could we perhaps get 128k? or even 96k? To avoid the need to ever have to resort to external SRAM? This chip really needs to be a "single chip" solution for its market niche (like the P1 was originally intended to be, but which we lost when we all pushed it so far beyond it's original niche).
Also, we need a fast inter-core comms channel that can be shared between all cogs on all four cores - this is something the P1 sorely lacked, and meant you could really only use muliple cogs for a single task at hub speeds, not cog speeds.
I certainly hope you aren't ordering your chips from Canada… the postage will kill you:)
I am not an engineer. And I am not trying to be either argumentative or facetious.
You didn't mention video, math performance, texture mapping, the learning curve for new users, company support, forum quality, or the open software base.
How long is the food chain?
From a running start, which supports a more robust end-user experience?
LPC4370 (Digikey quantity 1 price <$10). What you get 3 204 MHz ARM Cores, 208KB internal memory -- no external core supply
Built in peripherals SPIFI quad hispeed Flash interface, SDRAM controller, lots of UARTs, CAN, Ethernet, SPI, I2C, video output ...
P2 with SDRAM (1 COG gone), video output (another COG gone), some serial ports (another COG gone) pick another peripheral (another COG gone)
So you're down to 4 COGs maybe running an application, but then they also have that memory bottleneck
So where's the performance advantage? And the LPC4370 about 1/2W and P2 2-5W
The P2 may find its niche out there, it is just not what I see the volume microcontroller market needs. And there are new ARMs being introduced every few weeks.
Yes, you missed a critical post by Chip earlier in this thread, which is that the P2 currently requires about 16x more logic elements than the P1. So 4 much simpler P1 "cores" would still only be around 1/4 the complexity of the P2, and at the same clock speeds would presumably only consume about 1/4 the power (yes, I know it's more complex than this, but as a "back of an envelope" guesstimate it's probably within the ballpark).
Not quite, as you have assumed the idling logic ratio is the same on both cores.
P2 uses 90% clock gating, so the large blocks of high level logic not used, are not costing power.
Expecting 32 cores running 32 bit opcodes to be somehow 16 times(!) more Joule-Efficient than 8 cores running 32 bit opcodes, is 'optimistic'.
1) Where did you see the LPC4370 using only 1/2W while doing video & sdram access? P.85 of the data sheet shows 1.5W total power dissapation
Well I had one on my desk doing just that about this time LAST year. I can tell you that we've turned on every peripheral and CPU on the part and never seen over 200mA
Let me throw out another case, those M4s are 90nm processes, and they are blown away by ARM11s (most 65nm) running 3 times as fast all for about $15 and typically around a W.
I know it's impossible for a P2 to run Linux, so someone will go out there and emulate something that runs Linux, it will just be the whole weekend before you get the > prompt.
edit
Actually I just checked it was 2 years ago, with SGPIO capturing a QVGA sensor at 30 Hz and dumping it into SDRAM that had internal DMA running to the LCD controller to drive a VGA display at full 60 Hz. All this while the M4 was twiddling its thumbs. Later the same setup we added the M0 driving the Ethernet to serve a web page and control the i2s output levels that were streaming in from USB under the M4 control. Lots of stuff going on, and no melt down.
I really like 8 COGs + 1MB (with HUBEXEC on one COG) and "some ADCs + pull-up/down" for an interim chip. That is unlikely though right? Can it possibly be kept simple and non-symmetric? Lowest risk and highest bang for the buck is very important right now.
As far as power goes, the biggest Propeller customer right now is most likely Parallax Education. Can you imagine a 3W to 5W ActivityBot? I certainly can not.
That's great to know, as I will be playing with those chips. Not as nice to develop for as the P1/P2, but definitely interesting for some uses.
The point is, I don't believe for a second that the typical usage of a P2 would be 5W.
I don't even believe 5W as the "worst case", as the assumptions of how much logic was toggling for the simulation are unrealistic (as far as I have been able to determine).
As I've been stating for quite a while... the posted 5W is a worst-case simulation, not a measured known quantity. Therefore, the panic over it is not warranted, especially as the 1-2W typical usage figure talked about for quite a while is consistent with a 3.5W - 5W worst case.
Two hours to boot to a command prompt, with an ATmega1284p running 24Mhz emulating an ARMv5TE, with a 30 pin simm bit-banged.
At a WAG, a P2 with the sdram would be at least 60 times faster.
So two minutes to boot to a shell prompt sounds about right, not a weekend
The point is, we are right now looking at 180nm power consumption, without latest simulation results, at 160Mhz. Running at say 40Mhz for applications that don't need the horsepower drops the power envelope a LOT.
Well I had one on my desk doing just that about this time LAST year. I can tell you that we've turned on every peripheral and CPU on the part and never seen over 200mA
Let me throw out another case, those M4s are 90nm processes, and they are blown away by ARM11s (most 65nm) running 3 times as fast all for about $15 and typically around a W.
I know it's impossible for a P2 to run Linux, so someone will go out there and emulate something that runs Linux, it will just be the whole weekend before you get the > prompt.
Would you kindly stop saying 3W-5W will be required to run a P2 until we have a better handle on it?
WE DO NOT KNOW THAT.
That is WORST CASE, all 8 cogs going firing all logic (MUL, DIV, MAC, CORDIC, VIDEO, everything at once), at 160Mhz, 1.8vCore, before Chip does his magic power reduction.
And can we stop trying to justify feature cuts that will not affect the power envelope noticeably (hubexec, tasks, threads) by saying they will drop power requirements a lot (without providing proof HOW they will do that)?
Also, if you propose a 32 core P1 with 512KB hub (which would be a nice chip after P2) please show how in the same process it will have a lot lower power utilization? I already showed it will be a lot slower than the P2 for many applications.
Can we please remember the scientific method and not panic yet?
I propose that every power-cutting measure be required to show HOW it will cut power utilization, backed up with references and numbers. No hand waving.
Comments
#549, which set you off had the following technical arguments comparing a 16 core P1B and the current P2 design:
Maximum Hub Bandwidth per Cog
Total hub bandwidth per chip
MIPS per cog
MIPS per chip
Video Limits
Signal Capture / Generation Limits
SDRAM
LMM vs HUBEXEC
Pins
Which were based on simple algebraic calculations and known specifications.
I am well qualified to make those calculations, and I dare say you are also qualified - if you have been following the specifications, and have dwelled into the P2 instructions at a cycle level.
Which I have.
And I don't think anyone would dispute that I know the P1 inside out, and sideways.
The only point where I am not as qualified, and neither are you was
Power consumption
where I wrote "Roughly the Same" based on my general knowledge, and would have welcomed a technical post showing the difference by those that design ASIC's.
Given the rough equivalence in the number of transistors, 200Mhz clock with 2 cycle instructions with twice as many cogs, vs 160Mhz with single cycle instructions, and half the hub, that is a reasonable estimate. As reasonable as 5W for P2.
I have enough knowledge to be able to say that, and in case you missed it, I welcomed corrections by jmg (re power usage) and Ray (re misrimembering number of transistors in an sram cell).
The point is, instead of arguing all but one of the points (ie power consumption) - which you chose not to, either because you could not be bothered to do the math, or you realized I was correct and could not argue - you made a personal attack, by mis-characterizing what I said.
Therefore my request stands.
I am actually quite happy to get technical rebuttals, because I can learn from them when I am incorrect. So you are VERY welcome to dispute technical matters, I quite enjoy that, and learn from such discussion.
I will not accept ad hominem or strawman arguments, and will always fight them. From everyone.
I fully intend to keep using the P1 when appropriate, no matter P2/3/4/...
Here is one particularly thorny thermal environment I had to deal with about 20 years ago:
I had to design a sensor network for use within ovens, that absolutely had to run to at least 125'C and ideally above that. The PIC nodes had to be networked with RS485 as I recall.
Got it running - the biggest problem was chasing down milspec parts for all the components in the design. All units I tested ran fine for prolonged testing to at least 150'C, some all the way up to 175'C.
That was a fun project.
I'm not trying to be disagreeable here, but just trying to be realistic. When the BASICstamp came out, there wasn't much else around, nowadays there are lots of options, from PICaxe, Arduinos, and Teensys, all real cheap and very capable (maybe not PICaxe so much). But it's never been real clear to me what the P2 was aimed at, with the video capabilities it seemed like maybe the RasbPi, but I don't think it would benchmark well against it. I view the RasbPi as a closed system, and have avoided using it, while we have looked pretty closely at the BeagleBone and right now for a lot of things it looks pretty good, and hence they are shipping about >10,000 a month of those.
Now yes you do have that camera sensor, but it is only BW and 128x96 pixels (pretty low res). I just received my 2 PixyCams for the same price with a built in LPC4330 which has 2 200 MHz ARM cores plus all the other peripherals. I just don't see how a P2 will match it in performance.
Basically if you are going to convince the world you have a better mousetrap, you need to do some proving. The P1 came up short on a lot of measures, sure you can do video with it, but only in some very constrained ways. Yes we had a lot of real talented people here at the forum who could with lots of pre-processing and lots of man hours, could produce some pretty images or demos. But I remember in the forum a few years back, when it was apparent the P1 could not display an arbitrary image at VGA resolutions. That's why I say if the P2 is a monster compute engine that also does video, I want to see that proven, because there are lots of cheap ARMs out there that already do it.
I think you are missing the point.
A P2 running at 20MHz or 40MHz or 80Mhz (and I am guessing not even at 160MHz) will not take anywhere near the theoretical 5W worst case calculation of everything toggling in every cog.
To compare to an ARM fairly, pick EIGHT arm chips at the appropriate clock frequency, and add togeather their thermal output.
Or just run one cog on the prop, and compare power used at the same frequency.
And don't forget the cost differencce between say eight Cortex M3's and one P2.
The difference is, with the P2 you could crank up the clock, if you had the power budget, to run eight cores at 160Mhz.
I like apple to apple comparisons.
Almost forgot.
Not fair comparing a $12 eight core design to a single core $6 design.
- P2 will have tons of ADC, DAC's, serial ports, video, hardware threads etc.
Different animal.
I think the design wins will come from integrating designs - where you'd need several arms, maybe more microcontrollers, adc's, dacs's etc into a single P2 based device.
I agree that this is a concern. That is why the renewed/continued discussion about what to do with the P1 is important and is continuing.
There are a host of other issues to consider. First is the impact of the P2 on the development and marketability of direct sales into vertical markets at the wholesale level.
The second would be to differentiate that effort sufficiently so as not to disturb (and hopefully to support) markets carefully groomed by third party developers.
It wouldn't matter to me if most of the world walked away… what would matter to me is how many new people flocked my way…. while
being careful not to push away loyal developers.
It's a sharp edge… I think Parallax will successfully negotiate that edge.
Rich
The understanding is that we need to determine how much power the P2 will take before we know if it's viable to build at 180nm. They said doing this at 65nm would cost between $1-2M and they get asked all the time to partner with hopefuls like Parallax, but that's not their business model.
Their engineer got back to me after a doing a test where the ALU inputs were kept from toggling. A cog's core power went from 700mW to 450mW, or down by 36%. So, with aggressive flop gating into the ALU, we could maybe reduce power by 30%. That's still 500mW per cog, or 4W, total, for just the core.
Here are some numbers he gave me the other day:
We could probably get the 1.5V 100MHz case under 2W with aggressive flop gating in the cogs' ALUs. He's also setting up some memory considerations, which are going to give us a realistic idea of total core power.
Let's see what he comes back with.
I think this is where the P2 will struggle as it's not an obvious comparison.
This afternoon I was thinking about dropping a new MPU into an existing unit to give more memory for the next round of software. While I was at it I thought I may as well drop in a chip with lots of everything.
85 IO pins
512k Flash
128k RAM
USB 2.0 OTG
Ethernet
2 CAN modules
8 DMA Channels
4 SPI
5 I2C
16 ADC
6 20Mbps USARTS
5 timers with bells and whistles
And the cost for that?
$5 and 132mW
Quoted DMIPS is 105.
OK, so the raw DMIPS of a P2 is likely to be up around 1600 BUT, to achieve that amount of peripherals, you're going to be using what, 6 COGs? Leaving you two COGs for the main code with 400 DMIPS of processing. But I can stick in 4 of the other chips for the same DMIPS rating, about the same money and way less power consumption at 528mW.
LPC4370 (Digikey quantity 1 price <$10). What you get 3 204 MHz ARM Cores, 208KB internal memory -- no external core supply
Built in peripherals SPIFI quad hispeed Flash interface, SDRAM controller, lots of UARTs, CAN, Ethernet, SPI, I2C, video output ...
P2 with SDRAM (1 COG gone), video output (another COG gone), some serial ports (another COG gone) pick another peripheral (another COG gone)
So you're down to 4 COGs maybe running an application, but then they also have that memory bottleneck
So where's the performance advantage? And the LPC4370 about 1/2W and P2 2-5W
The P2 may find its niche out there, it is just not what I see the volume microcontroller market needs. And there are new ARMs being introduced every few weeks.
If releasing a P1B would provide more capital towards a P2, then it might make sense.
I really think that 1MB of ram is the right target for a P1B because there are so many applications where the P1 is fast enough, but doesn't have enough ram to do a job. Video will require big RAM, datalogging/high speed sampling requires big RAM, heck, just complex programs require big ram.
I've been thinking that the 256k ram in the P2 has been a little meager, given the increased capability and performance -- you don't want a bottleneck created by having too scarce a resource, the P1 already suffers from this and demonstrates the problem well (frame buffers are too small to take advantage of what the P1 can do today).
I think the P2 needs to go on a functionality diet. I would like to see hubex for 1 thread retained, but ditch all the other enhancements for multi-tasking and MT hubex. These features are too arcane for the target customers to use effectively, hubex is an easy thing to implement when C is the language, but the other features require heavy compiler work and will require a software that won't be available for a year after launch.
Hubex was a good win, but let's remember that the P2 had none of these power problems less than a year ago, and the major changes since then are hubex and all the caches. The P2 ran a full cog in a DE0 nano, it doesn't now.
Interesting case. Let me take a peek at it...
USB OTG guess: 1 cog (with SERDES, tasks)
Ethernet: 1 cog (assuming 10Mbps, serdes, might only take .5 cog, we will know once we try)
2 CAN: 0.25 cog, and UART or SERDES
8 DMA channles: 1 cogs, Chip's 1 cycle per long move, could be 0 full time cogs if occasional
4 SPI: with serdes, two 1/4 cogs, so 0.25 total
5 I2C: depends on speed, bit banged 1/4 cog @ 400khz
16 ADC: 0 cogs
6 20Mbps USARTS: 3 cogs serial ports, 1/4 of the time of those cogs
5 timers with bells and whistles: timers from 2.5 cogs
So assuming that you actually use all of that, you would need: (first approximation)
1+1+0.25+1+.5+.25+.75+2.5 (sort of) = 6.75
With care, combining brings the number way down:
1 USB Otg
1 Eth + CAN
3 DMA + SPI + UARTS + I2C + timers
5 cogs
If we can use some other resources, or tasks, in the Ethernet cog, probably only 4 cogs.
One cog should suffice for the "business/main" logic.
Of course, it is not fair to count cogs used to provide peripherals in the design.
May I ask what peripherals of the list you will actually use?
Next, we don't know how fast it will need to run. With USB helper instructions, 72Mhz may suffice.
Also, I strongly suspect the power consumption figures you gave did not assume all those peripherals going full blast, so let's not assume the P2 cogs would be full blast either.
Thanks for the updated numbers!
Is this a worst case figure, or "typical" case?
Agreed.
Most ARM core setups are targeted towards cell phones and tablet devices where power usage is critical. How many additional components would be needed to get an ARM setup comparable to the P2 setup? Just look at the STM32F429xx data sheet and read where they talk about all of the 'power profiles' they have! Normal Mode, Low-voltage mode, Over-drive mode, Under-drive mode, power-down mode... mode this, mode that. People are worried about a P2 with 2 or maybe 3 power/speed profiles? Talk about complex, table after table of what pins you can use (and only use) for what! P2 any pin, anything (except high speed analog pins). That ST is a single core, 180MHz, limited configuration controller and it is not cheap at $14.00+ for 50.
The P2 is a fully open design which will compare well in a true head to head match up in applications it is suited for. Build a tablet with it? NO of course not. Build a state of the art PLC, automation controller, data logger, robot controller, custom widget? You bet!
@All I personally translated bean counters into "I know about business realities" and in that sense, the comment and quoted pieces made good sense.
The power discussion continues to be an interesting one!
First and foremost, we need a LOT more, what is possible, and I mean in terms of who would adopt the chip at a modest coupla watt power budget, type discussions.
There are plenty of want to have like P1 discussions, and the other guys do X discussions, but really those are the wrong ones at 180nm, because it just won't be about that. A shrink could make it more about that, but I have yet to see a path to get us to the shrink without making this 180nm P2 fund it, so we need the what is possible discussion more than we need the wich it were type discussion.
Really, it simply is, barring those things we can do in order to maximize the 180nm design. Those may yield half the extreme profile, and good code strategy may well get us to a couple watts for a sweet spot case, ignoring education, which may be better still, depending.
Another general macro level observation is the impact of education. That business does well.
Would this design do as well, or maybe better at a reasonable power profile? The USB can delever 2.5 watts, so we play it safe and targer 1.5 watts. Will this design be viable or maybe even potent at that budget?
A lot appears to ride on that being true.
We have no idea what power problems the P2 had before now. They were not able to test it before, and they never made a successful run. It's very possible (and likely) that if the last run had "functioned" that it would have had similar high power usage.
It's a bummer that going to 65nm costs so much. Is it an option at OnSemi to go to 90nm? Would it help enough, and how much would it cost?
This is a frustrating situation, and it's clearly causing feathers to be ruffled all around. People need to step back, and stop pointing fingers, and just let Chip/Parallax and OnSemi work out a solution.
I am really interested in more OnSemi info.
However the P32X32B is way more in line with what was initially targetted as the P2. The current P2 is so much away from original specs that it is really P3. But none of this really matters. It is the ultimate chip(s) deliveredthat matters.
To me, the 5W caused a major point to stop and reflect. The P32X32B (P1B if you like) has way more appeal to me because
1. Way less risk
2. Significantly quicker to market
3. Will use less power than P2 (read Chip's posts for proof)
4. P2 can still continue - see Chips post
-- slightly delayed, but may get caught up because P32X32B will prove OnSemi process.
5. What everyone wanted originally.
6. 32 Cores and 512KB Ram @~200MHz will get a lot of free press - more exposure --> more sales?
Meanwhile P2 power can be refined and USB/SERDES done.
One big thing to come out of all this is the realisation that lots of simple cogs for peripherals trumps the fewer way more complex cogs that require complex sw and multi-tasking and multi-threading to get the job done!
As a final note: Parallax has never been driven by bean counters, but they won't risk the farm either, so our P1 supply is not in question.
Everyone wants more cores, and more memory, and more speed.
How much power are you willing to spend to get that, and are we not getting then back to the same problem the P2 currently faces?
Why not 16 cores, more memory, more speed, and at half the power?
That is in and of itself going to be at least 2x (200%) P1 on cores alone, and another 5x (500%) on top of that for speed, which I guess is 2x * 5X = 10x P1.
Thats far, far above an incremental change to P1, ie P1 > P1B.
As an interim solution, it satisfies the core, memory and speed requirements well, and the power budget is not quite as insane.
Plus, few cores means higher die per wafer, and better revenue from which to fund the PII.
I certainly hope you aren't ordering your chips from Canada… the postage will kill you:)
I am not an engineer. And I am not trying to be either argumentative or facetious.
You didn't mention video, math performance, texture mapping, the learning curve for new users, company support, forum quality, or the open software base.
How long is the food chain?
From a running start, which supports a more robust end-user experience?
Rich
P2:
SDRAM/video can share 1 cog
serial ports: 2 per cog, needs 1 task (.25 cog)
We are down to 6.75 cogs left - heck, let's call it 6
1) Where did you see the LPC4370 using only 1/2W while doing video & sdram access? P.85 of the data sheet shows 1.5W total power dissapation.
2) Core runs at 0.5V, and much smaller than 180nm process.
3) Performance advantage is the SIX processors left over, twice of the 4370.
FYI, p.87, M4 running while(1), all peripherals shut down, both M0's shut down, no sdram or internal memory use: 72mA @ 180MHz from 3v3.
That's 0.238W doing absolutely nothing, using only one of three cores, with all peripherals shut down. Sheesh.
Somehow, I doubt your 0.5W figure when doing a lot with the peripherals and the other two cores.
You are comparing an unproven worst case 5W for P2, with an unproven unsupported "typical" consumption for the 4370.
The data from the data sheet strongly suggests that typical consumption with three cores and peripherals would be over 1W.
Apples to apples arguments please.
Not quite, as you have assumed the idling logic ratio is the same on both cores.
P2 uses 90% clock gating, so the large blocks of high level logic not used, are not costing power.
Expecting 32 cores running 32 bit opcodes to be somehow 16 times(!) more Joule-Efficient than 8 cores running 32 bit opcodes, is 'optimistic'.
Well I had one on my desk doing just that about this time LAST year. I can tell you that we've turned on every peripheral and CPU on the part and never seen over 200mA
Let me throw out another case, those M4s are 90nm processes, and they are blown away by ARM11s (most 65nm) running 3 times as fast all for about $15 and typically around a W.
I know it's impossible for a P2 to run Linux, so someone will go out there and emulate something that runs Linux, it will just be the whole weekend before you get the > prompt.
edit
Actually I just checked it was 2 years ago, with SGPIO capturing a QVGA sensor at 30 Hz and dumping it into SDRAM that had internal DMA running to the LCD controller to drive a VGA display at full 60 Hz. All this while the M4 was twiddling its thumbs. Later the same setup we added the M0 driving the Ethernet to serve a web page and control the i2s output levels that were streaming in from USB under the M4 control. Lots of stuff going on, and no melt down.
https://forum.sparkfun.com/viewtopic.php?f=5&t=10314&start=420
One high value proposition the P2 offers is doing lots of things at once sans an OS.
As far as power goes, the biggest Propeller customer right now is most likely Parallax Education. Can you imagine a 3W to 5W ActivityBot? I certainly can not.
The point is, I don't believe for a second that the typical usage of a P2 would be 5W.
I don't even believe 5W as the "worst case", as the assumptions of how much logic was toggling for the simulation are unrealistic (as far as I have been able to determine).
As I've been stating for quite a while... the posted 5W is a worst-case simulation, not a measured known quantity. Therefore, the panic over it is not warranted, especially as the 1-2W typical usage figure talked about for quite a while is consistent with a 3.5W - 5W worst case.
AVR 8 bitter booting Linux to a command prompt: http://dmitry.gr/index.php?r=05.Projects&proj=07.%20Linux%20on%208bit
Two hours to boot to a command prompt, with an ATmega1284p running 24Mhz emulating an ARMv5TE, with a 30 pin simm bit-banged.
At a WAG, a P2 with the sdram would be at least 60 times faster.
So two minutes to boot to a shell prompt sounds about right, not a weekend
The point is, we are right now looking at 180nm power consumption, without latest simulation results, at 160Mhz. Running at say 40Mhz for applications that don't need the horsepower drops the power envelope a LOT.
Gentlemen,
Seriously.
Really.
Would you kindly stop saying 3W-5W will be required to run a P2 until we have a better handle on it?
WE DO NOT KNOW THAT.
That is WORST CASE, all 8 cogs going firing all logic (MUL, DIV, MAC, CORDIC, VIDEO, everything at once), at 160Mhz, 1.8vCore, before Chip does his magic power reduction.
And can we stop trying to justify feature cuts that will not affect the power envelope noticeably (hubexec, tasks, threads) by saying they will drop power requirements a lot (without providing proof HOW they will do that)?
Also, if you propose a 32 core P1 with 512KB hub (which would be a nice chip after P2) please show how in the same process it will have a lot lower power utilization? I already showed it will be a lot slower than the P2 for many applications.
Can we please remember the scientific method and not panic yet?
I propose that every power-cutting measure be required to show HOW it will cut power utilization, backed up with references and numbers. No hand waving.
Need to keep in mind that a lot of what makes HUBEXEC attractive on the P2 is all the new instructions using pointers, relative addresses, etc.
With the P1 instruction set HUBEXEC would look a lot more like hardware assisted LMM, which isn't bad, it just isn't the same as P2 HUBEXEC.
C.W.