Prop 1 failure rate calculation
madrfskills
Posts: 24
Folks,
I'm uploading revision 0.0 of a reliability calculation tool for the Prop 1, using the methodology of MIL-HDBK-217F Notice 2. Please review and comment.
Notes:
1) I've locked the worksheet to keep formulae from getting modified accidentally. If you want to edit this, go ahead - there is no password set, all you need to do is unlock it. Obviously the input cells in the "input section" can be modified.
2) The sheet includes an assumed thermal impedance of 10-11 DegC / Watt for the props, depending on the package. This seems low to me, but I'm using the number because it is the assumption used in the handbook.
3) Package reliability for the LQFP and QFN are equal in the handbook. In the real world leadless devices have higher failure rates. That said, this is not really an error for it is not the chip's package that fails but rather in the MIL-HDBK-217F model this is a considered a PCB failure. And, indeed, there is a section of the handbook dedicated to PCB failure rates. A full system-level analysis would take this into account, but that's a topic for another spreadsheet...
I hope this is helpful. Please review.
V/R
Mike
Folks,
A product I am working on requires a complete failure rate analysis per MIL-HDBK-217F. Yes, I know the many warts and weaknesses of -217, but a contract is a contract and I already said "Yes, sir."
If anyone has already spent the bucks to do lifecycle testing of Prop 1 and has the FIT numbers, please add these to the Prop 1 data sheet; it would save people much analysis pain. I assume this hasn't been done, so I'm making the following assumptions. Could anyone please check these??
1) Number of transistors = 3 million
2) Feature size = 0.35 micron
3) Die area = 0.53 cm^2
4) Failure activation energy = 0.35 electron volts (CMOS VHSIC-like)
5) I'm assuming that the chip fab and packager are non-QPL, non-QML for manufacturing processes
6) I'm assuming the electrical overstress hardness is between 4,000 and 10,000V, single event
Here is what I just don't know:
A) Thermal conductivity from junction to ambient for the QFN package. I'm assuming 70DegC/Watt, using 20 thermal vias under the ground paddle and lots of copper.
Anything about the quality screening. Does Parallax use any formal screening processes such as MIL-STD-883? Specifically, are TM1015 burn-ins done? To what level?
Is TM5004 (temp extremes) run?
Is TM2010/17 (internal visual) run?
Is TM2012 (X-ray) run?
Is TM2009 (external visual) run?
Is TM2023 (non-destructive bond pull) run?
Right now I'm just assuming the "non-screened / no QC" defaults in the failure model.
Using the assumptions I have right now, what I'm estimating looks pretty good: for an unsheltered naval environment, 70DegC ambient temperature, 80mA consumption (8 cogs running full-out) -- the estimated failure rate is 19 per million hours.
Specific breakdowns:
lambda bd (die base failure rate) = 0.16
pi mfg (manufacturer QPL) = 2 (i.e. none)
pi t (temperature factor) = 1.07 (junction temp estimated at 88DegC)
pi cd (die complexity factor) = 53
lambda bp (package failure rate) = 0.0029
pi e (environmental) = 6 (naval unsheltered)
pi q (quality factor) = 10 (unknown quality / no characterization)
pi pt (package type factor) = 6.1 (plastic SMD)
lambda EOS = 0.029 (4000-10000 ESD survivability)
Thanks!
V/R
Mike
I'm uploading revision 0.0 of a reliability calculation tool for the Prop 1, using the methodology of MIL-HDBK-217F Notice 2. Please review and comment.
Notes:
1) I've locked the worksheet to keep formulae from getting modified accidentally. If you want to edit this, go ahead - there is no password set, all you need to do is unlock it. Obviously the input cells in the "input section" can be modified.
2) The sheet includes an assumed thermal impedance of 10-11 DegC / Watt for the props, depending on the package. This seems low to me, but I'm using the number because it is the assumption used in the handbook.
3) Package reliability for the LQFP and QFN are equal in the handbook. In the real world leadless devices have higher failure rates. That said, this is not really an error for it is not the chip's package that fails but rather in the MIL-HDBK-217F model this is a considered a PCB failure. And, indeed, there is a section of the handbook dedicated to PCB failure rates. A full system-level analysis would take this into account, but that's a topic for another spreadsheet...
I hope this is helpful. Please review.
V/R
Mike
Folks,
A product I am working on requires a complete failure rate analysis per MIL-HDBK-217F. Yes, I know the many warts and weaknesses of -217, but a contract is a contract and I already said "Yes, sir."
If anyone has already spent the bucks to do lifecycle testing of Prop 1 and has the FIT numbers, please add these to the Prop 1 data sheet; it would save people much analysis pain. I assume this hasn't been done, so I'm making the following assumptions. Could anyone please check these??
1) Number of transistors = 3 million
2) Feature size = 0.35 micron
3) Die area = 0.53 cm^2
4) Failure activation energy = 0.35 electron volts (CMOS VHSIC-like)
5) I'm assuming that the chip fab and packager are non-QPL, non-QML for manufacturing processes
6) I'm assuming the electrical overstress hardness is between 4,000 and 10,000V, single event
Here is what I just don't know:
A) Thermal conductivity from junction to ambient for the QFN package. I'm assuming 70DegC/Watt, using 20 thermal vias under the ground paddle and lots of copper.
Anything about the quality screening. Does Parallax use any formal screening processes such as MIL-STD-883? Specifically, are TM1015 burn-ins done? To what level?
Is TM5004 (temp extremes) run?
Is TM2010/17 (internal visual) run?
Is TM2012 (X-ray) run?
Is TM2009 (external visual) run?
Is TM2023 (non-destructive bond pull) run?
Right now I'm just assuming the "non-screened / no QC" defaults in the failure model.
Using the assumptions I have right now, what I'm estimating looks pretty good: for an unsheltered naval environment, 70DegC ambient temperature, 80mA consumption (8 cogs running full-out) -- the estimated failure rate is 19 per million hours.
Specific breakdowns:
lambda bd (die base failure rate) = 0.16
pi mfg (manufacturer QPL) = 2 (i.e. none)
pi t (temperature factor) = 1.07 (junction temp estimated at 88DegC)
pi cd (die complexity factor) = 53
lambda bp (package failure rate) = 0.0029
pi e (environmental) = 6 (naval unsheltered)
pi q (quality factor) = 10 (unknown quality / no characterization)
pi pt (package type factor) = 6.1 (plastic SMD)
lambda EOS = 0.029 (4000-10000 ESD survivability)
Thanks!
V/R
Mike
Comments
One thing we can all attest to is the reliability of the prop chip. I err'd on a prototype and placed a 1K in series to a prop input pin with about +6V on it. That was actually out of spec, but when thinking about it one night I realised that I completely forgot that the voltage swings to -6V too. I ran the pcb for a day solid with no problems. Both 33K and 100K work within spec.
I will open an internal support ticket for this request. Please contact me off-line so we can obtain your complete contact information. I'm at kgracey@parallax.com.
The earliest I can assign this to one of our FAEs is May 2nd - if you are in need or more immediate details please let me know and I can have the urgent requests addressed beforehand. I choose May 2nd because our internal company structure and processes are changing significantly at that time, or at the very latest one week later. Our team and systems will be far more conducive to managing these requests between having available FAEs, formal support portal software, etc.
We will incorporate any information we provide to you into our datasheet.
Thanks,
Ken Gracey
Thanks - I will definitely be in touch through those paths. I thought I'd hit the forum first because I've found an incredible wealth of information here!
From an anecdotal perspective I can attest to the ruggedness of the Prop. I've not used it in operational hardware per se, but have several designs in which I use props to automate system-level tests inside of environmental chambers. In my application a prop is incorporated into a panel of 36 to 50 target PCBs and tests their functionality as the group is thermally cycled from -40 to +85 DegC. We dwell at each extreme for 30 minutes to an hour - depending on the test goal - and then thermally shock the system by heating or cooling very rapidly. I've had props put up with 100-200 cycles of this type of abuse. I've never seen the prop die before the PIC micros used on the target hardware - that should tell you something.
What I'm hoping to do with the -217F analysis is see if I can help establish some crude estimates of life expectancy using perhaps flawed but very common systems engineering tools. Obviously I'll share results with the community, which is why I'm not using my actual work email...
Cheers,
Mike
Roger all; I will be in touch offline and definitely appreciate the assistance.
To the extent I can add anything to the knowledge base through my analysis and/or experience I will share it with Parallax and the Prop community. Its a great product and has helped me out of a lot of jams.
V/R
Mike
Thanks for posting you inquiry and findings to the rest of us. I would not have had an opportunity to observe investigation into a part per these standards otherwise. Thanks for sharing your expertise!
You must have one fanta$tic environmental chamber
Have you done any 4 corners testing on Propeller with that range?
Jazzed, I've not been testing the Prop per se, but using it as part of a test rig so I don't do 4 corners testing on the parts and have been providing them with carefully regulated power. The props are clocked with LT1799 resistor-set silicon oscillators, which are remarkably stable over temperatuer, so I am not testing over a significant clock frequency space.
For environmental chambers I've been very happy with the Integra temperature plates - these are essentially well-regulated hot plates that incorporate liquid CO2 cooling. For convection type equipment I like the LR Technologies bench-top chambers; these will give you a 12 minute ramp from -55 to +125DegC. I know that -40 to +85 seems heroic, but the full military range is -55 to +125. I've had props run at +125, but have not tested any at the low end. The full automotive range is -40 to +125 as well. What we find is that fast cycling tends to make any material flaws or CTE mismatches between components, boards, and so forth rather obvious.
-40C is kind of normal for the equipment I work on around around here, Finland.
And things can also get rather roasty in outdoor metal cabinets in the summer time.
Does this only apply to well designed final products, or does it also include hackers doing smoke testing in the lab?
Where does the assumption "can withstand 4-16kV overstress" come form? (I always have to ask about assumptions)
I used to see MTFB numbers for storage that appeared to be a longer time period than the disk media could physically retain the data. For example, a co-worker told me that CD media deteriorates after a couple years, while the MTFB indicates decades. (Issue was CD disks deteriorated and data archives were lost). Is this a marketing abuse or is there a further aspect of this estimation?
Excellent questions; I'll take these in turn.
What the 37 year number means is that - under the assumption of proper component handling, installation, and use within data sheet parameters, and for the chosen environment (ground benign environment, 80mA current draw) the reliability of the propeller chip (chip itself) falls to about 37% after 37 years. The reason this is 37% and not 50% is that an exponential failure distribution is assumed (F(t) = integral(lambda*exp(-lambda*t),0,t)). This is usually valid for electronic components which have undergone burn-in and screening to eliminate the "infant mortality" failures and reach a steady-state failure rate. Note that the traditional bathtub curve (with increasing failure rate at some late time, i.e. "wearout" usually depicted for reliability usually doesn't apply to semiconductor components - but it is very much a fact of life for mechanical components, light bulbs, etc.). For a commercial micro without special quality control screening processes, this is actually a pretty good result.
Therefore, yes, this applies only to well-designed products. If one exceeds data sheet parameters, does caveman soldering, encounters a vicious (maladjusted and therefore high impact) pick and place robot, etc. then the MIL-HDBK model does not apply.
The ESD hardness is an assumption on my part based on industry norms for 0.35 micron process technology; generally all modern commercial 0.35 micron class devices have input protection on all pins sufficient to handle a 15kV human body model ESD threat. I did some X-rays of a prop and reached the conclusion that the I/O pad device geometries are consistent with norms, and actually look a bit conservative - therefore I'm assuming its a 4-16kV class device. I floated the number to Parallax and there was no dissension. For my products, all external leads are tested to at least 15kV HBM, 1000 shots, both polarities and I've not had a prop fail - that said, I haven't hit the prop directly and ESD damage usually takes a long time to manifest itself... As an alternative, you can go conservative and assume no ESD hardness, which takes the lambda_EOS number from 0.029 to 0.065 failures / million hours.
One other aspect that is not rolled up in the calculation I did is PCB reliability. For a concrete example, here are the results for a recent design I'm using the prop on: in my application, the ambient temperature inside the equipment case is 70DegC and the environment is best described as "Naval Unsheltered". I'm using the leadless chip carrier package. The model predicts a failure rate of about 12 per million hours, yielding an MTBF of 9.5 years. You must compute reliability for everything on the board; in my relatively simple design the prop (the most complex chip on the board by far) accounts for only 33% of the total failure rate... the next biggest threat is the 25 decoupling caps I have scattered around the board. Doing a Pareto analysis is helpful because it enables you to understand what is worth fixing and what can be ignored. In the first pass design the caps dominated and I spent the extra $ to better specify these. Now let's consider my PCB choices:
CASE 1: I'm a loser addicted to using cheap FR-4 circuit board laminates in safety critical systems. FR-4 has a coefficient of thermal expansion of 18 parts per million and the QFN package has a CTE of about 7ppm. In my application I have a thermal cycle about every four hours. Using section 16.2 of MIL-HDBK-217F I achieve a failure rate for the PCB/propeller package combination of 46 failures / million hours; i.e. MTBF is only 2.5 years. Worse, that's just for one component ... you need to do the calculation again for every component on the board. This board is bound to show significant failures in field well before my projected retirement date. Not good.
CASE 2: Same as case 1 but using the LQFP package. Lambda drops to 0.04 failures / million hours; i.e. MTBF for the PCB/prop interface is over 2,000 years. BUT - this PCB will continue to torment my surface mount ceramic chip decoupling caps, so who cares? A cracked decoupling cap will fail you just as surely as a broken board/chip bond. My retirement is still in doubt.
CASE 3: I keep the small QFN package but I spent the money on a good epoxy/kevlar laminate with a CTE of 7ppm. Now lambda is 0.03 failures / million hours and all my MLCC caps are likely to survive and prosper.
There are tradeoffs everywhere -for example, as a rule of thumb most people do not use surface mount capacitors bigger than 1206 size due to their high susceptibility to in-service cracking. 0402 and 0603 sizes seem best suited to surviving thermal shock. BUT, the very small 0402 and 0603's are highly susceptible to damage if someone's pick and place robot is rabid. Also, they're harder to hand solder, so the caveman fixing your gear years later may give you a better repair if you used, say, 0805's or 1206's. On the third hand, if using very small caps helps me shrink my board area, the mechanical resonant frequencies of the board increase and I find that equipment vibration is less of a threat ... and on average maybe I can tolerate a higher risk of whatever Godzilla is lurking in the board house's back room. On the fourth hand, if I don't match my board CTE with my caps, or use hard potting with a bad CTE match, I'm toast anyways.
Getting products to last 10+ years is a fun engineering challenge!
With respect to your optical media question; I suspect its a case where the quoted MTBF applies to only a single component in the system (say, metalization integrity) and doesn't apply to something else which is forcing the overall system reliability down (say, dimensional stability of the plastic substrate). For whatever its worth, you may want to contract a data recovery house - they can pull at least some data off of even highly degraded media. It will cost you serious $$, so you have to think hard about what the data are worth to you...
V/R
Mike
Thanks for the detailed answers. Much of these calculations are dark arts to me. I now know enough to realize I don't know enough, and that experts can be found that can help.
No worries on the optical media, the parties involved learned lessons measured in specific dollar amounts.
Also where did you learn all this reliability and failure mode stuff? None of my electrical engineering courses got that practical. (admittedly as an ME I didn't have many EE courses)
Lawson
There are not too many glass reinforced laminates which have both a good CTE match with components in the X-Y plane and good match to copper plated through-holes in the Z direction. To get there I think you do need a very high fill ratio, and then your manufacturing costs go up due to excessive tool wear. Another approach is to use an invar inner core layer to counteract FR-4 expansion, but I've never used this due to concerns about creep and strain against small vias.
Some of the aramids can deliver -4ppm/DegC. A notable, very good product is Arlon 45NK - a good, general purpose epoxy/aramid. At high frequencies I've had great success with Rogers RO4003C, a ceramic-filled thermoplastic. It gives a excellent compromise of RF and mechanical performance and - best of all - is usually in stock somewhere.
Learned most of my reliability work back in the good old days of AT&T. If you can find a copy, the AT&T Reliability Handbook is probably the best reference you'll ever find. After that came divestiture and a corporate culture which promotes pushing cheap junk out the door as fast as possible. Been bouncing back and forth between the government and academia. I think some of the graduate systems engineering curricula now offer pretty decent courses in reliability, and I some of the larger schools such as University of Maryland have courses which cover it in great depth. But for the most part we learn by trial and lots and lots of error...
V/R
Mike
I think the reliability depends allot on how hard you drive the outputs and how much you overdrive the input protection
diodes with higher voltages. Power supply design is also very important, proper decoupling and good quality capacitors
go a long way in keeping the Prop chips happy.