Design Techniques for Critical Embedded Systems

SRLM · 2012-01-09 17:09

Hi All,

I will soon be starting a senior design project working on an embedded system that needs to be as reliable as possible with no down time. I'm starting this thread to ask for feedback from the professional engineers out there who make critical embedded systems for a living (flight controllers, power grid electronics, etc.) on what design techniques to use to make the system have as few bugs as possible and ensure suitable redundancy.

The system will be a battery monitoring system for a large ($1M array) of batteries. These are the same batteries that are used in Balqon electric trucks but put into a cabinet and hooked up to the electrical system of a building. The monitor that we will be implementing will be used to watch over the variables associated with the array and ensure that the batteries are functioning properly. Since this has never been done before the system needs to be as reliable as possible in order to ensure hidden problems are dealt with quickly and safely.

This project will mostly be software based. The hardware will likely consist of interface board(s) to the battery systems and server(s) to process, store, and serve/display the data in real time.

I am looking for feedback on engineering techniques to ensure that the systems are built as reliably as possible. Stories are definitely appropriate, as are approaches to address bug free design.

Dr_Acula · 2012-01-09 18:13

I'm sure lots of ideas will come through, but to kick things off, I tend to think not of how the circuit will work, but how can it fail. This is a short list based on real things that have happened to my home water controller system over the last decade:

The power could go off for a long time. The power might not be mains voltage, it might be a million volts (lightening strike). You can have thermal stresses causing parts to fail. You can have mechanical stresses, eg plugs, or where wires go into boards. You can have oxidation. Water (both spray and humidity). You can have a colony of ants take up residence in your control box because it is nice and warm. Or mice. Or venomous spiders.

And then there is the software side. As an example, my work computer decides about once a month to stop connecting to the internet. So do many computers at other sites where I work. You reboot windows and it still doesn't work. So you take the box down to the computer guy and it works fine. What is going on? Well, the answer is that inside the mother board are a number of microcontrollers including one that does the ethernet connection. When you turn off your computer it does not turn off everything on the motherboard, as you can see by looking inside the box where you will see a little led glowing on the motherboard. If you unplug the computer, this led takes about 30 seconds to go off. So when you take the computer down to get it fixed, it is off for more than 30 seconds which ends up rebooting everything.

The person who wrote the code for those motherboard controllers has not done a complete analysis of how things fail. You could always triplicate things like the space shuttle did, and then take a vote on the answer. If your microcontroller fails, the pins are likely to fail either stuck high or stuck low. But they are unlikely to fail in a way that produces, say, a 1Hz square waveform. So what you can do is have some code that produces a 1hz waveform on a pin, in amongst all the other things it is doing, and then a 'watchdog' circuit with another chip that resets everything if that 1hz signal fails. Such chips exist, or you can build your own out of op amps and 555 timers and the like.

Another concept is to build things in layers. Biology does this - for instance, the heart keeps ticking along even when disconnected from the brain. But the brain has the ability to modify the heart's rate. This means that if the brain goes off (eg during an epileptic seizure, or if someone hits you over the head with a club), the heart keeps working.

So if there is a part of the circuit that can be controlled with simpler components like relays and op amps and 555 timers, then do it the simple way. On top of that, you can add a clever microcontroller layer that measures everything is working as it is and sends you an SMS if it fails.

So if the sun decides to go crazy with a violent electrical storm and the eeprom gets zapped in your microcontroller, a 555 might still keep things limping along.

And then there is the issue of malicious hacking. Relays and op amps and timers are more immune to hacking than microcontrollers. Check out the story of the Stuxnet virus http://en.wikipedia.org/wiki/Stuxnet or this story on how to exploit the known weaknesses of a system in order to take it over http://www.haaretz.com/news/middle-east/iran-official-we-tricked-the-u-s-surveillance-drone-to-land-intact-1.401641

In general, ask yourself how things can go wrong. What inputs are you measuring, what are the outputs, and what is the worst case scenario of any combination of outputs. As a practical example, even if your traffic lights are controlled by the most robust and simple code you can think of, it still is useful to have a failsafe mechanism that switches the lights to flashing yellow if two green lights happen to come on at the same time.

Do you have a schematic and a board design?

bill190 · 2012-01-09 19:10

Many moons ago, I heard a tale of a missile launch system which had completely duplicate and fault-tolerant systems. And both of these systems fed into one AND gate...

Guess which part failed? Yes that AND gate! (Almost caused a launch!)

Anyway I think a good design is two separate systems, each using totally separate design and parts. Might want to google the following words...

aircraft fault tolerant systems

Phil Pilgrim (PhiPi) · 2012-01-09 19:36

bill190 wrote:

Anyway I think a good design is two separate systems, ...

Three separate systems, with majority voting on the outputs. But, then, what will determine the reliability of the vote counter?

-Phil

Tubular · 2012-01-09 22:00

Come up with an initial design. Then perform a kind of structured "Hazop" style analysis with lots of what-if investigation, digging right down to detail including failure of the "fail safe" or "redundant" parts of your design.

My ISP went offline for a weekend, a year or two ago. Of course it turned out to be the redundant (backup) generator that pulled the whole system down.

With the performance of batteries deteriorating with cold temperatures, make sure your system will work well even under those conditions, or in the case of mains grid failure during a big cold snap, when the batteries may be needed even more.

Consider a good conformal coating to exclude moisture.

Study up on protective devices - fuses, circuit breakers, surge protection etc

Laser Developer · 2012-01-09 23:27

My 2 cents worth - I've been designing high reliability systems for industrial control and system monitoring for more than thirty years. These are systems where a single data error can cause an entire plant to shut down or blow up.

The most important thing that you can do during the design phase is to rapid prototype then blow it up. Find the mode of failure. I used to use two or three really ruthless engineers who would bang the switches on and off or short out cables or pull out plugs at the worst possible times. By doing this as early as possible in the design cycle we had time to do major structural redesigns so that fundamental issues could still be rectified. As a rule of thumb, if you haven't made four major revisions to the design before manufacture then there will be problems later. We even used to drop the finished items off the back of moving vehicles so that we could see which parts might get damaged during shipping. We also put prototypes in field conditions as early as possible and they usually failed in different ways than we had anticipated. And gave the software to the worst qualified user we could find to see how they made it fail.

Reliability is about listening to that voice in your head that keeps you awake at night. If you're not sure about part of the design then test it or change it. No compromises.

Another thing, whenever possible, buy it don't build it. You don't have time to do low level debugging of trivial systems or components. It's more important to spend more time on testing and less time on vanity, and as engineers, we always think that we can design better than anyone else. For high reliability design, it's not about how clever you are, it's about how fast you can figure out how stupid you are.

Finally, multiple redundant systems are all well and good but the latest standards call for redundant, independent systems based on different technologies. It turns out that different technologies have different modes of failure so an inferior technology can actually support a superior one when used as a backup. Take a look at the SIL standards to see how this works.

Good luck with your project.

Phil Pilgrim (PhiPi) · 2012-01-09 23:42

Laser Developer wrote:

Another thing, whenever possible, buy it don't build it.

That can either equate to ultimate trust or passing the buck if failures occur. But if you're a garage operation and your supplier is an ISO 9000-rated company, it's probably not bad advice.

-Phil

Leon · 2012-01-10 03:09

Many years ago Plessey Roke Manor used one of my 16-module transputer systems for a fault-tolerant system. They would ask a customer to pull out three or four modules at random, and the system would continue operating. XMOS devices, which have a similar architecture, are also ideal for such systems.

David B · 2012-01-10 11:20

If technicians or operators will be performing maintenance, upgrades, connections to this system, mistakes WILL happen. Guaranteed. Someday they will come in hung over, or anxious to leave early for the football game, or just after a fight with the spouse, and no matter how carefully the maintenance instructions have been written, something will be left undone, or edited wrong, or connected backwards.

So along with all this other good advice, I'd suggest building the human interface as mistake-proof and self-correcting as possible.

4x5n · 2012-01-10 11:32

In addition to the above something to consider when dealing with a large number of batteries is the amount of energy that they're capable of dissipating very quickly. Don't take the safety of the system and possible dangers to people working with it for granted.

Mike Green · 2012-01-10 11:49

In terms of good examples, there's a book called "Tog on Interface" by Tognazzini who was an early Apple Fellow and expert on Human-Computer Interfaces. He describes a lot of the early interface design and testing, particularly where they would drag employees in "off the street" and have them try some new experimental interface feature, then debrief them. Often they found completely unexpected behavior or experience and would have to go back to the "drawing board". It's the same thing for tault-tolerant design. Someone is likely to see a ventilation tube with a cover on it and assume that's where the cold water goes for the coffee maker.

davejames · 2012-01-10 11:51

Laser Developer wrote: »

Reliability is about listening to that voice in your head that keeps you awake at night. If you're not sure about part of the design then test it or change it. No compromises.

Absolutely!

Laser Developer wrote: »

For high reliability design, it's not about how clever you are, it's about how fast you can figure out how stupid you are.

Classic! This is going on my cube wall.

frank freedman · 2012-01-10 20:03

SRLM wrote: »

Hi All,

I will soon be starting a senior design project working on an embedded system that needs to be as reliable as possible with no down time. I'm starting this thread to ask for feedback from the professional engineers out there who make critical embedded systems for a living (flight controllers, power grid electronics, etc.) on what design techniques to use to make the system have as few bugs as possible and ensure suitable redundancy.

The system will be a battery monitoring system for a large ($1M array) of batteries. These are the same batteries that are used in Balqon electric trucks but put into a cabinet and hooked up to the electrical system of a building. The monitor that we will be implementing will be used to watch over the variables associated with the array and ensure that the batteries are functioning properly. Since this has never been done before the system needs to be as reliable as possible in order to ensure hidden problems are dealt with quickly and safely.

This project will mostly be software based. The hardware will likely consist of interface board(s) to the battery systems and server(s) to process, store, and serve/display the data in real time.

I am looking for feedback on engineering techniques to ensure that the systems are built as reliably as possible. Stories are definitely appropriate, as are approaches to address bug free design.

Look into the archives of IEEE and ACM if you can. I recall a series the ACM did on notable software failures such as the jet which went into inverted flight when it realized it had crossed the equator and other "discoveries".

FF

Design Techniques for Critical Embedded Systems

Comments