Design Techniques for Critical Embedded Systems
Hi All,
I will soon be starting a senior design project working on an embedded system that needs to be as reliable as possible with no down time. I'm starting this thread to ask for feedback from the professional engineers out there who make critical embedded systems for a living (flight controllers, power grid electronics, etc.) on what design techniques to use to make the system have as few bugs as possible and ensure suitable redundancy.
The system will be a battery monitoring system for a large ($1M array) of batteries. These are the same batteries that are used in Balqon electric trucks but put into a cabinet and hooked up to the electrical system of a building. The monitor that we will be implementing will be used to watch over the variables associated with the array and ensure that the batteries are functioning properly. Since this has never been done before the system needs to be as reliable as possible in order to ensure hidden problems are dealt with quickly and safely.
This project will mostly be software based. The hardware will likely consist of interface board(s) to the battery systems and server(s) to process, store, and serve/display the data in real time.
I am looking for feedback on engineering techniques to ensure that the systems are built as reliably as possible. Stories are definitely appropriate, as are approaches to address bug free design.
I will soon be starting a senior design project working on an embedded system that needs to be as reliable as possible with no down time. I'm starting this thread to ask for feedback from the professional engineers out there who make critical embedded systems for a living (flight controllers, power grid electronics, etc.) on what design techniques to use to make the system have as few bugs as possible and ensure suitable redundancy.
The system will be a battery monitoring system for a large ($1M array) of batteries. These are the same batteries that are used in Balqon electric trucks but put into a cabinet and hooked up to the electrical system of a building. The monitor that we will be implementing will be used to watch over the variables associated with the array and ensure that the batteries are functioning properly. Since this has never been done before the system needs to be as reliable as possible in order to ensure hidden problems are dealt with quickly and safely.
This project will mostly be software based. The hardware will likely consist of interface board(s) to the battery systems and server(s) to process, store, and serve/display the data in real time.
I am looking for feedback on engineering techniques to ensure that the systems are built as reliably as possible. Stories are definitely appropriate, as are approaches to address bug free design.
Comments
The power could go off for a long time. The power might not be mains voltage, it might be a million volts (lightening strike). You can have thermal stresses causing parts to fail. You can have mechanical stresses, eg plugs, or where wires go into boards. You can have oxidation. Water (both spray and humidity). You can have a colony of ants take up residence in your control box because it is nice and warm. Or mice. Or venomous spiders.
And then there is the software side. As an example, my work computer decides about once a month to stop connecting to the internet. So do many computers at other sites where I work. You reboot windows and it still doesn't work. So you take the box down to the computer guy and it works fine. What is going on? Well, the answer is that inside the mother board are a number of microcontrollers including one that does the ethernet connection. When you turn off your computer it does not turn off everything on the motherboard, as you can see by looking inside the box where you will see a little led glowing on the motherboard. If you unplug the computer, this led takes about 30 seconds to go off. So when you take the computer down to get it fixed, it is off for more than 30 seconds which ends up rebooting everything.
The person who wrote the code for those motherboard controllers has not done a complete analysis of how things fail. You could always triplicate things like the space shuttle did, and then take a vote on the answer. If your microcontroller fails, the pins are likely to fail either stuck high or stuck low. But they are unlikely to fail in a way that produces, say, a 1Hz square waveform. So what you can do is have some code that produces a 1hz waveform on a pin, in amongst all the other things it is doing, and then a 'watchdog' circuit with another chip that resets everything if that 1hz signal fails. Such chips exist, or you can build your own out of op amps and 555 timers and the like.
Another concept is to build things in layers. Biology does this - for instance, the heart keeps ticking along even when disconnected from the brain. But the brain has the ability to modify the heart's rate. This means that if the brain goes off (eg during an epileptic seizure, or if someone hits you over the head with a club), the heart keeps working.
So if there is a part of the circuit that can be controlled with simpler components like relays and op amps and 555 timers, then do it the simple way. On top of that, you can add a clever microcontroller layer that measures everything is working as it is and sends you an SMS if it fails.
So if the sun decides to go crazy with a violent electrical storm and the eeprom gets zapped in your microcontroller, a 555 might still keep things limping along.
And then there is the issue of malicious hacking. Relays and op amps and timers are more immune to hacking than microcontrollers. Check out the story of the Stuxnet virus http://en.wikipedia.org/wiki/Stuxnet or this story on how to exploit the known weaknesses of a system in order to take it over http://www.haaretz.com/news/middle-east/iran-official-we-tricked-the-u-s-surveillance-drone-to-land-intact-1.401641
In general, ask yourself how things can go wrong. What inputs are you measuring, what are the outputs, and what is the worst case scenario of any combination of outputs. As a practical example, even if your traffic lights are controlled by the most robust and simple code you can think of, it still is useful to have a failsafe mechanism that switches the lights to flashing yellow if two green lights happen to come on at the same time.
Do you have a schematic and a board design?
Guess which part failed? Yes that AND gate! (Almost caused a launch!)
Anyway I think a good design is two separate systems, each using totally separate design and parts. Might want to google the following words...
aircraft fault tolerant systems
Three separate systems, with majority voting on the outputs. But, then, what will determine the reliability of the vote counter?
-Phil
My ISP went offline for a weekend, a year or two ago. Of course it turned out to be the redundant (backup) generator that pulled the whole system down.
With the performance of batteries deteriorating with cold temperatures, make sure your system will work well even under those conditions, or in the case of mains grid failure during a big cold snap, when the batteries may be needed even more.
Consider a good conformal coating to exclude moisture.
Study up on protective devices - fuses, circuit breakers, surge protection etc
The most important thing that you can do during the design phase is to rapid prototype then blow it up. Find the mode of failure. I used to use two or three really ruthless engineers who would bang the switches on and off or short out cables or pull out plugs at the worst possible times. By doing this as early as possible in the design cycle we had time to do major structural redesigns so that fundamental issues could still be rectified. As a rule of thumb, if you haven't made four major revisions to the design before manufacture then there will be problems later. We even used to drop the finished items off the back of moving vehicles so that we could see which parts might get damaged during shipping. We also put prototypes in field conditions as early as possible and they usually failed in different ways than we had anticipated. And gave the software to the worst qualified user we could find to see how they made it fail.
Reliability is about listening to that voice in your head that keeps you awake at night. If you're not sure about part of the design then test it or change it. No compromises.
Another thing, whenever possible, buy it don't build it. You don't have time to do low level debugging of trivial systems or components. It's more important to spend more time on testing and less time on vanity, and as engineers, we always think that we can design better than anyone else. For high reliability design, it's not about how clever you are, it's about how fast you can figure out how stupid you are.
Finally, multiple redundant systems are all well and good but the latest standards call for redundant, independent systems based on different technologies. It turns out that different technologies have different modes of failure so an inferior technology can actually support a superior one when used as a backup. Take a look at the SIL standards to see how this works.
Good luck with your project.
-Phil
So along with all this other good advice, I'd suggest building the human interface as mistake-proof and self-correcting as possible.
Absolutely!
Classic! This is going on my cube wall.
Look into the archives of IEEE and ACM if you can. I recall a series the ACM did on notable software failures such as the jet which went into inverted flight when it realized it had crossed the equator and other "discoveries".
FF