Fault-tolerant systems

Leon · 2008-10-04 10:37

I've started this thread for discussing fault-tolerant systems using Propeller chips, as well as other processors.

Roke Manor used one of my transputer systems for a fault-tolerant demo. My system had 16 vertically mounted plug-in modules, and they'd ask a prospective customer to pull out two or three modules at random; the application would carry on working. The transputer made an ideal fault-tolerant processor, because it was easy to test the communication links, and switch to another processor if the links, or the processor, had failed.

The XMOS chips have a similar capability, of course. Cambridge University is using them for student projects, and a fault-tolerant system is one of the suggestions:

"Fault-tolerant parallel programming (and tools) using arrays of XMOS cores/chips. This involves some research into fault-tolerant techniques and will involve some inventiveness in designing automated systems for fault-tolerant programming. For example, replicating code onto several different threads, different cores, different chips in some automated manner and collecting/collating results from executed code."

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM
Suzuki SV1000S motorcycle

Post Edited (Leon) : 10/4/2008 10:42:56 AM GMT

heater · 2008-10-04 11:07

The first thing I learned about fault tolerant systems was that if you want to tolerate 1 failed node in a redundant network you need 4 nodes to start with. All fully connected. This was something of a surprise as you may intuitively guess that 3 would be sufficient to get a vote on which is the failed/incorrect node. In general to tolerate n failed nodes one needs 3n + 1 nodes in the network to start with. See http://en.wikipedia.org/wiki/Byzantine_Fault_Tolerance for example.

So for example Leon's transputer network only needed 10 nodes to tolerate the yanking out of 3 modules.

Surprisingly things like the Boing 777 primary flight computers only have 3 nodes !!

If anyone could explain a solution to the Byzantine Generals problem for only 4 nodes in such a way that I could understand it I would be very happy.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

hippy · 2008-10-04 14:25

The first step has to be defining what one means by Fault Tolerant, in general and in a specific cases and applications, just how fault tolerant does a fault tolerant system have to be ?

Nothing is ever going to be truly fault tolerant on a practical scale; no matter how many nodes monitoring any others, pull the power and everything goes down unless that's catered for, as nodes fail there comes a point where an additional fault is the straw which breaks the camel's back, multiple nodes will not note a fault if affected by that fault itself.

This is what probably drives people, like the example of Boeing's 777, to use less nodes than some would recommend, what they may call fault tolerant others may simply call resilient.

Leon · 2008-10-04 18:45

When I worked for BAe Military Aircraft they used three different computers for flight control systems, each designed by a different team.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM
Suzuki SV1000S motorcycle

SRLM · 2008-10-04 23:26

Thanks leon for starting this thread. I've never heard of fault tolerance before today.

Anyway, a question: what happens when you put the chip back in? I'd think the best design would bring the chip back up to speed like it was never removed, but I don't know.

Leon · 2008-10-04 23:35

It cropped up originally on the Propeller forum, but was getting off-topic there.

With the transputer system, it would detect that a module had been added, and start using it. Zero downtime, if a module failed and had to be replaced. All that happened when modules failed was that the system would slow down. I think it was intended for military systems; it cost about £13,000 GBP just for the hardware, and that was 25 years ago.

A related area is safety-critical systems. That is just as interesting. For instance, can reliable software be designed?

Built-in test is another related area. I once worked on military radio systems and the radios had to test themselves before they were ready for use. Of course, that raises the problem of what happens if the test function itself fails.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM
Suzuki SV1000S motorcycle

Post Edited (Leon) : 10/4/2008 11:44:32 PM GMT

SRLM · 2008-10-04 23:58

So I did a little bit of thinking about the Byzantine Fault Tolerance system, and this is what I deduced.

Four generals (for simplicity), with one a traitor. This should work according the formula n >= 3t + 1, where n is the number of generals and t is the number of traitors.

So I made a table of what they said to each other:

1 says:··1:A, 2:A, 3:A, 4:A
2 says:··1:A, 2:A, 3:A, 4:A
3·says:··1:B, 2:B, 3:B, 4:B
4·says:··1:A, 2:A, 3:A, 4:A

Now, for a moment assume that there were three generals. #1 looks at the result of #2 and #3, and sees A and B, respectively. This is going down the table. It doesn't know which one is the correct, so the situation doesn't work.

Okay, time for all four. General #1 looks, and sees A, B, A. He decides #3 is traitorous because it's different from the other two. Each of the other genrals reaches the same conclusion, and they throw out the traitor.

So, is this how the system works? Seems very simple and reliable (at least with four).

heater · 2008-10-06 08:02

As a recap for those new to this thead, Sapieha suggested that the Propeller was a good device for a "single chip fault tolerant" system" which I objected to thus:

1) It has no redundancy in the power supply.
2) It has no redundancy in the main memory.
3) It has no redundant I/O. All outputs are or'ed together for example.
4) It has no redundant main time counter.
5) Redundant copies of code running on multiple cogs have the same CPU bugs (if any) and probably the same compiler bugs.

Sapieha has replied elsewhere that:

1 . That option is not present with antoher chips.

2. Use of Regions of HUB memory give litle confidence with memory fault (it is nit perfect but OK)

3. I/O is just perfect to it. All CPUs can to adopt I/O work from other to test fault etc...

4. Counters is stil not so good that I will have them. I hopes that Prop II has all that I ned.

5. My code for COGs loads from separate EEproms for every COG. (Compiler problems You may have in all other firms CPUs.)

To which I now reply:

1. Redundant power supply is NOT possible with a single Prop (or any chip). One must use multiple devices with multiple power sources so the prop has no inherent advantage here. Taking cost into account multiples of some other device may be preferable.

2. That is my point "little confidence"

3. I/O is not perfect at all. For example all outputs from COGS are ORed together on their way to the pins so a failed/rouge COG can cause global failure.

4. The local COG counters are at least truly redundant resources.

5. Loading each COG from it's own EEPROM seems OK. Compiler bugs we will have to live with. CPU (COG) we will have to live with. Unless you really want to spend an age developing the same code in 3 or four different languages for 3 or 4 different CPU architectures.

But now Sapieha has said what he really is doing

Sapieha said...
... For experiments and studies/sciences It is Ok with 1 that have posiblites to that work without many extra components. Posiblites in Propeller give my cheap system for students.

This sounds like a great idea.

Sapieha, I'd love a book like the one you are planning, any chance of an English version ?

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Sapieha · 2008-10-06 08:22

Hi Leon.
Thanks for starting this thread.

Hi all
In first place fault tolerant system can't never be better that Hardware engineers + software engineers that construct that system.
It must be fuly cooperation between them.

In second place it must have as many as posible test and arrangements to correct fault if posible on this external components.
It is not posible for any software enginer to program nonthing that is not in hardware system and not mater how good is His programming skills

Ps. Heater Yes I planning write it in English with help of my friend

Yes in Real world but for students·that studying that system it is OK (It is not always all I/O pins fails at same time)

3. I/O is not perfect at all. For example all outputs from COGS are ORed together on their way to the pins so a failed/rouge COG can cause global failure.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Nothing is impossible, there are only different degrees of difficulty.
For every stupid question there is at least one intelligent answer.
Don't guess - ask instead.
If you don't ask you won't know.
If your gonna construct something, make it·as simple as·possible yet as versatile as posible.

Sapieha

Post Edited (Sapieha) : 10/6/2008 8:40:58 AM GMT

heater · 2008-10-06 08:58

SLRM: As far as I understand this problem is not so simple. Perhaps if every node is working on the same input data at the same time. But how do we know the input data is the same for all? Perhaps there is a fault. And basically how do we know that all nodes run at the same time in lock step?. That seems to require a common clock but then that is now a single point of failure.

Then there is the possibility of not just a wrong or failed node giving a wrong result to it's peers but a node that is actively devious in giving different and confusing results to it's peers.

In fact the problem is so not obvious that nobody really thought it through until 1982 when L. Lamport, R. Shostak, and M. Pease wrote their famous paper "The Byzantine Generals Problem" research.microsoft.com/users/lamport/pubs/byz.pdf
Which is probably why the missile systems I worked on at Marconi in the 80s only had triple redundancy. Not to mention the Boeing 777 fly by wire system.

Hippy says "The first step has to be defining what one means by Fault Tolerant". Quite so and first define what we mean by "fault" and which do we want to tolerate and which do we give up and die for.

I'd like to propose a challenge as an example if only as thought exercise : To build a fault tolerant clock that will.

1) Produce a reliable estimate of time since some event, say power up.
2) The result produced will be unaffected by any single fault in any part of the system that is used in the estimation. That includes CPUs, memory, program corruption, faulty links between components, power failures.
3) A component once faulty can "recover" or be replaced and resume it's work without disrupting the results produced.
4) In a multi CPU system (The probable solution) a completely rouge program loaded to one CPU that actively tries to confuse it's peers will not be able to induce an incorrect result.

We will assume our compilers are perfect and that our CPUs etc have no inherent design flaws. Basically we are looking at faults arising due ageing, mechanical violence, power failure etc. of a single component.

Requirement 4) allows the simulation of the bizarre faults that can arise due to faulty communication links.

As Sapieha says, a Prop can be used to "simulate" the system if any one actually comes up with some code to do it.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

hippy · 2008-10-06 10:49

A second division in fault tolerance is software fault tolerance and hardware fault tolerance.

The Propeller looks good for the first being multi-core but in practice isn't necessarily so because it is not sand-boxed, a rogue Cog can override outputs and also corrupt memory which other cogs may be relying upon. With spare Cogs it would be possible to CogStop one which has gone rogue and restart its task in another but achieving that isn't without some difficulty.

For fault-tolerance to algorithm error only the Propeller is probably a better choice than many, particularly as it is simply 8 independent MCU's with shared memory.

I'll put my hand up to being entirely sceptical of fault-tolerance, believing it to be an unachievable holy grail beyond simple voting mechanisms, redundancy and re-configuration when something is removed from the system. I'll however accept it may be that the concept is simply outside of my abilities to comprehend. Whenever I look at any system I always see a single point of failure which will render whatever is done entirely fruitless in the scheme of things - can there ever be a guaranteed 100% correct indicator on a plane which tells the pilot it is safe to continue or to land immediately ?

Maybe I expect too much ?

heater · 2008-10-06 11:39

You are right to be sceptical indeed paranoid. If one thinks it's easy and implements something that is "obviously" correct and safe it will no doubt crash and burn fairly quickly. We have seen this with railway systems, aircraft, cars, lifts, nuclear power stations etc. It has taken a long time to realize all the really odd things that can go wrong.

For example: You assumption that the Prop may be better suited to deal with algorithm error rather than hardware error. How can this be? A rogue algorithm could stamp over it's peers memory just a surely as something going wrong with the HUB or the memory decoder or the pointer in a wrlong. Indeed the "obvious" assumption that hardware errors and software errors are different may not be so simple at all.

Still we have not lost a Boeing 777 yet and they have had few "incidents" since introduction. Their primary flight control has three black boxes running in different locations of the aircraft (And yes they are actually black) Three separate power supplies. At least triple redundant sensors on all control surfaces and other inputs. At least triple redundant actuators on the control surfaces and "stick shaker" pilot feed back etc. Each black box contains three different CPUs on three boards: Intel 486, Motorola 6800xx and AMD 2900 if I remember correctly. All checking each others work before it goes out of the box.
I think I remember that there were actually two of each type of CPU running in lock step on each board for rapid detection of the failure of one of them.

In spite of all this there is a BIG RED SWITCH over head in the cockpit that the pilot can throw to disable all the digital control and fall back to analogue circuitry if he thinks the computers are really screwed.

As for software errors the 777 obviously had three different compilers for the three CPU architectures BUT they all run control software from the same code base. Weather that is wise or not seems to be still open to debate.

That BIG RED SWITCH also has triple redundant contacts of course. I had the pleasure of testing that switch when we were testing the black boxes at Marconi Aerospace only to find that it did NOT work !!

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

heater · 2008-10-06 11:48

Leon: "When I worked for BAe Military Aircraft they used three different computers for flight control systems, each designed by a different team."

I think this is an example of what I mean in my previous post and they used to do it in missile controls at Marconi. Before anyone thought of Byzantine Generals.

It has been shown that three computers may not be enough to survive as you expect under certain conditions. Can be done if all communications is signed but I have never seen that implemented.

It has also been shown that three teams writing code to the same specification will tend to write the same bugs. Either due to errors in the spec. itself or ambiguities that get interpreted the same way.

So it's not much better than having one team write the code and another team creating all the unit and integration tests. Creating the tests is more work than creating the code.

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

hippy · 2008-10-06 11:57

Sorry; by "algorithm error" I meant delivering an incorrect value result rather than causing any other effect. The sort of bug which results in indicating a plane is upside down or facing the wrong way as soon as it flies over the equator.

heater · 2008-10-06 12:15

Quite so. And limiting our scope to that kind of algorithm error makes the Prop quite a nice way to experiment with fault tolerant designs as Sapieha intends. It's then easy to introduce purposely faulty algorithms to a COG to simulate all kinds of processing and communication errors as well as deliberately malicious and misleading algorithms. As long as we can be sure we don't have the other kind of error that will confuse our experiments and cause a lot of head scratching.

One could always make experiments like this using communicating processes under Linux or whatever but it's much more satisfying to have some real LEDS and switches and real hardware to get into the feel of the thing.

Actually having four boards running from four batteries displaying time, for example, at least might result in something to show of to friends and family when they ask you "what the hell are you wasting all those hours doing?"

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
For me, the past is not over yet.

Jo · 2008-10-06 14:16

Note also that the 3-way redundancy systems typically work as

/ unit1\
system input < unit2 > fail safe voting system -> system output
\ unit3/

The most complex part is usually the voting system. This part *has* to be trusted, is typically is designed to fail-safe (ie when it fails indicates that whole
system is non-operational) and is generally some type of majority voting system. None of the units themselves know whether the others are good or
not. This works even under the byzantine general's problem as the voter counts as a 4th observer, even though non-participating.

And no, I wouldn't consider the propeller fault tolerant. Neither software nor hardware: too many single points of failure in the system
(power, memory bus, IO pins, cogs actively killing each other, etc)

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
---
Jo

Leon · 2008-10-06 14:24

At least it doesn't have interrupts. Does anyone remember the Viper MPU? It was developed by the MoD for safety-critical systems, and had a very simple architecture with no interrupts, to make it as reliable as possible. I think it was abandoned because its performance was abysmal, it was too simple. Einstein once said that things should be simple, but not too simple.

Leon

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
Amateur radio callsign: G1HSM
Suzuki SV1000S motorcycle

Sapieha · 2008-10-06 15:00

Hi Jo

heater · 2008-10-06 15:13

Last I heard about the Viper was that it did not have interrupts on purpose so that programs could be proved correct by static analysis etc. Of course the first thing its customers wanted was interrupts. Then there was the rumour that despite being designed using mathematically formal methods to prove it's correctness they found a bug in it anyway and that was the end of Viper.

Jo

Jo · 2008-10-06 15:47

heater:
yeah, full up fault tolerant stuff is more complex. I worked on fault-tolerant distributed systems during post-doc work and they certainly get very complex
and the types of failure modes that have to be consider definitely get "odd".
But you can simplify each part of the system into that pattern: input source(s) -> redundant processing units -> voting circuit.
You just keep putting more of these things in series/parallel at each point in the system where it is unacceptable to have have a
failure mode. So, if sensors need to be redudant, you have multiple sensors and voter to decide what the real sensed value ought to be
Similarly for actuators: they receive multiple inputs and internally each actuator votes on what input(s) form the majority and works
in that direction.

Of course, the less failure modes you can tolerate, the more complex the system becomes until the fault-tolerance overhead itself
consumes all engineering resources & budget

In avionics there are generally just 2 primary displays (captain & co-pilot) and backup displays tend to be purely "mechanical" and
fed from a completely different sent of sensors/wiring harness (to deal with total power failure modes. There have been 2 instances
with Airbus of dead-stick landings)

▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔
---
Jo

Fault-tolerant systems

Comments