P2 Application Idea: Fault Tolerant Computer?

Christof Eb. · 2023-01-25 16:24

These days I read a paper about the creation of a fault tolerant computer. A machine that could detect a failure comparing results of 3 redundant computers, which run in parallel. Vote for truth.
Would it be possible to do this on P2 with 3 cogs and three regions of RAM? Or 4 cogs, one as spare? - Seems to be merely a question of software?
web.cs.ucla.edu/~rennels/article98.pdf

This is somewhat related to @PurpleGirl 's https://forums.parallax.com/discussion/174792/what-would-be-a-good-idea-for-a-new-cpu-and-platform-to-try-on-a-p2#latest

Perhaps an application could be to run a P2 at the very edge of possible clock frequency. If faults occur, then reduce the speed. If not, just push up the pressure a tiny bit....

Have Fun!
Christof

__deets__ · 2023-01-25 17:26

IMHO the problem is the lack of redundancy in such a design. Sure, three cores could compute the same task, and then results are compared. But Hub-RAM, the whole infrastructure around serving it, and plenty of other functions are shared. So for this to be of relevance, it would need to narrowly only affect a specific core. Not e.g. an overall brownout, cosmic ray hit depending on where it hits, etc. I would also assume that the decision process in such a design should be simpler and more robust than the individual computers which results it compares. E.g. in case of simple binary outputs, a majority-vote circuit of some sort. Because only if the guards are more robust, the don't need guarding

PurpleGirl · 2023-01-25 18:40

A thought could be to use 4 cogs. Those within a pair could audit each other, so only 2 things to send to the hub. Then the 2 results could be audited against each other.

ManAtWork · 2023-01-25 20:03

IMHO, redundancy inside a single chip is no good. It catches only a small percentage of the possible errors. If you have a major power failure, overheating or any other event that affects the whole chip you don't gain anything.

For example, in most modern passenger aircraft you have three hydraulic systems with three hydraulic pumps, seperate valves, pipes and three actuators at each rudder. The voting takes place mechanically at the joint of the actuators. Any two of the three systems can fail (power or pressure loss) and everything is still controllable. Even if one system pushes actively in the wrong direction the other two can override it simply by applying more force. No comlex and also fault-sensitive voting, guarding or surveillance logic required.

It would be very hard to build something like that in software which relies on the same hardware as the process you try to protect against failures.

Maciek · 2023-01-25 21:44

I've been thinking a bit about this redundancy and fault tolerance and actually what Christof proposes is quite interesting. It's not aimed at 100% fault tolerance but rather towards finding a point where just a tiny bit above it things start to fall apart - a point of maximum reliable performance. Might be useful.
The plane example is a good one but...it only makes the hydraulic (or any other doubled or trippled) system fairly fault tolerant but a plane as a whole is not because there are parts of it that simply aren't duplicated or trippled (like set of wings or a tail).

So if we take a plane as a substitute for a computer then we arrive at multiple cogs executing in parallel as fairly close substitutes for the fault tolerant hydraulics or whatever else.

Christof Eb. · 2023-01-26 08:31

Well, yes, as engineers we are aware that there are no universal fault proof systems. At Tschernobyl, the operators switched off the system, that was designed to prevent the silly experiment, they made. At Fukushima, the power plant was made to withstand an earthquake and a flood wave too but not both of them at the same time. And some idiot might fill some ordinary hydraulic fluid into the plane systems, which won't work at low temperature... What we can do, is reduce probabilities of known problems.

Some time ago, I read a paper about an early computer working with tubes and delay lines. This thing was so little reliable, that it was normal, that it failed every few hours. They where proud, that they had means to detect faults. You just had to redo the last work.https://tcm.computerhistory.org/ComputerTimeline/Chap8_univac_CS1.pdf

ManAtWork · 2023-01-26 10:09

Yes, detecting faults is much easier than correcting them. In most cases it is safe to shut down the system in the case of a fault instead of continue running it probably with undefined behaviour. That, of course, requires that the system has a safe state. A flying airplane does not have one but most machines do.

The propeller is easpecially sensitive to undefined behaviour in the case of a software bug or crash because...

there is no "illegal instruction" exception. Cogs continue to execute bad code from undefined or currupted memory forever
there is no memory protection or MMU. All user code is executed from (writeable) RAM
there are OR gates connecting all OUT and DIR registers of all cogs together. A single crashing cog can override the outputs used by other cogs

So sort of watchdog would not not prevent crashes from happening but would at least minimize the consequences by turning all outputs off and stopping all cogs.

P2 Application Idea: Fault Tolerant Computer?

Comments