Fault Tolerance And Reliability

If all n different events have same mean time m, then the Mean time to the first one of the events = m/n

Theorem 1:

Mean time to event MT(A)=1/P(A)

Theorem 2:

P(A or B) = P(A) + P(B) - P(A and B)
Assuming A and B are independent
= P(A) + P(B) - P(A) * P(B)
= P(A) + P(B) (if P(A) and P(B) are very small)

Theorem 3:

If events A,B, have mean time MT(A), MT(B), then the mean time to the first event is 1/(P(A) + P(B))

Prove:

if p is the probability of an event in given time, then the mean time m = 1/p,
and there are n events, then the probability of one of these events = n * p
Therefore, mean time to one of these events = 1/ n*p = m/n

Capture.PNG

Fault Tolerance Strategy:

Fail-vote:
use two or more modules and compare their outputs, stops if there are no majority outputs agreeing. If fails twices as often with duplication but gives clean failure semantics

Capture.PNG

2.Fail-fast:
Similar to the fail vote except the system senses which modules are available and then uses the majority of the available modules.

Improve the software reliability:

Periodic transfer of data: The primary process does all the work until it fails, and the second process called backup takes over the primary and continues
Checkpoint-restart: The primary records its state on a duplexed storage module, at takeover the secondary starts reading the state of the primary from the duplexed storage and resumes the application.
Checkpoint messages: The primary sends its state changes as messages to the backup. At takeover the backup gets its current state from the most recent checkpoint message.
Persistent: backup restarts in the null state and lets Transaction mechanism to clean up all uncommitted transactions. This is the approach taken by the most Database Systems.
Highly available storage
- write to several storage modules.
- have some kind of checksum to make sure that the data read is correct with a very high probability.
- Disk mirroring is an example of this.
- Shadowing is another mirroring technique which allows atomic write operations.
Highly available Processes
- process pairing
- transaction based restart
- checkpoint restart

Improve the communication reliability