Reliability Evaluation Techniques

·  Reliability refers to the degree of tolerance against errors and component failures in a system.

·  A reliable system prevents loss of information even in the event of component failures.

·  The multiplicity of storage devices and processors in a distributed system allows the maintenance of multiple copies of critical information within the system and the execution of important computations redundantly to protect them against catastrophic failures.

E.g. if one of the processors fails, the computation can be successfully completed at the other processor, or if one of the storage devices fails, the information can still be retrieved from the other storage device.

·  Distributed computing systems have a potential for higher reliability since a few parts of the system can be down without interrupting the jobs of the users who are using the other parts of the system. This is in contrast to the centralized system, where a failure of the central processor results in the entire system shutting down.

E.g. if a workstation of a distributed computing system that is based on the workstation-server model fails, only the user of that workstation is affected.

·  The advantage of higher reliability is an important reason for use of distributed computing systems for mission-critical applications whose failure may be disastrous.

Ways to represent reliability

There are two ways to represent reliability:

1.  Network point of view

2.  Application point of view

Network Point of View

·  Given the link and node reliability for a specific network topology, determine the probability of the network becoming partitioned.

There are two popular parameters used to evaluate network reliability:

1.  Source-terminal reliability (STR)

This is the probability that a path exists between a given source node, s, and a given destination node, d.

2.  Computer Network Reliability (CNR)
This is the probability that a path exists between any source node and any destination node (i.e. the probability that any node in the network can communicate with any other node in the network).

Application Point of View

Given the link and processing node reliabilities, and the distribution of programs as well as data files, determine the probability of successfully completing a specific application.

A popular parameter used to evaluate reliability from the application execution standpoint is:

Distributed Program Reliability (DPR)

This is the probability that a program can successfully execute in a distributed computing system (i.e. the program can always access the

data files it needs to successfully execute).

Computing the STR

Consider the network topology given below:

X1 0.85 0.8 X4

0.8 X3

X2 0.9 0.9 X5

Assumptions:

1.  Each link is full-duplex when working.

2.  Each link can is either in a working state or in a failed state.

3.  The reliability of every possible link is known.

4.  The nodes are perfectly reliable (this assumption is for mathematical convenience only. The failure of a node can be easily modeled by assuming the failure of all links that connect to the node that has failed).

Step 1:

Determine all possible paths from source node 1 to destination node 3.

The possible paths (in increasing order of cardinality) are: X1 X4, X2 X5, X1 X3 X5, and X2 X3 X4. These are depicted below in order:

X1 X4 X2 X5 X1 X3 X5 X2 X3 X4

Step 2:

Reliability Expression Generation by making the terms disjoint:

Term 1: X1 X4 P1 P4

Term 2: XI X4 (these are both not present in the second path

X2 X5)

Z(X1 X4) = X1’ + X1 X4’

X2 X5 (X1’ + X1 X4’) P2 P5 (Q1 + P1 Q4)

Term 3: X4 + X2

Z(X4 + X2) = X4’ X2’

X1 X3 X5 X4’ X2’ P1 P3 P5 Q4 Q2

Term 4: X1 + X5 + X1 X5 X1 (1 + X5) + X5 X1 + X5

Z(X1 + X5) = X1’ X5’

X2 X3 X4 X1’ X5’ P2 P3 P4 Q1 Q5

Hence,

STR = P1 P4 + P2 P5 (Q1 + P1 Q4) + P1 P3 P5 Q4 Q2 + P2 P3 P4 Q1 Q5

STR = 0.85 * 0.8 + 0.9 * 0.9 (0.15 + 0.85 * 0.2) + 0.85 * 0.8 * 0.9 * 0.2 * 0.1 + 0.9 * 0.8 * 0.8 * 0.15 * 0.1 = 0.96008

The Zeta Operator

X1, X2, X3 are logical variables.

Z(X1) = X1’

Z(X1 + X2) = Z(X1) Z(X2) = X1’ X2’

Z(X1 X2) = Z(X1) + X1 Z(X2) = X1’ + X1 X2’

Z(X1 + X2 + X3) = Z(X1) Z(X2) Z(X3) = X1’ X2’ X3’

Z(X1 X2 X3) = Z(X1) + X1 Z(X2) + X1 X2 Z(X3)

= X1’ + X1 X2’ + X1 X2 X3’

Computing the CNR

Consider the network topology given below:

X1 0.85 0.8 X4

0.8 X3

X2 0.9 0.9 X5

Step 1:

Determine all minimum spanning trees (MST) for the topology given.

A MST is defined as the tree that connects every node in the network with no cycle in them. For e.g. X1 X3 X4 is a MST but X1 X2 X3 X5 is not.

The MSTs (in increasing order of cardinality) are:

MST1 = X1 X3 X4

MST2 = X1 X3 X5

MST3 = X2 X3 X4

MST4 = X2 X3 X5

MST5 = X1 X2 X5

MST6 = X1 X2 X4

MST7 = X2 X4 X5

MST8 = X1 X4 X5

Step 2:

Make the MSTs disjoint.

Term1

Pr[MST1] = P1 P3 P4

Term2

X4

Z(X4) = X4’

X1 X3 X5 X4’ P1 P3 P5 Q4

Pr[MST2] = P1 P3 P5 Q4

Term3

X1 + X1 X5 X1 (1 + X5) X1

Z(X1) = X1’

X2 X3 X4 X1’ P2 P3 P4 Q1

Pr[MST3] = P2 P3 P4 Q1

Term4

X1 X4 + X1 + X4 X1(X4 + 1) + X4 X1 + X4

Z(X1 + X4) = X1’ X4’

X2 X3 X5 X1’ X4’ P2 P3 P5 Q1 Q4

Pr[MST4] = P2 P3 P5 Q1 Q4

Term5

X3 X4 + X3 + X3 X4 + X3 X3 (X4 + 1 + X4 + 1) X3

Z(X3) = X3’

X1 X2 X5 X3’ P1 P2 P5 Q3

Pr[MST5] = P1 P2 P5 Q3

Term6

X3 + X3 X5 + X3 + X3 X5 + X5 X3 (1 + X5 + 1 + X5) + X5

X3 + X5

Z(X3 + X5) X3’ X5’

X1 X2 X4 X3’ X5’ P1 P2 P4 Q3 Q5

Pr[MST6] = P1 P2 P4 Q3 Q5

Term7

X1 X3 + X1 X3 + X3 + X3 + X1 + X1

X1 (X3 + X3 + 1 + 1) + X3 + X3 X1 + X3

Z(X1 + X3) X1’ X3’

X1 X4 X5 X1’ X3’ P2 P4 P5 Q1 Q3

Pr[MST7] = P2 P4 P5 Q1 Q3

Term8

X3 + X3 + X2 X3 + X2 X3 + X2 + X2 + X2

X3 (1 + 1 + X2 + X2) + X2 + X2 + X2 X3 + X2

Z(X3 + X2) X3’ X2’

X1 X4 X5 X3’ X2’ P1 P4 P5 Q2 Q3

Pr[MST8] = P1 P4 P5 Q2 Q3

Hence,

8

CNR = S Pr [MSTj]

j = 1

Computing the DPR

Consider the topology of the DCS. The distribution of files is shown in curly braces. Program P is executed on node 1 and the files required for its execution are f1, f2, f3. The link reliabilities are as shown.

{f1, f2} X2 {f2, f5}

0.85

X1 0.9 X7

0.95

{f1, f5} 0.8 0.8 {f1, f4}

X4 X6

X3 0.9 0.9 X8

0.9

X5

{f3, f4} {f3, f6}

Step 1:

Determine all minimum file spanning trees (MFST) for the topology given. The MFSTs (in increasing order of cardinality) are:

MFST1 = X1 X3

MFST2 = X1 X4

MFST3 = X3 X4

MFST4 = X3 X5 X6

MFST5 = X1 X2 X6

MFST6 = X3 X5 X8 X7

MFST7 = X1 X2 X7 X8

Term 1

Pr[MFST1] = P1 P3

Term 2

Z(X3) X3’

Pr[MFST2] = P1 P4 Q3

Term 3

X1 + X1 X1

Z(X1) X1’

Pr[MFST3] = P3 P4 Q1

Term 4

X1 + X1 X4 + X4 X1 (1 + X4) + X4 X1 + X4

Z(X1 + X4) X1’ X4’

Pr[MFST4] = P3 P5 P6 Q1 Q4

Term 5

X3 + X4 + X3 X4 + X3 X5 X3(1 + X4 + X5) + X4 X3 + X4

Z(X3 + X4) X3’ X4’

Pr[MFST5] = P1 P2 P6 Q3 Q4

Term 6

X1 + X1 X4 + X4 + X6 + X1 X2 X6

X1 (1 + X4 + X2 X6) + X4 + X6 X1 + X4 + X6

Z(X1 + X4 + X6) X1’ X4’ X6’

Pr[MFST6] = P3 P5 P7 P8 Q1 Q4 Q6

Term 7

X3 + X4 + X3 X5 X6 + X6 + X3 X5

X3 (1 + X5 X6 + X5) + X4 + X6 X3 + X4 + X6

Z(X3 + X4 + X6) X3’ X4’ X6’

Pr[MFST7] = P1 P2 P7 P8 Q3 Q4 Q6

Hence,

7

DPR = S Pr [MFSTj]

j = 1

Applicability of CNR to network design

One of the considerations in the design of computer networks is the reliability of between any pair of nodes and the maximum permissible cost.

Given the location of various nodes (routers) of the network, the maximum permissible cost of the installing the links, and the possible position of the links, an algorithm by K. K. Aggarwal is used to determine an optimal network topology, which maximizes the CNR.

Network Design Algorithm is as follow:

Steps:

1.  Find all the MSTs.

2.  Generate a cost matrix, Sc, with its rows corresponding to each spanning tree. For each row, if branch i exists in the MST then initialize it with its cost, else zero.

3.  Generate a reliability matrix, Sr, with its rows corresponding to each spanning tree. For each row if branch i exists in the MST then initialize it with its reliability, else one.

4.  Determine the cost of each MST from matrix Sc and generate a new matrix A.

5.  Determine the reliability of each MST from matrix Sr and generate a new matrix X.

6.  Determine the ratio of MST reliability and MST cost and generate a new matrix D, i.e., D = X / A.

7.  Choose the MST that has the highest ratio and satisfies the cost.

8.  Remove the links that are in the MST of step 7.

9.  For each remaining branch compute the increment in the reliability if that branch is added. Find the ratio of incremental reliability, DR, with the cost of the link

(i.e. DD = DR / cost of link).

10.  Choose the one with the highest ratio (i.e. highest value of DD) and satisfy the remaining cost.

11.  Remove this link from consideration and repeat steps 9 and 10 until all links are exhausted or the cost is exceeded.

Example of the network design problem

a b c d e f g h

Link cost 2 5 3 6 4 3 4 3

Link reliability 0.9 0.7 0.8 0.6 0.9 0.8 0.7 0.8

Maximum permissible cost = 21 units

The nodes to be connected are as follows. The algorithm will determine which links are to be maintained.

d

a g

c f

b h

e

Step 1: All the possible MSTs are abdeh, bcdeh, acdeh, bdefh, adefh, abefh, bdegh, adegh, abdgh, abdeg, abegh, bcefg, bcdfh, cdfha, acefh, bcegh, bcdeg, bcdgh, acegh, acdeg, acdgh, bdefg, adefg, abefg, abdfg, bcefg, abcfg, acefg, acdfg.

Steps 2 and 4

a b c d e f g h

2 5 0 6 4 0 0 3 20

0 5 3 6 4 0 0 3 21

2 0 3 6 4 0 0 3 18

0 5 0 6 4 3 0 3 21

2 0 0 6 4 3 0 3 18

2 5 0 6 0 3 0 3 19

2 5 0 0 4 3 0 3 17

0 5 0 6 4 0 4 3 22

2 0 0 6 4 0 4 3 19

2 5 0 6 0 0 4 3 20

2 5 0 6 4 0 4 0 21

2 5 0 0 4 0 4 3 18

0 5 3 0 4 3 0 3 18

0 5 3 6 0 3 0 3 20

2 0 3 6 0 3 0 3 17

2 0 3 0 4 3 0 3 15

0 5 3 0 4 0 4 3 19

Sc = 0 5 3 6 4 0 4 0 A = 22

0 5 3 6 0 0 4 3 21

2 0 3 0 4 0 4 3 16

2 0 3 6 4 0 4 0 19

2 0 3 6 0 0 4 3 18

0 5 0 6 4 3 4 0 22

2 0 0 6 4 3 4 0 19

2 5 0 0 4 3 4 0 18

2 5 0 6 0 3 4 0 20

0 5 3 0 4 3 4 0 19

0 5 3 6 0 3 4 0 21

2 0 3 0 4 3 4 0 16

2 0 3 6 0 3 4 0 18

Steps 3 and 5

a b c d e f g h