Optimal structure of fault-tolerant software systems

Gregory Levitin

Reliability Department, Planning, Development and Technology Division,

Israel Electric Corporation Ltd., P.O. Box 10, Haifa, 31000 Israel

E-mail:

Abstract

This paper considers software systems consisting of fault-tolerant components. These components are built from functionally equivalent but independently developed versions characterized by different reliability and execution time. Because of hardware resource constraints, the number of versions that can run simultaneously is limited. The expected system execution time and its reliability (defined as probability of obtaining the correct output within a specified time) strictly depend on parameters of software versions and sequence of their execution. The system structure optimization problem is formulated in which one has to choose software versions for each component and find the sequence of their execution in order to achieve the greatest system reliability subject to cost constraints. The versions are to be chosen from a list of available products. Each version is characterized by its reliability, execution time and cost.

The suggested optimization procedure is based on an algorithm for determining system execution time distribution that uses the moment generating function approach and on the genetic algorithm. Both N-version programming and the recovery block scheme are considered within a universal model. Illustrated example is presented.

Keywords: fault-tolerant programming; performance; reliability; expected execution time; optimization; genetic algorithm.

1. Introduction

Software failures are caused by errors made in various phases of program development. When the software reliability is of critical importance, special programming techniques are used in order to achieve its fault tolerance. Two of the best-known fault-tolerant software design methods are N-version programming (NVP) and recovery block scheme (RBS) [1]. Both methods are based on the redundancy of software modules (functionally equivalent but independently developed) and the assumption that coincident failures of modules are rare. The fault tolerance usually requires additional resources and results in performance penalties (particularly with regard to computation time), which constitutes a tradeoff between software performance and reliability.

NVP was proposed by Chen and Avizienis [2]. This approach presumes the execution of N functionally equivalent software modules (called versions) that receive the same input and send their outputs to a voter, which is aimed at determining the system output. The voter produces an output if at least M out of N outputs agree. Otherwise, the system fails.

In some applications, the available computational resources do not allow all of the versions to be executed simultaneously. In these cases, the versions are executed according to some predefined sequence and the program execution terminates either when the M versions produce the same output (success) or when after the execution of all the N versions the number of equivalent outputs is less than M (failure). The entire program execution time is a random variable depending on the parameters of the versions and on the order, according to which they are executed.

RBS was proposed by Randell [3]. In this approach, after execution of each version, its output is tested by an acceptance test block (ATB). If the ATB accepts the version output, the process is terminated and the version output becomes the output of the entire system. If the ATB does not accept the output, the next version is executed. If all N versions do not produce the accepted output, the system fails. It can be easily seen that in the RBS the entire program execution time is also a random variable depending on the parameters of the versions and on the order of their execution.

Estimating the effect of the fault-tolerant programming on system performance is especially important in safety critical real-time computer applications. This effect has been studied by Tai et al. in [4] and by Goseva-Popstojanova and Grnarov in [5, 6]. While in [4] a basic realization of NVP (N=3, M=2) consisting of versions with identical fault probabilities and different execution times has been considered, in [5] and [6] NVP with arbitrary N has been studied in which both times to failure and execution times of different versions are identically distributed random variables.

In many cases, the information about version reliability and execution time is available from separate testing and/or reliability prediction models [7]. This information can be incorporated into a fault-tolerant program model in order to obtain an evaluation of its reliability and performance. The reliability model of NVP with versions having different reliability has been considered in [8]. However, in this study, the system performance evaluation problem has not been addressed and a general algorithm for evaluating NVP reliability for arbitrary N and M has not been suggested.

The total cost of the software system is defined by the cost of the versions it consists of. Cost for each version can be the purchase cost, if the versions are commercial off the shelf, or an estimate based upon version size, complexity and performance.

Since in programs consisting of different versions the parameters of the versions and the sequence of their execution affects the program reliability and performance, the problem of optimal version choice and sequencing arises. This paper presents an algorithm that selects the set of versions and determines thesequence of their execution such that the system reliability (defined as probability of obtaining the correct output within a specified time) is maximized and total cost remains within budget.It is assumed that the software system is built from NVP and RBS components and the versionscan be chosen from a list of available products characterized by different reliability and execution time.

The universal model for both types of fault-tolerant programs and measures of their reliability and performance are introduced in the following section. A fast algorithm for evaluating the software system reliability and performance measures for an arbitrary sequence of versions is presented in section 3. The optimization technique is described in section 4. Section 5 contains illustrative examples.

Nomenclature

Pr(e)probability of event e

1(x)unity function: 1(TRUE)=1, 1(FALSE)=0

Cnumber of components in the software system

Ncnumber of versions chosen for component c

Jcnumber of versions available for component c

Mcnumber of identical outputs needed for component c to succeed

Lcnumber of versions of component c that can be executed simultaneously

rc(i)reliability of i-th version of component c

c(i)execution time of i-th version of component c

sc(i)cost of i-th version of component c

tc(i)termination time of i-th version of component c

pciprobability that system component c produces correct output after execution of i first versions

vcjnumber of version which is j-th in the line for execution in component c

wcjnumber of version that produces j-th output in component c

Tcrandom execution time for system component c

Trandom execution time for the entire system

txx-th realization of T

qxPr(T=tx)

Xnumber of different realizations of T

Stotal system cost

S*maximal allowable system cost

T*upper bound for system execution time

R(T*) software system reliability (probability that the correct output is produced in time less than T*)

Esoftware system expected execution time

Abbreviations

NVPN-version programming

RBS recovery block scheme

ATB acceptance test block

MGFmoment generating function

2. The problem formulation

According to the model presented in [8] the software system consists of C components. Each component performs a subtask and the sequential execution of the components performs a major task.

It is assumed that Nc functionally equivalent versions are available for each component c. Each version has an estimated reliability and constant execution time. Failures of versions for each component are statistically independent as well as the total failures of the different components.

The versions of each component c start their execution in accordance with an ordered list. Lc first versions from the list start their execution simultaneously (at time 0). If the number of terminated versions is less than Mc, after termination of each version a new version from the list starts its execution immediately. If the number of terminated versions is not less than Mc, after termination of each version the voter compares the outputs. If Mc outputs are identical, the component terminates its execution (terminating all the versions that are still executed), otherwise a new version from the list is executed immediately.

If after termination of Nc versions the number of identical outputs is less than Mc the component and the entire system fail.

In the case of component success, the time of the entire component execution is equal to the termination time of the version that has produced the Mc-th correct output (in most cases the time needed by the voter to make the decision can be neglected). It can be seen that the component execution time is a random variable depending on the reliability of the component versions.

The examples of time diagrams corresponding to a component with Nc=5 and Mc=3 and with a given sequence of versions execution (the versions are numbered according to this sequence) and different Lc are presented in Fig. 1. In this component, the second version fails and the task execution successfully terminates after termination of the four first versions. The influence of the sequence of versions execution on the component performance time is demonstrated in Fig. 2 for the same system for Nc=5, Mc=3 and Lc=2.

The RBS technique is very similar to the NVP with Mc=1. The main difference is in using the ATB after the execution of each version (the acceptance test time is usually much greater than the decision time of the voter and, therefore, cannot be neglected). By adding the acceptance test time to the execution time of each version one can consider the RBS to be NVP with Mc=1.

The sum of the random execution times of each component gives the random task execution time for the entire system. In order to estimate both the system's reliability and its performance, different measures can be used depending on the application.

Fig. 1. Time diagrams of versions execution in a component with different Lc.

In applications where the execution time of each task is of critical importance, the system reliability R(T*) is defined (according to performability concept [4, 9]) as a probability that the correct output is produced in time less than T*.

In applications where the average system productivity (the number of executed tasks) over a fixed mission time is of interest [10], the system reliability is defined as the probability that it produces correct outputs without respect to the total execution time, (this index can be referred to as R()) while the conditional expected system execution time E is considered to be a measure of its performance. This index determines the system's expected execution time given that the system does not fail.

Fig. 2. Execution of different sequences of versions in a component with Lc=2 .

Consider a set of Jc different versions available for each component c. Each version j (1jJc)is characterized by its reliability rc(j), execution time c(j) and cost sc(j). The permutation v*cof Jc different integer numbers ranging from 1 to Jc determines the order of the version that can be used in component c.Let hcj=1 if the version j is chosen to be included into component c and hcj=0 otherwise. The vector determines the subset of chosen versions for component c.

Having the vectors v*c and hc one can determine the execution order vcof the chosen versions by removing from v*cany number k for which hck=0. The total number of versions in component c(equal to the length of vector vc after removing the unchosen versions) is determined as

(1)

The system structure optimization problem can now be formulated as follows: find vectorsvc for 1cC that maximize R(T*) subject to cost constraint:

.(2)

3. Evaluating software system reliability and performance indices

3.1. Version termination times

In each component c a sequence in which the versions start their execution is defined by the vector vc. This means that each versionvci starts execution not earlier than versions vc1,…, vci-1 and not latter than versions vci+1,…, vcNc. Having the execution time of each version c(vci)(1iNc) and number of versions that can run simultaneously Lc, one can obtain termination time tci for each version using the following simple algorithm:

1. Assign x1=…= xLc=0;

2. For i=1,…,Nc repeat:

2.1.Find any k (1kLc): xk=min{ x1,…, xLc};

2.2. Obtain tc(vci)=xk+c(vci) and assign xk=tc(vci).

Times tc(vci),1iNc correspond to intervals between the beginning of component execution and the moment when the versions produce their outputs. Observe that the versions that start execution earlier can terminate later: jk does not guarantee that tc(vcj)tc(vck). To obtain the sequence, in which the versions produce their outputs, the obtained termination times should be sorted in increasing order tc(wc1)tc(wc2)…tc(wcNc), which gives the order of versions wc1,wc2,…,wcNc, corresponding to times of their termination.

The mapping

vc1,vc2,…,vcNcwc1,wc2,…,wcNc

determines the sequence of version outputs in which they arrive to the voter. Now one can consider the component c as a system in which the Nc versions are executed consecutively according to order wc1,wc2,…,wcNc and produce their outputs at times tc(wc1), tc(wc2), …, tc(wcNc).

3.2. Reliability and performance of components and entire system

Consider the probability that m out of k first versions of component c succeed. This probability can be obtained as:

(3)

The component c produces the correct output directly after the end of the execution of j versions (jMc) if and only if the wcj-th version succeeds and exactly Mc-1 out of the first executed j-1 versions succeed.

The probability of such event pcj is:

(4)

The component reliability Rc() is the probability that Mc out of j first versions succeed for j{Mc,…,Nc). Since the events of successful component execution termination for different j are mutually exclusive, we obtain:

Rc() =.(5)

The failure probability of component c is 1-Rc().

Observe that pci and tc(wci) for MciNc define the distribution of the random execution time Tc for the system component c. The random execution time of the entire system is equal to the sum of the execution times of its components:

T=.(6)

Having the distribution of this value in the form qx, tx (1xX) one can obtain the system reliability defined as its ability to produce the correct output in less time than a specified value T*:

.(7)

The probability that the correct output is obtained without respect to the execution time is defined as

.(8)

The conditional expected system execution time (the expected system execution time on condition that the system does not fail) can be obtained as:

.(9)

3.3. Algorithm for determining the execution time distribution for the system component

In order to obtain the execution time distribution for a component c in the form pcj, tc(wcj) (McjNc) one can determine the realizations tc(wcj) of the component execution time Tc using procedure presented in section 3.1 and probabilities pcj=Pr(Tc=tcj) using Eq. (4).

However, the probabilities pcj can be obtained in a much simpler way using a procedure based on the moment generating function (MGF) technique.

The MGF of a discrete random variable Y is defined as a polynomial

(10)

where the variable Y has K possible values and ak is the probability that Y is equal to yk.

Let the random binary variable bci be an indicator of the success of version wci in component c such that bci=1 if the version produces the correct output and bci=0 if it produces the wrong output. The distribution of bci can be represented by the MGF

uci(z)=rc(wci)z1+(1- rc(wci))z0. (11)

It can be easily seen that the product of polynomials

(12)

represents the distribution of the number of correct outputs in component c after the execution of a group of first j versions. Indeed the resulting polynomial relates the probabilities of combinations of correct and wrong outputs (the product of corresponding probabilities) with the number of correct outputs in these combinations (the sum of success indicators). Observe that after collecting the like terms (corresponding to obtaining the overall probability of a different combinations with the same number of correct outputs) Ucj(z) takes the form

,(13)

where jk is the probability that the group of first j versions produces k correct outputs.

Note that Ucj(z) can be obtained by using the recurrent expression

. (14)

According to its definition, pcj is the probability that the group of first j versions produces Mc correct outputs given the group of first j-1 versions has produced Mc-1 correct outputs, while jMcin polynomial Ucj(z) is equal to the unconditional probability that the group of first j versions produces Mc correct outputs.

In order to let the coefficient jMc in polynomial Ucj(z) be equal to pcj, the term with the exponent equal to Mc should be removed from Uc j-1(z) before applying Eq. (14) (excluding the combination in which j first versions produce Mc correct outputs while the j-th version fails).

The above considerations lie at the base of the following algorithm for determining all of the probabilities pcj (McjNc):

  1. Determine the MGF of each version of component c according to Eq. (11);
  2. Define Uc0(z)=1;
  3. For j=1,2,…,Nc:

3.1 Obtain Ucj(z) using Eq. (12) and collecting like terms;

3.2. If jMc assign: pcj=jMc.

3.3. Remove term jMczMc from Ucj(z) (if such a term exists);

This algorithm is much simpler than using Eq. (4). Since after collecting like terms each MGF Ucj(z) can have no more than Mc terms, the computational complexity of the algorithm (number of term multiplications in the procedure (14)) is not greater than 2(Nc-1)Mc.

3.4. Determining the execution time distribution for the entire system

Having the distributions of random execution times Tc for each component (pci, tc(wci) for 1cC, 1iNc) one can represent these distributions in the MGF form as follows:

(15)

The product

(16)