Methods of Increasing Modelling Power for Safety Analysis of Digital Systems

Methods of increasing modelling power for safety analysis of digital systems, applied to a Gas Turbine Control System

A. Bobbio1, E. Ciancamerla2, G.Franceschinis1, R. Gaeta3 , M. Minichino2, L. Portinale1,

1 DISTA, Università del Piemonte Orientale, 15100 Alessandria (Italy)

2 ENEA CR Casaccia, 00060 Roma (Italy)

3Dipartimento di Informatica, Università di Torino, 10150 Torino (Italy)

Abstract.The paper describes a probabilistic approach, based on methods of increasing modelling power and different analytical tractability, to analyse safety of a Gas Turbine Digital Controller. First, a Fault-Tree (FT) has been built to model system basic assumptions, such as independent stochastic activities and binary states. Then, the FT has been converted into a Bayesian Net, to include multi-states and sequentially dependent failures of components and to perform diagnoses, and further into a Stochastic Petri Net, to accommodate repair activity and components in cold stand-by redundancy. Due to the very large space of states of the Stochastic Petri Net model, Stochastic Well formed Net (SWN) have been adopted to alleviate the state explosion problem. SWN allowed to compact symmetries of the system and to fold components with the same failure rates, by the use of colours. Safety measures have been computed, referring to the emergent standard IEC 61508. The applicability, the limits and the main selection criteria of the investigated methods are provided.

1. Introduction

The paper describes a probabilistic approach, based on methods of increasing modelling power and different analytical tractability, to analyse safety of a Gas Turbine Digital Controller. The Gas Digital Controller belong to a co-generative plant, ICARO, in operation at centre ENEA CR Casaccia. The plant is composed by two sections: the gas turbine section of producing electrical power and the heat exchange section for extracting heat from the turbine gas exhaust gases. The Gas Digital Controller performs both control and protection functions in order to allow the Gas Turbine section works efficiently, with high availability and to protect the engine from over-temperature and over-speed, so involving safety aspects. A fault or a deterioration in the Gas Turbine Digital Controller could result in a reduction of plant efficiency (i.e increasing fuel consumption or nitrogen oxide pollution), in a reduction of plant availability (i.e. decreasing of operating time due to trips or failures) or in a reduction of plant safety (i.e. on failure of a protection function, which could result in a damage of the engine, safety critical to the plant because of its high capital cost).

The control and protection functions rely on a digital system. That, if on one side increases benefits, on the other side increases risks due to vulnerability to random failures and design errors of such systems.

For digital systems, the demand of safety is more and more urgent even in conventional application domains, like ICARO co generative plant, as proven by the increasing demand of conformity to IEC 61508 standard. IEC 61508 standard does not address any specific sector. A very important concept in IEC 61508 is that of Safety Integrity Level (SIL). SILs are used as the basis for specifying the safety integrity requirements for the safety functions to be implemented by the safety related system. As far as it concerns the determination of the appropriate Safety Integrity Level, IEC 61508 is based on the concept of risk and provides a number of different methods, quantitative and qualitative, for determining it

As a starting point, a Fault-Tree, modelling system basic assumptions, such as independent stochastic activities and binary states, has been built and analysed. To include multi states and sequentially dependent failures of some components of the Digital Controller and to perform diagnoses, FT has been converted into a Bayesian Net. To accommodate statistical dependence of the transition rates and repair transitions, FT has been converted into a Stochastic Petri Net. The Stochastic Petri Net model resulted unmanageable, due to a very large space of states. To manage such a complexity, Stochastic Well formed Net, a special class of Stochastic Petri Nets have been adopted. Stochastic Well formed Nets allow to compact symmetries of the system, by using colours and to alleviate the states explosion problem. Model results have been compared with Safety Integrity Levels of IEC 61508 standard.

The paper is organised in the following sections. We start in section 2 with the main requirements of IEC 61508 standard. Section 3 deals with the description of the case study. Sections 4, 5 and 6 provide the models of the case study using the different methods, the investigated measures and the experimental results. In section 7 there are the conclusions.

2. IEC 61508 Standard

Process industry requires that well defined safety requirements must be achieved, as hazards may be present in process installations. IEC 61508 introduces a principle referred to with the name As Low As Reasonably Practicable (ALARP). ALARP defines the tolerable risk as that risk where additional spending on risk reduction would be in disproportion to the actually obtainable reduction of risk. The strategy proposed by IEC 61508 takes into account both random as systematic errors, and gives emphasis not only to technical requirements, but also to the management of the safety activities for the whole safety lifecycle [2].

IEC 61508 has introduced the concept of Safety Integrity Level (SIL) attempting to homogenise the concept of safety requirements for the Safety Instrumented Systems. According to IEC 61508 the SIL is defined as “one of 4 possible discrete levels for specifying the safety integrity requirements of the safety functions to be allocated to the safety-related systems. SIL 4 has the highest level of safety integrity, SIL 1 has the lowest”.

The target dependability measures for the 4 SILs are specified in Table 1, for systems with low demand mode of operation and with continuous (or high demand) mode of operation. The determination of the appropriate SIL for a safety-related system is a difficult task, and is largely related to the experience and judgement of the team doing the job. IEC 61508 offers suitable criteria and guidelines for assigning the appropriate SIL as a function of the level of fault-tolerance and of the coverage of the diagnostic.

Table 1.Safety Integrity Levels: Target Failure Measures

Safety
Integrity
Level / LOW DEMAND MODE OF OPERATION
(Probability of failure to perform its
design function on demand) / CONTINUOUS/HIGH DEMAND MODE OF OPERATION
(Prob of a dangerous failure per hour)
4 / >=10-5 to <10-4 / >=10-9 to <10-8
3 / >=10-4 to <10-3 / >=10-8 to <10-7
2 / >=10-3 to <10-2 / >=10-7 to <10-6
1 / >=10-2 to <10-1 / >=10-6 to <10-5

3. Gas Turbine Section

The gas turbine section (figure 1) consists fundamentally of four main parts: the compressor, the combustion chamber, the turbine itself and the generator. The gas turbine is a single shaft engine. The rotor, which rotates at 22500 Rpm, is linked to a reduction gear for coupling with the generator. The compressor feeds air to the combustion chamber where the gas is also fed. Here, the combustion produces high pressure gases and high temperature. The content of NOx can be maintained inside the requested limits by a water injection to reduce the flame temperature. The expansion of these gases in the turbine produces the turbine rotation with a torque that is transmitted to the generator in order to produce the electrical power output.

Figure 1 - Gas turbine section

The air flow rate is constant and a control valve regulates the gas fuel in the combustion chamber. The control valve is actuated by the control system and a position sensor reads its position. The exhaust gas temperature, which is the most critical variable for the engine control system, is taken as an average of eight thermocouples, located along the circumference of the turbine exit. Among all variables that participate in the gas turbine only a few are directly measured by the sensors. From these sensors averages are taken by analog circuitry and are used, together with speed of turbine, to protect the engine

3.1. Gas Turbine Digital Controller

Gas Turbine Digital Controller performs both control functions and protection functions [3,4]. It also performs alarm monitoring and communication functions, not considered in this paper.

Control functionsaddress the normal run operation and all plant sequencing needed in starting and stopping operations. In performing control functions, the control system evolves throughout several states: from starting to no-load states, running on load and shutting down states. From such states shutdown requests override control logic and lead the system to the prior to start state. At any time a shutdown request will cause the control system to enter in its emergency shutdown state and carry out the shutdown actions which include the de-energisation of related relays. The control functions are essentially based on the control of fuel metering valve position.

Protection functions simply consist in providing the engine protection by independent overtemperature and overspeed shutdowns. Two thermocouple sensors are used, together with one speed probe as inputs of the protection functions. Gas turbine control system (figure 2) comprises a Main Controller, which implements the control functions, and a Backup Unit, which implements the protection functions. The Main Controller and the Backup Unit have separate processors and independent power supplies so that the Backup Unit is able to provide independent protection functions. The Main Controller occupies two boards in full: the Baseboard and the Expansion Board, and a portion of the Auxiliary Board that is shared with the Backup Unit. In the Baseboard is located a processor performing the main control functions. The Expansion Board provides an additional intelligent I/O capacity. The Backup Unit occupies one other portion of the Auxiliary Board on which an independent processor performs the protection functions. The sensors and the actuators, reported in figure 2, are limited to the ones needed to perform the fuel demand control logic at run state.

Figure 2 - High level architecture of the Gas Turbine Control System

The Backup Unit implements the protection functions of the turbine by providing an independent overtemperature and overspeed shutdown.

The hardware structure of the control system has been summarised as composed by two subsystems (figure 3), representing the Main Controller and the Back-Up unit. Each unit has an independent CPU, uses a separate power supply circuit (operating from the same supply inlet) but shares the following transducer signals: 2 thermocouples and 1 speed probe. "Watchdog" relays are associated to each hardware circuit board.

DI - Digital input;
AI - Analog input;
CPU- 32-bit microprocessor;
MEM - Memory;
I/O - I/O bus; / DO - Digital output;
AO - Analog input;
WD-Watchdog relay; / PS - Power Supply inlet;
SMC-Supply circuit of
the main controller.
RO - Relay output.

Figure 3 - Main Controller and Backup Unit hardware structure

The elementary components of the gas turbine control system are assumed to have constant failure rates (table 2).

Table 2 - Component /Failure Rate (f/h)

Component / Failure rate (f/h)
Iobus / IO=2.0 10-9
Therm. / Th=2.0 10-9
Speed / Sp=2.0 10-9
Memory / M=5.0 10-8
DO / DO=2.5 10-7
AO / AO=2.5 10-7
RO / RO=2.5 10-7
DI / DI=3.0 10-7
AI / AI=3.0 10-7
PS / PS=3.0 10-7
SMC / Smc=3.0 10-7
SBU / Sbu=3.0 10-7
CPU / CPU=5.0 10-7
WD / WD=2.5 10-7

4. The Fault Tree model

At the highest level of the analysis we adopt a Fault Tree model (figure 4).

Figure 4 - Fault Tree model for safety critical failures

The Fault Tree analysis is based on the following simplifying assumptions: components (and the system) have binary behaviour (up or down) and failure events are statistically independent. Qualitative and quantitative analyses of the FT have been carried out. Qualitative analysis has been aimed at enucleate the most critical failure paths (mcs). Quantitative analysis has been aimed at evaluate measures useful to characterise safety. The analysis has found 43 mcs, with the characteristics given in the table 3. The most critical mcs sorted by order are shown in table 4.

Table 3 - Order and number of mcs

Order / Number of mcs / % on TE Unreliability
4 / 2 / 93.09
3 / 41 / 6.91

The following measures have been performed:

Unreliability versus time;

Safe Mission Time (SMT) computed as the time interval in which the system unreliability is strictly lower than a pre assigned threshold;

Mean Time To Failure (that we consider a less significant measure with respect to SMT);

Most critical failure paths;

SIL evaluation limited to table 1 requirements of IEC61508 standard.

Table 4 - Most critical mcs

Minimal Cut Set
1 / PS WDB WDM
2 / CPUB CPUM WDB WDM
3 / Speed WDB WDM
4 / AIB CPUM WDB WDM
5 / DIM CPUB WDB WDM
6 / SupM CPUB WDB WDM
7 / CPUM SupB WDB WDM
8 / AIM CPUB WDB WDM

Unreliability versus time and failure frequency have been computed (table 5) for SIL evaluation according to IEC 61508. Comparing the results for the failure frequency of dangerous failures of the table (third column), and comparing t with the SIL Target Failure requirements (table 1), it is obtained SIL - 3 up to 500,000 h.

Fixing a limit for the Unreliability U=1.0* 10-3, the Safe Mission Time is SMT= 210.000 (h).

The Mean Time To Failure for the Top Event is: MTTF(for the TE) = 3.072 * 106 (h)

Table 5 - Unreliability and failure frequency of dangerous failures

Time t (h) / TE Unreliability / Failure
Frequency
10,000 / 9.095 10-9 / 9.095 10-13
50,000 / 5.157 10-6 / 1.031 10-10
100,000 / 7.317 10-5 / 7.317 10-10
150,000 / 3.291 10-4 / 2.194 10-9
200,000 / 9.256 10-4 / 4.128 10-9
250,000 / 2.014 10-3 / 8.056 10-9
300,000 / 3.730 10-3 / 1.243 10-8
350,000 / 6.181 10-3 / 1.766 10-8
400,000 / 9.447 10-3 / 2.372 10-8
450,000 / 1.358 10-2 / 3.018 10-8
500,000 / 1.861 10-2 / 3.722 10-8

5. The Bayesian Network model

According to the translation algorithm presented in [5], the Bayesian network derived from the FT of Figure 7 is reported in Figure 8. In the BN of Figure 8, gray ovals represent root nodes(corresponding to the basic events in the FT), while white ovals represent non-root nodes. Every node in the BN is a binary node, since the variable associated to it is a binary variable. The binary values of the variables associated to the nodes represents the presence of a failure condition (true value) or an operational condition (false value). The only chance (probabilistic) nodes of the BN are the roots (gray nodes). All the other nodes in the BN (white ovals) are deterministic nodes whose Conditional Probability Tables contains only 0 or 1 and are determined by the type of the gate in the FT they refer to (namely by the boolean AND function and by the boolean OR function) [5]. The root nodes must be assigned a probability value. Since the information about the failure probability of the system components is in the form of a constant failure rate (Table 2), the probability for the true value is obtained by computing the probability of generic component C (with failure rate C) at a specific mission time t as Pr(C=true) = 1 - e - t ).

Given the prior failure probabilities of system components (i.e. basic events in the FT) computed at different mission times (from t = 1 * 10 5 h to t = 5 * 10 5 h), we can evaluate the unreliability of the TE by computing the probability of node TE in the BN of Figure 5 given a null evidence.

Figure 5 - The Bayesian Net traslating the FT of figure 4

5.1 Posterior analysis

The novelty and the strength of the BN approach consists in the possibility of computing posterior probabilities (i.e. diagnoses), in order to analyze the criticality of the system components with respect to partial or total system failure. To this end, a probabilistic computation has to be carried out, by considering the occurrence of the TE as the evidence provided to the BN.

There are two main probabilistic computations that can be performed:

the posterior probability of each single component (in terms of BN, a belief updat-ing propagation must be carried out);
the joint posterior probability over the set of components (in terms of BN, a belief revision looking for the most probable configurations of the root variables must be carried out).

The first analysis allows to obtain information about the criticality (with respect to the occurrence of TE) of each single component alone, by computing the probability of each single component being down, given that the TE has occurred. The second kind of analysis is much more sophisticated and approaches the criticality problem over a set of components. However, it is worth noting that, differently from MCS computation, all the components (i.e. basic events) are considered in a given configuration, by providing a more precise information. In this case, the posterior joint probability of all the components, given the fact that the TE has occurred, is computed. Table 6 reports the posteriors of each single component computed at time t = 5 *10 5 h.

Table 6 - Posterior Probabilities for single components

Component / Posterior
WDb
WDm
CPUb
PS
CPUm
AIb
SB
ROb
AIm
SM
DIm
AOm
DOm
Mem
Speed
I/Ob
I/Om
Th1
Th2 / 1
1
0.37063624
0.34525986
0.30848555
0.2333944
0.2333944
0.19688544
0.19425736
0.19425736
0.19425736
0.16387042
0.16387042
0.03443292
0.00247744
0.00167474
0.00139391
0.00100097
0.00100097

The component criticality is a more significant measure with respect to their (prior) failure probability. Indeed the order in which components appear are different in the prior and in the posterior computations. We can notice that the two watchdogs WDm and WDb have a criticality 1, since their failures are necessary in order to have a system failure (as it could have been easily deduced from the structure of the FT as well). Moreover, the probability of a CPU failure in case of TE occurrence is about 30% for the CPU M of the main controller and about 37% for the CPU B of the backup unit. Notice that this posterior values are different, even if the failure rate of both CPUs is the same, because of the different role they play in the overall system dependability. In fact, the failure both of the main controller MC and of the backup unit BU are provided by the failure of the corresponding CPU in boolean OR with the failure of the PER sub-system, but the failure of PER M follows a different sequence of events than the failure of PER B, resulting in different posterior probabilities also for the two CPUs.

5.2 Multi-state nodes and sequentially dependent failures

In the present section, we discuss the use of BN which enlightens two peculiar features, not considered in FT, namely: the possibility of modeling non-binary events (like events whose behavior is more carefully considered by multi-state variables), and the inclusion of localized dependencies (where the state of a root component influences the state of other root components). A more realistic case for the power supply PS is to find it in three different conditions (states): working, degraded and failed. When PS is in state degraded it induces an anomalous behavior also in the supply equipment (SM) of the main controller (MC) and (SB) the back-up unit (BU). The BN, that models the described situation, is reported in Figure 6, where only the relevant part of the BN of Figure 5 is reconsidered. The PS node has three states denoted by W for working, deg for degraded and F for failed. The prior probabilities of the PS node in the three different states is also reported on the Figure. The arcs connecting node PS with both nodes SM and SB indicate a possible influence of the parent node PS on the children nodes SM and SB. This influence is quantified in the CPT’s reported in Figure 6, where it is shown that a degradation in PS induces a failure also in SM and SB with probability 0.9.