Rechner-Architekturen Für Eingebettete, Sicherheitskritische Anwendungen

Computer architectures for embedded safety-critical applications

Manfred Schmitz, Technical Director
MEN Mikro Elektronik GmbH

Failures of safety-critical electronic systems can result in loss of life, substantial financial damage or severely harm the environment. Such systems are used for example in medical equipment, in airplanes, in trains, in weapons and in nuclear power stations. The terms “safety” and “security” are often confused. The technical term "security" means that something not only is secure but that it has been secured. A safe system is a system with a determined error behavior.

Two basic modes of error behavior are distinguished. Systems that are „fail-operational“continue to operate when an error occurs, like for example the control system of airplanes and passively safe nuclear reactors. „Fail-safe“ systems are turned off in case of a failure, they are in a safe state. A machine or a train for example can be stopped when the control system fails, for an airplane that is not possible.

Systems which continue to work correctly when part of the system fails are called fault-tolerant.Autopilots in airplanes or the control system of ordinary nuclear power plants work according to that principle. A frequently used method for achieving fault tolerance is permanent testing of all components. In case of an error the defect component is switched off and another is used.

Figure 1: Process Control

Safety begins with the planning of development

The safety standard a system must comply to is described by classes. These are called SIL levels (SIL= Safety Integrity Level). For each SIL level a certain reliability is required. There are a number of different standards in different industries. The IEC61508 is of particular importance. This standard defines 4 levels – 4 is the highest reliability – 1 the lowest. 0 means that there are no special requirements as to the safety.

The required reliability depends on the frequency of usage. For applications which are used infrequently, like for example emergency stop switches, the standard defines a maximum PFD (probability of failure upon demand) of

SIL4: 10-5 to 10-4
SIL3: 10-4 to 10-3
SIL2: 10-3 to 10-2
SIL1: 10-2 to 10-1

Other important industry standards in this context are:

IEC61508 (Functional safety of electrical/electronic/programmable electronic safety-related systems)

In avionics:

D0-178B (Software Considerations in Airborne Systems and Equipment Certification) and the
DO-254 (Design Assurance Guidance for Airborne Electronic)

For railway applications:

EN50126 (Railway Applications - the specification and demonstration of transport systems),the
EN50128 (Railway Applications - Software for railway control and protection systems) and the
EN50129 (Railway Applications - Safety related electronic systems for signaling)

Such systems have to be extremely reliable in any event. That means that already the development process has to meet high requirements. Planning of development is very important. The DO-178B for example stipulates 10 different document types in the planning stage. During the actual development work is done according to the V-model. On system level requirements are formulated (what has to be done). These requirements are translated into an architecture (how it shall be done) on system level.From that requirements for the individual components are derived and on the base of these requirementsthe component architecture is developed. Depending on the complexity of the system there can be several document cascades from system level down to module level. The last step is the actual conception –the design and the implementation. These documents form the left side of the „V“ in the V-model. The counterpart of the conception is the module test, the architecture is opposed by the integration, the requirement by the validation and so on from cascade to cascade up to the highest system level. This part forms the right side of the “V”.

Figure 2: Development according to the V-Model

All requirement documents (including the architecture and conception documents in addition to the actual requirement specification) have to be written in a clear requirement language. This language must be defined(shall, should, may…). The use of clear expressions alone can guaranty that every requirement can be verified and tested. Formulating clear and verifiable requirements is very difficult and requires a lot of experience in the field. It is important that the global context stays clear. Each requirement has to be implemented and tested. It must be possible to verify the completeness of that chain. Manually this is very difficult. Often tools based on data bases are used for this task like for example DOORS from Telelogic. These tools make it possible to trace the requirements, the respective test cases and test reports relatively easily through all documents (Requirement Tracing). However, the use of these tools does not guaranty that the requirements are well formulated. Operating these tools is complicated so that often only one person in a design team is able to enter the requirements which is why they are often not updated.

Apart from the requirement tracing several other conception methods must be used. In software and FPGA development „Code-Rule-Checking“ is used. Here the code itself is checked automatically according to certain formal criteria. The „Code-Coverage-Analysis“ is a method for proving that every line has been checked at least once. Other methods are black box and white box tests to ensure that the software meets all requirements without errors. A configuration management is needed of course to be able to retrace code changes and their causes. The choice of the tools to be used is very important as it has to be guaranteed that the editor, the compiler and the linker yield the desired result. Strictly speaking, these tools already have to be manufactured with the same care and the same amount of time and money. Indeed, such tools are available. These are for example included in packages from several software manufacturers (Windriver: Platform for Safety Critical DO-178B, Green Hills Software: Integrity-178B RTOS, LynuxWorks: LynxOS-178).If you use “normal” software its reliability in operation has to be proven to the approval authority– often a difficult undertaking.

Measures beside development

Beside the development risk management is required. Risk management is based on the identification of a risk. This includes all risks, that means technical or financial risks as well as risks related to deadlines or the personnel involved. The significance of these risks has to be determined, in order to be able to decide if they have to be reviewed in more detail. The risks have to be evaluated. The aim is the minimization of risks in the development.

Common strategies are: transferring of a risk (e.g. outsourcing of a part of development to an external expert), avoiding a risk (e.g. using a proven component instead of a new one) or the assurance of a risk (e.g. a factory against fire). In most cases the whole development team tries to identify all risks in a brain storming. Part of the team (the risk expert team) then carries out the evaluation, the prioritization and the regular (mostly monthly) check and triggers remedial measures if necessary.

Figure 3: Risk Management

Several environmental tests are also carried out. Environmental tests are especially important for safety-critical systems, because these systems have to function error-free even in extreme conditions (extreme temperatures, extreme vibrations etc). Especially in avionics HALT tests (HALT = Highly Accelerated Life Test)are required. In normal environmental qualification an electronic system is tested in exactly the conditions it was developed for. For example, if a temperature range of -40 to 85°C is required, qualification is carried out in exactly that range. In the HALT test the proceedings are different. These limits are exceeded on purpose. Apart from the temperature the DUT is exposed to extreme vibrations at the same time. Sometimes the supply voltage is varied too. Purpose of the test is to find out the limits of a system. The system is destroyed on purpose, and then the damage is assessed so that an improvement of the construction can be carried out in situ. This process normally takes three to five days where the design engineers have to be present. With HALT weak points of the systemcan be detected and a safety margin with respect to the environmental requirement is gained as well as – safety.

How safety is evaluated

For safety evaluation, the MTBF value plays an important role. MTBF means „Mean-Time-between-Failure”. Dependent on the component affected an error can be permanent or temporary. For example, if a resistor shows an error, it is defect in most cases.If a memory shows an error it is possible that just one bit has toppled. The MTBF value is the average time which passes between two errors and describes the reliability of a system.The MTBF value is often confused with the MTBR (Mean-time-between-repair) value. The MTBR value corresponds to the average time between two errors which can only be removed by repair. There is no difference between the MTBF value and the MTBR value of a resistor whereas the two values of a memory do vary.

The MTBF value is indicated in hours. In order to get more “manageable” values the Lambda or FIT value is used (MTBF = 1/Lambda). If the Lambda values of all components are known all Lambdas can be added to get the total Lambda and thus the MTBF value of a system. Often it is difficult to get the Lambda value of a component from the manufacturer.In these cases the value has to be estimated.

Furthermore, the FIT – the reliability – depends on the „stress“ the component is exposed to.If a 0,5W resistor is operated with a power of 0,5W its life cycle will naturally not be as long as if it was only operated with 0,1W. This means that an evaluation can only be carried out with a detailed knowledge about the circuit. Possible changes of the FIT value are described in several concurring standards. The MIL-HDBK-217 is wide-known but has been invalid for more than ten years. Other standards are Telcordia, PRISM, IEC-TR-62380. The RDF 2000 (=UTE C 80-810) however seems to be the standard of the future.

This MTBF value is also needed for the required FMEA (Failure-Mode-and-Effects-Analysis) During the FMEA each component is checked from the bottom to the top. Then it is evaluated if the component influences the safe operation of the system and if so in which state the system is switched. For a system with a safe state (“fault-safe”) like a train or a machine two types of errors can be distinguished: faults which lead to a safe state and faults which lead to an unsafe state. Only the dangerous faults have to be considered. When an error is discovered the required remedial measures are taken. Only the errors which are not discovered are critical which means thatin order to determine the reliability of a system, i.e. its SIL-Level, only the FIT values of the undiscovered errors have to be added.

Especially in avionics, effects like SEU (Single Event Upsets) and MEU (Multiple Event Upsets) have to be taken into account. These are statistic errors which are caused by cosmic radiation. This radiation causes toggling of static and dynamic memory structures and registers. Thus it is possible that a processor unit which might have functioned error-free for several decades on sea level shows an error after just some hours in an altitude of 60,000 feet. In order to avoid SEU precautions have to be taken (e.g. using no RAMs) to obtain a higher MTBF value.

Redundant Systems

For the evaluation of the safety and the FMEA knowledge of the system is always required. If the FMEA/MTBF does not yield a high enough result, measures have to be taken in the hardware development to augment the FIT value. These measures or the conception of the architecture can only be carried out under consideration of the whole system. At any rate, everything has to be done in order to detect errors. This error detection can often be carried out in a quite easy way, for example by a person who watches the machine and switches it off when there is danger– however, the MTBF value of a human is not particularly high. In most cases it is tried to make the system or the control safe.

One method for detecting errors is redundancy. The systems or part of them are duplicated (for example the CPU unit). Only both computers combined can set the system into the critical state. A “voter” carries out the evaluation. If this voter detects a difference between the two systems it sets the system into the safe state – switches it off. This architecture is also called „1oo2“ that means „1 out of 2“ architecture. This method only works in “fail safe systems”. In “safe operational systems” the system is kept in the “hot standby” state. This means that a faulty system part switches itself off and the other part takes over its function. Now the system is functional, but unsafe. Software is considered as safe if it has been programmed according to the rules described above. As mentioned above hardware is never fault-free. With a 1oo2 architecture only single faults can be detected. The industry standards prescribe for the SIL3 and SIL4 level that also double faults can be detected. That is why the built-in self tests (BITE = Build-In-Test-Equipment)are so important. Often the hardware and the software for self test are far more complex than the equipment for the function itself.

In avionics there is a different approach. Each fault is considered a fault there (MTBF). But a subsystem must be able to detect a defect itself (the second computer is only in hot standby) – so there are the same problems.

In redundancy concepts, special attention must be paid to “common cause faults”. If, for example, two parallel computers have the same defect in the floating point unit this error will not be detected. A similar event occurs at binary outputs if they are of the same kind. Thus an overvoltage might destroy two redundant outputs at the same time. Other examples for common cause errors are power supply units and often also cablings. One tries to avoid common cause errors by making the two channels as far as possible independent from each other.Two CPU boards for example are equipped with two power supply units, two timers etc. The principle can be further improved by diversity. If two separate CPUs are used, a single fault cannot occur at the same time and thus not be overlooked.

With classic redundant systems and a voter the error probability can be decreased as the system switches into the safe state when there is an error. However, availability is then decreased. If availability is important (so important that you are willing to pay money for it) triple systems are often used (2oo3, „2 out of 3“).Here you assume that the single components are safe – i.e. a safe operating system etc.Now the voter evaluates three votes.The majority decides. If one component delivers a deviating result it is switched off. The system stays safe but availability is now decreased.With another additional computer a 2oo4 structure is achieved. When one component fails three functional components remain; the system still is safe and available. If you want to use standard operating systems (=unsafe) you can use a modified structure with two tuples. One tuple alone is diversitary. It is switched off when there is an error. The second tuple, however continues operation. This diversitarily coupled structure is not as available as the 2oo4 structure described above but in certain circumstances it enables the use of standard operating systems.Even simpler architectures can be used for very small control systems or controllers provided that all components are “developed safely”. It is possible for example to combine a simple CPU which has an extreme MTBF value with a small intelligent watchdog. This watchdog checks the results of the self test of the CPU and whether these results are delivered on time. So the
watchdog expects a certain result at a certain time.

Figure 4: 2oo4 Structure

As described above, Software is considered to be fault-free if it has been developed according to the required standards. This gets difficult when it comes to the operating system. In avionics the only solution is often to make the operating system “safe”, too. This means that the whole operating system has to be tested thoroughly. For that, the software sources, time and money are needed. In industry other concepts are often accepted, too. The architecture trait diversity can be transferred to the software, for example. Usage of two different reliable operating systems often is a solution if the application at least has been developed according to the applying rules.

Safety-critical architectures are very complex. They do not just concern the hardware and the software themselves but also the whole development process and even the used tools. The architecture concepts for different industries – railway, avionics, car manufacture, medical engineering – are only slightly differing, but the way of thinking and developing is different in the different markets.

Page 1 of 7