Using Software Rules to Enhance FPGA Reliability

Using Software Rules To Enhance FPGA Reliability

Chandru Mirchandani

Lockheed Martin

Abstract

OneThe question frequently asked by system designers is always asked, how reliable is my design? But wWhat does this really mean? To a program technical lead it might means“that will thebe system perform at the required quality level for performance, speed and data integrity over the life time of the mission (and then some).” This is a very tall order for designers to meet if there is a constraint on how much time and (money) can be spent and how much of that money can be allocated for experimentation, prototyping and implementation. Software engineers have been tasked with predicting the quality of their software and have done so based on historical information. However, this does not always pan out because innovative designs do not reuse complex code, but instead develop new complex code. The software system engineer has devised novel methods of improving the quality, and hence the ‘reliability’ of their systems by developing system designs and configurations that will minimize failures; maximize fault tolerance and optimize cost. This paper examines a process whereby some of these rules can be extended towards reliability and fault tolerance in Field Programmable Gate Arrays (FPGA), specifically towards minimizing the effect of common cause failures.

Background

Due to the complexity, the reliability of the some software does not reach a maturity level until the initial faults, due to integration, new code release, unexpected scenarios, etc., are uncovered and removed from the system. Thus the predicted reliability of the ‘new’ software, though based on expert opinion and a historical knowledge-base, cannot be validated completely in time to ensure that the software will meet the design reliability requirements. If using an adaptive model, the software systems engineers are able to re-engineer the software based on new test results whilst the system is being tested; it would increase their confidence in the newly developed system and at the same time allows the initial integration and operation region to be bridged faster. Schedule and budget constraints however do not always allow the Software System Engineers to develop an adaptive model. This paper proposes a method by which an adaptive model can be developed for FPGAs such that the reliability growth of the FPGA subsystem is more apparent at the early stages of system implementation. This will mitigate risks and improve the reliability of the delivered FPGA-based system. The paper will describe the process by which common mode failure effects are modeled for FPGA-based systems and outline the metrics that are used to track and dynamically minimize the effects of common mode failures in such systems.

Definitions and Classifications

For the purpose of clarity, this section will clarify and explain some of the terms used so that they could be used interchangeably between the two design paradigms. The historical definition for software reliability is the ability of the software to be failure free over a period of execution time. Thus if a program has been released and is ‘operational’, i.e. used in the environment and manner that it was designed for, the failure intensity will be a constant and the number of failures per unit time follow a Poisson distribution, and the reliability follows an exponential distribution.

Where λ is the failure intensity and tis the execution time. Thus if the failures are not corrected the Reliability will follow this exponential curve and decrease over execution time. In real life these failures are corrected through frequent releases in software that are made to fix the problems that are perceived to have caused the delay. In this case we say that the failure intensity will not be a constant but instead will decrease in some manner based on the resources and schedule used to implement the releases. As a result the reliability may still decrease over time, but at a much slower rate.

What prompts these failures? The most frequently used explanation for these failures are that they are one of three kinds. The first, which is generally discarded in measuring software quality, is that the error could be caused by bad documentation or incorrectusage. The second source of failures is attributed to the design of the software. The software performs a certain sequence of steps to manipulate the data and produce a result or output data product, which is incorrect. This failure is attributed to a design failure, because no matter what the input data, the result will always be wrong. Finally, the third source of failure is attributed to the information being fed into the software being incorrect. This could be from another software system or hardware component. If the data is from another software component within the same software system, it is attributed to a design failure of the software system. There is another general source of errors referred to environment errors, which cause intermittent failures due to unexpected states in the environment. From these definitions is it obvious that a failure will cause an interruption and thus failures always occur only when the program is executing the code, and hence are dynamic.

On the other hand, a fault is a defect in the program itself and when executed under certain conditions, will cause one or more failures. This indicates that faults are primarily static problems in that they are a property of the program and hence the design. If the conditions that prompt a fault to manifest itself as a failure do not occur, then the fault lays dormant and may lie undiscovered for a very long period of time. There could be many such faults and the number of faults is a testimony of the quality of the design and design expertise of the software and programmer respectively.

To extend this design paradigm to an FPGA-based system, one has to ensure that the “software”, which in this case is the Very High Speed Integrated Circuit Hardware Description Language (VHDL) codes, is fault-free. This is not trivial. VHDL programmers use many techniques for testing their code by simulating unreal conditions to stress their design. Though this is comparatively easy with the compiled code and a software test-bed, it takes time and therefore increases the cost of development and implementation to program the FPGA device with this compiled code and run time tests to verify the design. Legacy systems are very useful in identifying operational scenarios and building in reliability to the system. However, this will not completely eliminate the faults or defects in newly developed VHDL code, though it will decrease the failure intensity per unit time. For example, if the failure intensity at the beginning of operations was 5 failures per 100 hours of operation, through judicious corrective techniques and experience the failure intensity could very well drop to 1 failure per 100 hours of operation.

Software Reliability

Software reliability is achieved, maintained and improved over time by utilizing the properties of a distributed architecture with a limited amount of cohesion that would allow rapid recovery from fault conditions.

Critical software functions are distributed as redundant instances on multiple processors, thus minimizing the down time due to a processor failure

In addition this type of distributed architecture invites some cohesion by building in the capability for the redundant instances of the software applications to:

1. Initially detect, contain and recover from faults as soon as possible, and in the event this is not possible,

2. Allow the control to be passed on to the redundant instance within the reliability and availability requirements levied on the system, and

3. Include language defined mechanisms to detect and prevent the propagation of errors.

To increase the fault tolerance capability of the software system, the software modules or components that make up the system are designed to be independent of one another. The defects that are actual design faults are uncovered during execution. If theexecution time model [1] is used, this will ensure that the actual time the code is being executed is being recorded, and will enable the fault discovery rate during testing and initial deployment to be cumulatively counted.

FPGA Design Reliability

An FPGA-based design is by virtue of its physical environment a combination of hardware and software, with one extension, namely it is also dependent on the process. The basic hardware reliability of the hardware component of the FPGA has been discussed at length both in this forum and many others. The purpose of this discussion is to discuss the ‘software’ aspect of the FPGA, namely the programmed component of the design and ensure that we have optimized the ‘software’ design to enhance the operational reliability of the FPGA design.

The conceptual design, which is FPGA-based, will ingest data and perform a sequence of steps before it provides an output product. This design is composed of a number of tasks similar to software modules in a software system. Consider the first task that takes in a stream of data at a rate of N bytes/second. This data has got to go through n tasks before it is output as another stream of data in a manner that the next system element can recognize it. This means that if the next system element is a video card, the image is shown on the screen, or else if the next system element is a software program that measures the cloud cover, the output from the next element will be text and numbers specifying the percentage of cover, the type of cloud, etc.

One option would be toassign scores based on time of completion for each of these tasks, to arrive at a figure of merit to compare a performance measure. However, this would not be a true figure of merit, since all tasks within the same class may not have the same score. For example, the task of pre-processing the data from a serial to a parallel stream may have a score different from the pre-processing task of decoding the identified data stream. Within each class we have to assign a factor that would take into account the number of steps to achieve the result.

Assign time of completion to each task class as follows, tr is the time of completion of the ‘Reading’ task, tp is the time for ‘Parsing’ task, and so on. The next step is to introduce a weighting scheme to take into account the variation within the class based on the number of steps needed to perform the task. For example, it may take xr1 clock cycles to read in a unit of data before the task can start, then it may take xr2 clock cycles for the next element of the task to start, and so on. Let us assume that each sub-task is timed using clock cycles, this would normalize the measurement across all tasks within a class, and across all classes. Thus a single ‘reading’ task has a total of xri clock cycles or steps. Each of these clock cycles takes srunits of time to complete, and the single ‘reading’ task time is given by:

Task Time = sr.xri units of time

Table 1. Task Times

Task Class / Steps / step time (stask) / Task Time / Total Tasks Time (ttask)
Readingr / xri / sr / sr.xri / (sr.xri).nr= tr
Parsing p / xpi / sp / sp.xpi / (sp.xpi).np= tp
Pre-processing p1 / xp1i / sp1 / sp1.xp1i / (sp1.xp1i).np1= tp1
Monitoring M / xMi / sM / sM.xMi / (sM.xMi).nM= tM
Sorting s / xsi / ss / ss.xsi / (ss.xsi).ns= ts
Processing P / xPi / sP / sP.xPi / (sP.xPi).nP= tP
Post-processing p2 / xp2i / sp2 / sp2.xp2i / (sp2.xp2i).np2= tp2
Status-gathering S / xSi / sS / sS.xSi / (sS.xSi).nS= tS
Writing w / xwi / sw / sw.xwi / (sw.xwi).nw= tw

If in the performance of one scenario the reading task has to go through nr such instances, the total ‘Reading’ task time for the system to produce a single output product is:

tr = (sr.xri).nr

This allows us to calculate an execution time for each ‘programmed’ module, such that the faults or errors that do occur can be tracked per module to provide an ‘error’ or failure intensity. For the sake of discussion this paper will select the first three task classes as the full functionality of the FPGA and develop a model that would lend itself to building in reliability to an FPGA-based on software reliability rules. The objective is to attain reliability over an execution time of 100 hours for the subsystem implemented in the FPGA of greater than or equal to 0.94. Given that on an average the task time for the first three task classes can be evaluated in a manner similar to the process shown in Table 1. Assume that the percentage of time taken for each of the three tasks was calculated as 0.3, 0.3 and 0.4 for the Reading, Parsing and Pre-Processing tasks respectively.

Figure 1. FPGA System -Reading, Parsing and Pre-Processing Tasks

The basic premise of the model is the comparison of a system on an FPGA to a software system running on a processor. The common elements of the two systems, namely the Work Station will not be used in the model and the hardware component on the FPGA-based system will be comprised of the hardware gates on the device. The reliability allocated to the device will be assumed to be comparable with the hardware component of a traditional software based system. Figure 1 depicts a conceptual configuration of the FPGA-based software system.

Each of the tasks or functions is redundant, but the type of redundancy and the corresponding parameters are modeled as shown in Figure 2. For example the hardware component of the Reading Subsystem (Task) is in series with the software component of the subsystem, and they are configured redundantly. This parallel configuration is in series with a series component that represents the Common Cause Failures of the Reading Subsystem.

[1-{1-(exp(-(1-γh).λshwi.t).exp(-(1-γs).λsswi.t))}^2] . (exp(-γh.uh.λhwi.t).exp(-γs.us.λswi.t)

Figure 2. FPGA System - Reading Subsystem Reliability Block Diagram

The reliability of the system is analysed for the allocated and defined parameters shown in Table 2. These parameters are then used to calculate the Failure Intensity and Common Cause relationships, and subsequently evaluate the predicted reliabilities of the Reliability Block Diagram.

Failure Intensity (HW): λshwi = λhwi.uh.(1-γh)

Failure Intensity (SW): λsswi = λswi.us.(1-γs)

Common Cause:λhwi.uh.(γh) and λswi.us.(γs)

Table 2. Definitions

Calendar Time - τ / Mission Time to Calculate the Reliability
Execution - ei / Percentage of Mission Time used by the Task (or Subsystem)
Execution Time - t / ei . τ
Usage for SW / Percentage of the Total software used by the Task
Usage for HW / Percentage of Area of the Active portion of the Device used by Task
λshwi / Failure Intensity of Task i hardware with respect to Execution time
λsswi / Failure Intensity of Task i software with respect to Execution time
γhi / Fraction of Task i Task hardware that are common cause failures
γsi / Fraction of Task i Task software that are common cause failures

The overall Reliability of the Subsystem i is the AND-OR combination of the Independent Failure Reliability Blocks with the Common Cause Failure Blocks of the Task i hardware and software. The AND combination is the parallel combination of the Task i hardware and software, which is OR-ed with the Common Cause Failures of Task i hardware and software.

AND:[1-{1-(exp(-(1-γh).λshwi.t).exp(-(1-γs).λsswi.t))}^2]

AND-OR: [1-{1-(exp(-(1-γh).λshwi.t).exp(-(1-γs).λsswi.t))}^2] . (exp(-γh.uh.λhwi.t).exp(-γs.us.λswi.t)

Thus the overall Subsystem i Reliability is given as:

RSSi = [1-{1-(exp(-(1-γh).λshwi.t).exp(-(1-γs).λsswi.t))}^2]. (exp(-γh.uh.λhwi.t).exp(-γs.us.λswi.t)

The FPGA-based System Reliability is given as:

RS = ΠRSSi

The following tables show the values used as parameters for the Reliability Block Diagrams and for calculating overall reliability per Subsystem (or Task) and the overall System Reliability of the FPGA-based System.

Table 3. Failure Intensity and Usage Parameters for Systems Tasks

Reading / Parsing / Pre-Processing
Usage SW - us / 0.3 / 0.3 / 0.4
Usage HW - uh / 0.3 / 0.4 / 0.3
λhwi / 0.3 / 0.4 / 0.3
λswi / 0.3 / 0.4 / 0.3
Execution - ei / 0.2 / 0.1 / 0.7

The model is analysed for four different alternatives;

Table 4. Fraction of Failures due to Common Cause Failures

Configuration / HW Common Cause Fraction / SW Common Cause Fraction
γh / γs
Two copies of the same code on the same device / 0.01 / 1
Same code, on different devices / 0.0025 / 0.9975
Different codes for the same task on the same device / 0.01 / 0.5
Different codes for the same task on different devices / 0.0025 / 0.1

The fraction of failures due to common cause varies depending on the configuration and design of the system. In the case of the same device the fraction of common cause failures is higher with the redundancy built into the same device.

Figure 3. FPGA System – Overall System Reliability Block Diagram

Table 5. Overall System Reliability

Option / Configuration / FPGA-based System Reliability
1 / Same Code, Same Devices / 0.895726564
2 / Same Code, Diff Devices / 0.895973815
3 / Diff Code, Same Devices / 0.944752579
4 / Diff Code, Diff Devices / 0.98356125

The common cause failure as a fraction of the software failure intensity is lowest when the software code is different for the different devices. The overall FPGA-based configuration as shown in Figure 3 is analysed and the results shown in Table 5.