Technical Report (P8) Diverse Fault Management – a comment and prestudy of industrial practice
Jon Arvid Børretzen
Department of Computer and Information Science, Norwegian University of Science and Technology (NTNU), NO-7491 Trondheim, Norway
Abstract: This report describes our experiences with fault reports and their processing from several organizations. Data from investigated projects is presented in order to show the diversity and at times lack of information in the reports used. Also, we show that although useful process information is readily available, it is seldom used or analyzed with process improvement in mind. An important challenge is to describe to practitioners why a standard description of faults could be advantageous and also to propose better use of the knowledge gained about faults. The main contribution is to explain why more effort should be put into codifying of fault reports, and how this information can be used to improve the software development process.
1. Introduction
In all software development organizations there is a need for some minimum fault logging and follow-up to respond to faults discovered during development and testing, as well as claimed fault reports (really failure reports) from customers and field use. Such reports typically contain fault attributes that are used to describe, classify, analyze, decide on and correct faults. There are many standards for this kind of information, although the original fault reports will be more of an ad hoc character than of a specified standard. A software development organization will, in addition to a fault report scheme, have defined own customized metrics and processes related to fault management.
Systematic fault management is often also motivated by certification efforts, for instance ISO 9000. Software Process Improvement (SPI) and Quality Assurance (QA) initiatives can also be a motivation for fault management improvement work.
Despite of this, in most organizations there is still much underused or even non-used data, either from lack of knowledge about the subject or from lack of procedures to assist in using the available data. As Jørgensen et al. states, “no data is better than unused data” [Jør98]. This is because collection of data that is not used, leads to waste of effort during data collection, poor data quality and even possibly a negative attitude to any kind of measurements during development or other SPI and QA activities.
In this investigative pre-study, we are reporting earlier experience from case studies and data mining in 8 Norwegian IT organizations and also an Open Source Software community, where fault data has been under-reported and/or under-analyzed. That is, poor or wrongly coded classification of faults, missing fault information for affected program module, no effort registration and so on. There is also the issue of fragmented data representation, partial fault reports, Software Configuration Management logs, or merely comments in code. This paper describes our experiences with fault reports and fault reporting from working with fault reports from several different organizations. Results from these studies have been published papers like [Bor06, Bor07, Con99, Moh04, Moh06], but there is also a need to describe what we have learned from these studies in a descriptive manner.
In this field of study, different terminology is used in various sources. For this paper, we use the term fault in the same meaning as bug or defect. That is, a fault is the passive flaw in the system that could lead to an observable failure (vs. requirements) when executed. For fault report, other terms that are used are problem report and trouble report.
2. Metrics
Our studies have given us insight and knowledge about the practice and information available from fault report repositories in several commercial organizations. This section presents these organizations and some attributes of these repositories. Such information gives a quick insight into how fault reporting is performed and what possibilities are available in termas of analysis and process improvement.
Table 1 shows an overview of the 8+1 involved organizations. Because of nondisclosure agreements with some organizations, their identities have been anonymized (O1-O8). We compare with the Open Source organization Gentoo [Gen].
Table 1. Organization information
For each of these organizations, we have studied and analyzed fault reports in one or more development projects. From this, we have selected some relevant fault report attributes, and report the situation for each of the organizations. The attributes are the following:
• Fault report description: Whether the initial description of the fault is long or short, this indicates how well the fault has been described when found.
• Fault severity: How many levels of fault severity does the organization use to discern their faults?
• Fault type categorization: Does the organization classify faults according to type?
• Fault location: Does the organization describe where the fault is located, either structurally (i.e. which component) or functionally (what user function the fault relates to)?
• Release version of fault: Does the organization register in which release of the software the fault was found?
• Correction log: Does the organization keep a correction log for each fault, where developers can enter information relevant to the identification of fault cause and correction?
• Solution description: Does the organization record what the solution of the problem was and how the fault was corrected?
• Correction effort: Is information recorded about the effort needed to find and correct the fault?
• Mandatory completion: Are all fault report entry fields mandatory for completion?
• Specialized fault report system or change reports: Is the fault reporting system a separate entity, or is it used in combination with all change reports?
• Standard or custom fault reporting system: Does the organization use a standard available fault reporting system, or do they use a custom made system?
These attributes are shown for each organization in Table 2.
Table 2 shows that there is a wide range of information used in the fault reports of these organizations. For instance the amount of information recorded differs from well described faults with correction log and solution description, to cases where the faults are scantily described and the only information about correction or solution is whether it has been solved or not.
Table 2. Fault report attributes for each organization
3. Process
Using fault report information to support process improvement can be a viable approach to certain parts of software process improvement [Gra92]. Some organizations have used this approach actively, while others have not. For the most part, organizations have done little work in this area until other researchers start studying (by data mining) their fault report repositories. For each organization, we describe the level of fault report use beyond fault correction, i.e. if any analysis work has been performed by the organization itself, followed by what have been performed by us as researchers. This is shown in Table 3. As we see, as external researchers we have been able to exploit the available data in the companies in a much larger degree than the organizations themselves.
168
Table 3. Level of fault report work and external reasearch
One example of research results is from organization O1, where one conclusion of the work performed by the external researchers was performing software inspections was cost-effective for that organization. Inspections found about 70% of the recorded defects, but only cost 6-9% of the effort compared with testing. This yielded a saving of 21-34%. In addition, this study showed that the existing inspection methods were based on too short inspections. By increasing the length of inspections, there was a large saving of effort compared to the effort needed in testing. Figure 1 shows that by slowing down inspection rate from 8 pages/hour to 5 pages/hour, they could find almost twice as many faults. Calculations showed that by spending 200 extra analysis hours, and 1250 more inspection hours, they could save ca. 8000 test hours!
In O7, external researchers concluded through analysis of fault reports and fault types that the organization’s development process had definitive weaknesses in the specification and design phases, as a large percentage of faults that were found during system testing were of types that originated mainly in these early phases. Additionally, this external research lead the organization to alter the way they classified faults in a pilot project in order to study these issues further.
Figure 1 Inspection rates/defects in organization O1.
Other results we have drawn from several studies, is that the data material is not always well suited for analysis. This is mostly because of missing, incorrect or ambiguous data. It is apparent that since the organization generally does not use this data after recording it, the motivation for recording the correct data is low. In O3, for instance, 97% of the fault reports were classed as “medium” severity faults. This was the default severity when recording a fault, and was rarely altered even if the fault was actually of a more severe character.
There are some interesting fault report attributes that are not in wide use, even if the information is most likely available. Such types of information could be very useful in process improvement initiatives and the cost of collecting and analyzing this data is marginal. Some examples of such attributes are the following:
• Fault location: This attribute addresses where the fault is located, either as a functional location or structural location. When the functional location of a fault is reported, this is mainly from the view of the user or testers. It tells us which function or functional part of the system where the fault is discovered through a failure. In case of structural location, the fault report points to a place (or several) in the code, an interface or to a component where the fault has been found. For analysis purposes, the structural location is often the more useful information.
•
• Fault injection/discovery phase: The fault injection phase describes when in the development process the fault has been introduced. Sometimes faults are injected in the specification phase, but the most common faults are introduced in design and implementation. Even during testing faults can be introduced, if you include test preparation as part of the system implementation. The fault discovery phase describes in which phase the fault has been discovered, and the gap between injection and discovery should preferably be as small as possible,
because the longer a fault is present in the system, the more effort it will take to remove it. Fault cost (effort): This shows how much effort has gone into finding and/or correcting a fault. Such information shows how expensive a fault has been for a project, and may be an indication on fault complexity or areas where a project needs to improve their knowledge or work process.
By introducing and implementing a core set of fault data attributes (i.e. a metric) to be recorded and analyzed, we could make a common process for fault reporting. Already, several schemes for recording and classifying faults exist, like the Orthogonal Defect Classification scheme [Chi92], or the IEEE 1044 standard [IEEE 1044]. A core process could be customized for organizations who want a broader approach to analysis of fault reports. Some organizations use custom made tools for fault reporting, but a great deal do use standard commercial or open source tools. Introducing a core set of fault report attributes in these tools would help encourage organizations to record the most useful information that can be used as a basis for process improvement. Many tools already have functionality for analysis of the data sets they contain.
4. Use of data repositories
With industrial data repositories, we mean contents of defect reporting systems, source control systems, or any other data repository containing information on a software product or a software project. This is data gathered during the lifetime of a product or project and may be part of a measurement program or not. Some of this data is stored in databases that have facilities for search or mining, while others are not. Zelkowitz and Wallace define examining data from completed projects as a type of historical study [Zel98]. As the fields of Software Process Improvement (SPI) and empirical research have matured, these communities have increasingly focused on gathering data consciously and according to defined goals. This is best reflected in the Goal-Question-Metric (GQM) paradigm developed first by Basili [Bas94]. This explicitly states that data collection should proceed in a top-down way (i.e. designing research goals and process before examining data) rather than a bottom-up way (i.e. designing reseach goals and process after seeing what data is available). However, some reasons why bottom-up studies are useful are (taken from [Moh04]):
1 There is a gap between the state of the art (best theories) and the state of the practice (current practices). Therefore, most data gathered in companies’ repositories are not defined and collected following the GQM paradigm.
2 Many projects have been running for a while without having improvement programs and may later want to start one. The projects want to assess the usefulness of the data that is already collected and to relate data to goals (reverse GQM).
3 Even if a company has a measurement program with defined goals and metrics, these programs need improvements from bottom-up studies.
Another issue of data repositories is the ease of which data can be extracted for analysis. An example is from O1, where the researchers had to go to a great deal of effort to convert the fault data into a form that could be analyzed. In O3, the fault reports could only be accessed for analysis by printing hardcopies of the reports, which in turn had to be scanned and converted into data that could be analyzed. To be able to support process analysis in an efficient manner, the availability and form of the fault repositories should be in a standard and well kept form.
5. Discussion and conclusion
We have presented an overview of studies performed concerning fault reports, and shown the type of information that exists and is lacking from such reports.
What we have learnt from the studies of the fault report repositories of these organizations is that the data is in some cases under-reported, and in most cases under-analyzed. By including some of the information that the organization already has, more focused analyses could be made possible. For instance, specific information about fault location and fault correction effort is generally not reported even though this information is easy to register. One possibility is to introduce a standard for fault reporting, where the most important and useful fault information is mandatory. A reasonable approach to improving fault reporting and using fault reports as a support for process improvement is to start by being pragmatic. At first, use the readily available data that has already been collected, and in time change the amount and type of data that is collected through development and testing to tune this process. We have learnt that the effort spent by external researchers to produce useful results based on the available data is quite small compared to the collective effort spent by developers recording this data. This shows that very little effort may give substantial effects for many software developing organizations.