Factoring in a priori classifier performance into decision fusion
Kai Goebel[*]a, Shreesh P. Mysore[**]b
aGE Corporate Research & Development; bDept. of Control and Dyn. Systems, Cal. Inst. of Tech.
Abstract
In this paper we present methods to enhance the classification rate in decision fusion with partially redundant information by manipulating the input to the fusion scheme using a priori performance information. Intuitively, it seems to make sense to trust a more reliable tool more than a less reliable one without discounting the less reliable one completely. For a multi-class classifier, the reliability per class must be considered. In addition, complete ignorance for any given class must also be factored into the fusion process to ensure that all faults are equally well represented. However, overly trusting the best classifier will not permit the fusion tool to achieve results that rate beyond the best classifiers performance. Moreover, where several classes are considered and where the performance of the classifiers varies across the classes, additional constraints must be considered. We assume that the performance of classifiers to be fused is known, and show how to take advantage of this information. In particular, we glean pertinent performance information from the classifier confusion matrices and their cousin, the relevance matrix. We further demonstrate how to integrate a priori performance information within an hierarchical fusion architecture. We investigate several schemes for these operations and discuss the advantages and disadvantages of each. We then apply the concepts introduced to the diagnostic realm where we aggregate the output of several different diagnostic tools. We present results motivated from diagnosing on-board faults in aircraft engines
Keywords: Classification; Diagnostics; Information Fusion; Decision Fusion; A priori Information; Confusion Matrix
1. Introduction
To satisfy the need for high classification performance or a need for increased class coverage, different classification tools are sometimes developed either in parallel or sequentially. Often times, it is difficult or impossible for any one given classifier to deal with classes of interest at the desired level of accuracy. This motivates the parallel use of several classifiers. In addition, other classifiers are developed to be able to overcome expansion limitations of existing tools and the lack of adaptability to system changes and environmental changes. While the resulting patchwork approach might achieve optimization at a particular local level, it might also cause new problems due to the inevitable introduction of conflicting information. However, there is a potential benefits gained by taking a system-level view. This system-level scheme gathers and combines the results of different classification tools to maximize the advantages of each one while at the same time minimizing the disadvantages. Such a fusion scheme holds the promise to deliver a result that is better than the best result possible by any one tool employed. In part this can be accomplished because redundant information is available, which when combined correctly improves the estimate of the better tool and compensates for the shortcomings of the less capable tool. In addition, there is a gamut of secondary information that can potentially be folded into the fusion scheme to boost the overall classification performance. However, there is no substitute for a good classification tool and, ordinarily, multiple, marginal-performance tools do not necessarily combine to produce an improved result and in fact may worsen the outcome1.
There are several traditional approaches that deal with aggregation of information. Weighted averaging attempts to compensate for poor tool decisions by smoothing out the inferior performers. However, the trade off is that good information succumbs to the bad information in the process and a particular tool’s superior performance for some classes is sacrificed. In voting schemes, the tools decide jointly on the final output through a majority vote but encounter similar problems as the weighted averaging because several poor performers can always outvote a good tool. Bagging and boosting2 try to address some of those problems. Pattern recognition approaches such as neural networks can be employed to recognize patterns of behavior that may lead to correct decisions. However, if the input to the tools is not available to the fusion tool and the output pattern looks similar for different input scenarios, the neural network fusion may not perform satisfactorily. Dempster-Shafer reasoning is widely used for fusion tasks where several information sources are aggregated using Dempster’s rule of combination. However, it is imperative to properly fix the meaning of the underlying belief functions because the suitability of the rule depends on the correct context3. Model-based approaches consist of a sequence of steps for validation and conflict resolution, among others. The method shown by Nelson and Mason4 uses multiple models of known (or suspected) behavioral patterns to establish degrees of compatibility between data and hypothesis. It enforces preferences over the set of candidates (by removing candidates that violate these preferences), and iterates through a cycle of merging and deleting within a set of associated hypotheses for that conflict. Sequential and parallel multi-layered configurations5 employ a number of diagnostic tools in a sequential and parallel fashion for the refinement of the initial class found utilizing a priori probabilistic performance information about the tools which is used to calculate an error probability. The individual classifiers have the current input pattern as well as the class index of the preceding layer as input variables. A fuzzy fusion schemes described in Loskiewicz-Buczak and Uhrig6 utilizes the generalized mean and an -cut. The fusion scheme fuses the first two sensors, defines the confidence of the fused decision, and then continues to fuse additional sensors. If the confidence drops, the step is reversed. Finally, an -cut (depending on the particular class) determines the exact class. Rao7 discusses methods that provide performance guarantees based on finite samples from a statistical perspective. These approaches are constrained by performance bounds which could be improved by suitable incorporating application specific details.
We address the overall fusion problem using a multi-layered solution approach that focuses on the incorporation of a priori and external information in addition to the information provided by the individual classifiers. The approach specifically avoids statistics based approaches. Rather, it is a compilation of heuristics gleaned from expert reasoning. This method presented breaks the problem down into the three major components pre-processing, analysis, and post-processing. Each component contains several sub-modules.
2. Information used in Fusion Schemes
2.1 A priori Information
The fusion tool makes use of the output coming from the classifiers, non-classification information sources, and a priori tool performance information. The latter corresponds to information that is attainable through experiments or simulations.
Table 1: Confusion Matrix used as Input for both Design and IFM run-time version
/ 0.833 / 0.023 / 0.039 / 0.035 / 0.035 / 0.013 / 0.023/ 0.258 / 0.696 / 0.019 / 0.012 / 0.005 / 0.005 / 0.005
/ 0.313 / 0.011 / 0.582 / 0.029 / 0.027 / 0.014 / 0.024
/ 0.325 / 0.010 / 0.029 / 0.573 / 0.052 / 0.007 / 0.004
/ 0.382 / 0.007 / 0.027 / 0.041 / 0.496 / 0.007 / 0.041
/ 0.094 / 0.001 / 0.013 / 0.005 / 0.012 / 0.848 / 0.028
/ 0.234 / 0.007 / 0.032 / 0.004 / 0.058 / 0.026 / 0.640
The confusion matrix of the classification tools is the primary source of a priori information for the information fusion scheme described herein. The confusion matrix is a performance measure for the individual classification tools. It lists the observed classes versus the estimated classes. Because all classes are enumerated, it is possible to obtain information not only about the correctly classified states, but also about the false positives (FP), false negatives (FN), and false classified (FC) states. In our representation of the confusion matrix, the rows list the actual classes, the columns list the estimated classes. Note that the class C0 represents the normal (“null”) condition. The diagonal entries of the confusion matrix represent the correctly classified cases. The first row – except the first entry – contains the FP. The first column – except the first entry – contains the FN. The off-diagonal elements – except the FP and FN – are the FC. Table 1 shows the normalized confusion matrix for a classification tool where the result was divided by the number of experiments for each class. The classes are denoted as Cn where n={0, … , 6}.
2.2 Fusion Input
Primary input to the information fusion is the output of the classifiers. The information fusion tool is built on the premise that it can utilize information that led to the classification. In other words, it will not only consider the final fault assignment but also the underlying relevant fault strength. Depending on the diagnostic tool employed this can be a distance measure (for example for a k-means classifier), probability (for example for a Bayesian Belief Net), weight (for example for a neural net), membership (for example for a fuzzy knn), etc. This individual assignment criterion is then scaled between zero and one using an appropriate classifier specific non-linear function. The implicit interpretation is that a level closer to one means that the fault is increasingly more likely while a confidence level less than 0.5 is increasingly not likely. Thereby we avoid the step of needing a parametric model for fusing heterogeneous data8 and instead impose this task on the designer of the diagnostic tools who has to provide the mapping from diagnostic output to confidence level.
Other system information not diagnostic in nature (that can be used to support a diagnostic opinion) is also provided as evidential input for the information fusion tool. This is information that would not in itself give rise to an action but helps the diagnostician in understanding and confirming a diagnostic opinion.
3. Fusion architectures
Generally, our fusion tool is divided into three components: 1.) pre-processing, 2.) analysis or core fuser, and 3.) post-processing. We have designed the architecture in such a manner that a maximum amount of external information can be integrated. In addition, we attempted to keep the design modular to allow for later addition of domain-specific modules. Each component consists of several modules that are designed to improve the fusion task at hand. We have explored different schemes that can deal with the fusion scheme. Data are first conditioned in the data pre-processing component. Figure 1: Fusion Components of scheme 1
This includes changes to the classifier output through smoothing, outlier eliminations, capturing temporal effects, and integrating a priori classifier performance information. The outputs of the pre-processing component are modified class estimates. Next, the modules of the hierarchical core fuser aggregate the modified inputs. Finally, the results are polished – where necessary – in the post-processing component to allow for better user interpretation and to account for unequal fault representation. Elsewhere9,10,11,12,13, we reported about a hierarchical architecture (“scheme 1”) as shown in Figure 1. In this paper, we contrast findings from scheme 1 with a modified version which results in increased performance of the classifier fusion scheme. We will first briefly introduce the components of scheme 1 before addressing the changes made to the new scheme (“scheme 2”). The focus of this paper will be on the changes between the two schemes and in particular to the Relevance Equalization and Inter Tool Fusion of scheme 2.
3.1 Scheme 1
Preprocessing - Averaging
This stage deals with temporal information aggregation. Although plain averaging ensures smoothing of information with time, it can stifle the influence of new information. Hence adaptive averaging is used using an adaptive filter parameter that is adjusted to be low when ‘changes’ are high and vice-versa10.
Preprocessing – Fading
This module serves to aggregate conflicting information across tools and simultaneously acts on the temporal information also. For instance, if tool X indicates class A at time t1 and tool Y indicates class B at time t2, (t2>t1), then we need to account for the fact that B might have occurred in the time interval t2-t1. So we need to fade the past information with a fading factor that is a function both of the time elapsed and the confidence in the earlier tool’s decision10.
Cross-Correlation
This module makes use of the preferred misclassification of tools (off-diagonal entries of the confusion matrix) to discount the tool output for each class. The purpose is to factor out cross correlation effects. this scheme, information about preferred misclassifications is used in a manner that discounts the output of a certain class based on the entries in the association matrix11.
Scaling
The inequity in the representation of faults by tools is addressed here by boosting the diagonal entries of the confusion matrix. In this process, the module uses a ‘relevance matrix’ [ rt,i], where rt,i is 1 if tool t is built to recognize fault class i and 0 otherwise. Once this is done, the diagonal entries are used to scale the tool outputs so that more reliable tools are ‘trusted’ more12.
Strengthening
If tools agree on a certain class, then this module strengthens the output for that class (by a simple addition operation) 12.
Weakening
This module performs conflict resolution by discounting the entries of the classes in conflict12.
Tie-breaking
Fault criticality and fault frequency information are used to break ties.
Evidence updating
Any evidentiary information available is used to modify the fused output. We note that evidence plays only a supporting role and is used only to reinforce a decision, not to weaken it. Evidence information is available in the form of the evidence vector and it is multiplied by the evidence matrix (a binary matrix which captures the relevance of an evidence item to a fault class).
Back scaling
This is the final module and it converts the internally coherent information to a form that is externally interpretable. A [0,1] normalization and a dilation operation also constitute this stage.
3.2 Scheme 2
Scheme 2 is divided into pre-processing and core fusion (Figure 2). The post-processing was folded into the modules of the core fusion. Decision Averaging and Decision Fading functions as well as evidence updating was largely retained. Changes were made to Cross-Correlation, Scaling, Strengthening, Weakening, and Back Scaling. These functions were consolidated and newly structured to better accommodate the particular demands on the overall fusion task. In particular, a new definition of relevance was devised which uses some of the concepts of the scaling in scheme 1. Cross-Correlation, Scaling, Tie Breaking concepts, and tasks from the Back Scaling of scheme 1 were moved into a new module “Intra Tool Fusion”.
Figure 2: Fusion Components of scheme 2
Strengthening, Weakening, and concepts from Back Scaling were moved into a module “Inter Tool Fusion”. The changes are depicted in Figure 3.
Figure 3: Changes from scheme 1 to scheme 2
Preprocessing – Averaging
The spirit of this module is the same as before, but a few changes have been made to the implementation. Firstly, is now tool specific. Therefore, we have
(1)
Secondly, the value of is now determined by a simple fuzzy inference sytem. A sample rule is
IF (changes/window) is small THEN is large
Preprocessing – Fading
This is the same as before.
At this point, some non trivial changes have been made to the existing scheme. In the scheme 1, the scaling module performs two tasks – a.) relevance equalization and b.) modification of the tool outputs. In the scheme 2, relevance is defined in an entirely different way. In addition, the relevance equalization part of scaling is performed in a separate module. Finally, the modification of the tool outputs that was part of scaling in the existing scheme is moved over to a new module, along with the cross-correlation functions performed in the scheme 1. The rationale behind these changes is made clear in the following paragraphs.
Relevance equalization
In the scheme 1, the entries of the relevance matrix are binary valued. The entry (rt,i) corresponding to tool i and fault class i is 1 if cii in the confusion matrix for t is ‘high’ (presumably greater than .6 or .7). Since the proposed association scheme aims at exploiting the preferred misclassification of a tool to strengthen the confidence in a class, we wish to define relevance matrix entries based not only on the diagonal entries of the confusion matrices, but also the off diagonal entries. The following heuristic scheme is proposed to ensure equal representation of all classes.
We define , , and to be the top three values (in descending order) in the ith row of the confusion matrix for tool t.
R is called the relevance equalization matrix and is of the form
(2)
Each ri is the relevance equalization factor for class i and this operation suitably boosts the confusion matrix entries corresponding to those classes that are weakly represented by the tools. Each ri is computed by using the information about the extent of representation of class i by the tools and is derived from the confusion matrices as follows. Let r(t,i) be the “relevance” or “extent of representation” of class i by tool t. Then define
(3)
The representation of class i by all the tools is the computed as t r(t,i). The equalization factor ri for class i is then computed as
(4)
By defining ri in this way and using equation (6), we ensure equal representation of all classes.
The new module consists only scaling the confusion matrix entries for those classes that have insufficient number of tools recognizing them. The quantifier used to decide whether or not a class is well represented is the ratio of the sum of the relevances for a given fault (t r(t,i)) to the total number of tools (t 1). This is used in the following manner to modify the diagonal entries of the confusion matrices.