Joseph George Caldwell, PhD; Topic No. AF161-045; Proposal No. F161-045-1533

Volume Two

Automated Bayesian Data Fusion Analysis System (ABDFAS)

Joseph George Caldwell, PhD

1432 N Camino Mateo, Tucson, AZ 85745 USA

Tel. (520)222-3446, e-mail

(1) Identification and Significance of the Problem or Opportunity

This most recent SBIR solicitation underscores the need for improved methods of data fusion (sensor fusion, information fusion, multisensor fusion, multisource multisensor fusion). The following are selected solicitation titles on this subject:

1. AF161-045 Information Fusion to Enable Shared Perception between Humans and Machines

2. AF161-056 Fusion of Multiple Motion Information Sources

3. AF161-059 Event Recognition for Space Situational Awareness

4. AF161-153 Fusion of Kinematic and Identification (ID) Information

5. A16-043 Enterprise Enabled Intelligent Agents to Optimize Intelligence, Surveillance, and Reconnaissance (ISR) Collection

6. A16-037 Predicting, Prognosticating, and Diagnosing via Heuristics and Learned Patterns

7. N161-020 Human Computer Interfacing (HCI) for Autonomous Detect and Avoid (DAA) Systems on Unmanned Aircraft Systems (UAS)

This proposal describes an effort to develop a general-purpose computer-software program for constructing, analyzing, evaluating and optimizing data fusion systems. Specifically, the goal is to develop an automated tool that can construct Bayesian networks,analyze their performance, and optimize their performance. The system name is Automated Bayesian Data Fusion Analysis System (ABDFAS).

In this proposal, we use the term "data fusion" as a general term to refer to what is more specifically called sensor fusion, multisensor fusion, multisource multisensor fusion, and information fusion. A standard definition of data fusion is that provided by the Joint Directors of Laboratories (JDL) Subpanel on Data Fusion: "Data fusion is a process dealing with the association, correlation, and combination of data and information from single and multiple sources to achieve refined position and identity estimates, and complete and timely assessments of situations and threats, and their significance. The process is characterized by continuous refinements of its estimates and assessments, and the evaluation of the need for additional sources, or modification of the process itself, to achieve improved results."

The JDL identify five levels of data fusion: (0) Sub-object data assessment; (1) Object assessment; (2) Situation assessment; (3) Impact (threat) assessment; (4) Process refinement / resource management (adaptive data acquisition and processing). Levels 0 and 1 involve statistical analysis of object physical characteristics or dynamics (e.g., identification of a reentry vehicle, estimation of a ballistic trajectory), and effective techniques are available for working with such attributes (e.g., the Kalman-Bucy filter). Levels 2-4 involve estimation of relationships among battlefield entities and estimation of causal effects, and are substantially more difficult to address. This effort will focus on Levels 2-4, since those are the areas that are more difficult to address technically, and for which the need for and demand for improved methodology is greater.

There is a vast literature on the subject of data fusion. A quick search of the Internet identifies two sources that present anillustrative summary of the subject and a sample technique:

1. Castanedo, Federico, "A Review of Data Fusion Techniques,"The Scientific World Journal, Vol. 2013 (2013), Article ID 704504, 19 pages, posted at

2. Pan, Heping, Nickens Okello, Daniel McMichael and Mathew Roughan, "Fuzzy Causal Probabilistic Networks and Multisensor Data Fusion," Cooperative Research Centre for Sensor Signal and Information Processing, SPRI Building, Technology Park Adelaide, The Levels, SA 5095, Australia, invited paper for SPIE International Symposium on Multispectral Image Processing, October, 1998, Wuhan, China, SPIE Prodeedings, Vol. 3543, posted at

The preceding articles describe many approaches to data fusion and illustrate a particular one, viz., fuzzy Bayesian networks.

The motivation for proposing to develop a tool for assisting data fusion is revealed in the paragraphs that follow, which discuss the challenges faced by present-day data fusion applications.

Challenges facing current data fusion applications

1. Data and information overload. A major problem facing current data fusion applications is the massive amount of data that is generated and requires processing and analysis. There are two aspects to this problem. First is simply the amount of raw data that must be handled. To be useful, these data must be stored, processed and analyzed. Prior to at least a minimal level of processing and analysis, it is often not known in advance which data are important. This fact requires that a large amount of data be stored, at least for a time. As the total amount of data increases, the likelihood that useful information will be undiscovered increases. Second, many statistical and optimization procedures suffer from the "curse of dimensionality," which refers to the fact that the number of computations required for analysis increases exponentially with the number of entities involved – objects, events, sources, sensors, variables, observations (in time or in space), missing values, factors, parameters and hypotheses involved. Adding more sources and sensors may theoretically improve accuracy and decision performance (reduce false negatives and false positives), but in practice it may not improve system performance because of an inability to process the additional data in a timely fashion.

2. Inability to convert information into intelligence (context-relevant meaning); difficulty of causal modeling and analysis. Sensor systems produce data, and in some instances they automatically process data to the point where they make a decision. In some applications, the data preprocessing is done based solely on observed associations in the data, without consideration of a causal model (that specifies which variables may affect other variables, and which variables are the source of a threat). By its nature, preprocessing reduces the amount of data that must be handled. If the preprocessing does not take into account a causal model that specifies which variables affect other variables, valuable data may be discarded in error, estimates may be seriously biased, and the power of statistical tests may be seriously degraded. In general, over all disciplines, statistical inference often fails to take causal relationships into account. The fundamental source of this problem is that almost all of modern statistical theory is concerned with associative analysis, not causal analysis. In almost no statistics texts is the word "causal" used. The reason for this is that most of statistical theory is concerned simply with estimating the strength of associations, rather than with estimating the magnitude of causal relationships. Furthermore, even in the realm of causal analysis, almost all of classical statistics is concerned with estimating the effects of causes, not in identifying the causes of effects. The detection or identification of threats, intents, situations and activities requires the use of causal models, not non-causal (associative-only) models.

Unfortunately, causality cannot be inferred from associative analysis alone (i.e., analysis of observational data without benefit of randomized forced interventions or a specified causal model; "correlation is not causation"). Some of the procedures and methods used in multisensor fusion at present are based on associative models, not on causal models. Association-only models may work well for data fusion Levels 0 and 1, but they are not appropriate for Levels 2-4, where causal relationships among variables are an essential aspect of the application. If causal relationships are important in an application but are not properly taken into account, the conceptual framework of such methods is fundamentally flawed, and it is unreasonable to expect high performance from systems based on such (associative) methods. In a competitive game, the probability of hidden or missing data will depend on key response variables, introducing biases into estimates based on most traditional (associative) statistical methods. In analysis of experimental data, such as in a laboratory experiment or in a clinical trial (randomized controlled trial), forced randomized interventions are used to unequivocally infer causal relationships. In analysis of sensor data, which are observational, not experimental, it is essential at higher levels of data fusion to specify a causal model (i.e., identify which variables affect other variables).

In a military setting, the adversary is attempting to hide data from the observer. In this setting, "selection effects" are present, that introduce a stochastic relationship between the response variable (dependent variable, explained variable) and the missingness event. In technical terms, the nonresponse mechanism is not "ignorable" relative to inference methods based on the likelihood function (either classical methods or Bayesian methods). If the causal relationships are not correctly taken into account, and if the missing-data mechanism is not correctly taken into account, estimates will be biased and the error probabilities (significance level, power) of statistical tests will be incorrect.

3. Specialized nature of standard statistical and optimization methods and procedures; need for model validation. A very large proportion of the standard (basic, textbook) methods and procedures of modern statistics and optimization theory apply to very specialized situations, such as restriction of the probability distribution function describing a phenomenon to a small number of variables, or to a particular functional class (such as the class of stationary distributions, or exponential distributions, or linear models), or dependence on a small number of parameters (such as means, variances and covariances); or restriction of an optimization problem to the case of a single outcome (response) variable of interest; or restriction of a decision framework to a single player, two player, or zero-sum game representation. These special cases rarely represent real-world phenomena well. The assumptions that define these simple models (such as assumptions about the probability distribution of variables of interest) almost invariably do not apply to phenomena observed in the real world. If the assumptions are not correct, the statistical inferences (estimates, tests of hypotheses) will not be correct. To obtain good results from applying statistical methodology, the models must represent the salient features of real-world situations well. In real-world applications, valid models are typically far more complex than standard textbook models. To have confidence that the models on which decisions are based are valid representations of reality, the models must be subjected to validation testing. For simple models, system performance can often be measured by analytical means (formulas). For complex applications, system performance is assessed by means of simulation. At present, no comprehensive automated system exists, possessingall of the features intended for the ABDFAS, for validating general data fusion systems. The proposed system would provide a general-purpose simulation framework for assessing the performance of complex data fusion systems. Simulation can accomplish this since the system generates the ground truth (given user specifications), so that system performance can be compared to it.

4. Missing data. Many statistical procedures are designed for situations in which values are known for all variables, over a regular grid of space or time. When missing values occur in some variables, or entire observations are missing, the computational procedures required to produce correct estimates and tests become complicated and laborious (computationally intensive). What procedures are correct depends on the nature of the missingness, i.e., whether is it missing completely at random (MCAR); missing at random (MAR), i.e., dependent on other explanatory variables; or missing not-at-random (MNAR), i.e., dependent on the value of a response variable (in which case the missingness phenomenon is not "ignorable" for likelihood-function-based approaches such as maximum likelihood and Bayesian estimation). The occurrence of missing data causes the performance of a sensor system to degrade, both in terms of the quality of the processed data and in terms of the computational burden imposed to handle the missing data (e.g., a large increase in the number of candidate tracks in a correlation / tracking system). Missing data are a prominent feature of many sensor systems, and the algorithms used must be able to accommodate this feature in a correct fashion. Analysis of missing data is complex and difficult, and it is a fact that much statistical processing done in many systems does not handle missing data correctly (i.e., using likelihood-function-based methods). In many situations, the occurrence of missing data is well represented by means of a probability distribution for the missingness phenomenon. Examples include situations where an adversary attempts to conceal his activity, but is not completely successful in doing so, or when physical phenomena introduce noise into a sensor response.

5. Handling of sparse data. The preceding heading, "Missing data," refers to situations in which it is appropriate to represent the missing data mechanism by a random variable. In some applications, large data gaps are present, such as having no data at all from a particular source, or having no data for an extended period of time (such as caused by intermittent satellite coverage). In some applications, data become very sparse (either because of physical problems such as satellite paths or because an adversary is deliberately reducing observability). Conventional (likelihood-function based) methods of handling missing data make use of a probability distribution of missingness. Classical (frequentist) methods require substantial amounts of data in order to produce useful estimates and tests (in order to assure that laws of large numbers and the central limit theorem apply). These methods may degrade substantially or break down completely in sparse-data situations (sensor coverage gaps; adversary-caused missing data). Estimation and testing must be able to respond to these situations. In such applications, a Bayesian approach may be appropriate. (If no data are available for an appreciable time, then estimates are derived from the last-updated posterior distribution, until new data arrive.)

6. Disparate data sources; noncommensurate data. Many statistical procedures are designed to accommodate a single phenomenon, e.g., to process radar returns or events of a certain type (such as a noise), and are not able to accommodate data from divers sources (e.g., COMINT, SIGINT, HUMINT, all source, open source, photographic), or to quickly update estimates as new data arrive. What is needed are decision systems that can accommodate all of the available data relating to a situation or event of interest, regardless of its nature, and update estimates quickly whenever observations arrive, no matter how simple or complex. It is very important when data are obtained from disparate sources that they be combined with existing data in a way that is consistent with the assumed causal model, and in a way that the statistical properties of the resultant estimates and hypothesis tests be known. Combining data from disparate sources presents difficulties for classical (frequentist) statistical inference methods, since the standard approach is to combine all of the data into a single likelihood function. The use of Bayesian networks overcomes this difficulty, since the Bayesian network representation allows naturally and easily for updating for any amount of data, ranging from a complete observation or complete sample on all important random variables to a single value of a single random variable (in every case, the posterior distribution is simply recalculated).

7. Difficulty in comparing alternative decision systems. Information collected from multiple sources can be processed in different ways and presented to human operators in different ways. There are often a number of different measures of effectiveness of a decision system, such as the probabilities of false positives and false negatives, and the cost associated with wrong decisions. There is a substantial body of knowledge relating to decision systems analysis, and it is not always employed in designing or selecting a preferred data fusion system.

8. Difficulty in defining an optimal "human / machine" interface. In most systems, data are preprocessed to a certain extent, and then presented in summary form to human decisionmakers. In order to achieve optimal performance, it is necessary to define the nature of this interface. If causal relationships are important to the application (e.g., at higher levels of data fusion), it is essential to recognize that the human being must specify the nature of the causal relationships among system variables. For this reason, it is necessary for the human being to remain a part of the loop that determines how the sensor system preprocesses data. The human / machine interface should be defined in a way that makes effective use of the advantages of human beings and of statistical methodology. The human / machine interface can be configured in many different ways, and the performance characteristics of these alternatives can vary substantially. Automated means are not generally available to assistthis determination, to achieve optimal or near-optimal system performance.

9. Sensor allocation and tasking. In some systems there are opportunities for altering the data collection process, by varying the distribution of sensors over time and space, and varying the tasking of the sensors. It is desired to distribute and task sensors in a way that optimizes the value of the collected information, with respect to the quality of the decisions being made on the data. For many systems, there is quantitatively defined analytical link between the sensor distribution and decision quality, and the process of determining optimal allocations and tasking is done heuristically. In these applications, the allocation of limited resources may be done suboptimally, and it may not be known to what extent system performance could be improved by the use of mathematically rigorous optimizing procedures.

10. Lack of sufficiently fast algorithms. A statistical approach to estimation in complex models with substantial missing data is the Expectation-Maximization (EM) algorithm. As the amount of data involved in an analysis increases to very large amounts, as the complexity of the model increases, and as the amount of missing data increases, the amount of processing that is required to properly analyze the data increases dramatically. To be of practical value in demanding circumstances (where decisions are required quickly), it is necessary that decision systems be able to accomplish the needed processing within a reasonable amount of time (where "reasonable" depends on the application). Reliable decision systems should incorporate a capability to tailor the processing procedures to the workload burden and time constraints imposed by the situation. It is desired to assess the loss of performance by using "fast algorithms" or by discarding data to cope with data / information overload.