A Framework for Information Interoperability
Len Seligman and Arnon Rosenthal MITRE
To rapidly respond to new opportunities and threats, both government and industry are looking for faster, cheaper ways of sharing information via computer systems. In response, vendors offer a new “solution” every few years, such as data warehouses, Web services, “enterprise information integration” tools, and ontologies. While these are all potentially useful when applied appropriately, none is a silver bullet. Each organization has to assess its own needs and the best approach to meet them.
In all cases, the goal is to make available information that sources have and are willing to export. The framework presented in this article can be used to evaluate interoperability approaches. We'll describe the most common architectures available for achieving interoperability and the challenges that still stand in the way.
There are two main types of information interoperability:
- Exchange, in which a producer (such as the Department of Defense) provides information to a consumer (such as NATO), and the information is transformed to suit the consumer’s needs.
Integration, in which in addition to being transformed, information from multiple sources is also correlated and fused. In general, the consumer sees a single, coherent view rather than all the systems’ opinions.
Exchange requires addressing the first three problem levels in figure 1. - Integration requires that all four levels be addressed. You can use these problem levels to help you analyze proposed interoperability solutions.
Level 1: Overcome geographic distribution and infrastructure heterogeneity.
Data can be widely distributed geographically. In addition, to access the data you must overcome several types of infrastructure heterogeneity including:
- Different data-structuring primitives, such as relational database tables versus XML versus objects
- Different data manipulation languages (such as SQL or XQuery), proprietary data languages, and sources with no query language that require use of a general purpose pro- gramming language (e.g., Java)
- Different platforms, operating systems, networks, etc.
Level 1 challenges are not as resource-consuming as the others because off-the-shelf middleware products handle most of these challenges. In certain environments (e.g., tactical military applications), however, significant engineering is still required at this level.
/ Figure 1:The four levels of
information integration
Level 2: Match semantically compatible attributes.
Some independently developed information systems use the same terms for the same concepts, but many don?t. Sometimes, these differences in meaning are quite subtle. For example, in one system, ?number-of-employees? may include full-time and part-time employees but not contractors, whereas in another system, it includes all full-time workers, regardless of whether they are regular employees or contractors. If users combine results across systems without understanding these details, the resulting data is unlikely to satisfy the needs of the application.
Level 3: Mediate between diverse representations.
Integrators often must reconcile different representations of the same concept. For example, one system might measure altitude in meters from the earth’s surface while another measures it in miles from the earth’s center. In the future, application developers may define interfaces in terms of
abstract attributes using “self-description”—for example, Altitude (datatype=integer, units=miles). Mediators can use these descriptions to shield users from the representational details.
Levels 2 and 3 can be addressed by developing mappings across systems.
Level 4: Merge instances from multiple sources.
You can do this through data correlation and data-value reconciliation (sometimes called fusion). Data correlation determines if two objects, usually from different data sources, refer to the same real-world object. For example, if the Criminal Records database has “John Public, armed robber, born 1 Jan. 1970” and the Motor Vehicle Registry database has “John Public Sr., license plate JP-1, born 9 Sept. 1939,” might a police query consider these to refer to the same person and return “John Public, armed robber”?
Data correlation can identify different sources that disagree about particular facts. Suppose three sources report John Public’s height to be 180, 187, and 0 centimeters, respectively. Data-value reconciliation can be used to determine what values the search should return to the application. This capability requires detailed application knowledge. Vendors and researchers are increasing their efforts in the “data-cleaning” area to help administrators specify the desired policy, semi-automatically identify candidate objects to be merged, and—if cost-justified—resolve individual instances. Reconciliation rules should be flexible, modular, and displayable to domain experts who lack programming skills.
Typically, you must address these challenges in order, from lowest to highest. For example, unless the reconciliation meets the challenges of geographic distribution and diverse infrastructures, addressing higher levels will yield little benefit. For information exchange, levels 1–3 are sufficient, while integration efforts also require information merging.
This issue of The Edge discusses how we have addressed these levels through several approaches.
General architecture approaches for information interoperability include:
- Integration within the application. An application or Web portal communicates directly with each source using that source’s native interface and reconciles the data it receives. While common, this approach has serious drawbacks: it places great demands on the application developer, who must stay knowledgeable about each of several data interfaces. In addition, information combination becomes part of the code base that must be maintained, making it difficult to leverage commercial database management or middleware products.
- Data warehouses.Administrators define a global schema (i.e., a template) for the shared data. They provide the derivation logic to reconcile data and pump it into one system, typically with the help of extract-transform-load tools. Typically, the warehouse is read-only, with updates made directly on the source systems. As a variation, data marts give individual communities their own subsets of the global data.
- Federated databases. These virtual data warehouses do not populate the global schema. Instead, the source systems retain the physical data and a middleware layer translates all requests to run against the source systems. Commercial companies call this “enterprise information integration.”
- Messaging. One application or database uses structured messages to pass data to others. Often, however, the sender and receiver use different terms for the same concepts, so that the data must be transformed to meet the needs of the receiver. Enterprise application integration (EAI) products support message-based interoperability.
- Parameter passing. One application invokes another and passes
data as parameters. Web services are an example of this architecture, in which services are invoked and described using standard Web languages and protocols. EAI products also support this architecture.
Challenges
The technical issues of these approaches revolve around heterogeneity, distribution, and multiple versions. In general, the greatest challenges lie in semantics.
The framework presented in this article can be used to evaluate interoperability approaches. For example, if a vendor describes their product as being “the answer” to information interoperability, you can ask them which of the four levels their product addresses. If they say “all of them,” be suspicious!