VLDB 2000 SimQL, paper # 108
Augmenting Information Systems with Access to Predictive Tools
Gio Wiederhold and Rushan Jiang
Computer Science Department.
Gates Computer Science Building 4A
Stanford CA 94305-9040
650 725-8363 fax 725-2588
We report on a prototype system that provides access to computational tools that predict future states of the world. We also discuss its interoperation with SQL accessed resources which will augment the decision-making support capabilities of information systems. The central component is a new interface language, SimQL, which mirrors the functionality of SQL, but delivers information projecting future states, obtained from a variety of simulations.
Simulations to be wrapped for SimQL access include spreadsheets, business simulations, planning models, as well as large remote continuous simulations, as used for weather forecasting. Results reported through SimQL are paired data elements, the expected value and its certainty. SimQL is intended to be used within information systems that cover data from the past into the future, and support the assessment of the effects of alternate decisions, so that multiple future courses can be compared. Placing results of simulations into a consistent framework with databases and web-based information, avoids the system inconsistencies that decision-makers face today.
The long-range motivating vision is that an interface language provides separation of clients and tool providers. Their autonomy will allow information consumers and providers to make progress independently, mirroring the past decades of SQL use.
Basic database systems are being extended to encompass wider access and analysis capabilities. Today rapid progress is being made in information fusion from heterogeneous resources such as databases, text, and semi-structured information bases [WiederholdG:97]. Results of this research are being transferred to practical settings. The objective of many database technology extensions is to provide more capabilities for decision-making. However, the decision maker also has to plan and schedule actions beyond the current point-in-time. Databases make past and nearly current data available, but tools that predict future states are required for projecting the outcome at some future time of the decisions that can be made today. [StonebrakerK:82] proposed extensions that allow hypothetical relations to be defined within the schema. The data represented in such relations is to be computed using rules and other stored data, and could include projected values [StonebrakerK:82]. Such computational features are now included in several database systems, and are expected to be touted soon as part of Microsoft's SQLserver. For substantial predictions and alternative futures it seems better to rely on exiting and well-developed simulation tools. The predictive requirements for decision-making have been rarely addressed in terms of integration and fusion [Orsborn:94].
The tools that are available for projecting future states include spreadsheets with formulas predicting expected results, planning models, business-specific simulations, and continuous simulations, as used for weather forecasting. They all require computations to produce results, sometimes they may precompute values and store them for later retrieval. We will refer to all of them as simulations. Actual future states to be computed will be affected by two types of factors: 1. actions initiated by the decision maker, and 2. failures to execute actions correctly, events due to nature, and actions of others. Simulations have input parameters that allow setting of such expectations. Some of these parameters are best set by the decision maker, and others by experts. For instance, in a business setting, a product manager may set the amount of a product investment, but experts will contribute, say, the size of the market base, expected interest-rates, failure probabilities based on historical records, and the like. A simulation will provide for any investment made a given time, the likely sales, profits, and associated risks.
Database technology is not very visible in this domain. In simple cases the expected future state is projected from a back-of-the envelope estimate, counting on individual experience. Dealing with the increasing complexity of the modern world and the widening range of alternatives demands computational assistance. The most common tool being used for planning and documenting predictions is the spreadsheet. In situations where the amount of data is modest and relatively static, files associated with spreadsheets have entirely supplanted the use of database technology, and extensions to allow sharing of such files are being developed [FullerMP:93]. Business-specific simulation tools allow convenient entry of alternative decisions and might store intermediate results information into their own file structures. Analyzing the stored alternatives helps in selecting the best course-of-action [LindenG:92]. Since for the future multiple values are obtained, time-oriented database extensions, supporting a past history [Snodgrass:95], have not been adequate. To be effectively used, predicted values must be labeled with the parameter settings and systems assumptions that led to them. A weak point in predictive systems that store simulation results is that the volume of possible alternatives is huge, and any stored information becomes very rapidly invalid, as time and events pass on. We find hence in practice much more ad-hoc use of simulations. The effect is that simulations are not integrated into more comprehensive information systems, and data is often transferred (in and out) by cut-and-paste technology. The problem has been recognized in military planning; quoting from [McCall:96]: The two `Capabilities Requiring Military Investment in Information Technology' are:
`1.Highly robust real-time software modules for data fusion and integration;
`2.Integration of simulation software into military information systems.’
The database paradigm, providing clients with anywhere, anytime access to valid information, defined in a schema, supported by a substantial infrastructure, should also be attractive for data about the future. Our SimQL research has investigated such an approach and developed a language tool suitable for integration with database technology.
In order to bring prediction into the database paradigm we can exploit a wealth of available information technologies, reaching beyond the database community. We have made great strides in accessing information about past events, stored in databases, object-bases, and the World-Wide Web. Access to information about current events is also improving dramatically, with real-time news feeds and on-line cash registers. To serve projections we must expand the temporal range into the future, deal with multiple alternative projections, and manage the uncertainty of such projections.
The importance of rapid, ad hoc, access to data for planning is understood by database specialists, but should not be limited to historic data from databases. This audience understands database capabilities well, so will not belabor them. The invention of the schema [McGee:59] and of formal query languages [Codd:72] that depend on the schema has transformed application-specific file programming into an independent services industry. Eventually, multiple, remote databases could be queried [Litwin:83]. Modern versions of SQL provide now also remote access [DateD:93]. Extensions to SQL to manage historical data are becoming well accepted [Snodgrass:95]. Data warehouses that integrate data into historic views are becoming broadly available [Widom:95].
Figure 1: The Place of Simulation Access in Information Systems
Planners must consider alternate futures, so that an information model that supports planning must handle a tree of data beyond now, as shown in Figure 1. Branches of the plan have associated uncertainties. Planning systems developed in Artificial Intelligence do deal with alternatives and uncertainties [TateDL:98]. They model processes otherwise only performed in a planner's mind. That is, the planner sketches reasonable scenarios, mentally developing alternate courses-of-action, focusing on those that had been worked out in earlier situations. Such mental models are the basis for most decisions, and only fail when the factors are complex, or the planning horizon is long. Human short-term memory can only manage about 7 factors at a time [Miller:56]. That matches the number of reasonable choices in a chess game, so that a game as chess can be played about as well by a trained chess master as by a dumb computer with a lot a memory. But chess is simpler than most of the real world. To help the client, tools for managing uncertainty, pruning the space of alternatives, presentation of viable choices, and their comparison become essential.
Planning systems provide for computing the uncertainty forward in time, as the tree of alternatives widens. If values (i.e., income, profit, position, benefits, inventory) at the end-points are known, planning systems can perform the backwards calculations to obtain the current-net-value for decisions to be made now, or at intermediate points in the future [Tate96]. Unfortunately, they tend to be data poor. Instead of matching conditions with actual information, they tend to depend on equations, derived from mining past data to compute the projections. Since most planning systems store all their information internally, they also tend to be static. Recent events, and the progress of time, are not directly incorporated.
Decision-making in planning depends on knowing past, present, and likely future situations. Justifiable projections require entering current data, and computing results using well-defined models. We find such models in existing simulations. Replacing manual planning with simulations has the benefits that it becomes easy to dynamically re-execute the planning process when situations change. Events, expected at planning time, may change in relevance. Uncertainties reduce as time passes. Keeping information models for planning up-to-date is hence much work, and is unlikely to happen without tools that enable easy access to simulation results within dynamic decision-support systems. Integration of simulation results into effective client systems distinguishes our work from the objective of building grander simulations, which motivates the simulation community.
To assess the future thoroughly we must access and execute simulations dynamically. Spreadsheets use simple formulas. It is up the spreadsheet designer to identify columns or rows as being results representing future states. Simulations typically deal with time explicitly. They employ a wide variety of technologies, including continuous equational models and discrete, time-step models. Many simulations are available from remote sites [FishwickH:98]. Simulation access by more general information systems should handle local, remote, and distributed simulation services. Distributed simulations can also communicate with each other [MillerT:95]. These interact using highly interactive protocols (HLA) [IEEE:98], but their results are not now accessible to general information systems [Singhal:96]. If the simulation is a federated distributed simulation, as envisaged by the HLA protocol, then one federation member may supply the data to the decision-making support system, by first aggregating data from detailed events to the level that is appropriate for initiating planning interactions.
Extrapolating from a past into the future creates uncertainty. Uncertainty is an essential aspect of planning, and has been studied in a variety of abstract settings [BhatnagarK:86], and this research direction is ongoing. . The Artificial Intelligence (AI) community has a long history of computing with a variety of uncertainty measures and some researchers have found commonalties in approaches that make integration feasible [KanalL:86]. It will be important to bring this research into practical, information-based planning systems.
Alternate future scenarios represent not only choices that can be made by the client, but also events outside of the decision-makers control, as responses by others or acts-of-nature. When the projections become detailed and planning horizons extend far, the space of alternatives becomes immense. At each ply the alternatives multiply, and pruning or coalescing of branches becomes crucial. As time passes, opportunities for choosing alternatives disappear, so that the future tree is continuously chopped off at the root as the now marker marches forward [CliffordEa:97].
Today, the initial pruning is mainly done intuitively or interactively, with participants sharing whiteboards. Available tools are video-conferences and communicating smartboards, sometimes augmented by pasting results that participants extract from isolated analysis programs. For instance, a participant may execute a simulation to see how a proposal would impact people and supply resources. Financial planners will use spreadsheets to work out alternate budgets, and show a subset of the parameters to others. Automated pruning may be based on low probabilities, or on low potential loss or gain. Coalescing of low-valued branches can simplify computation, and allow expansion when conditions change. Automation of these techniques will be a challenge.
The SimQL Approach
The concept of our simulation access language, SimQL, mirrors that of SQL for databases. Instead of requesting stored information SimQL initiates interactions with a computational module. These modules are assumed to be external and substantial, so that the overhead of accessing them is worthwhile. The modules, or rather their wrappers, accept input parameters, including the desired future time, and return corresponding result values. Typical inputs parameters implicitly specify actions, say making a certain investment, or choosing an available alternative. For example, a client may specify a decision to use air freight rather than road transport. The computations may generate further alternatives, say, the possibility or not of a snowstorm causing delays in Chicago.
To make the results obtained from a simulation clear and useful for the decision maker the interface must use a simple model. Computer screens today focus on providing a desktop image, with cut and paste capability, while relational databases use tables to present their contents, and spreadsheets use a matrix with hidden formulas. To be effectively used simulations should also present a coherent interface model. In terms of system structure, we follow the accepted SQL approach. Note that SQL is not a language in which database management systems are written; those may be written in C, Ada, etc.. Rather, SQL is a language to describe, select, and fetch results for further use in information systems. The databases themselves are owned and maintained by others, as domain specialists and database administrators. Similarly, use of SimQL enables access to the growing portfolio of simulation technology and predictive services maintained by experts in the simulation community. Having a language interface will overcome the discontinuity now experienced when predictions are to be integrated with larger planning systems.
The research carried out under the proof-of-concept support include three phases:
1. Defining an initial specification for SimQL and creating a simple compiler and execution support
2. Wrapping several existing simulations to assess the generality of the SimQL concept
3. Performing experiments with a variety of simulation resources
There are two aspects to the SQL language, mimicked by SimQL:
- A Schema that describes the accessible content to an invoking program, its programmers, and its clients.
- A Query Language that provides the actual access to information resources.
Using similar interface concepts simplifies the understanding of clients and also encourages seamless interoperation of SimQL with database tools in supporting advanced information systems. There are differences, of course, in accessing past data and computing information about the future:
- Not all information about a simulation is made accessible via the SimQL schema. Simulations are often controlled by hundreds of variables, and mapping all of them into a schema for external access is inappropriate. Only those variables that are needed for querying results and for specifying the simulation ranges are made externally accessible. The remainder will still be accessible to the simulation developer. Defining the appropriate schema requires the joint efforts of the developer, the model builder, and the client.
- Predictions always incorporate uncertainty. Thus, a measure of uncertainty is always reported with the results. Its interpretation requires insights by the client programmer, just as the semantics of any retrieved results do. The information systems that process the results can then chose to take uncertainty explicitly into account, so that the decision-maker can weigh tradeoffs, say, risks versus costs.
- Results are also associated with points-in-time, complementing historical database models. The client should be able to integrate past, present, and simulated information, providing a continuous view, with increasing uncertainty. When delays occur in reporting past data, then the certainty at t=0 is already less than 1.0.
- For true decision support multiple courses-of-action (CoAs) should be supported in the client information system, since multiple candidate alternatives may be valid simultaneously, with some probability, in the future. Implicit, full utilization of predictive data requires a multi-value information model. In the proverbial sense, SimQL only provides the egg here, not the chicken.
- We do not expect to need persistent update capabilities in SimQL. Model updates are the responsibility of the providers of the simulations. The queries submitted to SimQL supply temporary variables that parameterize the simulations for a specific instance, but are not intended to update the simulation models.
Since we expect to often have to integrate past information form databases with simulation results we start with the relational model and SQL. However the objects to be described have a time dimension and an uncertainty associated with them. We hence used a simple object extension as the data representation for SimQL.