BAA 03-06-FH AQUAINT Phase II SUNY Albany/Rutgers/CUNY

VOLUME 1 - Technical / Management Details

(40 pages max excluding covers)

Organization / Company / University at Albany, SUNY
CAGE Code
DUNS / CEC Number
TIN Number
Type of Business / OTHER EDUCATIONAL
Proposal Title
System Design Perspective Category (Check ONLY ONE Box) (See Section 5.1) / _X_ 1. End-to-End System
___ 2. Component Elements
___ 3. Cross Cutting / Enabling Technologies
If "Component Elements" Category Selected Above
(Check ALL that Apply (See Section 5.1.2) / ___ Question Understanding and Interpretation
___ Determining the Answer
___ Formulating and Presenting the Answer
___ Other (Identify):
Data Strategy (See Section 5.2.3) / ___ Focused Data Strategy
_X_ Diverse Data Strategy
___ Other (Identify):

BAA 03-06-FH - VOLUME 1 - Technical / Management Details (CONTINUED)

Team Members / Type of Business / University at Albany, StateUniversity of New York / Other Educational
Rutgers, StateUniversity of New Jersey /Other Educational
CityUniversity of New York, LehmanCollege /other Educational
Principal Investigator(s) Name(s) / Professor Tomek Strzalkowski
Mail Address / University at Albany, SUNY
1400 Washington Avenue, SS-262
Albany, NY 1222
Phone Number / 518-442-2608
Fax Number / 518-442-2606
E-mail Address /
Administrative Contact Name / Ms. Linda Donovan
Mail Address
Phone Number / 518-437-4555
Fax Number
E-mail Address /
Proposal Duration / 24 months
Cost - Year 1 / $
Cost - Year 2 / $
Total Cost / $

Part I: Summary of Proposal.(Tomek + Paul)

National security depends more than everupon accurate, high-quality information being available at the right time to support important policy decisions. A key element of this process is the work of the intelligence analyst, who must quickly and efficiently produce the right information from a potentially enormous number of sources, reports and databases. Today's information retrieval and factoid question answering technology provides some help, but is also a source of growing frustration and missed opportunities. There is little doubt that a more powerful technology is needed to address question answering needs of professional intelligence analysts. The first phase of ARDA’s AQUAINT Program has pushed the QA technology out of its infancy into the realm where we can start seeing its benefits in the future.

The HITIQA team has made a significant progress in Phase 1 of AQUAINT. We have developed a prototype analytical QA system in which we demonstrated preliminary solutions to several key goals of the AQUAINT program:

  • Accepting complex, analytical questions in a form natural to the analyst.
  • “Understanding” the questions in context of available unstructured data.
  • Negotiating this understanding with the analysts through a multimodal dialogue.
  • Providing access to related information uncovered through the framing process.
  • Assessing information quality and maximizing relevance through source fusion.
  • Delivering means for exploring the answer space via interactive visualization.
  • Generating an answer out of fused “headlines” and fragments of source material.

HITIQA technology represents a radical departure from the “factoid” question answering that has dominated the research landscape until now. While factoid systems have made significant strides in accuracy, as demonstrated in TREC QA evaluations, their utility has always been limited in the world of professional analysts. Therefore, based on our successes with HITIQA-1 system and our growing experience and understanding of the analytical process, and given the goal of AQUAINT Phase 2, we propose another radical leap forward to turn HITIQA from a helper tool into an indispensable, highly adaptable “analyst’s assistant”, schematically illustrated in Figure 1.

In doing so, we will address the following Phase 2 goals:

  • Question Answering as Part of a Larger Information Gathering Process:HITIQA-2 will support the analyst throughout the entire analytical process associated with an “analytical scenario”. This means not only accepting a series of interrelated questions but also providing their interpretation in context of the overall information task along with adjunct information uncovered in the process of searching for answers to specific questions.
  • Accessing, Retrieving and Integrating Diverse Data Sources:HITIQA-2 will exploit structured data as a source of pre-processed information of direct interest to the analyst, as well as a source of knowledge that can adapted to provide a better understanding of unstructured, unprocessed, novel information.

Figure 1. HITIQA-2 Concept and Components

  • Interact with the system, using questioning strategies natural to the analyst: We will advance current triaging dialogue and visual browsing in HITIQA-1 to full problem-solving dialogue and exploratory navigation to provide a cooperative environment where the system actively assists the analysts in their work.
  • Explore boundaries of statistical and linguistic approaches to QA:HITIQA is already a hybrid system encompassing a variety of statistical and linguistic methods for information processing. This will be significantly expanded by adding knowledge acquisition methods that will utilize structured databases to learn how to process unstructured data with accuracy comparable to manually built knowledge-based methods, while also scalable to new and diverse domains.
  • Adapt to analyst’s preferred problem solving style: We will build into HITIQA automated mechanism for adapting the system’s performance to closely match the analyst’s personal preferences.This will be achieved over time through an adaptation process that tracks analyst’s information selections and interaction patterns and adjusts system’s behavior accordingly.
  • Maintain analyst’s confidence in the QA process: HITIQA-2 will create and maintain a persistent network of successive models reflecting the analyst’s information exploration strategy and a changing peripheral context. This will include a working space of the currently active answer model, as well as the backdrop of secondary information which can be explored to guarantee completeness.
  • Evaluate Validate and Present the Answer: In HITIQA-2, the answer, in the form of a preliminary analytical report, will be assembled from the structured knowledge sources and unstructured data items. This will be accomplished through adoption of frame-based semantics, shared among multiple data sources.

A.Summary of Innovative Claims

HITIQA has been conceived as a long term research project to address the challenges for the intelligence community identified in the AQUAINT Program as a whole. In Phase 1 we attacked a number of these challenges, finding solutions to some and making inroads into others. We have also discovered additional challenges that need to be solved before the QA technology can have visible impact on the work of the intelligence analyst. What we propose for Phase 2 is therefore not an incremental addition to our Phase 1 work; rather the challenges before us require an entire new set of innovations to be delivered. The innovations of Phase 2 are laid out on three tightly coupled “research thrusts”:

  1. Information Modeling Thrust: This research thrust includes question understanding, problem-solving dialogue, question and domain semantics, knowledge acquisition, information retrieval, extraction and framing, model maintenance and evolution, answer generation, and related topics.
  2. Quality Modeling Thrust: This thrust includes research into analyst-centered information quality models, task-specific information fusion, system adaptation to the user, persistent memory, and related topics.
  3. Visual Rendering Thrust: Research in this thrust includes the design of intuitive visual interfaces for representing content, context and perspective of analytical task, as well as means for navigation and manipulation of the answer space and outlaying areas.

These innovations are summarized in Table 1 below by laying out Phase 1 advances and proposed Phase 2 goals against the grid of overall objectives for HITIQA project.

Table 1: HITIQA-2 Advances compared to HITIQA-1 base
HITIQA
Innovations / phase 1 / phase 2
Information Modeling / Questions /
  • Single analytical
/
  • Scenarios involving series of questions and a strategy

Dialogue /
  • Clarification triage
  • Follow up limited to answer space model
/
  • Clarification & negotiation
  • Navigation and Problem-solving
  • Multiple dialogue Strategies

Answers /
  • Fused headlines and text passages
/
  • Fused passages
  • Correlated generated reports

Semantics /
  • Data-driven in general domain
  • Manual fit over specialized domain
/
  • Data and knowledge-driven
  • Domain adaptable
  • Knowledge acquisition from structured sources

Task-level persistence &
adaptability /
  • Not adaptable
  • One model per interaction
/
  • Retains successive models following analyst’s task strategy
  • Model backdrop context
  • Feedback with source fusion

Quality Modeling / User-level persistence & adaptability /
  • None
  • No personalized features
/
  • Adapts to user information selection and judgments
  • Adapts to analyst’s interaction needs
  • Persistent memory of interactions

Information
Quality & Usability /
  • Measured per source
  • 9 empirical quality criteria
/
  • Measured per source & topic
  • Individualized criteria based on the analyst’s pattern of use

Visual / Visualization
and
Navigation /
  • Answer space topology
  • Interaction alternative to dialogue
  • Single model navigation
/
  • Event and relationship map
  • Navigation and exploration of multiple models
  • Coordinated multimodal interaction

Data / Evaluation / Evaluations & Usability Studies /
  • Program-level pilots
  • Short-sessions with users
  • USNR sessions
/
  • Program-level metric-based evaluations
  • Sustained usability testing with USNR, USAF, other analysts

Data sources /
  • Unstructured text
/
  • Unstructured text
  • Structured databases
  • Web based sources

B.Summary of Technical Rationale

The key technical challenge to developing a practical QA system for the intelligence analyst is equipping it with the capacity to substantially assist the analytical process. This means being able to augment analyst’s own expertise in locating, correlatingand following up information, with capabilities todo these tasks more efficiently, more accurately, more thoroughly and speedily.In HITIQA-1 we developed an advanced tool that analysts can use to work on complex, analytical questions. However, transforming this tool into an indispensable analyst’s assistant requires a new set of research objectives for the second phase of AQUAINT program (see Box 1).The key goal is to enable a full range interaction, negotiated through HITIQA, between an analyst and a variety of structured and unstructured data sources.HITIQA-1 Dialogue Manager developed in the first phase is primarily focused onquestion clarification triage, with few options to engage in follow-ups. In HITIQA-2 we plan to expand this into complete problem-solving dialogue by substantially increasing the amount of knowledge that the system can manipulate, including the knowledge about the domain, the task at hand, and the state of the interaction. The added knowledge will allow for complex problem solving dialogue to occur. In addition, it will allow for a full and meaningful integration of language-based dialogue with visual navigation of the answer space.

Here we summarize the key steps needed to achieve HITIQA-2 objectives, each explained in more detail in the second part of this volume:

  1. Structured data is converted into knowledge: Structured data, most often in the form of relational databases, reflects the way users perceive the domain for a particular purpose. Often, multiple databases (or database relations) are required to capture key aspects of the domain (e.g., weapon transfer events, terrorism incidents, etc.). The structure provides adequate semantics of the data for at least one type of application.
  2. Knowledge is projected over new data. The knowledge extracted from structured data can now be projected over unstructured information to achieve an initial, partial structuring. Specifically, the structured relationships in the database provide frame patterns (event and relationship templates) for information extraction.
  3. Attribute extraction rules are bootstrapped. Structured data contains large number of instances of one of more types of relations. These are completely filled out frames (event or relationship templates), and thus they can be subsequently used to derive extraction tools to locate attributes (entities and relations) for these frames in the unstructured data.
  4. Enhanced question understanding is enabled. The analyst’s information request activates one or more event frames which are used to access structured data sources and to assist interpretation of information retrieved from unstructured sources. Additional frames will be established as a result of search process. The search is decomposed into series of queries corresponding to active frames.
  5. Revised answer space models are derived. The initial,partial understanding of the data can now be refined through the dialogue with the analyst so that (a) an initial model of answer space is created, (b) the dialogue and visual interaction continues until the system’s understanding of the task is improved, and (c) a revised model is built.
  6. Answer space navigation is enabled. All observed data is rendered into 3-D interactive visualization that provides an orthogonal interaction mode to the language based dialogue
  7. The larger context is explored. The system’s growing grasp of analyst’s goal and strategy is projected from the refined model onto the larger data context, thus allowing for informed source fusion based on analyst’s perceived usefulness and completeness of information rather than on any specific “objective” metric such as precision and recall.
  8. The system adapts to the analyst’s style. The system records all analyst’s information selections and relevance decisions made through the dialogue and visual navigation, and uses it as feedback to revise its information quality and source fusion criteria. Over time a personalized analyst’s model (an Analytical Strategy Model) is derived. ASMs can be swapped to provide alternative task solutions.
  9. The emerging system is continually evaluated. Our intention is to have the system used in a continuous series of exercises with dedicated group of analysts.

C.Schedule and Milestones

Schedule and milestones for the proposed research, including overall estimates of cost for each task. A one-page graphic illustration that depicts major milestones of the proposed effort arrayed against the proposed time and cost estimates must be included.

D.Summary of Deliverables

A summary of the deliverables associated with the proposed research.

E.Key Personnel

A clearly defined organizational chart of all anticipated program participants with brief biographical sketches of key personnel and significant contributors, their roles (including role of Principal Investigator) and their level of effort in each year (calendar year or academic / summer year) of the program. A chart, such as the following, is suggested.

Participants / Organization / Role / Year 1 / Year 2
Prof. Tomek Strzalkowski / University at Albany / Key Personnel/ PI, PM / 25% / 25%
Prof. Deborah Andersen / University at Albany / Significant Contributor / 25% / 25%
Ms. Sharon Small / University at Albany / Significant Contributor / 100% / 100%
Doctoral Candidate 1 / University at Albany / Contributor / 50% / 50%
Doctoral Candidate 2 / University at Albany / Contributor / 50% / 50%
Graduate Assistant 1 / University at Albany / Contributor / 50% / 50%
Graduate Assistant 2 / University at Albany / Contributor / 50% / 50%
Graduate Assistant 3 / University at Albany / Contributor / 50% / 50%
Prof. Paul Kantor / RutgersUniversity / Key Personnel/ co-PI / 25% / 25%
Prof. Nina Wacholder / RutgersUniversity / Significant Contributor / 25% / 25%
Prof. K.B. Ng / RutgersUniversity / Significant Contributor / 25% / 25%
Graduate Assistant 1 / RutgersUniversity / Contributor / 50% / 50%
Graduate Assistant 2 / RutgersUniversity / Contributor / 50% / 50%
Prof. Boris Yamrom / CityUniversity of New York / Key Personnel/ co-PI / 25% / 25%
Graduate Assistant 1 / CityUniversity of new York / Contributor / 50% / 50%

Professor Tomek Strzalkowski – University at Albany, SUNY

Education:SimonFraserUniversity, PhD Computer Science, 1986.

Experience: Dr. Strzalkowski is an Associate Professor of Computer Science at SUNY Albany. Prior to joining SUNY, he was a Natural Language Group Leader and a Principal Computer Scientist at GE CRD. Prior to GE, he was an Assistant Professor of Computer Science at New YorkUniversity. He received his PhD in Computer Science from SimonFraserUniversity in 1986 for work on the formal semantics of discourse. He has done research in a wide variety of areas in computational linguistics, including database query systems, formal semantics, and reversible grammars. He has directed research projects in natural language processing and information retrieval sponsored by ARDA, DARPA and NSF, including work under several TIPSTER contracts. While at GE, he was developing advanced text summarization systems for the Government. Dr. Strzalkowski has published over a hundred scientific papers on computational linguistics and information retrieval. He is the editor of two books: Reversible Grammar in Natural Language Processing, and Natural Language Information Retrieval.The new book Advances in Open Domain Question Answering is currently being prepared for publication by Kluwer. Current sources of support include DARPA-funded AMITIES project (2001-04; 20% commitment) and ARDA-funded HITIQA-1 project (2001-03; 20%), and NSF-funded ITR project (2002-2004; 5%). Pending proposals: NSF ITR (2003-06; 10%).

Professor Paul B. Kantor ( RutgersUniversity

Education: Ph.D. Theoretical Physics, PrincetonUniversity (1963)

Experience: Dr. Kantor is Professor of Information Systems in the School of Communication, Information and Library Studies at Rutgers, the State University of New Jersey. Previously he served as a faculty member at Case-WesternReserveUniversity, in the departments of Physics, Library Science, System Engineering, and Operations Research. At Rutgers since 1991, he has directed numerous research projects on the development and evaluation of library and information systems, most notably the ANLI system for augmenting a library online catalog with hyperlinks, and the AntWorld project. Prof. Kantor is also a Member of the internationally renowned RutgersCenter for Operations Research (RUTCOR), director of the Alexandria Project Laboratory, and director of the Rutgers Distributed Laboratory for Digital Libraries. He is author of more than 160 journal articles, book chapters, conference papers and technical reports, and his research has been supported by the ONR, the Institute for Defense Analysis, NSF, DARPA, and other organizations. He is a regular participant in the NSF Information and Data Management planning conferences, serves as a reviewer for numerous scientific and scholarly journals, and is a Fellow of the American Association for the Advancement of Science and the founding Editor in Chief of the journal Information Retrieval. Current projects include the DARPA-funded Novel Approach to Information Finding (AntWorld) N66001-97-C-8537 (15%). Pending projects include Dynamic Indexing and Archiving of Brain Images (NSF/ITR 11%) and Disruption of Quantum Coded Messages (NSF ITR/SY. 15%)

<ADD ALL key personnel and significant contributors>
Part II: Detailed Proposal Information.