We propose to address the “Data Triage” topic, under the general heading “Advanced Methods and Tools for Intelligence Analysis,” focusing our efforts on text data. The goal is to categorize individual pieces of text for triage purposes, identifying each item as belonging to one of three categories: a) items of current significance that should be processed by a human soon, b) items of potential future value, and c) items that may be discarded without further processing. This text data is expected to include both clean (electronic) text and scanned images of text documents, with Optical Character Recognition (OCR) output available for the latter; the data may potentially also include Automated Speech Recognition (ASR) output and/or keyword spotting results from voice sources. The data will be English for the initial proof-of-concept, but this English data may include output from Machine Translation tools, and we intend to use approaches that can be ported to new languages with the appropriate development of underlying tools and data understanding (e.g., a syntactic parser for an included language, a characterization of typical phrases in formal documents in that language, etc.). The goal is to move beyond simple indications of significance, such as keyword presence and anomaly detection, towards a more sophisticated use of various types of structure intrinsic to the data.

The proposed workshop will investigate characteristics that can distinguish the three specified categories of intelligence values of text data, based primarily on linguistic processing, supplemented by other kinds of available automatic processing. These resulting kinds of characteristics may include linguistic information, such as syntactic structure; visual characteristics such as those of maps, diagrams, etc.; discourse information that may distinguish types of communications (instructions, requests, standard commercial documents, social chatting, etc.), as well as elements like degree of formality and degree of emotion; potential indicators of coded messages; and other information extracted from text, such as named entities, keywords and phrases (including significant subject matter, “hot” locations and other topics of interest), structure from e-mail and other semi-structured documents; time information where it is available either from the document, as with e-mail, or from file system time stamps; etc. As a simple example of how these sources of evidence might combine, an e-mail from a person whose name appears on a watch list, with a discourse that indicates a command and subject matter of transportation is likely to be of high immediate significance; an e-mail that is part of a thread with a low degree of formality and a subject matter of food, occurring entirely within a one-hour period near noon on a single day, may be categorized as having no intelligence value.

The participants in this effort have developed techniques for search based on user-specified syntactic patterns,[to fill in appropriately according to final participants], and detection of novel information. The proposed Workshop will focus on developing and combining the complementary sources of value from these techniques. In addition, a primary expected outcome of the workshop is a well-studied set of tagged data. We intend to make use of our extensive contacts in the intelligence community in order to develop a set of realistic text data, with associated tags for intelligence value, corresponding to the three identified triage categories. This effort will form the basis for studying the correlation of the triage categories with the various extractable features and their combinations. In turn, this study can enable a proof-of-concept prototype that performs categorization based on the features identified as relevant.