How to Build aPacifica Model[1]

Douglas E. Dyer, PhD[2]

Sarah B. Dyer[3]

Active Computing, Inc.[4]

1 August 2005

News articles and text-based reportsregularly provide valuable information about current events in all parts of the world. We can infer important trends about a country’s political stabilityby reading news articles. For example, an article about a recent suicide bombing in Iraq might imply that Iraqi insurgents are gaining power. Interpreting large numbers of articles from local and international newspapers can normalize bias and supply us with an accurate idea of a country’s current situation. However, people require a substantial amount of time to read thousands of articles, so automation is desirable. While we lack technology for general natural language understanding, it is possible to identify relevant concepts and general tones in natural language text. Thus, for models that consist of text-based questions, we can identify relevant parts of a model and estimate the impact for each news article. Active Computing’s Pacificais a modeling engine for interpreting text inputs (e.g., news articles) in the context of a model consisting of natural language questions. Pacificauses natural language processing technology and linear combination, in conjunction with a particular model, to make assessments based on large text-based input sets such as news articles. A Pacificamodelconsists of several structured text files which are easy for domain experts to create with a spreadsheet. This paper describes the process for creating a Pacifica model and is intended for domain experts who wish to enhanceexisting models or create new models for various purposes. For illustration purposes, we will use an example of a model intended to assess the strength of the insurgency in Iraq.

The Reflex Insurgency Model analyzes news articles based on six areas considered to be strategically significant by the United States Joint Forces Command: political, economic, military, social, information, and infrastructure (PMESII). It includes weighted questions relevant to the insurgency, keywords and score words, and it runs in Pacifica.

The Reflex Insurgency Model consists of three spreadsheets.[5] The first spreadsheet includes a set of questions, each of which is associated with several sets of keywords. Each question addresses a specific concern related to one of the areas of PMESII. These questions and keywords are formatted as shown below.

Each row of the spreadsheet contains information corresponding to one question.

What does each column mean?

Column A: This column contains a number corresponding to the question in the same row. These numbers are used to identify questions in the second spreadsheet, which will be explained later.

Column B: This column contains the name of a subcategory of one of the larger categories of PMESII. Each question is categorized under one of the areas of PMESII, but is also further grouped into a subdivision. For example, in the spreadsheet shown above, question #33 states “Are judges corruptible?” We chose to put this question in the Social category and specifically the Injustice subcategory.

Column C: This column contains the list of questions. The questions are chosen as a result of the information desired. For example, if we desire the interpretation of large numbers of articles to tell us about the strength of an insurgency, we word the questions so that if the answer to them is yes, then the insurgency is strong.

Columns D, E, F, etc.: These columns each contain a list of keywords. Each question is associated with one or more lists of keywords. When an article is being “read,” the algorithm interprets a question as being relevant if there is a non-empty intersection of the lists of keywords. In other words, at least one keyword from each list of columns associated with the question must be present in the article. For example, for question #4 (“Is availability of water inadequate?”), the answer is considered to be yes if and only if a keyword from each associated set is contained in the text. These first set of keywords is “water drinking drink h2o.” The second set is “scarce scarcity available availability unavailable inadequate lack lacking short meager insufficient "not sufficient" dry drier driest.” If the article contains at least one word from each set, the algorithm decides that the availability of water is addressed in the article.

The second spreadsheet is used to determine how important an affirmative answer to each of the questions would be to the overall situation of a country. If we are trying to determine how strong the insurgency is in Iraq, then the answers to some of our questions carry more weight than the answers to others. For example, if water is widely unavailable or unfit for consumption, the population will be extremely likely to support the insurgency if its leaders promise that their basic needs will be satisfied. On the other hand, inadequate housing provides less incentive for the population to rebel against the government because it is probably of less immediate importance. This spreadsheet is formatted as shown below:

What does each column mean?

Column A: This column contains a number identifying which of the PMESII areas the corresponding question falls under. If the question addresses a Political aspect, then the indicator ID is 1. If the appropriate area is Military, the indicator ID is 2. Economic, Social, Information, and Infrastructure are 3, 4, 5, and 6, respectively.

Column B: This column contains the broadest categories under which the questions are grouped. There is one category for each of the areas in PMESII, and they are directly related to the information desired. If we are trying to determine how strong the insurgent movement in Iraq is, then our category corresponding to Political is “Political strength of the insurgency.”

Column C: This column contains the ID numbers for the questions. These numbers are the same as those in Column A in the first spreadsheet.

Column D: If the user wishes, he can weight some questions more than others. If an affirmative answer to a particular question is exceptionally important to the results of the interpretation of the article, then that question can be weighted accordingly. For the sake of simplicity, our weights in Column C are all 1. Optionally, a question may be removed from the model by assigning a weight of 0.

Column E: This column contains a score for each question. The nominal score represents an expected value based on experience reading the news. For example, if a news article mentions the water supply, chances are that the information in the article will strengthen the insurgency. Why? News articles are biased towards dramatic events. If the water supply is adequate, newspapers probably are not going to report it. A story about the water supply would only be published if sometime is wrong.The scores range from 0 to 10. In our example, we are assessing the strength of the insurgency in Iraq. If a question has a high score, then relevance implies that the insurgency is quite strong. If a question has a low score, then relevance means that the insurgency is weak.

The third spreadsheet works independently from the previous two. It consists of a list of words and their associated scores. The words are chosen as indicators of the overall tone of the article. This spreadsheet is formatted as shown below:

The purpose of this spreadsheet is to assess the overall tone of an article. If any of the specified score words are present in the text of the article, then the corresponding scores are used to rate the article. For example, look at the 7th row of the above spreadsheet. The score words are “casualty” or “casualties.” If these words appear in an article, the article is most likely addressing a negative event. The corresponding score, 9, reflects the severity of the negativity. We are using a scale of 0-10 in this example. As the scores increase in magnitude, the implication becomes increasingly negative. However, this part of the model searches for positive implications, as well. For example, the words “justice” and “relief” both have corresponding scores of 2, implying that articles containing these words have a positive tone.

Now that each part of the model has been explained, how is the process of creating a model completed? The development of a Pacifica model can be broken down into several distinct steps. First we identify an appropriate topic for the model. In the example model above, the topic is “Iraqi insurgency strength index.” The topic must be characterized quantitatively. After a suitable topic has been determined, we identify relevant, quantifiable questions. These questions all contribute to the condition or circumstance identified by the topic. Next, these questions are organized into a set of categories and subcategories, depicted by a tree diagram. An example of this diagram is included in the Appendix. Then we identify questions for each factor that, if answered positively, would maximize the value of the factor to the model. For each question, we must choose sets of keywords such that any article relevant to the question will contain at least one word in each set. We also assign each question a weight that reflects its importance in the model as well as a nominal score that reflects the impact on the model if a positive answer is assumed. Finally, we test our model on example inputs. If the results are not satisfactory, we refine the model.[6]

To further illustrate the process of creating a Pacifica model, let’s walk through the addition of a single question to our existing model on the Iraqi insurgent strength index. We want our question to relate to the subcategory “Impotence of anti-insurgency measures” under the category “Weakness of government leadership.” We identify our question to be “Is the government failing to control insurgent travel?” A positive answer to this question implies that the Iraqi insurgent strength index is high. Since our question is the 22nd question in the model, its ID number is 22. Here’s what the spreadsheet row looks like at this point:

Next we identify appropriate sets of keywords. In order for an article to be considered relevant to this question, at least one keyword from each set we define must be present in the text. Our first set of keywords includes the words: “insurgents, insurgency, rebels, rebel, rebellion, insurrection, resistance.” If an article contains at least one of these words, then it probably addresses some aspect of the insurgency. Our next set of keywords contains the words: “travel, airlines, airplanes, airports, trains, buses, bus, train, airplane, aeroplane, aeroplanes, transportation, road, roads, highway, highways.” If one of these words is present in the article, then transportation is probably being discussed. Our next set of keywords is: “government, leadership, leaders, leader, president, dictator, tyrant.” One of these words in the article’s text signifies that the government is being discussed. Our next set of keywords is: “control, measures, security, checkpoints, restriction, restrictions, limit, limits, confinement.” Any of these words indicates that some degree of restriction is being exercised. Our assumption is that the government is exercising control over the insurgents’ travel. The final set of keywords is used to determine if the government’s control over insurgent travel is inadequate: “inadequate, deficient, incapable, incompetent, unsatisfactory, poor, unlimited, unrestricted, unconfined, unsuccessful.” If a non-empty intersection of these sets of keywords appears in an article, then that article is deemed relevant to the question. After the keywords have been added, the spreadsheet row looks like this:

After the sets of keywords have been determined, we must assign a weight to the question. If this question appeared multiple times in different categories, then the question would be weighted accordingly, but for the sake of simplicity, we have weighted all of our questions equally. Finally, we must assign a nominal score that reflects this question’s importance to the Iraqi insurgency strength index. On our scale of 0-10, a high score implies that an affirmative answer to a question benefits the insurgency’s strength, while an affirmative answer to a low scored question is disadvantageous to the insurgency. If the government is failing to control insurgent travel, the insurgency’s strength is increased; for this reason, we have assigned the nominal score of 7. The corresponding spreadsheet row looks like this:

After the Pacifica model has been successfully created, the testing phase begins. If the initial selection of inputs fails to produce satisfactory results from the model, then the model must be refined. If relevant articles are not being flagged by the model, then the keywords for the individual questions probably need to be modified. If the model does not produce a desired piece of information, then possibly new questions should be added. Problems might occur if the chosen categories are not well organized or if the questions are inappropriate. A flowchart included in the Appendix summarizes the refinement process.

Summary

Perfect natural language technology does not yet exist, but Pacifica provides a flexible and straightforward way to interpret newspaper articles by way of a model of natural language questions. The Pacifica model’s three spreadsheets can easily be tailored to fit almost any desired assessment. Pacifica’s combination of natural language questions, sets of relevant keywords, and nominal scores allows for an automatic assessment of the situation of a country from a large volume of text-based sources, providing potentially invaluable information quickly and efficiently.

Appendix: Flowcharts for creating and refining a Pacifica Model.

We produced this tree diagram with Active Computing’s SimpleCIM tool, but any outlining tool will suffice.

1

[1] This work has been supported by funding received originating from DARPA and managed by the US Arrmy’s Communications Electronics Research and Development Command.

[2]

[3]

[4]

[5] Pacifica actually uses tab separated field files which easily convert to and from spreadsheets.

[6] A flowchart summarizing this process is shown in the Appendix.