Robust Semantic Role Labeling for Nominals
Robert Munro / Aman NaimatDepartment of Linguistics / Department of Computer Science
and School of Business
Stanford University / Stanford University
/
Abstract
We designed a semantic role labeling system for nominals that consistently out-performed the current state-of-the-art system. Focusing on the core task of classifying the known arguments of nominals, we devised a novel set of features that modeled the syntactic context and animacy of the nominals, reducing the error of the current state-of-the-art systems by F1 0.012 on the NomBank corpus, and most significantly, by 0.033 on test items with unseen predicate/headword pairs. This corresponds to an overall reduction in error of 10% and 15% respectively.
1 Introduction
Semantic role labeling (SRL) for verbs is an establish task in NLP, and a core component of tasks like Information Extraction and Question-Answering (Gildea & Jurafsky 2002, Carreras & Màrquez 2005, Toutanova, Haghighi & Manning 2005) but semantic role labeling for nominals has received much less attention (Lapata 2002, Pradhan et al 2004, Liu & Ng 2007). In general, nominal-SRL has proved to be a more difficult task than verb-SRL. Features that were successful for verb-SRL have not always produce significant results for nominal-SRL, and in general the error rates for nominal-SRL have been at least twice as high (Pradhan et al 2004, Jiang & Ng 2006).
SRL is typically divided into two tasks: identifying the arguments of a predicate, and classifying those arguments according to their semantic role. These are known as semantic role identification and semantic role classification respectively. The two are typically tested both independently and in combination. In (Liu & Ng 2007) the best-reported results for both tasks and the combination are given, tested on the NomBank corpus (Meyers et al 2004). Constrained by time, in this paper we focus solely on semantic role classification, comparing our results directly to those of (Liu & Ng 2007).
1.1 SRL for nominals
It is easy to demonstrate why nominal-SRL is a more complicated task than verb-SRL:
1) [the police agt] [investigated pred] [the crime pat]
2) [the crime pat] was investigated pred] by [the police agt]
3) [the police pat] were investigated pred] by [the governor agt]
Examples (1-3) show semantic roles predicated on a verb, and all three are unambiguous. For the predicate investigate, the Subject of an active sentence is the Agent, and the Object is the Patent. For a passive sentence, this is reversed. Provided a verb-SRL system models active/passive sentences, this is uncomplicated. Compare these to some equivalent nominalizations:
4) [The police agt] filed the report after 3 weeks, causing the governor to declare the [investigation pred] closed.
5) [The investigation pred] took 3 weeks.
6) [the crime’s pat] [investigation pred] ...
7) [the police’s agt] [investigation pred] ...
8) [The investigation pred] of [the police agt/pat??] took 3 weeks.
Example (4) shows that an argument may be realized some distance from the predicate; example (5) shows that arguments are not mandatorially subcategorized by the predicate; and examples (6-8) show that different roles may be realized in the same syntactic position, and can be inherently ambiguous. Therefore, successful nominal-SRL is a more difficult task than verb-SRL.
1.2 Our contribution
We report on new features, and interactions of features, that consistently improved the accuracy of semantic role classification for nominals.
In particular, we demonstrate that the new features modeling the syntactic context of the nominals improve the accuracy of semantic role classification systems, especially over unseen predicates/argument, and report on the distribution of such structures across the labels in NomBank.
From some basic strategies to model a predicate’s arguments holistically, we find that modeling the relative animacy of the arguments improves the classification of their roles, especially in combination with other features.
We report on the influence of training-set size on classification accuracy, and the relative success in classifying unseen predicate/argument pairs.
2 NomBank
The NomBank corpus uses the PropBank set of labels as (Palmer et al., 2005) to annotate all arguments of nominals in the Wall Street Journal corpus. It differs from PropBank in two ways. Firstly, an argument may overlap the predicate. For example investigator is both the predicate and its own Agent. Secondly, it is possible for an argument to realize two roles. For example, truckdriver contains both the Agent driver and the Patient truck.
There are 20 labels in NomBank: ARG0, ARG1, ARG2, ARG3, ARG4, ARG5, ARG8, ARG9, ARGM-ADV, ARGM-CAU, ARGM-DIR, ARGM-DIS, ARGM-EXT, ARGM-LOC, ARGM-MNR, ARGM-MOD, ARGMNEG, ARGM-PNC, ARGM-PRD, and ARGM-TMP.
This results in sentences like:
[The police’s arg0] [investigation pred] [of a crime arg1] [in Boston argm-loc] [took 3 weeks argm-tmp].
We followed the standard split of the corpus, training on sections 2-22, validating on section 24, and testing on section 23.
3 Maximum entropy classifier
Maximum entropy (MaxEnt) classifiers have been the staple for nominal-SRL (Jiang & Ng 2006). More sophisticated discriminative learning algorithms have also been used, including multitask linear optimizers and alternating structure optimizers, but they have not significantly improved the results for semantic role classification over MaxEnt classifiers (Liu & Ng 2007), and so we used the Stanford MaxEnt classifier. The objective function was:
where:
and the derivatives of the log likelihood correspond to:
The objective function was smoothed by assuming a Gaussian distribution, and penalized accordingly:
where the derivatives were adjusted by:
Brief experimentation with Naïve Bayes and KNN classifiers produced much less accurate results.
For future work, it would be interesting to compare our results across a greater number of classifiers, and with joint learning of the labels (Toutanova et al. 2005).
Figure 1: The relative frequency of different roles in different sentence syntactic positions.
4 Data analysis
No previous attempt at modeling the semantic roles of nominals has looked at the broader syntactic context of the constructions. Almost immediately, our data analysis revealed strong tendencies for different roles to be realized in different sentence positions.
We found strong likelihoods for certain roles to appear when the argument is realized in the Subject, Object or Adjunct position of a sentence. For example:
The police’s [investigation pred] took 3 weeks. (Subject)
They reported the police’s [investigation pred] (Direct Object)
The case was closed after the police’s [investigation pred] (Adjunct)
Figure 1 gives the distributions of sentence positions for each role type in the training data. With the exception of the ARG0/ARGM distinction for other positions, within each sentence position, the results are significant to p <0.001 by c2.
It is clear from the graph, that the difference is important. An argument in the Subject position is 55% likely to be an ARG0, only 10% likely to be one of the ARG2+ roles and not at all likely to be of the ARGM roles. For a Direct Object, on the other hand, an ARGM is almost 50% likely, and for other syntactic positions ARG1 is >50% likely.
When running baseline data, sentences with possessives were often misclassified, such as:
[P&G's arg0] [share pred] of the Japanese market
This led us to create a feature modeling whether the constituent was a possessive.
Figure 2: The relative frequency of possessives realizing different types of named entities.
The animacy of constituents is also correlated with the distribution of semantic roles, as more animate constituents will tend to more frequently realize agents, and therefore ARG0s. More general knowledge of named entities will also help identify locations, and therefore ARGM-LOCs.
We defined features based on animacy of the constituent using gazetteers of named entities and pronouns, and modeled an order of animacy from people to objects, based on the presence of known entities, proper nouns or nouns in a constituent.
Figure 2 shows the intersection of the animacy and possessive features, clearly showing that possessives realizing a person or organization are more likely to take the ARG0 role.
The semantic roles of a given predicate are highly interdependent (Jiang et al. 2005, Toutanova et al. 2005). For example, if we know that an ARG0 already exists for a predicate, then the chances of another ARG0 for that predicate are greatly reduced. For reasons of time-constraints, we did not implement the joint learning of (Toutanova et al. 2005) which is the best reported results for verb-SRL. However, we did define features that took into account all arguments for a given predicate, capturing at least part of the interdependency. Using the animacy feature, we included a feature to indicate whether a given argument was the (equal) most animate, medially animate, or least animate of the arguments for the current predicate. We also included the total number of arguments for the given predicate, as predicates with only one argument were more likely to realize either ARG0 or ARG1, while predicates with a high number of arguments where evidence that a low frequency role might be present.
1 / position: whether the constituent is to the left, right or overlaps with the predicate2 / ptype: phrase type of the constituent C
3 / firstword: first word spanned by C
4 / lastword: last word spanned by C
5 / ptype.r: phrase type of right sister
6 / nomtype: the NOM-TYPE of predicate, as defined by the NOMLEX dictionary
7 / predicate & ptype
8 / predicate & lastword
9 / predicate & headword
10 / predicate & position
11 / nomtype & position
Table 1. Baseline features (‘’ indicates an interaction feature)
5 Features
We used the features of (Liu and Ng 2007) as a baseline, as these produced the previous highest-reported results on this task. These features are given in Table 1, and give the results labeled L&N in the Results section.
Table 2 gives the new features that we developed based on observations discussed in the previous section. As the interaction of features is vital to an accurate classification, and we defined a number of interaction features. The final set of features, we had more interaction features than regular ones.
The final results report the use of features 1-17. Among the features that were not significant, many of them, like subject and possessive where significant when interacting with other features. It is also likely that the correlation between subject and possessive, and the interactions they take part in, somewhat masked their significance for the reasons described in the last two paragraphs.
The biggest surprise of the features that were not significant, either alone or in combination, was sentposition. While it was very significant to model whether an argument was in a certain syntactic position, including the sentence position made no difference. This supports the theory that it the explicit interaction of syntactic properties that produces the distributions in Figure 1.
5.1 Processing for significance
Testing all possible feature combinations is prohibitively expensive, so we devised two ways to test the significance of each feature, leaving a greater study of significance as future work.
Features sig to final classification: / sig.12 / relativeanimacy: the animacy of constituent relative to the animacy of the other arguments for the current predicate: (equal)highest/medial/lowest. / 0.002
13 / head: the headword of the constituent / 0.001
14 / subject & position / 0.002
15 / predicate & animacy & subject / 0.006
16 / possessive & nomtype / 0.001
17 / possessive & position & animacy / 0.001
Features not sig to final classification:
18 / subject: whether the constituent is in the sentence subject position / 0.001
19 / possessive: whether the constituent is a possessive (eg, our, her, its, -’s) / 0
20 / sentposition: whether the constituent is in the topic position, the 1st/2nd half of the sentence, or is sentence final / 0
21 / animacy: whether the constituent is a person, organization, location, other (unknown) proper noun or a noun. / 0
22 / predsyn: syntactic category of the predicate / 0
23 / numargs: the number of arguments for the current predicate / 0.004
24 / modifies: whether the constituent is a premodifier of the predicate / 0
25 / highestanimacy: the highest animacy of all the arguments for the current predicate / 0.003
26 / sentposition & position / 0
27 / pred & animacy / 0.004
28 / nomtype & animacy / 0
29 / possessive & position / 0
30 / numargs & animacy / 0.004
31 / modifies & animacy / 0
32 / modifies & sentposition / 0
Table 2. Novel features and interactions
We iteratively removing each feature from the training data and looking at the change in accuracy. If the removal of a feature did not result in a significant change of accuracy, then we considered it to be not significant to the final classification. This is the split of data in Table 2.
We expected that some of the non-significant features are good indicators of semantic roles, but were correlated strongly enough with other features for there to be no gain from their inclusion. We therefore also looked at the significance of adding each new feature to the baseline features. The resulting increase in overall F1 given by the feature is sig. column in Table 2.
Figure 3. Accuracy on test items with increasing training set sizes
6 Results
The ARG0 and ARG1 labels, roughly corresponding to the Agent and Patient roles, made up the majority of the examples and were more easily classified than the other labels. We therefore compared the overall F1 values to the F1 values for ARG0 and ARG1 (ARG0,1), and to the F1 values for all other labels (ARG2+).
Figure 3 gives the results for different training set sizes, comparing our results to that of the baseline features (L&N), which are the current state-of-the-art performance (Liu & Ng 2007).
The results show that we consistently outperform the baseline, especially among the less frequent items. Our F1 over the full set was 0.884, with 0.902 on ARG0,1 and 0.847 on ARG2+. This corresponds to an increase over the L&N results of 0.012, 0.009 and 0.019 respectively. While it does beat the current state-of-the-art results, it does not blow them out of the water. Nonetheless it is a significant increase in accuracy when we take into account the consistency over different training set sizes. The difference between the L&N results and that reported in (Liu & Ng 2007) is negligible, and probably the result of a slightly different MaxEnt algorithm and/or NomBank version.