The Pitfalls of Accuracy and Precision Insentiment Classification Lessons from Medical Testing

Page 1

The Pitfalls of Accuracy and Precision inSentiment Classification – Lessons from Medical Testing

Mark Frank – Department of Web Science – University of Southampton

Abstract

1.Introduction

In recent years sentiment analysis has been used to address a rapidly expanding range of problems of increasing importance such as product design, purchasing decisions, investment decisions, and predicting security threats [1] . As sentiment analysis plays a more significant part in our practical life it becomes vital to be rigorous and realistic about what it can achieve and to avoid conceptual pitfalls often associated with stochastic methods. Medical testing is another stochastic method with some elements in common with sentiment analysis. It has a much longer history and plays an even larger part in our practical life and sentiment analysis can learn from it. This paper applies one fundamental lesson from medical testing to that branch of sentiment analysis which we will call sentiment classification.

We define sentiment classificationas that part of sentiment analysis concerned with classifying units of text according to their sentiment[2]. The performance of a sentiment classification algorithm is typically reported as a one or more metrics derived from information retrieval such as: accuracy, precision, recall, the F score and occasionally the ROC or AUC curve. Each of these metrics has strengths and weaknesses and some of them can be very misleading, particularly in the sentiment classification context. The confusion matrix, and its associated assumptions, which underlies all of these metrics, is used in a wide range of fields including medical testing where there is extensive experience of such performance measures. This paper highlightsan issue associated with these performance measurementswhich is widely recognised in medical testing but is frequently neglected in sentiment classification. Part 1 of the paper recapitulates the standard definitions of the most important metrics and defines some less familiar terms. Part 2 illustrates the problem. Part 3 assesses the significance of the problem and calls for using metrics such as recall and sensitivity rather than accuracy or precision.

The examples and discussion in this paper are limited to sentiment classification of text into two alternatives such as approval and disapproval of a product, but the conclusions can easily be extended to multiple alternatives and sentiment analysis and information retrieval more generally.

2.Recapitulation of standard performance metrics

For most studies sentiment classification reports one or more of three metrics: accuracy, precision and recall[3] although a large number of other metrics are possible [4].

Any sentiment classification algorithm, whether it be a simple search for key words or a sophisticated supervised algorithm, can be considered to be a test for the presence of one or more conditions (e.g. approval or disapproval of a product) in the textual unit of interest (e.g. comment on a web site). In the context of this paper we will only consider the case where there are two conditions[1]. The result of applying this test leads to four possibilities. The condition may be true (the comment does indeed approve of the product) or false and the test result may be positive (the algorithmpredicts that the comment approves of the product) or negative. If TP, TN, FN and FP stand for the proportions of each of these four possibilities in the overall population they can be expressed in a table (often called a confusion matrix):

Has condition (e.g. comment does approve of the product) / Does not have condition (e.g. comment does not approve of the product)
Passes test (e.g. test suggests comment approves of the product) / TP / FP
Does not pass test (e.g. test suggests comment does not approves of the product) / FN / TN

An analyst evaluates an algorithm by testing it on a sample population where the presence of the condition is known by some other means (e.g agreement among human assessors). This allows the analyst to determine the number comments in each quadrant of the confusion matrix for that sample.

Then

Accuracy e.g. the proportion of comments that have been correctly classified by the test.

Precision e.g. the proportion of comments classified as approving of the product by the test that did in fact approve of the product.

Recall e.g. the proportion of commentsthat did in fact approve of the product that the test correctly classified.

None of these figures is superior to the other – although knowing two is sufficient to define the third. Sometimes they are combined to give an overall assessment of the test e.g. the F measure:

There are two other ratios which are standardly quoted in medical testing but are less common in information retrieval:

Specificity e.g. the proportion of commentsnot approving of the product which the test correctly classified.

Base rate = e.g. the proportion of comments in the population that approve of the product.

Finally there is an important ratio that does not have a commonly accepted name. This paper refers to as it the apparent base rate = . e.g. the proportion of comments that are predicted to be positive according to the algorithm. This paper adopts the name apparent base rate because this ratio might at first glance appear to be a good estimate of the real base rate. However, this is deceptive and it can be a very poor estimate.

In practice the parameters of a test are often adjustable so that recall can be traded off against specificity e.g. by deciding what level of prostate specific antigen will be taken as indicative of prostate cancer. This is captured by plotting one against the other to create a receiver operator curve (ROC) and the quality of the test can be summarised as one value by reporting the area under the curve (AUC) [5].

3.AProblemwith Common Metrics

Imagine that, as a sentiment analyst expert, you have developed an algorithm for detecting positive sentiment for your client’s product among comments on blog posts and it’s a pretty good algorithm. On a realistic sample of 250 comments you got an accuracy of 88%,aprecision of 90% and a recall of 95%. Now you use it for real on a much larger body of 10,000 comments. Your algorithm reports back that about 4000 comments (40%) were positive and 6000 negative. Your client is naturally a bit concerned about their product, but you can take a pat on the back for being able to give her the full picture.

Or can you? How pleased would your client be to know that applied to the real world the accuracy of your algorithm has dropped to 69%, the precision has collapsed to 25%, and the number of positive comments was actually about 10% not 40%? Yet this outcome is quite possible even though the algorithm is performing just as well for each individual comment as it was on the sample. In fact the only relevant change is that the base rate, the proportion of positive comments, has dropped. The change in performance comes from neglecting the importance of false positives – that is comments that are appear to be positive but are actually negative.

Given that there were 250 comments in the sample, 95% recall and 90% precision, the confusion matrix for the sample must have been like this:

actually
positive / actually negative
reported positive / 180 / 20 / 200
reported negative / 10 / 40 / 50
250 / 190 / 60

Of the 250 comments 190 were actually positive and 60 were actually negative.
Of the 250 comments 220 were reported correctly, hence the 88% accuracy.
Of the 200 comments reported as positive 180 really were positive, hence the 90% precision.
Of the 190 comments that really were positive the algorithm caught 180 of them, hence the 95% recall.

Howevernote that the less commonly reportedspecificity – the proportion of negative comments that were correctly reported as negative – is still good but less impressive. In this case it is 40 out of 60 i.e. 67%.

Sentiment classification derived these measures from information retrieval and machine learning in general which in turn borrowed the measures from the assessment of medical trials [3]. A confusion matrix like this would be familiar to a medical statistician. Only he would be using it in the context of a test to report on some underlying medical condition – such as a mammogram to test for breast cancer - and he would give some of the ratios different names. In medical statistics the recall would be called the sensitivity and the precision would be called the positive predictive value (PPV). And it is widely recognised in medical statistics that the PPV is dependent on the base rate – the underlying frequency of the condition. If breast cancer is very rare then even for a good test there will be a large number of false positives. Similarly,in sentiment classification, if the number of positive comments is low compared to the number of negative comments, then the number of negative comments reported as positive will be high. This conclusion depends on two crucial assumptions:

The sensitivity/recall does not vary with base rate.
The specificity does not vary with base rate.

These assumptions need not hold for all confusion matrices, but theyare not controversial in medical testing and similar reasoning applies to sentiment analysis. The probability of a patient with breast cancer having a positive mammogram is not affected by the frequency of breast cancer in other patients. A sentiment classification algorithm may well be affected by the frequency of positive comments across the whole sample while in development. But once developed, the probability of a positive comment being reported as positive is not typically affected by the frequency of positive comments amongrest of the population.

If assumptions A and B are true then when the test is repeated in the real world the recall should not changeunless the nature of the domain changes,nor should the specificity. However, the precision will change withchanging base rate. So in the example above the real world confusion matrix would be like this:

actually
positive / actually negative
reported positive / 947 / 3000 / 3947
reported negative / 523 / 6000 / 6523
10000 / 1000 / 9000

The recall is 947/1000 i.e. still approximately 95%. The sensitivity is 6000/9000 i.e. still 67%. But the base rate has changed dramatically from 190/250 (76%) to 1000/10000 (10%). As a result the precision has changed from 90% to 947/3947 i.e. 24% and the accuracy has changed to 6947/10000 i.e. 69%. Furthermore, in the sample, the test would have reported that 200 out of 250 comments were positive – very close to the right answer which is 190. In the real case the test would have reported that 3947 out of 10000 comments were positive when the right answer is actually only 1000. The apparent base rate is more than three times the actual base rate. In general the real base rate will be more extreme than the apparent base rate. The lower the sensitivity or specificity the worse the underestimate. Figures 1,and 2illustrate the effect of changing base rate and sensitivity and specificity on the apparent base rate. Figures 3 and 4 illustrate the effect of changing base rate on precision and accuracy.

Of course this discrepancy depends on there being a big change in the base rate. However, it is quite common in sentiment classification to apply the same test to different sources with very different base rates. In a frequently cited paper Dave et al [6] when developing algorithms for classifying product reviews used reviews from Cnet for different types of products where the proportion of positive reviews ranged from 57% to 90%. In a recent example, Lane et al [7] worked on three sets of magazine articles about different organisations provided by the same media organisation.The base rate of positive attitude as assessed by human observers varied from 22% to 57%.

Once the analyst is aware of the danger, provided the recall and sensitivity are known, it is a straightforward mathematical procedure to adjust estimates to allow for false positives. However, it is common not to report both recall and sensitivity and there are aspects of sentiment classification that exacerbate the problem. It is a common procedure in sentiment classification to conduct the analysis in two stages using two different algorithms. First the units of text are analysed to determine which ones contain opinions and which do not. Then those which are classified as having opinions are processed to classify that opinion as positive or negative. Typically the results are presented as a percentage of those units of text which have an opinion, for example 30% of those comments which had an opinion on the product were positive. These means that effectively there are two tests each of which has its own recall and specificity. The first test for having an opinion will contain false positives, and,significantly, false negatives. There may well be texts that had opinions that were not picked up by the algorithm and did not make it to the second stage. Even if the algorithm has excellent accuracy,precision and recall, if the percentage of texts that were rejected as having no opinion is high, then the number that were wrongly rejected can be large, perhaps very large, compared to the number that were classified as having an opinion. And there is no reason to suppose that this population of opinionated texts has the same characteristics as the population that were correctly identified and then analysed.

A more fundamental and intractable problem for sentimental analysis is measuring the real opinion. In medical statistics the characteristics of a test such as the test sensitivity and specificityare determined by comparing the test results to a gold standard which assumed to reveal whether the underlying condition is really present. So the quality of a mammogram can be determined by comparing its results with a sample of women who have a biopsy which is the gold standard. However, in sentiment classification the gold standards are hard to come by. Sometimes a dataset will be available that allows an algorithm to a more reliable measure of opinion. For example, sometimes users commenting on a product will write comments and also give the product a rating such as number of stars out of five. This provides useful training data and an objective way of determining the algorithms accuracy, precision and recall. But it also implies that when the algorithm is used for real it will be on a different set of data – otherwise it would be simpler to measure the ratings directly. Alternatively the sentiment of a sample of text units may be classified by observersand that sample used for training and determining the quality of the algorithm. But observers often disagree and are by no means certain to classify the sentiment of the text correctly. All of which implies that the reported accuracy, precision and recall are themselves open to question.[MF1]

4.Why It Matters

Performance metrics exist to give an objective indication of the quality of an algorithm. A good metric needs to be relevant (measure something users care about) and consistent (gives similar results across the different situations in which the algorithm is likely to be used). The commonly used metrics have weaknesses in one or other of these respects.

Accuracymay sound like an attractive option. It seems very reasonable to report the proportion of predictions that were correct. However, in many circumstances it can fail in relevance. It is not uncommon for a source of text to have an overwhelming preponderance of one condition, for example when commenting on services provided by the charity sector the great majority of texts are likely to be positive. In cases like this it is possible to get a very high accuracy simply by predicting all texts will be positive. It is equivalent to having a “test” for breast cancer that diagnoses all patients as not having breast cancer. It will be right the vast majority of the time, but it is not relevant. Although accuracy does not change as much as precision in the face of changing base rates (see Figure 4), it is still significantly inconsistent, particularly if either the sensitivity or specificity are mediocre.

Precision is sometimes relevant but unlikely to be consistent. It is equivalent to asking – if the mammogram indicates a problem what are the chances the patient has breast cancer? Also relevant, but rarely quoted in sentiment classification literature is the negative predictive value (NPV). This equivalent to asking – if the patient gets the all clear from the mammogram what are the chances the patient has not got breast cancer? Both the precision and NPV are highly relevant in a medical context where the focus is on the individual situation – less so in a sentiment classification context. A marketing manager is more likely to be interested in a best estimate of the overall number of positive reviews. In any case both precision and NPV are extremely vulnerable to a change in base rate and thus being inconsistent in practice (see Figure 3).

Recall and Sensitivity are not vulnerable to changes in base rate and are thus far more likely to be consistent within a given domain than the other metrics. They do, however, appear to be of limited relevance. They report the probability of a prediction given knowledge of the underlying condition. In reality it is the prediction that is known and the task is to determine the condition. However, given the recall, sensitivity and the apparent base rate it is possible to give an unbiased estimate of the real base rate and thus the precision, NPV, and accuracy for that particular body of text. Recall and sensitivity can be seen as measures of the quality of an algorithm in given domain. Base rate is a property of a set of texts from that domain. Precision and NPV are the results of applying an algorithm to that set of texts.

But does this matter in practice? The vast majority of sentiment classification literature reports one or more of the three measures: accuracy, precision and recall. If all three ratios are reported then it is possible to calculate the sensitivity as well but this is unusual. It is quite common, even in some of the most cited papers in area, to quote accuracy only[6], [8–11]. Dave et al justify using accuracy on the grounds that other values such as precision and recalldid not seem to provide any better differentiation than simple accuracy. While this may be true for their algorithms used in this context, providing accuracy alone makes it almost impossible to assess the likely performance of the algorithm in any other situation with a different base rate. It also makes it impossible to estimate the real base rate from the apparent base rate.