Chapter 2-6. More on Levels of Measurement
Summing Dichotomous-Scaled Variables or Ordinal-Scaled Variables Produces an Interval-Scaled Variable
Almost all standardized tests, such as the Zung Self Rating Depression Scale (Zung, 1965), are made up of several ordinal scale items, and a total score is derived from summing the item scores.
For example, Zung’s scale is made up of 20 items, which are each scored from 1 to 4. For example, the first item is:
I feel down-hearted and blue (1) A little of the time (2) Some of the time
(3) Good part of the time (4) Most of the time.
A total score is then computed by summing the scores from the 20 items, which has a range from 0 to 80.
It is widely accepted by measurement theory experts that these total scores, or totals from subsets of the items, are sufficiently interval scales, while individual items should be treated as ordinal scales.
Two of the best known measurement theory experts, Nunnally and Bernstein (1994, p.16), comment,
“Whereas there is usually little dispute over whether nominal or ordinal properties have been established, there is often great dispute over whether or not a scale possesses a meaningful unit of measurement. Formal scaling methods designed to this end are discussed in Chapters 2, 10, and 15. For now, it suffices to note that many measures are sums of item responses, such as conventionally scored multiple-choice, true-false, and Likert scale items. Data from individual items are clearly ordinal. However, the total score is usually treated as interval, as when the arithmetic mean score, which assumes equality of intervals, is computed. Those who perform such operations thus implicitly use a scaling model to convert data from a lower (ordinal) to a higher (interval) level of measurement when they sum over items to obtain a total score. Some adherents of Stevens’ position have argued that these statistical operations are improper and advocate, among other things, that medians, rather than arithmetic means should be used to describe conventional test data. We strongly disagree with this point of view for reasons we will note throughout this book, not the least of which is the results of summing item responses are usually indistinguishable from using more formal methods. However, some situations clearly do provide only ordinal data, and the results of using statistics that assume an interval can be misleading. One example would be the responses to individual items scored on multi-category (Likert-type) scales.”
______
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual. Salt Lake City, UT: University of Utah
School of Medicine. Chapter 2-6. (Accessed February 13, 2012, at
?pageId=5385).
A refinement of this idea
Although summing individual items to produce an interval scale is widely accepted, a little thought will make you wonder.
For example, suppose you provide the following list of tasks for someone putting weight (post weightbearing) on a leg after a hip replacement operation:
1. Stand up from a sitting position
2. Walk from room to room in own house using a cane or walker
3. Walk from room to room in own house unassisted
4. Walk up one flight of stairs
5. Run on treadmill for 5 minutes
If you sum up the number of tasks completed to get a total score, is it really an interval scale? The problem is that the tasks do not have the same level of difficultly, so the sum will not strictly have equal intervals. To make this a true interval scale, you need to weight the items by level of difficultly. An excellent way to assign weights is the Rasch Model, which is popularly used in measurement development. An excellent textbook on applying this method is Bond and Fox (2007).
In referring to the common practice of scoring the number of items correctly answered on a test in school, such as a math test, or going on to express this as a percentage, to measure the student’s ability, Bond and Fox (2007, p.21) regard this as only an ordinal scale,
“…The routine procedure in education circles is to express each of these n/N fractions as a percentage and to use them direcly in reporting students’ results. We will soon see that his commonpalce procedure is not justified. In keeping with the caveat we expressed
earlier, these n/N fractions should be regarded as merely orderings of the nominal
categories, and as insufficient for the inference of interval relations between the
frequencies of observations.”
Bond and Fox (2007) then go on to show how to weight the difficulty of the exam questions using a Rasch Model to provide a true interval scale with equal intervals of difficulty or student ability.
Visual Analog Scales for Symptom Measurements
A frequently used way to assess pain is the visual analog scale (VAS). Here, the study subjects rates his or her pain by placing a mark on a visual scale, such as,
|------|
noworst
painpossible
pain
These are frequently drawn with a line 100 mm long, so the score is the mm distance from the left (range, 0 to 100). Another variation is an integer rating, from 0 to 10.
In remarking on what level of measurement such a scale achieves, McDowell (2006, p.478), in his textbook on rating scales, comments without commiting himself to an opinion,
“Although nonparametric statistical analyses are generally considered appropriate (4), one study showed that VAS measures produced a measurement with ratio scale properties
(12).”
------
(4) Huskisson EC. (1982). Measurement of pain. J Rheumatol 9;768-769.
(12) Price DD, McGrath PA, Rafi A, et al. (1983). The validation of visual analogue
scales as ratio scale measures for chronic and experimental pain. Pain 17:45-56.
Perhaps the best article to cite for justifying that a VAS can be analyzed as an interval scale is Dexter and Chestnut (1995). These authors did a Monte Carlo simulation, sampling from an actual VAS dataset, and demonstrated that the independent sample t-test (assumes interval scale) performed as well as the Wilcoxon-Mann-Whitney test (assumes ordinal scale) in not inflating the Type I error rate. Similarly, they showed the oneway analysis of variance (assumes interval scale) performed as well as the Kruskal-Wallis test (assumed ordinal scale) in not inflating the Type I error rate. So, treating the VAS as an interval scale for analysis resulted in a correct hypothesis test.
In their methods paper assessing the bias and precision of VAS’s, Paul-Dauphin et al (1999) analyze these scales using a statistical approach that assumes at least an interval scale. These authors discuss the different approaches of expressing the VAS, such as reference ticks and labels or not, vertical versus horizontal. It is a good paper to cite if you intend to use a VAS, if you want to show you have put some effort into designing your study well.
How Many Categories In An Ordinal Scale Are Required To Consider It an Interval Scale
It would seem that adding more categories would take an ordinal scale closer to an interval scale, regardless of whether the intervals are strictly equal sized or not. This occurs because with more categories, there is less opportunity for the intervals to have large inequalities.
Also, an ordinal scale has an underlying theoretical continuous scale. So, the scores of the ordinal scale are approximations of the underlying continuous scale. It is somewhat analogous to expressing height by rounding to the nearest inch or centimeter.
So, just how many categories does it take for justifying analyzing an ordinal scale as an interval scale?
Nunnally and Bernstein (1994, p.115) make a suggestion about the number of categories,
“We will somewhat arbitrarily treat a variable as continuous if it provides 11 or more levels, even though it is not continuous in the mathematical sense. Consequently we will normally think of item responses as discrete and total scores as continuous. The number 11 is not ‘magical,’ but experience has indicated that little information is lost relative to a greater number of categories. Moreover, the law of diiminishing returns applies, and so using even 7 or 9 categories does little harm if the convenience of reporting data as a single digit is improtant to the application.”
Exercise. Download the Multiple Sclerosis Quality of Life (MSQOL)-54 Instrument from the website:
Go to items 53 and 54 on the second to last page. Item 53 is a 11 point scale and item 54 is a 7 point scale. Which would you say does the best job of approaching the accuracy of an interval scale?Is It All Right to Treat an Ordinal Scale as an Interval Scale for Analysis?
Point
Actually it is okay to analyze an ordinal scale using statistical methods that require an interval scale, but do not do it, since the idea has not yet caught on in biomedicine. You can, however, use this idea to make yourself feel comfortable analyzing sums of items not developed using the Rasch method as interval scales, or using VAS scales as interval level scales.
Detail (if you are curious)
As explained by Nunnally and Bernstein (1994, p.20), there is one camp, called the “fundamentalists”, who hold that ordinal scales should strictly be analyzed by the nonparametric tests that using on the rank order in the data. The other camp, called the “representationalists” advocated that the essential information in an interval scale is the rank ordering, and that there is little harm in analyzing ordinal data using parametric tests that assume an interval scale. These points of view were hotly debated in the 1950s in the social science literature. Studies were done that demonstrated there was little difference in outcomes by treating an interval scale as an ordinal scale, either approach produced basically the same correlation coefficient and p value.
What came out of that was a justification for many social science researchers to analyze ordinal scales as interval scales.
The trend was not adopted by researchers in biomedicine. Most of the measurements in medicine are either dichotomous or interval, so the issue did not have to be faced.
In contrast, most of the measurements in the social sciences are ordinal, or multiple ordinal items deriving a total score of a scale. Therefore, the social scientists looked into this question in ernest, to enable themselves to use regression models and their variants.
A famous biostatistician, Ralph D’Agostino, introduced some papers to introduce the idea into biostatistics. In his first paper (Heeren and D’Agostino, 1987), it was demonstrated by simulation that analyzing ordinal scales with a few categories with a t test, and with small sample sizes, 5 to 20, had the desired statistical property of the type I error being what it should be.
In a followup paper, Sullivan and D’Agostino (2003) investigate the performance of analysis of covariance, a parametric technique that assumes an interval scale, on ordinal data with 3, 4 and 5 categories. Again, they discovered the type I error was not inflated, while power of the test remained high.
D’Agostino did not take a position by stating a conclusion for or against analyzing ordinal scaled data with interval-level statistical approaches. Instead, he published his papers to lay the groundwork to move biostatisticians in this direction. His work, however, implies that this could be done and the idea will slowly catch on.
Dichotomous Variables Are Actually Interval Scaled Variables
Hardly anyone knows this, because it is not taught in statistics courses, but a dichotomous scale is also an interval scale.
What statistics books will advocate, however, is that categorical variables be converted to a set of dummy variables, or indicator variables (these are dichotomies, scored 0 or 1), as a way to include a categorical variable (either nominal or ordinal) into a regression model.
Statistics books fail to point out that the reason this works is that dichotomous variables are actually interval scales, so arithmetic can be done the variables themselves. Linear regression estimates an intercept and slope (the equation for a straight line), using the following equations:
and where
We can see that arithmetic is being done on the variables themselves.
Interval Scale Assumption
Linear regression, as well as the other forms of regression
models, assume that all predictor variables haveat least an interval scale.
This assumption is necessary so arithmetic can be performed on the values of each predictor variable.
It makes sense to do arithmetic on an interval scaled variable, since this scale is
sufficiently close to our notion of integers and real numbers (the interval scale shares the property of equal intervals with both of these number systems). It is generally accepted that it does not make sense to do arithmetic on nominal and ordinal scales, since these scales do not have equal intervals.
Although it is rarely claimed as such, a dichotomous scale could be considered an interval scale, since it has order (although perhaps an arbitrary order), it has equal intervals (one interval that is equal to itself), and one category can be selected to represent 0.
This claim is made by Jum C. Nunnally, one of the best-known psychometric experts (Nunnally and Bernstein, 1994, p.16):
“When there are only two categories, there is only one interval to consider, so that one interval may be considered an ‘equal’ interval. That is why binary (dichotomous) variables may be considered to form interval scales, the point noted above as being so important to modern regression theory and elsewhere in statistics.”
Nunnally and Bernstein (1994, pp. 189-190) further state:
“As noted in the section titled ‘Another form of Partialling,’ categorical variables are now used quite commonly in multivariate analysis thanks to Cohen (1968). This use reflects the point made in Chapter 1 that a scale may be regarded as an interval scale when it contains only two points. This is the basis of the analysis of variance. If the variable takes on only two values, such as gender, one level may be coded 0 and the other coded 1…. A variable coded 0 or 1 is called a ‘dummy’ or ‘indicator’ variable. The independent variable’s ‘scale’ has interval properties, by definition, because the scale has only two points.”
Sarle (1997), on his web-site discussing measurement theory, states the same thing,
“What about binary (0/1) variables?
For a binary variable, the classes of one-to-one transformations, monotone increasing/decreasing transformations, and affine transformations are identical--you can't do anything with a one-to-one transformation that you can't do with an affine tranformation. Hence binary variables are at least at the interval level. If the variable connotes presence/absence or if there is some other distinguishing feature of one category, a binary variable may be at the ratio or absolute level.
Nominal variables are often analyzed in linear models by coding binary dummy variables. This procedure is justified since binary variables are at the interval level or higher.”
This is why you can recode nominal and ordinal predictor variables into indicator, or dummy variables, and include them directly into the regression equation. The regression algorithm treats the indicator variable as an interval scale, and performs arithmetic directly on the 0-1 values.
This claim that dichotomous variables are actually interval scales is rarely taught in statistics classes, so few people are even aware why indicator variables work in regression models.
Statisticians are traditionally trained to think of a 0-1 variable as a “Bernoulli variable,” rather than as a continuous “interval scale” variable. A Bernoulli variable has mean p and variance p(1-p), where p is the probability of a 1 (Ross, 1998).
The derivation for this mean and variance for a Bernoulli variable, with standard deviation being the square root of the variance, is taught in the first semester of a masters degree level statistics program. The important point about the formulas is that they just use the nominal scale property of the variable. That is, they are based on simply counting the number of occurrences of the variables outcome (how 0’s and how many 1’s), and then doing arithmetic on the counts. Arithmetic is not done the values of the variable themselves.
These formulas for the mean and standard deviation of a Bernoulli variable look very different than the sample mean and sample standard deviation used in statistics:
(sample mean)
and
(sample standard deviation)
Let’s apply these standard formulas to a dichotomous variable and see what happens.
Reading in the Stata formatted data file, births.dta, using Stata menus:
FileOpen
Find the directory where you copied the course CD:
Change to the subdirectory datasets & do-files
Single click on births.dta
Open
use births.dta
Requesting a frequency table for the dichotomous variable, lowbw, using Stata menus:
StatisticsSummaries, tables & tests
Tables
Oneway tables
Categorical variable: lowbw
OK
tabulate lowbw
low birth |
weight | Freq. Percent Cum.
------+------
0 | 440 88.00 88.00
1 | 60 12.00 100.00
------+------
Total | 500 100.00
We see that the lowbw variable is a 0-1 variable, or Bernoulli variable.
Using the Bernoulli formulas, we get
mean = p = 60/500 = 0.1200
variance = p(1-p) = 0.1200(.8800) = 0.1056
standard deviation = = .324962
Notice how we just use the counts of the categories, the “Frequency” column of the frequency table, and then do arithmetic on the counts, rather than the values of the variable. That is, we computed these test statistics using only the nominal scale property of the variable (we just counted the frequency of occurrence of the name, or label, given to the variable).
Now, using the ordinary statistical formulas for mean and standard deviation, which were designed for interval scales,
StatisticsSummaries, tables & tests
Summary and descriptive statistics
Summary statistics
Variables: lowbw
Options: standard display
OK
summarize lowbw
Variable | Obs Mean Std. Dev. Min Max
------+------
lowbw | 500 .12 .325287 0 1
We see that the Bernoulli mean is exactly the same as when the ordinary formula for the mean is applied, both giving 0.12.