Chapter 1
Data and Descriptive Statistics
1.1 Introduction
Statistics is the art and science of collecting, summarizing, analyzing and interpreting data. The field of statistics can be broadly divided into two – (i) descriptive statistics and (ii) inferential statistics. In descriptive statistics, we simply describe a given set of data in ways that makes it understandable to the user or the decision maker. There are various approaches through which we can describe data, such as summarizing numbers, tabulating numbers, visualizing through various graphics etc. We will study some of these approaches in this chapter. In Inferential statistics, statisticians try to make some useful statements about a population, based on an analysis of some sample data. In general, a decision maker is interested in statements about a population, but collecting data for an entire population is not feasible either practically or economically. Statisticians therefore collect data from a small sample and make statements about the population based on the sample data. In later chapters, when we learn about inferential statistics, we will learn the process of making statements about a population based on sample data.
1.2 What is Data?
A single unit of data is the value of some variable of interest. For example the number 5 is a single unit of data. It could represent the number of customers waiting in line at a given time; it could represent the number of days it takes to ship an item; it could represent the weight in pounds of a package being shipped. In these examples, the number of customers waiting in line is a variable, the number of days it takes to ship an item is a variable and the weight in pounds of package is a variable. Data need not always be a number, it could also be a non-numeric value, such as “red” or “high”. The data “red” might represent the color of the next car you see on the road. It might represent the favorite color of your best friend or the most popular color for clothes amongst women. The data value “high” may represent the degree of customer satisfaction for my customers, or the degree of perceived quality for a product. As a statistician, whenever you see a collection of data, you must understand what each data item represents. There is always a real-world entity that a piece of data represents or describes.
1.3 Types of Data
Quantitative vs. Qualitative Data
Broadly, there are two types of data (i) numeric or quantitative data and (ii) non-numeric or qualitative or categorical data. Whether the data is quantitative or qualitative really depends on whether the underlying variable happens to be quantitative or qualitative. If the variable is weight or height or time taken or number of people etc., it is inherently quantitative and therefore data that describes these variables will be quantitative. If the underlying variable in qualitative in nature, such as the color of a dress or the degree of satisfaction, then the data is also qualitative. Sometimes, a qualitative variable is represented by a number, which can create some confusion. For example, in some countries, the zip code is represented by numbers (e.g. in USA). In some countries the zip code is not numeric (e.g. in Canada). Even if it is represented as a number, a zip code is essentially a qualitative variable, because it is simply a label for a neighborhood to facilitate mail delivery.
One way to test whether a variable is truly numeric is to see if it makes sense to perform some arithmetic on the data values. If it makes sense, then the variable must be truly numeric, if not, then it must not be. For example, while it makes sense to add two values of weight, it makes no sense to add two zip codes. The sum of two zip codes does not produce any meaningful number, whereas sum of two weight values produces a meaningful value.
A third type of data is date. It is neither completely quantitative nor completely qualitative. It has elements of both. For example, part of the date represents the month, which is a non-numeric quantity such as January, February, etc. The fact that these months can be represented as numbers, don’t necessarily make them numeric because adding 1 and 2 gives 3, but adding its corresponding values January and February does not give March. In fact, adding January and February does not give anything meaningful. Yet, some arithmetic can be performed with dates. For example, you can calculate the date, 60 days from today.
Discrete vs. Continuous Variables
Within Quantitative variables, there are two types of variables – Discrete and Continuous. If the values of a variable are discrete such as 3 or 5, the variable is discrete. A discrete variable can assume only certain values, it cannot have any value such as 3.14159. A continuous variable, on the other hand, can assume any value on a continuum, such as 3.14159. Height and weight are examples of continuous variables. If we have a scale of extremely fine resolution that measures height up to six decimal places, the height can be any number, such as 67.914159 inches. But because we tend to round it up to the nearest integer, it appears to be discrete. Similarly weight is a continuous variable because a very sensitive weighing machine can weigh a person up to several decimal places. Number of people in a line is an example of a discrete variable. Number of countries visited by a person is also an example of a discrete variable. Nobody can visit 3.14159 countries for example. Most discrete variables are those whose values are a result of counting, such as number of customers who enter a store in an hour, or the number of cars that pass through a traffic light in a day, or the number of students enrolled in a course. Most continuous variables are those whose values are a result of measurement, such as distance, weight, temperature etc.
Scales of Measurement
There are four different scales of measurement, namely (i) Nominal scale, (ii) Ordinal scale, (iii) Interval scale and (iv) Ratio scale. A qualitative (or categorical) variable may have a Nominal scale or an Ordinal scale. A quantitative variable may have an Interval scale or a Ratio scale. A variable with a nominal scale is a categorical variable whose values cannot be ordered. For example, Color is an example of a nominal variable because its values cannot be ordered. How do you order Red, Green, Brown, Blue that makes sense? Another example of a nominal variable is Gender. Values of an Ordinal scale variable can be ordered. For example, when filling out a survey on customer satisfaction, you might choose amongst categories of Poor, Below Average, Average, Above Average, Excellent. These values can be ordered and therefore the variable customer satisfaction is an ordinal variable. An Interval scale variable is a quantitative variable whose values do not have a true zero and consequently the ratio of two values is meaningless; only the interval between two values are meaningful. For example temperature in Fahrenheit is a variable whose value of zero is an arbitrary temperature. A temperature of zero degrees Fahrenheit does not correspond to zero heat and therefore this variable does not have a true zero. A ratio of 40 degrees and 20 degrees is 2, but it does not imply that 40 degrees temperature corresponds to twice the heat compared to 20 degrees temperature. So the ratio of two values is meaningless. In business examples, we rarely come across interval scale variables. A ratio scale variable is a quantitative variable with a true zero and therefore, for which, ratio is meaningful. For example sale price, height, length, weight are all examples of ratio scale variables.
Population vs. Sample
When learning statistics, we must learn to clearly distinguish between a population and a sample. A population consists of all entities of interest. A sample is a subset of entities from a population. Usually, though not always, it is infeasible to collect data about the entire population of interest. In rare cases, if the population size is small, then it is feasible to collect data about a population. For example, if I am interested in the income distribution of everyone in a city of a million residents, it would be quite infeasible to collect data on each resident’s income. If, however, we are interested in the income distribution of everyone in a small town of 15 residents, we may be able to collect the entire population data. Whenever collecting population data is infeasible, we have no choice but to work with sample data.
1.4 Descriptive Statistics
We will now discuss how data is described using descriptive statistics. It is important to recognize the type of data before deciding how to describe it because the descriptive statistics for quantitative data are different from the descriptive statistics for qualitative data.
Descriptive Statistics for Quantitative Variables
Sometimes data for a quantitative variable is given as a bunch of raw numbers, also called ungrouped data and sometimes it is given as grouped data. An example of ungrouped data is a list of raw numbers such as 2, 5, 7, 9, 4, 3, 3, 4, 6, 8, 14, 4, 20, 6, 10, 4, 5, 9, 11, 1, 6, 9, 4, 5, 13, 18, 7, 6, 9, 10. These numbers could represent any quantitative variable such as the number of cars sold per day in April at a car dealership. Grouped data appears as a frequency table for different groups of values, such as:
Table 1.1: Grouped Data
Num of cars sold in a day in April / Count (or frequency)1 – 3 / 4
4 – 6 / 12
7 – 9 / 7
10 – 12 / 3
13 – 15 / 2
16 – 18 / 1
19 – 21 / 1
Depending on whether the data is grouped or ungrouped, the way we describe data is different.
We describe data either using some summary measures or by some visual graphs.
Summary Measures for Quantitative Variables
There are four types of summary measures:
(i) measures of central tendency
(ii) measures of variation
(iii) measures of location
(iv) measures of shape.
Measures of Central Tendency
In general, any given data tends to crowd around a center. It helps to know what this center is. There are three measures of central tendency –
(i) Mean
(ii) Median
(iii) Mode.
The mean is simply the average of all the values. We can calculate the mean by simply summing up all the values and dividing by the total number of values. For example, the mean of these values: 2, 5, 7, 9, 4, 3, 3, 4, 6, 8, 14, 4, 20, 6, 10, 4, 5, 9, 11, 1, 6, 9, 4, 5, 13, 18, 7, 6, 9, 10 can be determined by summing up these values and dividing by 30. The sum of all these values happens to be 222. So the mean is 222/30 or 7.4. Mathematically, the mean is given by the formula: Mean = xi/n, where n is the number of values and xi’s are the data values. We can say that the mean number of cars sold per day in April is 7.4. If the data is grouped, then the way we calculate the mean is different. We compute the middle value in each group and then multiply frequency by the middle value and add the product and divide by total frequency.
Table 1.2: Computing Median in Grouped Data
Num of cars sold in a day in April / Middle group value / Count (or frequency) / Product of middle value and freq.1 – 3 / 2 / 4 / 8
4 – 6 / 5 / 12 / 60
7 – 9 / 8 / 7 / 56
10 – 12 / 11 / 3 / 33
13 – 15 / 14 / 2 / 28
16 – 18 / 17 / 1 / 17
19 – 21 / 20 / 1 / 20
Total / 30 / 222
The mean is 222/30 = 7.4
Although in this example, the mean of ungrouped and grouped data turned out to be the same, it may not always be the case.
The second measure of central tendency, median is the middle value. The middle value can be determined by arranging the data in either ascending or descending order and finding the value in the middle of the sorted list. Median is easier to obtain if there are an odd number of values because there is only one middle value. If there are an even number of values, such as in our example, then there are two middle values and the median is the average of the two middle values. If we arrange our data in ascending order it looks like this:
1,2,3,3,4,4,4,4,4,5,5,5,6,6,6,6,7,7,8,9,9,9,9,10,10,11,13,14,18,20
Since there are a total of 30 values, there are two middle values - the fifteenth and the sixteenth values. Since both of them happen to be 6, the average of these two middle values is also 6, so the median for this data is 6. If the fifteenth value had been 6 and the sixteenth value had been 7, the median would have been 6.5.
For grouped data, we compute the cumulative frequency column and look for the group that has the middle value. For example in the table below, 15th and the 16th values are the two middle values and they happen to be in the group 4 to 6. So we know that the median is in the group 4 to 6. Within that group, we find the value of the middle value in a prorated manner. In our example, the fifth value is closer to 4 and the 16th value is closer to 6. If we prorate, then the 15th value is 4 + (6-4)*(15-4)/12 or 5.833 and the 16th value is 4 + (6-4)*(16-4)/12 = 6. So the median is the the average of 5.833 and 6 or 5.92.
Table 1.3: Cumulative Frequency Table for Grouped Data
Num of cars sold in a day in April / Count (or frequency) / Cum. Freq.1 – 3 / 4 / 4
4 – 6 / 12 / 16
7 – 9 / 7 / 23
10 – 12 / 3 / 26
13 – 15 / 2 / 28
16 – 18 / 1 / 29
19 – 21 / 1 / 30
The third measure of central tendency, mode is the value that appears the most number of times. In our example, the value of 4 appears the most number of times. It appears five times. There is no other value that appears five times or more. The mode for this data is therefore 4. Sometimes, there may be more than one mode. For example, if the day 10 cars were sold, if only 9 cars had been sold, there would have been five days when 9 cars were sold. In that case there would be two modes – 4 and 9. When you have two mode, we do not try to find the average of the two modes. We simply report that there are two (or more) modes.