Chapter 1, Data and Descriptive Statistics
Chapter 1
Data and Descriptive Statistics
The term Statistics broadly refers to two things. First, it refers to “Data” and second, it refers to “The science of analyzing data”.
The First Meaning of “Statistics”
The term statistics is often used to simply mean data. For example when a close friend of yours delivers a baby, you might be interested in some “statistics” about the new born baby, such as its gender, height, weight, the date andtime of birth, the name, whether the birth was natural or cesarean, etc. The statistics (or the Data) of the baby might be: Female, 19 inches, 8 lbs., Dec. 31st, 11:59 p.m., Dorothy and Cesarean. In this example, the words “statistics” and “data” are essentially referring to the same thing – a bunch of values (such as Female, 19, 8, Dec. 31st, 11:59 p.m., Dorothy and Cesarean). These “values” essentially describe something about the world of interest to you. It is as if the data (or statistics) is telling a story or a tale, such as -There once was a close friend of mine, who delivered a little baby girl who happened to be exactly 19” at birth and weighed approximately 8 lbs. and was born on Dec. 31st at exactly 11:59 p.m. and she had to be delivered through a C-section and my close friend named the baby Dorothy and she lived happily ever after. Please remember this – every time you see any data, there is a tale behind it. It may not always be a tale about a little princess named Dorothy and it may not always have a happy ending, but there is always a story behind it. No matter what your field of study, you will encounter a lot of data in your career, but always remember – all data tell a story. When you try to make sense of the data, you are essentially being a storyteller. If you are a business student, your story may involve business performances, changes in economic climate, changes in consumer preferences, etc. If you are a biology major, your story may involve whether a certain species is becoming extinct and if so, how fast.
In the above example, the new born baby was of interest to you because it was not just any baby but a baby of someone you really care about, such as a close friend or a relative or your statistics professor. If it was any baby born in a hospital that you pass by everyday on your commute, you would probably not care much about their name, gender, height, weight, date and time of birth, name etc. If I start telling you the name, gender, height, weight, date and time of birth and the name of every baby born at that hospital, you will probably ask me to leave. But if you were concerned that by asking me to leave you might hurt my feelings (after all I am your statistics professor, someone you deeply care about), you might politely ask me to perhaps just give you the total number of babies born, the total number of males and females, the average height and the average weight of all the babies. I might tell you that in a given year, at that hospital, 625 babies were born, of which 325 were boys and 300 were girls, the average height of all babies was 20” and the average weight was 6.8 lbs. This way I will feel good about giving you data about all the babies, without really giving you data about each individual baby and you will feel good about not hurting my feelings and you will only have to listen to me for a few seconds, instead of a few hours.
Who knows, this summary data I gave you just might come in handy someday. For example, if you (if you are a girl) happen to get pregnant someday and you happen to go to this hospital, since it is on your commute, and if you happen to remember that I had told you that the average height of all the babies born at that hospital was 20”, you can take some comfort in the knowledge that your baby will be around 20” and not say 10” or 40” and having this comfort can be quite uplifting to you at a time when you need all the uplifting you can get. And if you happen to work at the delivery ward of this hospital, or if you are a researcher at this hospital, working on research on the health of new-born babies at this hospital, you might actually be very interested in the summary data about the total number of babies born, the number of males and females, the average height, weight etc. Whether the data is about individual entities or summary data about lots of entities put together, all data tell a story.
Summary of what has been discussed so far:
- Data, whether about individual entities or summary data, tell a story.
The Second Meaning of “Statistics”
The second meaning of the term statistics is - The science of analyzing data. A more detailed meaning is – The science of collecting, summarizing, analyzing and interpreting data. A more informal second meaning might be - The science of collecting data and then telling the story behind the data.
How is Statistics useful? The usefulness of Statistics really depends upon how important to you is the story behind the data. For example, if your retirement plan depends upon some data about the various investment options, then clearly the data and the story embedded in it is extremely useful to you because it becomes instrumental in determining the quality of your life after your retirement. If you enjoy a good retirement, you might be able to tell lots of stories about a princess Dorothy to your grandchildren and your grandchildren can tell their children the story of the great quality of life you enjoyed because of your great retirement planning which eventually hinged on a bunch of data. As another example, if as a manager you are faced with some difficult decisionsabout budget cuts, then summary data about the various expenses in your department will be very useful in making your decisions, because if you are not successful in cutting the budget, your job might be the target of someone at a higher level in your organization who is looking to cut the payroll budget. If a bunch of numbers regarding the various expenses within your department can save your job, I am sure you will find those numbers, or data, or statistics very useful.
Similarly how interesting statistics can be depends on how entertaining your story is. In fact, statistics is used a lot in party conversations. I am sure you have heard people engaged in entertaining conversation at a party saying things like – did you know that x% of people in such and such country do blabla and bla. If this x% seems to be an outrageous number, people find it entertaining. For example someone might tell you that in an ice-hockey team in Canada that 80% of the players are born either in January, February or March. You might find this a very entertaining piece of information, because one would expect that only about 25% of the players should be born in January, February or March. So, compared to an expected 25%, the 80% in the story seems to be an outrageous number, something to write home about. Remember – this interesting story hinged on some statistics about birthdates of some hockey players.
So informally speaking, statistics is the science of telling a story. Now, I am sure you must be asking yourself, is telling a story an art or a science? Like most people, you might be inclined to think that telling a story is more of an art than a science. And I will not argue with you on this. Even the field of mathematics, which is regarded as an exact science can be an art. For example proving a theorem is more of an art than a science, especially if you are proving a new theorem that no one else has ever proved before. Yet, the proof itself is very scientific. So, you could say that statistics is the art and science of telling a story. Given the same data, one more adept in the art of statistics can tell a far better story than a novice. If it were merely a science, then there would only be one possible story hidden behind a given set of data.
Now art is usually a talent that someone is born with. Although sometimes, by constantly practicing something over a long period of time, someone without an inborn talent can become an artist. For a one-semester course, it will be difficult to turn you into an artist. But we can study the science of statistics and once you start practicing it, after a few years, who knows, some of you will become adept in the art of statistics. So let’s study next, the science of telling a story from data.
The Science of telling a story hidden in data
Let’s look at an example. Let’s suppose the following values represent something in the real world:
2, 5, 7, 9, 4, 3, 3, 4, 6, 8, 14, 4, 20, 6, 10, 4, 5, 9, 11, 1, 6, 9, 4, 5, 13, 18, 7, 6, 9, 10
When you see data like this, it clearly appears very boring. But put a little story behind it and suddenly the data comes to life. Here are some more interesting definitions of statistics, definitions that you will not see in standard statistics textbooks – it is the art and science of making data come to life or the art of converting boring data into an interesting story. Please do not quote these definitions as standard definitions. I just made them up to make statistics more interesting.
When you see data like in the previous paragraph, the first thing you should ask is – what do these numbers represent? Suppose I tell you that in a statistics class of 30 students, I asked everyone how many different countries they have visited in their life and this data represents the number of countries visited by each student. Suddenly the data is not merely a set of boring numbers, but it is telling a story that one studenthas actually visited 20 countries (amazing), orthat one student has visited 18 countries (not bad), that onestudent has never been outside of their home country, that on average, a student has been to 7.4 countries, that the number of countries visited by maximum number of students is 4 (mode) and the median number of countries visited is 6. By making these statements, you have already made this data far more interesting than it originally was. You can further make it even more interesting by doing the following. You can say - let’s count the number of students who have visited 3 or fewer countries, number who have visited 4 through 6 countries, 7 through 9, 10 through 12 and so on and you might get a table that looks like this:
Number ofcountries visited / Count(or frequency)1 – 3 / 4
4 – 6 / 12
7 – 9 / 7
10 – 12 / 3
13 – 15 / 2
16 – 18 / 1
19 – 21 / 1
Table 1: Frequency Table
A table like this is called a Frequency Table. You can add some more columns - One for cumulative frequency, one for relative frequency and one for cumulative relative frequency.
Number ofcountries visited / Count
(or frequency) / Cumulative
Frequency / Relative
Frequency / Cumulative Relative
Frequency
1 – 3 / 4 / 4 / 13.3% / 13.3%
4 – 6 / 12 / 16 / 40.0% / 53.3%
7 – 9 / 7 / 23 / 23.3% / 76.7%
10 – 12 / 3 / 26 / 10.0% / 86.7%
13 – 15 / 2 / 28 / 6.7% / 93.3%
16 – 18 / 1 / 29 / 3.3% / 96.7%
19 – 21 / 1 / 30 / 3.3% / 100%
Table 2: Frequency, Cumulative Frequency, Relative Frequency, Cumulative Relative Frequency
All of a sudden, you can tell so much about the travel behavior of students in a class, based on a bunch of 30 values. For example you can say that 40% of the students have visited 4 to 6 countries, or only 3.3% of students have visited 19 to 21 countries. By looking at the cumulative relative frequency column you can make statements like, 53.3% (or roughly half) of the students have visited 6 or fewer countries and almost three quarters of the students have visited 9 or fewer countries. This second fact can also be stated as – almost a quarter of the students have visited 10 or more countries. So, by converting the data into a frequency and relative frequency table, a number of interesting statements can be made.
I am sure we have all heard the phrase – a picture is worth a thousand words. As long as we are in the business of telling stories, we can make use of pictures, which will be worth a lot more. For example, in Figure 1, we have a bar chart or a column chart of the frequency.
Figure 1: A Bar Chart of Frequency
If we eliminate the spaces between the columns in Figure 1, we create a diagram called a histogram, as shown in Figure 2 below. Note that we can call it a histogram, as long as the x-axis has numerical values. If the x-axis had categories, such USA, Korea, India, then such a diagram would not be called a histogram, it would be a bar chart and it is better to have spaces between the bars.
Figure 2: A Histogram of Frequency
If we draw a line graph of the relative frequencies as follows, we call it a frequency polygon:
Figure 3: A Frequency Polygon of the relative frequency
I hope through this example, you got a glimpse of how to convert a set of boring data into a useful story using some simple techniques such as frequency tables, bar charts and other charts.
The science of analyzing data, which is the second meaning of statistics, can be divided in to two parts – Descriptive Statistics and Inferential Statistics. In Descriptive Statistics, we describe data i.e. tell the story hidden behind a given set of data. We saw descriptive statistics in action through the above example. In Inferential Statistics, we make inferences about a population of interest from data collected from a sample of the population. In the next chapter, and the next and the next, we will study quite a bit about inferential statistics. We are still telling a story in inferential statistics, it is just that the story is about a population of interest instead of about the sample for which data was collected.
Summary of what we have said since the last summary:
Descriptive Statistics
Although we have seen an example of descriptive statistics above, we have not seen all the elements of descriptive statistics. In the example, we calculated the mean (or the average), the median and the mode and we did frequency tables and drew some graphs. Those are some of the things we do in descriptive statistics. There are many other things we do, such as calculate variances and standard deviations and percentiles and quartiles and a few other things.
Basically the mean, the median, the mode, the standard deviation etc. are summary measures to describe data. If we have some summary measures, we don’t need to see all of the raw data to get a sense of how the overall data looks. So if we know that the mean is 7.4 and the median is 6 and the mode is 4 and the high value is 20 and the low value is 1, I get a rough idea of how the data values are distributed. Often, a rough idea is all that is needed.
Let me formally introduce these measures. There are broadly fourtypes of summary measures –(1) measures of central tendency, (2) measures of location(3) measures of variation and (4) measures of shape.
Measures of Central Tendency
It has been generally observed that given any set of data, the values tend to crowd around a central value – it is as if there is a center of gravity and data seems to be attracted towards it. Of course some data is always found away from the center. The location of the center of gravity is determined by certainmeasures called the measures of central tendency. There are three measures of central tendency are (1) Mean, (2) Median and (3) Mode. A measure of central tendency is like a representative number for the entire data. The Mean is simply the average of all the values. We can calculate the average by simply summing up all the values and dividing by total number of values. For example, the mean of these values: 2, 5, 7, 9, 4, 3, 3, 4, 6, 8, 14, 4, 20, 6, 10, 4, 5, 9, 11, 1, 6, 9, 4, 5, 13, 18, 7, 6, 9, 10 can be determined by summing up these values and dividing by 30. The sum of all these values is 222. So the mean is 222/30 or 7.4.
The median is the middle value. The middle value can be determined by organizing the data in either ascending or descending order and finding out which value is in the middle. Median is easier to obtain if there are an odd number of values because there is only one middle value. If there are an even number of values, such as in our above example, then there are two middle values and the median is the average of the two middle values. If we organize the data in ascending order it looks like this:
1,2,3,3,4,4,4,4,4,5,5,5,6,6,6,6,7,7,8,9,9,9,9,10,10,11,13,14,18,20
Since there are a total of 30 values, there are two middle values - the fifteenth and the sixteenth. Since both of them happen to be 6, the mean of these two middle values is also 6, so the median for this data is 6. If the fifteenth value had been 6 and the sixteenth value had been 7, the median would have been 6.5.
The mode is the value that appears the most number of times. In this example, the value of 4 appears the most number of times. It appears five times. There is no other value that appears five times or more. The mode for this data is therefore 4. Sometimes, there may be more than one mode. For example, if one of the students who visited 10 countries had visited only 9 countries, there would have been five students who visited 9 countries. In that case there would be two modes – 4 and 9.
Which of the three measures of central tendency you should use depends on the type of data. If the data is distributed somewhat symmetrically around the center, the mean is the most appropriate measure of central tendency because it acts as a good representative of the data. If the data is not distributed symmetrically, i.e. it has a long tail on one side, then median is a better measure of central tendency. For example, suppose the student who visited 20 countries visited 150 countries. The data would look like this: 1,2,3,3,4,4,4,4,4,5,5,5,6,6,6,6,7,7,8,9,9,9,9,10,10,11,13,14,18,150