Chapter 2 Organizing and Summarizing Data

Chapter 2 – Organizing and Summarizing Data

Definition: When data are in their original form, as collected, they are called raw data.

We want to be able to visualize the characteristics of a data set; hence we construct graphical representations of the data. In order to do so, we must look at the frequency of occurrence of data values.

Definition: A categorical frequency distribution, used for categorical (qualitative) data, is a table listing the categories, together with the frequency of occurrence of each category in the observed data.

Example:The following table shows data on class rank of students receiving financial aid at a small 4-year college.

College Class Rank / Frequency
Fr / 18
So / 12
Jr / 6
Se / 4

Often, when the data are numeric, there are too many different data values for a listing of the raw data to be of use in seeing the characteristics of the data. It is common to divide the interval of values of the data into a relatively small number of subintervals, called classes, and to tabulate the data using the frequencies. Each frequency is the number of occurrences of data values in one of the classes.

Definition: A grouped frequency distribution is the organizing of raw data in table form, using classes and frequencies.

Definition: The largest data value that can be included in a class is the upper class limit for that class; the smallest data value that can be included is the lower class limit.

Definition: The class width is found by subtracting the upper class limit of one class from the upper class limit of the next-higher class.

Definition: The cumulative frequency for a class is the count of all observed data values in that class or in lower classes.

Rules for constructing a frequency distribution:

1) The number of classes should be between 5 and 20; 5 for small data sets, 20 for large data sets.

2) An observed data value must be in one, and only one, class. This means that the classes must be non-overlapping, or mutually exclusive.

3) The classes must be continuous; even if there are no observed data values in a given class, that class must be included, with a frequency value of 0.

4) The classes must be exhaustive; i.e., together they must include all of the data.

5) The classes must be equal in width.

Procedure for constructing a grouped frequency distribution:

1) Find the range by subtracting the lowest value of the data from the highest.

2) Select the number of classes desired (between 5 and 20).

3) Find the class width by dividing the range by the number of classes; round the result up to get the class width.

4) Go to the TI-83 calculator and construct a graph called a histogram, using the procedure listed below.

5)Use the information read off the calculator screen to construct the grouped frequency table.

Example: We have 25 scores on a final exam, as follows:

86, 83, 56, 98, 82, 52, 71, 88, 75, 91, 69, 88, 64, 78, 81, 74, 77, 83, 90, 85, 64, 79, 71, 83, 64

We want a frequency distribution. Since the data set is small, we choose 5 as the number of classes. The range of the data is R = Largest value – Smallest value = 98 – 52 = 46. To get the class width, we divide the range by 5, obtaining 9.2. We round this number up to obtain the class width, 9.25.We then go to the TI-83 to construct the histogram. We will talk about constructing histograms first, then get back to constructing the grouped frequency table.

Graphical Representations of Data

We will do several types of graphs that display numeric data. One of the most common ways to graph numeric data is through use of a histogram.

Definition: A histogram is a graph that displays the data by using vertical bars of various heights to represent the frequencies.

Characteristics of a histogram:

1) The classes are listed in order along the horizontal axis of the chart.

2) The vertical axis provides a scale for the frequencies.

3) A rectangle, or bar, is constructed for each class so that

a) the height of the bar is the frequency of the class

b) the bar for the class extends from the lower boundary of

the class to the upper boundary

4) Each axis of the histogram has a label, and the histogram has a title.

Example: Now let us create a histogram for a data set, and in so doing, generate a grouped frequency distribution.

Entering a data set into the TI – 83 graphing calculator, using the statistics exam data.

The stat list editor is a table where you can store, edit, and view up to 20 lists that are in memory. Also, you can create list names from the stat list editor.

1) To display the stat list editor, press STAT, and then select 1:Edit from the STAT EDIT menu.

2) Use the up arrow key to move the cursor to the top row of the table. Press 2ND, and then INS. You will see the

Name = prompt at the bottom of the screen. Type the name of your variable using the alphabetic keys (green symbols on your calculator).

3) Use the down arrow to move to the list. Type in the first data value and press ENTER. The cursor will automatically move down to the next space for the next entry. If you make a mistake, use the arrow keys to return to the location of the mistake and make a correction.

4) If you want to erase a list, move the cursor to the list name, and press DEL.

Steps in constructing a histogram using the TI – 83 graphing calculator:

First, you need to clear previous graphs.

1) Press Y=. You will see a list of functions. If any of them have already been defined, use the arrow keys and the CLEAR key to erase them.

2) Next press 2ND, and STATPLOT. You will see a list of plots. All of them should be off. If any are not, go down to 4:PlotsOff and press ENTER.

3) Clear all drawn figures. Press 2ND and DRAW. Choose 1:ClrDraw, and press ENTER.

4) Set the size of your graph window. Press WINDOW. The Xmin value should be equal to your smallest data value; in this case, we choose Xmin = 52. The Xmax value should be equal to or slightly larger than your largest data value; in this example, we choose Xmax = 102. The Xscl value is your class width. For this example, we choose 6 classes, and so Xscl = 9.25. The Ymin value should be 0; the Ymax value should be somewhat larger than your expected largest class frequency. Since there are 25 items of data, we choose Ymax = 12.

5) Press 2ND, STATPLOT, 1:Plot1, and ENTER. Turn Plot 1 On.

6) Choose the histogram symbol (the third symbol on the third line of the screen).

7) Go down to Xlist: and enter the name of your variable.

8) Press the GRAPH key. You will see the histogram displayed.

To generate the frequency distribution from the histogram:

1) Press the TRACE key.

2) Use the right arrow key to move from one bar of the histogram to the next, reading the class boundaries and the frequencies from the calculator screen. The result for this example is given below.

Class Limits / Frequency / Cumulative
Frequency
52.00 – 61.24 / 2 / 2
61.25 – 70.49 / 4 / 6
70.50 – 79.74 / 7 / 13
79.75– 88.99 / 9 / 22
89.00– 98.25 / 3 / 25

Note also that the table includes a column for the relative frequencies, which are the proportions of the data set falling into each class.

Defn: The relative frequency associated with a class is the proportion of the data set falling into that class. It is found by dividing the class frequency by the size of the data set.

Defn: The cumulative relative frequency associated with a class is the proportion of the data set falling into that class or lower classes. It is found by dividing the cumulative frequency for a class by the size of the data set.

Interpretation of Relative Frequency and Cumulative Relative Frequency: If we randomly select an observation from the data set, the relative frequency for a class is the probability that our selected observation will be found in that class. The cumulative relative frequency for a class is the probability that the observation will be found either in that class or in a lower class.

Distribution Shapes: (See p. 63)

1) In a uniform distribution, the frequencies are equal for all classes; the relative frequencies are also equal for all classes.

2) In a bell-shaped distribution, the greatest frequency (or relative frequency) occurs in the middle class, with decreasing frequencies away from the center in either direction.

Uniform and bell-shaped distributions are examples of symmetric distributions.

3) In a distribution that is positively skewed, or right-skewed, the majority of the data values fall to the left of the center and cluster at the lower end of the distribution; the tail of the distribution is to the right.

4) In a distribution that is negatively skewed, or left-skewed, the majority of the data values fall to the right of the center and cluster at the upper end of the distribution; the tail of the distribution is to the left.

Other Types of Graphs

Defn: A bar graph is used to represent the frequency distribution for a categorical variable, and the frequencies are displayed by the heights of the vertical bars.

Defn: A Pareto chart is a bar graph whose bars are drawn in decreasing order of frequency or relative frequency.

Example: p. 44

Note: Since we are dealing with non-numeric data, the TI-83 calculator will not do this type of graph.

Another type of graph used with categorical data is the pie graph.

Defn: A pie graph is a circle that is divided into sections or wedges according to the proportion of the data set in each category.

Note: The TI-83 will not do this type of graph. It must be done by hand. Illustration 2.1, p. 31.

Note: In any situation in which data are represented using graphical techniques, it is easy to construct the graph in such a way as to mislead the viewer. It is necessary to carefully examine the graph in order to interpret it properly. On pages 94 - 95 of the textbook, there are examples of graphs constructed to be misleading.

Time Series Plots

If the values of a variable are measured at regular intervals over a period of time, the data are referred to as time series data. Unlike previous data sets, the items in a time series data set may be related to each other. To represent the data graphically, we use a time series plot.

Defn: A time series plot is obtained by plotting the time at which a variable is measured along the horizontal axis and the measured value of the variable along the vertical axis. Lines are then drawn connecting the points.

Example: p. 70, Ex. 31

To do this type of plot using the TI-83, we need to enter two lists of numbers, the first list is the sequence of time points. The second list is the sequence of data values. The type of graph we are doing is the second of the six types available with the Stat Plot function of the calculator.

For time series data, we are looking for trends. In this example, we see that, although there are fluctuations in percent enrolled from year to year, there is an overall increasing trend.

Graphical Misrepresentations of Data

Data are sometimes graphed in ways that are used to mislead the reader, either intentionally or not.

Example: p. 72, Ex. 2

Example: p. 74, Ex. 2

Example: p. 75, Ex. 6