Lab 2: Descriptive Statistics

Pre-Reading

STAT 200

Objectives:

•Be able to use Excel to:

•Calculate measures of center and spread

•Create a histogram

•Create a boxplot

•Obtain five-number summary

•Understand the effect of outliers and distribution shape on the measures of center and spread.

•Determine which numerical and/or graphical summary to use for a given data set.

The lab itself involves computer based problem solving applied to various datasets. The problems and datasets will be given at lab time. Proper solving of these problems will require certain Excel skills. In Lab 1, we introduced how to obtain summary statistics. So you should know how to obtain measures of spread and measures of center (or location). The five-number summary is the minimum, Q1, median (Q2), Q3, and the maximum. These five measures give a quick summary of both centre and spread for quantitative variables and comprise, in general, a resistant set of measures (i.e., resistant/insensitive to outliers). The five-number summary is simply a collection of summary statistics seen in the previous lab. The boxplot is the graph that visualizes the five-number summary.

Statistics is mathematical in nature and as with math we tend to learn very little from simply reading without doing. In actual fact, it is easy to get the sensation of learning this way, when in fact very little has been absorbed. An active approach is required to learn Statistics properly. The assignments and labs offer an active learning opportunity. In order to be prepared for the labs, reading how to run certain functions may be enough to get by, but trying them beforehand on your computer or in the lab will help you to understand them.

To create histograms and boxplots (two statistical graphing methods) you will need to load an add-in and a macro, respectively. You need to do this on your home computer and your account in the lab. Any dataset you obtain from the web must be saved on your personal diskspace before you begin an analysis using Excel.

Histograms

A histogram is a graphical device used to visualize the distribution of a set of observations collected on a quantitative variable. Histograms can be used to assess the shape of the distribution (e.g.: symmetric, skewed to theleft, skewed to the right), whether the data set contains outliers, whether the distribution is bimodal and so on.

On the course website( you can find a file called “BetterHistogram.xls”. Download this file to you personal disk. Launch Excel, and open the “BetterHistogram.xls” file.If you receive a security warning when you try to open the xls file, click the button Enable Macros. Then the histogram macro is installed in your excel system.

We now make a histogram using the data set Eruptions.csv,also available on the course website. The ocean swell produces spectacular eruptions of water through a hole in the cliff at Kiama, known as the Blowhole, about 120km south of Sydney. The times at which 65 successive eruptions occurred from 1340 hours on 12 July 1998 were observed using a digital watch. The file contains 44 numbers – the times between 45 eruptions.

In order to draw a nice histogram, we may use excel functions(learned at lab 1!!) to find the maximum and minimum values of the data. Now, open the file “Eruption.csv”, and choose Tools  Better Histogram. The following dialog box will appear.

The data range must be a range of cells containing only numbers, i.e., no text. For this dataset, the Data Range should be $A$2:$A$45 (you should be able to simply drag the entries on the spreadsheet).The start value must be a number less than or equal to the minimum value of the data range. The step value, which indicates the width of a bin, must be greater than zero. The stop value must be greater than or equal to the maximum value of the data range. The start, step, and stop values must be chosen so that there are at least two bins between the start and the stop value.

For example, if the start value is set to be 0, stop value 150 and step value 10, when you click OK, the frequency distribution and a histogramwill be created on a new worksheet.You can modify the graph titles as you wish. The histogram is shown below.

More Resources

This is still not an ideal histogram. Further refinement of the histogram would require details that are more particular to Excel than to understanding a histogram. The Excel companion to your textbook containsmore details about how to customize a histogram plot. There are many good resources on the internet(Google histogram, excel, or visit www.treeplan.com/better.htm) and there aresome informative video podcasts which discuss how to better use Excel (Learn Excel from MrExcel VideoPodcast).

Boxplots

Boxplots show less detail than histograms, but can be more useful for highlighting features of certain distributions. They can be used fora single variable, but are very useful for side-by-side comparisons, comparing the distribution of heights of women and men, for example. The data set heights.xls relates exactly to this situation. It contains the observed height for 25 women and 25 men.

Open the Excel file calledboxplot,enable the macrosand save it to either your desktop or your Z: directory.Do not use the file directly from the shared folder (S:).Open the Excel file calledheights.xls. Select Data Area:$B$1:$C$26(or by simply dragging the entries on the spreadsheet) and copy to the Data sheet of the boxplot file in the cells $A$1:$B$26.

•Now press the Run button located on the data spreadsheet of the boxplot macro. This will launch a boxplot window. Complete the info in this window as follows:

•Select Data Area: Data$A$1:$B$26

•Check the box“First Row Contains Labels?”

•X-axis label: Height

•Click OK to validate your choices.

Notice that the boxplot macro produces a five-number summary and identifies outliers. The boxplot producedby the macro consists of

•a central box, which spans the first and third quartiles,

•a line in the box which marks the median,

•a line that extends to the smallest observation that has not been identified as an outlier (similarly forthe largest observation), and

•dots to indicate outliers (observations that fall more than 1.5xIQR above Q3 or below Q1).

Often, researchers want to know if populations are similar or different. In most situations we do not haveinformation on the entire population. We only have a sample. Here, we have a sample of 50 people (25 women,25 men) from some human population. The comparative boxplot helps researchers appreciatesimilarities or differences between populations. Look at your boxplot and decide if women and men havesimilar heights. Do you see any outliers?

It is good to note that there is a natural connection between the measures of centre and spread and the visualization of the data. A boxplot is a natural way to visualize data that have beenpresented with the median and IQR.

Enjoy the Lab!

1