STA 101: Data Analysis and Statistical Inference LAB 2

Sampling, Describing One Variable

Dr. Kari Lock Morgan

NAMES:

INTRODUCTION

You will create a lab report as a group. You can turn this report into your TA during lab if you finish, or anytime by the next class period (so by Wednesday, 1/22 this week). You may turn it in to your TA either by email or hard copy. The report need only include the answers to the numbered questions.

The software we will use for the first half of the semester is StatKey, free online software at www.lock5stat.com/statkey/. From the home menu, you will see several different options. All of these will make sense to you by the end of the course, but for now we’ll just focus on descriptive statistics and sampling distributions. Everything in blue on this homepage (and on all of StatKey) is clickable and will bring up a new page.

Capturing Images

Parts of this lab (and most future labs) will require you to copy and paste images from the screen into your lab report. Here's how to do this (if you don't know already!):

Mac: Press COMMAND+SHIFT+4, which will bring up a cross. Select the region you want to capture, and it will be saved to your desktop. If you would rather copy the image instead of save it, press CONTROL+COMMAND+SHIFT+4.

PC: Go to All Programs Accessories Snipping Tool. Clicking ``New" within the snipping tool allows you to select the region you want to include in your image. After it is selected, you can either save it or copy it by right-clicking on the image.

PART 1: GROUPS

You will be assigned to your lab group. This group will be permanent throughout the semester, so take some time and get to know each other! You will work on labs together, and hand in just one lab as a group. Include the names of all group members who were in lab and participated. If someone is absent from lab, do not include their name.

You can create the report using anything you'd like, but I personally recommend using google docs and sharing with your teammates. This makes it easy to type answers on one computer and copy and paste in generated graphics from another computer. Also, if you don't fully finish during lab, it makes it easy for all of you to access and edit the same document.

  1. Come up with a group name.

PART 2: SAMPLING

The goal of this part is to familiarize you with StatKey, and practice random sampling. From StatKey, go to the sampling distribution row, and click on sampling distribution for a mean (click on Mean). This opens a new page, design for sampling and calculating the mean each time. In the upper left corner, you will see Statistics Grad Schools, which is the name of the current data set: clicking on this will open a drop-down menu of built in datasets. Go to Baseball Players, and click. This loads a dataset of the population of all major league baseball players from 2012, and their annual salaries (in millions). By clicking on Show Data Table, you see the dataset with all the cases.

  1. Who was the highest paid MBA player in 2012?

You will find a plot of the data and relevant descriptive statistics in the upper right.

  1. How many baseball players are in this data set? What was the average salary for major league baseball players in 2012?

Suppose we didn’t have access to the population, but wanted to estimate the average salary using a sample. (IMPORTANT NOTE: This would be pointless here – there is no point in taking a sample if you already have data on the entire population! This is purely pedagogical!). By clicking Generate 1 Sample (upper left), you randomly sample 10 players (this number can be changed next to choose samples of size n = ), but we’ll leave it at 10 for now) from the dataset. The results are displayed under “Sample” on the bottom right.

  1. Which players were selected in your sample? What is the mean salary for your sample?

Notice that the mean from your sample is plotted with a dot in the big dotplot – the Sampling Dotplot of Mean (sampling distribution). This is analogous to what you did in class with Lincoln’s Gettysburg address – clicking generate 1 sample randomly sampled 10 players (just as you each randomly sampled 10 words), and then the sample mean is plotted in the dotplot, just as you each plotted your X on the board for your sample average.

  1. Click Generate 1 Sample again to sample another 10 players. What is the mean for this sample? Note the appearance of this dot on the sampling distribution.

It’s important to note that each dot on the sampling distribution represents the mean from one sample of size 10 (NOT a single player). Hover over one of these dots with your mouse, and you will see the sample from which it arose.

  1. If you were to generate thousands of such random samples, what number would they be centered around? Why?
  1. Try it! Click on Generate 1000 Samples. The center (average) is denoted with the black arrow at the bottom. Is this close to what you expected?

PART THREE: DESCRIBING ONE VARIABLE

We haven’t done descriptive statistics in class yet, but we’ll jump ahead and do it in lab. If you have questions, ask your group members or TA for help. The goal of this part is to get you comfortable playing with and exploring data!

For this, we’ll work with our class survey data. You can view the data at https://docs.google.com/spreadsheet/ccc?key=0AtJ5X5rxFtfqdGl4bU50YUI3OWVGVmIzcnZ0N0dzWEE&usp=sharing (Note: I deleted some cases and variables to avoid being able to identify anyone). To know what the variables are, you can view the survey questions at http://stat.duke.edu/courses/Spring14/sta101.001/survey.pdf. If you want to edit the data, I recommend downloading it, or copying and pasting into your own google spreadsheet.

Go back to the StatKey homepage (clicking on ``StatKey" in the upper left will bring you back to the main menu). We’ll use the first column, Descriptive Statistics and Graphics.

We'll begin with One Categorical Variable. Datasets from the textbook are already loaded into StatKey (upper left drop down menu), but here we'll input our own data. Select the column from the class survey with the variable you want to explore and copy it, then click on ``Edit Data" in StatKey. Delete what is already there and paste in your new variable. Because this is the raw data and not just summary statistics, make sure you click the ``Raw Data" box underneath. Click Okay, and you will see your variable described!

  1. Use StatKey to play around with a few categorical variables you are interested in. Choose one variable to describe in your lab report. Include relevant summary statistic(s) and a visualization (a screen shot is fine). What does this tell you about Duke students who are taking STAT 101 this semester?

You can enter a quantitative variable into StatKey in the same way (highlight the variable, click ``Edit Data", then copy and paste your variable in). However, note that ONLY NUMERIC VALUES are accepted. The survey instructions said to respond with only numbers for quantitative variables, but as you will see many of you still entered non-numeric answers. You will have to manually edit the data accordingly (welcome to the real world of data analysis!). I recommend sorting the data by the column you are interested in, making it easier to spot the non-numeric observations. If the correct response is obvious you can correct it, otherwise you can just ignore that case and paste only the numeric values into StatKey. Using CTRL+F and the replace feature can also be useful (e.g. replace hour or hours with nothing). Your TA can help you if needed.

  1. Use StatKey to play around with a few quantitative variables you are interested in. Choose one variable to describe in your lab report. Include relevant summary statistic(s) and a visualization (a screen shot is fine). What does this tell you about Duke students who are taking STAT 101 this semester?