Data Sets for Use in Statistic, Measurement and Design Courses

Charles Stegman, Calli Holaway-Johnson, Sean Mulvenon, Sarah McKenzie,

Ronna Turner, and Karen Morton

University of Arkansas

Paper presented at the Joint Statistical Meeting of the American Statistical Association, International Biometric Society, Institute of Mathematical Statistics,

and Statistical Society of Canada

Seattle, Washington

August 2006

Data Sets for Use in Statistic, Measurement and Design Courses

Abstract

A major focus in teaching graduate level courses in statistics, measurement, and design should be the analysis of data. Results can be used to illustrate key concepts underlying the procedures discussed, help students learn how to analyze theoretical data in preparation for their careers, aid in interpreting and presenting research results, and contribute to preparing future researchers. This paper presents information on a multitude of data sets applicable for teaching courses at multiple levels and the accompanying CD contains the actual datasets.

Background

It is common for textbooks in statistics and research methodology to include a disk with several datasets that are used throughout the text. Glass and Hopkins (1996) is a good example, although others could be mentioned. Textbook datasets are commonly limited in terms of the number of datasets included and the number of cases within each dataset.

The CD produced for this paper contains over 100 datasets from multiple fields, as well as Monte Carlo computer generated datasets. In addition, the datasets can be used across a range of courses from the introduction to research methodology and statistics through regression, ANOVA, multivariate, and advanced measurement.

Development of the CD

A first step was to locate publicly accessible datasets available on the web. These are datasets that can be downloaded and used in teaching so long as appropriate acknowledgement is given. For example, many researchers and professors have made their datasets available for public use through the StatLib library at Carnegie-MellonUniversity [ Three other helpful sites are the National Institute of Standards & Technology website [ the UCLA Statistics Lab website [ Journal of Statistics Education Data Archive ( and the DataFerrett [ The first site contains datasets that can used to test or demonstrate the accuracy and precision of different computer packages when analyzing statistical data. The UCLA site contains a wealth of statistical information and sample programs. The JSE Data Archive contains datasets that have been submitted by researchers around the world, and includes articles utilizing the datasets if available. The DataFerrett allows you to search multiple topics through data mining technology and select variables for different analyses.

For the CD, selected datasets have been collected from these sites, with each dataset reviewed and included because it relates to topics regularly used as examples in statistics and research methodology courses. The datasets represent data from many fields of studies as do the examples in many of the textbooks. While professors and students can access any of these public domain datasets, the advantage of collecting them on a CD is that they are put into a standard format (Excel) and made readily available for uploading into numerous statistical packages. This should facilitate their use by multiple users in a variety of courses. Each dataset includes variable descriptions as well as the bibliographic information from the original source.

Additionally, samples from large scale datasets based on government sponsored research have been generated to support substantive based educational research examples. For example, census data and other government sponsored large scale research have produced datasets, such as the Early Childhood Longitudinal Study (ECLS-K), the National Longitudinal Study of Youth (NLSY), the National Household Education Survey (NHES), and the National Education Longitudinal Study (NELS). DataFerrett can also be used to access large scale databases. The following are some of the topics that are available from DataFerrett: Health Care, Child School Enrollment, Computer Ownership & Uses, Voting & Registration, Race & Ethnicity, School Enrollment, Teenage Attitudes & Practices, and Library Use. Note the DataFerrett allows you to search these and many more topics and select the variable sets you want.

A third area where datasets have been generated is through Monte Carlo procedures. By specifying population parameters, we generated datasets that reflect educational settings and illustrate important statistical properties. Multivariate data are also generated that can be used in number of ways. For instance, variables can be selected for analysis in introductory courses and then revisited in more advanced courses like regression, design and multivariate statistics.

The Structure of the CD

Table 1 contains a list of the datasets contained on the CD. The title of each dataset is provided, as well as its name on the CD. The sample size and variables are also included. Finally, the original source for the data is given.

Insert Table 1

The datasets have been reformed into Excel files. Many of the original files were in different formats and, while statisticians are adept at handing these, many students may still be learning basic data management. Especially in introductory classes, the emphasis is on data analyses using programs like SAS, SPSS, or R. Having the Excel files allows instructors the opportunity to write one set of instructions for importing data, allowing more time to concentrate on statistical analyses. The exception is the large scale datasets from the national databases which would be applicable to more advanced classes. Given the size of the datasets and the need for the weighting factors, Excel was too limiting. In this case, dBase and SAS data files were created.

In more advanced classes, students could be expected to find, import, and clean data from the original sources. They could then analyze the data twice to make sure they get the same answers.

Example of Using Some of the Datasets

The dataset (Arkansas Math.xls) is based on simulated student data for grades 3-5 on the Arkansas Benchmark Mathematics Examination. The Arkansas Benchmark is a criterion-referenced examination that consists of both multiple-choice and open-response questions. Tests for each grade level are developed to reflect content identified in the Arkansas state frameworks. The multiple-choice and open-response sections are weighted equally in determining a student’s score. In addition to their reported scaled scores, students are categorized as Below Basic, Basic, Proficient, or Advanced. Students with scaled scores of 200 or above are considered to be proficient and above 250 are considered to be advanced.The dataset contains 216 observations on 19 variables that would be available to school personnel. The observations were generated to reflect the actual variables used by the State of Arkansas for No Child Left Behind (NCLB) school assessments.

Some of the ways we have used the Arkansas Math dataset include the following: the scaled scores can be used to demonstrate graphs (frequency distribution, frequency polygon, box plot and stem and leaf), measures of central tendency, variability, skewness, kurtosis and normality. Similarly, we have used the grade, gender and teacher variables to create subgroups for the same type of analyses. Several of the categorical variables are analyzed as well (demographics, crosstabs, and percentages). This is the material in the first five or six chapters in the introductory course. Students are required to create tables and figures using APA formats to help them in writing reports or articles.

The Arkansas Math dataset is also used to demonstrate a multitude of different statistical inferential procedures. You can select data for t-tests, ANOVA (one-way and factorials), model assumptions, multiple comparisons, effect sizes, correlation, regression, and chi-square analyses. The multiple choice and open response scores as well as the strand scores reflect multivariate data.

Another generated education dataset is Literacy Test.xls. This dataset was created to reflect data that would be available on many state criterion referenced tests that are given at different grade levels. It differs from the previous example in a couple of important ways. First, it is a larger dataset (5000 observations) and second, it includes individual student item scores tied to three stands that might be typical on a Literacy examination. The strands in this example are content, literacy, and practical. Each strand has 8 multiple-choice items (worth 2 points each) and an open-response item worth 16 points. Students receive a scaled score based on the points earned on the literacy items plus their response to a writing prompt. Other variables include gender, race, and free and reduced lunch participation. The same type of analyses mentioned above can be demonstrated with the dataset, but by having item data, a number of advanced measurement issues can also be discussed.

A third example involves the two datasets based on the binomial distribution (Random Guessing.xls, 80% Mastery.xls). These datasets involve expected performance of 50 students on examinations worth 40 points. The first set assumes guessing and the second set involves “mastery learning.” Note that instructors could actually conduct a class exercise and create the first dataset by giving students answer sheets to fill out without giving them the questions. The instructor could have the students “score” their tests with a pre-assigned answer key. The instructor could also discuss why some national tests involve a correction for guessing. Simple SAS “proc univariate” analyses show the first distribution is positively skewed (p=0.2), while the second is negatively skewed (p=0.8). Students could then practice merging the datasets and demonstrate a bi-modal distribution.

A fourth example (Star.xls) is based on student data (sample size is 150) for the STAR Reading and STAR Math tests given during the first quarter of the school year and the SAT-9 (reading, literacy, and math) given in the spring. Student gender is also included so that there are six variables for each student. Instructors can use the data for descriptive statistical purposes as well as correlation and regression analyses (including the correlation matrix, multiple regression, and testing for bivariate normality). Note an instructor could also do simple procedures using the total data set, separate analyses for each gender, test for equality of correlations, parallelism of regression lines, ANCOVA and MANOVA. One use of such data might be identification of “at-risk” students and discuss potential interventions that might be used between October and May.

The Diamond Pricing datasets provide an example of how different analyses may require reformatting of the datasets. With the Diamond Pricing.xls dataset, students may conduct univariate analyses. With the Diamond Pricing With Dummy Variables.xls dataset, students can perform more complicated analyses such as multiple regression. One valuable exercise might be to have students begin with the basic dataset and create the Data Set With Dummy Variables.xlsby using a statistical package such as SAS, SPSS or R.

Certain datasets allow for instructors to demonstrate various statistical concepts. For example, the Birth To Ten datasets are actual data that illustrate Simpson's paradox. The Baby Boom.xls dataset allows us to examine a variety of distributions, including binomial, Poisson, and exponential. These types of datasets can assist students in transitioning from a theoretical understanding to pragmatic application.

In addition to their use in parametric statistical analyses, many of the datasets lend themselves to nonparametric analyses. A valuable exercise might be to have students analyze a dataset using both parametric and nonparametric procedures. The resulting discussion could focus on the importance of choosing the appropriate statistical analysis, as well as the impact of the violations of normality assumptions.

Large Scale Datasets

For large scale data analyses we have included the ECLS-K dataset. The Early Childhood Longitudinal Study – Kindergarten (ECLSK_sample) dataset is a subset of data from the ECLS-Kindergarten Class of 1998-99 (ECLS-K) Public Use Dataset ( collected by the National Center for Education Statistics (West,

ksum.pdf). The complete dataset is available for public use, and is located at the NCES website along with more detailed User’s Guide information, statistical documentation, and user resources. The complete dataset includes data on a nationally representative sample of about 21,260 children enrolled in both private and public full-day and partial day kindergarten programs in the academic year 1998-99. The type of data includes child and parent demographic, child academic and behavioral, family environment, and classroom and school demographic variables.

The data file included in this disk is a subset of 97 academic, behavioral, demographic, and family environment variables (with 6 sample weighting variables and their associated 540 replicate weights) for a total of 643 variables. All 21,260 students are included in the dataset, thus the ECLSK_sample dataset contains the same sampling properties of the original public use dataset. In the original sampling, oversampling occurred for select subgroups such as Asian students and students in private kindergarten programs (West,

ksum.pdf). Thus, weighting variables are necessary for producing data that are representative of the 1998-99 national population. Additionally, the multi-stage sampling procedure used probability sampling from within primary sampling units. Because the sampling procedure allows for correlated samples, the within-group error variance is an underestimate of what would be found in the population, and subsequently, test statistics computed from the samples will be inflated. There are two common ways to adjust test statistics computed from the samples: the use of Design Effects or the use of re-estimation statistical packages such as SUDAAN ( or WestVar ( Design effect estimates can be found in the ECLS-K User’s Guide.

The ECLSK_sample data file is recommended for use by students in moderate to advanced applied research methods and statistics courses; it is not recommended for students in introductory courses. The format of the variables requires students to utilize recoding procedures and provides opportunities for students to practice the creation of new variables by combining multiple related background and/or environmental variables. Weighting can be introduced to the students through the use of the sampling weights provided in the data file. Additionally, students can learn about the need for design effects with samples obtained by clustered or multi-stage sampling procedures and/or the use of jackknifing procedures with selection of the replicate weights provided.

The types of variables allow for a variety of statistical procedures including nonparametric statistics, multiple regression, analysis of variance, analysis of covariance, and multivariate analysis of variance procedures. Professors teaching courses that include multiple regression, multivariate analysis, measurement and evaluation, and large-scale database analysis may find the data file useful for classroom examples and student practice. Additionally, professors will be able to create numerous smaller datasets from the data file for classroom use.

Included in the ECLS-K folder are the data file in two formats (a dBase file and a SAS data file; an Excel file could not be used because of the 256 variable limit), a Microsoft© Word file of the variable codebook, and a SAS file listing the variable labels and format statements. The user will want to review the ECLS-K User’s Guide for more detailed information on sampling, data collection, variables, use of weights, design effects, and appropriate variance estimation procedures. The dBase (.dbf) file is recommended for use in WestVar.

Monte Carlo Simulations

If you have descriptive statistical information for a data set, but don’t actually have the data set, a very efficient method to help develop a practice or pilot research data set is through the use of Monte Carlo simulations. In Monte Carlo simulations a researcher uses the descriptive data to create “parallel” data sets that have the characteristics of the original data set. Further, the researcher can create an unlimited number of cases and conditions associated with this original data set.

The use of Monte Carlo simulations has traditionally been used in statistics and other related fields to evaluate the effectiveness of new methods and procedures. For example, a researcher develops a new statistical procedure, however this procedure needs to be checked under various conditions for discrepant sample size, normality and non-normality conditions. Collecting data or using archival data sets to evaluate the effectiveness of this new procedure under these various conditions would take a protracted amount of time. Further, issues of random sampling error for the archival data sets may also be problem. Thus, the researcher would use the collected and archival data sets and Monte Carlo simulations.