Lab Objective

To become familiar with the software package R.

Why should we care about R?

R gives us an enormous advantage over people who learned about and performed statistical analyses back in the pre-computer days. It allows us to avoid the drudgery of long, arithmetical calculations in favor of understanding concepts and analyzing data.There is also a large body of researchers which add to R constantly – they write R packages which you can then use to apply their cutting edge results to your own problems. With this in mind, it’s obvious that R is a great choice for a long term tool for statistical analysis.

Questions:

Data analysis tip: It is common for some data to be missing on a file. Unfortunately, there is no universally accepted way of representing missing values. R typically uses "NA" for not available, but the data may not have been originally encoded in R. Some data producers, like federal agencies, use extreme values of a variable (e.g., -99) to indicate missing values. Using extreme values is bad practice: how does the user know if the value is an actual value or if it is a dummy for missingness? When you get a data set from someone, learn how they code missing data before doing any further analyses.

1.) What type of study is this?

2.) How is missing data encoded in this dataset?

3.) True or false: There are more than ten CEOs whose values of total compensation are missing in the data file.

4.) What is the graduate degree of the CEO with the highest total compensation?

5.) What is the industry of the CEO with the highest total compensation? What is the industry of the CEO with the lowest total compensation?

6.) Which industry type has the highest average CEO total compensation?

7.) Highest attained educational degree is in the variable "Grad degree". Which degree has the highest total compensation: MBA (business), JD (law), MD (physician), PhD, or no graduate degree? Use highest average total compensation as your criterion, and choose only from these categories.

8.) I have provided you with code that creates side by side boxplots of total compensation by industry. Modify this code to give side by side boxplots by graduate degree. Then use these side by side boxplots to compare and contrast the distribution of total compensations for different graduate degrees. In order to see what’s happening, choose an appropriate range for this plot.

9.) What Causal claims can we make about total compensation and graduate degree?

10.) Explore the data to answer at least one question that interests you.