EDA Project
Emily Poitan Mendez
2/20/18
Data Description
library(readxl)
library(ggplot2)
library(knitr)
police <-read_excel("~/Documents/R CODE/WashPost Police Shootings 2015.xlsx")
I will be using the data set of police shootings. This data set has characteristics of individuals killed by police in 2015. It has 990 observations and 14 variables which varies from numerical and categorical. In this case of analysis, we will be lookingat the relationship between race and age to see if there is an age difference among the different types of race.
Data Management(Race)
table(police$race)
##
## A B H N O W
## 14 258 172 9 15 494
Table 1.
In this table you can see there are 6 different types of races. There are 2 rows, the first one shows each type of race and the second shows how many people have been shot within the race. On the first row there are letters of A, B, H, N, O, W. A stands for asian, B stands for black, H stands for hispanic, N stands for native american, W stands for white and O stands for other.
temp.data.to.plot.<-police[!is.na(police$race),]
ggplot(temp.data.to.plot., aes(x=race)) +geom_bar()
Figure 1.
In this barchart you can see white has the highest number of people that have been shot. Black follows and Hispanics is the third highest race that police have killed. Asian, Native American and other are the lowest.
New Factor Variable
police$new_race<-police$race
police$new_race[police$new_race %in%c("A","N")] <- "O"
table(police$new_race)
##
## B H O W
## 258 172 38 494
Table 2.
temp.data.to.plot.<-police[!is.na(police$new_race),]
ggplot(temp.data.to.plot., aes(x=new_race)) +geom_bar()
Figure 2.
I decided to combine Asians, Native Americans and Other together to have a better distribution between them. I created a new variable called new_race that includes the 3 lowest races. As you can see on figure 1 and table 1 it has all 6 races, unlike table 2, and figure 2 only has 4 races: Black, Hispanic, White and other(Asian & Native American)
Data Managment(Age)
summary(police$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 6.00 27.00 34.00 36.66 46.00 86.00 12
In this table it shows the summary of age. The average is 36.66. The lowest age a cop shot was a 6 year old person and the oldest age a cop shot was an 86 year old person. You can see how the median is close to the mean, which means a majority of the poeple who have been shot are in the age from 20-40s and a few above their 60s.
temp.data.to.plot.<-police[!is.na(police$age),]
ggplot(temp.data.to.plot., aes(x=age)) +geom_density(col="blue") +
geom_histogram(aes(y=..density..), colour="black", fill=NA)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In this histogram you can see how a majority of the people who have been shot are in their 20s to 40s. There is a high density between the age range of 0.01 to 0.04. As you continue to follow the line it starts to decrease. It shows the graph is skewed right, showing that there are a few old people that have been shot, which is why there a low desnity close to 0.
Bivariate Comparision
temp.data.to.plot.<-police[!is.na(police$new_race) !is.na(police$age),]
ggplot(temp.data.to.plot., aes(x=new_race, y=age, fill=new_race)) +
geom_violin(alpha=.1) +
geom_boxplot(alpha=.5, width=.2)
In this boxplot it shows a comparison between age and race. You can see how white people have a wide distribution in age, meaning white people have been shot through the ages of 0 to 90. Unlike the others, Black, Hispanic and other, majority have been shot really young between the ages of 20 to 40. Which explain why they are wide from that range.
temp.data.to.plot.<-police[!is.na(police$new_race) !is.na(police$age),]
ggplot(temp.data.to.plot., aes(x=age, col=new_race)) +geom_density()
This is a grouped histogram. It compares the ages when people have been shot based in their race. Black has the highest in the age of 20, Hispanics around 25, other around 30, while white is varies from 30 to 50.
Conclusion
Based on the data, there is not a really an age difference among the different races. Whites are the only race that have a steady distibution. Blacks, Hispanics, and other have a skewed right tail where majority of people betweenthe ages of 20 to 40 have been shot. White has the most people that have been shot, while other has the lowest. Blacks have the highest density in the age of 20, but hispanic and other are pretty close to it. In conclusion, whites are getting shot more than any other race, but the other 3 races are getting shot more in the ages of 20s and 30s.