Final Project
Adam Prekeges
February 24, 2017
library(readxl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
ncbirths <- read.csv("C:/Users/adam/Desktop/R/NCBirths.csv", header=TRUE, stringsAsFactors = FALSE)
This paper analyzes the data set called NCBirths, which provides information on all births in North Carolina in the year 2004. It also provides information on whether or not the mom was a smoker, married, and many other variables. In this paper we are going to focus on the baby's weight and whether or not the mom is a smoker to see if there is any correlation going on.
head(ncbirths)
## fage mage mature weeks premie visits marital gained weight
## 1 NA 13 younger mom 39 full term 10 married 38 7.63
## 2 NA 14 younger mom 42 full term 15 married 20 7.88
## 3 19 15 younger mom 37 full term 11 married 38 6.63
## 4 21 15 younger mom 41 full term 6 married 34 8.00
## 5 NA 15 younger mom 39 full term 9 married 27 6.38
## 6 NA 15 younger mom 38 full term 19 married 22 5.38
## lowbirthweight gender habit whitemom
## 1 not low male nonsmoker not white
## 2 not low male nonsmoker not white
## 3 not low female nonsmoker white
## 4 not low male nonsmoker white
## 5 not low female nonsmoker not white
## 6 low male nonsmoker not white
Even though the data set measures 13 different variables we are going to focus on birth weight and whether or not the mother smokes to see if there is any correlation between the two. Theories from the past tell us that if the mother smokes then the baby will weigh less and will be less healthy. Whereas if the mother does not smoke then we should see a healthy weight baby.
mean(ncbirths$weight, na.rm = TRUE)
## [1] 7.101
table(ncbirths$habit)
##
## nonsmoker smoker
## 873 126
ggplot(ncbirths, aes(x=weight))+geom_histogram(binwidth = 1) +ggtitle("Weight of babies")
This histogram represents the weight of all babies. As we can see the highest bar is right around 7, which is our mean. We notice that the graph is slightly skewed to the left, which may represent the smokers babies, since they will weigh less.
ggplot(ncbirths, aes(x=habit, y=weight)) + geom_boxplot()
This graph represents the difference in smokers and nonsmokers and their babies weight. As shown in the previous table there are many more nonsmoker observations, which is why the distribution of nonsmokers is so large. The smokers have babies with lower weight than the nonsmoker babies. The NA category is there because there are a few observations that did not have the babies weight.
ggplot(ncbirths, aes( x= habit, y = weight)) + geom_point() + facet_wrap(~ncbirths$habit)
This last graph just represents the distribution of each specific category. As we can see nonsmokers have the largest distribution, and smokers points are more clumped together near the middle.
Conclusion In this short paper we analyzed the differences in birth weights in mothers who smoke/did not smoke in North Carolina in the year 2004. We can make the assumption that nonsmokers have a wider distribution than the smokers, but we cannot conclude that if you smoke your baby will weigh less than if you did not smoke.