1

Life Expectancy by Country

Life Expectancy by Country

Nick Dyer

ECO 328 J

April 17, 2006

Abstract

Even with the technological advances in the world, life expectancy tends to differ, in some cases greatly, from country to country. In conjunction with some of the information of papers of the past, this project models life expectancy as a function of GNP per capita, region of the world, and the percentage of the country with access to safe drinking water. With country wide data ranging from 1982-1996, empirical results showed that life expectancyis higher in countries with higher per capita GNP, better access to safe drinking water, as well as countries thatreside in the European and central Asia region of the world. Table of Contents:

IntroductionPage 4

Literature ReviewPage 5

ModelPage 7

DataPage 10

Empirical ResultsPage 11

ConclusionPage 14

Data and TablesPage 15

BibliographyPage 28

Introduction:

Life expectancy today in an era of medication and advanced medical understanding still greatly fluctuates around the globe. Being lucky enough to live in a country where the life expectancy is close to seventy two years of age I know that I am extremely blessed to have access to safe drinking water and to live in an economically prosperous country. This model looks into the factors that lead to a countries overall life expectancy.

I use data from 1982-1996 for one hundred and fifty two countries in the world. My dependent variable is life expectancy in a given country and my independent variables are the percent of the country with access to safe drinking water, per capita GNP, and the region of the world in which that country is located. Both my research and the research or others that I have researched proves that these independent variable help to explain life expectancy.

My data was collected from several different places. The data for life expectancy (my dependent variable), GNP, access to safe drinking water, and region of the world was colleted from The World Bank Group. The number of people per physician was collected from The World Almanac and Book of Facts.

My hypothesis is that each of my independent variables will have an effect on the life expectancy of someone residing in a certain country. In my project I will attempt to use all the appropriate tests in order to decide if my hypothesis was correct. In addition, my paper will look at any possible problems that might cause bias in the data as well as look at possible solutions for these problems. After testing I found that all of my independent variables have a significant impact on life expectancy especially after correcting for problems with heteroskedasticity.

Literature Review:

While looking for literature concerning life expectancy I was able to find an actual regression model that contained variables that I had not yet thought of. I was also able to find a project that a professor uses with his students. This project is used to show some important relationships concerning correlation, causation, and prediction in which he studies life expectancy in relationship to the number of people per television set and the number of people per physician. (Rossman, 1992)Rossman found that the relationship between life expectancy and people per physician is very similar to that between life expectancy and people per television. (Rossman, 1992). This means that I might have a problem with multicollinearity, however I am contemplating using per capita GNP rather than the number of people per television sets, as well as a dummy variable that takes into consideration what part of the world an individual lives.

One of the factors that affect the life expectancy of an individual is the number of people per doctor. Rossman was able to determine that “The correlation between life expectancy and people per physician estimated to be -0.666” (Rossman, 1992) Chen and Ching used the number of physicians and availability of health care as a variable to help explain life expectancy because they believe that the morephysicians and availability of health then the longer the life expectancy .(Chen, 2000) This is why I have decided to use the number of people per doctor as a variable even though I might run across multicollinearity I think that it important to keep all relevant variables.

The number of people per television setis another factor correlated with life expectancy. Rossman constructed a scatter plot of life expectancy versus people per television and concluded that there is a noticeable negative correlation. He calculates the value of the correlation coefficient -0.606. At first, this did not seem logical to me. I expected that the lower the number of people per television set, the lower the life expectancy would be because I figure the more sedentary the people, the shorter their life expectancy would be. After thinking it through the outcome started to make more sense to me. The lower the number of people per television set would help to measure the overall wealth of a country. Chen and Ching found that the more money that its citizens have the more they can spend on health care and leisurely activities. (Chen, 2000) This also leads me to believe that money spent on health care or something such as per capita GNP would be a better key variable. The wealth of citizens can be estimated by the number of people per television set. This is why I have decided to include a variable that represents per capita GNP.

Geography is one thing thatChen and Ching study. After consideration and by finding data that breaks the countries into regions, I have decided to use the region that a country is located in as one of my variables. The way that Chen and Ching used it in their model was by stating that African countries have more diseases and more of disease would bring down life expectancy. (Chen, 2000) The World Bank group that performed a lot of the research that I will be using in my regression model found that GNP per capita, the region of the world that a country is located in, and a countries access to safe water available to the population were all major factors in determining life expectancy. (Life Expectancy, 2006) This is why I will use the region of the world that a country resides in my regression model to estimate life expectancy in a given country.

Model:

The dependent variable in my equation is the life expectancy in different countries, measured in years. This variable is labeled as LE, an acronym for life expectancy. The independent variables in my equation are the number of people per doctor, per capita GDP, and the region of the world in which a country is located.

The first independent variable in my equation is the percentage of the population with access to safe drinking water. This independent variable will be labeled WTR. I would expect that as the percentage of people with access to safe water increases, the life expectancy age would increase. I think that this is true because according the data that I have found from the World Bank Group; the higher the percentage of people with access to safe water the healthier and a given country is and the longer a country’s life expectancy. This is understandable because the easier it is for someone to access safe water the more sanitary a country probably should be so the amount of infection and disease will be reduced. If I am correct, then the sign of the coefficient for WTR will be positive.

The second independent variable in my equation is per capita gross net product. In running preliminary regressions I have found that the higher the per capita gross net product in a country the more money people have to spend on health care, therefore increasing life expectancy. I feel that this can also be explained by saying that after going to medical school many doctors wish to practice where they feel they can be profitable and would go where people have the money to pay for their skills. The expected sign of the coefficient GDP would be positive.

The last independent variable in my equation is the region of the world in which a country is located. This independent variable will be labeled ROW, an acronym standing for region of the world.(Chen, 2000) Chen and Ching discovered that depending on the region of the world a country resides can be a major determinant of life expectancy. (Chen, 2000) This would make the sign of the coefficient negative in correlation with Europe and central Asia as a reference region, and would be a dummy variable equal to 1 for the region that a country resides in and equal to zero in all other cases.

The hypothesized regression of my model would be:

LEi = β0 - β1WTRi + β2GDPi - β3ROWi + Єi

LE: Life Expectancy in nation i.

PPD:Number of people per doctor in nation i.

GDP:Per capita GNP in nation i.

ROW:Region of the world that a country resides.

The main hypothesis that I will test is whether the variables that I have chosen have a significant effect on the life expectancy in the individual countries. My null and alternative hypotheses are as follows:

Ho: β1 ≥ 0 ; β2 ≤ 0 ; β3 ≥ 0

Ha: β1 < 0 ; β2 > 0 ; β3 < 0

The functional form of my equation is going to be linear because the slope of each variable compared to life expectancy should be constant; however, I fell that it will make sense to log GNP so that it says how a one percent increase in GNP in creases life expectancy in years. Originally my model contained the number of people per television set, however I realized that there was an extreme multicollinearity problem with per capita GDP. After omitting the number of people per television set, I no longer expect to have any problems with multicollinearity but I will still make sure to check for its presence. There should not be a problem with serial correlation because my model does not have a time series data; however I will still check my Durbin-Watson statistic after running my regression to check for it. I do not feel that I will have a problem with heteroskedasticty because I believe that the variables come from a distribution with a constant variance. I feel that there is always the possibility of omitted variables so I will make sure to be careful to check that the signs are as I predicted and that theoretically my model makes sense.

Data:

My data is cross-sectional from one hundred and fifty two countries throughout the world. My data is mostly in percentages or per capita figures in order to try to eliminate large variances between countries to help with large population variances. This should keep heteroskedasticity to a minimum. My data was collected from several different places.The data for life expectancy(my dependent variable), GNP, access to safe drinking water, and region of the world was colleted from The World Bank Groupfrom 1990-1996. After working on my data I discovered that I needed to identify six different regions of the world that I believed would impact life expectancy. The number of people per physician was collected from The World Almanac and Book of Facts. I did encounter a couple of problems with missing observations especially for access to safe drinking water in the Europe and Central Asia region of the world and attempted to fix this problem by searching for replacement data but could not do so.

Empirical Results:

When I first ran Ordinary Least Squares this is the regression equation that I originally developed:

LE = 57.01628+ 0.000224 (Log) GNP + 0.164050WTR - 2.393049AAP – 1.718252MENA + 0.685565SA - 14.74970SSA + 0.525441NAC

I expectgross national product (GNP), the percentage of people with access to safe drinking water (WTR), and the variables that stand for the region of the world (AAP, MENA, SA, SSA, NAC) to have a positive effect on LE. If GNP increases by one percent, LE will increase by 0.224 years. (Also I have decided to log GNP) If WTR increases by one percent, LE will increase by 0.164050 years. My first region is AAP or Asia and Pacific. I found that if a country resided in this region that their LE would be 2.393049 years lower than Europe and Central Asia (ECA) which was my reference region. My second region is MENA or Middle East and Northern Africa. I found that if a country resided in this region that their LE would be 1.718252 years lower than Europe and Central Asia. My third region is SA or South America. I found that if a country resided in this region that their LE would be 0.685565 years higher than Europe and Central Asiawhich I feel is mainly due to my data problem with missing observations.My fourth region is SSA or Sub-Saharan Africa. I found that if a country resided in this region that their LE would be 14.74970 years lower than Europe and Central Asia. My fifth region is NAC or Northern Central America and Caribbean. I found that if a country resided in this region that their LE would be 0.525441 years higher than Europe and Central Asia which I also feel is dueto my data problem with missing observations.

My R2 was 0.805081 and my adjusted R2 was 0.790719. (See Table 1) Since the adjusted R2 was not much lower, this means that the model is a good fit and the variables explain seventy-nine percent of the change in life expectancy. Looking at the preliminary equation, some of the coefficients have unexpected signs which I feel is due to missing observations. To test if they are significant or not I did a hypothesis test. My null and alternative hypotheses were as follows:

Ho: β1≥0 ; β2≥0 ; β3≥0 ; β4≥0β5≥0 ; β6≥0β7≥0

Ha: β1<0 ; β2<0 ; β3<0 ; β4<0 β5<0 ; β6<0 β7<0

I used the ninety-five percent confidence level, which would mean the critical t-value would be approximately 1.658 with 95 degrees of freedom.The t-statistics for the variables are available in table 2. (See Table 2) Since GNP, WTR, and SSA are greater than or close to the critical t-value, I can reject the null for these variables and conclude that these are the most significant variables. I also did an F-test to test the significance of the model at the ninety-five percent confidence level as a whole and it was 2.09. The null and alternative hypotheses are as follows:

Ho: β1= β2= β3= β4= β5= β6= β7=0

Ha: Ho is not true.

The probability for the F-statistic was 0, which is below 0.05 so I know that I can reject the null and conclude that the model is significant.

The first of the econometric issues that I feel is important to look into in any model is multicollinearity. The results of my regression model do not show any strong multicollinearity. There has not been a big change from R-squared to adjusted R-squared, the simple correlation coefficients between the explanatory variables is low, and the variance inflation factors (VIF) are generally low. (See Table 6) I do not feel that strong multicollinearity is a problem in my model.

The second of the econometric issues that I feel is important to look into in any model is serial correlation. Even though I do not have any time series data I feel that won’t hurt to take a look at the Durbin Watson statistic of 2.232262 to check for impure auto correlation. (See table 1) I tested the Durbin Watson stat at a ninety-five percent level and I feel that there is no serial correlation. I feel that can confidently conclude that serial correlation is not a problem.

The third of the econometric issues that my model might face that I feel is important to look into is heteroskedasticity. Since the data comes from a cross-section of nations, I think that heteroskedasticity has a high likelihood of being present. ( See Table 7) Tested at the ninety-five percent confidence level, I can reject the null meaning there is heteroskedasticity. Since I feel that heteroskedasticity is present I feel that it is important to run a White Standard Errors in the model to correct for the problem. (See Table 2) By running the White test I feel that I will be able to correct for the problem and not affect the coefficients.

Conclusion:

My model includes several of the key variables needed to correctly estimate life expectancy in a given country; however, I’m sure that there are other variables that could be useful that I have failed to include. Some other variables that I feel would have been good to include are things such as population growth, literacy levels and even a variable that measures disease in a country. Chen and Chingsaid that by using other variables such as literacy that you can find that “literacy yielded an adjusted R2 of 0.86350”(Chen, 2000). I feel that given more time it would be interesting to spend lots of time finding other variables that added to my regression model and allowed me to better make estimates. Hoyert, Kung and Smith even found that by using things such as disease rates for each individual disease that they were able to achieve a better regression. (Hoyert, 2005) It would make sense that the higher the rate of a disease in given country the lower the life expectancy will be, however, I could not obtain enough data for enough countries for this to be a relevant independent variable.

As a whole, my model looks good. The R2 and adjusted R2are both high. When looking at the F-statistic it is visible that my model is significant. Other than heteroskedasticity which I corrected for by using the White correction, I see no violations of the classical assumptions and I don’t feel that they are in any way a problem.I feel good about my model and I feel good about the hypothesis that it makes. If there was anything that I could change, I would add some of the variables that I mentionedearlier and I think that if I had more time I could use them to improve my model.