Report

Pollu-X prides itself in its global efforts to reduce pollution for both environmental reasons as well as wellness reasons. After a series of analyses and careful consideration, it is recommended that the company monitor three specific aspects of a country in order to successfully predict a large portion of its pollution index score: the median age of the population, population density, and fossil fuel consumption as a percentage of total energy consumption (Appendix A). While these factors do not have a causal relationship with the pollution index score, it is clear that the combination of these can indicate shifts in the pollution levels. Therefore, it is important that each factor be analyzed in order to detect current countries with potentially risky levels of pollution, and consistently monitored to predict which countries are moving towards these unsafe levels of pollution.

The pollution index of a country is composed of a variety of factors and equations which are not currently available, but the model is significant for our purpose. The model derived from this analysis shows age has the highest weight in predicting the levels of pollution in a country, followed by fossil fuel consumption, then population density (Appendix A).

Predicting which countries are reaching unsafe levels will help our company to thrive while simultaneously contributing to a higher cause. This model can aid us in broadening our knowledge and help us to expand our reach. This model can be utilized with a 95% confidence level in which over 40% of the pollution level can be predicted (Appendix B).

The data sets alone have little significance toward contributing toward our purpose; however, the connection among the four variables is considerably larger.

It is crucial that our company consider the strength of the model in order to fulfill our purpose. It is significant in indicating the pollution index score. Other factors proved to be less significant throughout the process. By embracing this, Pollu-X can grow dramatically and make a global difference.

Appendix A

The model constructed from the analysis to indicate pollution index levels per country is as follows, where P is population density, F is fossil fuel consumption as a percentage of total energy consumption, and A is the median age of the population:

Y = 101.4248 + 0.0219(P) + 0.2490(F) - 1.9445(A)

From this model, we can see the independent relationships between the dependent variable and each of the independent variables. As the population density increases per square kilometer, the pollution index level is expected to increase by a factor of 0.0219. For example, Malta, which has a population density of approximately 1323 people per square kilometer leads to an increase of 29.0006 in the country’s pollution index (0.0219*1323). Additionally, the pollution index level is expected to increase by 0.2490 per each percentage increase in fossil fuel consumption of the total energy consumption in that country. In the case of Malta, fossil fuel energy consumption is 94.5% of all energy consumption. According to the model, this indicates an increase in the pollution index level of approximately 23.5347 (94.5*0.2490). Contrastingly, the pollution index level of a country is expected to decrease by a factor or 1.9445 for each year that is increased in the median age of the population. The median age of the population in Malta is 40.9, and when multiplied by the negative factor of 1.9945, the result is -79.53178467. The model represents the sum of all of the factors and the intercept (101.4248). The final result for the pollution index level for Malta is 74.4284, which is approximately a 3 point difference from the actual recorded data.

Appendix B

Although the independent variables in this model had little correlation to the pollution index score individually, the combination of the three variables can is able to predict approximately 43% of the pollution index scoreas indicated by the R² of the multiple regression. Because the model can explain less than half, it is clear that there will be errors in the predicted score from the actual score, as seen in Appendix A. Correlations between the pollution index score and each independent indicator are shown in the chart below. All correlations are positive except that between the median age of the population and that country’s pollution index.

Relationship to Pollution Index
Independent Factor / Correlation / Regression
Population Density / R = 0.2029 / R² = 0.0412
Fossil Fuel Consumption / R = 0.0949 / R² = 0.009
Median Age of Population / R = 0.5732 / R² = 0.3286

The pollution index score consists of several factors that contribute to the overall pollution of a country. The most heavily weighted factors are air pollution and water, followed other smaller and less relevant pollution factors. The pollution index score is calculated using a series of formulas that are relatively complex. Therefore, it is important that these factors are not considered causal of the pollution index, but merely as predictors of the score. In other words, a higher median age does not cause higher pollution, as neither do the other factors.

The model is also relatively strong due to the fact that there were no missing data. In order to increase accuracy, any holes found within the variables and respective observations were filled prior to the analysis. Additionally, the only outlier that was found irrelevant to the analysis was the country of Singapore, which was eliminated. Eliminating Singapore had a dramatic impact on the results of this analysis because of the fact that its data was severely atypical.

The distribution of each of the factors was also taken into consideration during the process of making this as strong of model as possible. The pollution index was found to be normally distributed along with the median age of the population. While fossil fuel consumption exceeded the parameters of normal distribution by only .01, it cannot be considered normally distributed, and is therefore negatively skewed. Lastly, population density was found to be heavily positively skewed.

The coefficients for each of the variables expresses the weight that they hold within the equation. We see that median age has the greatest weight when predicting the pollution index. P-values under 0.05 signify that the variable is significant in our model, just as a higher +/- T-value also indicates importance to the equation.

Regression Statistics
Multiple R / 0.658329757
R Square / 0.433398069
Adjusted R Square / 0.413162286
Standard Error / 18.28803137
Observations / 88
ANOVA
df / SS / MS / F / Significance F
Regression / 3 / 21489.29284 / 7163.097612 / 21.41741014 / 2.14046E-10
Residual / 84 / 28093.97567 / 334.4520914
Total / 87 / 49583.26851
Coefficients / Standard Error / t Stat / P-value
Intercept / 101.4248612 / 9.253430845 / 10.96078448 / 7.1289E-18
Population Density (per km²) / 0.021924376 / 0.009329958 / 2.349890104 / 0.021124474
Fossil fuel (%) / 0.249006892 / 0.086607042 / 2.875134461 / 0.005115672
Median age of population / -1.944542412 / 0.256671402 / -7.575999497 / 4.25187E-11

Appendix C

The process by which these three variables were selected from the original 11 independent factors included careful analysis of each factor’s impact on the strength of the model. To begin, a multiple regression was run with all of the variables, which were population, CO2 emissions (metric ton per capita), GDP (in USD), population density (per km²), precipitation (mm), forest land (percentage of country area), imports (percentage of GDP), exports (percentage of GDP), fossil fuel (percentage of total energy consumption), renewable energy (percentage of total energy consumption), and the median age of population. It immediately became clear that Singapore was a strong outlier, and was therefore eliminated from the data. Additionally, variables with multicollinearity were eliminated due to their interference with the accuracy of the model. Factors with the greatest p-values were the first to be eliminated, as they were the least relevant to the model. However, because GDP and CO2 emissions seemed highly likely to be good predictors of the pollution index level, they were the last to be eliminated. With the elimination of each variable, the new R² was evaluated to ensure that the variables remained relevant.

Because the independent variables did not tend to violate homoscedasticity or independence, no transformations were made to the data to reduce the effects. These problems would have been notable in the residual output plots. There were no obvious patterns in any of these graphs, which allowed for free elimination and use of different variables based on its significance.

Appendix D

Dependent Variable:

Pollution Index
Mean / 58.08292
Standard Error / 2.496798
Median / 58.55
Mode / #N/A
Standard Deviation / 23.55474
Sample Variance / 554.826
Kurtosis / -0.83845
Skewness / -0.01146
Range / 103.94
Minimum / 9.85
Maximum / 113.79
Sum / 5169.38
Count / 89

Pollution Index – generated unit used to measure the levels of pollution within a country

  • Central Tendency
  • Mean: 58.08
  • Mean is the best measure for central tendency since the data is normally distributed.
  • Variability
  • 23.55
  • Normally distributed
  • Skewness: -.011
  • The data is negatively skewed, with the tail towards the negative side.
  • Outliers:
  • There are no obvious outliers in this data set

Independent Variables:

Population
Mean / 64425692.19
Standard Error / 20951642.55
Median / 11032328
Mode / #N/A
Standard Deviation / 197657400.5
Sample Variance / 3.90684E+16
Kurtosis / 35.08329351
Skewness / 5.803151553
Range / 1357056998
Minimum / 323002
Maximum / 1357380000
Sum / 5733886605
Count / 89

Population – measured in people

  • Central Tendency
  • Mean: 64425692.19
  • Mean is the best measure for central tendency since the data is normally distributed.
  • Variability
  • 197,657,400.5
  • Normally distributed:
  • Skewness: 5.8
  • The data is negatively skewed, with the tail towards the negative side.
  • Outliers:
  • China: 1,357,380,000
  • India: 1,252,139,596
  • Correlation with dependent:
  • R = 0.1957
  • Weak positive correlation
  • Concerning correlations:
  • There are none in this data set

CO2 Emissions – measured in metric tons per capita

CO2 Emissions
Mean / 5.978888
Standard Error / 0.701442
Median / 4.470965
Mode / #N/A
Standard Deviation / 6.617389
Sample Variance / 43.78984
Kurtosis / 13.07106
Skewness / 3.085846
Range / 40.23552
Minimum / 0.074565
Maximum / 40.31008
Sum / 532.121
Count / 89
  • Central Tendency
  • Median: 4.47
  • Variability:
  • Standard deviation: 6.62
  • Normally distributed:
  • Skewness: 3.09
  • The data is positively skewed, with the tail towards the postive side
  • Outliers:
  • Qatar: 40.31
  • Trinidad and Tobago: 38.16113079
  • Correlation with dependent:
  • R = 0.1587
  • Concerning correlations:
  • There are none in this data set

Gross Domestic Product – measured in dollars to analyze level of production

GDP
Mean / 7.55E+11
Standard Error / 2.25E+11
Median / 1.77E+11
Mode / #N/A
Standard Deviation / 2.13E+12
Sample Variance / 4.53E+24
Kurtosis / 39.29062
Skewness / 5.822838
Range / 1.68E+13
Minimum / 1.62E+09
Maximum / 1.68E+13
Sum / 6.72E+13
Count / 89
  • Central Tendency
  • Median: 1.77E+11
  • The median is a good measure because the data is skewed
  • Variability:
  • Standard deviation: 2.13E+12
  • Normally distributed:
  • Skewness: 5.82
  • The data is positively skewed with the tail towards the positive end
  • Outliers:
  • USA: 16768100000000
  • Correlation with dependent:
  • R = 0.008
  • No correlation
  • Concerning correlations:
  • There are none in this data set


Population Density – people per km²

Population Density
Mean / 233.6594
Standard Error / 87.8725
Median / 90.44884
Mode / #N/A
Standard Deviation / 828.9875
Sample Variance / 687220.4
Kurtosis / 77.57378
Skewness / 8.577405
Range / 7711.315
Minimum / 1.827463
Maximum / 7713.143
Sum / 20795.68
Count / 89
  • Central Tendency
  • Median: 90.45
  • The median is a good measure because the data is skewed
  • Variability:
  • Standard deviation: 828.99
  • Normally distributed:
  • Skewness: 8.58
  • The data is heavily positively skewed
  • Outliers:
  • Singapore: 7713.142857
  • Correlation with dependent:
  • R = 0.014
  • No correlation
  • Concerning correlations:
  • There are none in this data set

Average Precipitation – measured in mm

  • Central Tendency

Precipitation
Mean / 1141.843
Standard Error / 81.9693
Median / 848
Mode / 250
Standard Deviation / 773.2968
Sample Variance / 597988
Kurtosis / -0.05337
Skewness / 0.889282
Range / 3184
Minimum / 56
Maximum / 3240
Sum / 101624
Count / 89
  • Mean: 91141.84
  • Mean is the best measure for central tendency since the data is normally distributed
  • Variability:
  • Standard deviation: 77.297
  • Normally distributed:
  • Skewness: .889
  • The data is normally distributed
  • Outliers:
  • There are no obvious outliers
  • Correlation with dependent:
  • R = 0.105
  • No correlation
  • Concerning correlations:
  • There are none in this data set

Forest Land – percentage of total land in country

Forest Land
Mean / 28.90846
Standard Error / 2.079039
Median / 28.89227
Mode / #N/A
Standard Deviation / 19.61362
Sample Variance / 384.694
Kurtosis / -0.59005
Skewness / 0.446987
Range / 77.24155
Minimum / 0
Maximum / 77.24155
Sum / 2572.853
Count / 89
  • Central Tendency
  • Mean: 28.91
  • Mean is the best measure for central tendency since the data is normally distributed.
  • Variability:
  • Standard deviation: 19.61
  • Normally distributed:
  • Skewness: .447
  • The data is normally distributed
  • Outliers:
  • There are no obvious outliers
  • Correlation with dependent:
  • R = 0.237
  • Weak positive correlation
  • Concerning correlations:
  • There are none in this data set

Imports – percentage of total GDP

Imports
Mean / 46.38118
Standard Error / 2.610894
Median / 39.75387
Mode / #N/A
Standard Deviation / 24.63113
Sample Variance / 606.6924
Kurtosis / 5.774243
Skewness / 1.763199
Range / 154.5235
Minimum / 12.98455
Maximum / 167.5081
Sum / 4127.925
Count / 89
  • Central Tendency
  • Median: 39.75
  • Variability:
  • Standard Deviation: 24.63
  • Normally distributed:
  • Skewness: 1.763
  • Outliers:
  • Singapore: 167.51
  • Guyana: 119.21
  • Correlation with dependent:
  • R = 0.114
  • Weak positive correlation
  • Concerning correlations:
  • Concerning correlation with exports
  • R = 0.873
  • Strong positive correlation
Exports
Mean / 44.28873
Standard Error / 2.816531
Median / 38.8786
Mode / #N/A
Standard Deviation / 26.5711
Sample Variance / 706.0232
Kurtosis / 9.079757
Skewness / 2.135848
Range / 180.9444
Minimum / 9.57766
Maximum / 190.5221
Sum / 3941.697
Count / 89

Exports – percentage of total GDP

  • Central Tendency
  • Median: 38.88
  • Median is the best measure since data is skewed
  • Variability:
  • Standard Deviation: 26.57
  • Normally distributed:
  • Skewness: 2.136
  • The data is positively skewed
  • Outliers:
  • Singapore: 190.52
  • Correlation with dependent:
  • R = 0.2135
  • Weak positive correlation
  • Concerning correlations:
  • Concerning correlation with imports
  • R = 0.873
  • Strong positive correlation

Fossil Fuel and Renewable Energy – percentage of total energy consumption

Fossil Fuel
Mean / 72.18485
Standard Error / 2.511466
Median / 76.85397
Mode / 100
Standard Deviation / 23.69312
Sample Variance / 561.3642
Kurtosis / 0.308741
Skewness / -1.02522
Range / 94.27589
Minimum / 5.724115
Maximum / 100
Sum / 6424.451
Count / 89

Fossil Fuel Consumption

  • Central Tendency
  • Mean: 72.18
  • Mean is an acceptable measure for central tendency since the data is nearly normally distributed.
  • Variability:
  • Standard Deviation: 23.69
  • Normally distributed:
  • Skewness: -1.02522
  • Nearly normally distributed; skewed negatively
  • Outliers:
  • None
  • Correlation with dependent:
  • R = 0.089
  • Weak positive correlation
  • Concerning correlations:
  • Concerning correlation with renewable energy
  • R = 0.873
  • Strong positive correlation

Renewable Energy
Mean / 28.15029
Standard Error / 2.827231
Median / 21.98441
Mode / 0
Standard Deviation / 26.67204
Sample Variance / 711.3979
Kurtosis / 3.845637
Skewness / 1.673273
Range / 147.845
Minimum / 0
Maximum / 147.845
Sum / 2505.375
Count / 89

Renewable Energy Consumption

  • Central Tendency
  • Mean: 28.15029
  • I used mean as a measure because is closely related to fossil fuel consumption, which was just outside of normal distribution.
  • Variability:
  • Standard Deviation: 26.67
  • Normally distributed:
  • Skewness: 1.67
  • Not normally distributed; positively skewwed
  • Outliers:
  • Paraguay: 147.8
  • Correlation with dependent:
  • R = 0.084
  • No correlation
  • Concerning correlations:
  • Concerning correlation with fossil fuel
  • R = 0.873
  • Strong positive correlation
Median Age
Mean / 32.94607
Standard Error / 0.836518
Median / 32.6
Mode / 40.9
Standard Deviation / 7.891697
Sample Variance / 62.27888
Kurtosis / -1.04781
Skewness / -0.20081
Range / 30.6
Minimum / 15.5
Maximum / 46.1
Sum / 2932.2
Count / 89

Median Age of Population – measured in years

  • Central Tendency
  • Mean: 32.94
  • Mean is the best measure for central tendency since the data is normally distributed.
  • Variability:
  • Standard Deviation: 7.89
  • Normally distributed:
  • Skewness: -0.20081
  • Data is slightly negatively skewed
  • Outliers:
  • There are no outliers for this data
  • Correlation with dependent:
  • R = 0.573
  • Moderate positive correlation
  • Concerning correlations:
  • No concerning correlations with other variables

Overall:

After running the descriptives on all of my data, I am under the impression that the predictors are not very good. I feel that there are some observations that I might consider discarding, like Singapore; I have not yet decided on China or India, although they contribute greatly to the skewness of my data overall. I will not be removing energy as a variable, but will likely only use one subcategory (fossil fuel consumption).

Appendix E

Pollu-X is a nonprofit dedicated to educating populations about pollution, in which I hold the position of Senior Data Analyst. The purpose of gathering this data is to minimize the negative effects of pollution on health and the environment in potential problem areas. This data is meant to help detect potential problem areas prior to them experiencing the previously mentioned negative effects.

Data gathered from:

Numbeo

World Bank

Human Development Reports

Unit of analysis:

Countries

Dependent variable:

  • pollution index

Independent variables include:

  • CO2 emissions (metric ton per capita),
  • GDP (in USD),
  • population density (per km²),
  • precipitation (mm),
  • forest land (percentage of country area),
  • imports (percentage of GDP),
  • exports (percentage of GDP),
  • fossil fuel (percentage of total energy consumption),
  • renewable energy (percentage of total energy consumption),
  • median age of population