CS 365, Assignment 2 Report Template
NAME:
SCORE: /105 total
CODE: 10 possible points for well organized, well commented code that runs without errors
Part 1:50 points + 10 possible Extra Credit
1.1 Fill in Table 1 (20 points + Extra Credit)
Dataset / Is the data Powerlaw distributed Answer (‘no’ or ‘cr’ for can’t reject) and p-value, e.g.No, p = 0.42 / Max Likelihood Est. of Powerlaw slope with CI, e.g.,
-2.12 +/- 0.25 / Slope obtained from regression on CDF of ranked data / Is the data lognormal? Answer (‘no’ or ‘cr’ for can’t reject) and p-value / 5 pts Extra Credit
Is there a range over which it is Lognormal? Give xmin, xmax and p-value for the range.
Webpages
Transistors
Moby words
Spoken words
Income
Extra Credit
Wealth
1.2 (9 points) Provide 1 panel in each figure for each of the 5 datasets (or 6 for extra credit).
Show results of the maximum likelihood powerlaw fit (from plplot) in Figure 1a (on a single page)
Show results of the regression through the CDF of the ranked distribution in Figure 1b (on a single page)
Show the histogram of logged data in Figure 1c (on a single page)
1.3 Explain your results
1.3.1 (15 points) Provide 2-3 sentences for each of the 5 data sets explaining what the data represent, whether they are consistent with a power law distribution, a lognormal distribution, both or neither. Describe any deviations from the powerlaw distribution in the large or small values (for example, is there evidence of an exponential cut off in the tail?). Explain what the distribution and any deviation from the powerlaw or lognormal distribution means.
Web page links:
Transistors on chips:
Words from Moby Dick
Spoken Words
Income
1.3.2 (6 points) In which cases does using regression on the cumulative distribution give a slope consistent with the maximum likelihood slope estimate (consider the Confidence intervals for the maximum likelihood estimate in your answer). In which cases is it different, and why do you think it fails to give the maximum likelihood estimate?
5 more points of Extra credit:
Download the data listing the wealth of the 400 wealthiest Americans. You will have to grab the raw html and pull out the worth values from each of the 4 pages (for example,
grep "worth" rawwealthhtml.txt > worth.txt)
Fill in the table above, and explain your results. What are the differences between the wealth distribution and the income distribution, and what do the two datasets tell you about the distribution of wealth?
Part 2: 40 points + 5 points possible Extra Credit
2.1 Fill in Table 2 (24 points)
Dataset / p-value of regression / r2 / OLS slope with CI / RMA slope with CILifespan vs mass (no logs)
Log10 Lifespan vs log10 mass
Log10 Reproductive rate vs log10 mass
Log10 Age at maturity vs log10 mass
Log10 LRE vs log10 mass
Log10 Power vs log10 Ntrans
Extra Credit: Log10 Power vs all other variables / (list slopes with CIs for each variable) / N/A
2.2 (5 points) Provide Figure 2a showing the regression between lifespan vs mass and 2b showing the regression between log lifespan and log mass
2.3.1 (6 points)
Explain why RMA regression or OLS regression is most appropriate, and whether data should be log transformed or not for the mammal data set and the chip dataset.
2.3.2 (5 points)
Are the exponents for the mammal regressions consistent with the Metabolic Scaling Predictions? Consider whether the RMA or OLS regression methods and their confidence intervals give exponents consistent with quarter powers or not.
5 points Extra Credit: Perform a multiple regression to explain the variation in chip Power as a function of all the other available variables. Ask Matlab for help on regress in order to use multiple predictor variables. Fill in the extra credit line in Table 2. Explain which variables are significant in the prediction. Do you get any additional predictive power by transforming any of the variables (e.g. taking the square root or log?)