DSCI 415 Unsupervised Learning - Assignment 3 (93Points)

DSCI 415 Unsupervised Learning - Assignment 3 (93Points)

DSCI 415 – Unsupervised Learning - Assignment 3 (93points)

1 – Places Rated Almanac for Several Major U.S. Cities

These data are taken from the Places Rated Almanac, by Richard Boyer and David Savageau, copyrighted and published by Rand McNally. The nine rating criteria used by Places Rated Almanac are:

  • Climate & Terrain
  • Housing
  • Health Care & Environment
  • Crime
  • Transportation
  • Education
  • The Arts
  • Recreation
  • Economics

For all but two of the above criteria, the higher the score, the better. For Housing and Crime, the lower the score the better. The scores are computed using the following component statistics for each criterion (see the Places Rated Almanac for details):

  • Climate & Terrain: very hot and very cold months, seasonal temperature variation, heating- and cooling-degree days, freezing days, zero-degree days, ninety-degree days.
  • Housing: utility bills, property taxes, mortgage payments.
  • Health Care & Environment: per capita physicians, teaching hospitals, medical schools, cardiac rehabilitation centers, comprehensive cancer treatment centers, hospices, insurance/hospitalization costs index, flouridation of drinking water, air pollution.
  • Crime: violent crime rate, property crime rate.
  • Transportation: daily commute, public transportation, Interstate highways, air service, passenger rail service.
  • Education: pupil/teacher ratio in the public K-12 system, effort index in K-12, accademic options in higher education.
  • The Arts: museums, fine arts and public radio stations, public television stations, universities offering a degree or degrees in the arts, symphony orchestras, theatres, opera companies, dance companies, public libraries.
  • Recreation: good restaurants, public golf courses, certified lanes for tenpin bowling, movie theatres, zoos, aquariums, family theme parks, sanctioned automobile race tracks, pari-mutuel betting attractions, major- and minor- league professional sports teams, NCAA Division I football and basketball teams, miles of ocean or Great Lakes coastline, inland water, national forests, national parks, or national wildlife refuges, Consolidated Metropolitan Statistical Area access.
  • Economics: average household income adjusted for taxes and living costs, income growth, job growth.

In addition latitude and longitude, population and state are also given but you will not be using these in your PCA. Use PCA to identify the major components of variation in the ratings amongst cities.

In particular do the following:

a)How many principal components would you retain for these data? Include a scree plot. (4 pts.)

b)Interpret the retained principal components by examining the eigenvectors/loadings. Discuss. (6 pts.)

c)If you could only use a few variables from theoriginaldataset, which would they be? Explain. (3 pts.)

d)Use plots of the principal component scores to identify any cities which are outliers. What are the outlying cities and what characteristics make them unique? (5 pts.)

2 – Fine Needle Aspiration of Breast Tumor Cells

The data in file BreastDiag.JMP and BreastDiagin R contains data from a study of malignant and benign breast cancer cells using fine needle aspiration.

These data come from a study of breast tumors conducted at the University of Wisconsin-Madison. The goal was determine if malignancy of a tumor could be established by using shape characteristics of cells obtained via fine needle aspiration (FNA) and digitized scanning of the cells. The sample of tumor cells were examined under an electron microscope and a variety of cell shape characteristics were measured.
Your goal is to use summary statistics and graphical displays to determine which characteristics are most useful for discriminating between benign and malignant tumors.

The variables in the data file you will be using are:

  • ID - patient identification number (not used)
  • Diagnosis determined by biopsy - B = benign or M = malignant
  • Radius = radius (mean of distances from center to points on the perimeter
  • Texture texture (standard deviation of gray-scale values)
  • Smoothness = smoothness (local variation in radius lengths)
  • Compactness = compactness (perimeter^2 / area - 1.0)
  • Concavity = concavity (severity of concave portions of the contour)
  • Concavepts = concave points (number of concave portions of the contour)
  • Symmetry = symmetry (measure of symmetry of the cell nucleus)
  • FracDim = fractal dimension ("coastline approximation" - 1)
    Medical literature citations:

W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer fromfine-needle aspirates.
Cancer Letters 77 (1994) 163-171.

W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine learning applied to breast cancerdiagnosis and prognosis.
Analytical and Quantitative Cytology and Histology, Vol. 17. No. 2, pages 77-87, April 1995.

W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized breast cancer diagnosis and prognosis from fineneedle aspirates. Archives of Surgery 1995;130:511-516.

W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived nuclear features distinguish malignant frombenign breast
cytology. Human Pathology, 26:792--796, 1995.

See also:

Perform a principal component analysis using the characteristics listed above
(note: perimeter and area are not included).

a)How many principal components would you retain for these data? Include a scree plot. (4 pts.)

b)Interpret the retained principal components by examining the eigenvectors/loadings. Discuss. (6 pts.)

c)Examine a biplot from PCA, are the benign tumor cells and malignant tumor cells well separated on the basis of the PC scores? Explain. (3 pts.)

d)Can you characterize what makes the benign and malignant tumor cells different on the basis of these measurements? (4 pts.)

3- Gene Expression Levels for Colon Tissue

These data come from a microarray experiment where gene expression levels were measured for colon tissues samples from 40 individuals with colon cancer and 22 individuals without. Below is a blurb about microarray experiments taken from Modern Multivariate Statistical Techniques by Alan Izenman.

The data contained in the file Alontop.JMP and Alontop.txtcontain the expression levels for 92 genes (columns) found in n = 62 tissue samples (rows).

a)Use corrplot to display the correlations between the 92 genetic expression variables. Do you see distinct groups of genes that are highly correlated with each other, i.e. cluster together? How many groups/clusters of genes would you say there are? (4 pts.)

b)Conduct a PCA of the gene expression levels. How many PC’s would you retain? Include a scree plot and table of the percent of total variation explained. (4 pts.)

c)How do the loading patterns that define the PC’s relate to the results from part (a)? Use a loading plot of the first two PC’s to identify the “clusters” of similar genes.
(4 pts.)

d)Examine a 2-D biplot based on the first two principal components. Color code and label the points in this plot according to tumor type (cancerous vs. normal). Are the two groups of points well separated? Can you use this plot to identify genes that would be associated with developing colon cancer? List the genes that you would tell geneticists are potential markers for developing colon cancer. (6 pts.)

4– NHL Player Statistics (2016-2017 Season)

These data are comprehensive statistics for forwards and defensemen in the NHL (no goalies) from the 2016-2017 season. I have limited the database to contain only players in these positions that played at least 50 games.

Variables in these data:

  • Rank – player rank based entirely upon PTS (not used in PCA)
  • Player –name of player (point labels)
  • Age – age of player (not used in PCA)
  • Position – center C, left wing LW, right wing RW, or defense D
  • Team – team
  • G60 – goals scored per 60 minutes of ice time
  • A60 – assists per 60
  • PTS60 – goals + assists per 60 (G60 + A60)
  • PlusMinus (+/-) – goals for – goals against while player is on the ice, i.e. plus-minus rating.
  • PIM – penalties in minutes (total penalty time for the season)
  • PS – points shares; an estimate of the number of points contributed by a player, interpret as a %.
  • EV – even strength goals
  • PP – power play goals
  • SH – short-handed goals
  • GW – game winning goals
  • EVA – even strength assists
  • PPA – even strength assists
  • SHA – short-handed assists
  • S60 – shots on goal per 60 minutes of ice time
  • Sper - shooting percentage, percentage of shots on goal that result in a goal.
  • AveTOI – average time on ice per game (in minutes)
  • BLKperGame – blocked shots at even strength per game
  • HITperGame – number of hits at even strength per game
  • CF60 –Corsi For at even strength (Corsi = Shots + Shots Blocked (against) + Misses) per 60
  • CA60 –Corsi Against at even strength per 60
  • CFp -Corsi For % at even strength (CF/(CA+CF)), above 50% means the team was controlling the puck more often than not with this player on the ice in this situation.
  • CFprel – Relative Corsi For % at even strength which is the difference in CF% when the player is on and off the ice, i.e. CF% on – CF% off.
  • FF60– Fenwick For at even strength (Shots + Misses) per 60
  • FA60 – Fenwick Against at even strength per 60
  • FFp- Fenwick For % at even strength (FF/(FA+FF)), above 50% means the team was controlling the puck more often than not with this player on the ice in this situation.
  • FFprel-Relative Fenwick For % at even strength which is the difference in FF% when the player is on and off the ice, i.e. FF% on – FF% off.
  • oiSHp – Team on-ice shooting percentage at even strength, the shooting percentage when this player was on the ice.
  • oiSVp – Team one-ice save percentage at even strength, the save percentage when this player was on the ice.
  • PDO – PDO at even strength Shooting% + Save% (higher the better)
  • oZSp – offensive zone start % at even strength = Offensive Zone Faceoffs/(Offensive Zone Faceoffs + Defensive Zone Faceoffs).
  • dZSp - defensive zone start % at even strength = Defensive Zone Faceoffs/(Defensive Zone Faceoffs + Offensive Zone Faceoffs).
  • TK60 – takeaways per 60
  • GV60–giveaways per 60
  • SAtt60 – Total shots attempted in all situations (even strength, penalty kill, power play) per 60 minutes
  • Thru – percentage of shots taken that go on net.

These data are contained in the files NHL Skater Stats (Rates).csv, NHL Skater Stats (Forwards).csv, and NHL Skater Stats (Defense).csv. The first file contains both forwards and defensemen in the same file.

a)Conduct a PCA for the full skater (non-goalie) statistics. Construct a plot the first two principal components color coding the player names by their position. Which player position stands out from the rest? Given the loadings on the first two principal components does their position make sense? Comment on any players that standout. What makes them unique? (10 pts.)

b)Conduct a thorough PCA of the NHL Forwards data. (15 pts.)

c)Conduct a thorough PCA of the NHL Defense data. (15 pts.)

A thoroughPCA analysis will consist of the following:

  • A nice display of the pairwise correlations between the variables (use corrplot).
  • A discussion of what the first few PC’s are measuring. You might also want to include plots/graphs illustrating the variable loadings on the first few components.
  • Some nice graphical displays of the results, e.g. biplots and plots of the PC scores (2-D and/or 3-D) with interesting color coding and labeling. You should obviously discuss what these displays show.
  • Identification of players that standout and a discussion what makes them unique.