Problem Set #6
Geog 3000: Advanced Geographic Statistics
Instructor: Dr. Paul C. Sutton
Problem Set Number 6 is another short problem set. We will probe into topics that are beyond the scope of the textbooks. However, I present problems with a bit of background info that I believe is instructional and the problems will be relatively easy to solve. I hope these ‘exercises’ are instructive. I encourage you to search the web for good material on these data reduction topics. Topics we will cover include some common data reduction techniques such as: Principle Components Analysis (PCA)/Factor Analysis and Clustering. I will also introduce the idea of stochastic modeling via a Monte Carlo simulation of 1,000 Hockey Games. I got this idea from a book on geographic statistics by Peter A. Rogerson. There is a paper in the course documents section of the blackboard page for the course titled: Judgement under Uncertainty: Heuristics and Biases by Amos Tversky and Daniel Kahneman. This is a classic paper that likely influenced famous geographers such as Michael Goodchild and Reg Golledge. It is very relevant to answering the first question of this problem set about air force pilot training. Good luck. Don’t pull your hair out. You will get a chance to punish me back at course evaluation time soon.
From “Statistician’s Blues” by Todd Snider
They say 3 percent of the people use 5 to 6 percent of their brain
97 percent use 3 percent and the rest goes down the drain
I'll never know which one I am but I'll bet you my last dime
99 percent think we're 3 percent 100 percent of the time
64 percent of all the world's statistics are made up right there on the spot
82.4 percent of people believe 'em whether they're accurate statistics or not
I don't know what you believe but I do know there's no doubt
I need another double shot of something 90 proof
I got too much to think about
#1) Judgement under Uncertainty: Heuristics and Biases?
Daniel Kahneman and Amos Tversky were at an Israeli pilot training school conducting some sort of seminar with Israeli Air Force instructors. They asked a room full of flight instructors if ‘positive reinforcement’ worked better, worse, or no differently than ‘negative reinforcement’ for pilot training. Most of the air force pilots had come to the conclusion that negative reinforcement worked (improved student flight performance) and positive reinforcement was actually detrimental (made students fly worse). They based this conclusion on the fact that in the really bad landings performed by students (let’s say the worst 10% of all student landings) they yelled at the student and gave them lots of ‘negative reinforcement’. Their observation was that almost invariably that student’s next landing was better. In the case of really good landings by students (let’s say the best 10% of all student landings) they lauded them with praise, smiles, and other kinds of ‘positive reinforcement’. Their observations in these cases almost invariably were that the next landing by that student was not as good. The individual and collective conclusion of these flight instructors was that negative reinforcement works and positive reinforcement does not work.
Question 1: What statistical phenomena explains this pattern of observations? Are the conclusions of the Air Force instructors valid? Why or Why not?
NOTE: When you think about this, think about the students in this class and their parallel parking skills. Assume everyone in this class is equally skilled at parallel parking but not all of us perform each individual parallel parking job with the same results each time. Sometimes we nail it on the first pull in, sometimes we go back and forth five times. Visualize all of us performing a parallel parking job with a ‘Parallel parking Instructor’ sitting in the passenger seat. They yell and scream at us if we botch it and have to go back and forth five times. They give us a Hershey’s Kiss and a Hug if we nail it without having to go back and forth. Will we do better after a botched parking job? Will we do worse after a perfect one? Remember the assumption: We are all equally skilled at parallel parking. However, there is random variation with respect to our individual attempts at parallel parking.
Question 2: Consider your knowledge of the military (real, TV based, or imaginary). What kinds of reinforcement do you typically see being exercised by military personell? Does this little vignette perhaps explain your observations?
Helpful Hint: Read the paper Judgement under Uncertainty: Heuristics and Biases by Tversky and Kahneman 1974 that is posted under the course documents section of the blaockboard site for this course.
#2) Pixels, Remote Sensing, and Cluster Analysis: Separating Sand, Water, and Grass
I have produced two very simple ‘images’ represented as a matrix or ‘raster’ of numbers below (If you have taken a remote sensing course this might seem familiar). The numbers in each cell or ‘pixel’ represent the measurement of the NearInfrared and Green (visible) sensor at that location (higher numbers mean more VNIR or Green radiation was detected at that location). The ‘scene’ is a patch of earth that has Sand, Grasss, and Water in it. Type these numbers into a table with two columns titled: VNIR and Green (this is a raster version of a ‘Spatial Join’ in a GIS). Read the table into JMP and answer the questions below:
3 / 4 / 3 / 5 / 31 / 32 / 0 / 1 / 2 / 2 / 5 / 43 / 6 / 5 / 4 / 33 / 34 / 1 / 2 / 3 / 3 / 3 / 3
4 / 5 / 5 / 25 / 31 / 2 / 0 / 1 / 1 / 20 / 7 / 2
34 / 33 / 30 / 31 / 30 / 36 / 40 / 44 / 48 / 46 / 5 / 6
36 / 3 / 29 / 30 / 33 / 35 / 42 / 5 / 40 / 43 / 3 / 5
35 / 34 / 36 / 35 / 37 / 36 / 45 / 49 / 47 / 38 / 43 / 4
Visible Near Infrared Green
Classified Image Spatial & Spectral
Sand (S), Water (W), Grass (G) ‘outliers’
Question #1) Create a scatter plot of VNIR and Green bands. Cluster it by eye. Explain how you performed this ‘clustering’ operation.
Question #2) Use radiometric theory to classify these clusters (Assume Water (W) is Low VNIR and Low Green, Grass (G) is High VNIR and High Green, Sand (S) is High VNIR and low Green).
Question #3) Use ground truthing (aka empiricism) to classify these clusters. Assume you paid a graduate student to go out with a GPS and find out that the upper left hand corner pixel was definitely water, the lower left hand pixel was definitely grass, and the upper right hand pixel was definitely sand.
Question #4) Use JMP to classify these pixels into clusters. Go to ‘Analyze’ choose ‘Clustering’ add both ‘VNIR’ and ‘Green’ to the Y, Columns. Click ‘OK’. On the little red upsidedown triangle next to ‘Hierarchical Clustering’ choose ‘Number of Clusters’ Choose ‘4’ or ‘3’ (big hint from the scatterplot J). How does JMPs clustering of these pixels differ from yours? Explain what you think JMP is doing to ‘cluster’ these pixels.
Question #5) Are these pixels ‘clustered’ based on spectral or ‘spatial’ characteristics? Explain. In the diagram above identify one pixel that is a ‘spatial’ outlier and one pixel that is a ‘spectral’ outlier. Explain your reasoning and real world reasons they might occur.
#3) Using Attitudes about ‘Abortion’ and ‘Gun Control’ to understand Principle Components Analysis
Recall in Problem Set #4 you wrote a little about Principle Components Analysis (PCA), Factor Analysis and Clustering. Here is a little tutorial/exercise that I hope helps you understand the ideas behind PCA and Factor Analysis (PCA and Factor Analysis are very similar – I challenge you to find the best explanation of the difference between them on the web). I found the JMP help on Principal components to be very helpful (Check it out).
Imagine 35 people indicating with a number between 0 and 100 how they feel about the following statements (Where 100 is ‘Strongly Agree’ and 0 is ‘Strongly disagree’):
Statement #1: Abortion should be outlawed.
Statement #2: A well regulated Militia, being necessary to the security of
a free State, the right of the people to keep and bear Arms
shall not be infringed.
Load the table in the file named: PSno6_AbortionGunControlPCA.jmp into JMP and look at an X-Y scatter plot of the two columns. I imagine you can see that answers to these questions might co-vary significantly (i.e. there is a strong correlation between the responses to these two questions – in fact the R2 is 0.69). I made up these results but I think they might not be far from what we might really measure. This particular little problem set reminds me of a bumper sticker that says: “Look Honey, another Pro-Lifer for the War”. In any case, Principle Components analysis attempts to reduce the number of ‘columns’ in a data table. Can the responses to these two questions/statements be predominantly captured by some other attribute? To perform PCA on this data do the following: ‘Analyze’ , ‘Multivariate Methods’, ‘Principal Components’. Select both columns as ‘Y’ variables and click ‘OK’. The output suggests that the first principle component captures 91% of the common variation of the responses to these two statements and the second principle component captures 9% (8.517 to be precise). You can actually save the principle component scores by clicking on the tiny little upside down red triangle and choosing ‘Save Principal Components’ (do this). Answer the following questions:
#1) Factor analysis involves the art of ‘naming’ factors (which are closely related to Principal Components) based on ‘factor loadings’ which are derived from ‘eigenvectors’. This example is REAL simple so you don’t need to get involved with factor loadings. What might you ‘name’ the first and second principle component in this particular example based on your experience of people, attitudes, politics, etc.?
Factor #1: ______Factor #2: ______
#2) When you saved the ‘Principle Component Scores’ for each person you have a ‘score’ which represents their ‘Factor #1 ness’ and ‘Factor #2 ness’. How would you characterize an individual with a High Factor #1 score? How would you characterize someone with a High Factor #2 score? What about low Factor #1 and Factor #2 scores?
#4) Clustering and Principal Components Analysis – Fun with 3-D visualization
Load the file PCAfactorCluster.JMP into JMP. We are going to play with some of the JMP functionality. In the ‘Row’ menu select ‘Clear Row States’ to decolorize and de-symbolize the records. Choose ‘Graph’ – ‘Scatterplot 3-D’. Add ‘ClusterA’, ‘ClusterB’, and ‘ClusterC’ to the Y, columns. Click OK. Choose the ‘hand’ tool and click and drag on the 3-D cube. You should see the points as black dots in a 3-D cube. Let’s use the Clustering functionality of JMP. From your ‘exploratory data analysis’ (e.g. diddling around with your 3-D cube visualization) I hope you see that there are three clusters. Go to ‘Analyze’ - ‘Multivariate Methods’ – ‘Cluster’. Select ‘ClusterA’, ‘ClusterB’, and ‘ClusterC’ for the Y, columns. Click OK. Now in the little red upside down triangle choose ‘Number of Clusters’ enter ‘3’. Now using same red triangle choose ‘Color Clusters’ and ‘Mark Clusters’. Diddle with the 3-D cube again. Answer the following questions:
#1) In a broad conceptual way (no hairball math please) – How does JMP classify these points (records in our table) into the three clusters? And, do you think it did a good job?
#2) Different ‘Clustering’ techniques are often simply using different ‘distance’ metrics in variable space. Different approaches attempt to minimize the distance between each ‘record’ or ‘point’ in variable space to the ‘mean center’ of each cluster. Google the idea of ‘Mahalanobis Distance’ and give a go at explaining how it is a different way of clustering points in a variable space. I found this one to be fairly useful:
(http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_mahalanobis.htm ).
#3) Distance is a basic geographic concept. In statistics we take it into the twilight zone. In a 2-D Cartesian space the ‘distance’ between two points is given by the formula:
Distance = ((x1-x2)2 + (y1-Y2)2)1/2 (the good old pythagorean theorem)
In 3-D space the distance between two points is given by the formula:
Distance = ((x1-x2)2 + (y1-y2)2 + (z1-z2)2)1/2 (pythagorean again)
What about distance in a 4-D space? What about distance in ‘Variable’ Space?
Can we simply add (w1-w2)2 terms inside the major parentheses? Explain how this concept of distance is important to ‘clustering’ techniques. What would the formula for measuring Euclidean distance in a 5-Dimensional world look like?
#4) In the 3-D cube you played with imagine the points to be ‘pixels’ in a three band satellite image. What are the numerical values of the points (e.g. x, y, z) in terms of the satellite image data? Are the points in each cluster necessarily near each other on the ground in the real world?
#5) PCA, Factor Analysis, and Clustering as Data Compression Techniques
Statistics is often a mechanism by which we compress large amounts of information that our poor little brains cannot handle into smaller amounts of information that our brains can handle. Graphical techniques such as the Histogram and Scatterplot are incredibly profound for helping our mind ‘grok’ a large and abstract set or table of numbers. Geographic Information Systems provide similar cognitive aids to our comprehension of spatial data.