Problem Set #3

Geog 2000: Introduction to Geographic Statistics

Instructor: Dr. Paul C. Sutton

Due Tuesday Week #7 in Lab

Instructions:

There are three parts to this problem set: 1) Do it by Hand Exercises (#’s 1 – 15). Do these with a calculator or by hand. Draw most of your figures by hand. Write out your answers either digitally or by hand. Start your answer to each numbered question on a new page. Make it easy for me to grade. 2) Computer exercises (#’s 16-18). Use the computer to generate the graphics and paste them into a digital version of this assignment while leaving room for you to digitally or hand write your responses if you so wish. 3) ‘How to Lie with Statistics’ Essay(s) (#’s 19-23). For this section, prepare a typed paragraph answering each question. Do not send me digital versions of your answers. I want paper copies because that is the easiest way for me to spill coffee on your assignments when I am grading them. All of the questions on these four problem sets will resemble many of the questions on the three exams. Consequently, it behooves you to truly understand – ON YOUR OWN – how to do these problems. This is a LONG problem set. Get started early.

Rosencrantz and Guildenstern are spinning in their graves.

Do it By Hand Exercises (i.e. Don’t Use A Computer)

#1) Sampling Design

Usually it is too costly in terms of time, money, and/or effort to measure every item in a population (e.g. the length of all the bananas in the world, the average weight of 18 year old men in California, the fraction of defective light bulbs coming off an assembly line, etc). Consequently, we take a statistical approach to such problems by sampling a much smaller number from a population to make an inference about the parameters of a population. For the previous examples you could select 30 bananas and measure their respective lengths and estimate two parameters: The mean length of a banana (central tendency), and the standard deviation or variance of the lengths of those bananas (spread). The same parameters (mean, and variance) could be estimated for the weights of 18 year old Californian men. With the light bulbs the parameter you would be estimating is what fraction of the bulbs are defective. The quality of your estimates of these parameters depend profoundly on the quality of your sampling approach. Your sampling design must be ‘representative’ of the population you are trying to estimate parameters for. This usually requires ‘randomness’ of selection of your sample. Also, the confidence intervals about your parameter estimates tend to shrink as your sample size increases.

A) Define and provide an example of the following terms: Population, Sample, Parameter, Random Selection, Sampling Frame

A concrete example is a reasonable way to provide these definitions. Assume you are a campaign manager for a presidential candidate. You want to assess how you candidate stands with respect to actual voters. You can’t survey all of them (we have elections for that – pretty expensiveJ). All the actual voters would be the population you would define. Human behavior being what it is you cannot really guarantee to sample from actual voters. Consequently you will develop a sampling frame which is often registered voters. The sampling frame is the pool of potential entities that can possibly be sampled. You may use various tricks to weight your actual sample to get at something like likely voters (demographic weightings, geographic weightings, etc.). You may or may not parse (aka stratify) your sampling frame to in essence weight your results; in any case, you will want to randomly sample from the stratified or not stratified entities in your sampling frame. Random sampling simply means that all entities in your sampling frame or strata of your sampling frame have an equal chance of being selected. Your sample is simply those entities that are actually selected and measured. The parameter you might be trying to estimate is the fraction (or percentage) of the population that intends to vote for your candidate. This has an actual and true value which will be ascertained on election day. You will make an estimate of that parameter using a statistic. Your estimate will tend to be more accurate the larger a sample size you use; however, many things can mess you up: 1) Your candidate could be caught on camera at a Ku Klux Klan rally between the time your survey is done and election day (that might reduce his popularity). 2) A military conscription or draft may be instituted which changes the voter turnout of young voters. Basically there are lots of ways that your estimate of the percentage of voters intending to vote for your candidate can be changed by the course of events. Nonetheless, these polls are more often than not pretty useful as is suggested by the fact that so many are still conducted.

B) Define and provide an example of the following sampling approaches: Simple Random Sampling, Stratified Random Sampling, Cluster Sampling, and Systematic sampling.

Simple Random Sampling: Every entity within the sampling frame has an equal probability of being selected for the sample.

Cluster Sampling & Stratified Sampling: Wikipedia again - Cluster sampling is a sampling technique used when "natural" groupings are evident in a statistical population. It is often used in marketing research. In this technique, the total population is divided into these groups (or clusters) and a sample of the groups is selected. Then the required information is collected from the elements within each selected group. This may be done for every element in these groups or a subsample of elements may be selected within each of these groups. Elements within a cluster should ideally be as heterogeneous as possible, but there should be homogeneity between cluster means. Each cluster should be a small scale representation of the total population. The clusters should be mutually exclusive and collectively exhaustive. A random sampling technique is then used on any relevant clusters to choose which clusters to include in the study. In single-stage cluster sampling, all the elements from each of the selected clusters are used. In two-stage cluster sampling, a random sampling technique is applied to the elements from each of the selected clusters. The main difference between cluster sampling and stratified sampling is that in cluster sampling the cluster is treated as the sampling unit so analysis is done on a population of clusters (at least in the first stage). In stratified sampling, the analysis is done on elements within strata. In stratified sampling, a random sample is drawn from each of the strata, whereas in cluster sampling only the selected clusters are studied. The main objective of cluster sampling is to reduce costs by increasing sampling efficiency. This contrasts with stratified sampling where the main objective is to increase precision. One version of cluster sampling is area sampling or geographical cluster sampling. Clusters consist of geographical areas. Because a geographically dispersed population can be expensive to survey, greater economy than simple random sampling can be achieved by treating several respondents within a local area as a cluster. It is usually necessary to increase the total sample size to achieve equivalent precision in the estimators, but cost savings may make that feasible. In some situations, cluster analysis is only appropriate when the clusters are approximately the same size. This can be achieved by combining clusters. If this is not possible, probability proportionate to size sampling is used. In this method, the probability of selecting any cluster varies with the size of the cluster, giving larger clusters a greater probability of selection and smaller clusters a lower probability. However, if clusters are selected with probability proportionate to size, the same number of interviews should be carried out in each sampled cluster so that each unit sampled has the same probability of selection. Cluster sampling is used to estimate high mortalities in cases such as wars, famines and natural disasters. Stratified sampling may be done when you are interested in a sector of the population that might be small (say left-handed horseback riders). A simple random sample would not produce many subjects that were left-handed horseback riders. So, you can stratify your sample and increase the number of left-handed horseback riders that you sample to allow for statistical comparisons.

Systematic Sampling: An example is taking every 50th phone number in the phone book or every 5th customer that walks into a Barnes and Nobles Bookstore. It is supposedly a structured way of getting at a random sample but has some pitfalls associated with potential periodicity in the ‘stream’ of your sampling frame.

C) Explain the two important aspects of Simple Random Sampling: Unbiasedness and Independence

The property of unbiasedness simply means that the mechanism or procedure by which you select entities from your sampling frame should not mess with the equal probability of selection principle (this is of course assuming your sampling frame is representative of the population you are trying to estimate parameters of – See Part ‘D’). A classic example is asking for volunteers. People who volunteer for things are a biased sample of human subjects in many many ways. Suppose you are sampling fish in a lake. If you are catching them the old-fashioned way (probably not a good idea) with a fishing hook – you probably won’t sample the fish whose mouth is too small to get around the hook or the older wiser fish that tend not to get caught. These are kinds of bias.

The property of independence manifests in several ways. In terms of sampling – the selection of one entity should not change the probability of any of the other entities in the sampling frame from being selected. Also, knowing the value of the measurement of a sampled entity should not inform you of the value of the measurement of any prior or subsequently selected entities of your sample.

D) Suppose you wanted to estimate the fraction of people of voting age in Colorado that believe the landmark Supreme Court decision regarding abortion (Roe v. Wade) should be overturned. (BTW: An interesting factoid I discovered in the Literature review for my Master’s Thesis was this – Only 30% of American adults could answer the question: “Roe vs. Wade was a landmark Supreme Court decision regarding what?”). You use all the names and phone numbers in all the yellow page phone books of the state. You randomly sample 1,000 people from this these phonebooks and ask them if they want Roe v. Wade overturned. Is this sampling approach a good one? What is the sampling frame? Is this an unbiased approach to sampling? Explain why or why not.

No matter how carefully you take this approach you probably have some serious bias in your sample. This is probably not a good sampling approach. The sampling frame is people who have phone numbers in Colorado phone books. This is probably biased because many people don’t have land lines anymore and cell phone numbers don’t show up in phone books. One way this is almost definitely biased is that it undersamples young people and oversamples older folks. Also, some adults don’t have any phones at all. This cell phone issue is becoming increasingly problematic for polling enterprises such as Pew and Gallup.

#2) Condom manufacturing as a Bernoulli Trial

Suppose you start a condom manufacturing company. You have a method to test whether or not there are holes in your condoms (clearly this is something you might want to minimize). You have a machine that produces condoms for which you can control various settings. Suppose you test every one of your first batch of 10,000 condoms produced by this machine and 931 of them ‘fail’ the hole test.

A)Take a statistical approach to characterize the ‘effectiveness’ of your condoms based on your test results. Define your population and unknown parameter(s); find a statistic that estimates this parameter (an estimator) and the theoretical sampling distribution, mean, and standard deviation of this estimator; and use the data above as a random sampling of your product; finally, report and interpret your results and the statistical or sampling error associated with it.

The parameter I am trying to estimate is the fraction of the condoms that my machine produces that either do or don’t have holes in them. The population is all the condoms that my machine produces (this assumes that the defect rate is constant over time – probably not a good assumption but we’re going with it for the sake of simplicity). This problem is essentially identical to the Bernoulli Tack Factory problem in The Cartoon Guide. The statistic that estimates the parameter is given by: p^ = x / n , where ‘x’ is the number of condoms with holes (931), and ‘n’ is the size of the sample (10,000). Thus your estimate of the ‘defect rate’ of your condom machine is 9.31%.

For large sample sizes the sampling distribution of p^ will be approximately normal with mean equal to ‘p’ (the true population parameter value) and the standard deviation (σ aka sigma) is given by: σ = (p*(1-p)1/2 / (n)1/2 (which in this case is estimated at .0029) . Since 10,000 is pretty dang large we can say with reasonable confidence that p^ is close to 9.31%.

How close? Well we can say that there is a 68% chance that the range 9.31 +or- 0.21%

(e.g. 9.1% - 9.52%) contains the true parameter value ‘p’. We can say with 95% confidence that the range 9.31 +or- 0.42 (e.g. 8.89% - 9.74%) contains the true parameter value ‘p’.