Stats Exam Paper Answers:

1ai. Quota sampling involves selected a predetermined number of individuals from mutually exclusive groups from a population for use as a sample. For instance, if you decided to get 50 men and 50 women, then you would be using quota sampling. This is in contrast to simple random sampling where you would simply select 100 individuals at random.

ii)Stratified sampling involves the separation of the population into mutually exclusive subgroups as in quota sampling, only in this case, a random sample is taken from each of the defined strata. The number of individuals selected from each strata is often defined by the (theorized) proportion of individuals in the population that belong to that strata.

iii)An alternative, also known as a research hypothesis is that which the researcher is trying to show. For instance, if a quality control expert at a lightbulb factory is trying to find out if the length of time that their lightbulbs last is less than their advertised length, the alternative hypothesis would be:

Where k is the advertised length. In short, it’s what we want to find out in an experiment.

b)Median is the odd one. It is the only measure of central tendency in the group. The rest are measures of variability.

c)i) Getting the median of 1.5 would mean ½ of the data being 0 or 1 and the rest of the data being 2 or greater. Hence, since there are 15 points less than or equal to 1, there would need to be 15 points 2 or greater. There are 11 accounted for in the table, so x=4.

ii)The calculation is as follows:

d)

i)There are 36 ways that these two dice could fall. We know the formula for a conditional probability, which is what we have here.

Let’s let A be the probability that neither face is 6

Let’s call B the probability that the sum of the faces is 7.

We use the formula:

So we need to count the ways that the faces can add to 7 without either face being 6. This can happen in the following ways:

2, 5

5, 2

3, 4

4, 3

So

And these are the ways that you don’t get a 6:

1, 1

1, 2

1, 3

1, 4

1, 5

2, 1

2, 2

2, 3

2, 4

2, 5

3, 1

3, 2

3, 3

3, 4

3, 5

4, 1

4, 2

4, 3

4, 4

4, 5

5, 1

5, 2

5, 3

5, 4

5, 5

There are 25 of them. So

So,

ii)So let’s think about the possible outcomes. The most number of empty bowls that you can have is 4. That happens when all the balls go into the same bowl. The least number of empty bowls that you can have is 2, that’s if all the balls go into different bowls.

So let’s think about the probability that they all go into the same bowl. It doesn’t matter what bowl the first one goes into, but the probability that the second and the third go into that same bown is equal to .

And let’s think about the probability that they land in different bowls. The first ball can go into any bowl. That part doesn’t matter. But the probability that the second ball goes into a different bowl than the first ball is 4/5, and the probability that the third ball goes into neither of those bowls is 3/5. So the probability of being in all different bowls is: .

So, we can get the third probability by subtracting these from 1. We get:

So we can write out the probability distribution:

Number of empty bowls / Probability
2 / 12/25
3 / 64/125
4 / 1/125

e)i)These variables are certainly not independent. Extreme values of x tend to produce low values of y.

ii)The approximate correlation coefficient is 0. This is because the correlation coefficient measures linear correlation. That is, if we drew a linear regression line through these points, we’d expect it to have a slope of 0 (i.e., correlation = 0).

iii)If we split the data, then the correlation would be positive (guess is 0.65) on the left side and negative (guess is -0.65) on the right side.

f)Since the sample size is large and n*p-hat>5, we’re going to use the normal approximation to the binomial to do this test. Our hypotheses are:

  1. In this case our sample proportion is , and our test statistic is calculated as:
  2. We use the standard normal table to find and we find that probability to be 0.0035. This is our p-value.
  3. Since this value is less than our significance level, we reject the null hypothesis.
  4. There is sufficient evidence to conclude that the advertising campaign has been successful.

g)

i)

ii)

iii)

h)

i)We have the formula:

This is clearly not possible because probabilities must be between 0 and 1.

ii)This is not possible because the correlation coefficient must be between -1 and 1.

iii)This is not possible because the variance must be non-negative.

iv)This is possible. For example, the variance of the sample mean decreases by a factor of n, so an increase in sample size would result in a decrease of the variance of the sample mean.

2a

i)

Stem-and-Leaf Display: C1

Stem-and-leaf of C1 N = 50

Leaf Unit = 0.10

10 10 2335666699

15 11 01667

21 12 115689

(10) 13 5778889999

19 14 0156789999

9 15 00112236

1 16

1 17

1 18 4

ii)There are 50 data points, so the median is the average of the 25th and 26th points. We find those using the above table to both be 13.8. So the median is 13.8. There are 25 points in each half of the data so the 1st quartile is the 13th from the bottom and the third quartile is the 13th from the top. These values are 11.6 and 19.9, respectively.

iii)The stem and leaf plot shows each data point. The numbers on the right represent the values for the first decimal point. They are called leaves. The stems are the numbers in the 2nd column. This value gives the whole number part of the observation. The first column is the “depth guage.” It tells us how many points are below (for the first half) and above (for the second half) of the data including the current line. So, for example, the third entry in the 1st column is 21. That tells us that there are 21 observations in the 1st through 3rd lines.

b)

i)We convert this value to a z-score: . This value of z is so large, that no table goes out that far. That is, The probability that the can gets more than 290 units is <0.000.

ii)In this case we need to rely on the fact that the difference of two normally distributed random variables is also normal. Specifically,

Or, W follows a normal distribution with mean -40 and with standard deviation of

So, now we find our z-score: . Using the standard normal table, we find that

iii)Here again, we need to rely on the sum of normals being normal, in particular:

So we find out test statistic: , and we find from the table

3a)

i)

ii)It appears that there is a negative relationship between these variables. That is, the higher the wind speed, the lower the running times. Also, there appears to be a linear relationship between them.

iii)The following table shows the calculations:

y / x / y-ybar / x-xbar / (y-ybar)(x-xbar)
10.510 / -2.940 / 0.206 / -3.049 / -0.628
10.440 / -1.630 / 0.136 / -1.739 / -0.237
10.390 / -0.740 / 0.086 / -0.849 / -0.073
10.220 / -0.380 / -0.084 / -0.489 / 0.041
10.440 / -0.170 / 0.136 / -0.279 / -0.038
10.180 / 0.210 / -0.124 / 0.101 / -0.013
10.300 / 0.600 / -0.004 / 0.491 / -0.002
10.150 / 1.080 / -0.154 / 0.971 / -0.150
10.200 / 2.250 / -0.104 / 2.141 / -0.223
10.210 / 2.810 / -0.094 / 2.701 / -0.254
Average= / 10.304 / 0.109 / -1.575 / <--- (sum of (x-xbar)(y-ybar)
standard devation = / 0.130 / 1.714

So we calculate the coefficient to be:

iv)We find as we expected that the correlation is negative. And that there is a fairly high level of linear correlation between the two variables.

v)Yes. The correlation is high, so a predictive model would be useful.

b)Since I have a complete list of the employees of the company, and since the company is spread out over a large area, it might be useful to use stratified sampling to conduct the study. That is, take a simple random sample from each country the size of which would be based on the proportion of employees that live in that country. To avoid self-selection bias, we should choose the employees that we would like to survey and then deliberately seek them out to get them to fill in a survey. Nonetheless, one of the problems will be getting a high response rate. This may cause the study to lose some validity. We might add additional control variables including gender, job type, salary, years with company and others. These would help us to get a good idea of what the different kinds of employees at the company would like to spend their holidays.

4a.

i)The formal test that we’re talking about here is a test of independence. That is, are the variables attitude and social class related in some way.

The following table shows some needed values:

Observed Values
lower / middle / upper
approved / 34 / 45 / 61 / 140
indifferent / 55 / 30 / 203 / 288
disapproved / 28 / 22 / 220 / 270
117 / 97 / 484 / 698
Expected Values
lower / middle / upper
approved / 23.47 / 19.46 / 97.08 / 140
indifferent / 48.28 / 40.02 / 199.70 / 288
disapproved / 45.26 / 37.52 / 187.22 / 270
117 / 97 / 484 / 698

The expected values table relies on the fact that if the variables are independent then we know that . So the values inside the table of calculated by multiplying the marginal probabilities together. For example:

The approved, lower class cell is: . That’s the expected proportion. We multiply by the total sample size to get the final value: .

So the test statistic is the different between the observed and expected values squared divided by the expected values and then we add them all together. We do this and get 73.916 for our test statistic value. We consult the chi-square table, knowing that the degrees of freedom are (3-1)x(3-1) = 4, and find the p-value to be: <0.000

ii)The results show that the is very much evidence that the two variables are not independent. In particular, the middle class tend to approve of the proposal far more often than they would if the variables were independent.

b)The sample proportion is . So the confidence interval is:

c)We’ll use a two-independent-sample t-test with equal variance to carry out this test.

  1. The hypotheses are: and .
  2. We set
  3. Now we calculate the test-statistic

First we need the pooled standard deviation. The formula is:

So,

And the test statistic is:

We use the t-table, noting that there are degrees of freedom and find that our p-value is <0.000

  1. All we have demonstrated here is that contact between mother and infant in the first few hours makes the mother feel more attached to the baby in the first week. It is not obvious that this effect would extend to the relationship at other times. In any case, this is a sociological question, and not a statistical one. The analysis does not support the extension of these results beyond the first week.