A Catechism
for
The Practice of Statistics, Second Edition
The excellent textbook was written by Daniel Yates, David Moore, and Daren Starnes, and is copyright 2003, W.H. Freeman and Company.
This catechism was written by Joseph Strayhorn, in 2006, and is not copyrighted. Strayhorn’s email address is
YMS Chapter 1: Exploring Data
Q1. The science of data is known as ____.
A1. Statistics
Q2. Most raw data sets can be organized into rows and columns. Each row represents some object or person that is studied, and each column represents some characteristic about that thing that is measured. Our textbook calls those objects and characteristics what two things respectively?
A2. Individuals and variables
Q3. What are the two main classes of variable types?
A3. Categorical and quantitative
Q4. A description, depiction, or equation telling what values a variable takes on and how often it takes on these values is called the ___ of the variable.
A4. Distribution
Q5. Before studying the relationships among variables, it's usually good to begin by examining what?
A5. Each variable by itself
Q6. Before getting numerical summaries of the data, your textbook advises exploring the data with what?
A6. Graphs
Q7. What two types of graphs are usually most appropriate for categorical data?
A7. Bar charts and pie charts
Q8. If several percentages do not represent portions of the same whole, then what type of graph is inappropriate?
A8. A pie chart
Q9. When you are asked to describe a distribution after looking at a graph, the general tactic is to look for an overall pattern and also for striking deviations from that pattern. When describing the overall pattern, what three features should you mention?
A9. Center, shape, and spread
Q10. When you are asked to describe a distribution, the general tactic is to look for an overall pattern and also for striking deviations from that pattern. What are the striking deviations called?
A10. Outliers
Q11. Someone wants to display this center, shape, and spread of a data set with a picture. But the person also wants to communicate, through the same graph, the individual raw data values that were collected in the study. There are too many different values that the variable takes on to make a dot plot feasible. What type of graph should the person choose?
A11. A stem plot
Q12. Instead of a dot plot or a stem plot, a ____ is the most common graph of the distribution of one quantitative variable.
A12. Histogram
Q13. What does your textbook depict as a minimum number for either the number of stems in a stem plot, or the number of classes in a histogram?
A13. Five
Q14. If the right and left sides of a histogram are approximately mirror images of each other, we call the distribution what?
A14. Symmetric
Q15. If there's a big hump on the left side of a histogram and a long tail extending far out to the right, do we say that the distribution is skewed right or skewed left?
A15. Skewed right
Q16. If you look at people's incomes, defining income so that zero is the smallest possible value, and your sample includes mainly middle income people but at a few extremely high income people, will the distribution be skewed right or skewed left?
A16. Skewed right
Q17. Mary gets a test report saying that 79% of the test takers fell at or below the score that she made. The name of the type of score she got is what?
A17. Percentile
Q18. A relative cumulative frequency graph is often called what?
A18. Ogive.
Q19. In a relative cumulative frequency graph, or ogive, the horizontal axis is for the values of the variable you are looking at. For any given value on the horizontal axis, what does the value on the vertical axis stand for?
A19. The fraction of observations less than or equal to that value
Q20. If you are given a relative cumulative frequency graph, and someone asks you to find the center of the distribution, how do you do it?
A20. Find the value on the x-axis that has a 50% or .5 value on the y-axis.
Q21. On a time plot, what axis does time go on?
A21. The horizontal axis
Q22. On a time plot, an overall upward or downward slope is called what?
A22. A trend
Q23. On a time plot, what do you call the shorter-term variations that occur regularly, repeating themselves in a cyclic fashion?
A23. Seasonal variation
Q24. 1/n times the summation of the x(i), where n is the number of cases and x(i) is the value of the ith case, is known as what?
A24. The mean
Q25. The number in a distribution such than half the observations are smaller and the other half are larger is called what?
A25. The median
Q26. If there is no middle value in a data set because you have an even number of cases, how do you do find the median then?
A26. You find the mean of the two center observations.
Q27. Between the mean and median, which of these is pulled farther in the direction of extreme values or outliers?
A27. The mean.
Q28. If a distribution is highly skewed to the right, which value will be lower: the mean, or the median?
A28. The median.
Q29. From which statistic, the mean or the median, can you recover the total value of all the cases in your data set, if you know how many cases there are?
A29. The mean
Q30. What's the definition of the range of a distribution?
A30. The difference between the largest and smallest value
Q31. What's the chief problem with using the range as a measure of the spread of a distribution?
A31. It's too sensitive to outliers, and it depends on only two values in the data set.
Q32. What you call the median of the subset of observations whose position in the ordered list is to the left of the overall median?
A32. The first quartile
Q33. What's the definition of the interquartile range?
A33. The third quartile minus the first quartile.
Q34. What's the rule of thumb for defining outliers in terms of the interquartile range?
A34. An outlier falls more than 1.5 times the interquartile range above the third quartile or below the first quartile.
Q35. What five numbers are in the so-called five number summary?
A35. The minimum, the first quartile, the median, the third quartile, and the maximum.
Q36. What type of graph gives a picture of the five number summary?
A36. The box plot
Q37. What's the difference between a regular box plot and a modified box plot?
A37. In a regular box plot, the whiskers go out to the maximum and minimum. In a modified box plot, the whiskers go out to the largest and smallest data points that are not outliers. The outliers are plotted as isolated points on a modified box plot.
Q38. If you take the deviation of each observation from the mean of the whole set, square those deviations, add those squares, and divide by one less than the number of observations, what do you call the resulting number?
A38. The variance
Q39. What is the relationship between the variance and the standard deviation?
A39. The standard deviation is the square root of the variance.
Q40. How is the standard deviation like the interquartile range?
A40. Both of them are measures of spread of the distribution.
Q41. When you average the squared deviations from the mean to find the variance of a sample, what should you divide by: the n of cases, or the "degrees of freedom"?
A41. The degrees of freedom
Q42. Under what conditions will a standard deviation equal zero?
A42. When all the observations have the same value.
Q43. Between the interquartile range and the standard deviation, which is more resistant to the effects of the outliers?
A43. The interquartile range
Q44. How do you choose between the five number summary on the one hand, and the mean and standard deviation, on the other hand, as ways of describing a distribution?
A44. The mean and standard deviation are good for reasonably symmetric distributions that are free of outliers. Otherwise the five number summary is usually better.
Q45. If you add the same number to each observation, how does that affect the center and the spread of the distribution?
A45. The number that you add is added to the measures of center, such as the mean and median. But measures of spread, such as the interquartile range and standard deviation, are not affected.
Q46. If you multiply each observation by the same number, how does that affect measures of center and spread?
A46. Both the measures of center (median and mean) and the measures of spread (standard deviation and interquartile range) are multiplied by the same number. (The variance, which is also a measure of spread, is multiplied by the square of the number each observation is multiplied by.)
Q47. What are three graphical methods of comparing distributions?
A47. Side by side bar graphs, back-to-back stem plots, and side-by-side box plots.
YMS Chapter 2: The Normal Distribution
Q1. The scales of density curves are adjusted so that the total area under each curve is what?
A1. One
Q2. The area under the density curves between a couple of x-axis values represents what?
A2. The proportion of all observations that fall between those values.
Q3. Do measures of center and spread apply a to density curve as well as to sets of observations?
A3. Yes
Q4. How do you define the median of the density curve?
A4. The point with half the area under the curve to its left and the remaining half of the area to its right.
Q5. The quartiles of a density curve divide the area into what?
A5. Four equal parts.
Q6. What is the relationship between the mean and the median of a symmetric density curve?
A6. They are equal.
Q7. Which is pulled the farther toward the tail of a skewed distribution: the median, or the mean?
A7. The mean
Q8. In conventional notation, what are the meanings of x-bar and s, as contrasted to mu and sigma?
A8. The first two refer to the mean and standard deviation, respectively, of a set of observations, a sample. The second two refer to the mean and standard deviation, respectively, of a density curve or idealized distribution, or the population distribution.
Q9. What three features describe the overall shape of normal curve?
A9. Normal curves are symmetric, single peaked, and bell shaped.
Q10. Is there only one normal curve, or is there an infinite number of normal curves?
A10. An infinite number.
Q11. For for any given mean and standard deviation, is there only one normal curve, or an infinite number of normal curves?
A11. Only one.
Q12. How can you visually find the points one standard deviation from the mean of a normal curve?
A12. Those points are the inflection points of the curve. That is, the curve changes from falling more and more steeply to falling less and less steeply, or vice versa. (Optional answer for calculus lovers: they are points where the second derivative of the curve equals zero.)
Q13. The distributions of test scores, of measures of characteristics of living things, and of summary statistics for chance outcomes repeated many times, often (but not always!) follow what type of distribution?
A13. The normal distribution
Q14. What three percentages do you have to remember when you are stating the “empirical rule”?
A14. 68%, and 95%, and 99.7%.
Q15. Are the three percentages for 1, 2, and 3 standard deviations exact, or easier-to-remember rounded approximations?
A15. Approximations.
Q16. What do the three percentages in the empirical rule apply to? in other words, what is the meaning of this rule?
A16. The three numbers tell the per cent observations falling within the region plus or minus 1, 2, or 3 standard deviations from the mean, respectively, in a normal curve. (Note that the percents refer to the percent of observations encompassed by the interval from that number of standard deviations below the mean to that number above the mean.)
Q17. True or false: If Mary scores one standard deviation above the mean on a normally distributed test, then approximately 68% of the test takers scored as close to the mean of the test as, or closer to the mean than, Mary did.
A17. True
Q18. True or false: If Mary scores one standard deviation above the mean a on a normally distributed test, her score is in the 68th percentile.
A18. False
Q19. True or false: if Mary scores one standard deviation above the mean on a normally distributed test, half of 68% or 34% are above the mean but at or below Mary’s score. An additional 50% are below the mean. Thus Mary equals or surpasses 50% plus 34% of the test takers, and is at the 84th percentile.
A19. True
Q20. What does the notation N(100,15) mean?
A20. It denotes a normal in distribution with mean it 100 and standard deviation 15.
Q21. True or false: the standard score for any observation tells how many standard deviations that score is from the mean.
A21. True
Q22. What two operations do we do, to standardize a score?
A22. Subtract the mean and divide by the standard deviation.
Q23. A standard score is often called by what other term?
A23. The z-score.
Q24. What does the sign of a standard score correspond to?
A24. If the z-score is positive, it’s above the mean, and if negative, below the mean.
Q25. Are there an infinite number of standard and normal curves, each with its own equation describing it, or just one standard normal curve, with just one equation describing it?
A25. There is just one standard normal curve, with only one equation and describing it.