Exam 1 Review Notes
As I started describing statistics in class over the past weeks, the purpose of statistics is to:
· Complement visuals (graphs, tables, charts and plots) by offering us a way to organize data with numbers that describe and measure data samples (these numbers are called statistics).
· Statistics are therefore NUMERICAL ADJECTIVES that describe features of a sample of data important in answering the TWO big questions pertinent in organizing any data set:
o Question 1: What is happening in the data?
§ The answer to this question will be obtained in four different stages of description; we must measure a data set’s:
· CENTRAL TENDENCY
· DISPERSION or VARIATION
· LOCATION or POSITION, and
· ASSOCIATION
o Question 2: How often?
§ The answer to this question will be obtained by comparing statistics from the above different categories such that we can make descriptive statements about:
· the SHAPE of the frequency distribution of the data (SKEWNESS v. SYMMETRY)
· the MAGNITUDE of the frequency of the data (CHEBYSHEV’S AND EMPIRICAL RULES), and
· relationships between 2 variables (correlation and covariance)
MEASURES OF CENTRAL TENDENCY, (except for geometric mean computations)
The purpose of Central Tendency statistics is to find the MIDDLE of a data set. Because there are 5 varieties of data, there are different kinds of MIDDLE for data sets, and there are therefore different central tendency statistics. However, all of these statistics share 1 common feature: they all are trying to measure or describe what a “TYPICAL VALUE” for a data set is.
The arithmetic mean is a statistic that uses magnitude of numerical data as the criteria for finding a TYPICAL VALUE, the median finds what a TYPICAL VALUE is for a data set by revealing the number in the middle position of a data set, and the mode tries to find a TYPICAL VALUE by frequency of occurrence for a data set.
Look at the five examples below for computations of the three statistics for data sets consisting of unlinked observations forming a sample:
ordered arrays / data set #1 / data set #2 / data set #3 / data set #4 / data set #5n=10 / Constant / Symmetrical and bulky in the middle / symmetrical and bulky in the extremes / skewed right / skewed left
5 / 1 / 1 / 1 / -100
5 / 2 / 1 / 2 / 2
5 / 3 / 2 / 3 / 3
5 / 4 / 3 / 4 / 4
5 / 5 / 4 / 5 / 5
5 / 5 / 6 / 5 / 5
5 / 6 / 7 / 6 / 6
5 / 7 / 8 / 7 / 7
5 / 8 / 9 / 8 / 8
5 / 9 / 9 / 100 / 9
arithmetic mean (AVERAGE) / 5 / 5 / 5 / 14.1 / -5.1
MEDIAN / 5 / 5 / 5 / 5 / 5
MODE / 5 / 5 / 1 and 9 / 5 / 5
BIMODAL / Mean is a NONRESISTANT STATISTIC … IT IS INFLUENCED BY OUTLIERS
What can be learned about central tendency statistics and their implied description of the frequency distribution of the samples in questions? Think about this for next Tuesday’s class, and for the enclosed extra credit question.
MEASURES OF LOCATION OR POSITION, by ranking or placement only (except standard scores)
These statistics are important because they want to provide you with the ability to describe sample values BY HOW EACH OBSERVATION DIFFERS FROM ANOTHER ONE—contrasting with the purpose of central tendency measures, which was to find “TYPICAL” or REPRESENTATIVE values.
These measures go by the degree of refinement you wish to use in describing the relative position of observations in samples with respect to each other. For these measures to be computed, data sets must be AT LEAST ordinal, by level of measurement, so we can rank samples into ordered arrays.
For example, using the dataset below, we can compute the QUARTILES of the data set using excel :
University / Change in Cost ($) / ordered arrayUniversity of California, Berkeley / 1,589 / 308 / min / Box-and-whisker Plot
University of Georgia, Athens / 593 / 423 / Q1 / generated by PhStat
University of Illinois, Urbana-Champaign / 1,223 / 593 / Five-number Summary
Kansas State University, Manhattan / 869 / 708 / Minimum / 308
University of Maine, Orono / 423 / 869 / m=Q2 / Q1 / 508
University of Mississippi, Oxford / 1,720 / 922 / Median=Q2 / 869
University of New Hampshire, Durham / 708 / 1,223 / Q3 / 1324
Ohio State University, Columbus / 1,425 / 1,425 / Q3 / Maximum / 1720
University of South Carolina, Columbia / 922 / 1,589
Utah State University, Logan / 308 / 1,720 / max
Where the “5 number summary” results in a graph that splits the data into QUARTILE rankings that together form the so called Box and Whisker Plot:
MEASURES OF DISPERSION
These statistics are probably the MOST IMPORTANT ONES to be sure to comprehend from this course. What these statistics try to describe is the “TYPICAL DIFFERENCE IN VALUE (by magnitude or position) FROM A TYPICAL VALUE (a central tendency measure) USED AS REFERENCE POINT, OR FROM AN EXTREME VALUE USED AS A REFERENCE POINT.
Their role is to describe how scattered data values are with respect to different LANDMARK values; these landmark values are either central tendency measures (in magnitude based dispersion measures), or extreme values (in the case of position measures).
STATISTIC / CRITERIA / DESCRIBES / RemarksVARIANCE
Algebraic symbol: s2
Excel formula: =VAR(*:*) / Magnitude of numeric values / Absolute amount of typical variation in sample measured in squared units of the actual data / Least useful for descriptive purposes because squared units are difficult to interpret in many applications
STANDARD DEVIATION
Algebraic symbol: s
Excel formula: =STDEV(*:*) / Magnitude of numeric values / Absolute amount of typical variation in samples measured in the same unit of measurement as the actual data / MOST USEFUL measure of the average amount of dispersion in a sample of data … this measure is effectively a FANCY AND OBJECTIVE “CLASS WIDTH” that enables us to translate data into standard units of measurement that enable comparisons of different data sets
COEFFICIENT OF VARIATION
Algebraic symbol: CV
Excel formula: =STDEV(*:*) / AVERAGE(*:*) / Magnitude of numeric values / RELATIVE amount of typical variation in data in the same unit of measurement as the actual data / VERY USEFUL measure of the average amount of dispersion in a sample of data, relative to the magnitude of the mean, and is thus useful in creating relative dispersion comparisons across different data sets
RANGE
Algebraic symbol: RANGE
Excel formula: NONE / Location based dispersion measure / The difference in value between the extremes of the data values of a sample / Can be computed easily as in chapter 2, =max – min
INTERQUARTILE RANGE
Algebraic symbol: IQR
Excel formula: NONE / Location based dispersion measure / The difference in value in the middle half of the data values of a sample / Can be computed just as easily as RANGE, =Q3 – Q1
Conceptual examples of how to apply dispersion measures to better describe data samples
Example #1: Two distinct residential real estate markets have the following price characteristics:
Neighborhoods / Mean home selling price (in $1,000s) / Median home selling price (in $1,000s) / Standard deviation of home selling prices (in $1,000s)Oakland Hills C / 700 / 600 / 100
Noe Valley B / 1,000 / 750 / 200
Description: The Noe Valley B homes are more expensive, more scattered in dollars, and more right skewed in distribution shape than the Oakland Hills C homes. A home in Noe Valley B that costs $800K is comparable to an Oakland Hills C home that costs $600K (can you explain why? We will go over this problem on Tuesday, be ready)
Practice question: what is the price of a statistically equivalent home in Noe Valley to a home that sells for $1,000,000 in the Oakland Hills C neighborhood? (Hint: notice that a million dollar home in the Oakland Hills C neighborhood is a home that sells for 3 STANDARD DEVIATIONS MORE THAN THE MEAN PRICE of a typical home in that neighborhood).
Example #1 Using Z-scores, means and standard deviations
Explain, in the context of the above example #1 what it means when people say:
“I live in the Oakland Hills C neighborhood, and if my house were located in the Noe Valley B neighborhood, it would sell for 2 million dollars.”
Specifically, what can you infer from this persons home selling price in the Oakland Hills C neighborhood?
Example #2, applying the coefficient of variation (CV): The two investments described below have the following investment characteristics.
Investments / Mean annual rate of return on investment % / Standard deviation around the mean rate of return on investment % / Coefficient of Variation for Portfolios P and Q, respectivelyPortfolio P / 20 / 5 / 5/20 = 25%
Portfolio Q / 10 / 3 / 3/10 = 30%
Description: Portfolio P offers a higher return on investment, on average (but not necessarily always), but it comes at higher levels of risk (fluctuations in returns on investment) since its standard deviation (5) is greater than that of portfolio Q. However, portfolio Q has greater risk relative to the expected reward to be received from investing in it than portfolio P, since it has a higher (coefficient of variation) ratio of “risk to reward” (30%) than P (25%)—for how much you get from Q, a standard deviation of 3% is relatively more “risk” to the return to be received, than the higher (absolute value) risk to P, of 5%, is, in relation to the 20% expected payoff for P.
The Shape of a distribution
So far, these notes have only covered using basic statistics to describe data sets’ values. However, as the textbook details, you can also use some of these statistics to also describe the frequency distribution’s shape as either symmetric or skewed.
Symmetry occurs when the difference in value from the median to Q1 is the same as from the median to Q3, and when, moreover, the difference in value from the minimum to the median equals from the median to the maximum data point values. Alternatively, symmetry is to be inferred from a sample of data from which the mean equals the median in magnitude.
Skewness refers to the distortion of the shape of the frequency distribution of a sample measured by the difference in value between the mean and the median. When the mean is the greater of the two you have skewness to the right, and viceversa. Skewness to the right may also be observed when the difference from the median to the higher quartiles exceeds the value difference between the median and the lower quartile rankings for a data set.
The percentage distributions of data value ranges: empirical rule and Chebyshev’s rule
Both of these rules permit you to count how often data values lie in certain ranges around the mean or median of a data sample. The empirical rule only works effectively for symmetric data sets, while Chebyshev’s rule applies to any shape of a frequency distribution, skewed and symmetric.
If the data set lies in the described range, then / The percentage of the data’s distribution in this range, according to the empirical rule is: / The percentage of the data’s distribution in this range, according to Chebyshev’s rule is:For symmetrically shaped data sets / 1 standard deviation below the mean to 1 standard deviation above the mean: / 68% / NO PREDICTION
2 standard deviation below the mean to 2 standard deviation above the mean: / 95% / AT LEAST{1 – [1 / (2 2)]}
= AT LEAST 75%
3 standard deviation below the mean to 3 standard deviation above the mean: / 99.7% / AT LEAST{1 – [1 / (3 2)]}
= AT LEAST 88.89%
For skewed data sets / K standard deviation below the mean to K standard deviation above the mean: / NO PREDICTION / AT LEAST{1 – [1 / (K 2)]}
Linked or Chained data: the geometric mean and the geometric mean rate of return
Sometimes, datasets are formed such that the individual data points are connected or intertwined—we call this “chained”. For example, if I measured my body’s weight over time to establish whether my weight is changing—up or down—then, I am forming a chained dataset. If, on the other hand, I choose to weigh many other (different) persons, who may of may not be otherwise connected, then I have a data set that is said to be unchained. Another example: if I track bids for an auction item on eBay, I am collecting a chained data set., whereas if I collect the different prices at which different items were bought for on eBay, then I am collecting prices that make up a data set that is not chained.
In case the data set you are observing is chained, a more appropriate process for measuring MEAN is a GEOMETRIC AVERAGING PROCESS, not an arithmetic one. Thus the book describes the geometric mean and the geometric mean rate of return. Furthermore, on the web site you will find an excel file illustrating the three ways you can use MS EXCEL (not PhStat) to compute both, the geometric mean (value units) of a sample and the geometric mean rate of return ( in % change terms) of a sample. In fact, the problem described is the one due for homework in chapter 3.
Use these resources to learn to calculate and use these two geometric constructions.
Chained or linked data refers to observed sample data values (typically numerical) that are related by some other variable or factor, usually time, space or origin. Examples of chained data include time series, measurements of a variable from the same source at different points in time, or spatially related data, such as bids from different sources for the same auctioned item.