Chapter 3 Part B
Descriptive Statistics: Numerical Methods
- Measures of Relative Location and Detecting Outliers
1.Skewness
2.z-Scores
- The z-score is often called the standardized value.
- It denotes the number of standard deviations a data value xi is from the mean.
- A data value less than the sample mean will have a z-score less than zero.
- A data value greater than the sample mean will have a z-score greater than zero.
- A data value equal to the sample mean will have a z-score of zero.
3.Chebyshev’s Theorem
At least (1 - 1/z2) of the items in any data set will be within z standard deviations of the mean, where z is any value greater than 1.
•At least 75% of the items must be within
z = 2 standard deviations of the mean.
•At least 89% of the items must be within
z = 3 standard deviations of the mean.
•At least 94% of the items must be within
z = 4 standard deviations of the mean.
4.Empirical Rule
For data having a bell-shaped distribution:
•Approximately 68% of the data values will be within onestandard deviation of the mean.
•Approximately 95% of the data values will be within twostandard deviations of the mean.
•Almost all (99.7%) of the items will be within threestandard deviations of the mean.
5.Detecting Outliers
- An outlier is an unusually small or unusually large value in a data set.
- A data value with a z-score less than -3 or greater than +3 might be considered an outlier.
- It might be an incorrectly recorded data value.
- It might be a data value that was incorrectly included in the data set.
- It might be a correctly recorded data value that belongs in the data set !
B. Exploratory Data Analysis
1. Five-Number Summary
- Smallest Value
- First Quartile
- Median
- Third Quartile
- Largest Value
2. Box Plot
- A box is drawn with its ends located at the first and third quartiles.
- A vertical line is drawn in the box at the location of the median.
- Limits are located (not drawn) using the interquartile range (IQR).
- The lower limit is located 1.5(IQR) below Q1.
- The upper limit is located 1.5(IQR) above Q3.
- Data outside these limits are considered outliers.
- Whiskers (dashed lines) are drawn from the ends of the box to the smallest and largest data values inside the limits.
- The locations of each outlier is shown with the symbol * .
C. Measures of Association Between Two Variables
1.Covariance
- The covariance is a measure of the linear association between two variables.
- Positive values indicate a positive relationship.
- Negative values indicate a negative relationship.
- If the data sets are samples, the covariance is denoted by sxy:
- If the data sets are populations, the covariance is denoted by .:
2.Correlation Coefficient
- The coefficient can take on values between -1 and +1.
- Values near -1 indicate a strong negative linear relationship.
- Values near +1 indicate a strong positive linear relationship.
- If the data sets are samples, the coefficient is rxy.
- If the data sets are populations, the coefficient is .
D. The Weighted Mean and Working with Grouped Data
1. Weighted Mean
- When the mean is computed by giving each data value a weight that reflects its importance, it is referred to as a weighted mean.
- In the computation of a grade point average (GPA), the weights are the number of credit hours earned for each grade.
- When data values vary in importance, the analyst must choose the weight that best reflects the importance of each value.
where: xi= value of observation i
wi = weight for observation i
3.Mean for Grouped Data
- The weighted mean computation can be used to obtain approximations of the mean, variance, and standard deviation for the grouped data.
- To compute the weighted mean, we treat the midpoint of each class as though it were the mean of all items in the class.
- We compute a weighted mean of the class midpoints using the class frequencies as weights.
- Similarly, in computing the variance and standard deviation, the class frequencies are used as weights.
- Sample Data
- Population Data
where:
fi = frequency of class i
Mi = midpoint of class i
4.Variance for Grouped Data
- Sample Data
- Population Data
5.Standard Deviation for Grouped Data