Introduction and Descriptive Statistics

Chapter 1

Introduction and descriptive Statistics

1-1.1. quantitative/ratio

2. qualitative/nominal

3. quantitative/ratio

4. qualitative/nominal

5. quantitative/ratio

6. quantitative/interval

7. quantitative/ratio

8. quantitative/ratio

9. quantitative/ratio

10.quantitative/ratio

11.quantitative/ordinal

1-2.Data are based on numeric measurements of some variable, either from a data set comprising an entire population of interest, or else obtained from only a sample (subset) of the full population. Instead of doing the measurements ourselves, we may sometimes obtain data from previous results in published form.

1-3.The weakest is the Nominal Scale, in which categories of data are grouped by qualitative differences and assigned numbers simply as labels, not usable in numeric comparisons. Next in strength is the Ordinal Scale: data are ordered (ranked) according to relative size or quality, but the numbers themselves don't imply specific numeric relationships. Stronger than this is the Interval Scale: the ordered data points have meaningful distances between any two of them, measured in units. Finally is the Ratio Scale, which is like an Interval Scale but where the ratio of any two specific data values is also measured in units and has meaning in comparing values.

1-4.Fund:Qualitative

Style:Qualitative

US/Foreign:Qualitative

10 yr Return:Quantitative

Expense Ratio:Quantitative

1-5.Ordinal.

1-6.A qualitative variable describes different categories or qualities of the members of a data set, which have no numeric relationships to each other, even when the categories happen to be coded as numbers for convenience. A quantitative variable gives numerically meaningful information, in terms of ranking, differences, or ratios between individual values.

1-7.The people from one particular neighborhood constitute a non-random sample (drawn from the larger town population). The group of 100 people would be a random sample.

1-8.A sample is a subset of the full population of interest, from which statistical inferences are drawn about the population, which is usually too large to permit the variables to be measured for all the members.

1-9.A random sample is a sample drawn from a population in a way that is not a priori biased with respect to the kinds of variables being measured. It attempts to give a representative cross-section of the population.

1-10.Nationality: qualitative. Length of intended stay: quantitative.

1-11.Ordinal. The colors are ranked, but no units of difference between any two of them are defined.

1-12.Income:quantitative, ratio

Number of dependents: quantitative, ratio

Filing singly/jointly: qualitative, nominal

Itemized or not:qualitative, nominal

Local taxes:quantitative, ratio

1-13.Lower quartile = 25th percentile = data point in position (n + 1)(25/100) =

34(25/100) = position 8.5. (Here n = 33.) Let us order our observations: 109, 110,

114, 116, 118, 119, 120, 121, 121, 123, 123, 125, 125, 127, 128, 128, 128, 128, 129, 129, 130,

131, 132, 132, 133, 134, 134, 134, 134, 136, 136, 136, 136.

Lower quartile = 121

Middle quartile is in position: 34(50/100) = 17. Point is 128.

Upper quartile is in position: 34(75/100) = 25.5. Point is 133.5

10th percentile is in position: 34(10/100) = 3.4. Point is 114.8.

15th percentile is in position: 34(15/100) = 5.1. Point is 118.1.

65th percentile is in position: 34(65/100) = 22.1. Point is 131.1.

IQR = 133.5 - 121 = 12.5.

Percentile and Percentile Rank Calculations
x-th Percentile / Percentile rank of y
x / y
10 / 116.4 / 116.4 / 10
15 / 118.8 / 118.8 / 15
65 / 130.8 / 130.8 / 65
Quartiles
1st Quartile / 121
Median / 128 / IQR / 12
3rd Quartile / 133

1-14.First, order the data:

-1.2, 3.9, 8.3, 9, 9.5, 10, 11, 11.6, 12.5, 13, 14.8, 15.5, 16.2, 16.7, 18

The median, or 50th percentile, is the point in position 16(50/100) = 8. The point is 11.6.

First quartile is in position 16(25/100) = 4. Point is 9.

Third quartile is in position 16(75/100) = 12. Point is 15.5.

55th percentile is in position 16(55/100) = 8.8. Point is 12.32.

85th percentile is in position 16(85/100) = 13.6. Point is 16.5.

1-15.Order the data:

38, 41, 44, 45, 45, 52, 54, 56, 60, 64, 69, 71, 76, 77, 78, 79, 80, 81, 87, 88, 90, 98

Median is in position 23(50/100) = 11.5. Point is 70.

20th percentile is in position 23(20/100) = 4.6. Point is 45.

30th percentile is in position 23(30/100) = 6.9. Point is 53.8.

60th percentile is in position 23(60/100) = 13.8. Point is 76.8.

90th percentile is in position 23(90/100) = 20.7. Point is 89.4.

Percentile and Percentile Rank Calculations
x-th Percentile
x / y
20 / 46.4 / 46.4
30 / 54.6 / 54.6
60 / 76.6 / 76.6
Quartiles
1st Quartile / 52.5
Median / 70 / IQR / 27.25
3rd Quartile / 79.75

1-16.Order the data: 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7.

Lower quartile is the 25th percentile, in position 16(25/100) = 4. Point is 2.

The median is in position 16(50/100) = 8. The point is 3.

Upper quartile is in position 16(75/100) = 12. Point is 5.

IQR = 5 - 2 = 3.

60th percentile is in position 16(60/100) = 9.6. Point is 4.

Percentile and Percentile Rank Calculations
x-th Percentile
x / y
60 / 4 / 4.0
1 / 0
1 / 0
Quartiles
1st Quartile / 2
Median / 3 / IQR / 3
3rd Quartile / 5

1-17.The data are already ordered; there are 16 data points. The median is the point in position 17(50/100) = 8.5 It is 51.

Lower quartile is in position 17(25/100) = 4.25. It is 30.5.

Upper quartile is in position 17(75/100) = 12.75. It is 194.25.

IQR = 194.25 - 30.5 = 163.75.

45th percentile is in position 17(45/100) = 7.65. Point is 42.2.

Percentile and Percentile Rank Calculations
x-th Percentile
x / y
45 / 43 / 43.0
0
0
Quartiles
1st Quartile / 31.5
Median / 51 / IQR / 131.25
3rd Quartile / 162.75

1-18.The mean is a central point that summarizes all the information in the data. It is sensitive to extreme observations. The median is a point "inthe middle" of the data set and does not contain all the information in the set. It is resistant to extreme observations. The mode is a value that occurs most frequently.

1-19.Mean, median, mode(s) of the observations in Problem 1-13:

Median = 128

Modes = 128, 134, 136 (all have 4 points)

Measures of Central tendency
Mean / 126.63636 / Median / 128 / Mode / 128

1-20.For the data of Problem 1-14:

Mean = 11.2533

Median = 11.6

Mode: none

1-21.For the data of Problem 1-15:

Mean = 66.955

Median = 70

Mode = 45

Measures of Central tendency
Mean / 66.954545 / Median / 70 / Mode / 45

1-22.For the data of Problem 1-16:

Mean = 3.466

Median = 3

Mode = 1 and 2

Measures of Central tendency
Mean / 3.4666667 / Median / 3 / Mode / 1

1-23.For the data of Problem 1-17:

Mean = 199.875

Median = 51

Mode: none

Measures of Central tendency
Mean / 199.875 / Median / 51 / Mode / #N/A

1-24.For the data of Example 1-1:

Mean = 163,260

Median = 166,800

Mode: none

1-25.(Using the template: “Basic Statistics.xls”, enter the data in column K.)

Basic Statistics from Raw Data

Measures of Central tendency
Mean / 21.75 / Median / 13 / Mode / 12

1-26.(Using the template: “Basic Statistics.xls”)

Mean = .0514

Median = 0.3

Outliers: none

1-27.Mean = 592.93

Median = 566

Std Dev = 117.03

QL = 546

QU = 618.75

Outliers: 940

Suspected Outlier: 399

1-28.Measures of variability tell us about the spread of our observations.

1-29.The most important measures of variability are the variance and its square root- the standard deviation. Both reflect all the information in the data set.

1-30.For a sample, we divide the sum of squared deviations from the mean by n – 1, rather than by n.

1-31.For the data of Problem 1-13, assumed a sample: Range = 136 – 109 = 27

Variance = 57.74 Standard deviation = 7.5986

If the data is of a
Sample / Population
Variance / 57.7386364 / 55.9889807
St. Dev. / 7.59859437 / 7.48257848

1-32.For the data of Problem 1-14: Range = 18 – (–1.2) = 19.2

Variance = 25.90 Standard deviation = 5.0896

1-33.For the data of Problem 1-15: Range = 98 – 38 = 60

Variance = 321.38 Standard deviation = 17.927

If the data is of a
Sample / Population
Variance / 321.378788 / 306.770661
St. Dev. / 17.9270407 / 17.5148697

1-34.For the data of Problem 1-16: Range = 7 – 1 = 6

Variance = 3.98 Standard deviation = 1.995

If the data is of a
Sample / Population
Variance / 3.98095238 / 3.71555556
St. Dev. / 1.99523241 / 1.92757764

1-35.For the data of Problem 1-17: Range = 1,209 – 23 = 1,186

Variance = 110,287.45 Standard deviation = 332.096

If the data is of a
Sample / Population
Variance / 110287.45 / 103394.484
St. Dev. / 332.095543 / 321.550127

1-36.; this captures 31/33 of the data points, so Chebyshev's theorem holds. The data set is not mound-shaped, so the empirical rule does not apply.

1-37.; this captures 14/15 of the data points, so Chebyshev's theorem holds. The data set is not mound-shaped, so the empirical rule does not apply

1-38.; this captures all the data points, so Chebyshev's theorem holds. The data set is not mound-shaped, so the empirical rule does not apply.

1-39.; this captures all the data points, so Chebyshev's theorem holds. The data set is not mound-shaped, so the empirical rule does not apply.

1-40.; this captures 15/16 of the data points, so Chebyshev's theorem holds. The data set is not mound-shaped, so the empirical rule does not apply.

1-41.

1-42.

1-43.

1-44.Mean = 0.917 Median = 0.85 Std dev = 0.4569

1-45.Mean = $18.53 Median = $15.93

1-46.

1-47. Using MINITAB

Stem / Leaves
4 / 5 / 5688
8 / 6 / 0123
14 / 6 / 677789
(9) / 7 / 002223334
11 / 7 / 55667889
3 / 8 / 224

1-48.

There are no outliers. Distribution is skewed to the left.

1-49.A stem-and-leaf display is a quickly drawn type of histogram useful in analyzing data. A box plot is a more advanced display useful in identifying outliers and the shape of the distribution of the data.

1-50. / Stem / Leaves
1 / 0 / 5
1 / 1
1 / 2
7 / 3 / 234578
(13) / 4 / 2234567788899
11 / 5 / 012235678
2 / 6 / 3
1 / 7 / 8

1-51.The data are narrowly and symmetrically concentrated near the median (IQR and the whisker lengths are small), not counting the two extreme outliers.

1-52.Wider dispersion in data set #2. Not much difference in the lower whiskers or lower hinges of the two data sets. The high value, 24, in data set #2 has a significant impact on the median, upper hinge and upper whisker values for data set #2 with respect to data set #1.

1-53.Mean = 127

Var = 137

sd = 11.705

mode = 127

outliers: TWA, Lufthansa

1-54.Stem-and-leaf of C2 N = 45

Leaf Unit = 1.0

f / Stem / Leaves
13 / 1 / 0011111223444
18 / 1 / 55689
(6) / 2 / 022333
21 / 2 / 567789
15 / 3 / 0122234
8 / 3 / 78
6 / 4 / 012
3 / 4 / 7
2 / 5 / 23

1-55.Outliers are detected by looking at the data set, constructing a box plot or stem-and-leaf display. An outlier should be analyzed for information content and not merely eliminated.

1-56.The median is the line inside the box. The hinges are the upper and lower quartiles. The inner fences are the two points at a distance of 1.5 (IQR) from the upper and lower quartiles. Outer fences are similar to the inner fences but at a distance of 3 (IQR). The box itself represents 50% of the data.

1-57. / Mine A: / Mine B:
f / Stem / Leaves / f / Stem / Leaves
2 / 3 / 24 / 2 / 2 / 34
4 / 3 / 57 / 4 / 2 / 89
7 / 4 / 123 / 6 / 3 / 24
(5) / 4 / 55689 / 9 / 3 / 578
7 / 5 / 123 / (3) / 4 / 034
4 / 5 / 7 / 4 / 789
4 / 6 / 0 / 4 / 5 / 012
3 / 7 / 36 / 1 / 5 / 9
1 / 8 / 5

Values for Mine A are smaller than for Mine B, right-skewed, and there are three outliers. Values for Mine B are larger and the distribution is almost symmetric. There is larger variance in B.

1-58.No. One needs to use descriptive statistics and/or statistical inference.

1-59.

Comparing two data sets using Box Plots
Lower Whisker / Lower Hinge / Median / Upper Hinge / Upper Whisker
Shipments / 1.3 / 1.975 / 2.4 / 3.4 / 4.2
Market Share / 3.6 / 5.3 / 6.55 / 9.275 / 11.4
Shipments
Market Share

1-60.Mean = 5.785 median = 5.782

The mean is impacted by the high rate of fatalities for the very small car classification.

1-61.Answers will vary.

If we add the value “5” to all the data points, then the average, median, mode, first quartile, third quartile and 80th percentile values will change by “5”. There is no change in the variance, standard deviation, skewness, kurtosis, range and interquartile range values.
Average: if we add “5” to all the data points, then the sum of all the numbers will increase by “5*n”, where n is the number of data points. The sum is divided by n to get the average. So 5*n / n = 5: the average will increase by “5”.

Median: If we add “5” to all the data points, the median value will still be the midway point in the ordered array. Its value will also increase by “5”

Mode: Adding “5” to all the data points changes the number that occurs most frequently by “5”

First Quartile: adding “5” to all the data points does not change the location of the first quartile in the ordered array of numbers, which is: (.25)(n+1) where n is the number of data points. Whether the first quartile falls on a specific data point or between two data points, the resulting value will have been increased by “5”.

Third Quartile: adding “5” to all the data points does not change the location of the third quartile in the ordered array of numbers, which is: (.75)(n+1) where n is the number of data points. Whether the third quartile falls on a specific data point or between two data points, the resulting value will have been increased by “5”.

80th percentile: adding “5” to all the data points has the same effect as in the calculation of the first or third quartile. The value will be increased by “5”

Range: adding “5” to the all the data points will have no effect on the calculation of the range. Since both the highest value and the lowest value have been increase by the same number, the subtraction of the lowest value from the highest value still yields the same value for the range.

Variance: adding “5” to all the data points has no effect on the calculation of the variance. Since each data point is increased by “5” and the average has also been shown to increase by the same factor, the differences between each individual new data point and the new average will not change and will not be affected by squaring the difference, summing the squared differences and dividing by number of data points.

Standard Deviation: since the variance is not affected by adding “5” to each data point, neither is the standard deviation.

Skewness: Since each data point is increased by “5” and the average has also been shown to increase by the same factor, the differences between each individual new data point and the new average will not change. Therefore, the numerator in the formula for skewness is not affected. Since the standard deviation is not affected as well (the denominator), there is no change in the value for skewness.

Kurtosis: Since each data point is increased by “5” and the average has also been shown to increase by the same factor, the differences between each individual new data point and the new average will not change. Therefore, the numerator in the formula for kurtosis is not affected. Since the standard deviation is not affected as well (the denominator), there is no change in the value for kurtosis.

InterquartileRange: given that both the first quartile and the third quartile increased by the same factor, “5”, the difference between the two values remains the same.

Multiplying each data point by a factor “3” results in the following changes. The mean, median, mode, first quartile, third quartile and 80th percentile values will be increased by the same factor “3”. In addition, the standard deviation and the range will also increase by the same factor “3”. The variance will increase by the factor squared, and the skewness and kurtosis values will remain unchanged.
Multiplying all data points by a factor “3” and adding a value “5” to each data point has the following results. The order of operation is first to multiply each data point and then add a value to each data point. Each data point is first multiplied by the factor “3” and then the value “5” is added to each newly multiplied data point. Multiplying each data point by the factor “3” yields the results listed in c). Adding a value 5 to the newly multiplied data points yields the results listed in a).

1-62.s = 13.944s2 = 194.43

1-63. = 504.688 = 94.547

Measures of Central tendency
Mean / 504.6875 / Median / 501.5 / Mode / #N/A
Measures of Dispersion
If the data is of a
Sample / Population
Variance / 9227.5121 / 8939.15234 / Range / 346
St. Dev. / 96.0599401 / 94.5470906 / IQR / 149.5

1-64.

Step 1: Enter the data from problem 1-63 into cells Y4:Y35 of the template: Histogram.xls from Chapter 1. The template will order the data automatically.

Step 2:We need to select a starting point for the first class, an ending point for the last class, and a class interval width. The starting point of the first class should be a value less than the smallest value in the data set. The smallest value in the data set is 344, so you would want to set the first class to start with a value smaller than 344. Let’s use 320. We also selected 710 as the ending value of the last class, and selected 50 as the interval width. The data input column and the histogram output from the template are presented below. The end-point for each class is included in that class; i.e., the first class of data goes from more than 320 up to and including 370, the second class starts with more than 370 up to and including 420, etc.

1-65.Range: 690 – 344 = 346

90th percentile lies in position: 33(90/100) = 29.7 It is 632.7

First quartile lies in position: 33(25/100) = 8.25 It is 419.25

Median lies in position: 33(50/100) = 16.5 It is 501.5

Third quartile lies in position: 33(75/100) = 24.75 It is 585.75

1-66.

1-67. / Stem / Leaves
2 / 1 / 24
7 / 1 / 56789
(3) / 2 / 023
6 / 2 / 55
4 / 3 / 24
2 / 3
2 / 4 / 01

1-68.

The data is skewed to the right.

1-69. / Stem / Leaves
3 / 1 / 012
4 / 1 / 9
12 / 2 / 1122334
(9) / 2 / 556677889
6 / 3 / 024
3 / 3 / 57
1 / 4
1 / 4
1 / 5
1 / 5
1 / 6 / 2

The data is skewed to the right with one extreme outlier (62) and three suspected outliers (10,11,12)

1-70.

1-71.Mean = 25.857 sd = 9.651

1-72.Mean = 18.875var = 38.65outliers: none

1-73.Mean = 33.271

sd = 16.945

var = 287.15

QL = 25.41

Med = 26.71

QU = 35

Outliers: Morgan Stanley (91.36%)

1-74.Mean = 3.18

sd = 1.348

var = 1.817

QL = 1.975

Med = 2.95

QU = 3.675

Outliers: 8.70

1-75.

IQR = 3.5
data is right-skewed
9.5 is more likely to be the mode, since the data is right-skewed
Will not affect the plot.

1-76.Bar graph showing changes over time. Both the employee’s out-of-pocket and payroll deduction expenses have increased substantially over the last three years.

1-77.Mean (billions of tons) = 1.439

Mean (per capita tons) = 9.98

The mathematical computation for both averages is the same, however, they do differ in meaning. On average, the countries listed emit 1.439 billion tons of carbon dioxide each. However, the emissions per person is 9.98 tons. Dividing billions of tons by the rate per capita for the US, we get a population estimate of 256 million people, which is close to the actual population for 1997.

1-78.Mean = 2.75

sd = 14.44

var = 208.59

QL = 5.075

Med = 7.9

QU = 13.675

Outliers: –30.2

1-79.

Mean = 10301.05

sd = 16.916

var = 286.155

(Using the template: “Basic Statistics.xls”)

Measures of Central tendency
Mean / 10301.05 / Median / 10300.5 / Mode / 10300
Measures of Dispersion
If the data is of a
Sample / Population
Variance / 286.155263 / 271.8475 / Range / 54
St. Dev. / 16.9161244 / 16.4877985 / IQR / 16.25

1-80.Mean = 99.039

sd = .4366

var = .1907

Median = 99.155

1-81.Mean = 17.587

sd = .466

var = .2172

Measures of Central tendency
Mean / 17.5875 / Median / 17.5 / Mode / 18.3
Measures of Dispersion
If the data is of a
Sample / Population
Variance / 0.21716667 / 0.20359375 / Range / 1.4
St. Dev. / 0.46601144 / 0.45121364 / IQR / 0.75

1-82.Mean = 29.018

sd = 4.611

(Using the template: “Basic Statistics.xls”)

Measures of Central tendency
Mean / 29.018 / Median / 29.75 / Mode / #N/A
Measures of Dispersion
If the data is of a
Sample / Population
Variance / 21.26552 / 17.012416 / Range / 12.38
St. Dev. / 4.6114553 / 4.12461101 / IQR / 2.92

1-83.Mean = 4.8394

sd = .08

Median = 4.86

1-84.Stock Prices for period: April, 2001 through June, 2001 [Answers will vary due to dates used.]

a). Mean and Standard Deviation for Wal-Mart

Basic Statistics from Raw Data / Stock Prices: Wal-Mart
Measures of Central tendency
Mean / 51.041478 / Median / 51.1266 / Mode / 50.158
Measures of Dispersion
If the data is of a
Sample / Population
Variance / 2.25711298 / 2.22128579 / Range / 6.1911
St. Dev. / 1.50236912 / 1.49039786 / IQR / 1.9613
Higher Moments
If the data is of a
Sample / Population
Skewness / 0.07083784 / 0.06913994
(Relative) Kurtosis / -0.711512 / -0.7500338

b). Mean and Standard Deviation for K-Mart

Basic Statistics from Raw Data / Stock Prices: K-Mart
Measures of Central tendency
Mean / 10.450952 / Median / 10.66 / Mode / 11.8
Measures of Dispersion
If the data is of a
Sample / Population
Variance / 0.9852023 / 0.96956417 / Range / 3.51
St. Dev. / 0.99257358 / 0.9846645 / IQR / 1.955
Higher Moments
If the data is of a
Sample / Population
Skewness / -0.4070262 / -0.3972703
(Relative) Kurtosis / -1.132009 / -1.1378913

c). Coefficient of variation:

CV = std. dev  mean

For Wal-Mart:for K-Mart:

considering the data as a population:

CV = 1.49039786 / 51.041478 = 0.0292 CV = 0.9846645 / 10.450952 = 0.0942

considering the data as a sample:

CV = 1.50236912 / 51.041478 = 0.02943 CV = 0.99257358 /10.450952 = 0.09497

d). There is a greater degree of risk in the stock prices for K-Mart than for Wal-Mart over this
three month period.

e). For DJIA

considering the data as a population:

CV = 427.913791 / 10681.11 = 0.04006

considering the data as a sample:

CV = 431.350905 / 10681.11 = 0.04038