Descriptive Statistics

Stat 11

January 24, 2008

Descriptive Statistics

(single variable)

Notation

n number of items in sample

values in the sample (or use y, z, etc., all lower case)

i-th value (where i = 1, …, n)

(Greek “alpha”) – some number between 0 and 1, usually near 0,

representing, for example, a fraction of observations

“order statistics” – same values as , but sorted in increasing order. For example, is the largest of the numbers . Always:

Graphical representations

Dot plot

Histogram

Bars should normally have equal width. If not, then be sure

that equal areas represent equal quantities (“equal area principle”).

This means that the vertical axis has units of “number of

observations per x-axis unit.”

Box plot ——————————> or, per Tufte—>

Measures of Location

Mean = Average = Arithmetic mean (AM)

Other kinds of means:

Geometric mean (GM) = (interesting only if all xi’s are positive)

Harmonic mean (HM) = inverse of average of inverses

=

Root-mean-square (RMS) =

Grammatically challenging. We’ll sometimes call it the RMS mean.

Median: =

Midmean: average of the middle half

(exclude largest 25% and smallest 25%; report the average of the rest)

Trimmed mean: (really, “a-trimmed mean”)

= average of remaining values, after lowest values

and highest values are removed

(sensible only if α < 1/2.)

Weighted averages:

weighted average =

where each wi ³ 0 and .

Percentiles

Percentiles ( = quantiles = fractiles ):

= value such that fraction α of observations are ≤

and fraction 1 – α of observations are ³

(not well defined for certain values of α)

Quartiles:

Q1 = lower quartile, same as x0.25

Q3 = upper quartile, same as x0.75

Extremes:

xmin = minimum value among

xmax = maximum value among

Five-number summary:

minimum value, Q1, median, Q3, maximum value.

(Not quite standard. Some would use some extreme percentiles in place

of the minimum and maximum, especially in large data sets prone to outliers)


Measures of Dispersion

Variance (or “population variance,” VARP( ) in Excel):

= “mean square deviation”

Standard deviation (or “population standard deviation,” STDEVP( ) in Excel):

= “root-mean-square deviation”

(“standard” often means root-mean-square in statistics,

so “standard deviation” is a formula as well as a name)

Sample variance (VAR( ) in Excel):

Sample standard deviation (STDEV( ) in Excel):

Range: (largest value minus smallest)

(not really standard usage; some use the word

“range” to refer to the pair xmin, xmax.)

Interquartile range (IQR): Q3 – Q1 =

(not well defined if n is multiple of 4)

Mean absolute deviation:

mean absolute deviation =

2