Basic Statistical Concepts

OK A little better than in ’06- but still lacking. Too much meandering all over the place.

What I think needs to be done is to spend more time on Normal distribution—show a geophysical data set and how it in fact has a normal distribution then do statistical tests with that data set.

I also skipped over a lot of the Expected Value stuff and deviated to a discussion on degrees of freedom.

11:628:452 16:712:615

Class web site

http://marine.rutgers.edu/dmcs/ms615/

Syllabus

Course Outline

Lecture notes (to come)

data sets (to come)

Student presentations

The Book—

Only using 2 chapters from this $100 book. Did not order it from book store—if you want to buy it try Amazon. ~$90.00. Could probably get by without it.

Grading

50% Homework

50% Project

Three components of Final project

1) Oral Presentation

2) Detailed description of data analysis techniques

3) Results paper in form of Geophysical Research Letter publication (Format for a paper in Geophysical Research letters (~ 10 pages double spaced and 4 figures).

MATLAB??

Everyone have access to it ?

IF new to you check out.

First homework.

http://www.mathworks.com/academia/

and click on MATLAB & Simulink Tutorials for training.

Between this and just trying stuff out you’ll get the hang of it.

Today: Some statistics and terminologies. Future lecture will have more clear applications.

“Lies, Damned Lies and Statistics”

The great thing about this comment is that we don’t know who said it. Mark Twain used it in his 1924 autobiography, but it has been attributed to the British Statesman Benjamin Disraeli (1804-1881) and the radical journalist and politician Henry Labouchere (1831-1912)

n Basic Statistical Concepts & Terminology

Fundamentally all physical process in the ocean and atmospheric are deterministic—for they are governed by basic laws of physics that can be characterized by a set of PDE’s (Chemistry and biology may have some random component to them—or do they? Einstein once said “God does not play dice”).

However even the system of equations governing the watery bits of the earth leads to deterministic chaos which can have identical statistics as a random event. While some of the signals inherent in geophysical data are periodic and clearly deterministic—such as tidal motion—many signals can have a more complex evolution, such as the North Atlantic Oscillation, which may be modeled as some damped/forced oscillator—but we do not know the detailed physics of it.

Random Variable—An event could have any outcome from a set of possible values—this represents a sample of a population

Examples

n Flip a coin

n Roll a dice

n Measure the grain size of a piece of sand

n Measure the distribution of grain sizes from a sediment core

Population—All sets of potential measurements (the real object of interest)

In the case of a die it would be numbers 1-6.

Sample—What’s actually has been measured

If we roll the dice enough times we should be able to estimate what the population is and how frequently they occur. Also, if we know the true population (1-6) we should be able to estimate the odds, for example, of throwing the dice 30 times and getting no sixes.

Science of statistics is about drawing inferences of the population from sample.

Numbers calculated from the sample are the statistics (mean, variance).

(note that these are discrete measures—and there are analogs for continuous variables—but since all geophysical data is discrete I’ll leave it at this for now).

One method statisticians have developed to measure random variables is by essentially making measurements at random. And while biologists often do this (for example the stratified random sampling methodologies often used in ecological surveys) we in the geophysical field tend make measurements systematically –often in rapid sequence.

Consider measuring the temperature at 1 minute intervals—clearly this would not be a random measurement since the temperature tends to be fairly constant for this time (aside from microscale turbulent flucturations). Besides—it’d be pretty stupid to make random measurements of temperature—for it would be hard to physically interpret the measurements. Rather we want to make systematic measurements.

However, when applying statistical methods—which are based on data sets collected in some random fashion—on data that was largely collected non-randomly one must consider this in the application. Instead we use the systematically collected data to describe trends while the natural variability of the measurements (due to both sampling error and natural variabiltiy in the system) can be treated as a random variable.

This class will explore methods to characterize trends the data. This characterization included how data vary in space and time. We also must make error estimates which makes various assumptions about the statistics of the data (many which we are not sure are correct!). To make error estimates we must quantify the variability in the data.

Types of Data

1) Deterministic Data

a) Periodic Deterministic Data

Tidal motion is an example of deterministic data – or more generally

Any periodic data can be described as deterministic if f(t)=f(t +/- nT) where T is the period of the function and n is an integer.

Complex periodic data – multiple frequencies but the ratio of all the periods of oscillations are rational numbers.

(Recall that a rational number can be expressed by a fraction m/n)

b) non-periodic Deterministic Data

Almost periodic (if ratio’s of all periods of oscillations are rational numbers then the time series are periodic—however if not they are almost periodic). Expressed as a sum of periodic functions—but added together they are not periodic.

North Atlantic Oscillation?

c) Transient

z= e-rt

Earthquake, Tsunami, our life (maybe?)

Random Data

Cannot be described by an explicit mathematical function. Are characterized by statistics. A single time history of a random or stochastic process is called a sample function X(t), and the finite portion of this time history is called a sample record.

Smple records can either be stationary or non-stationary types. Data is non-stationary if its statistics, such as the mean value or the variance, change with time.

When statistics of a random variable remain constant over time this is referred to as an ergodic random process

Note that systems that are deterministic but chaotic, like the ocean/atmospheric system, can appear as a stochastic process because of the chaotic nature of the atmosphere.

Statistics of a chaotic process, while deterministic, may appear random. In the end flipping a coin is really a deterministic process and the result (heads or tails) depends on the detailed of the toss, the air motion, the bounce etc. However, since we can’t control this result with our cumbersome flipping—the result is in fact random.

Most oceanographic/ecological/economical/ect population parameters are unknown and must be estimated from the sample.

Therefore the estimator needs to be

· unbiased

· Efficient (small varience)

· Consistent

Some More Statistical Terminology (we’ve already defined population and sample)

Expected Value

Use normal distribution as an example of this.

If the expected value of a statistic is equal to the value of a population then we say that the statistic is an unbiased estimate of that parameter.

Consistency

If parameter a (the thing that we want to know!) is estimated by statistic A with sample size N. If a A goes to a as N goes to infinity then we say the estimator is consistent.

Efficiency

The statistic A is an efficient estimate of a if the variance is small.

Scattershots. Smaller the variance the better the efficiency, but may be biased.

(Draw FIgure)

Unbiased and efficient correspond to accuracy and precision.

IF you can only have one—efficiency (precision) is probably better, for you could probably correct for the bias.

For example consider a sharpshotter. If the sharpshooter knew that her sight was misaligned—she could adjust her aim and compensate. However if all you have is the former the only thing you can do is make many measurements to obtain a good measurement. So all the guy with the shaky hand can do is take a lot of shots at something.

Sometimes this is what we need to do in geophysical science—because many of or signals are very noisy and thus have a large variance.

Probability, Probability distribution and Probability Density Function.

Characterizes the probability that an event occurs. In continuous space it’s called Probability Mass Function, while in discrete space it’s called Probability Density Function. The most common one (and the one that we will mostly assume is true with our data) is a Normal Distribution.

Normal Distribution

Obviously we always work with data sets that are limited but we must use this limited sample to make estimates of the true statistical properties of the system (population).

Sometimes we’re left with one or two data points—can’t do much statistics with that—but often it doesn’t stop us from

Common assumption made by statisticians is that the population has a Normal or Gaussian distribution and this is the most common PDF.

m - Mean Value

s- Standard Deviation (square of the variance)

Where the probability of an event falling between a and b is

(3.5.3)

The importance of the normal distribution is that

1) Much of the natural variability is observed to be appoximated by the normal distribution. Good example is the variability associated with Fickian diffusion.

2) Mathematically simple. ( can compare data with Normal distribution simply by calculating the mean and variance)

3) Averages taken randomly from a non-normal population tend to follow a normal distribution more closely than the original population.

Consider example of rolling dice. The original population has a flat distribution. If we make a number of samples—each consisting of one roll of the die—we get back the distrubution of the population. If however we roll two dice—then we get a

Three simple statistical description of the data are:

· The Histogram

· The sample mean

· the Sample variance/population variance.

68 % of data falls into one standard deviation

95 % within two

99% within three.

Since a closed form of integral does not exist it must be evaluated with table—or with the MATLAB function

which is not exactly the Cumulative probability distribution – it gives twice the integral of the Normal distribution from 0- z/sqrt(2), rather than – infinity to z. by substituting z’=z/sqrt(2) one can compute the cumulative probability.

The normal distribution for zero mean and unit variance is :

Introduce the dummy variables

z=z’/sqrt(2)

dz=dz’/sqrt(2)

1/2 of the error function

But since it only integrates from zero—and the Normal distribution is symmetrical it is equal to the integral for –a a

Draw on board

Lognormal Distribution

Often data may not be normally distributed—but the log of the data may be. An example of this is distribution of the natural grain size of sediment. While its distribution, if plotted in millimeters, would show skewed towards smaller values. phi=-log2d

Is the data normally distributed?

Compare histogram of data vs normal distribution

First normalize you data to Z

(3.5.4)

which has mean of 0 and standard deviation of 1, and compare the histogram of your data with the Guassian distrubtion (later we can test how good this fit is).

use the MATLAB function hist

The curve Z is called a probability density function because

1) z & x are continuous (discrete is called probability mass function)

2) Area under curve is one

3) Area to left of any ordinate z is the probability of obtaining a value of z or less

It is symmetric so the mean=mode=medium=0 and SD=1; inflection point lies at one SD away from the mean.

Gamma PDF (of which Chi-Squared distribution)

Suppose we had a theoretical model of a population that we can use to predict the frequencies that by class in a random sample of size N. Dice example--each number will have frequency N/6. We can also use the Chi-Squared distribution to predict the expected magnitude of the observed deviations from the theory.

Essentially it is a measure of the variance—and thus it always positive. The formal mathematical description of the chi-squared distribution is quite complex—as is the theory behind it (which I do not understand)—but with data collected from a known population we can calculate the Chi-Squared distribution and thus estimate the expected deviations from theory as a function of sample size (degrees of freedsom).

fi are observed frequencies and f~ are the predicted. It has one parameter called the degrees of freedom which is the number of independent observations. This can be easy to determine in some data sets—and not so easy in others (we’ll talk about this later). But the use This leads to hypothesis testing.

Hypothesis testing

Are we dealing with loaded dice?

The probability of 30 rolls yielded 6 1’s – 5’s but no sixes chi-squared is 6.

The degrees of freedom in this case is 5 (there are six possible outcomes 1-6 however if you have 30 rolls of the dice you once you’ve been told what the outcomes for 1-5 the outcome for 6 is determined. Thus the degress of freedom is in general the number of possible outcomes minus 1.

Looking at the Chi-squared table we find it yields a probability of 25%. Meaning that there is a ¼ chance that this can happen with dice that are unloaded. It tay look suspicious but certainly within the statistical odds

However, if you rolled the dice 60 times, and got the same result (12 1’s-5’s but no sixes) then the chi-squared is 27—which is off the statistical chart and very, very unlikely.

More hypothesis testing following discussion of covariance and Fourier analysis.

ADDENDUM

Did not mention the standard Error. The standard Error of the mean is the standard error .

SE=std/N

And error bars are simply +/-1.96*SE

What is 1.96?

That is simply how many standard deviations are needed to conatin 95% of the area in a normal distribution

You can check this by typing erf(1.96/sqrt(2))