H1: the Art and Science of Learning from Data

Begrippenlijst sociale statistiek

H1: The Art and Science of Learning from Data

How can we investigate using data?

Statistics is the art and science of designing studies and analyzing the data that those studies produce. Its ultimate goal is translating data into knowledge and understanding of the world around us. In short, statistics is the art and science of learning from data.

Why use statistical methods?
- Design: planning how to obtain data to answer the questions of interest.
- Description: summarizing the data that are obtained.
- Inference: making decisions and predictions based on the data.

Variable = the characteristic being measured, such as number of hours per day that you watch TV.

We learn about populations using samples

Subjects = the entities that we measure in a study.

The population is the total set of subjects in which we are interested. A sample is the subset of the population for whom we have (or plan to have ) data. In practice, we usually have data for only some of the subjects who belong to that population.

Descriptive statistics refers to methods for summarizing the data. The summaries usually consist of graphs and numbers such as averages and percentages.
A descriptive statistical analysis usually combines graphical and numerical summaries, for instance a bar graph.

The main purpose of descriptive statistics is to reduce the data to simple summaries without distorting or losing much information.

Inferential statistics refers to the methods of making decisions or predictions about a population, based on data obtained from a sample of that population.

In most surveys, we have data for a sample, not for the entire population. We use descriptive statistics to summarize the sample and inferential statistics to make predictions about the population.

An important aspect of statistical inference involves reporting the likely precision of a prediction.

o For example: How close is the sample value of 54% likely to be to the true percentage of the population favoring gun control?

ð So called ‘margin of error’.

Sample statistics and population parameters:
A parameter is a numerical summary of the population. A statistic is a numerical summary of a sample taken from the population.
Important: random sampling!

Most studies design experiments or surveys to collect data. Often, though, it is adequate to take advantage of existing archived collections of data files, called databases.

An applet is a short application program for performing a specific task.

H2: Exploring Data with Graphs and Numerical Summaries

What are the types of data?

The characteristics observed to address the questions posed in a study are called variables.

The data values that we observe for a variable are referred to as the observations.

Each observation can be numerical, such as a number of centimeters, or it can be a category, such as ‘yes’ or ‘no’.

A variable is called categorical if each observation belongs to one set of categories.
=> Key feature: the relative number of observations.

A variable is called quantitative if observations on it take numerical values that represent different magnitudes of the variable.
=> Key features: center & spread.

Watch out! We don’t regard a variable such as telephone area code as quantitive, even though it is numerical. Quantitative variables measure how much there is of something.

- Discrete: if its possible values form a set of separate numbers. Such as 0,1,2,3,…

o For example: the number of pets in a household, the number of children in a family,…

ð Any variable that is phrased as “the number of …” is discrete.

- Continuous: if its possible values form an interval.

o For example: height, weight, age, the amount of time it takes to complete an assignment.

ð Doesn’t have a set of separate numbers. The answer to the amount of time it takes to complete an assignment could be 4.2458 hours.

Mode = the category with the highest frequency.

Frequency = count.

o For example: the numbers of reported shark attacks in the United States is a frequency.

The proportion of the observations that fall in a certain category is the frequency of observations in that category divided by the total number of observations.

The percentage is the proportion multiplied by 100. Proportions and percentages are also called relative frequencies.

A frequency table is a listing of possible values for a variable, together with the number of observations for each value.

How can we describe data using graphical summaries?

Graphs for categorical variables:
A pie chart is a circle having a “slice of the pie” for each category. The size of a slice corresponds to the percentage of observations in the category.

A bar graph displays a vertical bar for each category. The height of the bar is the percentage of observations in the category.

Pareto chart: Is a bar graph with categories ordered by their frequency, from the tallest bar to the shortest bar.

Graphs for quantitative variables:
A dot plot shows a dot for each observation, placed just above the value on the number line for that observation.

A stem-and-leaf plot: Each observation is represented by a stem and a leaf. Usually the stem consists of all the digits except for the final one, which is the leaf. Now sort the data in order from smallest to largest. Place the stems in a column, starting with the smallest. Place a vertical line to their right. On the right side of the vertical line, indicate each leaf (= final digit) that has a particular stem. List the leaves in increasing order.
=> Truncate the data values to make it more compact.

A histogram is a graph that uses bars to portray the frequencies or the relative frequencies of the possible outcomes for a quantitative variable.

Which graph should we use?

Ø The dot plot and stem-and-leaf plot are more useful for small data sets. (50 or fewer observations)

Ø Histograms are better for large datasets.

The shape of the distribution

A distribution of data is a frequency table or a graph that shows the values a variable takes and how often they occur.

ð Look for the overall pattern (clustering together or a gap?).

ð Unimodal ( one mount, highest point is the mode) vs. bimodal ( two mounds)

Shape:
- Symmetric.
- Skewed (to the left or to the right).
The tails of the distribution = the parts of the curve for the lowest and for the highest values.

A data set collected over time is called a time series. We can display time-series data graphically using a time plot. This charts each observation, on the vertical scale, against the time it was measured, on the horizontal scale. ( A trend of time)

How can we describe the center of quantitative data?

The mean is the sum of observations divided by the number of observations.

x= xn

An outlier is an observation that falls well above or well below the overall bulk of the data.

The median is the midpoint of the observations when they are ordered from the smallest to the largest (or the other way around).

Comparing mean & median:
- Symmetric: mean = median.
- Skewed to the right: mean > median.
- Skewed to the left: mean < median.

A numerical summary of the observations is called resistant if extreme observations have little, if any, influence on its value => median.

How can we describe the spread of quantitative data?

The range is the difference between the largest and the smallest observation.
The deviation of an observation x from the mean x is (x-x),the difference between the observation and the sample mean.

The standard deviation s of n observations is:

s= (x-x)2n-1

The symbol (x-x)2 is called the sum of squares. It represent finding the deviation for each observation, squaring each deviation, and then adding them up. The standard deviation s represent a typical distance or a type of “average distance” of an observation from the mean.

The symbol n-1 stands for the sample size – I

Remember:

· The larger the standard deviation s, the greater the spread of the data.
· S = 0 when all observations take the same value. For example: if all the observations are 2,2,2,2,2,2,2 then the mean equals 2, each of the seven deviations equals 0, and s=0.
(= minimum possible spread)
· S can be influenced by outliers.

This is the square root of the variance s2, which is an average of the squares of the deviations from their mean:

s2= (x- x)2n- 1

EXAMPLE:

The sum of squared deviations equals:

(x-x)2 = 4+4+4+0+4+4+4 = 24

The standard deviation of these n=7 observations equals:

s= (x- x)2n- 1 = 246 = √4 = 2.0

The Empirical Rule:
- 68% of the data falls within 1 standard deviation of the mean.
- 95% of the data falls within 2 standard deviations of the mean.
- All or nearly all (99,7%) observations fall within 3 standard deviations of the mean.

How can measures of position describe spread?

The pth percentile is a value such that p percent of the observations fall below or at that value.

EXAMPLE:

Your score of 1200 out of 1600 for a test falls at the 90th percentile. Set p=90 in this definition.

Then 90% of those who took the test scored between the minimum score and 1200. Only 10% of the scores were higher than yours.

Three useful percentiles are the quartiles. The first quartile has p=25, the second quartile has p=50 ( also referred to as the median), the third quartile has p=75. The quartiles split the distribution into four parts, each containing a quarter (25%) of the observations.

The interquartile range is the distance between the third and the first quartiles:

IQR=Q3-Q1

The five-number summary of a dataset is the minimum value, first quartile Q1, median, third quartile Q3, and the maximum value.
=> Graph = box plot.
- Box: contains 50% of the distribution, from Q1 to Q3.
- Whiskers: the lines extending from the box, they encompass the rest of the data,
except potential outliers.

The z-score for an observation is the number of standard deviations that it falls from the mean. For sample data, the z-score is calculated as:

z= x- xs

H3: Associoation: Contingency, correlation and regression

Response variables and explanatory variables

(analyzing data on two variables)

The response variable is the outcome variable on which comparisons are made.

The explanatory variable defines the group to be compared with respect to values on the response variable.

An association exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable.

How can we explore the association between two categorical variables?

A contingency table is a display for two categorical variables. Its rows list the categories of one variable and its columns list the categories of the other variable.

Each entry in the table is the number of observations in the sample with certain outcomes on the two variables.

Each row and column combination in this table is called a cell. For instance, in table 3.1 the first cell in the second row has the frequency 19485,this is the number of observations in the conventional category of food type that had pesticides present.

The process of taking a data file and finding the frequencies for the cells of the contingency table is referred to as cross tabulation of the data.

Table 3.1 is formed by cross-tabulation of food type and pesticide status for the 26698 sampled food items.

EXAMPLE( see textbook page 96-97):

TABLE 3.1

Pesticides
Food type / YES / NO / TOTAL
Organic / 29 / 98 / 127
Conventional / 19485 / 7086 / 26571
Total / 19514 / 7184 / 26698

The proportions are 29/127 = 0.23 for organic foods and 19485/26571 = 0.73 for conventionally grown foods.

The proportions are called conditional proportions. The proportions are formed conditional upon food type.

TABLE 3.2

Pesticide status
Food type / Present / Not present / Total / N
Organic / 0.23 / 0.77 / 1.00 / 127
Conventional / 0.73 / 0.27 / 1.00 / 26571

The conditional proportions in each row sum the 1.0. The sample size n for each set of conditional proportions is listed so you can determine the frequencies on which the conditional proportions were based. Whenever we distinguish between response variable and an explanatory variable, it is natural to form conditional proportions for categories of the response variable.

These conditional proportions treat pesticide status as the response variable.

The proportion of all sampled produce items that contained pesticide residues is not a conditional proportion. It is not found for a particular food type. We formed the proportion 19514/26698=0.731. such a proportion is called a marginal proportion. It is found using counts in the margin of the table.

TABLE 3.3

Pesticide status
Food type / Present / Not present / Total
Organic / 0.40 / 0.60 / 127
Conventional / 0.40 / 0.60 / 26571

Suppose that for each food type 40% had pesticides present and 60% didn’t had pesticides present, as shown in table 3.3 then, the food types would have the same pesticide status distribution. We then say that pesticide status is independent of food type.