Biostatistics

for Preventive Medicine

Handout 1 (5.1-5.3)

Statistic: A number that can be computed from data

Statistics: Scientific methods for gathering, organizing, summarizing, presenting, and analyzing information to draw valid conclusions and make decisions.

Biostatistics: The scientific study of numerical data based on natural phenomena.

Population: The largest collection of entities of which we have an interest at a particular time.

Sample: Part of the population that is examined in order to gather information.

Random Sample: Basic type of probability sample where each and every individual occurrence in the universe has an equal chance of being included in the sample.

Randomization: The use of chance to assign experimental units to groups. It is essential to probability sampling.

Bias: Occurs when some form of error is introduced to the design of the study. It may lead to false results in the association between the exposure and disease under study.

Valid sample: Adequate in size for population under study. It is free from introduced bias.

PROBABILITY SAMPLE DESIGN

Simple Random Sample (SRS) is the basic type of probability sample. SRS of size n consists of n units from the population chosen at random in such a way that every set of n units has an equal chance to be the sample actually selected. In general, results from the analysis of the statistical sample are applicable to the population under study.

Randomization is the use of chance in assigning subjects to eliminate bias

Steps: Obtain a list of n subjects from the population (Hospital records, ship’s company list, other directories). A tag is prepared for each subject carrying a number from 1 up to n (n is the total number in the population). Mix in a receptacle. Draw a tag blindly.

Alternatively, one could use a table of random numbers to identify the subjects. This gives every member an equal chance of being selected for the sample

SAMPLE BIAS

Selection Bias: Occurs when the outcome variable (disease or exposure) influences the selection of subjects differently for compared groups. Examples: Exposure and outcome have occurred at the time subjects are selected in the study. Use of inadequate comparison groups. Loss to follow-up or non-participation.

Information Bias: Misclassification of exposure or disease information. Examples: Interviewer Knowledge of subject’s disease status. Recall - individuals with disease remember their exposure histories differently than those who do not have disease.

Confounding: A mixing of the effect of interest with the effect of other predictors of disease. It occurs when another variable is associated with exposure. Example: Walter Reed, Panama, mosquito.

TECHNIQUES TO MINIMIZE BIAS

Selection - Use incidence data. Use more than one comparison group. Minimize loss to follow-up.

Information - Maximize accuracy of data. Double blind studies. Standard training of study personnel.

Confounding: Randomization in experiments. Matching (distributes confounders to cases and controls). High level analysis.

Consequences of Sample Bias: Conclusions drawn from the statistical analysis will be false. Will alter the statistical estimate of effect in the direction of being either more or less extreme than the true association.

Frequency distribution: A table that summarizes the values a variable can take and the number of observations of each value.

Table: A set of data arranged in rows and columns. Used to demonstrate relationships in data. This is the basis for the visual display of data.

Graphs: Quantitative display of data using a system of coordinates.

Charts: A method of illustrating statistical information using only one coordinate.

Types of Variables

Variable: Any characteristic of a person or thing that can be expressed as a number. A value of the variable is the number.

Categorical Variable: Non-numerical values that describe a variable. Examples: Blood type, gender, sex, exposure.

Continuous Variable: Can assume an infinite number of values between any two fixed points. Examples: Measurements, such as height, time, temperature, bacteria count.

Rounding Data: Rounding should be done as the last step to a mathematical problem to avoid inaccuracies.

Ratios, Proportions And Percentages

Ratio: A ratio is a frequency measure that compares two quantities. Defined: Ratio = x/y The values of x and y may be independent or x may be included in y

Proportion: A frequency measure, defined as a part of whole, and is usually expressed as a decimal: P = x/y

Percentage: A percentage is simply a proportion multiplied by 100. % = x/y (100)%

When the population is small, percentages are unstable and can be misleading

Presentation Graphics

Array: Raw ungrouped data can be organized into an orderly arrangement called an array. An Array can be arranged in order: Ascending or descending.

Range: The range is the difference between the largest (maximum) and smallest (minimum) values in a data set. It is reported as a single number in statistics.

Class Intervals (k): The number of class intervals is relative. For tables, use 4-8 class intervals. For graphs and maps, use 3-6 intervals.

Class Width: This is the range of the data in each class. To calculate class width, divide the range (of the entire data set) by the class interval.

Tables: Defined as an orderly arrangement of numerical data in columns and rows so that comparisons of the data can be made. There are 2 types: Frequency distribution table (which have 1 variable only) and Two-by-two tables (usually compare exposure and disease, or treatment and outcome).

Parts of statistical tables

Title: Should be clear and to the point. Answer the questions “What, Where, When.” Is located above the table and centered.

Boxhead: Space provided for the column headings.

Stub: The body of the table where the data is placed. Data should be arranged in an orderly manner. May also contain a summary (totals) row.

Rows: Horizontal arrangement of data. No lines are used to separate rows.

Columns: Vertical arrangement of the data. Vertical lines may separate columns.

Footnotes: Provide an explanation for something in the table. Located at the bottom left side, directly under the table.

Source: Identifying the reference from which a table (or the data for the table) is borrowed. Located bottom left side of table below the footnotes.

Table Ruling

Title – none. Words in ALL CAPITALS or bold. Boxhead: Top - double line, Bottom - single line, Vertical lines may separate column headings.

Stub: No horizontal lines inside data, Single horizontal line on the bottom, Separate summarization row with a horizontal line

Bar Chart

A graph of the frequency distribution of a categorical variable

Bars are separated by spaces. May be grouped.

Histogram

A graph of the frequency distribution of a continuous variable

Referred to in epidemiology as an epidemic curve

Plot of cases of disease by their date or time of onset

Class intervals are equal

Histogram construction

Intervals numbered between tick marks on axis. No spacing between intervals. Data represented as squares, and may be shaded or colored

Frequency Polygon

It is similar to a histogram, but constructed by connecting the midpoints of the class intervals by a straight line. It is “closed” by connecting the first and last midpoints with the midpoints on the X-axis.

Central Tendency

The purpose of measures of central tendency: To characterize all of the data in a distribution.

Shapes of Frequency Distributions

Normal Distributions: The symmetrical clustering of values around a central location that presents a "bell-shaped" curve. The basis of tests we use to draw conclusions or make generalizations from the data. The 3 measures of central distribution will be equal.

Skewed Distributions: An asymmetrical distribution. The measures of central distribution will be unequal.

Positively skewed - Direction of the tail is distorted to the right

Negatively skewed - Direction of the tail is distorted to the left

Measures Of Central Tendency

Mean: The arithmetic average which is most influenced by the extremes. Preferred for normal distributions. The most commonly used measure of center. Calculated by taking the sum of all the values observed, divided by the total sample size of the group

Formula: ¯X¯ =  X1…n/ n

X¯ = Mean of sample = Greek sigma for "The sum of"

X1…n = Value of the variablesn = The total number of values

Median: The median is a measure that is not influenced by the extremes. It identifies the midpoint of a distribution. Preferred measure for skewed distributions. To compute, arrange the data into an ordered array. The median will ALWAYS have the same number of values above it as it has below it. Do not confuse determining the rank (place on list) with the value of the median (measurement listed).

Determining the middle rank

Formula: middle rank = (n + 1) / 2 n = Number of Values.

If the n is an odd number of values, the middle rank will fall on a specific observation.

If the data set has an EVEN number of values, the rank will be between 2 whole numbers. You must find the High Middle Value (HMV) and Low Middle Value (LMV), add the two and divide them by two.

Formula: HMV+LMV

2

Mode: The most frequently observed value. There may be more than 1 mode. Not generally useful as a measure of central tendency. The mode value may also be dependent on the method of measurement or the rounding of values.

5.1-3 Page 1 of 4