Basic Medical Statistics

Background

You need to develop a reasonable level of competence & competence in medical statistics, not just for the AKT exam but to allow you to develop critical appraisal skills. This “Noddy guide” covers a lot of the basic material needed for the AKT exam.

The first section covers the concepts or normal distributions, population sampling, significance, confidence limits., risks… There is some information on the terminology around screening and also a list of useful definitions.

Further reading – Trish Greenhalgh’s “How to Read a Paper” is almost mandatory.

What do the statistics mean? After this, read chapters 5 and 8 of “How to read a paper”

P-values - The probability that the result of the study could have occurred by chance.

Simply looking at “averages” can be misleading. I am testing a new BP drug. Every person taking it will respond slightly differently – some better than others. I test it in 3 people and the BP drops by an average of 10mmHg. Looks good. BUT – could it be that with this drug, BP actually goes up in most people and I just happened to have tested it on 3 people where the BP dropped a bit?

The more numbers I have, the better I can look at the spread (distribution pattern) of the results. The better I can judge the distribution pattern, the more sure I can be about chance effect. By convention, a p value of <5% (or <0.05) is defined as significant. The likelihood of your result being coincidence is less than 1 in 20 (5%) The bigger the change you see and the more people you test, the more believable your result is.

So I went back and I did my BP study on 100 people. I found an average drop of 8.9mmHg. I looked at the spread of the results and the p-value came out at 0.01 – ie 99% likely I was seeing a true drop and only a 1% chance that I happened to have tested it in a group of people who were non-representative “good-responders” – a significant result.

The main curve shows BP levels in the “normal” population. Perhaps the average change with my tablet is nothing but I happened

The red splodges are my 3 results. to have tested it in 3 people where there was a slight drop in BP?

Do more tests and you start to get a better picture

Think about it – read a journal with 20 papers in it, quoting “significant” results with p= 0.05. One of those papers has recorded a fluke and has not really found anything “significant”…..

Confidence Intervals – how confident should you be that your result is accurate?

Sounds similar to p-values but you need to think hard to spot the difference.

Statement of fact - my drug actually lowers BP by an average of 8.9mmHg.

If I picked another group of 100 people and tested my drug again, I would get a result that was slightly different from 8.9mmHg. If I did the study 10 times, I would probably get 10 different answers. Which one is true? Confidence interval calculations look at the data spread and give an indication as to where the “true” answer probably falls.

My BP study results may state: “the fall in BP was 8.9mmHg (p=0.01, CI –7.1 to –10.4).This means

1) significant drop in BP averaging 8.9mmHg (only 1% likely to be a chance finding)

2) 95% sure that the drug’s true BP lowering effect is between 7.1 and 10.4mmHg

The blue curve shows the “true effect” of the drug when tested on a massive population.

The red curve is the spread of my results in my small sample population.

The confidence interval indicates how well I think I have matched up the 2 curves – big interval means I am not very confident, small interval means I am more sure that I am close to the “true” answer.

What if we get a “significant result” on p-values, but the confidence interval includes zero (or the point of no effect)? For instance p= 0.05, CI =1.4 to –5.6. What is being said is “In the people I picked to represent everyone, I am sure that the drug worked. But I did not look at enough people to be confident that these people really do represent the whole population. Please consider repeating the trial. then that is another way of saying it is possible that the drug achieved nothing.

If the confidence interval comes nowhere near zero (like CI –7.1 to –10.4), then we are pretty sure that our study population represents the whole population. We can be sure that our result is valid and that further trials are probably not needed.

Relative and Absolute Risks

Some studies measure changes in risk – like change in survival with a statin. Often the results are expressed in terms of changes in risk.

Absolute risk (AR) – statement of how often the “event” happened

Relative risk (RR) – risk of the event in one group “relative” to another group

Absolute risk reduction (ARR) – statement of how much the risk in one group is less than the risk in the other group

Relative risk reduction (RRR) – rarely bother with this one!

Example:

Annual stroke risk in high risk AF patients:

Aspirin – 11% stroke risk Warfarin – 6% stroke risk

AR of a stroke on warfarin is 6% (0.06) and the AR of a stroke on aspirin is 11% (0.11)

RR of a stroke on warfarin compared with aspirin is 54% (0.54) – 6/11

ARR from taking warfarin rather than aspirin is 5% (0.05) - AR reduced form 11% to 6%

RR numbers always look better than AR numbers! Saying warfarin is better than aspirin because the RR of a stroke is 54% lower sounds much better than warfarin gives an ARR of 5%.

Numbers Needed to Treat (NNT)

How many people do you need to “treat” with the study intervention to stop the study event from happening once? Remember that ARR is a “statement of the reduction in the rate of occurrence of an event”. NNT is simply the number of ARR’s needed to make 100%. In the example above, the ARR is 5%. This means 20 will be needed to make 100% - the NNT is 20. Therefore, treating 20 people with warfarin rather than aspirin, in the high risk groups, will prevent one stroke per year.

Forest Plots / Blobbograms After the tutorial, read “How to Read a Paper” Chapter 8

Meta-analysis means putting the results of several similar trials together and looking at the result. Thinking back to P-values, remember we said “The more numbers I have, the better I can look at the spread (distribution pattern) of the results”. This is what statisticians do with meta-analysis. One problem is that all the trials may have asked the question slightly differently. One may have presented data about death in the first 5 years after MI, another at death rates in the first 10 years etc. They go back to the individual study data and re-present the data so that the results are directly comparable. The big thing that changes when you pool results is confidence intervals. You are effectively repeating the trial several times and therefore it is logical that you should be more confident about where the “true answer” lies. This is what the Forest Plot (or blobbogram) represents.

Each trial is represented by one row on the plot. The box on each line represents the “result” of that trial, the size of the box represents the size of the trial. The width of the line drawn for that trial represents the confidence intervals for that trial. The very clever bit is the blob in the final row. This represents the new confidence intervals for the pooled results of all the trials.

In this example, all of the trials showed improved mortality except “Saint 1998”. However, in a lot of the trials, the confidence interval crosses the line of no effect, making it hard to state with certainty that the trial definitely showed better survival. “Stewart 1994” has a big blob (= big trial) and is it any surprise therefore that it has narrow confidence intervals (horizontal line is NOT very wide)? Smaller trials have wider confidence intervals. The final blob, however, unarguably lies on the “favours Tx” side of the line. Put simply – no more trials needed.

Hierarchy of studies

Case Reports

“This interesting thing was seen to happen on a few occasions….”

Case Control Studies

“We looked at how close people with leukaemia lived to power lines compared with…”

Cross-sectional surveys

Data collected at a single point in time

Cases of flu during an epidemic

Cohort studies

“We watched these two groups of people over 10 years”

Randomised controlled trials

“Two or more matched groups randomly allocated to intervention or placebo”

Systematic reviews

Meta-analysis, Forest plots, definitive statements.

Sensitivity & Specificity:

True Diagnosis
Positive / Negative
True Result / Positive / a
(TP True Positive) / b
(FP False Positive)
Type 1 error / Positive Predictive Value
PPV = a/a+b
Negative / c
(FN False Negative)
Type 2 error / d
(TN True Negative) / Negative Predictive Value
NPV = d/c+d
Sensitivity
a/a+c / Specificity
b/b+d

Sensitivity = a/ a+c Proportion of people with the target disorder who have a positive test

Specificity = d/ b+d Proportion of people without the target disorder who have a negative test

PPV = a/a+b Proportion of people with positive test who have target disorder

NPV = d/c+d Proportion of people with negative test who do not have target disorder

Power = 1 – FN Measure of the strength of a test

SnNout When a sign/test/symptom has a high Sensitivity, a Negative result rules out the diagnosis.

SpPin When a test has a high Specificity, a Positive result rules in the diagnosis.

Many people practice and memorise this table and draw it out as soon as they start the exam. There is nearly always one question based on it…

Excellent Wiki page: http://en.wikipedia.org/wiki/Sensitivity_and_specificity

Definitions of Statistical Terms Needed for AKT:

Know each and every one! Most of the important ones are listed here

http://www.gpnotebook.co.uk/simplepage.cfm?ID=x20050322112724411760

Mode of a set of values is that value which appears the most often. In a normal distribution the mode will equal the mean and the median.

Median value is that value which falls in the middle if all the values are arranged in sequential order.

Mean of a set of values is given by the sum of the values divided by the number of values

Variance is a measure of the spread of observations around a mean. It is the square of the standard deviation.

Standard deviation gives the average distance of each of a set of observations from the mean.

If the standard deviation is known then it can be stated that in a normally distributed population 68% of values will fall within one standard deviation of the mean, 95% within two and 99% within three standard deviations of the mean. The standard deviation is the square root of the variance.

Chi squared test is used to compare characteristics observed in a population with what would be expected in that population – how well does what was observed fit with what would have been expected

Confidence Intervals around a result obtained from a study sample indicates the range of values within which there is a specific level of certainty (usually 95%) that the true population value for that result lies

If the CI contains the value zero, then it is possible that there is no difference in effect between the intervention

If the CI excludes the value zero, one can be reasonably (95%) certain that there is a statistically significant difference between the interventions

Type II error, also known as a false negative, occurs when the test fails to reject a false null hypothesis. For example, if a null hypothesis states a patient is healthy, and the patient is in fact sick, but the test fails to reject the hypothesis, falsely suggesting that the patient is healthy. The rate of the type II error is related to the power of a test

Power of a test is essentially the ability to show a difference if one exists (i.e. to reject the null hypothesis). It is defined as 1 - type 2 error. You may have heard studies being criticised for being “underpowered”?

The power of a study defines the ability of a study to demonstrate an association between two variables, given that an association exists. For example, 80% power in a clinical trial means that the study has a 80% chance of ending up with a p value of less than 5% in a statistical test (i.e. a statistically significant treatment effect). 80% is often considered an acceptable level of power for a study. Power can be calculated before (a priori) or after (post hoc) data is collected:

apriori power calculation is conducted prior to the research study, and is typically used to determine an appropriate sample size to achieve adequate power

post-hoc power calculation is conducted after a study has been completed, and uses the obtained sample size and effect size to determine what the power was in the study, assuming the effect size in the sample is equal to the effect size in the population

Odds Ratio. In case control studies the relative risk cannot be calculated because the incidence is not determined by the study. Thus the useful manipulation of the data is into odds ratios:

suspected cause present: number of cases = a; number of controls = b
suspected cause absent: number of cases = c; number of controls = d

Odds ratio (OR) = ad/cb

If OR =1, or confidence interval (CI) includes 1, means no significant difference between treatment & control groups

If OR>1 and the CI does not include 1, events are significantly more likely in the treatment group than the control group

If OR<1 and CI does not include 1, events are significantly less likely in the treatment group than the control group