Paper RA08

Drug Safety Reporting- now and then

David J. Garbutt, Business & Decision, Zürich, Switzerland

ABSTRACT

Introduction

This paper is about the now and then of safety reporting, about its future and where it can, and should, go. I will not talk about Drug Safety Monitoring, although many of the lessons I hope to convince you about could apply there also.

Many of you may know the story of the emperor who had no clothes on, although he had been convinced he really did. Here we have a big pile of clothes, but no emperor. Our journey today is to see how we can go about putting the emperor back into his clothes so he can easily be recognized, as an emperor that is.

This paper will remind us why we do Safety Reporting, and ask if what we currently produce really fills that need, what we could do to improve our product, and briefly look at factors that I believe that indicate safety reporting will change in the next few years.

Clothes, but no emperor

Standard Safety reporting generates large amounts of paper. Listings with 20,000 lines are not uncommon. And AE tables can be as big, not to mention shift tables. A colleague of mine recently had to review a shift table taking up 280+ pages; actually there were four tables that long. Is there a dummies guide to interpreting shift tables? I certainly hope there is a dummies guide to programming them [1].

This sheer amount of product creates problems in generation, assessment, validation, assembly and last, and worst – comprehension and communication. Safety outputs are almost always descriptive – the outputs we create are only rarely analytical and therefore very limited. And, I have always suspected, not read.

We aim to show a drug has no dangers, or at least we can make clear what dangers there are, and under what circumstances they are important to the patient. We should also be asking what constellation of AEs comes with the drug. Is the incidence dose or exposure related? Is it related to any concomitant medications? Are there any particular-prone patient subsets? Are there any surprises in the data?

Safety data are more important than ever

Safety reporting used to be a check but now it is vital to marketing, drug screening, approval, and perhaps continued existence on the market.

Good safety analysis also has the potential to affect time to market. A 2003 study at the FDA[§] of the reasons for repeated reviews of new drug applications (NDAs) showed the commonest reason was safety concerns.

Standard NMEs studied were those with total approval times greater than 12 months in 2000 and 2001. Fifty-seven percent of these applications had times greater than 12 months, ranging from 12.1 to 54.4 months. The most frequent primary reasons for delay on the first cycle were safety issues (38 percent) followed by efficacy issues (21 percent), manufacturing facility issues (14 percent), labeling issues (14 percent), chemistry, manufacturing, and controls issues (10 percent), and submission quality (3 percent).

Source:

For priority NDAs the proportion of delays due to safety was 27% and came second to manufacturing and quality concerns.

Characterizing Safety data:

Safety data are not easy data to analyse with conventional statistical methods because many of the ‘standard’ assumptions are not fulfilled. Pathological features frequently seen in safety data include:

Asymmetric non-normal distributions, often with variability proportional to mean, heterogeneous subpopulations (e.g. patients are differentially prone to AEs – for example liver damage). Data are available as counts or time to occurrence. There are large amounts of variability –e.g. clearly seen at baseline. The data are inherently multivariate time series. There are scaling and range shifts between centres. Differentially responding subgroups of patients may have varying frequencies across centres.

Adverse events

Count data for Adverse Events is variable (as count data is) and complicated by the large number of possible events, and the high incidence on placebo. The large number of possible events means there is a great example of the possibility of false positives because so many tests are being performed. Some methods of analysis break down because there are many zero counts on placebo treatment.

ECG Data

These data are increasingly collected (especially in phase II) and are multivariate, non-normal, longitudinal series of measures[ag1]per patient. And in fact the measurements are summaries derived from from traces measured at two or three time points. The derivation of these measures needs a certain skill and this introduces another source of variation. In addition the assessment of abnormalities is not very reproducible between experts (20% of cases will be assessed differently).

Laboratory test result data

They have some similarities to ECG data – they are also multivariate, non-normal, correlated time series per patient. They are typically assessed using codings comparing the values to (more or less arbitrary) normal ranges. These limits are a univariate approach which is well known from basic multivariate distribution theory to be problematical for correlated variables[2]. For an example seeFigure 1 due to Merz [3]. This figure shows how high the misclassifications can be using this method. And these misclassifications go both ways – signals missed that should not have been (FN in figure) and vice versa(FP).

Against lab normal ranges

Normally we accept normal ranges at face value and I have always wondered how they were derived. (One reason is that we have skewed data and estimating quantiles (like 95% for example) needs a lot of data to be accurate. Ignoring the skewed shape and using theoretical limits based on a normal distribution would be misleading. A 1998 paper assessing lab normal ranges against a large (8000+ people) population found a situation of concern.

Abstract:

Background: When interpreting the results of clinical chemistry tests, physicians rely heavily on the reference intervals provided by the laboratory. It is assumed that these reference intervals are calculated from the results of tests done on healthy individuals, and, except when noted, apply to people of both genders and any age, race, or body build. While analyzing data from a large screening project, we had reason to question these assumptions.

Methods: The results of 20 serum chemistry tests performed on 8818 members of a state health insurance plan were analyzed. Subgroups were defined according to age, race, sex, and body mass index. A very healthy subgroup (n = 270) was also defined using a written questionnaire and the Duke Health Profile. Reference intervals for the results of each test calculated from the entire group and each subgroup were compared with those recommended by the laboratory that performed the tests and with each other. Telephone calls were made to four different clinical laboratories to determine how reference intervals are set, and standard recommendations and the relevant literature were reviewed.

Results: The results from our study population differed significantly from laboratory recommendations on 29 of the 39 reference limits examined, at least seven of which appeared to be clinically important. In the subpopulation comparisons, "healthy" compared with everyone else, old (> or = 75 years) compared with young, high (> or = 27.1) compared with low body mass index (BMI), and white compared with nonwhite, 2, 11, 10, and 0 limits differed, respectively. None of the contacted laboratories were following published recommendations for setting reference intervals for clinical chemistries. The methods used by the laboratories included acceptance of the intervals recommended by manufacturers of test equipment, analyses of all test results from the laboratory over time, and testing of employee volunteers.

Conclusions: Physicians should recognize when interpreting serum chemistry test results that the reference intervals provided may not have been determined properly. Clinical laboratories should more closely follow standard guidelines when setting reference intervals and provide more information to physicians regarding the population used to set them. Efforts should be made to provide appropriate intervals for patients of different body mass index and age.

Mold JW, Aspy CB, Blick KE, Lawler FH (1998) [4]

Figure 1 Univariate limits are misleading for correlated variables. FN is a false negative, and FP a false positive. Figure from Merz [3]

The situation may have improved now,although a recent survey of 169 laboratories by Friedberg et al, (2007) [5]would seem to argue that things have not changed. In any case this is just another argument for using the internal properties of the data we have rather than discarding information and using arbitrary classifiers.

Levels of variation in safety data

There are multiple sources of variability in safety data and this must be taken into account when analyzing and preferably plotting. There are large differences between patients and with many repeated measures there are visit to visit correlations. The time scale of these correlations varies according to the lab parameter being observed – but for blood haemoglobin levels it should be at least 3 months (this being the replacement time for blood haemoglobin). So simple scatter plots of individual liver function enzymes (Figure 15) ignoring the patient dimension are liable to be misleading.It is also a corollary that repeated measurements are not worth as much might be expected. On the plus side we are lucky to have baseline data for treated patients and placebo patients during the whole time course of treatment. In cross-over trials we can estimate treatments differences within patients and escape even more variation.

Why is there no more analysis than this?

Unlike those for efficacy endpoints, clinicalhypotheses for safety endpoints are typicallyloosely defined. This often results in little attentionbeing given to additional and more innovativeapproaches (including graphical ones).In a recent informal survey among over 200 statisticians involved in clinical trial design and analysis in GlaxoSmithKline, fewer than 5% routinely used graphical approaches to summarize safety data.

Amit,Heiberger,Lane (2007)[6]

I find this state of affairs shocking, although it fits with my experience of what reporting is done currently and what has been standard practice for the last 20 years.

I suspect the number using any (statistical) analytical method is even lower. And consider for a second how much money is spent on making lab tests on patients. We are looking at hundreds of dollars per time point, per replication. With a single patient’s lab data costing thousands of dollars - we should ask how much programming time would that money buy? And how much reading and reviewing time it might save?

A new paradigm for analysing of Safety Data

It may be worth going so far as to say that the analysis of safety data should be aimed at identifying patients with unusual patterns of response and characterizing exactly what those responses are.

What can we do?

It is difficult to prove a negative - that there are no dangers. Because there are many rare effects –such asTorsade des Pointes –with an incidence of 1 in 100,000 in the general population.

If our drug increases the chance of that condition by a factor of 10, we still need to study thousands of patients to have a reasonable chance of detecting the problem. It is all dependent on the power of the test - How many patients? How long (in total) have they been exposed to the drug?

With Safety data we really want to prove the null hypothesis - but not fall into the trap of declaring an issue when there is not one. So we comprehensively look for issues - but not analytically. So, too much is left to ad hoc comparisons, which is not better. We group values and lose information (e.g. lab shift tables). We do simplistic univariate analyses. We list or tabulate endless subsets, without proper comparison.

We have a problem because more data are coming. How can we include genetic markers in this informal mess?

Undersized and over-clad

Efficacy analysis has always been more important and because of this studies are sized for tests planned for efficacy variables and undersized for accurately measuring safety issues. I believe another reason is that safety data are more amenable to standardisation and in many companies this was done 10-15 years ago according to good (or acceptable) practices at the time. Standardisation is good and saves money and needlessly repeated effort, but setting things in stone is also like fossilisation.

Why is ten years ago different?

Computer resources and software, especially, were different then, and the methods for creating graphics output for publication had a much longerturn around time than now. Although we thought (and we were right!) that aweek was a big improvement on waiting for the time of a technician to makea black ink drawing with a Rotring® pen and tracing paper.

Statistics and statistical software have not stood still in the last 15 years.There are pictures, models and software that can help us.

Making progress

Modern statistical graphics was started by Tukey with his book Exploratory Data Analysis(EDA, published 31 years ago in 1977)[7].In this book he developed and exemplified methods for examining data which were semi-analytical. By this I mean they were graphical but were also based on an underlying method. A good example of this is his treatment of two-way tables. He also developed the boxplot for displaying the distribution of data while exposing asymmetry and the presence of outliers. EDA is full of hand drawn graphs; at that time sketching was the only way to quickly examine a set of data points. This exposes an important aspect of systems. The effort of making a plot ‘for a quick look’ should be low enough to make speculation no effort. And when something interesting is found the effort to create a report quality output should also be as low as possible.

The development of statistical graphics really took off in the 80’sand 90’s with the work of Cleveland [8], Tufte [9] and others whichutilised experimental work on our perceptual mechanisms and a realization that good communication would result from designs made with those characteristics in mind.

That research and the rise of interactive statistical packages hasmade these methods mainstream. There have been good implementations available for some time in S-Plus, R, and JMP.

New graphics of note


The advent of lattice and trellis graphics and high resolution displaysreally made the use of static plots a viable method of data analysis. It is an important development because not all data is analysed statistically or by statisticians, much data analysis is done by scientists. Producing sets of plots

[ag2]Figure 2 Quartiles of EKG variables in two treatments over time (Harrell, [12])

conditioned on other variables can really show what factors are important and they are especially useful when analysing data sets where the factors used for analysis have interactions. I have mentioned several books for those wanting to read more on this subject I should also mention Frank Harrell’s graphics course which is available on line [10]. A useful survey with history and examples is Leland Wilkinson’s article on Presentation Graphics[11].

I will illustrate some of the new methods later in this paper but for now I will just mention some of the most useful. Dotplots should replace barcharts as they are easier to interpret, more flexile and use less ink. Sparklines (Tufte [9]) put graphics in-line as word sized pictures. Like this or this . The aim is to integrate argument and evidence in text, but sparklines have also been used to achieve high information densities, a web page of examples is maintained at There are obvious possibilities here for plotting data for large numbers of patients on a single page for use as a compact patient profile.

There is some very interesting work by Frank Harrell in Rreport [12]which re-imagines the independent safety monitoring board (DSMB) report and displays of ECG data using half confidence intervals (see Figure 2) as well as time to event analyses for adverse events. The plots in Figure 2also use shades of grey in a clever way by plotting the treatment in grey after the placebo (in black) so differences are highlighted. The outer lines are 25% and 75% quantiles and the thicker bars depict medians. Vertical bars indicate half-widths of approximate 0.95 confidence intervals for differences in medians. When the distance between two medians exceeds the length of the bar, differences are significant at approximately the 0.05 level. The comparisons of demographic data across treatments groups shown in the sample report is also interesting.

The HH R package [13] which accompanies the book by Heiberger and Holland [14]includes the R function Ae.dotplot. We will see examples and adaptations of this later.


Figure 3 Scaled graph-theoretic measures (Wilkinson et al. [17])

Another approach which is not strictly graphical and not strictly modelling is scagnostics from John Tukey but never fully developed by him [16]. The term is a(nother) neologism coined by John Tukey and is a portmanteau word derived from scatter plot diagnostics. The aim is to quantitatively classify a two way scatter plot and therefore allow an automated or semi-automated search for the interesting features in data. This looks like a promising approach for cases where we look for ‘issues’ without being able to specify in advance all the possible patterns that might be important. Wilkinson et al. [17], [18] have expanded and extended this work.