Title: Gini index-based maximum concentration and area under the curve split points for analyzing adverse event occurrence in bioequivalence studies

Journal: Pharmaceutical Medicine

Authors: Blanca L. Torres-García, Lucila I. Castro-Pastrana, Sara Rodríguez-Rodríguez, Larisa Estrada-Marín, Beatriz Cedillo-Carvallo, Olga Guzmán-García, Alejandro Ruíz-Argüelles.

**Correspondingauthor:**Lucila I. Castro-Pastrana, Q.F.B., Ph.D.

Departamento de Ciencias Químico Biológicas, Universidad de las Américas Puebla, San Andrés Cholula, Puebla, México.

Email:

**Electronic supplementary material 1: Using the Gini coefficient method to estimateCmax and AUC0-t split points in relation to adverse event occurrence in bioequivalence studies**

Abbreviations: Cmax (maximum plasma concentration),AUC0-t (area under the plasma concentration curve from administration to last observed concentration at time t)

It is proposed to use the Gini index to obtain a point of division (or split point ´v´), which shows a possible relationship between the Cmax (or AUC0-t) values and the number of adverse events that will occur above or below this value v. Specifically, the adverse events taken into consideration in our study were the ‘suspected adverse drug reactions’ (SADRs). A SADR is any adverse event for which there is evidence to suggest a causal relationship between the drug and the adverse event. [1]

The Gini index is commonly used as a measure of income inequality in the field of economics. However, Tan et al. [2] suggested that it is also applicable as a method to find the best point of division in database classification, which is known as decision trees in statistics. According to Hastie et al. [3] there are similar methods, but the Gini index has the property of being differentiable and thus more susceptible to numerical optimization, as well as sensitive to changes in probability at the nodes.Our data set consists of pharmacokinetic parameters (Cmax and AUC0-t) measured during bioequivalence studies and SADRs presented by the subjects participating in these studies.

Here we obtain a point of division (v), which divides our data set into homogeneous subsets relative to the number of SADRs that exist within each of them, that is, once v is found, it is assumed that it divides the data set into two homogeneous subsets of pure classes, based on the number of SADRs that will exist at a level below Cmax (Cmax <= v) or the number of SADRs that will occur above a certain level of Cmax (Cmaxv).

A smaller value of Gini (degree of impurity), shows a more skewed distribution class, for example if you have two observations of Cmax with class of distribution (0,2), it means that you have a level of Cmax1 presenting 0 SADRs and Cmax2 presenting 2 SADRs, anda degree of impurity equal to zero will be obtained (Gini = 0). In the case of two observations of Cmaxwith a class of distribution (1, 1), it means there is a level of Cmax1 presenting 1 SADR and Cmax2 presenting 1 SADR, which means a higher degree of impurity will be obtained.

The best point of division (v) can be found by evaluating the Gini index (equation 1) for each candidate value to become a point of division,

(equation 1)

where c represents the number of classes, or,in our case,the total number of SADRs that can occur, p (j / t) denotes the fraction of SADRs that belong to class i that is given by the node t. The best point of division will be the one with the smallest Gini coefficient.

In order to show how to select the optimal point of division step by step, we chose the data of one period of the single-dose bioequivalence study for leuprolide 3.75 mg. Subjects were randomized to either of thetwo treatmentarms (generic or originator). From the 15 vasectomized men that participated in the study, only 2 had SADRs. One subject presented testicular pain (mild) and hot flashes (mild). The second subject presented myalgia (mild).

Table 1 shows the measured Cmax values for each volunteer. In the upper part of the table, the number of SADRs that each subject presented is shown. In this study, the maximum number of SADRs was 2.

We started by ordering the Cmax values in ascending order, aside their corresponding value of SADRs. Then, the candidate points to become points of division are identified as the midpoints between the adjacent observations of Cmax. These are presented in the first four rows of table 1.

The first candidate value to become the point of divisionis Cmax = 7.3. Then, it is estimated the number of times when 0, 1 and 2SADRsoccuredfor a Cmax <= 7.3, and in the same way for the number of SADRs occurring for values ofCmax> 7.3.

Subsequently, the Gini index(v) is calculated for each node.For the first row of the table Gini (1) = 0, and in the second node,, so the general Gini index for the candidate value to become the point of division is.

For the second candidateCmax=7.9575,we have: , then for this candidate value to be the point of division we get.

For the third candidate valueCmax=8.787, we have:, then for this candidate value to become the point of division we get .

This calculation procedure is carried out until obtaining the Gini indexes for all the Cmax candidate values.The best point of division corresponds to the candidate value with the smallest Gini Index. For our data set, that is, Cmax = 10.062 (see Table 1).

After estimating the split point, we calculated mean SADRs occurrencesabove and below that point by calculating the ratio between the number of SADRs (above or below) and the number of PK values (above or below). The Cmax split point for leuprolide was 10.062 ng/mL and among 15Cmax values, a total of3 SADRs were below the split point, giving an average of 0.4286 (3 SADRs / 7 Cmax values). This means that per volunteer and according to our data set, an occurrence rate of 0.4286 SADRs per subject was estimated for subjects with Cmax values below the split point. Abovethe split point the average frequency of SADRs was zero (0.0000) (0 SADRs/ 8 Cmax values). Figure 1 shows the corresponding decision tree constructed using R project based on the estimated Gini index.

This procedure was repeated for each bioequivalence study included in our paper and for both PK parameters, Cmax and AUC0-t.

These values were incorporated into Table 3 of our paper with the decimal figures rounded.

References

## 1.Food and Drug Administration. Guidance for industry and investigators. Safety reporting requirements for INDs and BA/BE studies. Silver Spring, MD: FDA. 2012: 32.

## 2.Tan PN, Steinbach M, Kumar V. Classification: basic concepts, decision trees and model evaluation. Chapter 4. In: Introduction to data mining. Boston: Pearson Education; 2006. pp. 158-163.

## 3.Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. Data mining, inference, and prediction. 2nd ed. (Springer Series in Statistics) New York: Springer; 2009. pp. 309-312.

**Table 1 (Supplementary material 1). **Data set of Cmax values and SADRs registered during a study evaluating the bioequivalence of 3.75 mg leuprolide generic formulation with the originator product, in 15 vasectomized men. Estimations of Gini index valuesfor candidate split points are shown.

SADRs = suspected adverse drug reactions

Cmax = maximum plasma concentration

**Figure 1 (Supplementary material 1).**Classification tree for estimated number of SADRs occurring below and above the calculated Cmax split point for leuprolide 3.75 mg. Leuprolide bioequivalence study dataset was analyzed using the Gini index method.

SADRs = suspected adverse drug reactions

Cmax = maximum plasma concentration