Lo-Ciganic Et Al, Machine Learning and Medication Adherence Thresholds: Online Supplement

Lo-Ciganic et al, Machine Learning and Medication Adherence Thresholds: Online Supplement

Online Supplement

eMethod. Technical Appendix

eTable 1. Operational Definitions for Diabetes-related Hospitalizations

eTable 2. Hospitalization Rates during the Post-Index Year

eTable 3. PDC and Hazard Ratios for Each Terminal Node in Survival Tree: All-Cause Hospitalizations

eTable 4. Multivariate Cox Proportional Models with Same Set of Predictors in Survival Tree: All-Cause Hospitalizations

eTable 5. PDC and Hazard Ratios for Each Terminal Node in Survival Tree: Diabetes-Related Hospitalizations

eTable 6. Multivariate Cox Proportional Models with Same Set of Predictors in Survival Tree: Diabetes-Related Hospitalizations

eFigure 1. Sample Size Flow Chart

eFigure 2. Important Predictors of Diabetes-related Hospitalizations Selected by Minimal Depth from Random Survival Forests

eFigure 3. Adherence Thresholds associated with Risk of Diabetes-related Hospitalizations: A Survival Tree

eMethod. Technical Appendix

Random Survival Forests

Details of random survival forest techniques are described elsewhere and in the technical appendix.1-3 To select the most important predictors of hospitalizations, we constructed a random survival forest of 1,000 survival trees, where each tree was from an independent and unique bootstrap sample of the training sample. At each branch or node, a random set of candidate predictors were chosen as candidates to split the node into 2 other branches, and the number of variables assessed at each branch was the square root of the total number of variables (e.g., the square root of 14, which was rounded as 4.). For each of the randomly selected variables, the variable whose split yielded the highest log-rank value was chosen to occupy the first node.4 The categorical variables were split according to their categories and continuous variables were split at randomly selected cut points (nsplit=5 in our study). For each subsequent node of the tree, random selection of candidate predictors and selection of the best split or threshold were repeated. The process continued until we reached a unique subset that contained no fewer than three hospitalization events (Figure A).2 From these individual survival trees, we identified important variables by averaged minimal depth from the tree trunk for each variable.2 The most predictive variables were defined as those whose average minimal depth (i.e., split nodes nearest to the root node) is smaller than the minimal depth of a variable which was unrelated to the survival distribution and determined under the null hypothesis of no effect (threshold).5 The smaller the minimal depth, the greater the association with the dependent variable and hence the impact of that variable on prediction. Simply due to random chance, a variable with no prediction power may on occasion split at a lower depth when a large number of trees are grown. Thus, similar to prior work, we used the average minimal depth of such an unrelated variable as a threshold to identify a set of important predictors whose average minimal depths were less than the said threshold. These important predictors were further used to construct a survival tree described in the next section. We assessed the prediction accuracy of random survival forests by the Harrell concordance index (C-index) using the out-of-bag method (i.e., bootstrap 2/3 sample of the training sample).2,6 C-index is defined as the probability of concordance given that the pairs considered are usable in which at least one had an event. It can be interpreted as the probability that a patient from the event group has a higher predicted probability of having an event than a patient from the non-event group. Unlike other measures of survival performance, Harrell’s C-index does not depend on choosing a fixed time for evaluation of the model and specifically takes into account censoring of individuals.7 A small prediction error is preferred; however, there is no gold standard how small would be desirable. Previous studies have reported prediction error rates ranged from 25-40%, which indicates some promise for the predictors, but not ideal and indicative of the complexity of the predicting health outcomes. In these studies, random survival trees performed at least as good as or better than traditional models.8,9

Survival Trees

We then fit survival trees with the important predictors identified from random survival forests and explored the optimal threshold of adherence to oral hypoglycemics that was most strongly associated with hospitalizations.10 Briefly, survival trees start with the root that included all patients from the training sample and used binary recursive partitioning methods to systematically search among all predictors for variables that classify or segment a target population into increasingly homogeneous subgroups with respect to the outcome of interests (i.e., hospitalizations in this study). For continuous variables (e.g., PDC), it searched the threshold value that optimally split patients into groups with similar likelihood of hospitalization risk. Partitioning stopped when risks for hospitalizations for the two partitioned subgroups were not statistically different based on log-rank tests or the minimum terminal node size was less than 20 patients. In addition, we used 10-fold cross-validation methods to guard against model over-fitting (complex parameter=0.005). We calculated hazard ratios (HR) for each terminal node. To compare prediction performance, we compared C-indices with 95% confidence intervals (CIs) between the final survival tree and Cox proportional hazard model.6,11

eMethod. Technical Appendix (Continued)

Figure A. Illustration of a Random Tree from Random Survival Forests. A bootstrap sample of patients from the original dataset is used to build a random tree. At the open circles randomly selected subset of variables (e.g., PDC, age) compete to split node. Among these, single variable that discriminates between event/non-event best chosen to permanently split node. Node levels are numbered based on their relative distance to the root of the tree (i.e., level 1, 2, 3). Splitting of nodes to create the tree continues until terminal nodes have few distinct events. Each terminal node (*) contains a group of patients with unique characteristics, and a survival curve demonstrating their outcome.

eTable 1. Operational Definitions for Diabetes-related Hospitalizations

Type of diabetes-related hospitalizations / ICD-9 or CPT codes
Diabetes / 250.xx
Hyperglycemia / 250.1x, 250.2x, 250.3x
Hypoglycemia / 250.8x, 251.1x, 251.2x
Septicemia or bacteremia / 038.xx, 790.7
Pneumonia / 480-6
Kidney infections, cystitis, urinary tract infection / 590, 595, 599.0x
Cellulitis / 680-682, 686
Electrolyte imbalance / 276.xx
Diabetes retinopathy / 250.5x, 361.xx, 362.0x, 362.1, 362.8x, 379.23, 369.xx
Diabetic nephropathy / 250.4x, 585.xx, 593.9
Diabetic neuropathy / 250.6x, 356.9, 357.2x
Ischemic heart disease / 410-414, v45.81, v45.82; CPT codes: 36.1x, 36.2x, 00.66, 36.06, 36.07
Stroke / 433-434
Diabetes peripheral circulatory disorders / 250.7x, 440.2x, 707.1x, 785.4x, v49.6, v49.7; CPT codes: 84.0x, 84.1x [excluded if any diagnosis is 895-897]

eTable 2. Hospitalization Rates during the Post-Index Year

Hospitalizations / Training sample (N=29,855) / Testing sample (N=3,275)
All-cause hospitalizations, n (%)
1 / 4,224 (14.2) / 478 (14.6)
≥ 2 / 2,936 (9.8) / 331 (10.1)
Number of months to first all-cause hospitalizations, mean (SD)/median (min-max) / 5.2 (3.5)/ 4.7 (0.03-12) / 5.2 (3.6)/ 4.9 (0.03-12)
Diabetes-related hospitalizations, n (%)a
1 / 2,822 (9.4) / 327 (10.0)
≥ 2 / 1,124 (3.8) / 118 (3.6)
Number of months to first diabetes-related hospitalizations, mean (SD)/median (min-max) / 5.6 (3.5)/ 5.6 (0.03-12) / 5.7 (3.6)/ 5.3 (0.03-12)

a: Diabetes-related hospitalizations were defined by inpatient admission during the post-index year with an ICD-9 codes as “primary” discharge diagnosis or current procedural terminology codes in any position, including diabetes, hyperglycemia, hypoglycemia, septicemia or bacteremia, pneumonia, kidney infections, cystitis, urinary tract infection, cellulitis, electrolyte imbalance, diabetes retinopathy, diabetic nephropathy, diabetic neuropathy, ischemic heart disease, stroke, and diabetes peripheral circulatory disorders

eTable 3. PDC and Hazard Ratios for Each Terminal Node in Survival Tree: All-Cause Hospitalizations

Terminal nodes / N / Average PDC (SD) / Medium PDC (min, max) / HR (95% CI)
A / 7,989 / 0.63 (0.26) / 0.66 (0.03, 1.00) / Referent
B / 1,186 / 0.66 (0.25) / 0.68 (0.03,1.00) / 1.42 (1.22, 1.64)
C / 3,367 / 0.76 (0.22) / 0.84 (0.03, 1.00) / 1.41 (1.28, 1.56)
D / 956 / 0.87 (0.10) / 0.91 (0.63, 1.00) / 1.78 (1.54, 2.06)
E / 303 / 0.45 (0.13) / 0.47 (0.14, 0.62) / 3.16 (2.59, 3.84)
F / 2,130 / 0.82 (0.12) / 0.84 (0.60, 1.00) / 1.64 (1.46, 1.83)
G / 1,728 / 0.38 (0.13) / 0.40 (0.04, 0.59) / 2.40 (2.16, 2.67)
H / 481 / 0.78 (0.21) / 0.85 (0.16, 100) / 3.12 (2.67, 3.69)
I / 291 / 0.97 (0.01) / 0.96 (0.95, 1.00) / 0.94 (0.67, 1.30)
J / 4,198 / 0.50 (0.25) / 0.49 (0.03, 0.94) / 1.94 (1.78, 2.12)
K / 484 / 0.92 (0.04) / 0.92 (0.84, 1.00) / 1.88 (1.54, 2.28)
L / 1,462 / 0.47 (0.21) / 0.48 (0.04, 0.83) / 3.03 (2.74, 3.38)
M / 1,879 / 0.78 (0.16) / 0.81 (0.47, 1.00) / 2.71 (2.46, 3.00)
N / 554 / 0.32 (0.10) / 0.33 (0.07, 0.46) / 3.93 (3.44, 4.56)
O / 1,347 / 0.82 (0.13) / 0.85 (0.57, 1.00) / 3.43 (3.10, 3.83)
P / 694 / 0.37 (0.13) / 0.39 (0.03, 0.56) / 5.54 (4.98, 6.30)
Q / 806 / 0.72 (0.23) / 0.79 (0.07, 1.00) / 6.02 (5.46, 6.79)

Lo-Ciganic et al, Machine Learning and Medication Adherence Thresholds: Online Supplement

eTable 4. Multivariate Cox Proportional Models with Same Set of Predictors in Survival Tree: All-Cause Hospitalizations

HR (95% CI) / P value
Prior hospitalizations or ED visits / 1.67 (1.59, 1.76) / <0.0001
Had insulin fills during the index year (reference= non-users)
0-<90 days / 1.41 (1.29, 1.55) / <0.0001
≥ 90 days / 1.27 (1.20, 1.34) / <0.0001
Had diabetes comorbidities (ref=DCSI=0) / 1.42 (1.35, 1.49) / <0.0001
PDC / 0.53 (0.48, 0.58) / <0.0001
Number of monthly total prescriptions / 1.06 (1.06, 1.07) / 0.0001
C-statistics (error rate)* / 0.672 (32.8%)

Abbreviations: DCSI: diabetes comorbidity severity index; ED: emergency department; HR: hazard ratios; PDC: proportion of days covered
* Error rate for the survival tree was 26%

eTable 5. PDC and Hazard Ratios for Each Terminal Node in Survival Tree: Diabetes-Related Hospitalizations

Terminal nodes / N / Average PDC (SD) / Medium PDC (min, max) / HR (95% CI)
A / 11,356 / 0.67 (0.26) / 0.72 (0.03, 1.00) / Referent
B / 2,869 / 0.52 (0.27) / 0.51 (0.03, 1.00) / 1.31 (1.14, 1.51)
C / 426 / 0.96 (0.02) / 0.96 (0.93, 1.00) / 1.25 (0.89, 1.73)
D / 1,889 / 0.59 (0.23) / 0.62 (0.03, 0.92) / 2.27 (1.98, 2.60)
E / 1,695 / 0.82 (0.12) / 0.84 (0.60, 1.00) / 1.75 (1.51, 2.05)
F / 793 / 0.81 (0.12) / 0.81 (0.60, 1.00) / 2.72 (2.28, 3.27)
G / 1,304 / 0.35 (0.14) / 0.34 (0.07, 0.59) / 2.52 (2.18, 2.95)
H / 990 / 0.39 (0.13) / 0.41 (0.04, 0.59) / 3.68 (3.19, 4.27)
I / 2,440 / 0.84 (0.11) / 0.87 (0.62, 1.00) / 1.74 (1.51, 2.00)
J / 1,612 / 0.39 (0.14) / 0.41 (0.04, 0.59) / 2.71 (2.39, 3.14)
K / 1,181 / 0.77 (0.21) / 0.84 (0.11, 1.00) / 4.04 (3.56, 4.65)
L / 612 / 0.89 (0.08) / 0.91 (0.73, 1.00) / 2.62 (2.14, 3.21)
M / 683 / 0.47 (0.17) / 0.49 (0.09, 0.72) / 4.02 (3.51, 4.87)
N / 1,069 / 0.82 (0.12) / 0.84 (0.60, 1.00) / 4.79 (4.20, 5.47)
O / 936 / 0.37 (0.14) / 0.37 (0.03, 0.59) / 6.64 (5.94, 7.64)

Lo-Ciganic et al, Machine Learning and Medication Adherence Thresholds: Online Supplement

eTable 6. Multivariate Cox Proportional Models with Same Set of Predictors in Survival Tree: All-Cause Hospitalizations

HR (95% CI) / P value
Prior hospitalizations or ED visits / 1.52 (1.42, 1.62) / <0.0001
Had insulin fills during the index year (reference= non-users)
0-<90 days / 1.72 (1.53, 1.93) / <0.0001
≥ 90 days / 1.71 (1.60, 1.83) / <0.0001
Had diabetes comorbidities (ref=DCSI=0) / 1.65 (1.54, 1.77) / <0.0001
PDC / 0.52 (0.46, 0.58) / <0.0001
Number of monthly total prescriptions / 1.07 (1.05, 1.06) / <0.0001
C-statistics (error rate)* / 0.669 (33.1%)

Abbreviations: DCSI: diabetes comorbidity severity index; ED: emergency department; HR: hazard ratios; PDC: proportion of days covered
* Error rate from the survival tree was 29%

eFigure 1. Sample Size Flow Chart
Abbreviations: OHA: oral hyperglycemic agents; T1DM: type 1 diabetes mellitus; T2DM: type 2 diabetes mellitus
Note: We included three other exclusion criteria but were not listed in the chart because n=0: (1) women who used metformin only, had a diagnosis for polycystic ovary syndrome, but no diagnosis for diabetes, (2) hyperglycemia not otherwise specified (ICD-9 790.6 without any diabetes code).

eFigure 2. Important Predictors of Diabetes-related Hospitalizations Selected by Minimal Depth from Random Survival Forests

Note: From 1,000 individual survival trees, the most predictive variables were defined as those whose average minimal depth (i.e., split nodes nearest to the root node) is smaller than the minimal depth of a variable which was unrelated to the survival distribution and determined under the null hypothesis of no effect (i.e., threshold). The threshold was calculated from a variable whose distribution of average minimal depth behaves a random coin tossing experiment, or average minimal depth increase little while number of variables increases. The horizontal dashed line in the figure is the threshold for filtering variables. All variables below the line are important predictors.