Supplemental Digital Content

Appendix A. Validation Framework for Spasticity Screening Tool

Study Design and Statistical Analysis Plan

In this statistical analysis plan (SAP) an outline for the minimum required study design elements and analyses needed to evaluate the performance of the screening tool are reviewed. In addition, example SAS (SAS Institute, Cary, NC) code is available by request, which details the implementation of methods for calculation of necessary statistics. Examples of the interpretation of these statistics are also given when complexity of analysis necessitates it. The analyses described in this SAP include the assessment of internal consistency with Cronbach’s α, test re-test reliability (TRT), convergent validity, and classification accuracy of the screening tool relative to gold standard diagnostic assessment conducted in the clinics.

Study Design

The methods described here for assessing reliability, validity, and classification accuracy of the spasticity screening tool can be employed in any study design of interest. This SAP assumes that the design will be built around clinic-based convenience sampling. This SAP also assumes that whereas design extensions may be incorporated, the following elements will serve as the basis of the design:

  1. Patients will be recruited in clinic.
  2. At baseline, patients will complete the spasticity screening tool and the clinician will perform an evaluation using the Modified Ashworth Scale (MAS).
  3. In addition to completing the screening tool at baseline, patients will complete the screening tool again 1 week later for assessing TRT. To isolate the TRT effect to time-dependent reliability of the screening tool, the MAS will also need to be completed again and a second TRT analysis conducted among those cases for whom the MAS indicates no change in spasticity.
  4. Clinicians working in the clinics where data is being collected will be blinded to the patient responses to the screening tool.

Scores for the Spasticity Screening tool

The spasticity screening tool is composed of 13 items each having response options from 0 to 4 with heterogeneous response anchors. This SAP emphasizes classical test theory scores and score evaluation in addition to classification accuracy assessments. Scores for the screening tool will be defined as the unweighted sum of the 13 items. This score will range from 0 to 52, with higher scores indicating greater spasticity severity.

Goal of Validation

Because this is a screening tool, the main criterion by which the quality of the measure will be judged is its ability to detect subjects in whom spasticity is disabling enough to necessitate treatment, as judged by an expert clinician. Therefore maximizing agreement with clinician diagnosis, or classification accuracy of the screening tool is the target of this study. Consequently, the statistics of interest in this study are sensitivity, specificity, negative predictive value (NPV) and positive predictive value (PPV). In addition, supplemental validation evidence will be derived by estimation of internal consistency, TRT, and convergent validity. Taken together, this set of analyses will demonstrate how accurate scores derived from the screening tool are at consistently measuring spasticity in need of treatment, how stable scores within person are over time, how spasticity scores are related to an existing spasticity measure, and how cut scores may be defined on the spasticity screening tool scores to maximize classification accuracy relative to clinical diagnostic evaluations. This body of evidence will support the development of a publication describing the psychometric and classification accuracy properties of the spasticity screening tool. The order of analyses presented in this SAP is as follows: First psychometric assessments of the scores will be reviewed, followed by the description of analyses or developing a cutpoint on scores which maximize the classification accuracy relative to diagnostic evaluations.

Internal Consistency

As scores are expected to be approximately normally distributed, coefficient α will be used for calculation of internal consistency. Coefficient α is a measure of how strongly a scale agrees with itself over all possible split-half permutations of item content, or the average inter-item correlation for the set of items defining the scale. The CORR procedure in SAS may be used for estimating Coefficient α. SAS code for doing so is available by request. The procedure involves listing all the spasticity items on the VAR line in order to correlate all of them, SAS then gives an estimate of the average correlation among the items as Coefficient α. As shown in Guttman’s equation 52 and the corresponding inequality, this estimate is the lower bound on the reliability of the scores.[i] The standardized estimate is the statistic to interpret. A sufficient value for Coefficient α is 0.70 to 0.80. Scores higher than 0.80 may also demonstrate good internal consistency but they should be interpreted carefully as it may indicate that the screening tool contains too many items.

Test Re-Test Reliability

When estimating TRT for spasticity screening tool scores, estimates will be based on the 2- way random intraclass correlation [ICC (2,1)] described by Shrout & Fleiss (1979).[ii] A SAS macro exists for the calculation of this ICC and the macro and an example of SAS code for structuring data for analysis and implementing the macro for estimation is available by request. Scores from time 1 (T1) will be correlated with scores from time 2 (T2). A minimally sufficient value of TRT is deemed to be 0.7-0.8. Table 1 provides an example of how TRT analyses should be presented.

Table 1. TRT Example

T1 / T2
ICC(2,1)

Convergent Validity

The standard regulatory measure of spasticity is the MAS, and this will serve as the sole convergent validator. Scores on the MAS for the most severely affected limb will be correlated using Pearson correlations. Convergent validity correlations tend to be moderate and positive, thus, a correlation in the range of 0.4 to 0.6 will indicate strong convergent validity.

Receiver Operator Characteristic Curves for Identifying a Cut Score Maximizing

Classification Accuracy

Scores generated for the spasticity screening tool may be partitioned so as to maximize classification accuracy relative to a gold standard such as clinical diagnosis. However, the threshold or cutpoint at which the scores should be partitioned to maximize this classification accuracy is not known a priori and must be estimated and inferred from the data. In this context the binary clinical diagnosis (spasticity (1) or not (0) may be used as an outcome in a binomial logistic model predicted from the spasticity score. Alternative thresholds may be estimated and the spasticity score cut at these locations. For each such location the classification accuracy may be assessed, and the threshold associated with the location which maximizes correct classification and other important statistics which quantify classification accuracy may be selected as the proposed cutpoint which maximizes screening accuracy. The general statistical method that is used in this procedure is a receiver operator characteristic (ROC) curve estimated from a logistic regression model. There is an easy to use and convenient SAS macro, named the ROCPLOT macro, which can be used for calculation of the needed statistics. The ROCPLOT macro automates the calculation of statistics needed for ROC curve estimation and threshold selection by using the functions of output from the LOGISTIC procedure. One simply fits the model predicting clinical diagnosis from the spasticity screening tool scores and the threshold which maximizes the desired criterion or criteria is given. SAS code available by request details the implementation and interpretation of the ROCPLOT SAS Macro.

As seen in Table 2, Criteria C, D, and Y are satisfied by the threshold for cutpoint 0.42814, and as seen in the label column, this cutpoint on the predicted probabilities corresponds to an observed screening tool score of 50. The cutpoint and screening tool score corresponding to where sensitivity equals specificity is different, but close, at screening tool = 55. Thus, 3 of the 4 criteria are satisfied at a cutscore of 50 on this example screening tool. At this cutpoint, the correct classification, is 87.5%, the D criterion indicates that the specificity is 75% (1–0.25), and the vertical distance from the noninformative line, or Youden index, is 0.75. Figure 1 contains these statistics plotted on the ROC curve. The point labels in Figure 1 are identical to those listed in the label column in Table 2.

Once the optimal cutscore is identified, the spasticity screening tool scores may be partitioned at that location. In this example the optimal cutscore is a screening tool value of 50. This will generate a binary screening tool criterion that is 1 if the screening tool score is 50 or above and zero otherwise. This criterion can be cross-tabulated with the clinical diagnosis variable. The rows of the resulting 2 × 2 table will contain the spasticity screening tool criterion and the columns will contain the clinical diagnosis status. Sensitivity is the column percent for the 2 × 2 table for the intersection of the screening tool criterion positive and diagnosis positive cells. Specificity is the column percent of the 2 × 2 table for the intersection of the screening tool criterion negative and diagnosis negative cells. PPV and NPV are the row percentages for those same cells, respectively. For the example data given here, partitioning the screening tool scores using the estimated optimal cutscore of 50 derived from results in Table 2 and Figure 1, yields a screening tool criterion that had sensitivity, specificity, NPV, and PPV of 75%. This same procedure applied to data collected on subjects who participated in clinical interviews and completed the screening tool will be the most effective manner for determining a screening tool criterion that maximizes classification accuracy.

In addition, a single item assessment of clinician opinion on need for treatment will be assessed. The binary classification of clinician impression of need for treatment will be contrasted with screening tool classification. Classification accuracy analyses for this contrast will be computed. In addition, the odds ratio and 95% confidence interval for the association between clinician impression of need for treatment and the screening tool classification will be reported.

Table 2. ROC Curve Criteria, Cutpoints, and Statistics

Criterion / Symbol / Cutpoint / Label / Value
Correct / C / 0.42814 / 50,0.428,C,D,Y, / 0.875
Dist To 0,1 / D / 0.42814 / 50,0.428,C,D,Y, / 0.250
Sens-Spec / = / 0.60930 / 55,0.609,,,,= / 0.000
Youden / Y / 0.42814 / 50,0.428,C,D,Y, / 0.750

ROC= receiver operator characteristic.

Figure 1. ROC Curve for Example Diagnosis Predicted by Screening Tool

Possible Analysis Extensions of This Statistical Analysis Plan

In addition to the analyses described in this SAP, many additional analyses exist that are designed to support the validation of screening tools. These analyses are traditionally applied upstream of where analyses described in this SAP start, and are designed to understand the domains assessed by a screening tool, make determinations about item quality and empirical reduction of the item pool, and how to generate empirically weighted scores supported by statistical theory.

Assessing dimensionality traditionally relies on exploratory factor analyses (EFA). A proper EFA for the multinomial ordered response options used in this screening tool relies on marginal maximum likelihood or weighted least squares estimation using the appropriate distribution (multinomial) and link functions (cumulative logit or probit, respectively). Under an appropriate EFA the number of domains, or factors, to estimate is determined based on model fit indices, such model fit indices may include information criteria (Akaike or Bayesian information criterion: AIC and BIC, respectively), and/or χ2 tests of absolute fit and the root mean squared error of approximation. The estimated loadings for the optimal factor solution are then rotated using a proper oblique rotation algorithm, preferably one based on the Crawford-Fergusson family.

After assessing dimensionality, item response theory (IRT) models are commonly used to assess item quality. Model fit may be used to assess the appropriate parameterization of the IRT model employed. A decision between Rasch and graded item response model parameterizations is strictly a decision based on model fit and the same indices used to determine dimensionality under the EFAs may be used to determine the optimal IRT model parameterization. After selection of the optimal IRT model parameterization, estimated item parameters may be used to evaluate item quality. Items determined to exhibit sub-optimal psychometric properties from the item parameters may be candidates for elimination. If the item pool is modified by item elimination, then the EFA and IRT models need to be refit to confirm that no changes have been made to the dimensionality and that item removal did not affect item performance of the retained items.

At conclusion of the IRT analyses, the IRT item parameters may be used to generate empirically weighted scores which form a composite of frequency and severity. These scores can then be assessed using the procedures described above in this SAP, starting with internal consistency and ending with ROC curve analysis for cutpoints and classification accuracy assessments.

These analyses are given a cursory description here so as to point interested researchers toward additional analyses which may be employed if interest exists in furthering the validation of this screening tool. However, these analyses require specialized training and software and are of an extremely iterative nature. Therefore, further support and delivery of explicit code is not possible for these analyses.

Summary

The design and analyses described in this SAP, along with the code (available by request), should provide a straightforward means to establishing the preliminary psychometric and classification accuracy of the spasticity screening tool. The entire set of analyses proposed in this SAP is summarized in Table 3. In addition to the descriptions of the proposed analyses, the criterion for acceptability is given in Table 3 for each analysis.

Table 3. Summary of Proposed Primary Analyses

Property / Definition / Test / Criterion for Acceptability
Internal Consistency / Average inter-item correlation among screening tool items / Correlation-based Cronbach’s α / 0.7–0.8
TRT1 / Across baseline and follow-up / Shrout & Fleiss ICC (2,1) / 0.7–0.8
TRT2 / Across baseline and follow-up conditioned on MAS no change / Shrout & Fleiss ICC (2,1) / 0.7–0.8
Convergent Validity / Correlation between Spasticity screening tool and MAS / Pearson Correlation / > 0.4
Classification Accuracy / Agreement between screening tool classification and diagnostic classification / Logistic regression-based ROC curves to define optimal cutscore on screening tool. Classification accuracy of the cutscore criterion assessed with sensitivity, specificity, NPV, and PPV. / Sensitivity and Specificity as close to or exceeding 0.8
Classification Accuracy / Agreement between screening tool classification and single item assessing clinician impression of need for treatment / Classification accuracy of the cutscore criterion assessed with sensitivity, specificity, NPV, and PPV. Reporting of odds ratio and 95% confidence interval. / Sensitivity and Specificity as close to or exceeding 0.8. An odds ratio greater than 1 and a 95% confidence interval that is narrow and does not contain 1.0.

ICC= intraclass correlation; MAS= Modified Ashworth Scale; NPV= negative predictive value; PPV= positive predictive value; ROC= receiver operator characteristic; TRT1= test re-test reliability time 1; TRT2= test re-test reliability time 2; Cronbach’s α may be defined as the average inter-item correlation.

[i] Guttman, L. A basis for analyzing test-retest reliability. Psychometrika 1945; 10(4): 255-282.

[ii] Shrout, PE & Fleiss, JL: Intraclass Correlations: Use in Assessing Rater Reliability. Psychological Bulletin 1979; 86(2):420-28.