Appendix A: Detection and Removal of Mischievous Responders
We used a combination of statistical methods and judgment to identify mischievous responders, paying particular attention to respondents who were identified as statistical outliers according to multiple criteria. We considered the parent proxy data and child data as two separate samples, examining each sample independently for outliers but using the same methodology.
One outlier detection method involved calculating the probability of each response pattern in the sample, flagging as potential outliers respondents whose item responses had the lowest probability as computed using the IRT model across the two forms for a specific domain. This was done for each of the three domains. This method allowed us to identify unlikely response patterns, such as one parent-child pair who both endorsed responses 1, 2, 3, 4, 5 in that order for five consecutive items, respectively, across all forms and domains; this type of pattern clearly suggests mischievous responding. However, some respondents were flagged as having unlikely response patterns simply because their responses suggested poor health relative to the calibration sample (i.e., low Mobility or high Depressive Symptoms/Fatigue). For this reason, we used these response pattern probabilities only in tandem with other statistical and judgmental methods to identify persons for further investigation; no respondent was excluded from analyses based solely on having a low probability response pattern.
A second method involved examination of data from respondents exhibiting score differences of 15 or more points on the T-score scale between the two forms administered within any of the domains. Fifteen points was chosen as an initial threshold for exclusion after examining graphical trends in the relationship between the two forms for each domain, which showed that the majority of respondents did not have such large score differences between the two administrations. Because the two forms were administered in close temporal proximity, it is highly unlikely that a 15-point difference reflects real change; such a score difference between successive administrations is more likely due to mischievous or non-thoughtful response behavior. Depending on the number of respondents flagged as outliers within a particular sample and domain, we sometimes increased the threshold to 20 points. This was done to avoid identifying too many respondents for consideration in the judgmental stages.
A third outlier detection statistic was the Mahalanobis distance (MD) value computed from the two-dimensional distributions of pre- and post-test scores within each domain. MD is a measure of the multivariate distance between a respondent’s score-pair and the overall mean of the sample. We used a robust estimator of the covariance matrix in calculating the MD [19]. Respondents with outlying values of MD were flagged for further examination.
After each of these methods flagged potential outliers, we compiled a list of questionable respondents and carefully examined the associated response patterns before deciding to exclude any individual from the analyses. Specifically, we checked whether each questionable response pattern was likely reflective of mischievous or non-thoughtful behavior. For example, respondents endorsing all the least severe response options on one form in a domain (e.g., 1, 1, 1, 1, 1) and all the most severe response options on the other form within that same domain (e.g., 5, 5, 5, 5, 5) were considered mischievous responders and set aside from subsequent analyses. On the other hand, some of the respondents flagged as potential outliers had response patterns that could be plausible, such as those endorsing the most severe categories for all or nearly all items on both forms. Likewise, sometimes only slight changes in response patterns between two forms resulted in relatively large score differences between those two forms. Within the parent proxy sample, for example,a response pattern of (2, 2, 2, 2, 1) on Fatigue Form A, followed by a response pattern of (1, 1, 1, 1, 1) on Fatigue Form B, resulted in a difference of more than 15 points on the T-score scale. We therefore relied on multiple criteria, taking into account both the scores and the raw response patterns, in deciding which respondents to set aside.
We treated the parent and child samples independently; that is, we did not exclude a child from the analyses simply because his or her parent was excluded, and vice versa. However, if a respondent was identified as an outlier for one of the domains, he or she was excluded from the analyses for all the domains. If a respondent exhibited random or non-thoughtful response behavior at any point during the procedure, we were not confident that this person’s scores were trustworthy for the remaining domains. After examining the score differences, MD values, and response patterns of all respondents who were flagged as potential outliers, we excluded a total of 25 respondents from the parent proxy sample and 23 respondents from the child sample.