Abstract
Background Global ratings of change (GROCs) are commonly used in research and clinical practice to determine which patients respond to therapy, but their validity as a criterion for change has not been firmly established. One factor related to their validity is the length of the recall period.
Objective The study objective was to examine the influence of the length of the recall period on the validity of a GROC for determining true change over time in the clinical setting.
Design This was a longitudinal, single-cohort observational study.
Methods Data from the Focus on Therapeutic Outcomes clinical database were collected for 8,955 patients reporting for physical therapy treatment of a knee disorder. Computerized adaptive testing was used to assess knee functional status (FS) at the initial and final (discharge) physical therapy visits. Each patient's GROC was obtained at discharge. Correlation and linear regression analyses of knee FS and GROC, stratified by length of time between intake and discharge, were conducted.
Results Correlations of GROC with knee FS change scores were modest even for the shortest period of recall (0–30 days) and were slightly lower for longer recall periods. Regression analyses using knee FS to predict GROC scores revealed similar findings. Correlations of GROC with intake and discharge scores indicated a strong bias toward discharge status, with little or no influence of baseline status. Standardized regression coefficients fitted the pattern expected for a valid measure of change but confirmed the strong bias toward discharge status.
Limitations One version of the GROC administered serially in a cohort of patients seen in clinical practice was examined.
Conclusions These results call into question the validity of GROCs for measuring change over time in routine clinical practice.
Patient-reported outcomes (PROs) have achieved a prominent role in clinical practice and research.1–4 They offer an efficient means to gather a patient's perception of level of symptoms, functional ability, and health-related quality of life. Established patient-reported measures have sound psychometric properties and often have stronger correlations with functional abilities than common objective measures.5 They are recognized by third-party payers and by government organizations as key measures of progress in response to therapeutic regimens.2
In both clinical and research applications, PROs are used to determine which patients respond to therapy (“responders”), that is, which patients have shown clinically meaningful improvements in response to health care interventions.5–8 Because PRO measures, like all clinical measures, are abstract representations of reality, researchers and clinicians have sought an external criterion or reference to determine the threshold for clinically meaningful change. One commonly used criterion is a global rating of change (GROC), also referred to as a transitional scale.9 Global ratings of change exist in various formats, but all allow a patient to indicate whether change has occurred, typically relative to the date of the initiation of care and, if so, the extent of that change. The GROC allows a patient to indicate the direction of change (ie, worsening or improvement) and the degree of change (ie, small to large) by using a Likert scale. Although the GROC was originally proposed as a means to indirectly establish a minimum clinically important difference,9 which may then be used as a threshold that patients must surpass to qualify as responders, various versions of the GROC have become commonly used in clinical practice to directly determine which patients have responded to therapy.10–13
Despite the frequent use of transitional ratings such as the GROC in research and clinical practice, the validity of these ratings for determining true change over time has not been firmly established.14 Supporters of global ratings argue for their strong face validity on the basis of straightforward interpretation and ease of use and cite their correlation with change scores on other PRO measures as support for their construct validity.15,16 However, Norman et al17 have challenged the ability of patients to recall their previous state of health, compare it with their current state, and mentally calculate a difference score that would accurately represent the true degree of change. They concluded that a retrospective rating cannot be substituted for the calculation of change between the baseline and the endpoint because a retrospective rating is strongly influenced by other factors. They demonstrated that transitional scales such as the GROC are strongly biased by health status at the time the rating is made, show little or no correlation with the baseline level of health, and hence are not valid measures of change over time.
Guyatt et al pointed out that, “Intuitively, one would expect that the longer the duration of time over which patients must cast their memory, the less likely they are to recollect accurately.”16(p904) Since this idea was proposed, only 1 study has investigated the influence of the length of the recall period on the validity of a transitional measure. Kamper et al18 reviewed data from 6 controlled trials and a single cohort study in which similar versions of the Global Perceived Effect scale for patients with neck, low back, ankle, or shoulder disorders were used. They studied 30 time periods ranging from 2 to 52 weeks; their results confirmed the strong bias of current status on a patient's perception of change in research trials and showed this bias strengthening as transition time lengthened. Despite this serious challenge to the validity of such retrospective global ratings, they continue to be used in both research and clinical practice.
These threats to the validity of transitional scales may be of greatest concern in clinical practice, in which the length of time to follow-up is variable both within and between patients and in which follow-up may occur over long periods. No studies to date have examined the validity of the GROC for determining true change over time with data from clinical practice. The purpose of this study was to examine whether the validity of the GROC in clinical practice is influenced by the length of the period of recall. Specifically, we hypothesized that global ratings over a shorter period of recall would show greater validity than global ratings over a longer period of recall, as reflected by correlations with both baseline and follow-up PRO scores and the strength of the associations between PRO change scores and transitional ratings.
Method
Participants
The data for this single-cohort study were obtained from the Focus on Therapeutic Outcomes (FOTO) database. The FOTO database is a proprietary database intended for the reporting of risk-adjusted outcomes, process improvement, and research purposes. Data were collected prospectively for all patients who attended participating outpatient physical therapy clinics in 26 states across the United States and were discharged from care between April 1, 2010, and December 30, 2011. Patients routinely provided demographic and risk adjustment information before the start of treatment using Patient Inquiry Software (FOTO Inc, Knoxville, Tennessee). Patients completed a functional status (FS) measure at intake and again at the final (discharge) visit, with optional status reports during the course of treatment, as directed by clinic policy, the clinician, or both. Patients who were seen for a knee problem at their initial physical therapy visit and had a follow-up FS measurement along with a GROC were eligible for this study. The GROC was an optional measure used by more than 500 clinics in the FOTO system.
Measures
Computerized adaptive tests.
Computerized adaptive tests (CATs) are designed to decrease respondent burden while providing precise estimates of FS. The CAT used for lower extremity function in this study was developed on the basis of items from the Lower Extremity Functional Scale (LEFS).19 Using item response theory to analyze extensive data from the FOTO database, the developers of the lower extremity CAT established a hierarchical order of items from the LEFS according to level of difficulty. Two items from the original LEFS were dropped after factor analysis and identification of redundant items, leaving a final item bank of 18 questions from the original LEFS. Although the same item bank was used for any patient with a lower extremity disorder, separate CATs were developed for hip, knee, and ankle or foot problems on the basis of the most efficient algorithm for estimating lower extremity function scores.20 Therefore, we analyzed data for only the most commonly presenting of these body regions (the knee) to ensure homogeneity.
A patient's FS score was ascertained by starting with an item at a median level of difficulty (eg, ability to walk 2 blocks) and then determining the order of subsequent items on the basis of prior responses. After each patient response, a predicted FS score was determined along with the estimated error. Additional questions were posed until a stopping rule was satisfied—when either the estimated error was sufficiently low (<4 FS scale units, or 4% of the total score range) or a change in the predicted FS based on the previous 3 items was less than 1%. Functional status was scored on a scale from 0 to 100, with 100 representing the highest functional level. Although the knee FS score is based on items from the LEFS, it is not intended to reproduce or estimate the original LEFS score but rather is intended to produce a new estimate of a patient's functional ability on a continuum represented by a scale from 0 to 100.
Computerized adaptive test scores for hip, knee, and ankle or foot problems have shown excellent reliability, with a Cronbach alpha of .96 and a reliability coefficient of .95, defined as the ratio of the true measurement variance divided by the observed variance.20 This reliability coefficient reflects the relative amount of error associated with a single estimate of knee function. The CAT scores had a low absolute error (standard error of measurement=3, or 3% of the scale from 0 to 100) and a minimum detectable change at the 95% confidence level of 7.9 FS scale units.20 Subsequent prospective data collection demonstrated that the knee CAT had good construct validity, differentiating levels of FS between subgroups on the basis of age, acuteness of symptoms, surgical history, and number of comorbidities.21 Ceiling or floor effects, defined as the proportion of scores within 5% of the top or bottom of the scale, respectively, were less than 1% at baseline and less than 5% at discharge. The knee CAT was sensitive to overall change, with 75% of patients' change scores at discharge exceeding the minimum detectable change, and responsive to clinically meaningful change, with receiver operating characteristic analysis showing areas under the curve of 71% to 78% when FS cutoff scores were used to differentiate patients who showed improvements from those whose status remained unchanged. The average number of items needed to establish a patient's CAT functional score was 6.9 items (in the original LEFS, 20 items are needed), and the computerized format of the CAT allowed for instantaneous scoring and, thus, for greater efficiency than most paper-and-pencil outcome measures.21
GROC.
The GROC used in this study was a 15-point Likert scale ranging from −7 to +7.9 Patients were asked to rate their self-perceived “overall change” compared with their status at the initial clinic visit. Response options were listed on a continuum, with −7 (labeled “worse”) on the left, +7 (labeled “better”) on the right, and 0 (labeled “no change”) in the middle. This version of the GROC was described previously.21–25
Data Analysis
All data were analyzed with NCSS 8 (NCSS, LLC, Kaysville, Utah). Descriptive statistics were used to summarize demographic and outcome data. For assessment of bias, chi-square and independent t tests were used to look for significant differences in baseline demographics and FS scores between patients included in the analysis and those excluded because of incomplete data, at an alpha level of P<.05.
When multiple FS reports over the course of treatment were available, the final status report was used for data analysis to reflect overall progress for the episode of care. The data were stratified according to the period of time between intake and the final status report, reflecting the length of the recall period for patient GROCs. Recall periods were 0 to 30, 31 to 60, 61 to 90, 91 to 180, and greater than 180 days.
The validity of the GROC was examined in 3 ways. First, for each recall period, Spearman rank order correlation coefficients were calculated to assess the strength of the relationship between patient GROCs and intake, discharge, and change scores on the knee FS CAT. The Spearman correlation was used because of the ordinal nature of the GROC and for consistency, as normality could not be guaranteed across all data. If the GROC reflected true change, the correlations with change scores on the validated knee FS measure should be moderate to strong (.40–.70), statistically significant, and greater than the correlations with baseline or discharge scores.17 In addition, Guyatt et al16 proposed that the validity of a transitional change rating requires that baseline status and follow-up status exert equal (and opposite) influences on the transitional rating. They offered mathematical proof of this expected correlation pattern, which assumed equal variances of pretreatment and posttreatment outcome measure scores.16 Therefore, we used the modified Levene equal variance test for the assessment of significant differences between intake and discharge FS variances and calculated the correlations of the GROC with the knee FS measure at intake and at discharge to investigate this relationship.16,17
Stepwise linear regression was carried out to further examine the validity of the relationship between GROC scores and FS scores.16 In the initial model, a patient's change in the knee FS score was the sole predictor, and the patient's GROC was the dependent variable. Next, we investigated the relative contributions of the patient's baseline and discharge FS scores by building a model initially including the discharge score and then entering the intake knee FS score into the model as a second predictor. Two models were used to avoid multicollinearity of the change score with the initial and discharge scores. Data plots were inspected to verify assumptions of normality and homoscedasticity. The contribution of the change in the knee FS score to the patient's GROC and the relative contributions of the patient's intake status and discharge status to the patient's GROC were determined on the basis of the change in R2 between the regression models. Standardized regression coefficients were examined as an indication of the relative contribution of each dependent variable to the full multiple regression model.
Results
There were 10,712 patients with knee problems and intake lower extremity FS scores in the database. Follow-up data with complete FS and GROC scores were available for 8,955 patients (84%). Patients for whom complete data were available were generally older than those for whom data were missing (mean difference=4.5 years). They were more likely to be women (61% versus 56%), had lower intake FS scores than those for whom data were missing (mean difference=1.9 percentage points), and were more likely to have had surgery for their current problem (59% versus 48%). Comparisons of included versus excluded patients are shown in Table 1. Table 2 shows the mean, standard deviation, and variance of intake and discharge FS scores for each recall period.
Baseline Data for Patients for Whom Complete Data Were Available and Patients for Whom Data Were Missinga
Means, Standard Deviations, and Variances of Knee Functional Status (FS) Intake and Discharge Scores
The correlations of the GROC with intake, discharge, and change scores on the knee FS measure are shown in Table 3. The correlations of the GROC with intake FS scores were all near 0 (range=−.05 to .13) and usually were not significant, whereas the correlations of the GROC with discharge FS scores were all positive and were statistically significant (range=.41 to .51). In each case, the variance of the discharge scores was significantly greater than the variance of the intake scores. The correlations of the GROC with FS change scores were moderate, positive, and significant; the values were nearly identical to the corresponding GROC–discharge FS score correlations, with differences of .04 or less in all 5 comparisons. The GROC–FS change score correlations exceeded the GROC–discharge FS score correlations in only 2 of the 5 time periods; in both cases, the difference was only .01.
Correlations of Global Ratings of Change With Knee Functional Status (FS) Intake, Discharge, and Change Scores
The correlation of the GROC with the FS change score was highest for the 0- to 30-day recall period and diminished slightly over the longer recall periods (Tab. 3). The correlation of the GROC with the discharge FS score was also highest for the 0- to 30-day recall period. In no recall period did the GROC show a statistically significant negative correlation with the intake FS score.
The results of the stepwise regression analyses are shown in Table 4. In all cases, the initial regression analysis was statistically significant, indicating that both variations of FS scores (change score only and discharge only) were significant predictors of GROC scores for the same recall period. The R2 values for these predictions ranged from .07 to .19, reflecting the low to moderate correlations seen in Table 3. When the intake FS score was added to the discharge FS score in a stepwise fashion, changes in R2 values were minimal, ranging from 0 to .03, indicating small contributions to the ability to predict GROC scores. Interestingly, the standardized beta coefficients did fit the expected pattern, as described by Norman et al,17 with all of the intake beta values being negative and all of the discharge beta values being positive. This result indicated that lower intake scores and higher discharge scores were predictive of higher GROC scores. In all cases, however, the discharge beta values were larger in magnitude than the intake beta values.
Regression Analyses of Prediction of Global Ratings of Change by Knee Functional Status Scores
Discussion
The primary conceptual challenge to the validity of transitional measures is the difficulty of unbiased recall. Patients have to assess their current level of health, recall their previous level (which may be 1–3 months or more in the past), and then perform the mental arithmetic to formulate a valid estimation of change.17 Our results based on data from clinical practice add to the body of literature questioning the validity of transitional ratings such as the patient GROC for determining true change over time.16,18
We showed that, across a range of recall periods, the shortest recall period resulted in only modestly accurate predictions of GROC scores, and prediction strength diminished slightly with longer recall periods. A true measure of change over the course of treatment should have substantial correlations with change scores on established functional outcome measures, but it should also equally reflect a patient's status at baseline and status at the completion of therapy.17 Our results showed little to no correlation between the GROC and baseline FS. Although the correlations of the GROC with the change in FS over the period of treatment were moderate, these associations were primarily due to moderate correlations of the GROC with discharge FS. In all 5 comparisons, the correlations of the GROC with the discharge FS score were greater than or practically the same as the correlations of the GROC with the FS change score. Changes in R2 values after the addition of intake values were negligibly low across all recall periods. Although the intake beta coefficients were statistically significant in all but the longest time frame, they diminished sharply beyond the 60-day recall period and consistently were substantially lower in magnitude than the discharge beta values.
This is the first critique investigating the validity of using the GROC in routine clinical practice. Results similar to ours have been shown by data from clinical trials.16–18,26,27 Guyatt et al16 examined GROCs for various scales of PRO measures related to chronic respiratory illnesses. They reported mixed findings for global ratings assigned at a 4-week recall period. First, they found that only 14 of 37 possible time periods met their inclusion criterion of a correlation of .5 or greater between the GROC and a change in patient-reported respiratory status. Five of the 14 comparisons were judged to be anomalous because they reflected a relatively symptom-free baseline at the beginning of the allergy season. Of the remaining 9 comparisons, only 3 showed similar correlations of global change with both baseline and posttreatment PRO measures, as would be expected for a valid criterion for change, whereas the other 6 did not. Their results differed from ours in that their stepwise multiple regression analyses showed that in all 9 cases, baseline PRO scores were significant predictors of global ratings, over and above the variance explained by posttreatment scores,16 whereas we found that the predictive validity of baseline scores diminished to nonsignificance at the longest recall period (>180 days) and contributed negligibly (0%–3%) to R2 over discharge scores alone.
One explanation for the qualified support of the validity of global ratings in the study of Guyatt et al16 could be that 12 of the 14 analyses of clinical trial data were for a relatively short recall period—4 weeks or less. It has been hypothesized that with shorter recall periods, GROCs would have greater validity.16 Our results based on data from clinical practice offer mixed evidence to support this hypothesis. The correlations of the GROC with FS change scores were highest for the time frame of 1 month or less, but they were only slightly higher than those for longer recall periods, with overlapping confidence intervals in most instances. Even with recall periods of 1 month or less, patients' estimates of change did not appear to adequately reflect their baseline status. It is apparent from Table 3 that the GROC–FS change score correlations simply reflected higher correlations with discharge scores than with baseline scores, confirming patient bias based on the current FS when the GROC was completed.17
Kamper et al18 reported a general trend for stronger correlations of the GROC with change scores on functional measures for shorter recall periods than for longer ones, but in 15 analyses, there were no apparent instances of the expected negative, moderate correlations with intake scores. In their study of clinical trial data, for 3 of the 4 instances when the recall period was 4 to 7 weeks or less, the GROC–discharge score correlations were greater than the GROC–change score correlations, consistent with our results for a recall period of 30 days or less. In the present study, we did not find the expected negative correlations between the GROC and intake FS scores in any of the 5 recall periods, which ranged from 0 to 30 days to greater than 6 months.
The R2 values were modest for the 0- to 30-day period and declined by 6% or more for all but 1 of the longer recall periods, whether based on FS change scores or discharge FS scores only. The use of discharge FS scores alone allowed modestly accurate predictions of patient GROC scores, with R2 values equal to or better than those obtained with FS change scores for all recall periods. The addition of baseline FS scores to the model with discharge scores only improved predictive ability marginally, with R2 values improving by 3% or less in each case. Kamper et al18 obtained similar results in their analyses of data from clinical trials, with increases in R2 values of 5% or less in 13 of 15 analyses when intake disability scores were added to a regression model with Global Perceived Effect scale scores as the dependent variable.
In support of the validity of GROC, in the present study, the standardized regression coefficients indicated the expected pattern of association with GROC scores, with negative coefficients for intake FS scores and positive coefficients for discharge FS scores. This finding indicates that higher GROC scores would be associated with lower intake scores and higher discharge scores, as expected for a valid change criterion. In 12 of 15 analyses, Kamper et al18 obtained the same pattern for the prediction of Global Perceived Effect scale scores by disability scores over various recall periods. In all cases in the present study and that of Kamper et al,18 however, the standardized coefficients for the discharge scores were substantially higher in magnitude than those for the intake scores, and the overall accuracy of the model was low to moderate. Guyatt et al16 showed that standardized regression coefficients for pretreatment and posttreatment quality-of-life measures for respiratory conditions were similar in magnitude in 8 of 14 analyses of clinical trials, with all 14 also having opposite signs.
One possible explanation for the overall low correlations between the GROC and FS change scores is that patients included aspects of health beyond function when determining their global ratings. The GROC presented to patients in the FOTO database directs patients to rate the “overall change during the treatment for this condition.” It has been suggested that an advantage of the GROC is that it allows patients to decide which aspects of their recovery are important.14 Previous studies of the validity of transitional measures have used general wording that does not direct patients to a specific aspect of their condition15,26–28; this factor may limit their correlations with more specific measures of symptoms or function. However, an examination of studies of global ratings in which the domain of health was specified in the root question, such as symptoms (eg, pain, cough severity)16,18,29,30 or physical function,18 did not show greater support for the validity of the GROC. The aspects of health that a patient considers when estimating the amount of change over time are not known. It seems likely that the parameters influencing this estimate vary across patients and perhaps also within patients over time.
Why are transitional measures still used in research and clinical practice, given tepid support for their validity? The answer probably lies in their strong face validity and the intuitive sense that a patient knows best how much change has occurred. The integrity of clinical judgments and research conclusions about responders depends on valid measures. Transitional measures may continue to be used for gauging a patient's perception of change or for their original purpose of quantifying the minimum clinically important difference of a measure; however, their weak to absent validity for representing true change over time must be acknowledged. Our data suggest that transitional ratings such as the GROC have limited validity for determining true change across the various time frames seen in routine clinical practice. We concur with Norman et al17 that a retrospective transitional rating cannot be substituted for a calculation of change between the baseline and the endpoint.
Limitations and Future Research
A limitation of the results of the present study is the potential bias introduced by the violation of the assumption of equal variance between intake and discharge scores, as required by the mathematical proof provided by Guyatt el al.16 Those researchers also found that the variance of the follow-up measurement was greater than that of the baseline measurement in 4 of 14 analyses (P<.05).16 Greater variance in scores would tend to bias correlation upward, perhaps contributing to the discharge score–GROC correlations being larger than the intake score–GROC correlations. However, our use of standardized regression coefficients accounted for any differences in variance between pretest and posttest scores, indicating that our results were robust.18 Because there are challenges to the use and interpretation of standardized regression coefficients, these results must be interpreted in context with the other results of our study. Also, our data had a thin tail, with few values representing negative GROC scores, so we caution against using regression coefficients to predict negative GROC values.
Our results may not be generalizable to other transitional measures because differences in the formats of transitional measures may limit comparability across studies. In a review, Kamper et al14 discussed differences in title, wording, time frame, reference point, and magnitude of scale among various versions used in previous studies. In the FOTO database used in the present study, a global rating of overall change was indicated on a 15-point scale ranging from −7 to +7, with descriptors only in the middle and at each end of the scale. The FOTO instructions ask patients to rate change since the initiation of therapy, whereas other instruments ask for the level of change since the last visit. The latter approach has the benefit of shortening the period of recall, but the former approach is common in clinical practice and may be most relevant to practitioners wanting a measure of overall effect. Our results question the validity of rating change from the start of therapy for determining true change over time. Finally, patients in clinics using this database are asked to rate their FS and degree of change at each clinic visit. It is not known how serial ratings over the course of multiple clinic visits affect the validity of overall pretreatment and posttreatment PROs.
Conclusions and Clinical Implications
This analysis of routine clinical data from patients receiving physical therapy for musculoskeletal problems of the knee does not support the validity of the use of transitional scales such as the GROC for determining true change over time in routine clinical practice. Our results concur with those of previous studies of data from clinical trials, with correlations between transitional ratings and PRO scores being strongly biased by a patient's status at discharge and being unrelated to the patient's status at baseline. The GROC performed only marginally better for a recall period of 30 days or less than for longer recall periods. Clinicians should be aware of this bias and choose alternative methods for gauging the overall level of change over time in their patients.
Footnotes
Dr Schmitt provided concept/idea/research design, project management, and institutional liaisons. Both authors provided writing and data analysis. Dr Abbott provided consultation (including review of manuscript before submission). The authors thank Judy Holder and the staff at Focus on Therapeutic Outcomes for their assistance in obtaining the relevant data for this study. They also acknowledge posthumously the encouragement and support that Dennis Hart (1948–2012) gave to the inception of this project and all of his contributions to the measurement and analysis of health care outcomes.
The protocol for the secondary use of FOTO data was reviewed and approved by the St Catherine University Institutional Review Board.
A platform presentation of this research was given at the Combined Sections Meeting of the American Physical Therapy Association; January 21–24, 2013; San Diego, California.
Dr Abbott was supported in part by a Sir Charles Hercus Health Research Fellowship from the Health Research Council of New Zealand.
- Received April 30, 2013.
- Accepted November 8, 2013.
- © 2014 American Physical Therapy Association