Abstract
Assessing patient progress is an integral part of physical therapist practice. In an attempt to assist clinical decision making regarding a patient's change status, researchers have offered study-based threshold change values. Often researchers have provided reliability and diagnostic test–based estimates of threshold change values obtained from the same patient sample. A potential dilemma occurs when the reliability (ie, the minimal detectable change [MDC])–based threshold change value exceeds the diagnostic test–based threshold value. How can a change be detected if the threshold change value falls within the limits of error? In this situation, researchers have recommended using the larger MDC threshold change value. In this perspective article, we describe the interpretation of the threshold values provided by each of these estimation methods and consider which one offers information that is more meaningful to the challenge faced by physical therapists when making decisions concerning the change status of patients. The context for our discussion is a clinical vignette that depicts the dilemma outlined above. We conclude this perspective with suggestions for researchers concerning essential information to include when reporting threshold estimates obtained from reliability–based and diagnostic test–based studies of outcome measures.
Physical therapists routinely make decisions about the change status of patients. The typical process involves comparing the current and previous assessments' values for the outcome of interest. If the difference between assessments meets a preconceived threshold value, a patient is labeled as having changed, and this impression is recorded in the medical record. In the past, threshold change values were based on clinical experience. Today, study-based threshold change values are available for many outcomes.1–4
Although a number of threshold estimation methods exist, the 2 most frequently reported originate from the test-retest reliability method and diagnostic test methodology (DTM), where the goal is to “diagnose change.”5 Investigators often have applied both methods to the same patient sample. In such situations, it is not uncommon for the 2 methods to yield different threshold values.6–9 An apparent conundrum arises when the reliability-based threshold value, referred to as the “minimal detectable change” (MDC), exceeds the diagnostic test method's threshold value.6–9 When this situation has been encountered, investigators have advocated applying the larger MDC threshold value.6,10 The rationale for using MDC over the alternative estimate is that true change cannot be detected if the measured change lies within the bounds of measurement error as defined by MDC. An example of this is found in a study by Young et al.6 These investigators estimated the change threshold for the Neck Disability Index (NDI) to be 10.2 points applying the MDC at a 90% confidence level (MDC90) (ie, 90% of patients who are truly stable will display random fluctuations between test and retest of 10.2 or fewer NDI points) and 7.5 points using DTM. Young and colleagues went on to recommend using the MDC estimate of 10.2 change points when making clinical decisions because the DTM estimate of 7.5 change points was within the bounds of measurement error.
Given that researchers have obtained different threshold values for MDC and DTM when estimated on the same patient sample, the purpose of this perspective article is to compare the information conveyed by these methods and examine which interpretation is more consistent with the information physical therapists need to know when interpreting changes in scores obtained from outcome measures. We set the scene for our discussion by introducing a vignette to illustrate the dilemma encountered by physical therapists when reliability and DTM estimation methods produce different threshold change values. The vignette is followed by a discussion of the information conveyed by each estimation method and suggestions for researchers when reporting the results from threshold value estimation studies.
Clinical Vignette
In this vignette, we refer to a hypothetical measure named the Hip Knee Ankle Scale (HKAS), a patient-reported outcome measure designed to quantify lower-extremity functional status. Values for this measure vary from 0 to 20, with higher values representing higher levels of functional status.
The Physical Therapist's Dilemma
A physical therapist is treating Mr Smith, a patient originally seen 5 days after an ankle sprain. His HKAS value was 11/20 when assessed 1 week ago, and today it is 14/20. The physical therapist chose a 1-week reassessment interval because in her experience the typical patient would undergo an improvement over this period. However, the physical therapist has not used the HKAS previously and wonders whether the measured 3-point difference is likely to represent a true improvement. A colleague provides the therapist with the results of a recently published study.
Hypothetical Study Précis
Purpose.
The aim of the study was to estimate an improvement threshold value for the HKAS when applied to patients after ankle sprain.
Methods.
The study sample consisted of 300 patients with ankle sprains who were administered the HKAS at their initial visit and 1 week later. At the 1-week follow-up visit, a reference standard for change was applied that classified patients as worsened, about the same, and improved. The HKAS change (difference) value was calculated for each patient by subtracting the initial assessment's value from the follow-up assessment's value. Thus, positive changes represent improvement.
A threshold for true improvement was calculated using reliability and DTM estimation methods. The first approach applied a test-retest reliability study design and defined the threshold value for improvement as the MDC90 value. Patients categorized by the reference standard as “about the same” were included in this analysis. The MDC90 value was calculated by multiplying the standard deviation of HKAS difference values (SDdiff) between test and retest by 1.65, the Z-value (ie, standard normal deviate) associated with a 90% confidence value. The SDdiff is equal to the standard error of measurement times the square root of 2.11
In addition, a threshold value for improvement was calculated using DTM. Patients categorized by the reference standard as “improved” were compared with those categorized as “about the same” or “worsened.” Receiver operating characteristic (ROC) curve analysis was applied, and sensitivity and specificity values were calculated for all reported HKAS change values. Applied in this context, sensitivity represents the number of patients correctly identified by the HKAS as having improved divided by all of those labeled by the reference standard as having improved. Specificity denotes the number of patients correctly identified by the HKAS as not having improved divided by all of those labeled by the reference standard as not having improved (ie, about the same or worsened). An ROC curve plots sensitivity on the y-axis against 1 − specificity on the x-axis. Consistent with the works of others, the threshold improvement value was defined as the HKAS change score that jointly maximized sensitivity and specificity.4–6,8,9 Graphically, the change score for the data point on the ROC curve closest to the upper left-hand corner of the graph represents the joint maximization of sensitivity and specificity.
Results.
Of the 300 participants, 100 were classified by the reference standard as “about the same,” and the remaining 200 were classified as “improved.” No patients were categorized as “worsened.” The mean (SD) difference (ie, follow-up value − initial value) for the “about the same” group was 0 (2.7) HKAS points. The MDC90 was estimated to be 4.5 HKAS points (ie, 2.7 × 1.65). Because the HKAS allows integer values only, to maintain a confidence level of at least 90%, an effective change of 5 HKAS points would be required.
The mean change for the 200 patients who improved was 4.0 (2.3) HKAS points. The ROC curve, shown in Figure 1, revealed that sensitivity and specificity were jointly maximized for an improvement of ≥2 HKAS points. Table 1 displays the results for an improvement of ≥2 HKAS points. The corresponding values for sensitivity and specificity were 90% (95% confidence interval [CI]=0.86–0.94) and 75% (95% CI=0.67–0.83), respectively. The area under the ROC curve was 0.88 (95% CI=0.83–0.92).
Receiver operating characteristic curve, with potential threshold change values.
2 × 2 Table Results for the Threshold Improvement Value ≥2
2 × 2 Table Used to Illustrate Predictive Value and Post-measure Chance of Improvement Calculations
The Physical Therapist's Dilemma Revisited
After reviewing the results from this study, Mr Smith's physical therapist is not sure whether to use a threshold value of 2 HKAS change points as derived from the DTM analysis or the 5 HKAS change points determined from the MDC analysis. If the value of 2 is applied, Mr Smith's functional status will be labeled as improved; however, if a threshold value of 5 is applied, Mr Smith's functional status will be labeled as not improved.
What Do the MDC and DTM Estimates Convey?
To assist the physical therapist with this challenge, we now provide interpretations of the threshold estimates produced by the 2 methods and relate their interpretations to the question facing the physical therapist: “Is Mr Smith's 3-point HKAS improvement more likely to be associated with a patient who has truly improved or with a patient who has not improved?” Implicit in this question is the realization that some patients who are truly unimproved will display HKAS change scores greater than the improvement threshold value and that other patients who are truly improved will provide HKAS change scores less than the improvement threshold value. That some patients will be misclassified is inevitable. No threshold is likely to be 100% accurate.
Interpretation of the MDC Threshold Value
We suspect that one of the attractions of MDC is that many individuals interpret it as reporting the confidence in correctly labeling a patient's change status. It does not. Because MDC is calculated from only patients who are unchanged, it represents the variability in difference values between test and retest for a predetermined percentage of patients who are truly unchanged. The percentage level chosen by researchers is admittedly arbitrary; however, 90% (MDC90) and 95% (MDC95) levels are typically reported. The interpretation of MDC90 is that 90% of patients who are unchanged will display random fluctuations within its bounds, or that 10% of patients who are truly unchanged will display fluctuations greater than MDC90. Figure 2 displays the distribution of difference scores between test and retest for our 100 patients who were unchanged. The MDC90 is superimposed on this distribution. The interpretation of MDC is with respect to patients known to be unchanged, and it is our contention that this is distinctly different from the challenge faced by physical therapists. In clinical practice, a change score for an outcome measure is obtained on a patient, and the physical therapist must decide whether this measured change is more likely to be associated with a patient who is unchanged or one who is changed.
Distribution of difference values for patients classified as unchanged. MDC90=minimum detectable change at a 90% confidence level.
Interpretation of the DTM Threshold Value
Whereas MDC is based on a single distribution of patients who are unchanged, DTM's threshold value is acquired from 2 distributions (Fig. 3). When the goal is to identify an improvement threshold value, one distribution consists of patients who are improved and the second distribution consists of patients who are unimproved. The threshold change value is derived from a decision rule involving sensitivity and specificity. Although many possible rules could be stated, the rule uniformly applied in threshold change value studies to date is to jointly maximize sensitivity and specificity.4–6,8,9 Sensitivity is defined in terms of patients who are improved, and specificity is defined in terms of patients who are unimproved (ie, those who remain unchanged or have worsened). Because sensitivity and specificity are referenced in terms of known distributions, knowledge of their values does not directly answer the physical therapist's question. Presumably, the sensitivity and specificity distributions will overlap: some patients who are unimproved will display change scores greater than the threshold value, and some patients who are improved will report change scores below the threshold value. This concept is shown in Figure 3. It represents the distributions of patients who are unimproved and those who are improved from the hypothetical study.
Change value distributions for patients who were unimproved and those who were improved (prevalence of improvement=66.7%). HKAS=Hip Knee Ankle Scale.
To answer the physical therapist's question, information from sensitivity and specificity must be combined with a patient's pre-measure chance of improvement. By pre-measure chance of improvement, we mean the chance that a therapist assigns to a patient as having improved prior to administering and interpreting the outcome measure's result. There are 2 ways of combining this information. One method calculates predictive values, and the other approach calculates the post-measure chance of improvement. We will frame our discussion in terms of predictive values: they can be calculated directly from Figures 3 and 4. Details concerning the post-measure chance of improvement and its relationship to predictive values are provided in the Appendix.
Change value distributions for patients who were unimproved and those who were improved standardized to a pre-measure chance of improvement of 50%. HKAS=Hip Knee Ankle Scale.
The positive predictive value represents those patients correctly identified by the measure as improved divided by all patients identified by the measure as improved (ie, those to the right of the vertical line in Fig. 3). For the study sample, the positive predictive value is 88%. The negative predictive value represents those patients correctly identified by the measure as unimproved divided by all patients identified by the measure as unimproved (ie, those to the left of the vertical line in Fig. 3). For the study sample, the negative predictive value is 79%.
An essential point to realize when making decisions concerning individual patients is that unlike sensitivity and specificity, which are not affected by the prevalence or premeasure chance of improvement, predictive values are influenced by the prevalence or pre-measure chance of improvement. In a research study, the prevalence of patients who are truly improved is simply the percentage of patients who underwent a true improvement during the course of the study. In clinical practice, a therapist works with one patient at a time and in so doing must form an impression of a patient's chance of improving over a specified reassessment interval. One approach for doing this is to estimate the interval over which the typical patient with characteristics similar to those of the patient of interest would be expected to meet the threshold change value. This expectation can be written in the form of a measurable goal. For example, “To increase Mr Smith's HKAS score by x points in y weeks,” where “x” is equal to the change threshold and “y” is the number of weeks over which the typical or average patient would be expected to improve x points. By framing the reassessment interval in terms of the typical or average patient's change profile, the therapist is effectively setting the pre-measure chance of improvement at 50%.12
Figure 4 displays the distribution of change scores standardized to a pre-measure chance of improvement of 50%. This figure shows 100 patients who are unimproved and 100 patients who are improved, applying the same sensitivity and specificity values obtained from the hypothetical study. Note that the improved group in Figure 4 is composed of one half the number of patients who are improved in Figure 3. This difference in prevalence affects the number of patients in the lower panel to the right and left of the vertical line and, therefore, affects the positive and negative predictive values.
In clinical practice, there are several important reasons for considering a pre-measure chance of improvement of 50%. First, 50% represents maximum uncertainty in whether a patient has improved or not. Thus, the information gain (ie, the difference in a patient's chance of improving between the pre-measure estimate of change and the predictive value result) will be maximized. Second, because predictive values are influenced by the pre-measure chance of improvement—as the premeasure chance of improvement increases, the positive predictive value increases and the negative predictive value decreases—selecting a pre-measure chance of improvement of 50% allows a direct representation of whether application of the threshold change value does better at assisting to rule in or rule out an improvement. The third reason pertains to the post-measure chance of improvement method discussed in the Appendix. When the pre-measure chance of improvement equals 50%, the post-measure odds of improvement equals the likelihood ratio (see Appendix for elaboration).
Returning to the vignette, we see that based on clinical experience, the therapist scheduled Mr Smith's reassessment to take place when the typical patient would be expected to show improvement. In our example, the therapist had not used the HKAS previously; accordingly, her expectation of change was based on experience with other rich information gained in the course of clinical practice. Combining a pre-measure chance of improvement of 50% with the reported sensitivity and specificity values obtained for a threshold change value of ≥2 points yields positive and negative predictive values of 78% and 88%, respectively. Given that Mr Smith's change score of 3 points exceeded the threshold change value, the therapist interprets this as a true improvement with a substantial level of confidence. Had the change score been less that the threshold change value, the therapist could conclude with a high level of confidence (the negative predictive value is 88%) that Mr Smith had not improved. Sample predictive value and post-measure chance of improvement calculations standardized to a pre-measure chance of improvement of 50% are shown in the Appendix. Although these calculations may appear quite onerous, they can be easily performed using an inexpensive (99 cents at the time of this writing) application software program for mobile devices called “twoBYtwo.”13
Reconciling the Difference in Interpretations of MDC and DTM Threshold Change Estimates
In this section, we provide a brief comparison of the information offered by the 2 methods. When the natural history of a condition favors improvement, few patients will display worsening and the distribution of patients who are unchanged and those who are unimproved will be similar. In our example, none of the patients worsened. When this is the case, the distribution of difference scores used to estimate MDC is identical to the distribution used to calculate specificity, which can be seen by comparing the distribution shown in the upper panel of Figure 3 with the distribution displayed in Figure 2. Without knowledge of the location and dispersion of the improved patient distribution, the MDC's usefulness in providing a threshold value for improvement is uncertain. For example, if the distribution of patients who are improved displayed considerable overlap with the MDC distribution, as would be the case if the change scores of patients who are improved resembled the change scores of those who are unimproved, the MDC threshold change value would not discriminate between the 2 groups.
In summary, when deciding on a patient's improvement status, information from both unimproved and improved distributions is required. Because MDC is based on a single distribution consisting entirely or almost entirely of patients who are unchanged, it is incapable of providing the chance that a patient with a given change score has improved.
Suggestions for Researchers
In their eagerness to provide threshold change values, researchers often have left out important information,6–10 or at times have applied an incorrect interpretation to the threshold value.8 For MDC to have meaning, the distribution of difference scores between test and retest must be consistent with a normal distribution and centered at or close to zero. If the distribution is not consistent with a normal distribution, it is not appropriate to estimate MDC using a z value (ie, a standard normal deviate). If the mean difference between test and retest values differs a meaningful amount from zero, either the patients have truly changed or there is a drift associated with the measure. Either way, a systematic difference between test and retest values calls into question the interpretation and usefulness of MDC. Accordingly, when estimating MDC, it is necessary to verify these assumptions. At times, investigators have applied an incorrect interpretation to MDC.8 The MDC does not convey the chance that the measured change is real, but rather it provides an expected range for random fluctuations between test and retest for patients who are truly unchanged.
Another consideration when reporting MDC is to acknowledge that it is a point estimate of the population value. The confidence in the reported value of MDC is directly related to the confidence in SDdiff, or more correctly SDdiff2 (ie, variance). For this reason, we believe that although it is appropriate to comment on the point estimate of MDC, it is necessary to present a CI for MDC in order to convey a likely set of values over which the population MDC value is likely to lie. The narrower the CI, the more likely it is that the point estimate provides a reasonable representation of the population value. For our hypothetical study, MDC90 was 4.5 HKAS points. Applying a 95% CI to this estimate, MDC90 yields an interval estimate of 3.9 to 5.2 HKAS points. Thus, MDC90 could be as low as 3.9 points or as high as 5.2 points.
When applying DTM, it is not uncommon for investigators to report a threshold value for improvement without providing information concerning sensitivity and specificity.10,14,15 Because this practice does not allow the calculation of predictive values or post-measure chance of improvement, we advocate the reporting of sensitivity and specificity (or equivalently likelihood ratios) and their CIs. Of equal importance is the conversion of this information to predictive values or post-measure chance of improvement standardized to a pre-measure chance of improvement of 50%. For our hypothetical study, the positive predictive value adjusted to a prevalence of 50% was 78%. The 95% CI for this estimate is 72% to 83% given a total sample size of 200 patients.
We also encourage researchers to provide a figure containing the ROC curve with the threshold values for each sensitivity and 1 − specificity pairing superimposed on the curve (ie, similar to Fig. 1). This added information allows readers to calculate the confidence in clinical decisions based on threshold values other than the one advocated by the study's authors. Using Mr Smith's improvement score of 3 HKAS change points as an example, and referring to Figure 1, Mr Smith's physical therapist could estimate the sensitivity and specificity values to be approximately 73% and 83%, respectively. Combining these values with a pre-measure chance of improvement of 50% produces a positive predictive value of 81%, which is greater than the 78% obtained when a threshold value of 2 was applied.
Our final comment addresses a reality encountered by researchers when estimating MDC and DTM threshold change values. For many important patient outcomes, including pain, functional status, and health-related quality of life, it is generally agreed that no error-free reference standard exists. Accordingly, investigators have applied a construct validation process that has included retrospective,6 prognostic,16 and concurrent reference standards of change.5 These imperfect standards have been applied to identify both patients who are changed and those who are unchanged. We believe that it is important for researchers to acknowledge the limitation of using any one of these methods and, where possible, to obtain threshold estimates using multiple reference standards. Confidence in an advocated threshold value can be increased when different reference standards yield similar results. An excellent example of this concept is applied in the study by Grotle et al,16 who used both prognostic and retrospective reference standards.
Summary
In this perspective article, we have described a framework for interpreting reliability and DTM threshold change values. We have shown that only DTM has the potential to provide information consistent with the challenge faced by clinicians when contemplating whether a patient has improved or not. These chances for improvement are provided by predictive values and post-measure chance of improvement, and we have noted that these values are dependent on the pre-measure chance of improvement. Finally, we have recommended that researchers comment on the requisite assumptions associated with applied estimation methods and provide results in a format that can be directly applied to clinical decision making.
Appendix.
Predictive Value and Post-measure Chance of Improvement Calculations
Background for Calculating Predictive Values and Post-measure Chance of Improvement From Sensitivity, Specificity, and Pre-measure Chance of Improvement
In this appendix, we explain how predictive values and post-measure chance of improvement are calculated. However, we suggest that interested readers apply easy-to-use application software programs such as twoBYtwo cited in this perspective article or use a nomogram.17
Steps taken to calculate predictive values (refer to Tab. 2 for reference):
Step 1. Choose a convenient sample size for N. We chose 2,000.
Step 2. Calculate the number of patients who truly improved by multiplying N by the pre-measure chance of improvement: (a+c)=N × pre-measure chance of improvement.
Step 3. Calculate the number of patients who truly did not improve by subtracting those who improved from N: (b+d)=N−(a+c).
Step 4. Calculate the number of patients the measure correctly identified as having improved by multiplying all those who truly improved by the sensitivity: a=(a+c) × sensitivity.
Step 5. Calculate the number of patients the measure incorrectly identified as not having improved by subtracting those who were correctly identified as having improved from all those who truly improved: c=(a+c)−a.
Step 6. Calculate those patients the measure correctly identified as not having improved by multiplying all of those who truly did not improve by the specificity: d=(b+d) × specificity.
Step 7. Calculate those patients the measure incorrectly identified as having improved by subtracting those the measure correctly identified as not having improved from all of those who truly did not improve: b=(b+d)−d.
Step 8. Calculate the positive predictive value by dividing the number of patients correctly identified by the measure as having improved by all patients identified by the measure as having improved: positive predictive value=a/(a+b).
Step 9. Calculate the negative predictive value by dividing the number of patients correctly identified by the measure as not having improved by all patients identified by the measure as not having improved: negative predictive value= d/(c+d).
Background for Calculating Post-measure Chance of Improvement From Likelihood Ratios and Pre-measure Chance of Improvement
Step 1. Convert the pre-measure chance of improvement to the pre-measure odds of improvement: pre-measure odds of improvement=[(a+c)/(b+d)]:1.
Step 2. Calculate the positive likelihood ratio (LR+): LR+=sensitivity/(1−specificity). LR+ is the likelihood ratio for a change score meeting or exceeding the threshold improvement value.
Step 3. Calculate the negative likelihood ratio (LR−): LR−=(1−sensitivity)/specificity. LR− is the likelihood ratio for a change score not meeting the threshold improvement value.
Step 4. Calculate the post-measure odds for a change score meeting or exceeding the threshold improvement value: post-measure odds threshold met = pre-measure odds × LR+.
Step 5. Calculate the post-measure odds for a change score not meeting the threshold improvement value: post-measure odds threshold not met = pre-measure odds × LR−.
Step 6. Convert the post-measure odds to the post-measure chance of improvement: post-measure chance of improvement = post-measure odds/(post-measure odds +1).
Relationship Between Predictive Values and Post-measure Chance of Improvement
The post-measure chance of improvement given the threshold change value has been met is equal to the positive predictive value.
The post-measure chance of improvement given the threshold change value has not been met is equal to 1 minus the negative predictive value.
Footnotes
-
Both authors provided concept/idea/project design and writing.
- Received January 2, 2012.
- Accepted June 25, 2012.
- © 2012 American Physical Therapy Association