# On “Responsiveness of the Balance Evaluation Systems Test (BESTest) in people with subacute stroke.” Chinsongkram B, Chaikeeree N, Saengsirisuwan V, et al. Phys Ther. doi: 10.2522/ptj.20150621.

- P.W. Stratford, PT, MS, School of Rehabilitation Science, McMaster University, Hamilton, Ontario, Canada.

- Address all correspondence to Mr Stratford at: stratfor{at}mcmaster.ca.

Clinicians and researchers are inundated with different measures intended for the same purpose. This leads to the question: “Which measure should I use from the pool of competing measures?” In addition to feasibility, the choice of measure will likely be guided by the extent to which valid inferences can be drawn from a measured value or change score. Chinsongkram and colleagues^{1} examined this issue for competing measures used on patients with stroke.

The study by Chinsongkram and colleagues^{1} compared the responsiveness of several balance tests in people with subacute stroke. These investigators adopted the framework proposed by Husted and colleagues^{2} that views internal and external responsiveness as different concepts. Internal responsiveness is estimated from a single group before-after study design and is quantified by an effect size. Consistent with this representation of responsiveness is the unstated assumption that the sample's change scores come from the same population—that is, the sample of participants is derived from one population and, therefore, share the same population mean. The standardized response mean (SRM), a commonly used responsiveness statistic, is calculated as the mean change divided by the standard deviation of the change score; the signal is in the numerator and the noise in the denominator. Framed in this context the denominator of the SRM represents an estimate of variability in responses for patients who share the same population mean change score.

In contrast, external responsiveness differs from internal responsiveness in 2 ways. First, a key underlying assumption is that the study sample is drawn from 2 populations and the mean change for these 2 subsamples differs. Chinsongkram and colleagues^{1} conceived of 2 groups based on a Berg Balance Scale (BBS) change score >7 or ≤7 in the second sample. External responsiveness designs are more rigorous than the internal design because external designs attempt to answer the question “To what extent can the measure differentiate among samples of patients whose change scores truly differ?” rather than the question “Can the measure detect change?” Typical analyses used to evaluate between-group differences include receiver operating characteristic curves, *t* tests for independent sample means, and Norman's S.^{3,4}

Rather than viewing internal and external responsiveness as 2 separate aspects of responsiveness, a different conceptualization is that they represent a hierarchy in design options when the goal is to evaluate the extent to which valid inferences can be drawn from a measure's change score.^{3} Using the study hierarchy approach, a problem arises when internal and external analyses are applied and interpreted on the same patient sample. Specifically, if patients or subsamples of patients truly change by different amounts (as was the case when using the BBS cut-score of 7), the subsamples do not share the same population change mean. A consequence is that the denominator of the SRM, which is intended to contain noise only, now contains both noise and signal (ie, the true difference between subsamples' means).^{5} I will illustrate this with the following example.

Table 1 contains hypothetical Balance Evaluation Systems Test (BESTest) scores for 8 patients consistent with the study design of Chinsongkram and colleagues.^{1} In this illustration, there are 2 subsamples of 4 patients who improved a little and 4 patients who improved a great deal according to the reference standard (ie, BBS score >7). The mean change for the entire group was 13.50 (SD=10.24), resulting in an SRM of 1.32. Table 2 displays the repeated-measures analysis of variance results for an analysis where BESTest scores were the dependent variable and patients and occasions (2 levels: pretest, posttest) were factors. Applying the mean square (MS) terms from this table, the mean between-sample change difference can be reproduced as , or 13.5, and the standard deviation as or 10.24. Table 3 expands the previous analysis by partitioning the residual error sum of squares (SS) from Table 2 into the group-by-occasion sum of squares (SS) that represents the extent to which a difference in change scores exists between groups (signal) and a residual SS (noise). This analysis reveals that much of the denominator of the SRM calculation actually contains signal. This masks the extent to which valid inferences can be drawn from the results. Accordingly, the SRM analysis in the study by Chinsongkram and colleagues does not capture the measures' relative abilities to assess valid change.

Although this illustration questions the SRM application in the current study, it does not detract from the overall conclusion that the BESTest may have an advantage over the Mini-BESTest, particularly as it relates to a floor effect.

## Footnotes

*This letter was posted as an eLetter on May 17, 2016, at ptjournal.apta.org. The letter is responding to the version of the article published ahead of print on April 21, 2016.*

- © 2016 American Physical Therapy Association