We read with attention the article by Benka Wallén et al.1 The aim of their study was to examine the dimensionality of the Mini-BESTest, as well as the properties of its items and their interrelationship, in a sample of elderly individuals with mild-to-moderate Parkinson disease (PD). New studies on the psychometric features of the tool in specific populations are welcome. In particular, the authors have interestingly emphasized the peculiarity of the Timed “Up & Go” Test with dual task (item 14) in the participants under study. In fact, this test may even be a sensitive predictor of early impairment in PD2 and merits further attention. However, we are concerned about some critical points related to both data processing and interpretation. Thus, we take this opportunity to highlight a series of sensitive issues that may have affected the results of both Rasch analysis and exploratory factor analysis (EFA) and propose additional analyses and comments on the data set in order to find more parsimonious and plausible explanations for some results.
Sample
The authors attempted to replicate the calibration of the Mini-BESTest (an instrument aimed at measuring balance deficits) in a sample in which about half of the participants had mild to moderate balance deficits (Hoehn & Yahr stages 1–2, where, by definition, stage 2 means “bilateral or midline involvement without impairment of balance”) and none had severe or very severe balance deficits (ie, balance ability less than −1.5 logits, in their Fig. 1).1 In such a situation, any statistical analysis would tell you more about the specific characteristics of that sample than about the actual general performance of the tool used for balance measurement. Indeed, in a sample such as this one (not large), a reduced number of participants can provide useful (but only partial) information for the calibration process.
As a consequence of the sample characteristics, the following findings emerged: (1) lack of responses for category 0 (“high difficulty”) in items 1 (“sit to stand”), 7 (“standing, eyes open”), and 10 (“change in gait speed”), which prevented those items from being fully calibrated in Rasch analysis (due to the lack of optimal standards for interpreting rating scale functioning3) and which required item 7's deletion before EFA; and (2) poor targeting (the extent to which the items are appropriately difficult for that sample). Indeed, as expected in a sample with a shortage of severe balance disorders, Rasch analysis did not consider this sample as “approximately normally distributed on the latent dimension.” Conversely, on visual inspection, the person ability distribution was negatively skewed, with its mean value (Figure) more than 1 logit greater than the mean item difficulty (set at zero, by convention). This drawback produced a series of negative effects on the results.
Comparision of frequency distribution of participant ability (balance) between the studies by Benka Wallén et al1 and Franchignoni et al.8 The horizontal line represents the measure of balance, in linear logit units: from left to right, measures indicate increasing balance level, from very severe balance deficit to normal balance. By convention, the average difficulty of the scale's items (here indicated by the dashed vertical line) is set at 0 logits.
Overall, it is of scant use to perform a validation study on an outcome measure using a relatively small population (a convenience sample recruited for a different trial) that does not cover the full breadth of the variable under analysis (in this case, balance).3,4 Such represents a threat to the generalizability of the research results.
Rasch Analysis
Item Fit
Benka Wallén et al1 gave only a partial report of the results of Rasch analysis on the data set, and a few usual measures (including reliability indexes) that are crucial for a full understanding of the analysis were omitted. As for the item fit, in identifying misfitting items, the authors seem to have given the same importance to the information coming from infit and outfit values, whereas it is widely recommended that more emphasis should be placed on infit values than on outfit values.5,6 All Mini-BESTest infit values were very good (ranging from 0.86 to 1.17) and confirm the internal construct validity of the tool. The low (ie, overfitting) outfit values of item 7 “standing, eyes open”—and, to a lesser extent, of item 1 “sit to stand”—simply indicate that they show a too predictable (ie, showing local deficit in the stochastic variation) off-target pattern of responses, and there is limited use of the available category options, as expected in a population with negative skewness in the distribution of balance performance.6
In this context, the message that some items “might overlap with the content of other items” and “exhibit redundancy and may instead compromise the validity of the scores” is misleading. In a 14-item Rasch-validated tool with all items showing good infit values, the overfitting items 1 and 7 do not degrade the measurement's quality,7 particularly if you use Rasch linear measures, obtained from a sample representative of all levels of balance disorders.8 Actually, items 1 and 7 are conceptually valid, do not show any duplication of content with other items, and their standardized residual correlations do not exhibit any item dependency.8 As is clear from looking at the item map, they fill a precise gap in the difficulty hierarchy, useful for producing information in participants (not represented in Benka Wallén and colleagues' sample),1 with high levels of postural instability. An outcome measure has to be constructed so that it measures ability over the full range of the latent construct. In the case of such overfitting items, Wright and Linacre suggested that “if we are analyzing data from an existing test: take no action, even an item with a very low mean-square tells us a little something that is new and useful.”9 Moreover, the literature confirms the validity and clinical usefulness of the Mini-BESTest in individuals with mild PD or more subtle balance deficits, and it is more useful than the Berg Balance Scale in evaluating balance disorders.10
Dimensionality
The analysis of dimensionality using the principal component analysis of the Rasch residuals was incompletely reported, which raises some concerns: its interpretation would have been better based on a long series of analyses6 aimed at deciding, step by step and according to expert opinion, whether the possible secondary dimension is large enough to distort measurement and whether it merits 2 different measures of outcome (or omitting those items, if they are off-dimension). Anyway, an explained variance by the first contrast of 2.04 eigenvalues suggests a possible dimension of 2 items (which ones, in the present study?), whereas 3 items is widely considered as the minimum for a real and reproducible dimension.6,11 Such a weak second dimension (often present without much distortion of the results) cannot be deemed as sufficient to declare multidimensionality in the tool, and—according to our experience8,12—could simply reflect a reasonable contrast between items based on upright stance maintenance versus those related to gait.
Overall, after having found a weak second dimension, Benka Wallén et al (instead of directly moving to EFA, as they did) would have been better advised to check more in depth the data-set characteristics (taking full advantage of Rasch software), perform a simulation of a Rasch-fitting database with the same characteristics as their data set (to verify how large the secondary dimension is), and search for possible outliers or idiosyncratic behaviors.
Exploratory Factor Analysis
The accuracy of the results largely depends on the quality of the many methodological decisions that a researcher must make in order to complete the analysis (eg, respect for distributional assumptions, factorability of the data matrix, use of the polychoric correlation).13 But, apart from the extraction method and type of rotation, no additional technical information about these analyses was provided. This information is particularly important in this sample, whose homogeneity would be expected to produce lower variances and factor loadings.14 Regrettably, the authors—after having found a parsimonious (and probably “acceptable,” except for item 14) 1-factor solution, with 13 out of 14 loadings >0.4—let the selected statistics guide them toward a “most rational” 3-factor solution (with factor loadings far from being strong and showing cross-loading), instead of performing, for example, parallel analysis, a useful tool for understanding to what extent the scale is “unidimensional.”11,13 Under such conditions (quite low and homogeneous sample size, weak factor structure), EFA gives unstable solutions,11,13,14 and, as Norman and Streiner stated, results risk being “as useful and informative as tarot cards.”15
Moreover, the very low communality of the Timed “Up & Go” Test with dual task (item 14) in EFA (in the presence of good infit values in Rasch analysis) points to the need for a thorough check of the data, bearing in mind the study population characteristics (eg, on/off phenomenon; cognitive status, including attention; anxiety level; education level) and applying clinical reasoning. Generally, we think that the item Timed “Up & Go” Test with dual task deserves more detailed guidelines for its scoring. For example, is it the same if the dual task affects either counting or walking? How can this task be applied in people with language, speech, or cognition disorders? How difficult is the added task in individuals with a low level of education?
Overall, this data-driven analysis produced factors that are difficult to explain from a clinical point of view (as their discussion shows) and at odds with the literature.8,16 Furthermore, the selected procedures disregard generalizability issues, and thus the ability to cross-validate to new or different data coming from clinical studies is very small. For these reasons, we will not discuss the further analyses reported in the article, such as the second Rasch analysis and confirmatory factor analysis.
Conclusion
In summary, we agree with the authors that, at present, the Mini-BESTest seems to be the best tool for assessing balance in people with PD16: the very good Rasch infit values suggest that all items are measuring the same underlying construct, “dynamic balance.”12 However, considering the increasing use of this scale in both research and clinical practice, additional studies are warranted to confirm some validity issues (including the stability of item difficulty) in larger samples presenting different diseases and levels of balance disorders. The article by Benka Wallén et al1 rightly highlights that the use of subscores (related to supposed subcomponents of balance) in the Mini-BESTest has never been validated and is inconsistent with the conceptual unidimensional model that the scale fulfills. Moreover, we think that more detailed guidelines are warranted for administering (in special contexts) and scoring item 14 (Timed “Up & Go” Test with dual task).
At the same time, we are concerned about the technical limitations of the study (performed on a quite small convenience sample, not covering a broad range of balance impairment)14: the authors correctly acknowledge them, but in the Discussion section, they underestimate the misleading effects of these limitations on the results. Given that the main aim of validation studies is to provide in-depth information about a series of relevant aspects of a tool's validity,3 the work by Benka Wallén et al1 seems to have limited external validity. Thus, according to the current knowledge, we cannot agree with the authors' conclusions (mainly data-driven) that gait items are highly diverse one from another and that if you aim to adopt a short balance scale valid for a large spectrum of individuals with balance disorders, the item Timed “Up & Go” Test with dual task has to be considered so different from balance construct as to merit a separated measure of outcome.
Footnotes
Competing Interests: None declared.
This letter was posted as an eLetter on June 17, 2016, at ptjournal.apta.org. The letter is responding to the version of the article published ahead of print on May 26, 2016.
- © 2016 American Physical Therapy Association