Abstract
Background The Brief Balance Evaluation Systems Test (Brief-BESTest) was recently proposed as a clinical tool for quickly measuring balance disorders, but its measurement properties warrant investigation.
Objective The study objective was to perform a detailed analysis of the psychometric properties of the Brief-BESTest by means of Classical Test Theory and Rasch analysis.
Design This was an observational measurement study.
Methods Brief-BESTest data were collected from a sample of 244 participants. Internal consistency was analyzed with the Cronbach α and item-to-total correlations. Test-retest reliability and interrater reliability were investigated in a subgroup of 21 participants. The minimum detectable change at the 95% confidence level was calculated. Scale dimensionality was examined through Horn parallel analysis; this step was followed by exploratory factor analysis for ordinal data. Finally, data were examined using Rasch analysis (rating scale model).
Results The Cronbach α was .89, and all item-to-total correlations were greater than .40. Test-retest reliability had an intraclass correlation coefficient (ICC) (2,1) of .94, and interrater reliability had an ICC (2,1) of .90. The minimum detectable change at the 95% confidence level was 4.30 points. The unidimensionality of the test was confirmed, but 1 item showed low communality. Rasch analysis revealed the inadequacy of response categories, 5 misfitting items, minor mistargeting, moderate person reliability (.80), and 2 pairs of locally dependent items.
Limitations The sample was a cross-section of people who had balance disorders from different neurological etiologies and were recruited consecutively at a single rehabilitation facility.
Conclusions The Brief-BESTest was confirmed to have some acceptable-to-good reliability indexes when calculated according to Classical Test Theory, but the scale showed fairly limited sensitivity to change. Rasch analysis indicated that item selection should be improved from a psychometric point of view. Item redundancy needs to be reduced, and the metric coverage of the measured construct needs to be improved with new items.
Balance control depends on the interplay of peripheral inputs integrated in the central nervous system, leading to a variety of coordinated motor responses aimed at maintaining postural stability.1 The control of posture can be affected in various ways, either as a consequence of disease, such as peripheral neuropathy,2–4 Parkinson disease,3,5–11 and stroke,12 or through aging.13 Therefore, measuring balance performance is complex and warrants the investigation of different types of postural tasks to accurately describe the neural processes underlying balance control.14 In view of this need, a systems framework for postural control was proposed,15,16 and, accordingly, a new 36-item balance test—the Balance Evaluation Systems Test (BESTest)—was introduced.17
However, some possible drawbacks, in terms of item redundancy and the lengthy test duration, were found with the new test.18 For addressing these potential limitations, 2 short versions of the BESTest have been published: the Mini-BESTest18 and the Brief-BESTest.19 The Mini-BESTest is a unidimensional 14-item version focused on assessing dynamic balance and was produced with the aid of factor analysis and Rasch analysis. In contrast, the Brief-BESTest consists simply of 1 item from each of the 6 balance domains of the BESTest.
Until now, only a few studies of limited sample size19–22 have assessed some of the psychometric properties of the Brief-BESTest according to the Classical Test Theory approach. From these studies, it was concluded that the Brief-BESTest was a promising instrument, with interrater and test-retest reliability values ranging from .92 to .99 and internal consistency values ranging from .86 to .92.19,22 Moreover, for prospective fall prediction accuracy over 6 months, the area under the receiver operating characteristic curve was 0.88, the sensitivity was 0.71, and the specificity was 0.87.20 Similar values were observed for retrospective fall prediction accuracy in people who had Parkinson disease and reported at least one fall; the area under the receiver operating characteristic curve was 0.82, the sensitivity was 0.76, and the specificity was 0.84.20 Moreover, a Spearman ρ correlation of .81 (P<.001) was found with the Activities-specific Balance Confidence Scale and, with regard to the sensitivity to change, the standard error of measurement and the minimum detectable change were 1.13 and 2.55 points, respectively.22
According to the authors of the original Brief-BESTest,19 the items selected were those that were considered to be the most representative on the basis of item-to-total correlations. However, it has been well established that caution about possible weaknesses of the measurement properties of a shortened instrument is warranted, particularly when the reduction has been obtained through the Classical Test Theory approach.23,24 Other psychometric strategies, such as Rasch analysis, should be applied if the goal is to select the smallest number of items while optimizing the technical quality of an outcome measure.25 Rasch analysis allows assessment of how well an item performs in terms of its relevance or usefulness for measuring the underlying construct, the amount of the construct targeted by each question, the metric coverage of the construct, the possible redundancy of an item relative to other items in the scale, and the appropriateness of the response categories.26–28
Not unexpectedly, a letter to the editor examining the scale with Rasch analysis29 raised a series of concerns about the structure and functioning of the Brief-BESTest. Therefore, a more in-depth evaluation of this new balance assessment scale, supported by detailed statistical analysis in an adequate sample of people with balance deficits, is warranted.
The aims of our study were to evaluate the psychometric properties of the Brief-BESTest by means of both Classical Test Theory and Rasch methods and to compare the results of the 2 methods. To test the appropriateness of the scale as a general-purpose tool for the measurement of balance disorders, we collected data from a sample of people who had balance disorders from different neurological etiologies and were recruited through a consecutive sampling method at a single rehabilitation facility. This sample was representative of the spectrum of people who have neurological diseases and usually attend departments of rehabilitation.30
Method
Participants
Between October 2012 and January 2014, a sample of 270 people who had balance disorders from different etiologies and were admitted to the Division of Physical Medicine and Rehabilitation and the Division of Neurologic Rehabilitation of the Salvatore Maugeri Foundation, Scientific Institute of Veruno, Novara, Italy, a free-standing rehabilitation facility, were administered the Brief-BESTest. Potential participants had been referred from surrounding acute care hospitals and by general practitioners and had been screened for rehabilitation potential. The inclusion criteria were an ability to maintain an upright position without support for 5 seconds, an ability to understand the required motor tasks, no hip or knee replacement in the preceding 6 months, and no recent bone fracture. Of the 255 people fulfilling the inclusion criteria, 6 declined to participate and 5 did not complete the assessment (3 because of fatigue and 2 because of back pain).
The final study population consisted of 244 participants (55% men) with a mean age of 65.3 years (SD=14.9) and a median age of 69 years (range=21–90). Their clinical characteristics are shown in Table 1. Our sample was composed mostly of Italians (95%) from zones with a population density of less than 300 inhabitants/km2 (Piedmont area). The participants' mean education was 6.5 years (SD=3.2, range=4–18). All participants signed an informed consent statement.
Clinical Characteristics of Participants
All participants were tested at admission randomly by 1 of the 3 physical therapists involved in the study. The physical therapists had at least 10 years of experience in assessment and treatment of patients with neurological conditions. They also had experience in BESTest administration and daily administration of balance scales such as the Mini-BESTest and the Berg Balance Scale. They had been specifically trained in balance assessment with the Brief-BESTest during a course consisting of three 1-hour sessions.
Assessment With the Brief-BESTest
The Brief-BESTest is a balance assessment instrument comprising all 6 dimensions of the original BESTest (biomechanical constraints, stability limits, anticipatory adjustments/transitions, reactive postural responses, sensory orientation, and dynamic gait). Theoretically, it is a 6-item scale, because one item from each subsection of the BESTest system was selected on the basis of the highest item-to-total correlation coefficients within each respective section. However, because the items related to anticipatory adjustments/transitions and reactive postural responses include both right and left components, the full scale actually consists of 8 items. The items are administered and scored according to the original test instructions,19 with a 4-level rating scale ranging from 0 to 3, with 0 representing severe balance impairment (or inability to perform the task without falling) and 3 representing no balance impairment. Thus, the maximum possible total score for the Brief-BESTest is 24,19 with higher scores indicating less severe balance impairment.
Data Analysis
Sample size requirements were determined according to the planned statistical analysis and are described in detail in the corresponding sections.
Classical Test Theory.
Reliability analysis includes the evaluation of internal consistency and repeatability to test whether a scale ensures stable and objective measurements. The internal consistency of the Brief-BESTest was analyzed first with the Cronbach α coefficient (a minimum α value of .90 is recommended for individual judgments, and a value of .95 is desirable31). The Cronbach α for ordinal data was calculated32 from the polychoric correlation matrix with the computation method for ordinal data.33 Next, we verified the item-to-total correlation, that is, the correlation between each single-item score and the overall assessment score, omitting this item from the total. A Spearman ρ coefficient of greater than .40 was considered to be the minimum value for an acceptable item-to-total correlation.34
The repeatability of the Brief-BESTest was assessed by means of test-retest and interrater reliability with the intraclass correlation coefficient (ICC [2,1]) measure.35 Intraclass correlation coefficients exceeding .90 were considered to be reasonable for clinical measurements.35 The sample size required was determined on the basis of a pilot study, with the expectation of obtaining ICCs of about .90 and a 95% confidence interval (CI) close to .2.36 A minimum number of 21 participants was needed. Therefore, a subgroup of 21 participants was assessed with the Brief-BESTest twice, at admission and within 24 hours (before the first rehabilitation sessions), to study test-retest reliability. At admission, the same subgroup of participants was evaluated in turn by 3 physical therapists to assess interrater reliability.
To investigate the size of the measurement error and, consequently, the sensitivity of the Brief-BESTest to change, we calculated the standard error of measurement and its 95% CI on the basis of the analysis of variance used to produce the ICC.37 Then, the minimum detectable change was obtained with the following formula: standard error of measurement × z value × √2. The minimum detectable change at the 95% confidence level (MDC95) was established for a z value of 1.96.
We investigated dimensionality using an exploratory approach. The sample size calculation for exploratory factor analysis38 took into consideration 2 general recommendations: a minimum of 200 cases and a participant-to-item ratio of 20:1. On the basis of the participant-to-item ratio, because the Brief-BESTest is an 8-item scale, we would have had to include at least 160 participants. The Kaiser-Meyer-Olkin measure of sampling adequacy was used as a measure of the appropriateness of factor analysis. A Kaiser-Meyer-Olkin value of greater than 0.60 is required, and a value of 0.90 is considered to indicate perfect appropriateness.39 Parallel analysis40,41 was applied to identify the number of dimensions (factors) measured by the items of the scale. Then, the contribution of each item to the identified factors was evaluated with exploratory factor analysis for ordinal data. If an item has a communality of less than 0.40, it may not be related to the other items, or an additional factor should be explored.42,43
Rasch analysis.
We examined the responses to the Brief-BESTest items using Rasch analysis with a rating scale model. The sample size required to provide 99% confidence that no item calibration was more than ±0.5 logit away from its stable value was 243.44 The analysis was run according to the following 5 steps.
First, we performed a rating scale diagnostic to verify whether the response categories (from 0 [unable to perform] to 3 [normal performance]) for each item of the scale were being used effectively and consistently, according to the following 5 guidelines26,45: at least 10 cases per category; even distribution of category use; monotonic increases in both the average measure across rating scale categories and thresholds (ie, the ability levels at which the response to either of 2 adjacent categories was equally likely); category outfit mean square (MnSq) value of less than 2; and threshold differences of greater than 1.10 logits and less than 5 logits.
In the second step, we estimated the item difficulty and person ability measures, expressed on a linear logit scale. According to Rasch theory, in principle, people of medium ability should agree (or have success) with the easier items and disagree (or fail) with the more difficult ones. We also examined the data for floor and ceiling effects. Then, we assessed internal construct validity by determining how well the empirical data fit the Rasch model.27,28 In line with Linacre's recommendations,46,47 infit and outfit MnSq values between 0.80 and 1.20 were considered to indicate acceptable fit, and more emphasis was placed on infit values than on outfit values for identifying misfitting items. An MnSq of greater than 1.20 indicates that the data are unexpected compared with the predictions of the model. Such is the case with underfitting items, which could degrade the measurement system. An MnSq of less than 0.80 indicates that the data are more predictable than expected from the model. In this case, items are considered overfitting, and their presence might produce misleadingly good reliabilities and separations.
In the third step, we estimated person reliability and item reliability. Person reliability and item reliability coefficients range from 0 to 1 and are often interpreted in a manner similar to that used for the Cronbach α. The main difference is that the estimate of reliability in Rasch models is based on a procedure that acknowledges discrete data properties, rather than treating the data as continuous raw scores, as is done in Classical Test Theory methods. Reliability also can be expressed in terms of a separation index, with a range from 0 to infinity, indicating the reproducibility of relative measure location—that is, how well one can differentiate between different person or item performances along the measurement construct. A separation index of greater than 1.50 is considered to be the minimum acceptable separation index, whereas a separation index of 2 enables the distinction of 3 statistically detectable groups or strata of measures, according to the following formula: (4G + 1)/3.28
The fourth step addressed dimensionality and local item dependence through principal components analysis (PCA) of the standardized residuals. For confirmation of the unidimensionality of the scale, the variance explained by the measured construct (ie, the Rasch factor) should be greater than 50%, whereas the variance explained by the first residual factor should be less than 10%, with an eigenvalue lower than 3.46,48 We calculated the correlations of standardized residuals to test local independence between items. A high correlation (>.30) of residuals for 2 items indicates that they may be locally dependent, either because they duplicate some feature of each other or because they both incorporate some other shared dimension.28
In the last step, we performed a differential item functioning (DIF) analysis to search for possible differences resulting from context effects between the measures obtained for different subgroups of participants.49,50 We defined 2 categories: sex (men versus women) and age (<69 years old versus >69 years old, 69 years being the median age). We investigated DIF by calibrating the scale for each group separately to obtain an estimate of the item difficulties for each group and using, as anchor values, the person calibrations for the entire sample; then, we performed pair-wise t tests between the 2 sets of item difficulties (2-sided α of <.05, with Bonferroni correction, depending on the number of comparisons). The a priori hypothesis was not to find DIF between the analyzed groups.
Statistical software.
The results of the Classical Test Theory analysis were analyzed with STATA V.12 (StataCorp LP, College Station, Texas). Rasch analysis was run with Winsteps Rasch measurement computer program 3.68.2 (Winsteps, Beaverton, Oregon).47
Role of the Funding Source
This study was supported, in part, by a Ricerca Finalizzata grant (RF-2010-2312497) from the Italian Ministry of Health and by a PRIN 2010–2011 grant (2010R277FT from the Italian Ministry of Education, University, and Research).
Results
Table 1 shows that different neurological conditions were represented in our sample; the mean age across the participant groups ranged from about 48.1 to 71.5 years. For the entire sample, the median of the Brief-BESTest raw scores (entire potential range=0–24) was 8 (interquartile range=3–13), with a skewness of 0.60 and a kurtosis of 2.60.
Classical Test Theory
Reliability and sensitivity to change.
The Cronbach α value for the Brief-BESTest was .89. Item-to-total correlations (Tab. 2) ranged from .47 (item 1, “hip/trunk lateral strength”) to .70 (items 4a and 4b, “compensatory stepping lateral, left” and “compensatory stepping lateral, right,” respectively).
Item-to-Total Correlations and Factor Analysis of the Brief-BESTest
For the test-retest reliability of the Brief-BESTest, the ICC (2,1) was .94 (95% CI=.86,.97); for the interrater reliability, the ICC (2,1) was .90 (95% CI=.81, .96). The standard error of measurement of the Brief-BESTest was 1.55 points (95% CI=1.28, 2.10), and its MDC95 was 4.30 points (95% CI=3.56, 5.83)—representing 6.5% and 17.9% of the total score of the scale, respectively.
Dimensionality.
The Kaiser-Meyer-Olkin measure of sampling adequacy, equal to 0.86, revealed the adequacy of the data matrix for factor analysis. The results of the parallel analysis supported a unidimensional underlying response structure, because only the eigenvalue of the first factor from our data was larger than the related 95th percentile eigenvalue from the random data, accounting for 63% of the explained variance. Item loadings for the single-factor solution are shown in Table 2. Item 1 (“hip/trunk lateral strength”) and item 2 (“functional reach forward”) showed the lowest communalities (0.30 and 0.43, respectively) (see also Rasch analysis results on dimensionality).
Rasch Analysis
Rating scale category functioning.
The rating scale (from 0 to 3) fulfilled only 3 of the 4 category functioning criteria detailed in the Method section.45 In particular, the response categories included a minimum of 229 responses (category 3, no balance impairment) to a maximum of 708 responses (category 0, severe balance impairment); the average measure across rating scale categories and thresholds increased monotonically; and the outfit MnSq values for each category were less than 2. Concerning the fourth criterion, Figure 1 shows that the thresholds—that is, the ability levels at which the response to either of 2 adjacent categories was equally likely—were at −0.91, −0.44, and 1.35 logits. These thresholds showed that category 1 occupied a narrow interval of the latent variable, raising concerns about the rating scale category definition.
Category probability curves. The y-axis represents the probability (0–1) of responding to 1 of the 4 rating categories, and the x-axis represents the corresponding performance values, calculated as person ability minus item difficulty and expressed in logits. The 3 thresholds (vertical dashed lines)—that is, the ability levels at which the response to either of 2 adjacent categories was equally likely—were at −0.91, −0.44, and 1.35 logits.
Internal construct validity: data fit, item difficulty, and person ability estimates.
Table 3 shows both item difficulty measures and fit information. The easiest item was 2 (“functional reach forward”), and the most difficult items were 3a (“one-leg stance, left”) and 3b (“one-leg stance, right”). Five of the 8 Brief-BESTest items misfit the underlying construct. More exactly, item 1 (“hip/trunk lateral strength”) underfit the model (infit and outfit MnSq values of 1.47 and 1.52, respectively), whereas item 2 (“functional reach forward”) (infit MnSq=0.76) and items 3a (“one-leg stance, left”), 4a (“compensatory stepping lateral, left”), and 4b (“compensatory stepping lateral, right”) (outfit MnSq=0.74, 0.78, and 0.78, respectively) overfit the model (Tab. 3).
Item Difficulty Measures and Goodness-of-Fit Statistics for the 8 Items of the Brief-BESTest
Figure 2 shows the distributions of person ability and item difficulty. The average item difficulty estimates covered a limited range of 1.92 logits, and the distance between the threshold boundaries was 4.18 logits (from −2.11 to 2.07 logits; Fig. 2, horizontal arrows). Person ability spanned 9.18 logits (from −4.42 to 4.76 logits), with a skewed distribution. The mean person ability was situated at −0.92 logit (Fig. 2, M; SD=1.68) from the average item difficulty, conventionally set at 0 logits (Fig. 2, M′; SD=0.62).27 Only 2 participants (0.8% of the sample) achieved the maximum score (ie, no balance impairment), whereas 14 participants (5.7%) achieved the minimum score (ie, severe balance impairment). Therefore, there was no evidence of a ceiling effect, but the percentage of participants with a minimum score indicated a floor effect, to some extent.48
Person ability and item difficulty map of the Brief-BESTest. A vertical line represents the measure of the variable, in linear logits. The leftmost column locates each person's ability, from higher (top) to lower (bottom) ability. The rightmost column locates the relative difficulty of each item for this sample. A higher item measure (item 3a) reflects more difficulty in that item (higher score). By convention, the average difficulty of items in the test is set at 0 logits (indicated by M′), and participants with average ability are located at M. Horizontal dashed arrows indicate the 2 extreme threshold boundaries, and S and T represent, respectively, 1 and 2 standard deviations. Each point indicates one participant, and each hash mark indicates 2 participants. CSL-L=compensatory stepping lateral, left; CSL-R=compensatory stepping lateral, right; FRF=functional reach forward; H/T LS=hip/trunk lateral strength; SECF=stance with eyes closed, on foam surface; SL-L=one-leg stance, left; SL-R=one-leg stance, right; TUG=Timed “Up & Go” Test.
Figure 3 shows the standard error of measurement along the whole person ability span; outside the item threshold boundaries (from −2.08 to +1.78), there was a steep increase in these errors, implying a loss of precision of the ability measure.
Standard error of measurement (SE, expressed in logits) for the Rasch model person ability estimates and cumulative percentage of participants at each level of ability. About 30% of the sample had an SE of greater than 0.6 logit (from −4.42 to −2.08 logits and from 1.78 to 4.76 logits).
The person reliability coefficient was .80, and the person separation index was 2.01 (enabling the distinction of 3 detectable strata of person ability). The item reliability coefficient was .98, and the item separation index was 6.25.
Local item dependence and dimensionality.
The PCA of the standardized residuals revealed that the variance explained by the Rasch factor was 55.5% (26.6% explained by items and 28.9% explained by participants). The variance explained by the first residual factor was 11.6%, with an eigenvalue of 2.10. For 2 item pairs, items 4a (“compensatory stepping lateral, left”) and 4b (“compensatory stepping lateral, right”) and items 3a (“one-leg stance, left”) and 3b (“one-leg stance, right”), the residual correlations were high (.55 and .31, respectively).
DIF analysis.
No DIF was found in terms of either sex or age (as separated into 2 groups by the median age of 69 years).
Discussion
For valid decision making in clinical practice, high-quality outcome measures that meet rigorous measurement standards are required. The aim of the present study was to evaluate the measurement properties of the Brief-BESTest with the goal of determining its appropriateness as a general-purpose tool for the assessment of balance disorders. To our knowledge, this is the first study analyzing the Brief-BESTest by means of both Classical Test Theory and Rasch analysis in a wide range of participants with balance disorders, representing the spectrum of people who usually attend departments of neurological rehabilitation.30
The main result of the present study was that Classical Test Theory and Rasch analysis highlighted several weaknesses of the Brief-BESTest, in particular, with regard to reliability issues and sensitivity to change, rating scale category functioning, internal construct validity and clustering of the item difficulty calibration, and local item dependence. Overall, our findings are in line with concerns raised by a preliminary Rasch analysis evaluation of the Brief-BESTest.29
Reliability and Sensitivity to Change
In the present study, the Cronbach α was found to be good for group-level comparisons but only borderline for individual judgments in clinical settings.31 This result is in line with findings reported by the authors of the original Brief-BESTest for 2 different cohorts of patients.19 We checked measurement error through reliability tests; we found good test-retest and interrater reliability, as did Huang et al22 for a sample of 28 people who survived cancer after chemotherapy treatment. In the study by Padgett et al,19 the interrater reliability of the Brief-BESTest was higher; however, half of the small sample in that study (n=20) consisted of people who were healthy, the score distribution was negatively skewed, and there was a clear ceiling effect.
Starting from the test-retest reliability, we calculated the MDC95 of the Brief-BESTest; the MDC95 indicated the smallest change in a participant's total score that likely reflected a true change (rather than measurement error alone) in balance ability. This value corresponded to a 5-point change, which was quite high (it represented about 20% of the 24-point maximum scale range) and indirectly revealed the low sensitivity of the tool to change. In a recent study performed with older people who survived cancer,22 higher MDC95 values were found for the Brief-BESTest than for the Mini-BESTest. However, to our knowledge, the present study is the first to estimate the sensitivity of the Brief-BESTest to change by means of the MDC95 for a large sample of people with neurological conditions and balance disorders.
According to Rasch analysis, the person separation index indicated that the scale was just able to differentiate among 3 different strata of balance performance (ie, normal, moderately impaired, and severely impaired).28 As a comparison, the Mini-BESTest is able to differentiate 5 strata of balance ability, from mild to very severe deficits.51 The larger number of strata in the Mini-BESTest than in the Brief-BESTest suggests that the Mini-BESTest may be better able to detect balance changes in individual patients because of the natural course of the disease or treatment. Overall, Rasch analysis clarified some reasons for the low potential of the Brief-BESTest for detecting change; indeed, the sensitivity of a scale to change is directly related to the distribution of item difficulty calibrations and rating scale thresholds relative to the dispersion of person measures before an intervention.26 These points are discussed in the following text.
Rating Scale Category Functioning
Rating scale diagnostics revealed that the category rating scale (from 0 to 3) of the Brief-BESTest requires some refinement, probably through a reduction of categories (from the original 4 to 3) by collapsing misfunctioning category 1 (“mild”) with an adjacent category (0 [“absent”] or 2 [“moderate”]). This result is not surprising. Category misfunctioning had already been identified for the BESTest,18 and the development of the Mini-BESTest revealed that the model best meeting the criteria for category functioning was a 3-category rating scale—the only one able to ensure that each rating category was distinct from the others in representing a different balance ability.18 However, the suboptimal functioning of category 1 also might have been an idiosyncratic feature of the sample in the present study; hence, further analysis of the performance of the 4 rating categories is warranted.
Internal Construct Validity
The fit statistics results of the Rasch analysis revealed an unexpectedly high variability of responses related to “hip/trunk lateral strength” (item 1). This result means that item 1 does not belong to the same construct as the other variables. This finding is in line with that of the previous factor analysis of the BESTest18: item 1 (like the others related to biomechanical constraints) failed to load meaningfully onto any factor. In addition, the lowest item-to-total correlation in the present study was associated with item 1. In our opinion, this item should be deleted. In general, the section of the BESTest related to biomechanical constraints (as well as the section related to stability limits) warrants separate psychometric studies. Biomechanical constraints are important facets of postural control but appear to be independent of the unidimensional construct related to dynamic balance.
On the other hand, the infit values for item 2 (“functional reach forward”) and, to a lesser extent, the outfit values for items 3a (“one-leg stance, left”), 4a (“compensatory stepping lateral, left”), and 4b (“compensatory stepping lateral, right”) revealed an overfit—that is, an overly predictable pattern. Linacre47 recommended that more emphasis should be placed on infit values than on outfit values for identifying misfitting items. However, because overfit generally should be ignored unless it is extreme, the widespread overfitting behavior suggested that these items did not provide unequivocal information about the respondents and tended to inflate person reliability, although they did not degrade the quality of the measurement.47 These observations take on more significance given that the person reliability coefficient of the Brief-BESTest in the present study was acceptable but not high.
The distance between the average ability of participants and the mean item difficulty (about 1 logit; Fig. 2) revealed that the Brief-BESTest items were difficult for our participants. Conversely, a smaller distance would have increased person reliability, because the average ability of participants and the mean item difficulty were intertwined. Furthermore, the spread of average item difficulties and of extreme threshold boundaries was limited compared with the wide range of person ability estimates. These findings were confirmed by analysis of the standard errors of measurement for the Rasch model person ability estimates (Fig. 3); they were higher than 0.5 to 0.6 logit in about 30% of the sample (mostly from participants with the lowest balance ability and just marginally from those with the highest ability). Accordingly, in these 2 zones, the precision of the ability measurement decreased sharply with movement to one extreme or the other.
Overall, the results indicate that the Brief-BESTest does not have sufficient items to optimally measure a wide range of balance abilities. Indeed, for the best test design, the distribution of item difficulties must match the distribution of person abilities. Conversely, Figure 2 shows the lack of such a match for participants with a lower balance ability (ie, higher impairment), located at the negative extreme of the logit scale (bottom left). The addition of some less difficult items is warranted. Finally, some Brief-BESTest items share the same span of difficulty, indicating potential item redundancy and the risk for inflation of the cumulative raw score when the scores of individual items reflecting the same level of ability are summated.27,28
Local Item Dependence and Dimensionality
The presence of item redundancy in the scale was confirmed by the PCA of standardized residuals. Although the PCA revealed a substantial (although not high) unidimensionality, it revealed 2 pairs of items with high local dependence: items 3a (“one-leg stance, left”) and 3b (“one-leg stance, right”) and items 4a (“compensatory stepping lateral, left”) and 4b (“compensatory stepping lateral, right”). This local dependence, due to the logical relationship between these 2 item pairs, also seemed to be the reason for the relatively high percentage of variance being explained by the first residual factor (>10%).
A similar problem of high residual correlation between BESTest items was found and addressed, leading to the development of the Mini-BESTest18; in that study, to enhance the metric properties of the scale, only the worst performance in bilateral and locally dependent tasks was scored.18,29 In any case, response dependence between items in an outcome measure should be avoided because it inflates person reliability and gives a false impression of measurement precision; in addition, given that all other factors are held constant (eg, the same number and kind of items), there is less information if responses are locally dependent than if they are independent.52 Conversely, valid sums from sets of items require each item in the set to provide related but independent information as well as relevant but not redundant information.53
In line with Rasch analysis, Classical Test Theory methods also raised some concern about the actual contribution of item 1 (“hip/trunk lateral strength”) to the underlying construct of the scale—that is, balance control. Although parallel analysis supported the unidimensionality of the Brief-BESTest, exploratory factor analysis revealed low communality for item 1.
However, caution in generalizing these results to different groups or settings is warranted because of the selection criteria of our sample, a cross-section of participants who had balance disorders from different neurological etiologies and were recruited consecutively at a single rehabilitation facility. Moreover, the present study lacked a DIF analysis by diagnosis because the variety of diagnostic categories with relatively small numbers of participants with each pathology would have yielded unreliable estimates for most pathologies. For adequate power, more participants equally distributed among different diagnoses would have been needed.
In conclusion, the aim of the present study was to investigate a series of psychometric properties of the Brief-BESTest with stringent criteria, including Rasch methods. The 3 main findings of the present study can be summarized as follows: (1) the Brief-BESTest confirmed the presence of some acceptable to good reliability indexes when calculated according to Classical Test Theory, (2) the Brief-BESTest showed fairly limited sensitivity to change, and (3) the application of Rasch methods to both analysis of internal construct validity and PCA of standardized residuals clearly indicated that item selection in the Brief-BESTest could be improved from a psychometric point of view.
More generally, to produce—starting from the BESTest—a very short measure of balance useful for a wide spectrum of people with postural instability, further research based on modern psychometric approaches is needed. Such research should reanalyze, in depth, both the BESTest and the Brief-BESTest, with the goals of minimizing item redundancy and selecting a unidimensional pool of items with coverage and technical quality similar to or better than those of the Mini-BESTest.18
Footnotes
Mr Godi was responsible for concept/idea/research design. Dr Bravini, Prof Nardone, Mr Godi, Dr Franchignoni, and Dr Giordano wrote the manuscript. Dr Bravini, Mr Godi, and Mr Guglielmetti were responsible for data collection and participant recruitment. Dr Bravini, Mr Godi, and Dr Giordano provided data analysis. Mr Godi was project manager. Prof Nardone, Dr Franchignoni, and Dr Giordano provided guidance and consultation (including review of manuscript before submission). Rosemary Allpress scrutinized the English.
This study was approved by the Central Ethics Committee of the Salvatore Maugeri Foundation.
This study was supported, in part, by a Ricerca Finalizzata grant (RF-2010-2312497) from the Italian Ministry of Health and by a PRIN 2010–2011 grant (2010R277FT from the Italian Ministry of Education, University, and Research).
- Received September 28, 2015.
- Accepted April 14, 2016.
- © 2016 American Physical Therapy Association