Abstract
Background Despite increasing clinical and research use of the 11-item version of the Tampa Scale for Kinesiophobia (TSK-11) in people with neck pain, little is known about its measurement properties in this population.
Objective The purpose of this study was to rigorously evaluate the measurement properties of the TSK-11 when used in people with mechanical neck pain.
Design This study was a secondary analysis of 2 independent databases (N=235) of people with mechanical neck pain of primarily traumatic origin.
Methods The TSK-11 was subjected to Rasch analysis and subsequent evaluation of concurrent associations with the Neck Disability Index and a numeric rating scale for pain intensity.
Results The TSK-11 conformed well to the Rasch model for interval-level measurement, but less so for acute or nontraumatic etiologies. A transformation matrix suggested that small changes at the extremes of the scale are more meaningful than in the middle. Cross-sectional convergent validity testing suggested relationships of expected magnitude and direction compared with pain intensity and neck-related disability. The use of the linearly transformed TSK-11 led to potentially important differences in distribution of data compared with use of the raw scores.
Limitations The sample size was slightly smaller than desired for Rasch analysis. The 2 databases were similar in terms of symptom duration, but differed in pain intensity and age.
Conclusions The TSK-11 can be considered an interval-level measure when used in people with neck pain. It provides potentially important information regarding the nature of neck-related disability. Clinically important difference may not be consistent across the range of the scale.
The perception that neck pain is a biopsychosocial phenomenon with multifactorial etiologies is gaining acceptance.1 Although symptoms of neck pain and stiffness contribute significantly to the overall experience of neck-related disability, they are not the only influences. As with most musculoskeletal pain conditions, cognitions and beliefs about the symptoms, fear of movement or reinjury,2,3 and social or environmental factors4,5 also contribute to a person's experience of disability. In the absence of hard diagnostic signs of these constructs, clinicians must rely upon the judicious use of carefully constructed, well-validated self-report measures to better identify specific factors that could influence clinical outcomes. One such tool is the Tampa Scale for Kinesiophobia (TSK).6
The TSK provides a measure of irrational fear of movement or reinjury. It portends that fear of pain or movement may be a stronger influence on disability than pain itself.7 In its original form, the TSK comprised 17 opinion-based items that evaluated respondents' general disposition toward the safety of movement and the fragility of their condition on an opinion-based scale (0=strongly disagree, 1=disagree, 2=agree, and 3=strongly agree). A re-evaluation of the original scale's properties by Woby and colleagues8 led to the removal of the 4 reverse-scored items and 2 additional poorly performing items, creating an 11-item version of the Tampa Scale for Kinesiophobia (TSK-11). Using approaches drawn from classical test theory, Woby and colleagues found the TSK-11 to possess adequate evidence of construct and content validity and test-retest reliability.8 Originally intended for use in people with chronic low back pain, both forms of the TSK have shown increasing use among other clinical populations, most notably in people with neck pain.9,10
An important issue arises, however, in the accurate clinical interpretation of TSK scores for such patients. For example, there are 44 different possible scores on the TSK-11, and it should not inherently be considered an interval-level measure for the purposes of statistical analysis without higher-level analyses of its properties. Luce and Tukey11 have defined the axioms of quantitative measurement that must be satisfied before a measurement tool can be logically considered a continuous, interval-level scale. Among these axioms are that: (1) the scale score must have the same meaning regardless of who is observing (measuring) or being measured, (2) the scale scores must be additive, and (3) the distance between any 2 points must be consistent across the entire range of the scale. Many common statistical procedures for hypothesis testing assume interval-level measurement, but many psychometric tools are more appropriately classed as ordinal level until evidence is available to suggest otherwise. The potential for spurious results increases when procedures reserved for interval-level measurement are applied to ordinal-level scales.12–14 Rasch analysis15 is one such approach that can be used to evaluate the degree to which an ordinal scale, such as the TSK-11, conforms to the axioms of quantitative measurement and, therefore, can be supported as interval level for clinical and research purposes.
The purposes of this study were (1) to evaluate the properties of the TSK-11, using a Rasch paradigm, and (2) to re-evaluate concurrent associations using the interval-level form of the scale in people with neck pain.
Method
Procedure
The database for this study was constructed from 2 independent databases owned by the authors. One sample was drawn from participants at the University of Queensland, Brisbane, Australia, and Northwestern University, Chicago, Illinois, and another was drawn from research participants at Western University, London, Canada (longitudinal cohort study) collected between 2009 and 2012. Both databases intended to evaluate influences on outcomes following neck pain of primarily traumatic origin, and recruitment was through rehabilitation clinics or in response to posted advertisements.
Participants
Participants were eligible for the study if they were experiencing neck pain of nonmalignant and nonbony (ie, fracture or dislocation) etiology, of any mechanism and for any duration; were between the ages of 18 and 65 years; were able to read and understand English at a conversational level; and were free of major systemic disease (eg, cancer, lung disease) or neuromuscular conditions (eg, stroke, multiple sclerosis). Neck pain was considered present if the respondent reported any pain or stiffness in the area bounded by the occiput superiorly and the seventh cervical vertebra inferiorly, with or without radiating symptoms. The 2 databases did not differ in duration of symptoms, but, on average, the individuals in the Canadian database were older (42.6 versus 29.7 years, P<.01) and reported lower mean pain intensity (5.1 versus 5.9, P=.01). Approval for both original studies was obtained from the relevant institutional review boards (University of Queensland, Northwestern University, and Western University) prior to initiating data collection.
All participants provided demographic and general descriptive information, including age, sex, mechanism of injury, and location and duration of symptoms. Pain intensity was rated on a 0 to 10 numeric rating scale (NRS), where 0 indicated no pain and 10 indicated extremely intense pain. Participants also completed the TSK-11 and the Neck Disability Index (NDI). The NDI is a widely used and well-validated self-report measure of neck-related disability16,17 comprising 10 items, each scored on a 0 to 5 scale, where a higher number indicates greater disability.
Data Analysis
Rasch analysis assumes a unidimensional scale. Principal components analysis (PCA) with varimax rotation was conducted as a first step to identify any potential problems with the factor structure in our sample prior to Rasch analysis. Horn's parallel analysis technique18 was used to determine the number of factors to extract.
Assuming appropriate factor structure, extreme scores, either 0 (floor) or 33 (ceiling), were removed, as it is impossible to reliably locate those scores in the Rasch model. Two different Rasch models are widely used: the distance-varying partial credit model or the uniform-distance rating scale model. Generally, a strong argument is required before the rating scale model is accepted. A likelihood ratio test can be conducted to determine whether the distances between response options are similar enough to use the rating scale model. A statistically significant result would suggest that the distances vary from each other to an extent greater than chance and, therefore, that the partial credit model should be used.
In order for scale data to fit the Rasch model, the position of each person in the database should be estimable by virtue of knowing only his or her score on the scale. Conversely, the score of each person should be estimable by virtue of knowing only his or her position relative to all other individuals. Deviations from estimated person location or scale score are termed “residuals.” If the residuals are relatively small and are normally distributed, the data adequately fit the Rasch model. This assumption can be tested descriptively and statistically. Descriptively, the mean of all item and person residuals, which are logit transformed prior to analysis, should be close to 0, with a standard deviation close to 1. Fit can be evaluated statistically through an item-trait interaction statistic where a P value <.05 indicates significant deviation. Significance would suggest the data do not fit the model and reasons for misfit should be explored.
Misfit can be the result of a number of scale problems, and Rasch analysis allows for a detailed evaluation of each problem. The first is that the response options may not be performing as expected. RUMM2030 software (Rumm Laboratory Pty Ltd, Duncraig, Western Australia, Australia) was used to determine whether any of the response thresholds were disordered, as would be the case if, for example, the threshold between responses of 2 and 3 was located lower on the scale than the threshold between responses of 1 and 2. This disordered response threshold may be the result of a poorly worded item or ambiguous response options, either of which will result in an inability to reliably estimate a person or item score.
Another source of misfit is differential item functioning (DIF). Conceptually, DIF occurs when a clinically relevant subset of the sample responds differently to an item than does the rest of the sample. This difference is identified through splitting the sample by level of the trait (ie, sex) and then plotting the location of similar groups of each sample, termed “class intervals,” along the continuum of the construct of kinesiophobia. A trait-by-class-interval analysis of variance is conducted to determine whether the ability to estimate score or location between the 2 levels of the trait differs to a degree greater than chance. Differential item functioning can be either: (1) uniform, in which one subgroup consistently performs differently across all levels of the item, or (2) nonuniform, in which the subgroups differ at some but not all levels of the item.
A third source of misfit is referred to as “multidimensionality” or “location dependence.” According to the axioms of quantitative measurement, a scale should measure only one underlying domain. If there are multiple factors, or “dimensions” in the Rasch lexicon, the location of people or items cannot be estimated for one dimension by virtue of knowing the score on another dimension. In other words, it is not appropriate to sum scores across 2 different subscales. Location dependence is a related construct; in this case, the response to one item is influenced by the response to a previous item. For example, if an earlier item on a scale were to ask about walking ability, a negative response to that item would, by definition, result in a negative response to a subsequent item asking about running ability. Location dependence would suggest that a subset of items may be related and, as such, should either be partialled out of the scale as a subtest or collapsed together into a “super item,” such as “mobility” in the case of our walking example. Both multidimensionality and location dependence can be identified through the individual item residuals. A PCA with varimax rotation of the residuals and a correlation matrix among all residuals can be used to identify multidimensionality or location dependence. To test the effect that location dependence has on model fit, a subtest procedure is performed in which conceptually similar items are grouped together to form a super item and re-entered into the Rasch analysis. If this procedure corrects the misfit, the scale is said to conform adequately to the Rasch measurement model.
Assuming adequate fit, a transformation matrix is constructed that allows transformation of raw TSK-11 scores into a linear scale for statistical testing. To determine whether this procedure has an appreciable impact on the results of cross-sectional validity testing, we correlated raw and transformed TSK-11 scores with age, NDI and NRS score. Group mean comparisons were performed using independent t tests to explore differences in the P value when the raw and transformed scores were used among sex (male/female), duration (acute/chronic), and cause (motor vehicle accident/other).
Role of the Funding Source
Part of the data for this study were collected for a cohort study funded through the Physiotherapy Foundation of Canada. No funding was received specifically for the secondary analysis described in this report.
Results
PCA
The initial sample was composed of 242 individuals with neck pain of varying causes and durations (Tab. 1). Eigenvalues for 100 randomly generated 242-person data sets were calculated and averaged to determine the appropriate threshold for factor retention. Using PCA with varimax rotation, one factor was extracted with an eigenvalue of 4.83 that explained 43.9% of total variance in the score. A second factor with an eigenvalue of 1.16 fell below the threshold for a second factor retention (1.26), as determined by the random data sets generated by the parallel analysis technique. As such, we continued to Rasch analysis.
Characteristics of the Sample (N=242)a
Preanalysis
Seven people were removed from the database after identifying floor scores that were deemed extreme. There were no individuals who scored the extreme ceiling. Accordingly, 235 individuals remained for further analysis. The likelihood ratio test was significant, suggesting the partial credit model was the most appropriate for Rasch analysis.
Fit of the Data to the Rasch Model
General summary statistics for the full 11-item scale, with 4 response options on each item, are shown in Table 2. Mean person location (X̅=−0.33, SD=1.08) approximated mean item location (X̅=0.00, SD=0.38), suggesting the sample was a good match with the items for evaluating the properties of the scale. The data showed significant item-trait interaction (χ2=61.4, P=.002), suggesting misfit to the expectations of the Rasch model. Mean item fit residual was good (X̅=0.20, SD=1.25), and none of the items individually showed significant misfit when a Bonferroni-corrected P value was used (P=.05/11). Individual person fit was evaluated, and 22 respondents had fit residuals outside of the ±2.5 standard deviation criterion for acceptable fit. Of these 22 respondents, 2 showed a positive fit residual, and 20 showed a negative fit residual. Lundgren-Nilsson and colleagues19 have suggested that negative fit residuals do not threaten the interpretation of the findings; rather, they simply do not add additional information. A simple interpretation of person misfit is that those individuals respond differently than expected to the scale items, which warrants investigation of DIF.
Summary Statistics of the Rasch Analysis of the 11-Item Version of the Tampa Scale for Kinesiophobia
There were no instances of disordered response thresholds for any of the 11 items, suggesting this factor was not the cause of the misfit. The following factors were used to create subgroups for evaluation of DIF: (1) sex, (2) traumatic versus nontraumatic causes of neck pain, and (3) short-term (<180 days) versus chronic (≥180 days) duration of symptoms. There were no cases of DIF by sex. Item 5 (“My accident/injury has put my body at risk for the rest of my life”) showed significant uniform DIF for both cause and duration of neck symptoms. Respondents with a nontraumatic mechanism for their symptoms showed greater misfit to the model than did those with traumatic mechanisms. Similarly, respondents experiencing symptoms for less than 6 months showed greater misfit to the model than did those experiencing symptoms for 6 months or longer. Taken together, these results suggest that item 5 may not function as intended for respondents with neck pain of nontraumatic origin or when their pain has been present for less than 6 months.
Rather than attempting to split item 5 for the acute nontraumatic subgroup, we elected to remove those 28 individuals from the database, as that item is arguably not appropriate for that etiology. The remaining data set of 207 respondents was re-evaluated (Tab. 1). This re-evaluation did not resolve the overall scale misfit (item-trait interaction: χ2=52.9, P=.0002), but did resolve the DIF for item 5.
A correlation matrix of the residuals was constructed, and 6 such correlations fell outside of the accepted criterion, suggesting a complex pattern of locally dependent items. The pattern most closely matched the findings of a previous exploratory factor analysis by Roelofs and colleagues,20 who found the 17-item version of the TSK to contain 2 subscales: activity avoidance and pathologic somatic focus. The corresponding items from the 11-item version used here are: activity avoidance (items 1, 2, 7, 9, 10, and 11) and pathologic somatic focus (items 3, 4, 5, 6, and 8). Guided by this theory, we created 2 subtests with the TSK-11 and re-entered them into the analysis as super items. In simple terms, this procedure considers each subtest (activity avoidance and somatic pathological focus) as an individual item, with a number of levels dictated by the total number of items-by-response options (18 thresholds for activity avoidance, 15 thresholds for somatic pathological focus). The subtest procedure using the sample of 207 respondents (excluding those with short-term, nontraumatic neck pain) resolved the misfit (item-trait interaction: χ2=0.69, P=.95). The 2 subscales correlated at r=.84, and there were no significant differences in fit residuals at any level between the 2 subscales. The mean fit residual was 0.27 (SD=0.55), and the person separation index (analogous to the Cronbach alpha) was 0.84. The Figure provides a histogram of response threshold locations to person locations for the total scale, and Table 3 is a transformation matrix for converting raw TSK-11 scale scores to interval-level scores for statistical analysis.
Person-item threshold histogram. The spread of 11-item version of the Tampa Scale for Kinesiophobia (TSK-11) items covers the location estimates of the respondents in the sample, with little evidence of redundancy.
Transformation Matrix of the 11-Item Version of the Tampa Scale for Kinesiophobia for Conversion of Raw Ordinal-Level Scores Into Interval-Level Scores
Cross-Sectional Construct Validity
Using the results of the Rasch analysis, we evaluated cross-sectional convergent validity. The sample included those respondents with extreme scores, but excluded respondents with short-term, nontraumatic neck pain, for a total sample of 213. Overall, the linearly transformed scores resulted in a higher mean value (15.3 versus 14.5, P<.01), with a smaller standard deviation (5.4 versus 6.9) and standard error of the mean (0.37 versus 0.47). Table 4 shows the simple bivariate associations among sex, chronicity, cause, age, pain intensity, and NDI. For illustrative purposes, the results of the analyses are presented for both the raw TSK scores and the linearly transformed scores. There were several instances in which the group mean standard deviations and accompanying P values were different to a nontrivial degree between the 2 scoring methods, notably for sex and cause. The TSK-11 showed a significant relationship with cause, pain intensity, and NDI score (P<.01 in all cases).
Simple Bivariate Associations Between the 11-Item Tampa Scale for Kinesiophobia Raw and Transformed Linear Scoresa
Discussion
A higher-order analysis of the measurement properties of the TSK-11 in a sample of people with neck pain has been described using Rasch analysis and cross-sectional associations. The results of the Rasch procedure suggest that the TSK-11 functions well as an interval-level linear scale when not used in people with short-term (<6 months), nontraumatic neck pain. There appear to be 2 dimensions within the scale, but both are related strongly enough that it can be logically summed to produce a total score. Use of the transformed TSK-11 leads to potentially important differences in standard error and associated P values compared with the raw score.
Cleland and colleagues21 evaluated the properties of both the full 17-item version of the TSK and a shortened version with the reverse-scored items removed in a sample of 78 people with neck pain using classical approaches. Test-retest reliability in this sample was moderate to good (intraclass correlation coefficient=.69–.80), and internal consistency was acceptable at α=.87 to .89. Although we did not evaluate test-retest reliability in our sample, the values for internal consistency are comparable to the results from our larger cohort (Person Separation Index=.84).
Despite relatively little formal validation of the TSK in people with neck pain, it has been used as a prognostic indicator following acute injury22,23 and as a modifier of treatment effectiveness in that population.24 Whether in neck pain or other conditions, the score on the TSK has almost universally been evaluated using statistical approaches that should be reserved for linear measurement. The degree to which such practices actually affect results is debatable (see Knapp25 for a review of these controversies). We compared the standard deviation and associated significance (P) value obtained on simple bivariate tests of association between the raw ordinal-level TSK-11 data and the transformed values. The difference in standard deviation was as high as 1.6 points, with difference in P value of .11 between the raw mean difference and transformed mean difference when comparing data between the sexes. We believe this difference extends beyond triviality and could have an important effect when interpreting the results of research using the TSK-11. Although none of the associations evaluated here led to a nonsignificant finding becoming significant or vice versa, the magnitude of difference in distribution of scores conceivably could have such an effect. Furthermore, the machinations of the Rasch method allow a more in-depth analysis of the performance of individual items and separate clinical groups. Our results suggest that, if the TSK-11 were to be used in people with short-term (ie, acute) or nontraumatic neck pain, item 5 might not function well, and its removal should be considered prior to administration and scoring.
Roelofs and colleagues20 found the full TSK to be composed of 2 separate factors: activity avoidance and pathologic somatic focus. Normally, a multifactorial scale should not be considered summative unless strong evidence for doing so has been presented. We used the same conceptual groupings of items to establish multidimensionality in the Rasch analysis, but went further to determine that these 2 factors are strongly related, such that the score on one can be used to accurately predict the score on the other. Accordingly, although the TSK-11 may be composed of more than one subscale, summing the subscale scores to provide a total overall score that acts as a linear measure is reasonably justified.
Readers should be aware that although the TSK-11 has performed in accordance with the axioms of quantitative measurement and has shown cross-sectional relationships with related constructs in magnitude and direction that make sense, we have not supported criterion-related or “gold standard” validity. We are unable to make any definitive claims that the TSK-11 is a valid measure of kinesiophobia. Phobias are defined by irrational fears, and we are not convinced that statements such as “Pain lets me know when to stop exercising so that I don't injure myself” are inherently irrational. It is conceivable that, in many instances, such a belief is entirely rational. Currently, no firm diagnostic criteria exist for identifying kinesiophobia, and we are at best able to suggest that the TSK-11 provides a measure of general negative valence toward exercise, but not an irrational fear or specific phobia.
The transformation matrix suggests that ordinal-level change in the middle range of the scale is not as meaningful as is change toward the extremes, and this finding may have an important impact on identifying changes that are clinically meaningful. As a practical example, let us assume that an interval-level change of 10% of the overall scale, or 3.3 interval-level points, represents meaningful change. We recognize this is an arbitrary decision, as the clinically important difference has yet to be established for the TSK-11, but will work for this example. A baseline score of 32/33 on the ordinal-level scale represents a score of 29.9/33 on the interval-level scale. In order to reduce that score by 3.3 interval-level points, the ordinal score would have to move from 32 to 30, or from 29.9 to 26.6 on the interval-level scale, to be considered meaningful in our hypothetical example. Conversely, a baseline score of 19/33 on the ordinal scale (18.8/33 on the interval scale) would require an ordinal change from 19 to 14 (18.8 to 15.2 on the interval scale) to be confident that the same level of change had occurred. Using this example, an ordinal change of 5 points in the middle of the scale is linearly similar to an ordinal change of 2 points at the upper end. To facilitate this conversion, a worksheet has been created in Microsoft Excel 9 (Microsoft Corporation, Redmond, Washington) format that is free for clinicians or researchers to use (the worksheet is available in Excel and PDF formats).
This study is not without its limitations. Although the sample size was large by most standards, it was arguably small for Rasch analysis, in which a sample of 243 or greater is generally desirable for 99% confidence in item calibration stability at ±½ logit.26 However, our sample of more than 200 respondents was clearly well targeted to the scale and, therefore, can be considered appropriate for making inferences about the scale's properties. Another potential limitation is that the data were taken from 2 independent databases and combined, which may introduce some bias into the measurements. In most cases, the variables used were self-reported, and different raters should not have introduced any bias. The difference in mean pain intensity between the 2 databases is statistically but not likely clinically relevant (0.8 point), although the difference in mean age provides a better representation across the age span than either database alone. However, readers should note that the sample was largely female (80%), which—while representative of the overall population of people seeking care for neck pain27—limits our ability to generalize our findings to male respondents. Finally, readers may note that the first step in the analysis described here was a factor analysis on a scale that had yet to be supported as interval level, which on the basis of our arguments presented in the introduction, is arguably inappropriate. Such an approach is common not only in Rasch analysis but also in classical test theory, and is used simply as a starting point from which subsequent results can be interpreted. The results of the Rasch analysis supported the unidimensional interval-level nature of the scale, suggesting the factor analysis was appropriate, but we recognize that such an approach required a leap in logic that ultimately could have been erroneous.
In summary, the TSK-11 appears to function well as an interval-level scale when used in people with neck pain. It does not function as well when the neck pain is shorter term (<6 months) and of nontraumatic origin, and we suggest clinicians consider removing item 5 when the scale is to be used in that population. For researchers comparing simple group means or correlations, the difference in distribution of scores was potentially important between the raw and transformed values, which could affect interpretation of results. For clinicians and researchers, the magnitude of interval-level change associated with the raw ordinal scores is not consistent across the scale, which might influence interpretation of change scores. Using this information, clinicians should feel confident in using the TSK-11 with people with neck pain, especially of traumatic origin and longer duration, and should consider the transformation matrix (Tab. 3) when evaluating change in their patients' fear of movement. The TSK-11 conforms to axioms of linear quantitative measurement and provides potentially valuable information when used in people with neck pain of various etiologies.
Footnotes
-
Both authors provided concept/idea/research design, writing, data collection, and institutional liaisons. Dr Walton provided data analysis, project management, fund procurement, and facilities/equipment. Dr Elliott provided study participants and consultation (including review of manuscript before submission).
-
Some of the material contained in this report was given as a poster presentation at the 13th World Congress of the International Association for the Study of Pain; August 27–31, 2012; Milan, Italy.
-
Part of the data for this study were collected for a cohort study funded through the Physiotherapy Foundation of Canada. No funding was received specifically for the secondary analysis described in this report.
- Received June 15, 2012.
- Accepted September 4, 2012.
- © 2013 American Physical Therapy Association