Abstract
Background Physical functioning is a core outcome domain to be measured in nonspecific low back pain (NSLBP). A panel of experts recommended the Roland-Morris Disability Questionnaire (RMDQ) and Oswestry Disability Index (ODI) to measure this domain. The original 24-item RMDQ and ODI 2.1a are recommended by their developers.
Purpose The purpose of this study was to evaluate whether the 24-item RMDQ or the ODI 2.1a has better measurement properties than the other to measure physical functioning in adult patients with NSLBP.
Data Sources Bibliographic databases (MEDLINE, Embase, CINAHL, SportDiscus, PsycINFO, and Google Scholar), references of existing reviews, and citation tracking were the data sources.
Study Selection Two reviewers selected studies performing a head-to-head comparison of measurement properties (reliability, validity, and responsiveness) of the 2 questionnaires. The COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) checklist was used to assess the methodological quality of these studies.
Data Extraction The studies' characteristics and results were extracted by 2 reviewers. A meta-analysis was conducted when there was sufficient clinical and methodological homogeneity among studies.
Data Synthesis Nine articles were included, for a total of 11 studies assessing 5 measurement properties. All studies were classified as having poor or fair methodological quality. The ODI displayed better test-retest reliability and smaller measurement error, whereas the RMDQ presented better construct validity as a measure of physical functioning. There was conflicting evidence for both instruments regarding responsiveness and inconclusive evidence for internal consistency.
Limitations The results of this review are not generalizable to all available versions of these questionnaires or to patients with specific causes for their LBP.
Conclusions Based on existing head-to-head comparison studies, there are no strong reasons to prefer 1 of these 2 instruments to measure physical functioning in patients with NSLBP, but studies of higher quality are needed to confirm this conclusion. Foremost, content, structural, and cross-cultural validity of these questionnaires in patients with NSLBP should be assessed and compared.
Low back pain (LBP) is the primary worldwide cause of years lived with disability according to a report of the Global Burden of Disease.1 Approximately 80% of people experience activity-limiting LBP at some point in their lifetime, and approximately 5% develop chronic LBP lasting for more than 3 months.2 Costs associated with LBP represent a serious burden to society, and lost work productivity accounts for the bulk of these costs.3,4 Approximately 90% of patients with LBP are labeled as having nonspecific low back pain (NSLBP) because a specific cause for their pain cannot be found.5–7
Limitations in physical functioning are frequently reported by patients with NSLBP. The measurement of physical functioning as a core outcome domain in all clinical trials for NSLBP has been recently recommended by a wide, international, multidisciplinary, and multi-stakeholder panel of experts.8 Several patient-reported and back-specific questionnaires have been developed and used to measure back-specific functional status.9 Among these questionnaires, 2 are most frequently used10 and were previously recommended by panels of experts11,12: the Roland-Morris Disability Questionnaire (RMDQ) and the Oswestry Disability Index (ODI). Different versions of both questionnaires have been developed over time,9 and to reduce inconsistency across studies, one specific version for each questionnaire was recommended by their developers: the original 24-item RMDQ and version 2.1a of the ODI.13
The original RMDQ was developed in 1983 from the Sickness Impact Profile, with the aim of developing “a simple, sensitive, and reliable method of measuring disability in patients with back pain.”14(p141) It consists of 24 items representing “physical functions that were likely to be affected by LBP”; each item can be checked if it applies to a patient for that day, leading to a total score that is obtained by counting the number of checked items.13(p3115) The original version of the ODI (ie, ODI 1.0) was published in 1980 with the scope of being “a valid indicator of disability,” where disability was defined as “the limitations of a patient's performance compared with that of a fit person.”15(p271) The ODI consists of 10 items representing different health constructs (eg, pain intensity, physical functioning, sleep functioning, social functioning).16 The first item of ODI 1.0 underwent a substantial change that resulted in the development of ODI version 2.0,17 which presented some very small typographical errors that were corrected to become version 2.1a of the questionnaire.18 The total score of the ODI is calculated by adding all scores of applicable items, dividing the obtained score by the maximal total score, and by multiplying the result by 100 to obtain a percentage score.16
To be used in research and clinical practice, a measurement instrument needs to show adequate measurement properties (ie, validity, reliability, and responsiveness).19 The measurement properties of an instrument are context-specific (ie, they depend on various factors, such as study population, clinical setting, time points of assessment, and comparator instruments).20 Therefore, to make an adequate judgment on which of 2 instruments has better measurement properties, both instruments should be administered to the same patients, in the same setting, at the same time points, and with the same comparator instruments. For researchers, clinicians, and their patients who want or have to make a choice between recommended versions of RMDQ and ODI, it would be crucial to know whether one instrument has better measurement properties than the other.
An attempt to compare the measurement properties of the RMDQ and ODI has been made in some reviews13,21–24; however, all of these reviews failed on some key methodological aspects for systematic reviews on measurement properties of instruments.20 Two of these reviews were narrative reviews, as they were not conducted in a systematic fashion,13,24 and none of them included an assessment of the methodological quality of the studies, which was necessary to weight the trustworthiness of results. Moreover, none of them aimed specifically at focusing on head-to-head comparison studies, which have the best design to establish whether an instrument is better than another.25 Newman et al26 recently performed a systematic review of head-to-head comparisons between RMDQ and ODI, but they focused only on responsiveness, without making a specific distinction between different versions of the questionnaires, and included all LBP disorders in their evidence synthesis. Hence, to date, no systematic reviews have been conducted to summarize head-to-head comparison studies focusing on all measurement properties of recommended versions of RMDQ and ODI in only patients with NSLBP.
This systematic review purported to determine whether the 24-item RMDQ or the ODI 2.1a has better measurement properties than the other to measure physical functioning in patients with NSLBP. The rationale for focusing this review solely on patients with NSLBP is related to the scope of the ongoing international effort aimed at developing a core outcome set of domains and measurement instruments to be used and reported in all clinical trials conducted in this large subgroup of patients with LBP.8,27 The highest consensus was reached on the measurement of physical functioning,8 and as a previous panel of experts suggested both the 24-item RMDQ and the ODI 2.1a for this domain,11,12 it is essential to assess whether 1 of the 2 instruments has better measurement properties in the NSLBP population.
Method
This review was conducted and reported following the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) statement.28,29 A protocol was written a priori and can be accessed on the international prospective register of systematic reviews (http://www.crd.york.ac.uk/PROSPERO/, registration number: CRD42014014803).
Data Sources and Searches
The following biomedical databases were last searched on February 2, 2015, to retrieve eligible articles: MEDLINE (through the interface PubMed), Embase (Embase.com), CINAHL (EBSCOhost), PsycINFO (EBSCOhost), and SportDiscus (EBSCOhost). The search strategy consisted of 3 groups of search terms representing the following components of the research aim: (1) RMDQ and ODI, (2) NSLBP, and (3) measurement properties. The 3 groups of search terms were combined with each other with the Boolean operator “AND,” and index and/or title/abstracts terms within each group were combined with the operator “OR.” A specific search filter was used for retrieval of studies on measurement properties of instruments in the MEDLINE database.30 The full electronic search strategies for all databases are presented in eAppendix 1. No restrictions of language and time were applied to the search strategies. Google Scholar also was searched twice using the extensive names of the 2 questionnaires; the first 100 hits of each search were last checked on February 12, 2015, for inclusion. References of studies included in other systematic reviews21–23,26 also were screened. Backward citation tracking was performed by checking the references of the studies deemed as eligible; forward citation tracking was performed in the database Web of Science by screening titles of articles that cited the eligible studies.
Study Selection
A study was included if it met the following criteria: (1) full-text original article (eg, not an abstract, editorial, or review), (2) purpose to evaluate one or more measurement properties of both 24-item RMDQ and ODI 2.1a, and (3) study population of adult patients (ie, >18 years old) with NSLBP. For the scope of this review, considering the very small adjustments in wording highlighted by its developer,18 the 3 versions of the ODI (ie, 2.0, 2.1, and 2.1a) were included, assessed, and renamed as the same questionnaire (ie, ODI 2.1a). Studies including patients with specific mechanical diagnoses (eg, spinal stenosis, herniated disk) were not included. Studies including patients with the following specific nonmechanical causes for their LBP (eg, infection, cancer, rheumatoid arthritis, ankylosing spondylitis, other inflammatory disorders) also were excluded. Studies including a “mixed” population of patients with LBP were included only if at least 75% of the patients met the inclusion criterion, and the same rule was followed for studies including patients with spinal pain at different levels.
Eligibility criteria were applied independently by 2 reviewers (A.C., L.J.M.) to titles and abstracts of all articles retrieved with literature searches. Full texts of potentially eligible articles were downloaded and assessed against the inclusion criteria by the same 2 reviewers independently. Agreement regarding inclusion was sought in a consensus meeting between reviewers, and in case of disagreements, a third reviewer (R.W.O.) made decisions. If it was not clear which version of the RMDQ or ODI was used in a study, the authors of that study were contacted by email to request this information. The corresponding author of a study was contacted first, and if no answer was received, other authors with a retrievable email address were contacted. If an answer was not received by any of the authors or if the authors were not able to say which version was used, the study was not included. Citation tracking and checking references of other reviews were conducted by 1 reviewer (A.C.) and, when potentially eligible studies were retrieved, their eligibility was screened by 2 reviewers independently (A.C., L.J.M.).
Data Extraction and Quality Assessment
The COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) checklist31,32 was used to assess the methodological quality of the studies. This checklist consists of 9 boxes, each representing a measurement property included in the COSMIN taxonomy: internal consistency, reliability, measurement error, content validity, structural validity, construct validity (hypotheses testing), cross-cultural validity, criterion validity, and responsiveness.19 Each box contains several items that can each be scored on a 4-point rating scale (ie, poor, fair, good, or excellent). An overall score for the methodological quality of each measurement property for each study is determined by taking the lowest rating of any of the items in a box.32 The COSMIN consensus-based definitions of measurement properties19 were used to decide which properties were assessed in a study and which corresponding boxes had to be completed, regardless of the terminology used in the included studies. Assessment of the methodological quality was performed by 2 reviewers independently (A.C., L.J.M.), and in case of disagreements, a third reviewer (R.W.O.) made final decisions.
A customized data extraction form was developed for this review, and extracted data were subsequently reported in tables. The following information was extracted from each included study by one reviewer (A.C.) and double checked by a second reviewer (L.J.M.): characteristics of the studies (ie, country, language, design, clinical setting, inclusion and exclusion criteria, type of intervention, methods for selection of patients, measurement properties assessed, time points of assessment), characteristics of the patients included in the studies (ie, sample size, age, sex, disease characteristics, and RMDQ and ODI scores at baseline), and results on the assessed measurement properties.
Data Synthesis and Analysis
Meta-analysis of different parameters (eg, Cronbach alpha, intraclass correlation coefficient [ICC], Pearson correlation) was conducted for studies assessing the same measurement properties of the 2 questionnaires. Data extracted on characteristics of studies and participants were used to assess whether there was sufficient clinical and methodological homogeneity. Results of different studies were statistically pooled when: (1) participants displayed similar characteristics in terms of age, sex, and RMDQ and ODI baseline scores; (2) participants were assessed with the same time interval; and (3) the same statistical parameters (ie, same statistical models or formulas) were used. Pooled correlation coefficients with their 95% confidence intervals (95% CIs) were calculated using a Fisher z transformation of the correlations.33 In light of expected between-study error, the DerSimonian and Laird random-effects model was used in the meta-analysis.34 Statistical heterogeneity of results was assessed using the Q statistic and the I2. The Q statistic reflects the total amount of variance in the meta-analysis, and the I2 indexes the proportion of variance that is due to between-study differences and is not sensitive to the number of studies considered.35 The I2 values range from 0% to 100%, and values >50% are suggested to represent substantial heterogeneity.35 Sensitivity analyses excluding studies of poor methodological quality were performed to assess whether the pooled estimates were strongly influenced by the results of these studies. All meta-analyses were performed using the Comprehensive Meta-Analysis 2.1 software (Biostat, Englewood, New Jersey).
The overall rating for a measurement property of each instrument was considered “positive,” “indeterminate,” or “negative,” following adapted international quality criteria for good measurement properties (eAppendix 2).36 The criteria for measurement error were modified a priori (eAppendix 2) to enable a straightforward interpretation of results on this property. This interpretation of results would not have been possible if using the original criteria, which take for granted that a study would report parameters of measurement error together with the minimal important change (MIC),36 although this is often not the case. As suggested by the COSMIN initiative,20 a best evidence synthesis was performed for each measurement property, taking into account the results, their consistency, and the methodological quality of the studies (eAppendix 3). One instrument was considered to be better than the other on a given measurement property when it displayed at least a moderate level of evidence with consistent and positive ratings and the other instrument displayed conflicting findings or negative ratings (eAppendix 3). When, for a certain measurement property, both instruments displayed the same level of evidence with consistent and positive ratings, 1 of the 2 instruments was considered better than the other if showing consistently better results in all of the studies. Results for each measurement property were carefully inspected to assess whether a clear difference between instruments could be found in patients with acute or subacute/chronic NSLBP duration.
Role of the Funding Source
The authors acknowledge the Wetenshcappelijk College Fysiotherapie (WCF) of the Royal Dutch Society for Physical Therapy (KNGF) for providing funding for this study. This funding body did not have any role in design, conduct, analysis, or interpretation of data, nor in writing the manuscript and deciding to submit the manuscript for publication. The views expressed here are those of the authors and do not necessarily reflect those of their funding bodies.
Results
Figure 1 presents the flowchart for the study selection process. Nine articles37–45 were considered as eligible, including a total of 11 studies comparing the measurement properties of the 24-item RMDQ and ODI 2.1a in patients with NSLBP. Eight articles in which a head-to-head comparison of RMDQ and ODI was performed were not included because the recommended versions of the RMDQ and ODI (Fig. 1) were not used: 6 studies46–51 used the ODI 1.0, 1 study52 used the “chiropractic version” of the ODI, and 1 study53 used the 23-item version of the RMDQ. Two articles54,55 were excluded because they presented the direct comparison of the 2 instruments in patients with specific LBP (Fig. 1). Citation tracking of eligible articles did not add any study to those retrieved through databases and searches of other sources. Table 1 presents the characteristics of the studies and the included participants.
Flowchart of results of search strategy and selection of articles.
Characteristics of the 11 Studies and of the Patients Included in This Reviewa
Two studies40,45 evaluated internal consistency, 4 studies38,40,43,45 evaluated test-retest reliability, 4 studies38,40,43,44 evaluated measurement error, 5 studies40,44,45,54 evaluated construct validity, and 7 studies37–39,41,43,44 evaluated responsiveness. None of the studies made a direct comparison of the following measurement properties: content validity, structural validity, cross-cultural validity, and criterion validity.19 Five studies37,40–43 were conducted only in patients with chronic NSLBP, where chronic NSLBP was defined as the presence of nonspecific LBP for more than 3 months (Tab. 1). Two studies40,41 were conducted in patients with NSLBP for less than 3 weeks, 2 studies39,44 included patients with subacute and chronic NSLBP (ie, pain for more than 6 weeks), and 2 studies38,45 included the whole spectrum of NSLBP duration (Tab. 1). Results on the measurement properties are subdivided and presented in the 3 COSMIN macro domains: reliability, validity, and responsiveness (Tabs. 2 and 3, Fig. 2; eTable). Eight of the studies included in this review37,40–42,44,45 assessed the measurement properties of 5 translated and cross-culturally adapted versions of RMDQ and ODI (ie, Brazilian, Norwegian, German, Italian, and Persian). These studies were considered and assessed together with those evaluating the measurement properties of the original versions, as no modifications were made in the structure of the questionnaires (eg, number of items, type of response options) during the process of translation and adaptation.
Internal Consistency, Reliability, and Measurement Error of the RMDQ and ODI in Head-to-Head Comparison Studies Conducted in Patients With NSLBPa
Responsiveness of the RMDQ and ODI in Studies Making a Head-to-Head Comparison in Patients With NSLBPa
Pooled correlations with 95% confidence intervals (95% CIs) of Roland-Morris Disability Questionnaire (RMDQ) and Oswestry Disability Index (ODI) with other instruments measuring related or unrelated constructs in patients with nonspecific low back pain: (A) correlation between RMDQ and physical functioning subscale of the 36-Item Short-Form Health Survey (SF-36-PF) in 384 patients, (B) correlation between ODI and SF-36-PF in 384 patients, (C) correlation between RMDQ and pain intensity measures (pain intensity was measured with a 100-mm visual analogue scale by Grotle et al,40 Mannion et al,42 and Mousavi et al45 and with a 0–10 numeric rating scale by Monticone et al44) in 416 patients, (D) correlation between ODI and pain intensity measures (pain intensity was measured with a 100-mm visual analogue scale by Grotle et al,40 Mannion et al,42 and Mousavi et al45 and with a 0–10 numeric rating scale by Monticone et al44) in 416 patients, (E) correlation between RMDQ and bodily pain subscale of the 36-Item Short-Form Health Survey (SF-36-BP) in 384 patients, (F) correlation between ODI and SF-36-BP in 384 patients, (G) correlation between RMDQ and general health subscale of the 36-Item Short-Form Health Survey (SF-36-GH) in 205 patients, (H) correlation between ODI and SF-36-GH in 205 patients, (I) correlation between RMDQ and mental health subscale of the 36-Item Short-Form Health Survey (SF-36-MH) in 205 patients, and (J) correlation between ODI and SF-36-MH in 205 patients.
Internal Consistency
Two studies of poor methodological quality assessed internal consistency.40,45 These studies were classified as being of poor quality because they calculated the Cronbach alpha of the total scores without assessing or providing evidence that the questionnaires were unidimensional (Tab. 2). Considering this limitation, statistical pooling was not performed, and it remains unknown whether one of the questionnaires has better internal consistency (Tab. 4).
Best Evidence Synthesis of Measurement Properties of the RMDQ and ODI in Head-to-Head Comparison Studies Conducted in Patients With Nonspecific Low Back Paina
Reliability
Four studies assessed the test-retest reliability of the 2 questionnaires: 2 studies40,43 were classified as being of poor methodological quality, and the other 2 studies38,45 were classified as being of fair quality because of the small sample sizes included (Tab. 2). The study by Mousavi et al45 also was classified as being of fair quality because a short time interval was adopted for the reassessment of the participants (Tab. 2). Statistical pooling was not performed due to discrepancies in time points of assessment and differences in (ICC parameters (Tab. 2). Two studies38,45 included patients with acute and chronic NSLBP but did not report descriptive statistics on pain duration (Tab. 1), making it impossible to identify different findings related to NSLBP duration. A moderate level of evidence of good reliability was found for the ODI but not for the RMDQ, for which there were conflicting findings in the 2 studies of fair quality (Tab. 2). These results suggest that the ODI displays better test-retest reliability than the RMDQ (Tab. 4).
Measurement Error
The measurement error of the RMDQ and ODI was compared by 4 studies: 2 studies40,43 were rated as being of poor methodological quality, and 2 studies38,44 were rated as being of fair quality. The sample sizes influenced the quality of 3 of these studies (eTable), and the rating of the study by Monticone et al44 was due to the lack of information on how missing items were handled. Meta-analysis for the standard error of measurement (SEM) and the smallest detectable change (SDC) was not performed due to discrepancies in time intervals and in the parameters' formulas (Tab. 2). Due to limited reporting on NSLBP duration in 3 of these studies,38,40,44 it was not feasible to identify whether there were discrepant results on this property related to pain duration. In the 2 studies of fair methodological quality, the ODI displayed moderate evidence of a positive rating for its SDC, while the RMDQ displayed a negative rating (Tab. 2; eAppendix 3). These results indicate that the ODI has a smaller measurement error than the RMDQ.
Construct Validity–Hypotheses Testing
Construct validity was assessed in 5 studies: 3 of fair methodological quality40,45 and 2 of poor methodological quality42,44 (eTable). Studies of fair quality were judged as such because of limited information regarding the measurement properties of comparator instruments in any study population. The other 2 studies were rated as poor because of lack of information on the comparator instruments42 or because it was unclear what was expected for the correlations between instruments.44 Meta-analyses were performed on Pearson correlations of the RMDQ and ODI separately when these correlations were calculated with the same comparator instruments in at least 3 studies (Fig. 2). Given the focus of this review on physical functioning, disease-specific comparator instruments measuring function or disability were considered as instruments measuring the same or a related construct; all other instruments were considered as measuring unrelated constructs.
Pooled correlations with the physical functioning subscale of the Medical Outcomes Study 36-Item Short-Form Health Survey (SF-36-PF) were −.66 (95% CI=−.77, −.60) for the RMDQ and −0.70 (95% CI=−0.77, −0.61) for the ODI (Figs. 2A and 2B); substantial heterogeneity was found for the pooled estimate of the ODI. Pooled correlations of .46 (95% CI=.35, .55) and −.46 (95% CI=−.61, −.26) were found for the RMDQ with pain instruments (Figs. 2C and 2E), and correlations of .54 (95% CI=.41, .64) and −.56 (95% CI=−.68, −.40) were found for the ODI with the same instruments (Figs. 2D and 2F); substantial heterogeneity was found for all but one of these estimates (Fig. 2C). Both instruments displayed correlations below .5 with other unrelated constructs, with the ODI showing higher correlations than the RMDQ and with no substantial heterogeneity in these meta-analyses (Figs. 2G–2J). Sensitivity analyses revealed that all of these pooled estimates were not substantially different when the studies of poor methodological quality42,44 were removed. Correlations investigated only in 1 or 2 studies were not included in meta-analyses and are presented in the eTable. The ODI showed consistently higher correlations than the RMDQ with all of the other instruments assessed, with the only exception of the correlation with the role–physical subscale of the SF-36 in one study44 (Fig. 2, eTable). One study40 in the meta-analyses included patients with acute NSLBP, and its correlations were in line with the other studies in chronic NSLBP included in the meta-analyses; the only difference was a lower correlation with the SF-36 bodily pain subscale, but that was not substantially different between the RMDQ and ODI. Correlations in the meta-analyses were all as hypothesized (eAppendix 2) for the RMDQ (100%), whereas the ODI met 3 out of 5 of expected correlations (60%) (Figs. 2D and 2F).
In assessing the results of individual studies, the RMDQ met 62.5% of the a priori hypotheses, and the ODI met 75% of them (eTable). In performing best evidence synthesis, more weight was allocated to the results of the meta-analyses, as the meta-analyses were based on more precise correlation estimates. A moderate level of evidence with a consistent positive rating was given to the RMDQ, as all a priori hypotheses were met in the meta-analyses, whereas results for the ODI were considered as conflicting. These results indicate that the RMDQ has better construct validity than the ODI for measuring physical functioning in patients with NSLBP (Tab. 4).
Responsiveness
Seven studies37–39,41,43,44 of fair methodological quality compared the responsiveness of the 2 instruments (Tab. 3). All but one study43 assessed responsiveness using a construct approach, and all of the studies assessed responsiveness using a criterion approach, with a global perception of change scale (GPCS) as a gold standard and with an inconsistent number of point scales across studies (Tab. 3). The overall quality score of all studies37–39,41,43,44 was influenced by different factors: unclear description of handling of missing items, vague or absent hypotheses regarding correlations or effect sizes, limited information on measurement properties of comparator instruments, and uncertainty regarding the GPCS as an adequate gold standard. Statistical pooling was not performed due to discrepancies in time points of assessment and differences in the GPCSs used in the different studies (Tab. 3). Like done for construct validity, disease-specific comparator instruments measuring function or disability were considered as measuring a similar construct, with all other instruments measuring other constructs as unrelated. In 3 studies,41,44 correlations of change scores of both instruments with change scores in the SF-36-PF were lower than correlations with changes in pain measures, which was unexpected. In 2 of these studies,41,44 the correlations with the SF-36-PF were below .5, and with some pain measurements above .50, both also were unexpected (Tab. 3). Forty percent of the correlations were in accordance with our hypotheses for the RMDQ, and 50% of the correlations were in accordance with our hypotheses for the ODI. In 6 studies,37–39,41,44 both RMDQ and ODI displayed larger standardized response means (SRMs) for the group of “improved” patients when compared with the whole group or with those “not improved” (Tab. 3). In one study,43 both questionnaires displayed areas under the curve (AUCs) below 0.70; in 2 studies,39,44 only the RMDQ presented an AUC slightly below this threshold for a positive rating (eAppendix 2). The only study including solely patients with acute NSLBP41 showed higher correlations, effect sizes, and AUCs than other studies, but the results were similar and conflicting for both questionnaires. Overall, due to a negative rating for correlations and a positive rating on SRMs (Tab. 3, eAppendix 2), the evidence was considered as conflicting for both instruments and consequently made inconclusive the comparison of responsiveness of the 2 instruments (Tab. 4).
Discussion
A systematic review was conducted to assess studies directly comparing the measurement properties of the original 24-item version of the RMDQ and version 2.1a of the ODI in patients with NSLBP. Nine articles, including 11 studies in the review, met the eligibility criteria (Fig. 1). There was moderate-quality evidence showing that the ODI has better test-retest reliability and less measurement error than the RMDQ (Tab. 4). On the other hand, there was moderate-quality evidence suggesting that the RMDQ has better construct validity than the ODI as a tool to assess physical functioning. Conflicting evidence was found for responsiveness of both instruments, and their internal consistency is unknown due to only studies of methodological quality or no studies on that measurement property (Tab. 4). In this review, no clearly different findings in measurement properties could be shown for patients with a different NSLBP duration. Overall, based on the 5 measurement properties assessed in the studies included in this review, there are no strong arguments to prefer one instrument over the other to measure physical functioning in patients with NSLBP. Nevertheless, this systematic review provides some valuable information that should be put in the research agenda by the scientific community. First, head-to-head comparison studies of adequate methodological quality on the 5 measurement properties included in this review are necessary. Second, and more importantly, some key measurement properties of these 2 instruments (ie, content, structural, and cross-cultural validity) should be compared in patients with NSLBP.
The ODI 2.1a was found to have better test-retest reliability, mainly because an ICC below .70 was found for the 24-item RMDQ in one study38 of fair methodological quality (Tab. 2). A recent systematic review23 retrieved 28 studies assessing test-retest reliability of all RMDQ versions, finding only 2 studies displaying an ICC below .70.38,56 These results might suggest that the results of the study by Davidson and Keating38 could be considered as fortuitous or strictly related to the long time interval used for reassessment; the same aforementioned review also found that heterogeneity in reliability results across studies can be explained by the different test-retest time frames adopted.23 However, the same review also substantiated the results for test-retest reliability found in this review, as the pooled ICC for the ODI was higher than that of the RMDQ.23
The studies included in this review showed that the ODI has a smaller measurement error than the RMDQ, also explaining the results in favor of the ODI for test-retest reliability. These findings are in line with those of the review of Geere et al,23 who found smaller mean SDCs for the ODI when all versions of the 2 questionnaires were considered. However, in that review,23 the difference of SDC of the 2 questionnaires was not present when only time intervals shorter than 14 days were analyzed. This difference could not be assessed in our review because the 2 studies of fair methodological quality38,44 adopted time intervals of 6 and 8 weeks, respectively. Another way to assess the measurement error of an instrument is to compare it with its MIC and evaluate whether the instrument is able to discriminate measurement error from the MIC.36 Nevertheless, a limitation of this approach is that it might be difficult to use absolute MIC values for an instrument, considering that they can be context- and population-specific57 and dependent on baseline values of the assessed questionnaires.58 Two of the studies included in this review43,44 also estimated the MIC of the 2 instruments and showed them to be smaller than SDCs for both questionnaires. This finding would indicate that neither of the 2 instruments is able to discriminate between SDC and MIC or to detect “real” changes in the construct measured. Hence, although the ODI has a smaller measurement error than the RMDQ, it cannot be asserted that it also has a greater ability in discriminating SDC and MIC.
The ODI consistently displayed higher correlations with other instruments measuring the same or unrelated constructs (eg, pain intensity, general health, mental health, social functioning) (Fig. 2, eTable). On the one hand, these results could suggest that the construct measured by the 2 questionnaires is not precisely the same and that the construct measured by the ODI might be broader than that of the RMDQ; they also might indicate that the RMDQ measures a narrower construct and that it might provide a more focused assessment of physical functioning. On the other hand, it also is possible that the stronger correlations of the ODI could be partly explained by its smaller measurement error documented in this review. It should be noted that we made the a priori decision to consider pain intensity as an unrelated construct because the purpose of our study was to assess RMDQ and ODI as measures of physical functioning, defined as “the ability to carry out daily physical activities.”8(p1133) This subjective decision has strongly influenced the specific conclusion that the RMDQ has better construct validity and the general conclusion that there are no strong arguments to prefer 1 of the 2 instruments. This subjective decision could be criticized, as pain intensity also could be considered as a construct related to RMDQ and ODI, as they are LBP-specific instruments. These 2 instruments were developed to measure disability,14,15 which, taking into account frequently used models and definitions,59,60 is a domain that cannot be considered equivalent to physical functioning as defined for this study. Moreover, previous analyses of the content of the 2 instruments have shown that they do not measure only daily physical activities.9,61,62 Overall, the results of this review on construct validity should be further explored by future studies of good or excellent methodological quality formulating multiple and specific a priori hypotheses regarding expected correlations with other instruments and by studies comparing the content validity of the 2 instruments as measures of physical functioning or of a larger construct.
Responsiveness was assessed by the majority of the studies included in this review, but conflicting evidence was found for both the RMDQ and ODI (Tab. 4). All of the studies lacked the formulation of multiple a priori and specific hypotheses regarding expected correlations with changes in other instruments or effect sizes; this gap should be filled by future longitudinal studies assessing this measurement property.20 Another methodological aspect that should be improved in future studies is the formulation of GPCSs used as gold standards to assess responsiveness following a criterion approach.20 It was recently shown that construct-specific anchors have higher validity than global anchors,63 as those used in the studies included in this review used generic GPCSs. Considering that both the RMDQ and ODI are widely used as outcome measurement instruments,10 it is fundamental that they display good responsiveness, and studies of good or excellent quality are needed to better assess this measurement property. The rating of conflicting evidence in this review also was driven by the fact that correlations between the change on the RMDQ or ODI and the change on the SF-36-PF were found to be lower than correlations obtained with instruments not measuring the same construct (Tab. 3). It could easily be asserted that these unexpected results were due to the poor responsiveness of the SF-36-PF, but more studies have shown that, besides displaying lower results than other measurement tools, the responsiveness of the SF-36-PF is above minimum criteria for both AUCs and effect sizes.64–67
The comparison of internal consistency of the 2 instruments was inconclusive, considering that unidimensionality of the instruments was not assessed. Despite the fact that these are the 2 most widely used outcome measurement instruments,10 no studies comparing the structural validity of these questionnaires were found. In recent years, some studies assessed the dimensionality of one or the other instrument by means of factor analysis or Rasch analysis. Results of these studies are contradictory regarding the dimensionality of both questionnaires, with some studies revealing them to be unidimensional68–73 and others not.40,74–76 The results of these studies highlight that it is not clear whether both instruments are unidimensional and that, possibly, their internal structure might vary across different languages and populations. For this reason, it is crucial that future studies on the RMDQ and ODI compare their structural validity in the same population and that they do so before assessing their internal consistency. It also is suggested to further evaluate cross-cultural validity of both questionnaires, as this evaluation will give insight into possible differences in factor structure or differential item functioning across translated versions.
Studies performing a head-to-head comparison of the content validity of recommended versions of the RMDQ and ODI are needed to evaluate whether the content of 1 of the 2 instruments better represents the most relevant aspects of physical functioning in patients with NSLBP. Content validity refers to the extent to which, in a given measurement application, the most relevant and comprehensive aspects of a construct are adequately reflected in the content of a measurement instrument.19,77 To date, the RMDQ and ODI have been considered and recommended for measuring the same health construct.11–13 However, the results on construct validity of our review clearly suggest that there are discrepancies in their correlations with other instruments and that, possibly, they do not measure exactly the same construct. Two studies comparing the content of RMDQ and ODI showed that some body functions or activity limitations were related to the items of one questionnaire but not the other, and vice versa.61,62 Considering the emerging importance of this measurement property in the selection of instruments and in the assessment of their validity,77–80 it is essential to investigate content validity further for these 2 questionnaires. In general, when making a choice between 2 instruments, content validity should be the first property to be explored to evaluate whether one instrument is a better reflection of the construct measured in the specific target population.
Here, we provide some suggestions for future studies assessing and comparing content, structural, and cross-cultural validity of the RMDQ and ODI in patients with NSLBP. Qualitative research plays a key role in the assessment of content validity of existing instruments.81 Focus groups or cognitive interviews81 could be conducted in patients with NSLBP to assess which of the 2 instruments cover the most important aspects of physical functioning and whether there are additional relevant aspects that are not covered. Previous studies have assessed the content of these instruments by linking it to the International Classification of Functioning (ICF) categories.61,62 However, to date, no studies have attempted to link the content of the RMDQ and ODI to the ICF core set for LBP.82 Focusing only on the ICF categories included in the core set would allow us to better understand whether the content of these instruments reflects and covers several aspects important to patients with LBP. A recent study that followed this procedure with the ICF core set for rheumatoid arthritis could be used as a valid example for such a study.83 The qualitative assessment of content validity of both instruments should be combined with the quantitative assessment of their structural validity (ie, dimensionality).77 Evidence on the unidimensionality of the RMDQ and ODI should be provided, as their total score is routinely used to assess the effectiveness of health interventions in patients with NSLBP.84
Statistical techniques such as confirmatory factor analysis85 or item response theory (IRT) models86,87 allow us to assess the unidimensionality of a patient-reported instrument. Item response theory models provide some advantages over factor analysis,88,89 as they also permit us to estimate item parameters along a continuum representing different levels of ability on the measured construct and to estimate the measurement precision of an instrument along the same continuum.86,87 Therefore, IRT should be preferred to compare the performance of the RMDQ and ODI in the same group of patients with NSLBP, but authors of future IRT studies should be aware that, when testing both instruments in the same patients, a large sample size is required. For example, if using an IRT 2-parameter logistic model (eg, graded response model), at least 1,000 participants are needed to have accurate parameters' estimation.87,89
To date, to our knowledge, there are no studies examining whether the factor structure of the RMDQ and ODI is consistent across different countries and languages, and, for this reason, it is of high priority to assess cross-cultural validity, defined as “the degree to which the performance of the items on a translated or culturally adapted patient-reported instrument are an adequate reflection of the performance of the items of the original version of the instrument.”19(p743) It should be highlighted that cross-cultural validity refers not only to the factor structure of a questionnaire but also to other aspects of validity, such as face and content validity. Therefore, it would be important that cross-cultural adaptation processes include an assessment of face and content validity in patients with NSLBP, as these properties can vary for the same instrument in different languages, cultures, and settings. Having empirical evidence supporting the cross-cultural validity of the RMDQ and ODI would allow reviewers to combine studies with more confidence in future systematic reviews. A study on cross-cultural validity of these widely used instruments would require a collaborative effort of the scientific community to join forces and design parallel studies in different countries. Such an effort could be facilitated or embedded within already active collaborations such as the international and multidisciplinary steering committee working on the development of a core outcome set for clinical trials in NSLBP.8,27
It was out of the scope of this review to compare the RMDQ and ODI as measures of disability, but our results give a clear indication on this matter in patients with NSLBP. If researchers or clinicians want to measure a functional domain broader than solely physical functioning, the ODI should be preferred over the RMDQ because: (1) it displays better test-retest reliability and measurement error and (2) the higher cross-sectional correlations with all other instruments would indicate better construct validity. As previously reported, consistent higher correlations with other instruments can be explained by the fact that the instrument measures a broader construct or has smaller measurement error, which would support the preference for the ODI in both cases.
A previous review13 recommended use of the ODI in patients with persistent and severe disability and use of the RMDQ in patients with lower levels of disability. These recommendations were based on a previous study48 showing differences related to floor and ceiling effects, with the greater proportion of patients scoring higher on the RMDQ or lower on the ODI. All studies included in this review reported very similar score levels on the RMDQ and ODI (Tab. 1), making it not feasible to empirically assess these previous recommendations. However, we attempted to assess whether a difference in some measurement properties could be related to the pain duration (ie, acute versus subacute or chronic). Only 2 studies40,41 in patients with acute NSLBP were included, and they did not show a substantial different trend in results between the 2 questionnaires. Hence, due to the small amount of head-to-head comparisons in acute NSLBP, we are not able make suggestions regarding one instrument being better (or not) than the other in patients with a different pain duration.
Three considerations can be made on some methodological aspects of this review. First, we chose to focus on RMDQ and ODI versions recommended by their developers13 because they showed superior measurement properties compared with shorter or modified versions in patients with NSLBP.65,68,69,90 Consequently, the results of this review cannot be generalized to all existing versions of these questionnaires. Second, the results of this review are not generalizable to specific LBP populations (eg, patients with spinal stenosis), as we focused on and included only studies conducted in patients with NSLBP; this decision was taken to be consistent with the scope of the core outcome set that has been developed for clinical trials in patients with NSLBP.8,27 Third, a potential limitation of this study is that we combined RMDQ and ODI data from different countries and languages without knowing whether the items of these questionnaires have the same performance in different cultures. Hence, the evaluation of validity of these questionnaires across different languages and countries should have very high priority. This is an important issue not only for using these patient-reported outcomes in clinical practice but also in systematic reviews in which data from different cultures and languages are routinely combined.
To sum up, this systematic review identified 11 studies of fair or poor methodological quality, performing head-to-head comparisons of 5 measurement properties of the 24-item RMDQ and ODI 2.1a. The ODI showed better reliability and measurement error, whereas the RMDQ showed better construct validity as a measure of physical functioning. In light of these findings, there are no strong reasons to prefer one instrument over the other to measure physical functioning in patients with NSLBP. To further compare the measurement properties of these 2 instruments, studies of higher methodological quality are needed, and priority should be given to studies on the content, structural, and cross-cultural validity of these questionnaires.
Footnotes
Mr Chiarotto, Ms Maxwell, Dr Terwee, Professor Wells, Professor Tugwell, and Professor Ostelo provided concept/idea/research design. Mr Chiarotto, Dr Terwee, and Professor Ostelo provided project management. Mr Chiarotto and Ms Maxwell provided data collection. Mr Chiarotto, Ms Maxwell, Dr Terwee, and Professor Ostelo provided data analysis. Professor Ostelo provided fund procurement. All authors provided writing and consultation (including review and approval of manuscript before submission).
The authors acknowledge the Wetenshcappelijk College Fysiotherapie (WCF) of the Royal Dutch Society for Physical Therapy (KNGF) for providing funding for this study.
- Received August 17, 2015.
- Accepted March 31, 2016.
- © 2016 American Physical Therapy Association