Abstract
Background There is little evidence for the measurement properties of instruments commonly used for women with pelvic girdle pain.
Objective The aim of this study was to examine the internal consistency, test-retest reliability, and construct validity of instruments used for women with pelvic girdle pain.
Design This was a cross-sectional methodology study, including test-retest reliability assessment.
Methods Women with pelvic girdle pain in pregnancy and after delivery participated in a postal survey that included the Pelvic Girdle Questionnaire (PGQ), Oswestry Disability Index (ODI), Disability Rating Index (DRI), Fear-Avoidance Beliefs Questionnaire (FABQ), Pain Catastrophizing Scale (PCS), and 8-item version of the Medical Outcomes Study 36-Item Short-Form Health Survey questionnaire (SF-36). Test-retest reliability was assessed with a random subsample 1 week later. Internal consistency was assessed with the Cronbach alpha, and test-retest reliability was assessed with the intraclass correlation coefficient (ICC) and minimal detectable change (MDC). Construct validity based on hypotheses was assessed by correlation analysis. Discriminant validity was assessed with the area under the receiver operating characteristic curve.
Results All participants responded to the main (N=87) and test-retest (n=42) surveys. Cronbach alpha values ranged from .88 to .94, and ICCs ranged from .78 to .94. The MDC at the individual level constituted about 7% to 14% of total scores for the 8-item version of the SF-36, ODI, and PGQ activity subscale; about 18% to 22% for the DRI, PGQ symptom subscale, and PCS; and about 25% for the FABQ. Hypotheses were mostly confirmed by correlations between the instruments. The PGQ was the only instrument that significantly discriminated participants who were pregnant from participants who were not pregnant as well as pain locations.
Limitations A comparison of responsiveness to change of the various instruments used in this study was not undertaken, but will be carried out in a future study.
Conclusions Self-report instruments for assessing health showed good internal consistency, test-retest reliability, and construct validity for women with pelvic girdle pain. The PGQ was the only instrument with satisfactory discriminant validity, thus, it is recommended for evaluating symptoms and disability in patients with pelvic girdle pain.
Pregnancy-related back and pelvic girdle pain is a common condition varying from self-limiting symptoms of short duration during pregnancy to great pain and disability both during and after pregnancy.1 To obtain information regarding the impact of pelvic girdle pain on general functioning or treatment effects, clinicians and researchers must rely on patients' self-reports of symptoms and disability. To date, most of the self-report instruments that frequently have been used in clinical studies with women who have pelvic girdle pain were developed or tested for psychometric properties in patients with low back pain. For example, recent studies2–4 included disability instruments such as the Oswestry Disability Index (ODI)5 and the Disability Rating Index (DRI).6
Despite the fact that studies relating to pelvic girdle pain have greatly increased during the last decade, there is little published evidence for the measurement properties of instruments commonly used for patients with pelvic girdle pain. One exception is the new condition-specific instrument developed for pelvic girdle pain, the Pelvic Girdle Questionnaire (PGQ), which assesses activity limitations and symptoms and which was found to have good construct validity as well as reliability when used for patients with pelvic girdle pain.7 However, the measurement properties of this instrument have not been compared with those of other commonly used instruments. It is important to provide evidence for the comparative performance of instruments that have been used to assess various aspects of health status in patients with pelvic girdle pain. This evidence can inform their future selection in research, including evaluations of physical therapy and other treatments for pelvic girdle pain.
The purpose of this study was to examine the internal consistency, test-retest reliability, and validity of a range of instruments used for women with pelvic girdle pain.
Method
Participants and Recruitment Method
Women with pelvic girdle pain in pregnancy and after delivery were consecutively recruited at primary care clinics in Oslo, Norway, by 7 physical therapists who were experienced in the management of pelvic girdle pain. The women were clinically examined and assessed by the physical therapists using recommended inclusion criteria.1,3 Inclusion criteria were as follows: pelvic girdle pain located distal, lateral, or both in relation to the L5–S1 area, in the buttocks, symphysis, or both, with pain onset during pregnancy or within 3 weeks after delivery. Fulfillment of the diagnostic criteria was based on the following tests: Posterior Pelvic Pain Provocation Test, Active Straight Leg Raising Test, pain provocation of the long dorsal sacroiliac ligament, and pain provocation of the symphysis by palpation and by a modified Trendelenburg test. The results of the Posterior Pelvic Pain Provocation Test or the Active Straight Leg Raising Test had to be positive on the right side, left side, or both, and the results of at least 1 of the other 3 tests had to be positive.
The women were invited to participate in a postal survey about pelvic girdle pain and functioning and were asked to complete and return a comprehensive questionnaire using a prepaid envelope after the first clinical consultation. The comprehensive questionnaire included 6 instruments, and the participants were informed that there was some overlap between the questions and that the survey would take about 30 minutes to complete. Test-retest reliability was assessed 1 week later with a random sample of 42 participants, 21 of whom were pregnant; the questionnaire was administered with the same survey method. Written informed consent was obtained from all of the participants.
All women invited to participate in the main survey (N=87) and the test-retest survey (n=42) responded (Tab. 1). The mean age of the participants was 35 years (SD=5.0), and approximately half of the participants were pregnant. There were no statistically significant differences between the total sample and the test-retest subsample in any of the baseline characteristics (Tab. 1).
Baseline Characteristics of Participants in Total Sample and in Test-Retest Subsample
Comprehensive Self-Report Questionnaire
The questionnaire included measures of pain and symptoms, activity limitations, disability, psychological constructs, and quality of life. Pain location was assessed with a mannequin on which participants drew the body area in which they had pain, according to categories distinguishing 4 areas around the pelvis: symphysis, left sacroiliac joint, right sacroiliac joint, and over the sacrum.
The PGQ is a condition-specific instrument that assesses activity limitations (activity subscale with 20 items) and symptoms (symptom subscale with 5 items)7 in patients with pelvic girdle pain. Items are scored on a 4-point descriptive scale, and item scores are summed and transformed to yield a score of 0 to 100, where 100 is the worst possible score. The PGQ was developed for patients with pelvic girdle pain in pregnancy as well as postpartum. The ODI was originally developed in a specialist referral clinic for patients with chronic low back pain.5 In version 2.0, patients rate their perceived disability for 10 items (pain intensity, personal hygiene, lifting, walking, sitting, standing, sleeping, sexual activity, social activity, and traveling) on a 6-point scale. Item scores are summed and transformed to yield a score of 0 to 100, where 100 represents the greatest possible disability. The DRI was developed for the assessment of physical disability in patients with chronic pain in the neck, shoulder, and lower back.6 It contains 12 items of daily activities, more demanding daily physical activities, and work-related or more vigorous activities. Patients rate their ability to carry out the activities on a 100-mm visual analog scale (VAS). Scores on the DRI are the mean of the 12 item scores, with 100 representing the greatest possible disability.
Two instruments assessing psychological constructs were included: the physical activity subscale of the Fear-Avoidance Beliefs Questionnaire (FABQ activity subscale)8 and the Pain Catastrophizing Scale (PCS).9 The FABQ activity subscale was developed for patients with chronic low back pain and consists of 5 items scored from 0 to 6. Scores for items 2, 3, 4, and 5 are summed to yield a score ranging from 0 to 24, the latter representing the highest level of fear-avoidance beliefs. The PCS was developed for patients with chronic pain and contains 13 items scored on a 5-point descriptive scale. Scores on the PCS are the mean of the 13 item scores, ranging from 0 to 4, the latter representing the highest level of pain catastrophizing.
General health was assessed with the 8-item version of the Medical Outcomes Study 36-Item Short-Form Health Survey questionnaire (8-item SF-36).10 Each item represents one dimension of the SF-36: general health (SF1), physical functioning (SF2), role–physical (role of physical health problems in work or other daily activities) (SF3), bodily pain (SF4), vitality (SF5), social functioning (SF6), mental health (SF7), and role–emotional (role of emotional problems in work or other daily activities) (SF8). Each item is weighted with norm-based scoring, and scores above and below 50 are considered above and below the average in the general US population, respectively.11
The comprehensive questionnaire also included questions about age, education level, number of children, and work status. Clinical information concerned pain localization in the pelvic region, pain duration (months), and pain pattern (pain-free periods).
Data Analysis
Data quality.
The amounts of missing data at the item and scale levels of the instruments were compared. Floor or ceiling effects were considered to be present if more than 15% of the participants reported the lowest or the highest possible score, respectively.12 The numbers of participants with the lowest and highest possible scores for each of the items also were compared.
Internal consistency.
Internal consistency was assessed with the Cronbach alpha. Cronbach alpha values of less than .7, .7 to .8, and more than .8 were considered to indicate low, moderate, and good internal consistency, respectively.13
Test-retest reliability.
Test-retest reliability was assessed with a 2-way random-effects model and expressed as the intraclass correlation coefficient.14 An intraclass correlation coefficient of .80 or more was considered to indicate good test-retest reliability. Test-retest agreement was expressed as the minimal detectable change (MDC) at the individual level (MDCind) and at the group level (MDCgroup).12 The MDCind was assessed with the standard error of measurement (SEM), which was calculated as the square root of the sum of the between-measures variance and the residual variance obtained from an analysis of variance. The MDCind was calculated as
Construct validity.
To assess the convergent and divergent construct validity of the instruments, we formulated and tested several hypotheses. First, we expected high correlations between instruments that assessed activity limitations (PGQ activity subscale, DRI, and ODI) and an instrument that assessed physical functioning (SF2). We also expected these instruments to correlate relatively highly with the PGQ symptom subscale. Second, low to moderate correlations were expected between instruments that assessed activity limitations or physical functioning (PGQ activity subscale, DRI, ODI, and SF2) and instruments that assessed other aspects of physical health (SF1, SF3, SF4, and SF6). Third, we also expected low to moderate correlations between instruments that assessed activity limitations and an instrument that assessed fear of activities (FABQ activity subscale). Fourth, we expected low correlations between instruments that assessed activity limitations or physical functioning (PGQ activity subscale, DRI, ODI, and SF2) and instruments that assessed aspects of mental health (PCS, SF5, SF7, and SF8). Fifth, we expected low to moderate correlations between the PGQ symptom subscale and instruments that assessed other aspects of physical health (SF1, SF3, SF4, and SF6). Sixth, we expected low correlations between an instrument that assessed physical pelvic symptoms (PGQ symptom subscale) and instruments that assessed aspects of mental health (PCS, SF5, SF7, and SF8). Finally, we expected moderate to high correlations between instruments that assessed aspects of mental health (PCS, SF5, SF7, and SF8) and between an instrument that assessed fear of activities and instruments that assessed aspects of mental health.
To test the hypotheses, we computed Spearman rank correlation coefficients. Values of less than .3, between .3 and .6, and of greater than .6 were considered to indicate low, moderate, and high correlations, respectively.13
Discriminant validity.
To assess discriminant validity, we hypothesized that, compared with participants who were postpartum, participants who were pregnant would have poorer scores on the self-report questionnaires. We also expected the questionnaires to discriminate between participants with pain located at all 3 pelvic girdle joints (symphysis and sacroiliac joints) and participants with pain located at 1 or 2 joints.15 We used receiver operating characteristic curve analysis to assess the discriminative ability of the instruments. We computed the area under the receiver operating characteristic curve, which reflects the accuracy of an instrument for differentiating between 2 subgroups (eg, diagnostic group or change versus no change). The area under the receiver operating characteristic curve may range from .50 (no discriminative ability) to 1.0 (perfect discriminative ability).14
All data analyses were performed with SPSS version 18.0 (SPSS Inc, Chicago, Illinois). Because we were evaluating multiple tests, we set the significance level at 1%.
Role of the Funding Source
Grant support was provided by The Norwegian Fund for Postgraduate Education in Physiotherapy.
Results
Data Quality
There were very few missing data for the self-report questionnaires (Tab. 2). There were 3 missing values for the item “push a shopping cart” and 1 missing value for the item “roll over in bed” for the PGQ activity subscale, and there was 1 missing value for the item “is your sleep interrupted” for the PGQ symptom subscale. There were no missing data for the DRI, ODI, and 8-item SF-36. One participant did not fill in any of the items for the FABQ activity subscale and the PCS.
Missing Data, End Effects, and Internal Consistency for the Instruments
None of the instruments showed floor or ceiling effects in their total scores. Table 2 shows that for the PGQ activity subscale and PGQ symptom subscale, the range for the lowest possible scores was 0% to 44.8% and the range for the highest possible scores was 0% to 78.2%. For the DRI, which was assessed on a VAS with scores of 0 to 100, the ranges for the lowest and highest possible scores were 0% to 4.6% and 0% to 10.3%, respectively. For the ODI, the range for the lowest possible scores was 1.1% to 57.5%, and there were few scores at the top end of the scale. For the FABQ activity subscale, the ranges for the lowest and highest possible scores were 2.3% to 17.2% and 4.6% to 39.1%, respectively. For the PCS, the range for the lowest possible scores was 14.9% to 71.3%, and there were few scores at the top end of the scale. For the 8-item SF-36, the ranges for the lowest and highest possible scores were 0% to 25.3% and 0% to 14.9%, respectively.
Item-total correlations were acceptable for most of the PGQ items, but 6 items in the activity subscale and 2 items in the symptom subscale had item-total correlations below .4 (Tab. 2). The DRI, ODI, and FABQ activity subscale each had 1 or 2 items with item-total correlations below .4. The Cronbach alpha values were acceptable for the majority of the scales, but the value for the PGQ symptom subscale was just below the .7 criterion, and that for the FABQ activity subscale was low, at .6.
Table 3 shows the test-retest reliability and measurement errors for the included instruments. Intraclass correlation coefficients were acceptable and varied from .78 to .94. The MDCind was low for the 8-item SF-36, ODI, and PGQ activity subscale, representing about 7% to 14% of the total scores. The MDCind for the DRI, PGQ symptom subscale, and PCS constituted about 18% to 22% of the total scores. A higher measurement error—about 25%—of the total score was found for the FABQ activity subscale. Test-retest differences were independent of the magnitude of the mean scores, and there were no systematic differences between the test and retest scores.
Test and Retest Scores, Measurement Error, and Intraclass Correlation Coefficient (ICC)a
The hypotheses were mostly confirmed by correlations between the instruments (Tabs. 4 and 5). For example, for the most part we found high correlations between activity limitations assessed with the PGQ activity subscale, DRI, and ODI and physical functioning assessed with the SF2; in addition, we found low correlations between these 4 instruments and aspects of mental health assessed with the PCS, SF5, SF7, and SF8. The final hypothesis, regarding expected moderate to high correlations between instruments that assessed aspects of mental health and between an instrument that assessed fear of activities and instruments that assessed aspects of mental health, was not completely confirmed. In particular, correlations between the FABQ activity subscale and instruments that assessed aspects of mental health were lower than expected.
Correlations Between Instrumentsa
Construct Validity: Seven A Priori Formulated Hypotheses and Correlation Valuesa
The discriminant validity of the instruments is shown in Table 6. The PGQ symptom subscale and PGQ total were the only instruments that significantly (P<.01) discriminated participants who were pregnant from participants who were not pregnant as well as participants with pain located at all 3 pelvic girdle joints from participants with pain located at 1 or 2 joints. The FABQ activity subscale showed good discriminative ability for participants who were pregnant and those who were postpartum but not for pain localization. The PGQ activity subscale, DRI, and ODI discriminated only pain locations and not participants who were pregnant from participants who were not pregnant. The PCS and 8-item SF-36 did not discriminate between participants who were pregnant and participants who were not pregnant or pain locations.
Comparison of Instrument Scores According to Pregnancy and Pain Localization by Receiver Operating Characteristic Curve Analysisa
Discussion
We found good internal consistency, test-retest reliability, and construct validity for a range of self-report instruments used for women with pelvic girdle pain. However, the PGQ, including both symptom subscale scores and total scores, was the only instrument that significantly discriminated participants who were pregnant from participants who were postpartum as well as pain locations.
There are several advantages of the present study. First, we were able to concurrently assess the reliability and validity of several of the instruments that are most commonly used for patients with pelvic girdle pain, including the ODI and DRI. We also compared these instruments with a new condition-specific instrument for patients with pelvic girdle pain, the PGQ. Second, we included instruments that assess psychological constructs; like instruments for back-related pain conditions, these instruments may be relevant for patients with pelvic girdle pain. Third, by including both participants who were pregnant and participants who were postpartum, we were able to investigate the ability of the instruments to distinguish between these 2 conditions, which is important from a clinical perspective. Because we included both participants who were pregnant and participants who were postpartum as well as different severities and durations of pelvic girdle pain, we consider the findings of the present study to be highly representative for patients across different clinical settings, such as primary care and secondary care. Finally, there were very few missing data for all of the self-report instruments, despite their similarities. The fact that we did not assess the responsiveness to change of the various instruments used in the present study might be considered a study limitation. Responsiveness to change is an important criterion in choosing instruments for clinical trials as well as in clinical practice. The concurrent evaluation of instrument responsiveness to change will be undertaken in a future study.
Measurement errors reported as MDCind estimates were low for the 8-item SF-36, ODI, and PGQ activity subscale, at about 7% to 14% of the total scores, but also were acceptable for the other instruments, at about 18% to 22% of the total scores. These estimates are similar to previously reported measurement errors (10% to 35%) for self-report instruments.16,17 We found a higher measurement error—about 25%—for the FABQ activity subscale, similar to findings previously reported for patients with low back pain.18 Estimates of MDCind are useful in deciding whether a patient has shown improvement (or deterioration) greater than measurement error. The MDCgroup is relevant when instruments are used for research purposes, when the intention is to assess differences between groups.12
We used a 1-week interval to investigate test-retest reliability on the assumption that this interval would be short enough for the participants' status to remain stable and long enough to ensure that they would not recall their first responses. The fact that participants completed several instruments also might have reduced the possibility that they would remember their previous responses. Furthermore, we considered the participants in the retest subsample to be representative of the population studied because they were similar to the total cohort with regard to all baseline characteristics.
Although there were no floor or ceiling effects, according to the definitions of Terwee et al,12 substantial proportions of participants had the lowest and highest possible scores for several of the items in the PGQ. This result can be interpreted as an indication that the PGQ items adequately covered the constructs that they were designed to assess. The results of the Rasch analysis in the development of the PGQ also showed that the construct hierarchy was well covered by the items in the instrument.7 The moderate to low scores for the ODI in the present study may indicate that the ODI is not as sensitive to activity limitations in women with pelvic girdle pain as the PGQ. The ability of an instruments to cover the whole difficulty level of the construct (here, the level of activity limitations) is important for responsiveness to change.19 Instruments that produce scores at the floor and ceiling are limited in their ability to measures changes in health.
Moreover, there were relatively large differences between the instruments in the percentages of participants with the lowest and highest possible scores for similar items, such as “dressing,” “climbing stairs,” and “exercise/sports.” Some of these differences could be explained by the different scaling methods used. For example, a 4-point scale was used in the PGQ, whereas a VAS with scores of 0 to 100 was used in the DRI. Not many participants with pain and symptoms score 0 on a VAS; hence, it could be argued that a low score on this scale could be classified as less than 10 to make interpretation of the results more comparable. The percentages of participants with scores of less than 10 on the VAS (with scores of 0–100) for the DRI items “dressing,” “climbing stairs,” and “exercise/sports” were 35.6%, 12.6%, and 3.4%, respectively. These data confirm that the participants scored consistently across the PGQ and DRI instruments and that the differences in percentages must be interpreted in light of the different scaling methods used.
Most of the hypotheses regarding convergent and divergent validity were supported, with the exception that correlations between instruments that assessed aspects of mental health, in particular, between the FABQ activity subscale and PCS, were lower than expected. A potential explanation is that participants with pelvic girdle pain scored significantly higher on the FABQ activity subscale than on the PCS (as shown by the rescaled scores [0–100] in Tab. 3) and, thus, that the FABQ activity subscale captures a more relevant aspect of pelvic girdle pain than the PCS. The mean score on the FABQ activity subscale in the present study was slightly higher than that in another Norwegian study, which included women in early pregnancy.20 In addition, the correlation between the FABQ activity subscale and instruments that assessed activity limitations in the present study was higher (r=.33) than that in the other Norwegian study (r=.18).20
Moreover, the 2 hypotheses for discriminant validity were consistently confirmed only for the PGQ. The other instruments for assessing activity limitations showed satisfactory discriminative ability for pain localization but not for pregnancy (ie, whether participants were pregnant or not). We consider it important for a pelvic-specific instrument to discriminate between different levels of severity of pelvic girdle pain, such as being pregnant or not as well as number of pain localizations. In a recent Norwegian study, pain locations in the pelvis in early pregnancy were found to be significantly associated with disability and pain intensity in late pregnancy.4 The FABQ activity subscale had surprisingly good discriminative ability for participants who were pregnant and those who were not pregnant, indicating that women who are pregnant may be more careful about physical activities than women who are postpartum. Being pregnant seems to lead to reduced physical activity in terms of both regular exercise21,22 and physical activities such as household and family care.21
The Norwegian versions of the PGQ, DRI, ODI, 8-item SF-36, FABQ activity subscale, and PCS had good internal consistency, test-retest reliability, and construct validity when used with a sample of participants who had pelvic girdle pain. The PGQ was the only instrument with satisfactory discriminant validity. The PGQ can be recommended for evaluating symptoms and disability in patients with pelvic girdle pain for both clinical and research purposes. Responsiveness to change should be concurrently assessed for these instruments.
Footnotes
-
Dr Grotle and Dr Stuge provided concept/idea/research design. Dr Stuge, Ms Krogstad Jenssen, and Dr Grotle provided data collection. Dr Grotle, Dr Stuge, and Dr Garratt provided data analysis and writing.
-
The study protocol was approved by the Regional Committee for Medical Research Ethics in Norway.
-
The authors thank The Norwegian Fund for Postgraduate Education in Physiotherapy for grant support.
- Received March 9, 2011.
- Accepted July 11, 2011.
- © 2012 American Physical Therapy Association