Abstract
Background Variations in paraspinal muscle cross-sectional area (CSA) and composition, particularly of the multifidus muscle, have been of interest with respect to risk of, and recovery from, low back pain problems. Several investigators have reported on the reliability of such muscle measurements using various protocols and image analysis programs. However, there is no standard protocol for tissue segmentation, nor has there been an investigation of reliability or agreement of measurements using different software.
Objective The purpose of this study was to provide a detailed muscle measurement protocol and determine the reliability and agreement of associated paraspinal muscle composition measurements obtained with 2 commonly used image analysis programs: OsiriX and ImageJ.
Design This was a measurement reliability study.
Methods Lumbar magnetic resonance images of 30 individuals were randomly selected from a cohort of patients with various low back conditions. Muscle CSA and composition measurements were acquired from axial T2-weighted magnetic resonance images of the multifidus muscle, the erector spinae muscle, and the 2 muscles combined at L4–L5 and S1 for each participant. All measurements were repeated twice using each software program, at least 5 days apart. The assessor was blinded to all earlier measurements.
Results The intrarater reliability and standard error of measurement (SEM) were comparable for most measurements obtained using OsiriX or ImageJ, with reliability coefficients (intraclass correlation coefficients [ICCs]) varying between .77 and .99 for OsiriX and .78 and .99 for ImageJ. There was similarly excellent agreement between muscle composition measurements using the 2 software applications (inter-software ICCs=.81–.99).
Limitations The high degree of inter-software measurement reliability may not generalize to protocols using other commercial or custom-made software.
Conclusion The proposed method to investigate paraspinal muscle CSA, composition, and side-to-side asymmetry was highly reliable, with excellent agreement between the 2 software programs.
Cross-sectional area (CSA) asymmetries of lumbar paraspinal muscles,1–7 as well as fat infiltration,8,9 have been associated with low back pain (LBP) and related pathologies using various imaging techniques. As a result, the measurement of paraspinal muscle asymmetry or composition has been emphasized in a number of studies related to the etiology and prognosis of LBP.1–15 There are inconsistencies, however, in study findings of the association between painful spinal conditions and paraspinal muscle morphology. For example, Ploumis et al6 used a manual segmenting technique to measure paraspinal muscle functional CSA (FCSA), defined as fat-free muscle mass, in a group of 40 patients with monosegmental disk disease and unilateral LBP, with or without radicular symptoms, and reported significant multifidus muscle atrophy on the symptomatic side. Yet, in another magnetic resonance imaging (MRI) study, Hyun et al10 found no significant asymmetry between involved and uninvolved sides in a group of 39 patients with disk herniation, with or without radiculopathy. They also measured multifidus muscle FCSA, but used a technique to determine the proportion of muscle versus fat tissue based on a signal intensity threshold.
Similarly, 2 studies that quantitatively compared the degree of paraspinal muscle fatty infiltration present in patients with chronic LBP compared with a control group of individuals who were healthy showed conflicting results.1,2 Different threshold techniques and measurement protocols were used to measure the proportion of muscle fatty infiltration, which may have contributed to the discrepant findings, but the effect of such differences on measurement is not known.
Variations in imaging modalities (MRI, computed tomography scan, and ultrasound), image analysis programs, and measurement protocols contribute to conflicting results. Currently, several methods are used to investigate paraspinal muscle morphology, and too little attention has been given to whether they lead to roughly equivalent measurements. Some investigators have focused on total CSA,3,4,7,12–14 whereas others contend that FCSA is a better indicator of muscle atrophy and contractibility.16 Functional CSA is calculated by using either a manual technique or a signal intensity threshold technique with the aid of computer software. Although the reliability of measurements of FCSA using the 2 different approaches has been investigated in several studies,1,15–19 investigators interested in segmenting paraspinal muscles or fat tissues currently use a variety of computer software, including in-house custom-made software,1,18 software that is part of an MRI scanner,20 picture archiving and communications systems workstations,17,19 commercial software,10 computer-aided drafting (auto-CAD) software,3,21 and freeware.15,16,22 Moreover, the use of proprietary software and insufficient descriptions of measurement protocols hinder replication of results by others, and the comparability of measurements obtained using different software and measurement protocols has been neglected.
Although the measurement error related to the measurement methods used appears to be mostly associated with the observer,23 the software used also might lead to measurement differences, and there is a need to determine whether direct comparisons can be made among different software packages (using comparable methods). There currently is no standard protocol, and we found no investigations of reliability or agreement among measurements obtained with different software or protocols.
To clarify the measurement error related to use of 2 widely available, free image analysis programs and associated measurement techniques, the purpose of the present study was to determine the reliability and agreement, as well as the standard error of measurement (SEM), of paraspinal muscle CSA and composition measurements obtained using 2 open source, readily available computer software programs: ImageJ and OsiriX. In addition, the associated image analysis protocol is proposed for standardized use to facilitate comparisons among studies.
Materials and Method
Measurement Study Design
Total CSA and FCSA measurements of the multifidus muscle, the erector spinae muscle, and the 2 muscles combined, bilaterally, were directly obtained for each participant using 2 open source software packages. ImageJ (version 1.43, National Institutes of Health, Bethesda, Maryland) is a free, downloadable, public domain image processing software program (http://rsbweb.nih.gov/ij/download.html) that was developed by the National Institutes of Health. The 32-bit OsiriX software (version 3.8.1, Pixmeo, Geneva, Switzerland) was downloaded from http://www.osirix-viewer.com/ and was previously assessed as a more user-friendly image analysis software package for the Apple Mac OS (Microsoft Corp, Redmond, Washington) than ImageJ.24 One of the OsiriX program's main advantages is its integrated PAC system, which allows patient data to be stored automatically.24 Both software packages have been utilized by clinicians and scientists in a wide variety of studies as functional tools for image analysis.24–26
To determine intrarater and inter-software measurement reliability, each muscle measurement was acquired 4 times by the same rater, twice using each software program. In an effort to minimize bias from carryover or practice effects, the first complete set of measurements using each software program was obtained by alternating between programs after every block of 10 participants' images, randomly selected and ordered. After all magnetic resonance images were assessed once using either ImageJ or OsiriX, the images were reordered and blinded to be similarly assessed again a minimum of 5 days after the first measurements were completed.
Sample of Lumbar MRI
A sample of 30 patients (11 female and 19 male) were randomly selected from an ongoing study of patients attending spine specialty clinics and having commonly diagnosed lumbar pathologies, including disk herniation, spinal stenosis, spondylolisthesis, and nonspecific chronic LBP. Patients were excluded if they were below 18 or over 60 years of age, had a contrast agent allergy, had reduced renal function, were not able to undergo MRI acquisition, or had a tumor, infection, spinal fracture, or rheumatoid arthritis or were pregnant.
The MRI protocol included routine T2-weighted turbo spin echo sequences for both axial and sagittal images acquired with a Siemens Avanto 1.5T MRI system (Siemens AG, Erlangen, Germany) (axial T2 parameters included repetition time=4,000, echo time=113, and slice thickness=3 mm).
Muscle Measurements
All muscle measurements were acquired by one of the investigators (M.F.) who, in preparation for the measurements, received training in spine MRI assessments focusing on lumbar intervertebral disk and paraspinal muscle morphology. For practice purposes, a sample of about 15 images was analyzed with each software application prior to the beginning of the measurement study.
Quantitative measurements of the multifidus and erector spinae muscles individually and as a group (multifidus and erector spinae muscles together) were obtained from the T2-weighted axial images using ImageJ and OsiriX. ImageJ has already been used in previous studies to measure total CSA and FCSA using a threshold method, with previously reported intraclass correlation coefficients for intrarater reliability of both area measurements ranging from .89 to .99.15,16 We are not aware of any reports of reliability of paraspinal muscle morphology measurements using OsiriX. The same MRI slices were used for the ImageJ and OsiriX muscle measurements. Because the reliability of FCSA and total CSA measurements has been shown to be relatively equivalent across spinal levels,16 measurements for this study were taken only at mid-disk for L4–L5 and mid-S1 for every participant. The 2 levels were selected because most lumbar pathologies and muscle morphological changes occur between L4–L5 and L5–S1.27
The paraspinal muscle measurements of interest in this study for the multifidus and erector spinae muscles and the 2 muscles as a group included the following: total CSA, FCSA, ratio of FCSA to total CSA, side-to-side differences (muscle asymmetry) in total CSA and FCSA, and mean signal intensity of total CSA.
The FCSA measurement was obtained by selecting a threshold signal within the total muscle CSA to include only pixels within the lean muscle tissue range (Fig. 1A). The gray scale range for lean muscle tissue was established for every participant, on each scan slice. Four to 6 sample regions of interest (ROI) within the bilateral paraspinal muscle group (multifidus and erector spinae) were taken from areas of lean muscle tissue visible on each slice (Fig. 1B). If atrophied paraspinal muscle with significant fatty infiltration was encountered, care was taken to avoid the inclusion of any visible pixel of fat. The maximum value acquired from the sample ROIs was used as the highest threshold to distinguish muscle tissue from fat, in the same way the lower limit was determined by the minimum signal intensity value obtained from the sample ROIs. However, because we observed that the lower limit was typically 0 or 1, it might be best to standardize the lower limit at 0. This standardization could potentially decrease related measurement error and simplify the protocol. When timing a sample of measurements obtained with each software program, the average time taken to complete the measurements of the 3 muscle regions bilaterally at one spinal level was approximately 9 minutes with OsiriX and 5 minutes with ImageJ.
(A) Measurement of total cross-sectional area of erector spinae and multifidus muscles (right) at L4–L5. Lean muscle functional cross-sectional area (FCSA) of the paraspinal muscle group using a threshold method is represented by the area highlighted in green (left). (B) Sample selection of regions of interest to define upper and lower signal intensity threshold limits.
Data Analysis
The statistical analysis was performed using Statistical Package for the Social Sciences version 18.0 (SPSS Inc, Chicago, Illinois). Means and standard deviations for each variable were obtained. The ICC (2,1) was calculated to determine the intrarater reliability of measurements using OsiriX and ImageJ for each measurement variable and every muscle of interest using a 2-way random-effects model and absolute agreement. The ICC reflects both the degree of correlation and agreement between the ratings and was interpreted using the following criteria, as suggested by Portney and Watkins28: .00–.49=poor, .50–.74=moderate, and .75–1.00=excellent. The SEM was calculated to provide an estimate of the expected error related to a particular measurement.28 The ICC defines the ability to discriminate among individuals, whereas the SEM defines the measurement error in the same units as the initial measurement.29 Method agreement between the measurements acquired from the different software programs also was evaluated using the 95% limits of agreement as suggested by Bland and Altman.30–32 Reliability results were analyzed and reported according to spinal level, muscle investigated, and muscle side.
Results
Inter-Software Reliability of Muscle Measurements Using OsiriX and ImageJ
The results for the inter-software reliability (ICC), SEM values, and descriptive statistics (mean±SD) for the left side are presented in Table 1 for the L4–L5 spinal measurements and in Table 2 for the S1 measurements. The results for the right side were virtually equivalent and are not presented. The inter-software reliability was analyzed by comparing the first set of measurements collected with each software program. The ICCs for all of the different muscle composition measurements, regardless of the muscle analyzed or spinal level, showed excellent agreement and varied between .81 and .99. However, the SEM associated with the side-to-side difference measurements was of greater magnitude in comparison with the rest of the other muscle measurements.
Inter-Software Reliability Indexes for Left Paraspinal Muscle Measurements at L4–L5a
Inter-Software Reliability Indexes for Left Paraspinal Muscle Measurements at S1a
Inter-Software Agreement
Figure 2 shows the combined Bland and Altman 95% limits of agreement plots for the different muscle composition measurements from the left multifidus muscle at L4–L5 using the first set of measurements collected with each software program. Two methods are considered to have good agreement when the measurement difference is small enough for both methods to be used interchangeably.30 All of the plots show good agreement between OsiriX and ImageJ and no systematic bias; the distribution of the scores around the mean approximate zero and are spread evenly and randomly above and below the line.28 As suggested by Bland and Altman, an initial histogram of the difference scores was performed for every measurement parameter, and all histograms followed a normal distribution. Because the error is normally distributed, we can observe that about 95% of the points are between the limits of agreement (noted by the dashed lines on the plots) for each measure. The width of the limits of agreement for the different measurements also was small (Fig. 2).
Bland-Altman 95% limits of agreement plots for the different muscle composition measurements of the left multifidus muscle at L4–L5. CSA=cross-sectional area, FCSA=functional CSA, CSA diff=side-to-side difference in CSA, FCSA diff=side-to-side difference in functional CSA, FCSA/CSA=ratio.
Intrarater Reliability of Muscle Measurements Using OsiriX and ImageJ
The intrarater reliability (ICC), SEM values, and descriptive statistics (mean±SD) related to OsiriX and ImageJ muscle measurements for the left side are presented in Table 3 for the L4–L5 level and in Table 4 for the S1 level. Again, the results for the right side were virtually equivalent and are not presented. The ICCs for intrarater reliability across both spinal levels for total CSA measurements of the paraspinal muscles, individually and as a group, ranged from .94 to .99 for ImageJ and from .97 to .99 for OsiriX. The FCSA ICCs across both spinal levels for all of the measured muscles tended to be slightly lower for ImageJ (ICC=.90–.96) compared with OsiriX (ICC=.97–.98), although all values were excellent.
Intrarater Reliability Indexes for OsiriX and ImageJ for Left Paraspinal Muscle Measurements at L4–L5a
Intrarater Reliability Indexes for OsiriX and ImageJ for Left Paraspinal Muscle Measurements at S1a
The side-to-side difference measurements are of much smaller areas compared with the total CSA and FCSA measurements and had lower reliability values (ICC=.77–.97). The intrarater ICCs for the side-to-side difference in total CSA varied from .80 to .90 for OsiriX and from .78 to .91 for ImageJ, and the side-to-side difference in FCSA varied from .77 to .96 for OsiriX and from .85 to .97 for ImageJ. The reliability of the signal intensity of the total CSA and the ratio of FCSA/CSA also was measured because these data give a proportion estimate of a muscle fat content. The mean ICC for the signal intensity of the total CSA was .99 for measurements acquired with either software program, and the mean for the FCSA/CSA ratio was .96 for OsiriX and .91 for ImageJ (range=.88–.97). The SEM associated with each muscle composition measurement was generally comparable between the software programs, except for the FCSA measurement where the SEM tended to be higher for ImageJ.
Discussion
We have presented specific protocols for paraspinal muscle measurements using 2 readily available, free image analysis programs, OsiriX and ImageJ, in a level of detail to allow replication (Appendix). The reliability and agreement of related paraspinal muscle measurements were found to be reasonably comparable between software programs, with excellent reliability when applied to a clinically relevant population. These findings are supported by the Bland and Altman limits of agreement that indicate inter-software agreement is within an acceptable range to use either of the 2 methods. Furthermore, the similar intrarater and inter-software reliability coefficients and SEMs suggest that the software used contributes little to the measurement error.
A threshold technique was utilized to calculate FCSA based on differences in pixel intensities between muscle (low intensity) and fat tissues (high intensity) on T2-weighted axial images. The application used in OsiriX is based on a region-growing algorithm, whereas ImageJ uses a signal intensity threshold algorithm. With OsiriX, once the lean muscle signal intensity is defined, the region-growing image segmentation involves the selection of seed points, which determine whether neighboring pixels will be included in the selection. This method is more time-consuming compared with a straight threshold algorithm where the only step needed is to indicate the upper and lower bounds of the threshold limit for muscle tissue. However, as suggested by Dello et al,24 our impression was that OsiriX is a more user-friendly software package than ImageJ. We are not aware of any other study that investigated the agreement of paraspinal muscle measurements between 2 different image analysis programs.
The results of this study related to intrarater reliability, however, are similar to those of other studies examining measurements of FCSA and total CSA that used a threshold technique. Danneels et al1 reported ICCs for intrarater reliability that varied between .81 and .92 for FCSA, whereas other authors reported ICCs for intrarater reliability that were slightly higher (.90–.99).15,16,18 Studies using a tracing technique to measure FCSA by manually segmenting muscle from fat tissues have shown somewhat lower ICCs for intrarater reliability, varying between .81 and .96.17,19 Other investigators measuring total CSA reported ICCs for intrarater reliability that varied between .89 and .99.3,15,22,33,34 In the present study, however, intrarater reliability indexes were computed primarily in order to better interpret the contribution of inter-software reliability to measurement error. The fact that inter-software reliability is similarly high as intrarater reliability further suggests that using one software program as opposed to the other contributes little to measurement error.
One of the strengths of this study is the report of reliability indexes related to both individual muscle measurements and side-to-side differences. After several investigations of individuals with chronic LBP and those who were asymptomatic, Hides et al4 suggested that total CSA side-to-side asymmetry of the multifidus muscle greater than 10% could potentially signify an abnormality. Other investigators are now referring to this guideline.15
However, to our knowledge, only 2 studies examined the reliability of side-to-side difference measurements, with ICCs varying between .77 and .97 for side-to-side difference measurements of total CSA and .82 to .94 for FCSA (Battié and colleagues, unpublished research).15 The ICCs for both side-to-side difference measurements reported in our study are similar. Despite both single muscle measurements and side-to-side difference measurements having high reliability coefficients and similar SEMs, the error is relatively more important in the difference measurements, as they represent much smaller areas. For example, when using OsiriX, we found that the mean FCSA side-to-side difference of the multifidus muscle at L4–L5 was 0.75 cm2 and the associated SEM was 0.19 cm2, which is small in absolute terms but still relatively large, as it represents approximately 25% of the mean measurement of multifidus asymmetry. The SEM of 0.30 cm2 represents only approximately 5% of the mean multifidus muscle FCSA measurement of 5.84 cm2. When changes over time are of interest, such as in preintervention and postintervention measurements, there may be a high probability that the differences observed are due to measurement error rather than true changes if they do not exceed 2 SEMs.35 The greater measurement error related to side-to-side difference was confirmed by the Bland and Altman plots where the limits of agreement were relatively large in comparison with the other measurements.
Another strength of this study is that we studied patients with LBP conditions for whom the measurements are most likely to be of interest and who are expected to have more fatty infiltration9,36 and muscle atrophy1,4 compared with people who are healthy, increasing the difficulty of determining muscle boundaries during manual segmentation. Other authors reporting on the reliability of FCSA measurements primarily used samples of participants who were healthy.15,16,18 Our results suggest that total muscle size, within the range studied, and spinal level (L4–L5, S1) do not influence intrarater reliability or inter-software agreement. Only the erector spinae muscle at S1 seems to have a proportionally higher SEM associated with the composition measurements with both software programs, in comparison with the other analyzed muscles. This finding could be explained by the high fatty infiltration and the smaller size of the erector spinae muscle at S1, which increased the difficulty in determining the muscle borders.
A limitation of this study is the restriction of the measurement analysis to only 2 software packages. Even though inter-software reliability and agreement between OsiriX and ImageJ were excellent, even when measurements were obtained by an individual with modest experience, this finding might not be the case for other custom-made and commercial software used for image analysis. As determining inter-software reliability was the primary purpose of this study, replicate measurements were obtained from the same image to remove a potential extraneous source of measurement error. However, this represents a limitation when looking at intrarater reliability, where estimates might have been somewhat lower if the rater had repeated the entire procedure, including selecting the image from which to obtain the measurement.
In summary, a detailed protocol for paraspinal muscle CSA and composition measurements using 2 widely available, commonly used software programs was described, which yielded measurements with high inter-software and intrarater reliability. However, we found slightly lower reliability of side-to-side difference measurements compared with measurements of single muscles, which may be an important consideration in view of the current interest in muscle asymmetry. Future related studies would benefit from using a standard muscle measurement protocol to facilitate replication and comparisons among studies.
Appendix.
Specific Protocols for Obtaining Muscle Cross-Sectional Area (CSA) and Functional CSA (FCSA) Signal Measurement
Footnotes
-
Both authors provided concept/idea/research design, writing, and data analysis. Ms Fortin provided data collection and project management. Dr Battié provided fund procurement and facilities/equipment. The authors thank Doug Gross and Luciana Macedo for their review of this work and helpful comments.
-
This study was approved by the Health Research Ethics Board of the University of Alberta.
-
Support was received from the Canada Research Chairs Program and the European Union Community's Seventh Framework Programme (FP7, 2007–2013; grant HEALTH F2–2008-201626; project GENODISC).
- Received November 5, 2011.
- Accepted March 4, 2012.
- © 2012 American Physical Therapy Association