Translating reliability coefficients into clinically meaningful representations of measurement error is a necessary and important step when the goal is to link clinical research to clinical practice. The study by Steffen and Seney^{1} investigates the reliability of several balance and ambulation tests and converts the obtained coefficients into minimal detectable change (MDC) estimates. The authors apply Shrout and Fleiss^{2} type 3,k intraclass correlation coefficients (ICC) to quantify relative reliability and, from these estimates, they calculate the standard error of measurement (SEM) to quantify measurement error in the same units as the original measurement. For some of the balance and ambulation tests, 2 trials were performed on each of 2 occasions (eg, Timed “Up & Go” Test [TUG]); for other tests (eg, SixMinute Walk Test [6MWT]), a single measurement was performed on each of 2 occasions. In the former case, the authors reported a type 3,2 ICC; in the latter case, they presented a type 3,1 ICC.
The authors’ rationale for applying the type 3,k ICC was “The ICC(3,k) was used instead of the Pearson correlation coefficient (r) for testretest reliability because it assesses rating reliability by comparing the variability of different ratings of the same subject with the total variation across all ratings and all subjects.”^{1(pp740–741)} In fact, the type 3,1 ICC provides an estimate of reliability similar to the Pearson r because neither coefficient accounts for a systematic difference in scores between the replicate measures (eg, either trials or occasions in Steffen and Seney's study). Presumably, in a testretest reliability study, one is interested in both systematic and random errors, and, if this is true, the type 2,k ICC is the better choice because it includes both sources of variance in the reliability coefficient calculation. When the systematic error is zero, the type 2,k and 3,k ICCs provide identical estimates of reliability. However, when systematic error is present, as in the case of Steffen and Seney's 6MWT data, the type 2,k ICC will be less than the type 3,k ICC.
My second reflection addresses the use of the Shrout and Fleiss classification system in situations where 2 or more facets exist, such as for the TUG data. Here, the facets are trials and occasions. A dilemma occurs when attempting to interpret the meaning of the type 3,2 ICC reported by Steffen and Seney. It is not clear if the second digit (2) refers to 2 trials, 2 occasions, or 2 trials performed on each of 2 occasions (ie, a total of 4 measurements). I propose that a generalizability^{3} approach to the analysis has the potential to provide a clearer picture of the sources of variance, their magnitude, and the relative merits of averaging over either trials or occasions, or both.
To illustrate the points raised above, I have generated synthetic data for the TUG. Paralleling the design of Steffen and Seney, the synthetic data represent 2 TUG trials performed on each of 2 occasions for 10 persons. The data presented in Table 1 were contrived to illustrate a systematic difference between occasions, but no systematic difference between trials.
Table 2 reports the mean scores for trials and occasions. Of interest is that the trial means averaged over occasions are almost identical; however, the occasion means differ. Stated another way, a systematic difference exists between occasions, but not between trials averaged over occasions.
Table 3 displays Shrout and Fleiss type 2,1 and type 3,1 ICCs obtained by performing randomized block analysis of variance (ANOVA). Negative variance estimates were set to zero for all analyses. Pearson r values also are reported in this table. That the intertrial type 2,1 and 3,1 ICCs are identical to 2 decimal places reflects the similarity of trial means shown in Table 2. By contrast, the interoccasion means shown in Table 2 differed, and this systematic difference is not reflected in the type 3,1 ICC or in the Pearson r. Accordingly, the type 3,1 ICC is greater than the type 2,1 ICC because the variance due to occasion is greater than zero.
The following section illustrates a generalizability analysis that includes both trials and occasions in a single analysis. I applied a 3way random effects ANOVA. The rationale for applying a random effects model was that I wished to generalize beyond the persons, trials, and occasions composing the study sample. The ANOVA and variance components were calculated using MINITAB statistical software*, and the results appear in Table 4. Once again, negative variance estimates were set to zero.
Inspection of the variance components reveals the following important findings: (1) there is a large variance among persons, and this is desirable, (2) the variance between trials averaged over occasions is zero (this reflects the near identical means reported in Table 2), (3) there is a relatively large variance due to occasions (this reflects the difference in occasion means reported in Table 2), (4) the person by occasion (P × O) variance is substantially greater than the person by trial (P × T) variance (this suggests that averaging over occasion will have a greater effect than averaging over trials), and (5) the residual error is relatively small compared with the person variance.
The variance components reported in Table 4 can be applied to calculate generalizability coefficients that represent intertrial and interoccasion reliability. They also can be used to examine the distinct effect of averaging over trials, occasions, or both.
The theoretical intertrial reliability (generalizability) for a single trial is obtained by substituting the variance components into Equation 1 and by setting n_{t} and n_{o} to 1. The obtained value is .97, and this is analogous to the Shrout and Fleiss type 2,1 intertrial ICCs of .96 reported in Table 3. The intertrial reliability for an average of 2 trials can be obtained by setting n_{t} to 2 and n_{o} to 1. This yields an intertrial reliability of .98, which is analogous to a Shrout and Fleiss type 2,2 ICC.
When the goal is to draw inferences about the change status of a person, as is the case when MDC is applied, the interoccasion reliability (generalizability) coefficient is of interest. It is calculated by applying Equation 2. The theoretical interoccasion reliability for a single trial is obtained by substituting the variance components into Equation 2 and by setting n_{t} and n_{o} to 1. This gives an interoccasion reliability of .74, which is the average of the 2 interoccasion reliability estimates reported in Table 3. The interoccasion reliability for a single trial performed on each of 2 occasions is obtained by setting n_{t} to 1 and n_{o} to 2. This yields an interoccasion reliability of .85.
Finally, one can examine the interoccasion reliability for the average of 2 trials on each of 2 occasions. This is accomplished by setting n_{t} to 2 and n_{o} to 2 in Equation 2. A value of .86 is obtained, and, to my knowledge, there is no equivalent Shrout and Fleiss coding scheme to represent this combination.
Footnotes

This letter was posted as a Rapid Response on June 3, 2008, at www.ptjournal.org.

↵* Minitab Inc, Quality Plaza, 1829 Pine Hall Rd, State College, PA 168013008.
 Physical Therapy