Abstract
A stepped wedge cluster randomized trial is a type of longitudinal cluster design that sequentially switches clusters to intervention over time until all clusters are treated. While the traditional posttest-only parallel design requires adjustment for a single intraclass correlation coefficient, the stepped wedge design allows multiple outcome measurements from the same cluster and so additional correlation parameters are necessary to characterize the within-cluster correlation structure. Although a number of studies have differentiated between the concepts of within-period and between-period correlations, only a few studies have allowed the between-period correlation to decay over time. In this article, we consider the proportional decay correlation structure for a cohort stepped wedge design, and provide a matrix-adjusted quasi-least squares approach to accurately estimate the correlation parameters along with the marginal intervention effect. We further develop the sample size and power procedures accounting for the correlation decay, and investigate the accuracy of the power procedure with continuous outcomes in a simulation study. We show that the empirical power agrees well with the prediction even with as few as nine clusters, when data are analyzed with matrix-adjusted quasi-least squares concurrently with a suitable bias-corrected sandwich variance. Two trial examples are provided to illustrate the new sample size procedure.
Keywords: finite-sample correction, generalized estimating equations (GEEs), group-randomized trial, proportional decay, quasi-least squares (QLS), sample size calculation
1 |. INTRODUCTION
A unique feature of cluster randomized trials (CRTs) is that intact clusters, such as schools or clinics, are randomized to intervention arms.1,2 Randomization at the cluster level often carries pragmatic considerations, eg, administrative convenience, political reasons and prevention of treatment contamination.3 A stepped wedge CRT is a type of longitudinal design that sequentially switches clusters to intervention during the course of the study until all clusters are treated.4 Such designs have become increasingly popular due to their logistical flexibility and perceived ethical benefits. Because individual outcomes within the same cluster tend be more similar than those in different clusters, the intraclass correlation coefficient (ICC) plays a central role in designing CRTs. While the traditional posttest-only parallel design (ie, parallel cluster randomized design without a baseline period as defined in the work of Murray1) requires adjustment for a single ICC, longitudinal cluster randomized design such as the crossover or stepped wedge design allows multiple outcome measurements from the same cluster and so naturally requires additional correlation parameters to characterize the within-cluster structure.5,6 Correspondingly, sample size and power calculations for stepped wedge designs necessitate the specification of more than one correlation parameters. For example, Hemming et al7 considered both the within-period and between-period ICCs in their sample size procedure for a cross-sectional design. Hooper et al8 and Li et al5 examined a three-correlation structure that additionally accounts for the within-individual repeated measurements in a closed-cohort design.
Despite existing development of multiparameter correlation structures for designing stepped wedge trials, most of them assumed a constant between-period ICC with a few exceptions. For example, in a cross-sectional design where outcome data are obtained from a different set of participants in each cluster period,1,9 Hemming et al7 allowed the between-period ICC to be different from the within-period ICC, but restricted the between-period ICC to be constant irrespective of the distance between periods. Relaxing the constant between-period ICC assumption for a cross-sectional design, Kasza et al10,11 studied a nonuniform correlation structure with a decay parameter and proposed a sample size procedure that accounts for the exponential correlation decay. Grantham et al12 further extended their sample size procedure to allow for continuous-time correlation decay in multiple-periods CRTs with continuous recruitment. From a trial planning standpoint, if correlation decay is present, Kasza et al10 indicated that omitting the correlation decay in a cross-sectional design would either underestimate or overestimate the true variance of the intervention effect, which led to inaccurate sample size determination. As we demonstrate in Section 6.3, similar considerations carry over to closed-cohort designs, where outcome data are collected from the same set of participants in each cluster period.1,9,13 Particularly, the constant between-period ICC assumption in the works of Hooper et al8 and Li et al5 may not always be realistic and it is therefore necessary to develop alternative design and analysis strategies accounting for correlation decay in cohort stepped wedge studies.
Two popular modeling approaches for stepped wedge designs are cluster-specific models (eg, random-effects models) and population-averaged models.15 The parameter estimates from a population-averaged model can be interpreted as the marginal intervention effect for the participating individuals combined over all cluster periods, and may be preferred over the cluster-specific models for trials conducted in the health policy or health services settings.5 In this article, we consider a population-averaged model with a decay correlation structure. As indicated in Table 1, the first contribution of this article is to fill in the gap and characterize a proportional decay structure appropriate for cohort stepped wedge designs. Such a proportional decay structure has been previously introduced in analyzing clustered longitudinal data; see, eg, the works of Lefkopoulou et al,16 Shults and Morrow17 and Liu et al,18 but has not been exemplified in CRTs with a staggered randomization. Based on the proportional decay structure, we derive a new closed-form variance expression to facilitate sample size and power determination. Based on the derived variance expression, we additionally obtain a simple-to-use design effect (DE) and study how the power depends on the correlation parameters. Since the sample size procedure requires input for the correlation parameters, accurate estimation of the correlations is instrumental for future planning of stepped wedge trials. Therefore, a second contribution of this article is to introduce a modified generalized estimating equations (GEEs) approach to accurately estimate the decay correlation structure along with the marginal mean parameters. The traditional GEE19 is modified in the following two ways. (1) As simple moment estimators for the decay correlation structure may not be easy to obtain, we estimate the marginal mean and correlation parameters by quasi-least squares (QLS).20,21 The QLS approach shares the same estimating equations with GEE regarding the marginal mean parameters, but is flexible enough to accommodate nonstandard correlation structures. Similar to the traditional GEE estimator, the QLS estimator is also robust to correlation misspecification. (2) Since stepped wedge cluster randomized design frequently includes a small number of clusters, we refine the QLS approach by incorporating appropriate finite-sample bias corrections to both the estimation of correlation parameters as well as the variance of the intervention effect.
TABLE 1.
Decay | Design feature | Correlation structure | Example references |
---|---|---|---|
No | Cross-sectional | Nested exchangeable | Hemming et al,7 Hooper et al,8 and Li et al5 |
Closed-cohort | Block exchangeable | Hooper et al8 and Li et al5,14 | |
Yes | Cross-sectional | Exponential decay | Kasza et al,10 Kasza and Forbes,11 and Grantham et al12 |
Closed-cohort | Proportional decay | This article |
The remainder of this article is organized as follows. The notation of cohort stepped wedge designs is introduced in Section 2. In Section 3, we provide the proportional decay correlation structure and discuss the QLS estimators to estimate the correlation structure. In Section 4, we develop closed-form procedures for sample size and power calculations based on the population-averaged model coupled with the proportional decay structure. We conduct a simulation study in Section 5 to examine the accuracy of the proposed power procedure when the trials are analyzed by the QLS approach. Section 6 provides two illustrative examples of cohort stepped wedge designs and Section 7 concludes.
2 |. NOTATION AND BASIC SETUP
We consider a cohort stepped wedge design, where a closed cohort of individuals are enrolled at each of the I participating clusters at the start of the trial. We mainly focus on cohort designs to inform the applications in Section 6, and will defer the discussion of cross-sectional designs to Section 7. We assume the trial involves a total of T time periods. All clusters start from the control condition, and may be randomly chosen to switch to intervention during the course of the study, until all clusters are treated at the end of the Tth period. Individual participants will be scheduled for outcome measurement during each period, and so each individual has a total of T repeated measurements. Denote yijt as the outcome for individual j (j = 1, …, Ni) from cluster i (i = 1, …, I) at period t (t = 1, …, T). A step is defined as the preplanned time point when at least one cluster crosses over from control to intervention. We denote the total number of steps by S (2 ≤ S ≤ T − 1), and assume that ms clusters cross over at step s such that . We assume a complete design in the terminology of the work of Hemming et al7 such that outcome measurements are taken for all individuals during each period. Following the work of Woertman et al,22 we define the baseline measurements as those taken before any cluster is randomized to intervention, and follow-up measurements as those taken after at least one cluster is randomized to intervention. We assume there are b ≥ 1 baseline measurements planned under the control condition, and cs ≥ 1 follow-up measurements planned between step s and step s + 1 (or end of study). Each measurement time point is associated with a distinct time period and the total number of periods . A standard stepped wedge design is given by b = cs = 1 for all s, and T = S + 1 (T ≥ 3). A schematic illustration of a standard design with I = 6 clusters and T = 4 periods can be found in Figure 1. We also provide a schematic illustration of an alternative design with I = 6 clusters and T = 6 periods in Web Figure 1.
3 |. ANALYSIS CONSIDERATIONS: MODELS AND STATISTICAL INFERENCE
3.1 |. Population-averaged models
The population-averaged model relates the marginal mean, μijt, to the time trend and the intervention effect by
(1) |
where g is the link function and βt is the tth period effect.5 Furthermore, Xit is the intervention status, which is equal to 1 or 0 depending on whether cluster i receives intervention during period t, and δ describes the intervention effect on the link function scale. Model (1) can be regarded as the marginal counterpart of a number of existing random-effects models, such as those proposed in the works of Hemming et al,7 Hooper et al8 and Kasza et al.10,11 Like these random-effects models, our marginal model (1) does not specify treatment by period interaction and so δ should be interpreted as the average intervention effect across periods. We write the collection of model parameters as θ = (β1, …, β, δ)′, the collection of intervention status for cluster i (a sequence of ones preceded by zeros) as Xi = (Xi1, …, XiT)′, and define v(μijt) as a known variance function. To allow for potential correlation decay over time, we define the proportional decay correlation structure similar to the work of Lefkopoulou et al.16 Specifically, we define the within-period correlation as the correlation between outcomes for two distinct individuals from the same cluster during the same period, ie, corr(yijt, yij′t) = τ for j ≠ j′. The same definition is prevalently used in the posttest-only parallel designs.1 We then assume a first-order autoregressive structure for the set of within-individual repeated measurements. Therefore, the within-individual correlation, which describes the association between outcomes measured at time t and t′ of the same individual, is corr(yijt, yijt′) = ρ|t − t′|, t ≠ t′. Finally, we define the between-period correlation as the correlation between outcome measured at time t for individual j and outcome measured at time t′ for individual j′, and assumes a decay structure as corr(yijt, yij′t′) = τρ|t−t′| for j ≠ j′, t ≠ t′.
In a closed-cohort design, typically, the between-period correlation is much smaller than the within-individual correlation for any two fixed periods, since the former is defined for measurements from two distinct individuals, while the latter is defined for those from the same individual. This ordering of magnitude is reflected in the proportional decay structure because τ < 1 and corr(yijt, yij′t′) = τρ|t−t′| < corr(yijt, yijt′) = ρ|t−t′|. By comparison, the exponential decay structure by Kasza et al10 is based on cross-sectional designs and therefore obviates the need for modeling the within-individual correlation from repeated measurements. In fact, the exponential decay structure assumes corr(yijt, yij′t′) = τρ|t−t′| irrespective of whether j = j′ since two different sets of participants are included in two different periods for a cross-sectional design. In this respect, parameters τ and ρ have similar interpretations in both structures. We provide a visual comparison of these two structures in Table 2. In summary, the proportional decay correlation structure is defined through two parameters, τ and ρ, with the former resembling the traditional ICC definition in a parallel design and the latter controlling for the degree of correlation decay. Of note, the works Shults and Morrow17 and Liu et al18 also adopted the same proportional decay structure in longitudinal CRTs with a parallel assignment, and we extend the discussion of this decay structure to CRTs with a staggered assignment.
TABLE 2.
Proportional decay structure | Exponential decay structure | |
---|---|---|
corr(yi) |
3.2 |. Quasi-least squares analysis
We use the QLS approach introduced by Shults and Morrow17 to simultaneously estimate the intervention effect in model (1) and the correlation parameters in a cohort stepped wedge design. In particular, the QLS approach and the traditional GEE approach share the same estimating equations for the marginal mean parameters, whereas the former provides a flexible and convenient way to estimate nonstandard correlation structures. Furthermore, both the QLS and GEE approaches are robust to correlation misspecification, namely, estimators for the marginal intervention effect remain consistent even if the working correlation model is misspecified. In sufficiently large samples (usually I ≥ 30), the robust sandwich variance could also be used to adequately quantify the uncertainty of the intervention effect estimate even under correlation misspecification. We refer the readers to the textbook of Shults and Hilbe23 for a full exposition on the advantages of QLS over the traditional GEE.
Write yij = (yij1, …, yijT)′, μij = (μij1, …, μijT)′, , and . Furthermore, define Di(θ) =∂μi ∕∂θ′, and let the working covariance of yi be , where ϕ > 0 is the dispersion parameter, , Aij(θ) = diag{v(μij1), …, v(μijT)}, and Ri(α0, α1) is a positive definite working correlation parameterized by α0 and α1. We assume the true correlation structure among elements of yi is the proportional decay structure, denoted by Ri(τ, ρ). In matrix notation, we can verify that the proportional decay structure induces separability between τ and ρ in that Ri(τ, ρ) = Gi(τ) ⊗ F(ρ), where Gi(τ) is a Ni × Ni exchangeable correlation matrix (with respect to τ) and F(ρ) is a T × T first-order autoregressive correlation matrix as follows:
We could verify that the determinant
Therefore, valid correlation values that ensure positive definite Ri(τ, ρ) should be contained in the triangular region
(2) |
Finally, the inverse of the Ri also exists in closed form and is given by , where
Ii is a Ni × Ni identity matrix, Ji is a Ni × Ni matrix of ones, C2 = diag(0, 1, …, 1, 0), and C1 is a T × T tridiagonal matrix with zeros on the main diagonal and ones on the two subdiagonals.
To introduce the QLS estimating equations, we further define , and write . The first-stage QLS estimates for θ, α0, and α1 are obtained by alternating between the following estimating equations until convergence:
(3) |
(4) |
(5) |
In particular, (3) is the usual GEE coupled with the proportional decay structure, and (4) and (5) are scalar equations for the first-stage correlation estimates. Further, closed-form solutions exist for and (within an iterative step) and are provided in Web Appendix A. Chaganty and Shults21 showed that and are asymptotically biased for τ and ρ. To eliminate the large-sample bias in the first-stage correlation estimates, Chaganty and Shults21 provided the following second-stage estimating equations to obtain: ,
(6) |
(7) |
The closed-form solution for (6) and (7) are provided by the work of Shults and Morrow17 as
and .
3.3|. Bias-corrected correlation estimation
Although the correlation estimates obtained from the second-stage QLS estimating equations are unbiased in large samples, they could be subject to finite-sample bias. The finite-sample bias is a typical consideration in stepped wedge CRTs,5 as they usually include a small number of clusters (I ≤ 30). We refine the QLS approach with finite-sample bias corrections to the correlation estimating equations by utilizing the cluster leverage,24 defined as . Specifically, when I is small, the estimated residual tends to be biased toward zero, and following the work of Preisser et al,25 we have and, therefore, . This last equation suggests that is a better estimator for corr(yi) compared to the simple cross-product , since the former accounts for finite-sample bias in a multiplicative fashion. Furthermore, observe that
for all values of α0, α1, and θ; we propose to replace the first-stage estimating equations (4) and (5) by
(8) |
(9) |
where represents the matrix-adjusted estimator for the correlation structure. The solutions obtained from (8) and (9) could effectively reduce the finite-sample bias in and , which would in turn decrease the finite-sample bias in the QLS estimators for τ and ρ. Of note, similar finite-sample matrix adjustment was developed by Preisser et al25 for the Prentice-type GEE,26 and we extend this finite-sample bias-correction approach to the QLS estimating equations. Accurately estimating the correlation parameters in the analysis stage has practical implications since these estimates could be used to guide the planning of future trials.1 Additional details of the matrix-adjusted estimating equations, (8) and (9), along with the closed-form updates are provided in Web Appendix B.
3.4 |. Bias-corrected covariance estimation
The availability of a small number of clusters may also have implications for estimating the variance using GEE-based approaches.27 In general, the variance of the marginal mean model parameter can be estimated using the model-based variance or the sandwich variance , where
(10) |
and both Ω0 and Ω1 are evaluated at (, , ). When both Ci and Bi are identity matrices, Equation (10) reduces to the uncorrected sandwich estimator of Liang and Zeger,19 which we denote as BC0. BC0 provides valid inference regardless of the correct specification of the working correlation Ri, as long as the number of clusters is sufficiently large (I ≥ 30), while the consistency of the model-based variance requires the correct specification of the correlation structure. As is biased toward zero with a limited number of clusters, BC0 is likely to underestimate the variance and alternative choices of matrices Ci and Bi maybe necessary to provide a partial correction to the finite-sample bias.27 We consider three popular approaches for bias corrections summarized as follows and also in Table 3: the finite-sample correction due to the work of Kauermann and Carroll,28 or BC1, is given by Ci = I and Bi = (I − Hi)−1/2; the finite-sample correction due to the work of Mancl and DeRouen,29 or BC2, is given by Ci = I and Bi = (I − Hi)−1; and the finite-sample correction due to the work of Fay and Graubard,30 or BC3, given by and Bi = I, where the bound parameter ζ is a user-defined constant (< 1) with a default value 0.75. Because the matrix elements of the cluster leverage are between 0 and 1, we generally have BC0 < BC1 < BC2 in terms of the amount of correction.25 Furthermore, Scott et al31 have shown that BC3 tends to be close to BC1. Of note, the estimation of dispersion parameter should only affect the model-based variance. Similar to the work of Liang and Zeger,19 we propose to consistently update the dispersion parameter from iteration s to s + 1 by .
TABLE 3.
4 |. DESIGN CONSIDERATIONS: SAMPLE SIZE AND POWER CALCULATIONS
4.1 |. Closed-form variance of the intervention effect
Under the null hypothesis H0: δ = δ0, the large-sample variance of is provided by the (T + 1, T + 1) element of the large-sample covariance matrix of . Since the QLS estimator is asymptotically normal, we could use the z-test statistic to test the null of no intervention effect H0: δ = 0, and the power to detect an intervention effect of size δ ≠ 0 with a prescribed type I error rate α is approximately
(11) |
where Φ is the standard normal cumulative distribution function and zα/2 is the normal quantile such that Φ(zα/2) = 1 − α∕2. Because there is uncertainty in estimating the asymptotic variance , an alternative two-sided test uses the same statistic but refers to the t-distribution. We consider two choices of degrees of freedom (DoF). The first DoF dates back to the work of Mancl and DeRouen29 and equal to the number of clusters minus the number of regression parameters; this DoF has been previously used in the GEE analyses of parallel CRTs,32 three-level CRTs,33 crossover CRTs,6 stepped wedge CRTs,5,34 and shown to have test size not exceeding the nominal level. The second DoF was suggested in the PhD dissertation of Ford35 and specifies DoF = I − 2. This choice of DoF was found to provide excellent control of type I error rate for GEE analyses of both parallel CRTs and stepped wedge CRTs in the work of Ford and Westgate.35,36 With the same effect size δ and prescribed type I error rate α, the power of the t-test is approximately
(12) |
where ΨDoF is the t distribution function and the quantile tα/2 is chosen such that ΨDoF(tα/2) = 1 − α∕2. We notice that, because the t-distribution has a heavier tail compared with the standard normal distribution, the QLS z-test ismore likely to result in an inflated type I error rate with the use of BC0 than the corresponding QLS t-test. As the bias-corrected sandwich variance estimators (BC1, BC2, and BC3) provide different degrees of inflation relative to the uncorrected variance BC0, an investigation of both the z- and t-tests coupled with the collection of alternative variance estimators could help inform the practical choice among the analytical options for stepped wedge CRTs.
To assist the design of stepped wedge trials allowing for correlation decay, we derive a new closed-form variance expression for assuming the outcome is continuous and g is the identity link function. We will return to categorical outcomes and nonlinear link functions in Section 7. To do so, we follow the work of Shih37 and assume the covariance of Yi to be known as var(Yi) = Vi. Therefore, is the (T+1, T+1) element of themodel-based variance . We further assume a balanced design such that an equal number of participants will be recruited in each cluster prior to the first period, so that Ni = N. Such a simplification assumption is routinely made in designing stepped wedge trials.
Under a balanced design, we could write the design matrix corresponding to cluster i as Zi = 1N ⊗ (IT, Xi), where 1N is a N-vector of ones. Then, the variance of the intervention effect is equal to the lower-right element of , where ϕ is the marginal variance. We show in Web Appendix C that a closed-form variance expression for is
(13) |
where the design constant is the total number of cluster periods exposed under the intervention condition, is the squared number of clusters receiving the intervention summed across periods, and and are cross-product terms resulting from the decay correlation structure. It is interesting to see that this variance expression does not depend on the magnitude of the period effect βt as long as they are controlled for in the marginal mean model. Noticeably, the QLS-based variance (13) extends the formula due to the work of Liu et al18 to longitudinal cluster designs with staggered randomization. Furthermore, as the cohort size N becomes large, the variance expression converges to
(14) |
which is a finite constant since |ρ| < 1 and τ > 0 for large N, according to (2). Therefore, the limit of the variance is a positive constant determined by available design resources I, T, and two correlation values, τ and ρ, and cannot be made arbitrarily small. In other words, the power of the stepped wedge design may not be increased to one by solely increasing the cohort size, which is consistent with the known results for parallel cluster randomized designs.1 For this reason, when N is large, variance (14) could be used in the design stage to approximate the variance (13). Finally, given hypothesized values for I, N, T, and correlation parameters, variance expression (13) or (14) can be used in Equations (11) and (12) to obtain the predicted power.
4.2 |. The design effect
For determining the required sample size based on Equations (11) and (12), it is straightforward to solve N by fixing the required number of clusters I but not the other way around. However, with a predetermined cohort size N for each cluster, we could postulate a series of values for I and find the smallest value such that the estimated power is at least equal to the prescribed level. Additionally, in the following case studied by Woertman et al,22 we could derive a simple expression for the design effect (DE) relative to an individually randomized trial to simplify sample size calculation. Specifically, we assume that an equal number of clusters switch to intervention at each step so that ms = m, and, furthermore, an equal number of measurements are taken between steps such that cs = s for all s = 1, …, S. We then write the total number of clusters I = Sm and total number of periods T = b + Sc, and the design constants become
Plugging the design constants back into the variance formula (13) and dividing by the variance of the two-sample mean difference 4ϕ∕(NSm), we obtain
(15) |
The aforementioned DE allows us to easily study how the design resources affect the statistical efficiency relative to individual randomization and how the correlation parameters affect the statistical power. For example, since the DE is free of b, the relative design efficiency does not change according to the number of baseline periods. However, for fixed values of the correlation parameters, increasing the number of steps S and number of measurements between steps c decreases the DE and increases the efficiency. On the other hand, for fixed design resources, larger values of the within-period correlation τ increases the DE, confirming that τ functions as the traditional ICC of a parallel CRT. By contrast, the role of correlation parameter ρ is characterized by f(p) = (1 − ρ2)∕[(S + 1)c(1 − ρ)2 + 6ρ], which is monotonically increasing on (−1, r) and decreasing on (r, 1), where
For convenience, we can also define the decay parameter d = 1 − ρ so that d = 0 and d = 1 correspond to no decay and total decay, respectively. Since it is more plausible that ρ ∈ (0, 1), the aforementioned result suggests that, with an increasing level of correlation decay, the DE first increases to its largest value and then decreases, with the maximum DE obtained at ρ = r. A numerical illustration of the DE as a function of the decay parameter is provided in Web Figure 2.
5 |. A SIMULATION STUDY
5.1 |. Simulation design
We carry out a simulation study (1) to compare the correlation estimates from the uncorrected QLS and the proposed matrix-adjusted QLS (MAQLS), and (2) to evaluate the utility of the proposed power formula for QLS-based analyses of stepped wedge CRTs. For the second objective, we first determine the empirical type I error rates for the QLS-based tests coupled with alternative variance estimators, and then identify valid tests (those with a close-to-nominal type I error rate) whose empirical power corresponds well with the predicted power from the proposed formula. Findings specific to the second objective are informative for practical data analysis since we prefer tests that maintain a valid size and meanwhile demonstrate empirical power that is at least the magnitude of the analytical prediction.
Within-cluster correlated continuous outcomes were generated from a multivariate normal distribution with mean given by μijt = βt+Xitδ and covariance ϕR(τ, ρ), where R(τ, ρ) is the proportional decay structure defined in Section 3.1. We set the marginal variance ϕ = 1 and assumed a gently increasing period effect such that β1 = 0 and βt+1−βt = 0.1×(0.5)t−1 for t ≥ 1. As discussed before, the predicted power should be insensitive to the magnitude of the period effects as long as they are accounted for in the QLS analyses. We fix the effect size δ∕ϕ1/2 at zero for studying test size and choose δ∕ϕ1/2 from {0.2, 0.3, 0.4, 0.5} for studying power. We choose the within-period correlation τ ∈ {0.03, 0.1}, which represent typical ICC values reported in the parallel CRTs.1 We further chose ρ ∈ {0.2, 0.8}, representing large and moderate degree of correlation decay over time. The number of clusters is varied from 9 to 24 as stepped wedge CRTs usually include a limited number of clusters. We specify the number of periods 3 ≤ T ≤ 8 as these values are frequently reported in practice, according to recent reviews by the works of Martin et al38 and Grayling et al.39 Finally, the cohort size is chosen as 5 ≤ N ≤ 24 to ensure that the predicted power is at least 80%. For illustration, we focus on standard stepped wedge designs so that there is only one baseline period and the number of steps S = T − 1. In other words, an equal number of I∕S clusters cross over to intervention during each step, and the outcome is measured only once for each individual between consecutive steps. For each scenario, 10 000 data replications were generated and analyzed using both QLS and MAQLS. For the first objective, we report the percent relative bias in estimating τ and ρ. In general, an unbiased approach for estimating the correlation parameters is preferred since accurate reporting of correlations is critical for planning future trials. Web Tables 1 and 2 provide a summary of the simulation scenarios along with the convergence rates. The convergence rates are similar between QLS and MAQLS, and all exceed 97%. For the second objective, we consider both the z-tests and the t-tests for testing the null hypothesis of no intervention effect, coupled with five different variance estimators for , namely, the model-based variance, BC0, BC1, BC2, and BC3. The nominal type I error rate is held fixed at 5%, and we consider an empirical type I error rate between 4.5% and 5.5% to be acceptable based on the margin of error from a binomial model with 10 000 replications. By a similar reasoning, since the predicted power in each scenario is at least 80%, we consider an empirical power that differs by no more than 0.8% from the predicted value to be acceptable.
5.2 |. Results
For the first objective, we summarize in Table 4 and Web Table 3 the percent relative bias in estimating the correlations with QLS and MAQLS. It is evident that the percent bias in estimating the within-period correlation τ is much larger than that in estimating the correlation parameter ρ, without respect to the incorporation of matrix adjustment to the first-stage estimating equations (4) and (5). However, the QLS estimator for τ exhibits noticeable downward bias, especially when the number of clusters is not large. By contrast, MAQLS substantially reduces such finite-sample bias and improves the estimation of τ. On the other hand, the QLS estimator for the parameter ρ seems more accurate in that the absolute percent bias only occasionally exceeds one. Nevertheless, MAQLS still mildly improves the estimation of ρ in that the absolute percent bias is always maintained under one. The comparative findings between QLS and MAQLS are consistent regardless of the magnitude of intervention effect δ. Therefore, MAQLS is the preferred approach because it provides much less biased estimates for the correlation parameters; these more accurate correlation estimates will eventually facilitate accurate estimation of sample size and power for future cohort stepped wedge trials.
TABLE 4.
Correlations | Effect size | Design resource | Percent bias for τ | Percent bias for ρ | |||||
---|---|---|---|---|---|---|---|---|---|
τ | ρ | δ | I | N | T | QLS | MAQLS | QLS | MAQLS |
0.03 | 0.2 | 0 | 18 | 10 | 7 | −26.6 | 3.5 | −0.6 | 0.2 |
0.03 | 0.2 | 0 | 18 | 24 | 4 | −16.0 | 5.5 | −0.3 | 0.0 |
0.03 | 0.2 | 0 | 20 | 14 | 5 | −20.2 | 3.4 | −0.5 | 0.0 |
0.03 | 0.2 | 0 | 21 | 8 | 4 | −29.4 | 3.2 | −0.4 | 0.2 |
0.03 | 0.2 | 0 | 15 | 8 | 4 | −41.3 | 4.7 | −0.6 | 0.3 |
0.03 | 0.8 | 0 | 16 | 12 | 5 | −26.2 | 4.4 | −0.6 | 0.0 |
0.03 | 0.8 | 0 | 24 | 7 | 5 | −27.7 | 0.6 | −0.6 | −0.1 |
0.03 | 0.8 | 0 | 12 | 8 | 5 | −49.4 | 8.2 | −2.1 | −0.4 |
0.03 | 0.8 | 0 | 12 | 5 | 4 | −74.2 | 6.6 | −2.3 | −0.5 |
0.03 | 0.8 | 0 | 10 | 5 | 3 | −91.6 | −0.4 | −1.5 | −0.5 |
0.10 | 0.2 | 0 | 21 | 11 | 8 | −9.2 | 3.9 | −0.6 | 0.6 |
0.10 | 0.2 | 0 | 24 | 11 | 7 | −8.1 | 3.4 | −0.4 | 0.6 |
0.10 | 0.2 | 0 | 15 | 16 | 6 | −11.8 | 7.1 | −0.6 | 1.1 |
0.10 | 0.2 | 0 | 18 | 8 | 7 | −12.7 | 3.9 | −0.8 | 0.7 |
0.10 | 0.2 | 0 | 16 | 7 | 5 | −16.7 | 3.7 | −0.7 | 1.0 |
0.10 | 0.8 | 0 | 20 | 18 | 5 | −8.0 | 5.0 | −0.1 | 0.2 |
0.10 | 0.8 | 0 | 15 | 9 | 4 | −15.2 | 3.1 | −0.2 | 0.2 |
0.10 | 0.8 | 0 | 10 | 20 | 3 | −16.9 | 6.5 | −0.1 | 0.3 |
0.10 | 0.8 | 0 | 12 | 5 | 5 | −27.0 | 2.6 | −1.5 | 0.0 |
0.10 | 0.8 | 0 | 9 | 7 | 4 | −29.7 | 3.8 | −1.0 | 0.1 |
For the second objective, we present the empirical type I error rates of the z-tests and t-tests for the QLS and MAQLS analyses in Web Tables 4 to 6 and Web Figure 3. Overall, we observe that the matrix adjustment to the correlation estimation mildly affects the tests with the model-based variance but has little impact on the tests with the sandwich variance. This is in accordance with the work of Lu et al,40 who observed the same results for the GEE analyses of pretest-posttest CRTs. Since MAQLS provides more accurate estimation of the correlations, we will focus on this approach. Table 5 summarizes the empirical type I error rates of the MAQLS z-tests and t-tests with DoF = I − 2. We leave the results for t-tests with DoF = I − (T + 1) to Web Table 6 as these tests are conservative in many cases. From Table 5, we observe that MAQLS z-tests are more liberal than the corresponding MAQLS t-tests. The type I error rates of the MAQLS z-tests coupled with the model-based variance or BC2 are close to nominal when I ≥ 20, while the MAQLS z-tests coupled with BC0, BC1, or BC3 are always liberal. By contrast, only the MAQLS t-tests with BC0 remain liberal, while MAQLS t-tests with BC1 or BC3 maintain close-to-nominal size and MAQLS t-tests with model-based variance or BC2 are conservative. Overall, the t-tests with DoF = I − 2 and BC1 demonstrate test sizes that consistently agree with the nominal level.
TABLE 5.
z-test | t-test | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
τ | ρ | δ | I | N | T | Pred | MB | BC0 | BC1 | BC2 | BC3 | Pred | MB | BC1 | BC2 | BC2 | BC3 |
0.03 | 0.2 | 0 | 18 | 10 | 7 | 5.0 | 4.8 | 8.4 | 6.8 | 5.1 | 7.1 | 5.0 | 3.4 | 6.3 | 4.7 | 3.8 | 4.9 |
0.03 | 0.2 | 0 | 18 | 24 | 4 | 5.0 | 4.9 | 8.1 | 6.5 | 5.0 | 6.5 | 5.0 | 3.4 | 6.0 | 4.7 | 3.6 | 4.7 |
0.03 | 0.2 | 0 | 20 | 14 | 5 | 5.0 | 5.1 | 8.3 | 6.9 | 5.8 | 7.0 | 5.0 | 3.8 | 6.5 | 5.4 | 4.4 | 5.5 |
0.03 | 0.2 | 0 | 21 | 8 | 4 | 5.0 | 5.2 | 7.9 | 6.6 | 5.3 | 6.6 | 5.0 | 3.8 | 6.2 | 5.0 | 4.0 | 4.9 |
0.03 | 0.2 | 0 | 15 | 8 | 4 | 5.0 | 5.7 | 9.7 | 7.7 | 5.8 | 7.7 | 5.0 | 3.7 | 7.0 | 5.3 | 4.1 | 5.3 |
0.03 | 0.8 | 0 | 16 | 12 | 5 | 5.0 | 5.6 | 9.0 | 7.3 | 5.6 | 7.3 | 5.0 | 3.8 | 6.7 | 5.1 | 3.8 | 5.1 |
0.03 | 0.8 | 0 | 24 | 7 | 5 | 5.0 | 5.3 | 7.7 | 6.6 | 5.5 | 6.6 | 5.0 | 4.3 | 6.2 | 5.2 | 4.3 | 5.3 |
0.03 | 0.8 | 0 | 12 | 8 | 5 | 5.0 | 6.0 | 10.6 | 7.9 | 5.8 | 7.9 | 5.0 | 3.5 | 7.0 | 4.9 | 3.2 | 5.0 |
0.03 | 0.8 | 0 | 12 | 5 | 4 | 5.0 | 5.7 | 10.3 | 7.8 | 5.7 | 7.5 | 5.0 | 3.4 | 6.9 | 5.0 | 3.5 | 4.9 |
0.03 | 0.8 | 0 | 10 | 5 | 3 | 5.0 | 6.3 | 11.2 | 8.2 | 5.6 | 7.1 | 5.0 | 3.0 | 7.0 | 4.6 | 3.1 | 3.8 |
0.10 | 0.2 | 0 | 21 | 11 | 8 | 5.0 | 5.0 | 8.1 | 6.9 | 5.7 | 7.0 | 5.0 | 3.8 | 6.6 | 5.3 | 4.4 | 5.6 |
0.10 | 0.2 | 0 | 24 | 11 | 7 | 5.0 | 5.1 | 7.7 | 6.4 | 5.3 | 6.5 | 5.0 | 3.8 | 6.1 | 5.1 | 4.3 | 5.2 |
0.10 | 0.2 | 0 | 15 | 16 | 6 | 5.0 | 4.8 | 9.7 | 7.5 | 5.7 | 7.8 | 5.0 | 3.1 | 6.8 | 5.2 | 4.0 | 5.3 |
0.10 | 0.2 | 0 | 18 | 8 | 7 | 5.0 | 4.8 | 8.8 | 7.1 | 5.6 | 7.3 | 5.0 | 3.4 | 6.7 | 5.2 | 4.0 | 5.4 |
0.10 | 0.2 | 0 | 16 | 7 | 5 | 5.0 | 5.2 | 9.1 | 7.2 | 5.5 | 7.3 | 5.0 | 3.4 | 6.6 | 5.1 | 3.8 | 5.2 |
0.10 | 0.8 | 0 | 20 | 18 | 5 | 5.0 | 5.3 | 7.9 | 6.4 | 5.2 | 6.5 | 5.0 | 3.8 | 6.1 | 4.8 | 3.9 | 4.8 |
0.10 | 0.8 | 0 | 15 | 9 | 4 | 5.0 | 5.5 | 9.1 | 7.2 | 5.3 | 7.0 | 5.0 | 3.7 | 6.6 | 4.8 | 3.7 | 4.6 |
0.10 | 0.8 | 0 | 10 | 20 | 3 | 5.0 | 5.6 | 11.7 | 8.5 | 6.0 | 7.4 | 5.0 | 2.8 | 7.3 | 4.9 | 3.0 | 4.0 |
0.10 | 0.8 | 0 | 12 | 5 | 5 | 5.0 | 5.8 | 10.5 | 7.9 | 5.7 | 7.9 | 5.0 | 3.3 | 6.9 | 4.9 | 3.3 | 4.9 |
0.10 | 0.8 | 0 | 9 | 7 | 4 | 5.0 | 6.0 | 12.0 | 8.6 | 5.7 | 8.3 | 5.0 | 2.8 | 7.2 | 4.7 | 2.9 | 4.5 |
Web Tables 7 to 9 and Web Figure 4 present the predicted and empirical power for each simulation scenario. Because we are only interested in tests that maintain close-to-nominal sizes, we summarize in Table 6 the differences between the empirical and the predicted power only for the MAQLS z-tests and t-tests with DoF = I − 2. Among the z-tests, only the choice of BC2 provides substantially lower power than predicted. While the choices of model-based variance BC1 and BC3 provide adequate power for the z-tests in a number of scenarios, one should be cautious in adopting these tests with a small number of clusters since they may carry an inflated test size. On the other hand, the empirical power for MAQLS t-tests coupled with the model-based variance, BC1 or BC3 corresponds reasonably well with the analytical prediction from the proposed formula even for as few as nine clusters, while the empirical power for MAQLS t-tests with BC2 still tends to be substantially lower than predicted. Interestingly, the MAQLS t-test with DoF = I − (T + 1) coupled with model-based variance, BC1 or BC3 also demonstrates empirical power fairly close to prediction, even though these tests are more conservative under the null. These tests may not be preferred over the t-tests with DoF = I − 2 since they are less powerful.
TABLE 6.
z-test | t-test | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
τ | ρ | δ | I | N | T | Pred | MB | BC0 | BC1 | BC2 | BC3 | Pred | MB | BC0 | BC1 | BC2 | BC3 |
0.03 | 0.2 | 0.3 | 18 | 10 | 7 | 89.9 | −0.5 | 1.1 | −1.0 | −3.4 | −0.9 | 86.0 | 0.2 | 2.4 | −0.1 | −3.4 | 0.2 |
0.03 | 0.2 | 0.3 | 18 | 24 | 4 | 88.6 | −0.4 | 1.5 | −0.4 | −2.9 | −0.5 | 84.4 | 0.5 | 3.2 | 0.5 | −2.5 | 0.4 |
0.03 | 0.2 | 0.3 | 20 | 14 | 5 | 89.7 | 0.0 | 1.5 | −0.5 | −2.6 | −0.3 | 86.2 | 0.3 | 2.6 | 0.2 | −2.5 | 0.4 |
0.03 | 0.2 | 0.4 | 21 | 8 | 4 | 87.5 | −0.5 | 1.3 | −0.8 | −3.4 | −0.9 | 83.9 | −0.1 | 2.2 | −0.5 | −3.2 | −0.6 |
0.03 | 0.2 | 0.5 | 15 | 8 | 4 | 90.7 | −1.1 | 1.0 | −1.2 | −3.9 | −1.1 | 85.9 | −0.3 | 3.0 | −0.2 | −4.3 | −0.3 |
0.03 | 0.8 | 0.2 | 16 | 12 | 5 | 88.6 | −0.5 | 1.2 | −1.4 | −4.4 | −1.3 | 83.8 | 0.1 | 2.5 | −0.6 | −4.4 | −0.6 |
0.03 | 0.8 | 0.2 | 24 | 7 | 5 | 88.2 | −0.5 | 1.1 | −0.6 | −2.4 | −0.7 | 85.2 | 0.2 | 1.8 | 0.0 | −2.3 | 0.0 |
0.03 | 0.8 | 0.3 | 12 | 8 | 5 | 94.1 | −1.1 | 0.4 | −1.8 | −4.6 | −1.9 | 88.7 | 0.3 | 2.7 | −0.4 | −4.8 | −0.3 |
0.03 | 0.8 | 0.4 | 12 | 5 | 4 | 95.2 | −0.8 | 0.7 | −1.2 | −3.8 | −1.5 | 90.3 | 0.5 | 2.8 | −0.3 | −4.1 | −0.6 |
0.03 | 0.8 | 0.5 | 10 | 5 | 3 | 94.6 | −0.7 | 1.1 | −1.1 | −4.8 | −2.3 | 87.8 | 0.8 | 4.3 | −0.3 | −6.0 | −2.4 |
0.10 | 0.2 | 0.3 | 21 | 11 | 8 | 87.8 | −0.6 | 1.5 | −0.4 | −2.8 | −0.1 | 84.3 | −0.1 | 2.6 | 0.1 | −2.8 | 0.5 |
0.10 | 0.2 | 0.3 | 24 | 11 | 7 | 87.8 | −0.6 | 1.2 | −0.3 | −2.5 | −0.1 | 84.8 | 0.0 | 2.1 | 0.2 | −2.2 | 0.4 |
0.10 | 0.2 | 0.4 | 15 | 16 | 6 | 90.6 | −1.3 | 1.2 | −1.0 | −4.3 | −0.8 | 85.8 | −0.8 | 2.9 | −0.3 | −4.1 | 0.1 |
0.10 | 0.2 | 0.4 | 18 | 8 | 7 | 91.6 | −0.1 | 1.5 | −0.4 | −2.7 | −0.1 | 87.9 | 0.4 | 2.7 | 0.3 | −2.5 | 0.6 |
0.10 | 0.2 | 0.5 | 16 | 7 | 5 | 88.6 | −0.8 | 1.9 | −0.6 | −3.5 | −0.4 | 83.8 | −0.1 | 3.4 | 0.3 | −3.5 | 0.6 |
0.10 | 0.8 | 0.2 | 20 | 18 | 5 | 86.1 | −0.5 | 1.8 | −0.5 | −3.0 | −0.6 | 82.1 | −0.3 | 2.7 | 0.0 | −3.2 | 0.0 |
0.10 | 0.8 | 0.3 | 15 | 9 | 4 | 89.5 | −1.0 | 1.0 | −1.5 | −4.6 | −1.8 | 84.5 | −0.1 | 2.6 | −0.8 | −4.8 | −1.1 |
0.10 | 0.8 | 0.4 | 10 | 20 | 3 | 94.4 | −0.6 | 1.6 | −0.8 | −4.3 | −2.1 | 87.5 | 0.4 | 4.6 | 0.6 | −5.3 | −1.8 |
0.10 | 0.8 | 0.4 | 12 | 5 | 5 | 93.2 | −0.5 | 0.9 | −1.3 | −4.4 | −1.4 | 87.5 | 0.7 | 3.2 | −0.3 | −4.8 | −0.3 |
0.10 | 0.8 | 0.5 | 9 | 7 | 4 | 97.3 | −0.6 | 0.4 | −1.1 | −4.2 | −1.4 | 91.4 | 1.1 | 3.5 | −0.2 | −5.7 | −0.7 |
6 |. NUMERICAL ILLUSTRATIONS
6.1 |. The AEP study
We illustrate the proposed sample size procedure to design a cohort stepped wedge CRT that aims to study the effect of an exercise intervention on the physical function of patients with end-stage renal disease.41 The intervention was an accredited exercise physiologist (AEP) coordinated resistance exercise program, offered at hemodialysis clinics to improve the quality of life for dialysis patients. During the planning phase, it was determined that I = 15 clinics (clusters) were available, and would be randomized over T = 4 periods evenly spaced across 48 weeks. At baseline, no exercise programs were offered to any clinic. At week 12, 36 and 48, a random subset of 5 clinics cross over from control to intervention. A closed cohort of patients was recruited at baseline, and would be followed up during the study period. The primary patient-level outcome was the 30-second sit-to-stand test, recording the number of times a patient could rise from and return to a seated position in a 30-second time frame. The 30-second sit-to-stand test was conducted at the end of each period, resulting in four outcome measurements per patient. Based on a prior study within a similar context, a conservative estimate of the effect size was given by δ∕ϕ = 0.325,42 and the within-period correlation was estimated to be τ = 0.03.43 With I = 15 clusters, the simulations suggest the MAQLS t-test with DoF = I − 2 = 13 could maintain nominal size and adequate power; we illustrate the power calculation based on the t-test statistic.
Given this is a standard stepped wedge design where an equal number of clinics switch to intervention at each step, we can show that U = IT∕2, W = I2T(2T − 1)∕{6(T − 1)}, V = I(T − 2)∕2 and Q = I2T(T − 2)∕{3(T − 1)}. The variance expression (13) is then simplified to
(16) |
If we anticipate large correlation decay so that ρ = 0.2, the power is estimated using Equations (11) and (16) to be 79.4% if N = 21 and 80.5% if N = 22. Therefore, at least N = 22 patients should be recruited in each clinic to achieve 80% power under the proportional decay structure. On the other hand, we could arrive at the same results by using the DE formula (15). For example, in an individual randomized study, 348 patients would be required for the hypothesized effect size. Assuming 21 patients will be included in each clinic, the DE is approximately 0.92, indicating a total of 320 patients in approximately 15.2 clinics would be required. Since the study affords to randomize only 15 clinics, we increase the cohort size to N = 22, resulting in a DE 0.94. Therefore, 326 patients are required for a total of 326∕22 ≈ 14.8 clinics, and we conclude that 22 patients in 15 clinics ensured 80% power.
While the within-period correlation estimate was available from prior studies, published estimates of the correlation parameter ρ (or decay parameter d = 1 − ρ) are currently rare. For this reason, we carry out a sensitivity analysis on the power and present the results in panel (A) of Figure 2, where we fix the design resources but vary τ ∈ (0.03, 0.06) and ρ ∈ (0, 1). Note that the upper bound of the within-period correlation τ was reported by the work of Littenberg and MacLean43 and is used in this assessment. As expected, larger values of the within-period correlation reduce the study power, and furthermore, given a certain value of the within-period correlation, a greater magnitude of decay (smaller ρ or larger d) generally reduces the study power unless ρ ≈ 0 (or d ≈ 1). For the hypothesized τ = 0.03, the study power remains at least close to 80% regardless of the correlation decay. On the other hand, the amount of correlation decay could result in further power loss if the within-period correlation τ increases. Nevertheless, the power loss is at most around 10% even if the within-period correlation τ approximates the upper bound 0.06.
6.2 |. The CORE study
We next illustrate the proposed sample size procedure by designing the CORE stepped wedge trial. The CORE study aims to evaluate the patient-centered service design in health providers to improve the psychosocial recovery outcomes for people with severe mental illness in Australia.44 The new service design intervention adopted the experience-based co-design to identify users’ positive and negative experiences of the service, and involved patients’ participation to co-design solutions to the negative experiences. A total of I = 11 teams from four health service providers would be participating the study; each team involved a number of service users who will be affected by the intervention. A stepped wedge design was considered appropriate for the study due to logistical constraint in simultaneously introducing the intervention to more than a few teams. The experience-based co-design intervention will be delivered to the clusters in three waves, each with a duration of 9 months. Four teams will start the intervention in wave 1 and wave 2, respectively, while the remaining three teams receive the intervention in the final wave. In other words, the study includes four periods, with a baseline period lasting about 6 months.
The outcome of interest is the improvement in psychosocial recovery measured by the recovery assessment scale revised,45 and was measured for each user at the end of baseline period and each of the three follow-up period. The standardized effect size on the psychosocial recovery outcome was estimated to be 0.35, and the within-period correlation was assumed to be τ = 0.1.44 Since the study affords to randomize only 11 clusters, there may be a risk of inflated type I error rate with a z-test. As the t-test with DoF = I − 2 = 9 performs best with respect to empirical size and power in the simulations, we determine the required cohort size based on a test using expressions (12) and (13). Assuming the correlation decay is only moderate so that ρ = 0.8, power is estimated to be 0.79 for N = 8 and 0.81 for N = 9, barring drop out. Therefore, N = 9 is required to ensure 80% power given a 5% test size. We further conducted a sensitivity analysis to see how power changes according to the degree of correlation decay, and presented the power contour in panel (B) of Figure 2. Due to the small sample size and the heavy tail of the t distribution, the study is sensitive to correlation decay when τ = 0.1, and remains so even if τ approaches zero. On the other hand, the actual study planned to recruit N = 30 users in each team. With this larger cohort size, panel (C) of Figure 2 suggests that the power becomes less sensitive to the correlation decay, especially as the within-period correlation approaches zero. For example, if τ ≤ 0.02, the study power remains at least around 80% regardless of the amount of correlation decay.
6.3 |. Consequences of specifying a nondecay correlation structure
For GEE analyses of cohort stepped wedge designs, Li et al5 recently developed the block exchangeable correlation structure for estimating the sample size and power. With a continuous outcome and identity link function, the block exchangeable correlation structure is also implied by the linear mixed effects model for cohort studies discussed in the works of Hooper et al,8 and Girling and Hemming.46 To understand the implications of alternative correlation models, we compare the variance of the intervention effect estimator obtained under the proportional decay correlation structure to that obtained under the block exchangeable correlation structure. Specifically, both the proportional decay structure and the block exchangeable correlation structure assume a constant within-period correlation, ie, corr(yijt, yij′t) = τ for j ≠ j′. However, the latter correlation model also assumes constant between-period and within-individual correlations such that for j ≠ j′, t ≠ t′ and , t ≠ t′. To focus ideas, we carry out the comparisons based on the standard stepped wedge designs with a single baseline period and an equal number of clinics switching to intervention at each step. We refer to the variance of obtained under the proportional decay structure as , whose expression is provided in (16). We refer to the variance of obtained under the block exchangeable structure as , whose expression is derived in the work of Li et al5 as
(17) |
with and as the two distinct eigenvalues of the block exchangeable matrix. If we define function , we can write the relative variance as
(18) |
For each value of the within-period correlation τ and each value of the decay parameter d = 1 − ρ in the proportional decay model, there may exist pairs of values for (, ) that result in the same variance of the intervention effect. Obtaining the equal variance correspondence is equivalent to finding the straight line that solves . Because the relative variance is quadratic in , such a line may not always exist within the plausible range of (, ) that ensures a positive definite block exchangeable correlation matrix. We confirm this observation by plotting the relative variance for varying correlation parameters. Fixing τ = 0.03, T = 4 and N = 20 similar to the accredited exercise physiologist study, we present in Figure 3 the contour of over the (, ) plane for each level of decay d ∈ {0, 1, …, 0.9}. The dashed thick line indicates the equal variance correspondence. We also present the contour plots by specifying N = 100, T = 8, and τ = 0.1 in Web Figures 5 to 7. It is evident that the existence and location of the equal variance line on the (, ) plane depends on the value of correlation decay, cohort size and number of periods. If the equal variance line exists, the value of τ and number of periods only affects its location while cohort size N further affects its slope, as expected from inspecting expression (18). Apart from the equal variance correspondence, the two variances are generally different. Depending on the gray shades (colored shades in the online version), either the proportional decay or the block exchangeable structure will lead to a larger variance of the intervention effect and require a larger sample size. As a result, using the block exchangeable correlation structure in the presence of correlation decay could lead to an overestimation or underestimation of the true intervention effect variance, and vice versa. Particularly, the variances returned from these two correlation models become close when , approximate zero and when the decay parameter, d = 1 − ρ, approximates one. These two restrictions result in many nearly-zero entries in the block exchangeable and proportional decay correlation matrices, increasing the dependence of sample size estimation on the within-period correlation τ. Therefore, it is anticipated that the variances from the two models become similar in this particular scenario, even though in general there could be large differences between and . To summarize, our key message for cohort stepped wedge designs echo the principal findings in the work of Kasza et al10 for cross-sectional designs. It is possible to grossly overestimate or underestimate the variance of the intervention effect if the correlation model is misspecified, except in restrictive scenarios. In practice, researchers could investigate the sensitivity of sample size estimates to misspecification of the correlation structure when there is limited preliminary data at the design stage of the trial.
7 |. DISCUSSION
This article expanded on the design and analysis considerations for cohort stepped wedge CRTs in the presence of correlation decay. Since a cohort design involves repeated outcome assessments for fixed sets of individuals, we adopted the proportional decay structure of Lefkopoulou et al16 to characterize the within-cluster correlations among the outcome measurements. Based on a marginal mean model accounting for the treatment and period effects, we developed a new sample size and power procedure to design stepped wedge CRTs accounting for such correlation decay. To apply this procedure, a key step is to obtain reasonable values for the correlation parameters. The within-period correlation, τ, is similar to the traditional ICC in a parallel CRT, and may often be found in previous studies with a similar endpoint. By contrast, the correlation parameter, ρ (or decay parameter d = 1 − ρ), is not as commonly reported in the literature, and therefore the sensitivity of power should be investigated across a range of values for ρ, as illustrated in Section 6. Given that accurate reporting of correlations is vitally important for designing future stepped wedge trials, we also provided an improved MAQLS approach to estimate the correlation parameters along with the marginal mean parameters. The MAQLS has little impact on the estimation of the marginal mean parameters and the associated statistical tests coupled with the sandwich variance, but it substantially reduces the bias in estimating the within-period correlation τ and mildly improves the estimation of ρ, as confirmed in our simulation study.
In our simulation study with a small number of clusters, we found that, regardless of choice of variance estimators, the t-tests provide better control of the type I error rates compared to the z-tests, which are often liberal. Regardless of the two choices of DoF, the t-tests coupled with the model-based variance, BC1 and BC3 preserve the nominal size and demonstrate empirical power that agrees well with analytical prediction. Since the t distribution with a smaller DoF has a heavier tail, which implies a smaller power under the alternative, we prefer the t-tests with DoF = I − 2 for the design and analysis of cohort stepped wedge CRTs under the proportional correlation decay structure. An additional piece of evidence that favors DoF = I − 2 is found in the recent simulation study by Ford,35 who showed that the GEE t-tests with DoF = I − 2 provide satisfactory control of type I error rates even for I = 6 but in the absence of correlation decay. Additional work is needed to investigate whether these extremely small sample sizes could provide adequate power under the proportional decay structure. On the other hand, although the t-tests with DoF = I − (T + 1) could provide adequate power using the model-based variance, BC1 or BC3, they are frequently conservative under the null with a small number of clusters. In fact, one needs a minimum of T+2 clusters to provide at least one DoF, which further limits the applications of such a t-test in the design and analysis of small stepped wedge trials.
Recent reviews of stepped wedge CRTs9,38,39 indicated that both the cross-sectional and cohort designs were common in practice. Although we have developed the design and analysis strategies specifically for cohort stepped wedge CRTs, a parallel discussion for cross-sectional stepped wedge CRTs could be equally informative. As discussed in Section 3.1, the exponential decay structure is often used to model correlation decay in multiperiod CRTs with repeated cross-sectional samples.10 Assuming there are Ni observations in each period for each cluster, we could write the exponential decay structure as Li(τ, ρ) = (1 − τ)ITNi + τJN ⊗ F(ρ) without changing the interpretation of τ and ρ from the cohort setting. To estimate the intervention effect and correlation parameters, the MAQLS approach could still be applied once we replace the second-stage estimating equations 6 and (7) by
Such modifications are necessary because τ and ρ are no longer separable in Li(τ, ρ), and hence, updates for τ and ρ do not come in closed forms. Correspondingly, the inseparability between τ and ρ also precludes the derivation of an analytical inverse , and therefore, one may not be able to obtain a simple algebraic expression for . As a result, sample size and power calculation requires numerically inverting the correlation matrix Li(τ, ρ). In fact, with a continuous outcome and the identity link, it is straightforward to show that the QLS-based sample size procedure with Li(τ, ρ) coincides with the mixed-effects model-based sample size procedure developed in the work of Kasza et al10 with exponential correlation decay.
We have assumed that each individual has complete follow-up during the study. In reality, individual dropout may be anticipated and could be accounted for in the design phase. Given an expected attrition rate γ, a simple and commonly used strategy is to inflate the required total sample size by 1∕(1 − γ), so that a complete-trajectory GEE analysis may provide adequate power if the drop-out or missingness is completely at random.47 More sophisticated approaches that deal with monotone missingness have been studied for repeated-measure randomized trials and may be adapted to the stepped wedge context by considering staggered treatment assignments and appropriate levels of clustering.48,49 In any event, trial implementation methodologies to prevent attrition bias in longitudinal CRTs merit further investigation.50
One simplification we made in the sample size and power calculations was to assume equal cluster (cohort) sizes. It has been shown that cluster size imbalance leads to reduced power in parallel CRTs and therefore may be accounted for in the design phase.51 For a stepped wedge trial, Girling52 computed the relative efficiency of unequal versus equal cluster sizes by assuming a linear mixed-effects model without correlation decay. It was concluded that the efficiency loss due to unequal cluster sizes is unlikely to exceed 12% across a wide range design of resources and correlation values. Nevertheless, a corresponding expression for the relative efficiency accounting for correlation decay is currently not available and should merit additional study. The availability of such expressions for relative efficiency could inform the amount of additional design resources required to compensate the efficiency loss due to unequal cluster sizes. Another limitation of our design strategy is that we have assumed the proportional decay correlation is the correctly specified within-cluster dependency structure. However, both the QLS or MAQLS estimators for the intervention effect remain consistent even if the correlation structure is misspecified. If it is anticipated in the design phase that the working correlation may be misspecified, one could follow the general idea of Rochon48 to develop a modified sample size procedure based on the sandwich variance.
Finally, we have assumed a continuous outcome and an identity link function, corresponding to the scenarios of the illustrative examples. In practice, categorical outcomes could be collected in cohort stepped wedge designs. Under correlation decay, we could extend the QLS-based sample size procedure to accommodate binary and count outcomes by following the steps outlined in section 3.2 of the work of Li et al.5 In those cases, a further complication is that the variance is an explicit function of the marginal mean, and so the magnitude of the period effects necessarily affects the variance for the intervention effect. Future work is required to investigate the operating characteristics of such extensions for calculating sample size and power with binary and count outcomes.
Supplementary Material
ACKNOWLEDGEMENTS
This work was supported within the National Institutes of Health (NIH) Health Care Systems Research Collaboratory by the NIH Common Fund through cooperative agreement U24AT009676 from the Office of Strategic Coordination within the Office of the NIH Director and cooperative agreement UH3DA047003 from the National Institute on Drug Abuse. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health. The author thanks Dr John S. Preisser at the University of North Carolina at Chapel Hill for providing thoughtful comments to an earlier version of this article. The author is also grateful for the associate editor and two anonymous reviewers for providing helpful comments, which improved the exposition of this work.
Funding information
NIH Health Care Systems Research Collaboratory, National Institute on Drug Abuse, Grant/Award Number: UH3DA047003
Footnotes
SUPPORTING INFORMATION
Additional supporting information may be found online in the Supporting Information section at the end of the article.
CONFLICT OF INTEREST
The author declares no potential conflict of interests.
DATA AVAILABILITY STATEMENT
The illustrative examples only concern sample size and power estimation, so do not involve analysis of actual data sets. The R code for fitting the matrix-adjusted quasi-least squares with continuous outcomes may be found online in the supporting information tab for this article
REFERENCES
- 1.Murray DM. Design and Analysis of Group-Randomized Trials. New York, NY: Oxford University Press; 1998. [Google Scholar]
- 2.Donner A, Klar N. Design and Analysis of Cluster Randomization Trials in Health Research. Hoboken, NJ: Wiley; 2000. [Google Scholar]
- 3.Turner EL, Li F, Gallis JA, Prague M, Murray DM. Review of recent methodological developments in group-randomized trials: part 1–design. Am J Public Health. 2017;107(6):907–915. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Hussey MA, Hughes JP. Design and analysis of stepped wedge cluster randomized trials. Contemp Clin Trials. 2007;28(2):182–191. [DOI] [PubMed] [Google Scholar]
- 5.Li F, Turner EL, Preisser JS. Sample size determination for GEE analyses of stepped wedge cluster randomized trials. Biometrics. 2018;74(4):1450–1458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Li F, Forbes AB, Turner EL, Preisser JS. Power and sample size requirements for GEE analyses of cluster randomized crossover trials. Statist Med. 2019;38(4):636–649. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Hemming K, Lilford R, Girling AJ. Stepped-wedge cluster randomised controlled trials: a generic framework including parallel and multiple-level designs. Statist Med. 2015;34(2):181–196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Hooper R, Teerenstra S, de Hoop E, Eldridge S. Sample size calculation for stepped wedge and other longitudinal cluster randomised trials. Statist Med. 2016;35(26):4718–4728. [DOI] [PubMed] [Google Scholar]
- 9.Barker D, McElduff P, D’Este C, Campbell MJ. Stepped wedge cluster randomised trials: a review of the statistical methodology used and available. BMC Med Res Methodol. 2016;16(69):1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Kasza J, Hemming K, Hooper R, Matthews JNS, Forbes AB. Impact of non-uniform correlation structure on sample size and power in multiple-period cluster randomised trials. Stat Methods Med Res. 2017;28(3):703–716. 10.1177/0962280217734981 [DOI] [PubMed] [Google Scholar]
- 11.Kasza J, Forbes AB. Inference for the treatment effect in multiple-period cluster randomised trials when random effect correlation structure is misspecified. Stat Methods Med Res. 2018;28(10–11):3112–3122. 10.1177/0962280218797151 [DOI] [PubMed] [Google Scholar]
- 12.Grantham KL, Kasza J, Heritier S, Hemming K, Forbes AB. Accounting for a decaying correlation structure in cluster randomized trials with continuous recruitment. Statist Med. 2019;38(11):1918–1934. 10.1002/sim.8089 [DOI] [PubMed] [Google Scholar]
- 13.Copas AJ, Lewis JJ, Thompson JA, Davey C, Baio G, Hargreaves JR. Designing a stepped wedge trial: Three main designs, carry-over effects and randomisation approaches. Trials. 2015;16(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Li F, Turner EL, Preisser JS. Optimal allocation of clusters in cohort stepped wedge designs. Stat Probab Lett. 2018;137:257–263. [Google Scholar]
- 15.Preisser JS, Young ML, Zaccaro DJ, Wolfson M. An integrated population-averaged approach to the design, analysis and sample size determination of cluster-unit trials. Statist Med. 2003;22(8):1235–1254. [DOI] [PubMed] [Google Scholar]
- 16.Lefkopoulou M, Moore D, Ryan L. The analysis of multiple correlated binary outcomes: application to rodent teratology experiments. J Am Stat Assoc. 1989;84(407):810–815. [Google Scholar]
- 17.Shults J, Morrow AL. Use of quasi-least squares to adjust for two levels of correlation. Biometrics. 2002;58(3):521–530. [DOI] [PubMed] [Google Scholar]
- 18.Liu A, Shih WJ, Gehan E. Sample size and power determination for clustered repeated measurements. Statist Med. 2002;21(12):1787–1801. [DOI] [PubMed] [Google Scholar]
- 19.Liang K-Y, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73(1):13–22. [Google Scholar]
- 20.Chaganty NR. An alternative approach to the analysis of longitudinal data via generalized estimating equations. J Stat Plan Inference. 1997;63(1):39–54. [Google Scholar]
- 21.Chaganty NR, Shults J. On eliminating the asymptotic bias in the quasi-least squares estimate of the correlation parameter. J Stat Plan Inference. 1999;76(1):145–161. [Google Scholar]
- 22.Woertman W, de Hoop E, Moerbeek M, Zuidema SU, Gerritsen DL, Teerenstra S. Stepped wedge designs could reduce the required sample size in cluster randomized trials. J Clin Epidemiol. 2013;66(7):752–758. [DOI] [PubMed] [Google Scholar]
- 23.Shults J, Hilbe JM. Quasi-Least Squares Regression. Boca Raton, FL: Chapman and Hall/CRC; 2014. [Google Scholar]
- 24.Preisser JS, Qaqish BF. Deletion diagnostics for generalised estimating equations. Biometrika. 1996;83(3):551–562. [Google Scholar]
- 25.Preisser JS, Lu B, Qaqish BF. Finite sample adjustments in estimating equations and covariance estimators for intracluster correlations. Statist Med. 2008;27(27):5764–5785. [DOI] [PubMed] [Google Scholar]
- 26.Prentice RL. Correlated binary regression with covariates specific to each binary observation. Biometrics. 1988;44(4):1033–1048. [PubMed] [Google Scholar]
- 27.Turner EL, Prague M, Gallis JA, Li F, Murray DM. Review of recent methodological developments in group-randomized trials: part 2–analysis. Am J Public Health. 2017;107(7):1078–1086. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Kauermann G, Carroll RJ. A note on the efficiency of sandwich covariance matrix estimation. J Am Stat Assoc. 2001;96(456):1387–1396. [Google Scholar]
- 29.Mancl LA, DeRouen TA. A covariance estimator for GEE with improved small-sample properties. Biometrics. 2001;57(1):126–134. [DOI] [PubMed] [Google Scholar]
- 30.Fay MP, Graubard BI. Small-sample adjustments for Wald-type tests using sandwich estimators. Biometrics. 2001;57(4):1198–1206. [DOI] [PubMed] [Google Scholar]
- 31.Scott JM, DeCamp A, Juraska M, Fay MP, Gilbert PB. Finite-sample corrected generalized estimating equation of population average treatment effects in stepped wedge cluster randomized trials. Stat Methods Med Res. 2017;26(2):583–597. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Li F, Turner EL, Heagerty PJ, Murray DM, Vollmer WM, Delong ER. An evaluation of constrained randomization for the design and analysis of group-randomized trials with binary outcomes. Statist Med. 2017;36:3791–3806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Teerenstra S, Lu B, Preisser JS, Van Achterberg T, Borm GF. Sample size considerations for GEE analyses of three-level cluster randomized trials. Biometrics. 2010;66(4):1230–1237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Grayling MJ, Wason JMS, Mander AP. Group sequential designs for stepped-wedge cluster randomised trials. Clinical Trials. 2017;14(5):507–517. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Ford WP. Improved Standard Error Estimation for Maintaining the Validities of Inference in Small-Sample Cluster Randomized Trials and Longitudinal Studies [dissertation]. Lexington, KY: Theses and Dissertations–Epidemiology and Biostatistics, University of Kentucky; 2018. [Google Scholar]
- 36.Ford WP, Westgate PM. Improved standard error estimator for maintaining the validity of inference in cluster randomized trials with a small number of clusters. Biometrical Journal. 2017;59(3):478–495. [DOI] [PubMed] [Google Scholar]
- 37.Shih WJ. Sample size and power calculations for periodontal and other studies with clustered samples using the method of generalized estimating equations. Biometrical Journal. 1997;39(8):899–908. [Google Scholar]
- 38.Martin J, Taljaard M, Girling A, Hemming K. Systematic review finds major deficiencies in sample size methodology and reporting for stepped-wedge cluster randomised trials. BMJ Open. 2016;6:e010166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Grayling MJ, Wason JMS, Mander AP. Stepped wedge cluster randomized controlled trial designs : a review of reporting quality and design features. Trials. 2017;18(33):1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Lu B, Preisser JS, Qaqish BF, Suchindran C, Bangdiwala SI, Wolfson M. A comparison of two bias-corrected covariance estimators for generalized estimating equations. Biometrics. 2007;63(3):935–941. [DOI] [PubMed] [Google Scholar]
- 41.Bennett PN, Daly RM, Fraser SF, et al. The impact of an exercise physiologist coordinated resistance exercise program on the physical function of people receiving hemodialysis: a stepped wedge randomised control study. BMC Nephrology. 2013;14(1):204–210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Cappy CS, Jablonka J, Schroeder ET. The effects of exercise during hemodialysis on physical performance and nutrition assessment. J Ren Nutr. 1999;9(2):63–70. [DOI] [PubMed] [Google Scholar]
- 43.Littenberg B, MacLean CD. Intra-cluster correlation coefficients in adults with diabetes in primary care practices: the Vermont Diabetes Information System field survey. BMC Med Res Methodol. 2006;6(1):20–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Palmer VJ, Chondros P, Piper D, et al. The CORE study protocol: a stepped wedge cluster randomised controlled trial to test a co-design technique to optimise psychosocial recovery outcomes for people affected by mental illness in the community mental health setting. BMJ Open. 2015;5:e006688. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Lusczakoski KD, Olmos-Gallo PA, McKinney CJ, Starks R, Huff S. Measuring recovery related outcomes: a psychometric investigation of the recovery markers inventory. Community Ment Health J. 2014;50:896–902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Girling AJ, Hemming K. Statistical efficiency and optimal design for stepped cluster studies under linear mixed effects models. Statist Med. 2016;35(13):2149–2166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Little RJ, Rubin DB. Statistical Analysis with Missing Data. 2nd ed.Hoboken, NJ: Wiley; 2002. Hardcover. [Google Scholar]
- 48.Rochon J. Application of GEE procedures for sample size calculations in repeated measures experiments. Statist Med. 1998;17(14): 1643–1658. [DOI] [PubMed] [Google Scholar]
- 49.Jung S-H, Ahn C. Sample size estimation for GEE method for comparing slopes in repeated measurements data. Statist Med. 2003;22(8): 1305–1315. [DOI] [PubMed] [Google Scholar]
- 50.Prost A, Binik A, Abubakar I, et al. Logistic, ethical, and political dimensions of stepped wedge trials: critical review and case studies. Trials. 2015;16(1):351–361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Eldridge SM, Ashby D, Kerry S. Sample size for cluster randomized trials: Effect of coefficient of variation of cluster size and analysis method. Int J Epidemiol. 2006;35(5):1292–1300. [DOI] [PubMed] [Google Scholar]
- 52.Girling AJ. Relative efficiency of unequal cluster sizes in stepped wedge and other trial designs under longitudinal or cross-sectional sampling. Statist Med. 2018;37(30):4652–4664. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.