Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2022 Jul 21;24(4):833–849. doi: 10.1093/biostatistics/kxac026

Designing three-level cluster randomized trials to assess treatment effect heterogeneity

Fan Li 1,, Xinyuan Chen 2, Zizhong Tian 3, Denise Esserman 4, Patrick J Heagerty 5, Rui Wang 6
PMCID: PMC10583727  PMID: 35861621

SUMMARY

Cluster randomized trials often exhibit a three-level structure with participants nested in subclusters such as health care providers, and subclusters nested in clusters such as clinics. While the average treatment effect has been the primary focus in planning three-level randomized trials, interest is growing in understanding whether the treatment effect varies among prespecified patient subpopulations, such as those defined by demographics or baseline clinical characteristics. In this article, we derive novel analytical design formulas based on the asymptotic covariance matrix for powering confirmatory analyses of treatment effect heterogeneity in three-level trials, that are broadly applicable to the evaluation of cluster-level, subcluster-level, and participant-level effect modifiers and to designs where randomization can be carried out at any level. We characterize a nested exchangeable correlation structure for both the effect modifier and the outcome conditional on the effect modifier, and generate new insights from a study design perspective for conducting analyses of treatment effect heterogeneity based on a linear mixed analysis of covariance model. A simulation study is conducted to validate our new methods and two real-world trial examples are used for illustrations.

Keywords: Design effect, Effect modification, Heterogeneous treatment effect, Intraclass correlation coefficient, Nested exchangeable correlation structure, Power calculation

1. Introduction

Pragmatic cluster randomized trials (CRTs) are increasingly popular for comparative effectiveness research within health care delivery systems based on real-world interventions and broadly representative patient populations. While the average treatment effect (ATE) has typically been the primary focus in planning pragmatic trials, interest is growing in understanding whether the treatment effect varies among prespecified patient subpopulations, such as those defined by demographics or baseline clinical characteristics. By design, the flexible inclusion of a range of clusters and patients to mimic real-world practice in pragmatic trials naturally induces more heterogeneity, an aspect that should be reflected at the design stage and which invites the study of the associated factors that may lead to variation in treatment effects.

While data-driven exploratory heterogeneity of treatment effect (HTE) analysis can be ad hoc and used to generate hypotheses for future studies, confirmatory HTE analysis is often hypothesis driven, and carried out based on prespecified effect modifiers identified from either prior data or subject-matter knowledge (Wang and Ware, 2013). A recent systematic review reported that 16 out of 64 health-related CRTs published between 2010 and 2016 examined HTE among prespecified demographic patient subgroups but also noted a lack of methods for designing CRTs to enable the assessment of confirmatory HTE (Starks and others, 2019). To this end, Yang and others (2020) proposed an analytical power calculation procedure for HTE analysis in CRTs with two-level data where patients are nested in clusters. Although planning CRTs for assessing the ATE only requires accounting for the outcome intraclass correlation coefficient (ICC), planning CRTs for assessing the HTE additionally requires accounting for the covariate-ICC (or equivalently, ICC of the effect modifier). The covariate-ICC captures the fraction of between-cluster covariate variation relative to the total of between- and within-cluster covariate variation and plays an important role in sample size determination for HTE analysis.

Although these two distinct types of ICCs were identified as essential ingredients for estimating the power of the HTE test in two-level CRTs, analytical design formulas are currently unavailable for pragmatic trials exhibiting a three-level structure. As a first example, the Health and Literacy Intervention (HALI) trial in Kenya is a three-level CRT studying a literacy intervention to improve early literacy outcomes (Jukes and others, 2017). In the HALI trial, continuous literacy outcomes are collected for children (participants), who are nested within schools (subclusters), which are further nested in Teacher Advisory Center (TAC) tutor zone (cluster); randomization is carried out at the TAC tutor zone level. Beyond studying the ATE on the literacy outcomes, one may be interested in detecting effect modification by different levels of the baseline literacy measurements to see whether the intervention is more effective in improving spelling or reading test scores among those with lower baseline scores. Power analysis for such an interaction test requires accounting for the covariate-ICCs and outcome-ICCs within each school and across different schools. In a second example, the Strategies to Reduce Injuries and Develop Confidence in Elders (STRIDE) trial (Gill and others, 2021) aims to study the effect of a multifactorial fall prevention intervention program and recruited patients (participants) nested within clinics (subclusters), which were nested in health care systems (clusters); the randomization was at the clinic level and therefore the study can be considered as a subcluster randomized trial. The study included prespecified HTE analysis with patient-level effect modifiers such as age and gender, among others. Suppose the interest lies in detecting potential intervention effect modification by gender on the continuous outcome, concern score for falling, the power analysis for such an interaction test also required accounting for the covariate-ICCs and outcome-ICCs within each clinic and between different clinics. In both examples, as we discuss in Section 2.1, a direct application of the design formula developed in two-level CRTs can result in either conservative or anticonservative sample size estimates; therefore, the three-level hierarchical structure clearly requires new methods that enable a precise understanding of how both participant- and subcluster-level characteristics moderate the impact of new care innovations. Addressing this gap in three-level designs is further complicated by the fact that not only can the randomization be carried out at either the subcluster or cluster level but also the effect modifier can be measured at each level. To date, existing methods for power analysis in three-level trials were mainly developed for studying the ATE, with a closed-form design effect jointly determined by the between-subcluster outcome-ICC and the within-subcluster outcome-ICC (Heo and Leon, 2008; Teerenstra and others, 2010). Dong and others (2018) have developed sample size formulas for HTE analysis with a univariate effect modifier under cluster-level randomization. However, the essential role of covariate-ICCs was neglected in their design formulas, which lead to under-powered trials as we demonstrate in Section 3. Beyond Dong and others (2018), computationally efficient and yet accurate power analysis procedures are lacking to ensure sufficient power for HTE analyses in three-level trials.

In this article, we contribute novel design formulas to address multilevel effect modification in trials with a three-level structure, without restrictions on the level of randomization or the level at which the effect modifier is measured. We characterize a nested exchangeable correlation structure for both the effect modifier and the outcome conditional on the effect modifier and derive new sample size expressions for HTE analysis based on the asymptotic covariance matrix corresponding to a linear mixed analysis of covariance (LM-ANCOVA) model. With a small set of design assumptions, our closed-form expressions can adequately capture key aspects of the data generating processes that contribute to the variance of the interaction effect parameters are a good approximation to the true Monte Carlo variance for the sample size of interest (as we demonstrate in Section 3) and therefore obviate the need for exhaustive power calculations by simulations. Based on either a univariate effect modifier or multivariate effect modifiers, we prove in Section 2 that both the between-subcluster and within-subcluster covariate-ICCs are key elements in the asymptotic variance expression of the HTE estimator. Beyond power analysis for the HTE, in Section 2.3, we point out that the LM-ANCOVA framework can be used for more efficient analyses of the ATE, thus allowing the use of familiar sample size equations for testing the ATE. We report a simulation study to validate our new methods in Section 3 and provide two concrete examples in Section 4. Section 5 concludes.

2. Power analyses for confirmatory HTE analysis in three-level trials

We consider a parallel-arm design with a three-level data structure, including the common scenario where participants (level 1) are nested in subclusters (level 2), which are nested in clusters (level 3). Let Inline graphic be the outcome from participant Inline graphic (Inline graphic) from subcluster Inline graphic (Inline graphic) in cluster Inline graphic (Inline graphic). For design purposes, we assume each cluster includes an equal number of subclusters (Inline graphic), and each subcluster includes an equal number of participants (Inline graphic). We denote Inline graphic if participant Inline graphic in subcluster Inline graphic of cluster Inline graphic receives treatment and Inline graphic if the participant receives usual care. While randomization is more frequently carried out at the cluster or subcluster level in three-level designs, our framework also allows participant-level randomization as we elaborate on below.

For prespecified HTE analysis with a set of Inline graphic effect modifiers, Inline graphic, we consider the following LM-ANCOVA model to formulate the HTE test:

graphic file with name Equation1.gif (2.1)

where Inline graphic is the random intercept at the cluster level, Inline graphic is the random intercept at the subcluster level, and Inline graphic is the participant-level random error. We adopt the conventional assumption in CRTs that the random effects are mutually independent and assume the absence of additional random variations. In model (2.1), Inline graphic is the intercept parameter, Inline graphic is the main effect of the treatment, Inline graphic and Inline graphic are the main covariate effects and the treatment-by-covariate interactions. The null hypothesis of no systematic HTE across subpopulations defined by Inline graphic can be formulated as Inline graphic: Inline graphic. Putting Inline graphic in the context of a single binary effect modifier where Inline graphic for female and Inline graphic for nonfemale, Inline graphic represents the mean outcome among the nonfemale subgroup without receiving treatment, Inline graphic represents the treatment effect for the nonfemale subgroup, Inline graphic represents the mean outcome difference between the gender subgroups, and the scalar parameter Inline graphic encodes the difference in treatment effect between the gender subgroups, and a two-sided HTE test can proceed with the Wald statistic based on a standard normal sampling distribution. Of note, when Inline graphic is mean centered, the interpretations of Inline graphic and Inline graphic will change (we discuss the interpretation of Inline graphic after mean-centering covariates in Section 2.3), but the interpretations of Inline graphic and Inline graphic remain unchanged, and therefore mean-centering covariates does not affect the HTE test for Inline graphic.

We assume the allocation ratio to the treatment and control groups is Inline graphic, with Inline graphic. We consider three design configurations: (i) CRT with randomization at the cluster level such that Inline graphic clusters are randomized to treatment and Inline graphic clusters are randomized to usual care; (ii) CRT with randomization at the subcluster level such that within each cluster, Inline graphic subclusters are randomized to treatment and Inline graphic subclusters are randomized to usual care; (iii) individually randomized trial (IRT) with a hierarchical structure such that within each subcluster, Inline graphic participants are randomized to treatment and Inline graphic participants are randomized to usual care. In configurations (i) and (ii), we can further write the treatment indicator Inline graphic, Inline graphic and Inline graphic, Inline graphic, respectively, but we maintain Inline graphic throughout for a more unified presentation. To facilitate the characterization of the covariance parameters, we reparameterize model (2.1) by centering the treatment indicator and obtain

graphic file with name Equation2.gif (2.2)

where Inline graphic, Inline graphic, Inline graphic and Inline graphic. Let Inline graphic denote the total variance of the outcome conditional on Inline graphic. Under model (2.2), Inline graphic. The strength of dependency between two outcomes conditional on covariates can be characterized by two distinct outcome-ICCs. The within-subcluster outcome-ICC describes the strength of dependency between two outcomes within the same subcluster, or Inline graphic. The between-subcluster outcome-ICC describes the degree of dependency between two outcomes from two different subclusters but within the same cluster, or Inline graphic. If we write Inline graphic, then the implied correlation structure for Inline graphic is nested exchangeable (Teerenstra and others, 2010) with the matrix expression given by

graphic file with name Equation3.gif (2.3)

where “Inline graphic” refers to the Kronecker operator, Inline graphic and Inline graphic are a Inline graphic identity matrix and Inline graphic matrix of ones, respectively. Define the collection of design points Inline graphic, and Inline graphic as the design matrix for cluster Inline graphic. Given values of the variance components, the best linear unbiased estimator of Inline graphic is the generalized least squares (GLS) estimator, given by Inline graphic. Furthermore, Inline graphic converges in distribution to a multivariate normal random variate with mean zero and covariance matrix Inline graphic. Then, the asymptotic variance of Inline graphic (based on Inline graphic-regime) is the lower-right element of the square matrix Inline graphic. Design calculations for sample size and power then boil down to explicitly characterizing the key elements of Inline graphic, which we operationalize below.

2.1. Univariate effect modifier

We first provide the analytical variance expressions when the treatment effect heterogeneity concerns a univariate effect modifier (Inline graphic). For testing Inline graphic: Inline graphic with a two-sided Wald test with type I error rate Inline graphic and target power Inline graphic, the required number of clusters, number of subclusters per cluster, and subcluster size, (or the total sample size), should satisfy Inline graphic, where Inline graphic is the Inline graphicth quantile of the standard normal distribution, Inline graphic is the interaction effect size and Inline graphic refers to the lower-right element of Inline graphic, or Inline graphic. Notice that Inline graphic is a function of Inline graphic, Inline graphic and decreases to zero as Inline graphic or Inline graphic increases. Therefore, we intentionally define Inline graphic by multiplying Inline graphic with Inline graphic to facilitate efficiency comparison when randomization is conducted at different levels. This way, the limit of Inline graphic will remain as a constant instead of zero if we also let Inline graphic or Inline graphic increase. To proceed, we write Inline graphic, Inline graphic and Inline graphic as three distinct eigenvalues of the nested exchangeable correlation structure (2.3) (Li and others, 2018), based on which an explicit inverse is given by Inline graphic; this analytical inverse plays a critical role in simplifying Inline graphic. For testing the HTE, we further assume a nested exchangeable correlation structure for the univariate effect modifier such that the marginal correlation matrix for Inline graphic is

graphic file with name Equation4.gif (2.4)

where Inline graphic, Inline graphic, defines the within-subcluster covariate-ICC and Inline graphic, Inline graphic (without restrictions on Inline graphic, Inline graphic), defines the between-subcluster covariate-ICC. Figure 1 provides a graphical representation of the data structure in a three-level trial. Defining Inline graphic, Inline graphic and Inline graphic as three distinct eigenvalues of the nested exchangeable correlation structure (2.4), we first establish the following result.

Fig. 1.

Fig. 1.

A graphical representation of the data structure in a cluster with two subclusters in a three-level trial. Each oval represents a participant (level 1) nested in a subcluster (level 2) nested in the cluster (level 3). The effect modifier Inline graphic and outcome Inline graphic are measured for each participant with their respective covariate-IC7Cs and outcome-ICCs depicted. A thicker arrow indicates a stronger correlation between variables.

Theorem 2.1

Defining Inline graphic as the marginal variance of the univariate effect modifier, the limit variance Inline graphic can be expressed as a function of the eigenvalues of the outcome-ICC and covariate-ICC matrices, and is dependent on the level of randomization. (i) When randomization is carried out at the cluster level, we have


Theorem 2.1

(ii) When randomization is carried out at the subcluster level, we have


Theorem 2.1

(iii) When randomization is carried out at the participant level, we have Inline graphic. (iv) The variances are linearly ordered such that Inline graphic, with equality obtained in the absence of residual clustering (e.g., Inline graphic or Inline graphic).

The proof of Theorem 2.1 can be found in Appendix A of the Supplementary material available at Biostatistics online. Depending on the level of randomization, Theorem 2.1 provides a cascade of variance expressions to facilitate analytical power calculations. First, in the absence of any residual clustering in a three-level design such that Inline graphic, we can write Inline graphic, which is essentially the limit variance of an interaction effect estimator in a simple ANCOVA model without clustering (Shieh, 2009). Therefore, we immediately notice that all three variances share the same form of the variance without clustering, Inline graphic, multiplied by a design effect expression characterizing the nontrivial residual clustering, which we denote by Inline graphic, Inline graphic, Inline graphic, when the randomization unit is the cluster, subcluster, and participant, respectively. When the randomization is at the cluster level, Cunningham and Johnson (2016) showed that the design effect for testing the ATE in a three-level trial is unbounded when the subcluster size Inline graphic increases indefinitely. In sharp contrast, with a participant-level effect modifier, the design effect under a three-level CRT is bounded above even if the subcluster size goes to infinity, as Inline graphic. Intuitively, the HTE parameter Inline graphic corresponds to a participant-level interaction covariate that varies within each subcluster whereas the ATE corresponds to a cluster-level treatment variable that only varies between clusters; this subtle difference underlies the difference in the two design effects. Interestingly, when Inline graphic, then Inline graphic and there can even be variance deflation in testing HTE under a three-level CRT relative to an individually randomized trials without clustering. For the subcluster randomized trial, Inline graphic, and the design effect is similarly bounded in the limit. The design effect Inline graphic, however, does not depend on Inline graphic or Inline graphic and shares the same form with the one derived in Cunningham and Johnson (2016) for testing the ATE in a three-level IRT.

In addition, Theorem 2.1 demonstrates that randomization at a lower level leads to potential efficiency gain for estimating the interaction parameter Inline graphic. This ordering result parallels that developed for testing the ATE in three-level designs (Cunningham and Johnson, 2016) and is intuitive because randomization at a lower level allows the within-cluster or within-subcluster comparisons to inform the estimation of the associated parameter. All three variances are increasing functions of the conditional outcome variance Inline graphic and decreasing functions of the marginal covariate variance Inline graphic, matching the intuition that explained variation due to the effect modifier can lead to a HTE estimate with higher precision. Similar to the analysis of the ATE in IRTs or CRTs, a balanced design with equal randomization, Inline graphic, leads to the largest power for testing Inline graphic under a fixed sample size. Interestingly, the covariate-ICCs enter the variance for estimating Inline graphic in an orderly fashion, that is, Inline graphic is independent of Inline graphic and Inline graphic, and Inline graphic depends on Inline graphic but is free of Inline graphic, whereas Inline graphic depends on both Inline graphic and Inline graphic. This observation suggests that the variance of Inline graphic only depends on the covariate-ICCs defined within each randomization unit but not between randomization units.

Theorem 2.1 helps demystify the relationships between the key ICC parameters and design efficiency for estimating the HTE. When randomization is carried out at the cluster level, larger values of covariate-ICCs, Inline graphic, Inline graphic, are always associated with a larger Inline graphic (smaller power). This is expected because larger covariate-ICCs imply less per-participant information for estimating Inline graphic and therefore reduce statistical efficiency. The relationship between Inline graphic and outcome-ICCs, Inline graphic, Inline graphic, can be nonmonotone and is graphically explored in Figure S1 of the Supplementary material available at Biostatistics online, where Inline graphic exhibits a parabolic relationship with Inline graphic and Inline graphic. This implies that a direct application of the existing design formula in Yang and others (2020) by equating Inline graphic and Inline graphic can lead to either conservative or anticonservative predictions for Inline graphic and inaccurate sample size results in general. When randomization is carried out at the subcluster level, larger values of between-subcluster outcome-ICC, Inline graphic, is always associated with a smaller Inline graphic (larger power), whereas a larger value of covariate-ICC, Inline graphic, is always associated with a larger Inline graphic (smaller power). However, Inline graphic can be nonmonotone in the within-subcluster outcome-ICC, Inline graphic (see Figure S2 of the Supplementary material available at Biostatistics online for a graphical exploration). Finally, when randomization is carried out at the participant level, a larger value of within-subcluster outcome-ICC, Inline graphic, is always associated with a smaller Inline graphic (larger power), whereas Inline graphic is independent of other ICC parameters. Table 1 provides a concise summary of these relationships and we further expand on these remarks in Appendix B of the Supplementary material available at Biostatistics online.

Table 1.

Definition of ICC parameters, and a concise summary of the relationships between ICC parameters and the variance of the interaction effect parameter with a univariate participant-level effect modifier (Inline graphic), under different levels of randomization

    Level of randomization
ICC Interpretation Cluster Subcluster Participant
    Inline graphic Inline graphic Inline graphic
Inline graphic The intraclass correlation parameter between outcomes from two participants within the same subcluster Inline graphic Inline graphic Inline graphic
Inline graphic The intraclass correlation parameter between outcomes from two participants in two different subclusters Inline graphic Inline graphic Indep
Inline graphic The intraclass correlation parameter between covariates from two participants within the same subcluster Inline graphic Inline graphic Indep
Inline graphic The intraclass correlation parameter between covariates from two participants in two different subclusters Inline graphic Indep Indep

“Indep” indicates that the variance is independent of and thus does not change with the specific ICC parameter; “Inline graphic” indicates a monotonically increasing relationship; “Inline graphic” indicates a monotonically decreasing relationship; and “Inline graphic” indicates a nonmonotone and likely quadratic relationship.

Based on Theorem 2.1, further simplifications of the variance expressions are possible when the effect modifier is measured at the subcluster or cluster level. When the effect modifier is measured at the subcluster level (i.e., Inline graphic) we obtain

graphic file with name Equation7.gif (2.5)

When the effect modifier is measured at the cluster level (i.e., Inline graphic), Inline graphic reduces to Inline graphic, but Inline graphic remains identical to (2.5). In both cases, Inline graphic is unaffected and remains the same as in Theorem 2.1.

2.2. Generalization to accommodate multivariate effect modifiers

Although a common practice for confirmatory tests of HTE is to consider a univariate effect modifier, the above analytical sample size procedure can be generalized to jointly test the interaction effects with multivariate effect modifiers. In this case, we write Inline graphic as the set of Inline graphic baseline covariates and Inline graphic. We are interested in testing the global null hypothesis Inline graphic: Inline graphic based on a Wald test. From the LM-ANCOVA model (2.2), the scaled GLS estimator Inline graphic is asymptotically normal with mean zero and variance equal to the lower-right Inline graphic block of Inline graphic, which is denoted by Inline graphic. This motivates a quadratic Wald test statistic Inline graphic, which converges to a Chi-squared distribution Inline graphic with Inline graphic degrees of freedom and noncentrality parameter Inline graphic, where Inline graphic is the estimated covariance matrix of Inline graphic. For fixed effect size vector Inline graphic, the corresponding power equation of this Wald test is approximated by

graphic file with name Equation8.gif (2.6)

where Inline graphic is the Inline graphic quantile of the central Chi-squared distribution with Inline graphic degrees of freedom and Inline graphic is the probability density function of the Inline graphic distribution. Fixing any two of Inline graphic, Inline graphic, or Inline graphic, solving (2.6) for the other argument (for example, using the Inline graphic function in Inline graphic and rounding to the next integer above) readily gives the required sample size to achieve a desired level of power; therefore, sample size determination for HTE analysis with multivariate effect modifiers boils down to the characterization of Inline graphic in explicit forms.

In Appendix C of the Supplementary material available at Biostatistics online, assuming a nested block exchangeable correlation structure for the multivariate effect modifiers, we prove a general version of Theorem 2.1. The general results establish the analytical forms of three covariance matrices of the Inline graphic-vector of HTE estimators, Inline graphic, Inline graphic, and Inline graphic, when randomization is carried out at the cluster level, subcluster level, and participant level, respectively. Furthermore, these covariance matrices have Löwner ordering such that Inline graphic, with equality obtained in the absence of residual clustering (e.g., Inline graphic or Inline graphic). Finally, with a univariate effect modifier measured at the participant level, the LM-ANCOVA model (2.1) can also be used to detect contextual effect modification by decomposing the covariate effect into a cluster-level, subcluster-level, and participant-level components to address different effects for the aggregated and lower-level variations (Raudenbush, 1997). Correspondingly, Inline graphic becomes the transformed vector of effect modifiers, and the more general Theorem derived in Appendix C of the Supplementary material available at Biostatistics online can be applied for analytical power analysis with contextual effect modification. We derive those explicit expressions in Appendix D of the Supplementary material available at Biostatistics online as an application of this more general result.

2.3. Connections to sample size requirements for studying the ATE

Even though the primary motivation for model (2.1) is to study HTE with prespecified effect modifiers, the LM-ANCOVA model also implies a covariate-adjusted estimator for the ATE. This is because the conditional ATE given by LM-ANCOVA is Inline graphic, and therefore the marginal ATE parameter can be given by integrating over the distribution of effect modifiers, Inline graphic. Often times, a practical strategy is to globally mean center the collection of effect modifiers such that the sample mean is zero. Without loss of generality, we assume Inline graphic and therefore Inline graphic can be interpreted as the covariate-adjusted ATE estimator. Such an observation is akin to the ANCOVA analysis of IRTs, for which model (2.1) is a direct generalization with multiple random effects. Different from prior work on ANCOVA analysis of IRTs without residual clustering (Yang and Tsiatis, 2001), we assume model (2.1) is correctly specified such that an explicit model-based variance expression of the ATE estimator can be used as a basis for study planning. Specifically, we denote the asymptotic variance expression of Inline graphic as Inline graphic, based on which the generic sample size requirement in a three-level design is Inline graphic. Based on model (2.1), we provide a final result to facilitate power and sample size determination for testing the ATE in three-level designs (proof in Appendix E of the Supplementary material available at Biostatistics online).

Theorem 2.2

The limit variance expression of the covariate-adjusted ATE estimator depends on the unit of randomization and is at most a function of the outcome-ICCs. (a) When the randomization is carried out at the cluster level, we have Inline graphic. (b) When randomization is carried out at the subcluster level, we have Inline graphic. (c) When randomization is carried out at the participant level, we have Inline graphic. (d) The variances are linearly ordered such that Inline graphic, and equality is obtained when Inline graphic (Inline graphic) and Inline graphic (Inline graphic.

Theorem 2.2 implies that the design effect in a three-level design for estimating the ATE, as compared to an unclustered randomized design, is one eigenvalue of the outcome-ICC matrix Inline graphic, matching those derived by Cunningham and Johnson (2016) under three-level designs with one subtle difference. In Cunningham and Johnson (2016), the design effects were all derived assuming a linear mixed model without any effect modifiers and therefore the outcome variance and the outcome-ICCs in each eigenvalue are marginal with respect to effect modifiers. In contrast, the design effect expressions implied by Theorem 2.2 are parameterized by outcome variance and outcome-ICCs conditional on effect modifiers. As we demonstrate in Section 3, adjusting for effect modifiers can partially explain the residual variation of the outcomes at any level, and therefore may reduce either the outcome variance or the amount of residual correlation. In such a case, the conditional outcome-ICCs and conditional outcome variance are frequently no larger than their marginal counterparts, and therefore the ATE estimators based on LM-ANCOVA are likely to be more efficient than those under the unadjusted linear mixed model. This improvement in efficiency can directly translate into sample size savings. Finally, if the marginal variance of a univariate effect modifier is a unity such that Inline graphic, regardless of the level of randomization, the large-sample variance expression for the covariate-adjusted ATE estimator is identical to the large-sample variance expression for the HTE estimator, when the effect modifier is measured at or above the unit of randomization. Namely, Inline graphic regardless of the measurement level of the effect modifier; Inline graphic when the effect modifier is measured at the subcluster or cluster level; and Inline graphic when the effect modifier is measured at the cluster level.

3. Numerical studies

We carry out simulations to assess the finite-sample performance of the proposed sample size procedures for planning three-level trials. Our objectives are (i) assessing the accuracy of the proposed methods for powering three-level trials to detect HTE as well as the ATE and (ii) demonstrating, from a cost-effectiveness perspective, whether the sample size estimates based on Theorem 2.2 can be smaller than those estimated by the approach given in Cunningham and Johnson (2016), for the same level of power. Whenever applicable, we also compare our sample size results with those obtained from Dong and others (2018). To focus on the main idea, throughout we assume a univariate continuous effect modifier measured at the participant level. We consider designs with randomization at each of the three levels, leading to CRTs, subcluster randomized trials, and IRTs with a hierarchical structure. We assume equal randomization with Inline graphic=1/2, and a balanced design with equal numbers of subclusters Inline graphic in each cluster and equal subcluster sizes Inline graphic.

For the first objective, we fix Inline graphic and Inline graphic at Inline graphic, nominal type I error Inline graphic and the desired power level Inline graphic; the remaining parameters are varied. Under each design, we consider two levels of subcluster sizes Inline graphic and two values for the number of subclusters per cluster Inline graphic. Since typically Inline graphic and Inline graphic, we chose Inline graphic to represent small and moderate conditional outcome-ICC, and Inline graphic to represent small, moderate, and large covariate-ICCs, based on ranges commonly reported in the CRT literature (Murray and Blitstein, 2003). To ensure that the predicted number of clusters is practical and mostly below Inline graphic, we set Inline graphic for all randomization scenarios, Inline graphic for the CRTs, and Inline graphic for the subcluster randomized trial or IRT. For each parameter combination, we estimate the number of clusters Inline graphic that ensures at least Inline graphic power and round to the nearest even integer above to ensure an exactly equal randomization. We then use the predicted cluster number Inline graphic to simulate correlated outcome data based on the LM-ANCOVA model and compute the empirical power of the Wald test for HTE or the ATE. When the randomization is carried out at the cluster level, we quantify the proportion of explained variation due to the effect modifier (details in Appendix F of the Supplementary material available at Biostatistics online) and use the method of Dong and others (2018) to obtain the required number of cluster Inline graphic to ensure at least 80Inline graphic power, as a comparator to our new method.

We generate the individual-level outcomes using the LM-ANCOVA model (2.1) by fixing Inline graphic and Inline graphic. For the HTE tests under each randomized design, we fix Inline graphic, and Inline graphic. For studying the empirical power of the Wald test for the covariate-adjusted ATE, we fix Inline graphic such that Inline graphic. We generate the correlated effect modifiers based on the linear mixed model, Inline graphic, where the global mean Inline graphic, Inline graphic, Inline graphic, and Inline graphic. The cluster-specific random intercept in LM-ANCOVA Inline graphic, the subcluster-level random effect Inline graphic, and the random error Inline graphic. For each parameter configuration, we generate Inline graphic hypothetical trials to evaluate the empirical type I error under the null and empirical power under the alternative. For each hypothetical trial, we fit the LM-ANCOVA model using restricted maximum likelihood estimation and carry out the corresponding Wald test for inference. To evaluate the covariate-adjusted Wald test for the ATE, we first globally mean-center the effect modifier before fitting the LM-ANCOVA. Finally, while the sample size estimation for HTE test and ATE test under a three-level design can generally be based on the standard normal distribution, we consider the Inline graphic-distribution with the between-within degrees of freedom (Inline graphic) only when studying the ATE under a CRT (standard normal distribution will still be used for testing the ATE in both subcluster randomized trials and IRTs). This choice represents an effective small-sample degrees of freedom correction specifically for CRTs with a limited number of clusters (and will have negligible difference from the standard normal when Inline graphic is sufficiently large, also see Chapter 2.4.2 in Pinheiro and Bates (2006)) and has been previously shown to maintain valid type I error rate (Li and others, 2016) in small CRTs for testing the ATE and therefore is adopted to more objectively assess the agreement between empirical and predicted power under that specific scenario. For this scenario, we also confirm in Table S1 of the Supplementary material available at Biostatistics online that the predicted variance of the ATE estimator based on Theorem 2.2 is close to the Monte Carlo variance.

For the second objective, given each effect size Inline graphic, we still generate the correlated outcome data using the LM-ANCOVA model as the above, but assume that the primary analysis of the ATE is based on a linear mixed model without the effect modifier. Correspondingly, we compute the required sample size using the formulas in the absence of any covariates. To operationalize those formulas, we obtain the total outcome variance marginalizing over the covariate distribution, Inline graphic. Furthermore, the unconditional outcome-ICCs can be approximated by Inline graphic and Inline graphic, where the explicit form of Inline graphic is derived in Appendix F of the Supplementary material available at Biostatistics online. The required sample size for the unadjusted mixed model analysis is then estimated from the Theorem 2.2 but by replacing Inline graphic, Inline graphic with Inline graphic, Inline graphic, and Inline graphic with Inline graphic, namely, using the formulas in Cunningham and Johnson (2016). We compared the results with those estimated based on Theorem 2.2 to investigate saving in sample size due to adjustment for the univariate effect modifier. Finally, we also obtain the empirical type I error rate and empirical power by fitting the linear mixed model omitting Inline graphic to verify the accuracy of the sample size procedure without the effect modifier.

3.1. Results

Table 2 provides a summary of the estimated number of clusters using the proposed formula in Theorem 2.1, the empirical size, empirical power as well as predicted power by formula of the Wald test for HTE when Inline graphic, and when the randomization is carried out at the cluster level (differences from 80Inline graphic power are due to rounding). The Wald test maintains the nominal type I error rate, which ensures the validity of the subsequent comparisons between the empirical and predicted power. Across all scenarios, the predicted power is in good agreement with the empirical power, even when there are as few as Inline graphic clusters. In Table 2, we also observe that the method of Dong and others (2018) often leads to much smaller sample size estimates (Inline graphic) than the proposed method and therefore their method is consistently anticonservative. In Appendix F of the Supplementary material available at Biostatistics online, we have re-expressed their design formula using our notation and found that it does not depend on any covariate-ICC parameters. It is apparent from Table 2 that ignoring the covariate-ICC at each level during study planning can result in under-powered CRTs, especially in the presence of nontrivial covariate-ICCs.

Table 2.

Estimated required number of clusters Inline graphic for the HTE test based on the proposed formula, empirical type I error (Emp. size), empirical power (Emp. power), as well as predicted power (Pred. power) for the HTE, test when randomization is at the cluster level. For studying power, the HTE effect size is fixed at Inline graphic. In the last two columns, we estimate the required sample size Inline graphic using the method of Dong and others (2018) and obtain the actual predicted power (Actual power) using our formula based on the estimated sample size Inline graphic to assess the degree to which the three-level CRT may be under-powered

Design parameters Performance characteristics Comparator
Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Emp. size Emp. power Pred. power Inline graphic Actual power
20 4 0.015 0.010 0.150 0.100 42 0.051 0.812 0.806 40 0.787
20 4 0.015 0.010 0.300 0.150 44 0.054 0.806 0.807 40 0.768
20 4 0.015 0.010 0.500 0.300 48 0.052 0.801 0.804 40 0.730
20 4 0.100 0.050 0.150 0.100 42 0.051 0.813 0.808 36 0.746
20 4 0.100 0.050 0.300 0.150 48 0.050 0.805 0.813 36 0.694
20 4 0.100 0.050 0.500 0.300 58 0.053 0.798 0.800 36 0.598
20 8 0.015 0.010 0.150 0.100 22 0.048 0.811 0.819 20 0.781
20 8 0.015 0.010 0.300 0.150 24 0.054 0.837 0.832 20 0.760
20 8 0.015 0.010 0.500 0.300 26 0.055 0.816 0.816 20 0.708
20 8 0.100 0.050 0.150 0.100 22 0.045 0.817 0.825 18 0.744
20 8 0.100 0.050 0.300 0.150 24 0.048 0.813 0.811 18 0.692
20 8 0.100 0.050 0.500 0.300 30 0.046 0.796 0.806 18 0.590
50 4 0.015 0.010 0.150 0.100 18 0.054 0.818 0.822 16 0.775
50 4 0.015 0.010 0.300 0.150 20 0.055 0.829 0.833 16 0.744
50 4 0.015 0.010 0.500 0.300 22 0.055 0.811 0.811 16 0.678
50 4 0.100 0.050 0.150 0.100 18 0.052 0.828 0.832 16 0.787
50 4 0.100 0.050 0.300 0.150 20 0.054 0.812 0.813 16 0.722
50 4 0.100 0.050 0.500 0.300 26 0.056 0.816 0.808 16 0.603
50 8 0.015 0.010 0.150 0.100 10 0.053 0.857 0.856 8 0.772
50 8 0.015 0.010 0.300 0.150 10 0.052 0.827 0.828 8 0.739
50 8 0.015 0.010 0.500 0.300 12 0.058 0.825 0.830 8 0.663
50 8 0.100 0.050 0.150 0.100 10 0.052 0.874 0.868 8 0.786
50 8 0.100 0.050 0.300 0.150 10 0.053 0.818 0.813 8 0.722
50 8 0.100 0.050 0.500 0.300 14 0.054 0.831 0.834 8 0.600

Under cluster randomization, Table 3 provides a summary of the estimated number of clusters using the proposed formula in Theorem 2.2, the empirical size, empirical power as well predicted power by formula of the Wald test for the covariate-adjusted ATE when Inline graphic. The Wald test still maintains valid type I error rate, with empirical power close to analytical prediction by our formula. This suggests that our variance expressions are accurate for study design purposes. Interestingly, even though the ATE size is twice the HTE effect size, the required sample size to achieve 80Inline graphic power is not always larger for the HTE test compared to the ATE and can depend on the remaining design parameters. Finally, we demonstrate that ignoring the univariate effect modifier in the study design stage can lead to larger than necessary sample size estimates under a wide range of design configurations (Table S2 of the Supplementary material available at Biostatistics online). In fact, for the same ATE size, we find that the required sample size based on the Cunningham and Johnson (2016) formula may even be Inline graphic larger than that based on Theorem 2.2, as a result of explained variation. For example, as seen in Table S2 of the Supplementary material available at Biostatistics online, the adjusted outcome-ICCs Inline graphic and Inline graphic can be substantially smaller than their marginal counterparts, Inline graphic, Inline graphic, especially when the covariate-ICCs are farther away from zero.

Table 3.

Estimated required number of clusters Inline graphic for the covariate-adjusted ATE test based on the proposed formula, empirical type I error (Emp. size), empirical power (Emp. power), as well as predicted power (Pred. power) for the covariate-adjusted ATE test when randomization is at the cluster level. For studying power, the ATE size is fixed at Inline graphic

Design parameters Performance characteristics
Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Emp. size Emp. power Pred. power
20 4 0.015 0.010 0.150 0.100 22 0.051 0.824 0.828
20 4 0.015 0.010 0.300 0.150 22 0.052 0.821 0.828
20 4 0.015 0.010 0.500 0.300 22 0.054 0.821 0.828
20 4 0.100 0.050 0.150 0.100 60 0.052 0.798 0.801
20 4 0.100 0.050 0.300 0.150 60 0.053 0.797 0.801
20 4 0.100 0.050 0.500 0.300 60 0.053 0.796 0.801
20 8 0.015 0.010 0.150 0.100 16 0.048 0.820 0.819
20 8 0.015 0.010 0.300 0.150 16 0.048 0.818 0.819
20 8 0.015 0.010 0.500 0.300 16 0.051 0.814 0.819
20 8 0.100 0.050 0.150 0.100 52 0.048 0.815 0.811
20 8 0.100 0.050 0.300 0.150 52 0.048 0.815 0.811
20 8 0.100 0.050 0.500 0.300 52 0.049 0.813 0.811
50 4 0.015 0.010 0.150 0.100 16 0.051 0.839 0.832
50 4 0.015 0.010 0.300 0.150 16 0.053 0.840 0.832
50 4 0.015 0.010 0.500 0.300 16 0.055 0.837 0.832
50 4 0.100 0.050 0.150 0.100 56 0.050 0.812 0.810
50 4 0.100 0.050 0.300 0.150 56 0.050 0.814 0.810
50 4 0.100 0.050 0.500 0.300 56 0.050 0.813 0.810
50 8 0.015 0.010 0.150 0.100 14 0.054 0.851 0.851
50 8 0.015 0.010 0.300 0.150 14 0.056 0.850 0.851
50 8 0.015 0.010 0.500 0.300 14 0.059 0.842 0.851
50 8 0.100 0.050 0.150 0.100 48 0.051 0.805 0.801
50 8 0.100 0.050 0.300 0.150 48 0.052 0.802 0.801
50 8 0.100 0.050 0.500 0.300 48 0.054 0.802 0.801

Simulation results for when the randomization is carried at the subcluster level and at the participant level are presented in Tables S3–S8 of the Supplementary material available at Biostatistics online. The patterns are qualitatively similar to the results in Tables 2 and 3 as well as Table S2 of the Supplementary material available at Biostatistics online and confirm that our analytical power procedure can accurately track the empirical power for the Wald test for HTE and ATE. In addition, these results confirm the ordering properties of the variances under different randomized designs, with the cluster-level randomization requiring the largest sample size and the participant-level randomization requiring the smallest sample size, for studying both HTE and the ATE. Finally, we observe that, for studying the ATE, the sample size saving by adjusting for the participant-level effect modifier is often the largest under a CRT and the smallest under an IRT with a hierarchical structure.

4. Two real CRT examples

We illustrate the proposed methods using two real trial examples, with a primary focus on the detection of confirmatory HTE. In both examples, we consider two-sided tests with a nominal Inline graphic type I error and Inline graphic power.

Example 4.1

(The HALI trial). The HALI trial compared the effect of a literacy intervention with usual instruction in terms of students’ literacy outcomes measured by performance on spelling and reading tests (Jukes and others, 2017). The study exhibits a three-level structure since the children participants are nested in schools (subclusters) which are further nested in TAC tutor zones (clusters); and the randomization of literacy intervention is carried out at the TAC level with Inline graphic. There are approximately Inline graphic children in each school, and on average Inline graphic schools per TAC tutor zone. We focus on the Inline graphic-month spelling score as a continuous outcome and the baseline spelling score as a continuous potential effect modifier. Based on the estimates in Jukes and others (2017), we assume the conditional within-subcluster and between-subcluster outcome-ICCs as Inline graphic and Inline graphic. We further assume the within-subcluster and between-subcluster covariate-ICCs Inline graphic and Inline graphic. Using Theorem 2.1(a), the required number of TAC tutor zones to detect HTE by the baseline spelling score for a standardized effect size, Inline graphic (interpreted as the impact due to one standard deviation unit change in baseline spelling score on standard deviation unit change in the 9-month spelling score) is Inline graphic. To explore the sensitivity of power to varying covariate- and outcome-ICC values, we estimate the power of the HTE test by varying Inline graphic, Inline graphic with fixed outcome-ICC values, and by varying Inline graphic, Inline graphic with fixed covariate-ICCs. The power contours in Figures 2(a) and (b) indicate that the test power is sensitive to the magnitude of within-cluster covariate-ICC, but relatively insensitive to the magnitude of between-cluster covariate-ICC. While larger values of the within-cluster covariate-ICC can reduce the power of the HTE test to below Inline graphic, variations of the outcome-ICCs within the range we considered maintain the power of the HTE test above Inline graphic.

Fig. 2.

Fig. 2.

Estimated power contours for studying the heterogeneous treatment effect in the HALI and STRIDE trials as a function of the covariate and outcome-ICCs. (a) and (b) are based on the participant-level continuous effect modifier, baseline spelling score, in the HALI CRT (Inline graphic, Inline graphic and Inline graphic), while (c) and (d) are based on the participant-level binary effect modifier, SRH, in the STRIDE subcluster randomized trial (Inline graphic, Inline graphic and Inline graphic).

Example 4.2

(The STRIDE Trial). The STRIDE trial was a subcluster randomized trial comparing the effectiveness of a multifactorial fall prevention intervention program versus enhanced usual care on patient’s health outcomes (Gill and others, 2021). The study randomized primary care clinics to intervention conditions with allocation probability Inline graphic and measured outcomes at the participant level; the clinics were nested within health centers, thus exhibiting a three-level structure. Each health center included about Inline graphic clinics, and the average clinic size was Inline graphic. In this illustrative example, we considered the concern score for falling as a continuous outcome and self-rated health (SRH; measures whether one has good/excellent SRH) as a binary effect modifier measured at the participant level (Gill and others, 2021). We assumed the conditional within-subcluster and between-subcluster outcome-ICCs as Inline graphic and Inline graphic, and further assumed the within-subcluster covariate-ICC Inline graphic. Using Theorem 2.1(b), Inline graphic health centers would be required to detect HTE moderated with the SRH for a standardized effect size Inline graphic (interpreted as the impact due to change in SRH on the standard deviation unit change in the concern score). To additionally explore the sensitivity of power to varying covariate- and outcome-ICC values, we present the power contours in Figures 2(c) and (d) as in Example 4.1. In particular, Figure 2(c) confirms that larger values of Inline graphic can decrease the power of the HTE test, but varying Inline graphic has no effect on the test power in a subcluster randomized trial. Varying values of the outcome-ICCs generally do not result in an under-powered HTE test, except when Inline graphic and Inline graphic (a scenario with a moderate within-cluster outcome-ICC but a small between-subcluster outcome-ICC), in which case the power appears just below Inline graphic. Overall, these two examples illustrate how our design expressions can be effectively operationalized in practical design situations and emphasize the critical role of the within-subcluster covariate-ICC in determining the power of the HTE test.

5. Concluding remarks

While prespecified HTE analysis has been a recognized goal in randomized trials, guidance to date on planning pragmatic trials involving clusters with HTE analysis is scarce. Through analytical derivations, we contribute a cascade of new variance expressions to allow for rigorous and yet computationally efficient design of pragmatic CRTs to power confirmatory HTE analysis with effect modifiers measured at each level, addressing a critical methodological gap in designing definitive pragmatic CRTs. Compared to the design of IRTs with HTE analysis (Brookes and others, 2004), the design of CRTs with HTE analysis requires more complex considerations due to the multilevel data structure and additional design parameters, especially the ICCs of the outcome as well as the effect modifier. While recent advancements in computational tools have made the use of a simulation-based power calculation an attractive alternative to analytical power predictions in clustered designs, such an approach is typically time-consuming and can quickly become impractical due to the need to search across multidimensional parameter spaces and repeatedly fitting multilevel models. Our proposed design formulas not only reduce the associated computational issues, but, more importantly, identify key aspects of the data generating processes that contribute to the study power. For example, the power of the HTE test is only affected by the HTE effect size but not the ATE size. Furthermore, the power of the HTE test only depends on the covariate-ICCs defined within each randomization unit but not between randomization units. Regardless of the level of randomization, the variance of the interaction effect estimator from the LM-ANCOVA model is proportional to the ratio of the conditional outcome variance and the variance of the effect modifier. In the context of a binary effect modifier, for example, the variance of the effect modifier is a function of the mean, and it follows that the variance of the interaction effect estimator reaches its minimum when the prevalence of the effect modifier is Inline graphic (holding all other factors constant). These insights greatly simplify power analysis by obviating the need to exhaust ancillary design parameters as would otherwise be required in a simulation-based power calculation procedure, and can possibly provide a basis for deriving optimal designs for testing effect modification.

There are additional aspects that we do not address in this article but remain important topics for future research. First, we have assumed equal cluster sizes to arrive at the main results. In a two-level CRT, Tong and others (2021) recently developed a correction factor for the variance formula of the HTE estimator under variable cluster sizes and found that the impact of cluster size variation is minimal with an participant-level effect modifier but can be more substantial with a cluster-level effect modifier. We anticipate these findings can be generalizable to three-level designs, although a careful derivation is needed to obtain the explicit correction factors when only the number of participants varies across subclusters, or only the number of subclusters varies across clusters, or both. Second, our design expressions developed under the LM-ANCOVA model are at best an approximation for three-level designs with a binary outcome. Analytical design formulas for powering the HTE test with a binary outcome that explicitly acknowledge the binomial variance structure warrant additional research. Finally, it would be worthwhile to generalize our results to further accommodate random covariate effects as an additional source of random variation. In general, the asymptotic variance expression tends to be analytically less tractable under random coefficients models and therefore requires a separate development.

Supplementary Material

kxac026_Supplementary_Data

Acknowledgments

Conflict of Interest: None declared.

Contributor Information

Fan Li, Department of Biostatistics, Yale University School of Public Health, New Haven, CT 06510, USA.

Xinyuan Chen, Department of Mathematics and Statistics, Mississippi State University, MS 39762, USA.

Zizhong Tian, Division of Biostatistics and Bioinformatics, Department of Public Health Sciences, Pennsylvania State University, Hershey, PA 17033, USA.

Denise Esserman, Department of Biostatistics, Yale University School of Public Health, New Haven, CT 06510, USA.

Patrick J Heagerty, Department of Biostatistics, University of Washington, Seattle, WA 98195, USA.

Rui Wang, Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA 02115, USA and Department of Population Medicine, Harvard Pilgrim Health Care Institute and Harvard Medical School, Boston, MA 02215, USA.

Software

R code for implementing the proposed methods, simulations, and data examples are available at https://github.com/BillyTian/code_3levelHTE.

Supplementary material

Supplementary material is available at http://biostatistics.oxfordjournals.org.

Funding

Research in this article was supported by a Patient-Centered Outcomes Research Institute AwardInline graphic (PCORIInline graphic Award ME-2020C3-21072). This work was also partially supported by the Yale Clinical and Translational Science Award (UL1TR001863). The statements presented in this article are solely the responsibility of the authors and do not necessarily represent the views of PCORIInline graphic or its Board of Governors or Methodology Committee.

References

  1. Brookes, S. T., Whitely, E., Egger, M., Smith, G. D., Mulheran, P. A., and Peters, T. J. (2004). Subgroup analyses in randomized trials: risks of subgroup-specific analyses: power and sample size for the interaction test. Journal of Clinical Epidemiology 57, 229–236. [DOI] [PubMed] [Google Scholar]
  2. Cunningham, T. and Johnson, R. (2016). Design effects for sample size computation in three-level designs. Statistical Methods in Medical Research 25, 505–519. [DOI] [PubMed] [Google Scholar]
  3. Dong, N., Kelcey, B. and Spybrook, J. (2018). Power analyses for moderator effects in three-level cluster randomized trials. The Journal of Experimental Education 86, 489–514. [Google Scholar]
  4. Gill, T.M., Bhasin, S., Reuben, D.B., Latham, N.K., Araujo, K., Ganz, D.A., Boult, C., Wu, A.W., Magaziner, J., Alexander, N. and Wallace, R.B. (2021). Effect of a multifactorial fall injury prevention intervention on patient well-being: the stride study. Journal of the American Geriatrics Society 69, 173–179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Heo, M. and Leon, A. (2008). Statistical power and sample size requirements for three level hierarchical cluster randomized trials. Biometrics 64, 1256–1262. [DOI] [PubMed] [Google Scholar]
  6. Jukes, M.C., Turner, E.L., Dubeck, M.M., Halliday, K.E., Inyega, H.N., Wolf, S., Zuilkowski, S.S. and Brooker, S.J. (2017). Improving literacy instruction in Kenya through teacher professional development and text messages support: A cluster randomized trial. Journal of Research on Educational Effectiveness 10, 449–481. [Google Scholar]
  7. Li, F., Lokhnygina, Y., Murray, D.M., Heagerty, P.J. and DeLong, E.R. (2016). An evaluation of constrained randomization for the design and analysis of group-randomized trials. Statistics in Medicine 35, 1565–1579. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Li, F., Turner, E. and Preisser, J. (2018). Sample size determination for gee analyses of stepped wedge cluster randomized trials. Biometrics 74, 1450–1458. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Murray, D. and Blitstein, J. (2003). Methods to reduce the impact of intraclass correlation in group-randomized trials. Evaluation Review 27, 79–103. [DOI] [PubMed] [Google Scholar]
  10. Pinheiro, J. and Bates, D. (2006). Mixed-Effects Models in S and S-PLUS. New York: Springer Science & Business Media. [Google Scholar]
  11. Raudenbush, S. (1997). Statistical analysis and optimal design for cluster randomized trials. Psychological Methods 2, 173. [DOI] [PubMed] [Google Scholar]
  12. Shieh, G. (2009). Detecting interaction effects in moderated multiple regression with continuous variables power and sample size considerations. Organizational Research Methods 12, 510–528. [Google Scholar]
  13. Starks, M.A., Sanders, G.D., Coeytaux, R.R., Riley, I.L., Jackson, L.R., Brooks, A.M., Thomas, K.L., Choudhury, K.R., Califf, R.M. and Hernandez, A.F. (2019). Assessing heterogeneity of treatment effect analyses in health-related cluster randomized trials: a systematic review. PLoS One 14, e0219894. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Teerenstra, S., Lu, B., Preisser, J.S., Van Achterberg, T. and Borm, G.F. (2010). Sample size considerations for GEE analyses of three-level cluster randomized trials. Biometrics 66, 1230–1237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Tong, G., Esserman, D. and Li, F. (2021). Accounting for unequal cluster sizes in designing cluster randomized trials to detect treatment effect heterogeneity. Statistics in Medicine, 41, 1376–1396. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Wang, R. and Ware, J. (2013). Detecting moderator effects using subgroup analyses. Prevention Science 14, 111–120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Yang, L. and Tsiatis, A. (2001). Efficiency study of estimators for a treatment effect in a pretest–posttest trial. The American Statistician 55, 314–321. [Google Scholar]
  18. Yang, S., Li, F., Starks, M.A., Hernandez, A.F., Mentz, R.J. and Choudhury, K.R. (2020). Sample size requirements for detecting treatment effect heterogeneity in cluster randomized trials. Statistics in Medicine 39, 4218–4237. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kxac026_Supplementary_Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES