MASTOR: mixed-model association mapping of quantitative traits in samples with related individuals.

Jakobsdottir J; McPeek MS

doi:10.1016/j.ajhg.2013.03.014

MASTOR: mixed-model association mapping of quantitative traits in samples with related individuals.

Jakobsdottir J ¹,

McPeek MS

Affiliations

1. Department of Statistics, University of Chicago, Chicago, IL 60637, USA.
Authors
Jakobsdottir J¹
(1 author)

ORCIDs linked to this article

Jakobsdottir J | 0000-0002-8019-9683

American Journal of Human Genetics, 01 May 2013, 92(5):652-666
https://doi.org/10.1016/j.ajhg.2013.03.014 PMID: 23643379 PMCID: PMC3644644

Free full text in Europe PMC

Abstract

Genetic association studies often sample individuals with known familial relationships in addition to unrelated individuals, and it is common for some individuals to have missing data (phenotypes, genotypes, or covariates). When some individuals in a sample are related, power can be gained by incorporating all individuals in the analysis, including individuals with partially missing data, while properly accounting for the dependence among them. We propose MASTOR, a mixed-model, retrospective score test for genetic association with a quantitative trait. MASTOR achieves high power in samples that contain related individuals by making full use of the relationship information to incorporate partially missing data in the analysis while correcting for dependence. Individuals with available phenotype and covariate information who are not genotyped but have genotyped relatives in the sample can still contribute to the association analysis because of the dependence among genotypes. Similarly, individuals who are genotyped but are missing covariate or phenotype information can contribute to the analysis. MASTOR is valid even when the phenotype model is misspecified and with either random or phenotype-based ascertainment. In simulations, we demonstrate the correct type 1 error of MASTOR, the increase in power that comes from making full use of the relationship information, the robustness to misspecification of the phenotype model, and the improvement in power that comes from modeling the heritability. We show that MASTOR is computationally feasible and practical in genome-wide association studies. We apply MASTOR to data on high-density lipoprotein cholesterol from the Framingham Heart study.

Free full text

Am J Hum Genet. 2013 May 2; 92(5): 652–666.

https://doi.org/10.1016/j.ajhg.2013.03.014

PMCID: PMC3644644

PMID: 23643379

MASTOR: Mixed-Model Association Mapping of Quantitative Traits in Samples with Related Individuals

Johanna Jakobsdottir^1,³ and Mary Sara McPeek^1,^2,^*

Author information Article notes Copyright and License information Disclaimer

This article has been cited by other articles in PMC.

Go to:

Associated Data

Supplementary Materials: Document S1. Figure S1
mmc1.pdf (54K)

Go to:

Abstract

Go to:

Introduction

Genome-wide association studies commonly contain both related and unrelated individuals. For example, families that have previously been recruited for linkage studies may later be included in an association analysis. In other instances, such as the Framingham Heart Study,¹ individuals may be sampled for a population-based study, with their relatives later recruited. For such studies it is important to develop association analysis tools that can incorporate pedigree and covariate information and can handle missing data. It has been demonstrated that when related individuals are included in an association study, the genealogy needs to be appropriately modeled² and that additional power can be achieved by including all individuals in the analysis while accounting for the dependence resulting from relatedness in the sample.³

For genetic analysis of association with quantitative traits in samples with related individuals, the GTAM method⁴ has previously been developed. GTAM is a prospective, mixed-model analysis that incorporates covariates and uses estimated variance components (VCs) for additive polygenic and environmental variance. (For individuals not known to be related, later variations on the GTAM method include EMMAX⁵ and similar methods,^6–8 in which the kinship matrix of GTAM is replaced by an IBS matrix or other empirical relatedness matrix.) The T^SCORE test⁹ addresses the problem of testing association with a quantitative trait when there are high-resolution SNP genotypes for a subset of individuals in each pedigree, with sparser marker data for the remaining individuals. To address this problem, T^SCORE uses genotype imputation, followed by association testing with a score statistic similar to that of GTAM. Family-based association testing approaches can also be used for quantitative traits.^10,11 Family-based approaches, such as FBAT,¹¹ that restrict analysis to the within-family component of association by conditioning on relatives’ genotypes, provide robustness to population structure at the price of strong requirements on the availability of genotype data on relatives in order to have adequate power. Another recent method for association mapping of quantitative traits in related individuals is GQLS,¹² which is the same as the previously proposed WQLS method¹³ for binary traits, except with a quantitative trait used in place of the binary trait. GQLS is a retrospective analysis that does not allow for covariates and does not involve estimation of VCs.

We propose a retrospective mixed-model approach to genetic association mapping of a quantitative trait in samples with related individuals, which we call MASTOR (mixed-model association score test on related individuals). Unlike GQLS, MASTOR allows for covariates and for additive polygenic and environmental VCs. Unlike family-based tests, MASTOR is applicable to completely general samples of related and unrelated individuals, and unlike T^SCORE, MASTOR allows pedigrees with loops. Unlike GTAM and T^SCORE, MASTOR is retrospective and so is more robust to misspecification of the phenotypic model. A further advantage of MASTOR over GQLS and GTAM is in its handling of missing data. It uses the information on dependence of related individuals to allow individuals with missing genotype or missing phenotype to contribute power to the analysis. MASTOR uses a different approach to missing data than does T^SCORE and we compare them in simulations.

MASTOR can be viewed as an extension to quantitative traits of the MQLS method³ for binary traits. Like MQLS, MASTOR can be shown to be asymptotically optimal when, for example, the genetic effect of the tested locus is additive with effect size tending to 0. Features of MASTOR include the following: (1) it is applicable to and computationally feasible for essentially arbitrary combinations of related and unrelated individuals, including small outbred pedigrees and unrelated individuals, as well as large, complex inbred pedigrees; (2) it is computationally feasible for genetic studies with millions of markers; and (3) it incorporates phenotype and covariate information on relatives who have missing genotype data at the marker being tested, incorporates genotype data on individuals with missing phenotype or covariate information, and appropriately accounts for the uncertainty resulting from missing data and dependence. For comparison, we also propose the ASTOR test, which is a simplified version of MASTOR that also corrects for dependence but does not require estimation of heritability.

In order to assess the type 1 error and power of MASTOR and compare it to previously proposed statistics, we perform simulations under various models for quantitative traits, allowing for multiple causal loci with epistasis, covariates, both normal and nonnormal error, polygenic additive and/or dominance and environmental components of variance, and various assumptions about missing genotype, phenotype, and ascertainment. We illustrate the applicability of the method by analyzing data on high-density lipoprotein cholesterol (HDL-C) from the Framingham Heart Study.

Go to:

Material and Methods

Suppose the data for the genetic association study include genotype, phenotype, pedigree, and relevant covariate information on a sample of individuals, where we allow missing data. We assume that the phenotype is quantitative, i.e., continuously varying. We consider an analysis in which genetic variants are tested, one at a time, for association with the trait. For simplicity, we assume that each tested variant is biallelic. (The extension to a multiallelic variant is described in Appendix A.) The individuals in the sample can be arbitrarily related, with the pedigree(s) that specify the relationships assumed to be known. Unrelated individuals can also be included in the sample, or the entire sample could consist of unrelated individuals.

We fix a particular genetic variant, which we refer to as the variant of interest, and we arbitrarily label its two alleles “0” and “1.” Let N denote the subset of individuals in the sample who have nonmissing genotype at the variant of interest. Let n = |N| be the number of individuals in set N. Let R denote the subset of individuals in the sample who have nonmissing phenotype value and no missing covariates and who satisfy at least one of the following three criteria: (1) they have nonmissing genotype at the variant of interest, (2) they have a relative in the study with nonmissing genotype at the variant of interest, or (3) they are in the same pedigree with an individual with nonmissing phenotype and no missing covariates who either has nonmissing genotype or a relative with nonmissing genotype at the variant of interest. Let r = |R| denote the number of individuals in set R. The set of individuals included in the MASTOR analysis is N [union or logical sum] R. That is, in addition to individuals who have complete data on phenotype, covariates, and genotype at the variant of interest, we also include in the analysis (1) individuals who are genotyped at the variant of interest but are missing phenotype or covariate information and (2) individuals who have phenotype and covariate information but have missing genotype at the variant of interest, provided that they have at least one relative who is genotyped at the variant of interest (or are in a pedigree that meets condition 3 above).

Let X = (X₁,…, X_n)^T denote the vector of genotype data for the individuals in N, where X_i = 0, .5, or 1, according to whether the i^th genotyped individual has 0, 1, or 2 copies of allele 1 at the variant of interest. Let Y = (Y₁,…, Y_r)^T denote the vector of phenotype data, where Y_i is the quantitative trait value for the i^th individual in set R, and let W be the r × (w + 1) matrix of covariates with (i, j)^th entry W_ij equal to the value of the j^th covariate for the i^th individual in set R. We assume that W always includes an intercept (i.e., column of ones) and therefore has w + 1 columns, where w is the number of covariates to be included in the analysis in addition to the intercept.

In MASTOR, we analyze the data retrospectively, i.e., we condition on (Y, W), the covariates and phenotype, and treat the genotype vector, X, as random. This approach is appropriate with either random or phenotype-based ascertainment. Additionally, this order of conditioning provides a natural way of incorporating information on individuals with partially missing data.³

In what follows, we first give a brief review of the mixed-effects GTAM method of Abney et al.,⁴ which is closely connected to MASTOR. We then describe MASTOR and show how it can gain power by using additional information not used by GTAM. We also briefly describe the T^SCORE statistic⁹ and contrast it with the MASTOR and GTAM statistics.

Brief Review of the Mixed-Effects GTAM Method of Abney et al.

We restrict attention to a special case of the GTAM method⁴ in which one tests for an additive effect of the variant of interest (1-df test), in the presence of covariates, with additive polygenic effects and independent normal error in the model. For computational convenience, we use an asymptotic assessment of significance instead of the more robust, covariance-preserving permutation test described by Abney et al.⁴

The GTAM method uses only the set of individuals, S = N ∩ R, who have complete data on phenotype, covariates, and genotype at the variant of interest. Let s = |S| be the number of such individuals. We let Y_S and X_S represent the length s subvectors of Y and X, respectively, obtained by considering only the individuals in S. Similarly, we let W_S denote the s × (w + 1) submatrix of W obtained by extracting the s rows of W that correspond to individuals in S.

In GTAM, the analysis is prospective, not retrospective. In other words, we condition on (W_S, X_S), the covariates and genotypes, and treat the quantitative trait vector, Y_S, as random. This approach is clearly justified for individuals randomly sampled from a population, though the justification is less obvious when ascertainment is based on phenotype. The quantitative trait (possibly after suitable transformation of phenotypes and/or covariates) is modeled as

Y_S = W_Sβ + X_Sγ + ε_S,

(Equation 1)

where β is the (w + 1) × 1 vector of covariate effects, including intercept, γ is the (scalar) association parameter, measuring the effect of genotype on phenotype, and ε_S is a random vector of length s, satisfying

ϵ_{S} \sim N_{s} (0, σ_{a}^{2} Φ_{S S} + σ_{e}^{2} I_{s}),

(Equation 2)

i.e., ε_S has the s-dimensional multivariate normal distribution with mean vector 0 and variance matrix $σ_{a}^{2} Φ_{S S} + σ_{e}^{2} I_{s}$ , where $σ_{a}^{2}$ represents additive genetic variance, $σ_{e}^{2}$ represents variance due to measurement error or environmental effects assumed to be acting independently on individuals, I_s is the s × s identity matrix, and Φ_SS is the s × s kinship matrix for the individuals in S, given by

Φ_{S S} = (\begin{matrix} 1 + h_{1} & 2 ϕ_{12} & \dots & 2 ϕ_{1 s} \\ 2 ϕ_{21} & 1 + h_{2} & \dots & 2 ϕ_{2 s} \\ ⋮ & \dots & \dots & ⋮ \\ 2 ϕ_{s 1} & 2 ϕ_{s 2} & \dots & 1 + h_{s} \end{matrix}),

(Equation 3)

where ϕ_ij is the kinship coefficient between the i^th and j^th individuals of S, and h_i is the inbreeding coefficient of the i^th individual of S. Note that the identity 2ϕ_ii = 1 + h_i holds for all i, so the diagonal entries can be equivalently expressed as twice the self-kinship coefficients. Here, as in the original GTAM method, Φ_SS is taken to be the pedigree-based kinship matrix, though an estimated kinship matrix based on genotype data, such as that of Han and Abney,¹⁴ has also been used in GTAM.

We find it convenient to reparameterize the VCs, $(σ_{a}^{2}, σ_{e}^{2})$ , in terms of $(σ_{T}^{2}, ξ)$ , where $σ_{T}^{2} = σ_{a}^{2} + σ_{e}^{2}$ represents the total variance of the trait and $ξ = σ_{a}^{2} / (σ_{a}^{2} + σ_{e}^{2})$ is the (narrow-sense) heritability of the trait. With this parameterization, we can rewrite Equation 2 as

ϵ_{S} \sim N_{s} (0, σ_{T}^{2} Σ_{S}), where Σ_{S} = ξ Φ_{S S} + (1 - ξ) I_{s} .

(Equation 4)

The null hypothesis of no association is H₀ : γ = 0, and the alternative hypothesis is H_A : γ ≠ 0. The GTAM statistic is asymptotically equivalent to the score statistic for this null hypothesis, assuming the model given in Equations 1, 2, 3, and 4. To calculate the GTAM statistic, we first estimate the heritability, ξ, by its null maximum likelihood estimate (MLE), i.e., its MLE in the submodel of Equations 1, 2, 3, and 4 in which γ = 0. We call this null MLE ${\hat{ξ}}_{0 S}$ , and we compute it only once per genome screen (at least at the initial stage of analysis). Let ${\hat{Σ}}_{S}$ denote Σ_S evaluated at $ξ = {\hat{ξ}}_{0 S}$ . Then, for each marker, the calculation of the GTAM statistic can be expressed in terms of a generalized regression based on Equation 1, where we take $ϵ_{S} \sim N_{s} (0, σ_{T}^{2} {\hat{Σ}}_{S})$ , with ${\hat{Σ}}_{S}$ treated as fixed and known and β, γ, and $σ_{T}^{2}$ treated as unknown. Then the parameter estimates, $\hat{β}$ , $\hat{γ}$ , and ${\hat{σ}}_{T}^{2}$ , can be obtained by generalized regression under this model, and the GTAM statistic is equal to the generalized-regression t statistic for testing H₀ : γ = 0 in this model. The statistic is given by

GTAM = \frac{\hat{γ} \sqrt{X_{S}^{T} P_{S} X_{S}}}{\sqrt{{\hat{σ}}_{T}^{2}}} = \frac{(X_{S}^{T} P_{S} Y_{S})}{\sqrt{(X_{S}^{T} P_{S} X_{S}) (Y_{S}^{T} Q_{S} Y_{S}) {(s - w - 2)}^{- 1}}},

(Equation 5)

where $P_{S} = {\hat{Σ}}_{S}^{- 1} - {\hat{Σ}}_{S}^{- 1} W_{S} {(W_{S}^{T} {\hat{Σ}}_{S}^{- 1} W_{S})}^{- 1} W_{S}^{T} {\hat{Σ}}_{S}^{- 1}$ and $Q_{S} = P_{S} - P_{S} X_{S} {(X_{S}^{T} P_{S} X_{S})}^{- 1} X_{S}^{T} P_{S}$ are both symmetric s × s matrices (see Abney et al.⁴ for further details). We consider (GTAM)² and assess its p value by using a $χ_{1}^{2}$ asymptotic distribution under the null hypothesis.

A technical point is that the set S typically differs from marker to marker, but we would like to avoid re-estimating the null MLE of the heritability, ${\hat{ξ}}_{0 S}$ , at every marker, at least in the initial analysis. Instead, we try to use some representative set of individuals to obtain ${\hat{ξ}}_{0 S}$ , e.g., we might use the set of individuals who have nonmissing phenotype and covariate information and who have genotype data for some minimum number of markers. Then, once a subset of interesting markers has been identified by the initial analysis, one could fit the full model in Equation 1 for each of those markers, including MLE estimation of ξ based on the individuals genotyped at each marker in the subset. This approach⁴ reduces the computational burden of the method by not requiring numerical maximum likelihood estimation of ξ at every marker in the entire screen.

MASTOR

MASTOR can be viewed as an extension, to quantitative traits, of the MQLS method³ for binary traits. In contrast to GTAM, MASTOR is based on a retrospective approach, rather than a prospective approach, and it uses the larger set of individuals N [union or logical sum] R, rather than S = N ∩ R. One advantage of the retrospective approach is that it provides a natural way to incorporate information on individuals with missing genotype by using the known dependence among relatives’ genotypes under the null hypothesis. This allows MASTOR to use extra information not used by GTAM. Another advantage is that the retrospective analysis is less dependent on correct specification of the phenotypic covariance matrix under the null hypothesis. (This is because the correct calibration of MASTOR depends only on the null conditional mean and variance of the genotype data, not on the phenotype model. See Appendix C for mathematical details and subsection Assessment of Type 1 Error and the Impact of Variance Components and subsection Power Studies for empirical confirmation.) We first present the calculation of the MASTOR statistic and then provide the justification for it.

The MASTOR statistic takes the form

MASTOR = \frac{{(V^{T} X)}^{2}}{{\hat{Var}}_{0} (V^{T} X | W, Y)} = \frac{{(V^{T} X)}^{2}}{{\hat{σ}}_{X}^{2} V^{T} Φ_{N N} V},

(Equation 6)

where V, which will be defined in the next paragraph, is a vector of length n that is a function of W, Y, and the pedigree information. We assume that under the null hypothesis of no association between genotype and phenotype, we have E₀(V^TX | W, Y) = 0 and ${Var}_{0} (X | W, Y) = σ_{X}^{2} Φ_{N N}$ , where Φ_NN is the n × n kinship matrix for the individuals in N (similar to Equation 3) and $σ_{X}^{2}$ is an unknown scalar. In that case, we have ${Var}_{0} (V^{T} X | W, Y) = σ_{X}^{2} V^{T} Φ_{N N} V$ . A previous work¹⁵ suggests estimation of $σ_{X}^{2}$ by

{\hat{σ}}_{X}^{2} = X^{T} U X {(n - 1)}^{- 1}, where U = Φ_{N N}^{- 1} - Φ_{N N}^{- 1} 1 {(1^{T} Φ_{N N}^{- 1} 1)}^{- 1} 1^{T} Φ_{N N}^{- 1},

(Equation 7)

where 1 denotes a vector of length n with every element equal to 1. An alternative estimator that accounts for possible dependence between genotype and covariates is derived by replacing U in Equation 7 with

U^{'} = Φ_{Q Q}^{- 1} - Φ_{Q Q}^{- 1} W_{Q} {(W_{Q}^{T} Φ_{Q Q}^{- 1} W_{Q})}^{- 1} W_{Q}^{T} Φ_{Q Q}^{- 1}

(Equation 8)

and replacing (n − 1) in Equation 7 by (q − w − 1), where Q is defined to be the set of individuals with both genotype and covariate information, but who may or may not have known phenotype (S [subset or is implied by] Q N), with q = |Q|, Φ_QQ is the q × q kinship matrix for individuals in Q, and W_Q is the q × (w + 1) submatrix of W that is obtained by extracting the q rows of W that correspond to individuals in Q. Given the choice of ${\hat{σ}}_{X}^{2}$ based on either Equation 7 or 8, we use ${\hat{Var}}_{0} (V^{T} X | W, Y) = {\hat{σ}}_{X}^{2} V^{T} Φ_{N N} V$ . Note that if N contains both members of a monozygotic (MZ) twin pair, Φ_NN is not invertible and a similar problem arises for Φ_QQ. This difficulty can be easily overcome (see Appendix B).

To define the vector V of Equation 6, we first obtain a null MLE of the heritability, call it ${\hat{ξ}}_{0}$ , which is similar to the ${\hat{ξ}}_{0 S}$ obtained for GTAM, except that it is based on the larger set of individuals, R, rather than on S [subset or is implied by] R. Specifically, we let ${\hat{ξ}}_{0}$ denote the MLE of ξ in the model

Y = Wβ + ε

(Equation 9)

(possibly after suitable transformation of phenotypes and/or covariates), where $ϵ \sim N_{r} (0, σ_{T}^{2} Σ)$ , with Σ ≡ ξΦ_RR + (1 − ξ)I_r, where $σ_{T}^{2}$ is the sum of the additive and environmental trait variances as before, Φ_RR is the r × r kinship matrix for individuals in R (similar to Equation 3), and I_r is the r × r identity matrix. Because it is based on a different subset of individuals, the null MLE ${\hat{ξ}}_{0}$ for MASTOR will generally differ from the ${\hat{ξ}}_{0 S}$ of GTAM. We let $\hat{Σ}$ denote Σ evaluated at $ξ = {\hat{ξ}}_{0}$ . Next we calculate the transformed phenotypic residuals from the generalized regression based on Equation 9, where we take $ϵ \sim N_{r} (0, σ_{T}^{2} \hat{Σ})$ , with $\hat{Σ}$ treated as fixed and known and β and $σ_{T}^{2}$ as unknown. In this generalized regression, $\hat{β}$ is obtained as $\hat{β} = {(W^{T} {\hat{Σ}}^{- 1} W)}^{- 1} W^{T} {\hat{Σ}}^{- 1} Y$ , and we define the transformed phenotypic residual to be ${\hat{Σ}}^{- 1} (Y - W \hat{β}) = P Y$ , where $P = {\hat{Σ}}^{- 1} - {\hat{Σ}}^{- 1} W {(W^{T} {\hat{Σ}}^{- 1} W)}^{- 1} W^{T} {\hat{Σ}}^{- 1}$ . Then, we define the vector V by V = UΦ_NRPY, where Φ_NR is the n × r cross-kinship matrix having (i, j)^th entry equal to twice the kinship coefficient between the i^th individual in set N and the j^th individual in set R. The resulting MASTOR statistic is

MASTOR = \frac{{(X^{T} U Φ_{N R} P Y)}^{2}}{(Y^{T} P Φ_{R N} U Φ_{N R} P Y) {\hat{σ}}_{X}^{2}},

(Equation 10)

where $Φ_{R N} = Φ_{N R}^{T}$ . The following two subsections give various ways of understanding this statistic.

Special Case: Complete Data

In the special case when there are complete data on all sampled individuals, we have R = N = S. In that case, it is easily verified that UΦ_NR P = P, so the numerators of the (GTAM)² and MASTOR statistics are both equal to (X^TPY)². Their denominators represent different estimators of the variance of X^TPY, with the denominator of (GTAM)² representing an estimator of Var(X^TPY | W, X) under a prospective model, and the denominator of MASTOR representing an estimator of Var₀(X^TPY | W, Y) under a retrospective model.

In the complete data case, the assumptions required for MASTOR to be a correctly calibrated statistic include the following. (1) E₀(X | W, Y) = Wα, where α is an unknown (w + 1) vector of coefficients. In other words, under the null hypothesis of no association between genotype and phenotype, the genotype is permitted to be linearly related to the covariates, or it can be unrelated to the covariates. (2) ${Var}_{0} (X | W, Y) = σ_{X}^{2} Φ_{N N}$ , where $σ_{X}^{2}$ is an unknown scalar. This is a version of the standard variance relationship that holds, for example, under Mendelian inheritance in a single population. Here, however, we do not require $σ_{X}^{2} = 1 / 2 p (1 - p)$ , where p is allele frequency, which would hold under Hardy-Weinberg equilibrium. Instead, we use the more robust estimator, ${\hat{σ}}_{X}^{2}$ , given in the previous subsection.

Justification for the MASTOR Statistic

We briefly present two different ways of understanding and justifying the MASTOR statistic. First, an intuitively clear interpretation of the MASTOR statistic is obtained by noting that it can be rewritten (see Appendix C) as

MASTOR = \frac{{({\hat{X}}^{T} P Y)}^{2}}{{\hat{Var}}_{0} ({\hat{X}}^{T} P Y | W, Y)},

(Equation 11)

where $\hat{X}$ is defined to be a vector of length r whose entry for individual i [set membership] R is X_i if i’s genotype is observed (i.e., if i is in N) and is ${\hat{X}}_{i}$ if i’s genotype is not observed, where ${\hat{X}}_{i}$ is the best linear unbiased predictor (BLUP) of the missing genotype of individual i, based on the remaining genotype data at the marker.¹⁶ The BLUP is given by

{\hat{X}}_{i} = \hat{p} + \sum_{j \in N} 2 ϕ_{i j} {[Φ_{N N}^{- 1} (X - \hat{p} 1)]}_{j},

(Equation 12)

where

\hat{p} = {(1^{T} Φ_{N N}^{- 1} 1)}^{- 1} 1^{T} Φ_{N N}^{- 1} X

(Equation 13)

is the best linear unbiased estimator (BLUE) of allele frequency.¹⁷ When individual i [set membership] N (i.e., i is genotyped), the formula for ${\hat{X}}_{i}$ reduces to X_i, the observed genotype for i. The version of the MASTOR statistic given in Equation 11 is analogous to the MASTOR statistic given for complete data, except with X replaced by $\hat{X}$ (and with the variance suitably adjusted for the additional uncertainty). In the case of incomplete data, the main assumptions needed for the MASTOR statistic to be a correctly calibrated statistic can be written as (1) $E_{0} (\hat{X} | W, Y) = W α$ , where α is an unknown length (w + 1) vector of coefficients, and (2) ${Var}_{0} (X | W, Y) = σ_{X}^{2} Φ_{N N}$ , where $σ_{X}^{2}$ is an unknown scalar (see Appendix C for details). These are identical to the assumptions for the complete-data case described in subsection Special Case: Complete Data above, except that in assumption (1), X is replaced by $\hat{X}$ .

The interpretation of MASTOR in terms of BLUP imputation of genotypes is useful in understanding the role played by missing data in MASTOR. Suppose individual i has phenotype and covariate information but is missing genotype data. Then, in the MASTOR method, the greatest potential contribution i could make to the association analysis would occur if he or she had genotyped relatives whose information could then be used in the BLUP imputation of i’s genotype. A somewhat less-substantial contribution that i could make to the association analysis would occur if i had no genotyped relatives but was in the same pedigree with an individual j such that ${({\hat{Σ}}^{- 1})}_{i j} \neq 0$ , where j has phenotype and covariate information available and is either genotyped or has a genotyped relative. In that case, ${({\hat{Σ}}^{- 1})}_{i j} \neq 0$ means that i could contribute information to the transformed phenotypic residual of j, and j would have either an observed genotype or a BLUP and so would contribute to the analysis. Conversely, suppose individual i has genotype information, but is missing phenotype or covariate information. In that case, i will contribute to the analysis if there is at least one individual, j, who has nonmissing phenotype and covariate information, missing genotype information, and a genotyped relative. In that case, if i is not related to j, then the contribution of i will be in improved estimation of the nuisance parameter $\hat{p}$ , which is used in the BLUP imputation for j, whereas if i is related to j, then i also provides more direct information about j’s genotype and will contribute to the BLUP, ${\hat{X}}_{j}$ , through both the first and second terms of Equation 12.

A second interpretation of MASTOR is that it is a quasilikelihood score test of the null hypothesis H₀ : δ = 0, versus H_A : δ ≠ 0 in the retrospective mean model

E(X|W, Y) = p1 + δΦ_NRPY.

(Equation 14)

(See Bourgain et al.¹³ and Wang and McPeek¹⁸ for more details on quasilikelihood score tests in this setting.) For a genotyped individual i, the conditional expectation in Equation 14 can be rewritten as

E (X_{i} | W, Y) = p + δ \sum_{j \in R} 2 ϕ_{i j} {[{\hat{Σ}}^{- 1} (Y - W \hat{β})]}_{j},

(Equation 15)

so the summation can be viewed as a weighted sum of the transformed phenotypic residuals, i.e., elements of ${\hat{Σ}}^{- 1} (Y - W \hat{β})$ , with weights proportional to the kinship coefficient between each individual and individual i. This represents the quantitative trait version of the enrichment effect,³ which basically says that phenotypic values of an individual’s relatives provide additional information, beyond that provided by the individual’s own phenotypic value, about the probability that the individual carries an allele affecting the trait.

For outbred individuals, the mean model given in Equations 14 and 15 holds up to terms of order o(δ) as δ → 0 assuming a general, 2-allele, prospective model of the form E(Y_i|W, X) = W_iβ + γ₁1_{X_i=.5} + γ₂1_{X_i=1} + ϵ_i, with $ϵ \sim N_{r} (0, σ_{T}^{2} Σ)$ and Σ = ξΦ_RR + (1 − ξ)I_r, where γ₁ and γ₂ are both tending to 0, combined with the assumptions E₀(X | W) = p1 and ${Var}_{0} (X | W) = σ_{X}^{2} Φ_{N N}$ . The connection between δ of Equations 14 and 15 and (γ₁, γ₂) is that $δ = σ_{T}^{- 2} p (1 - p) (γ_{1} (1 - 2 p) + γ_{2} p)$ , where the model of Equations 14 and 15 holds as this tends to zero. (For inbred individuals, the mean model given in Equations 14 and 15 is derived under the further assumption that the genetic effect is additive or multiplicative.)

Despite the fact that the mean model of Equations 14 and 15 does not allow X to depend linearly on W under the null hypothesis, MASTOR is still correctly calibrated when E₀(X | W, Y) = Wα (see Appendix C). To also obtain optimality for this case, we can change the mean model, replacing p1 by Wα in Equation 14, which results in a modified MASTOR statistic obtained by replacing U by U′ in Equation 10.

The T^SCORE Test

The T^SCORE test⁹ is very similar to GTAM in that it is also a prospective mixed-model analysis that incorporates covariates and uses estimated VCs for additive and environmental variance. A major difference between T^SCORE and GTAM is that T^SCORE first imputes values for missing genotypes, so that a larger subset of individuals can be analyzed. Thus, the numerator of T^SCORE resembles that of MASTOR in Equation 11 but with a different imputation approach. The denominator of T^SCORE represents a prospective variance calculation, so it is similar to GTAM in that regard, but with some differences. In the GTAM denominator, the generalized genotypic sum of squares has covariates regressed out, but they are not regressed out in the generalized genotypic sum of squares appearing in the T^SCORE denominator. In the GTAM denominator, the term $(Y_{S}^{T} Q_{S} Y_{S}) {(s - w - 2)}^{- 1}$ represents a generalized regression estimate of the trait variance, based on residual sum of squares under the alternative model, using only individuals with nonmissing phenotype, covariate, and genotype. In T^SCORE, that estimated trait variance is effectively replaced with an MLE of the trait variance under the null hypothesis, using everyone with available phenotype and covariates, regardless of how much information is available on their genotypes. In contrast, the denominator of the MASTOR statistic has a phenotypic variance term that varies depending on the amount of genotype information available for the individuals having phenotype and covariate information. If all or almost all individuals who have phenotype and covariate information also have genotype data at the marker being tested, then GTAM, T^SCORE, and MASTOR should give very similar results. The main differences are in their handling of missing data. We include T^SCORE in our simulations and data analysis. Because both our simulations and data analysis involve pedigrees with >15 individuals, we use the Ghost software,⁹ in which the genotype imputation uses the Elston-Stewart algorithm to calculate T^SCORE.

ASTOR: Is the Heritability Parameter Really Needed?

MASTOR involves estimation of the heritability. This is done only once per genome screen, at least initially, for computational reasons (see subsection Computational Approach and Software). Still, one could ask whether accurate heritability is really needed for MASTOR, because, in the retrospective framework, the statistic would still be correctly calibrated if heritability were ignored. We propose a simplified approach, ASTOR, that is a version of MASTOR in which the heritability is assumed to be 0, so that Σ = I_r. The formula for ASTOR is given by Equation 10 with P replaced by P_I = I_r − W(W^TW)⁻¹ W^T, eliminating the heritability estimation step. However, note that ASTOR still correctly accounts for relatedness in the sample. (This is because the correct calibration of ASTOR depends only on the null conditional mean and variance of the genotype data, not on the phenotype model; see Appendix C.) In Results we compare ASTOR to MASTOR in terms of power and type 1 error.

Choice of Statistics to Compare by Simulation

MASTOR is designed for association analysis of quantitative traits with related individuals, taking into account covariates. The previously proposed GTAM and T^SCORE also address this problem, so it is useful to compare them to MASTOR. EMMAX is designed for samples of individuals not known to be related, rather than for family data, and when EMMAX is applied, the authors⁵ exclude individuals whose empirical kinship coefficient is greater than .10 (which corresponds to excluding first- and second-degree relatives). If EMMAX were modified to use the known pedigree information, it would be the same as GTAM in our simulation context. Therefore, within the context of our simulation, GTAM could be thought of as representing an upper bound on the power of a potential extension of EMMAX to family data. Previous work¹⁵ has shown that FBAT has very low power in simulation settings like those we consider, because it does not incorporate the data on the unrelated individuals and because many of the families do not meet the FBAT criteria for “informative families.” Thus, FBAT is not well suited to analyzing the type of data in our simulations and we do not consider this approach further. Similarly, GQLS does not allow covariates, so it is also not able to handle the type of data we consider. Of these methods, only GTAM and T^SCORE are designed to address the problem addressed by MASTOR, namely, quantitative trait association analysis in family data with general pedigree types, taking into account covariates. We also consider the simplified method ASTOR, which addresses the same problems but does not require heritability estimation.

Computational Approach and Software

We have developed software, called MASTOR, which is coded in C and implements the MASTOR, ASTOR, and GTAM methods. To calculate the MASTOR statistic, our software performs two main steps. The first step involves estimation of heritability and covariate effects under the null model, from which we can calculate the transformed phenotypic residual vector, PY. The second step is calculation of the statistic of Equation 10. The first step involves singular value decomposition (SVD) of a kinship matrix, Φ_RR, as well as numerical maximization of a likelihood to get the MLEs of the VCs. Note, however, that the computational burden is reduced in two ways. First, the block-diagonal structure of Φ_RR, with blocks corresponding to families, allows the SVD to be done independently on each block. Thus, for example, if set R were divided into f equal-size families, the computational cost of the SVD would be O(r³f⁻²) when the block-diagonal structure is exploited. Depending on f, this could be much faster than the naive SVD, which would have cost O(r³). Second, in our implementation, we use a well-known algebraic trick^4,5,19 to rewrite our likelihood as a function of just a single parameter, $σ_{a}^{2} / σ_{e}^{2}$ , which eliminates the need to perform the SVD in every iteration of the numerical maximization of the likelihood.

In a genome scan, the first step of MASTOR would need to be performed only once under certain conditions. For example, this would hold if each person who is phenotyped is either genotyped or has a genotyped relative at every marker in the scan. In that case, the set R would be the same for every marker. For computational reasons, we choose to perform step 1 only once at the initial stage of the genome scan, even though R may differ slightly from marker to marker. To do this, we fix R at the beginning of the analysis. For example, one could include, in R, individuals who are phenotyped and who have at least some minimum number of markers at which they are either genotyped or have a genotyped relative. Then, once a subset of interesting markers has been identified in the initial analysis, one could perform a separate step 1 for each marker in the subset.

The second step, in which the statistic for each marker is calculated, scales linearly in the number of markers m, and for each marker there is an inversion of Φ_NN. As in the first step, the block-diagonal structure of Φ_NN greatly reduces the computational burden.

The MASTOR software also allows the option of fitting a linear mixed model to the data without performing a genome scan. This option, which we refer to below as mixedMLE, serves two purposes. First, it is useful for preliminary analyses of the phenotype and covariate data in order to formulate the null model (Equation 9). For example, the mixedMLE option could be used to fit the data under versions of the null model having different sets of covariates included and different choices of transformations of phenotype and/or covariates, in order to decide which choices should be used in the association tests. Once the null model is chosen, MASTOR can be run in the default mode to perform the association analysis. After genetic variants of interest have been identified by an initial analysis with MASTOR, the mixedMLE option of MASTOR can then serve a second purpose, namely, it can be applied, with one or more variants included as covariates, to estimate the parameters of the alternative model (Equations 1, 2, 3, and 4), including effect size(s) of variants or even interactions among them.

Simulation Studies

We perform simulation studies in order to (1) compare type 1 error and power of the tests; (2) determine whether it is possible to retain the high power of MASTOR without going to the trouble of estimating heritability (i.e., compare power of ASTOR and MASTOR); and (3) assess sensitivity of MASTOR and GTAM to misspecified VCs and to estimation of the heritability from a subset of individuals that is not exactly the same subset used in the test. To address these questions, we simulate data that include related individuals, under a variety of trait models and assumptions about missing genotype and phenotype, as we now describe.

Trait Models

We simulate five trait models, denoted I, II, III, IV, and V, all of which have sex as a covariate. Model I has a single major gene acting additively with additional additive polygenic effects. It is given by

Y = 1.5*1 + .5*1_female + 1.5*X + ε,

(Equation 16)

where 1 is the vector with all elements equal to 1, 1_female is the vector with i^th element equal to 1 if i is female and 0 if i is male, $ϵ \sim N (0, σ_{a}^{2} Φ + σ_{e}^{2} I)$ , with $σ_{a}^{2} = 4$ and $σ_{e}^{2} = 7$ , and X is the genotype vector with i^th element X_i = 0, .5, or 1, according to whether individual i has 0, 1, or 2 copies of allele 1 at the major gene, where the frequency of allele 1 is .1. Model II has four unlinked causal loci, three of which interact, with additional additive polygenic effects. It is given by

Y = 1.5*1 + .5*1_female + f(X₁,X₂,X₃) + g(X₄) + ε,

(Equation 17)

where $ϵ \sim N (0, σ_{a}^{2} Φ + σ_{e}^{2} I)$ , with $σ_{a}^{2} = 4$ and $σ_{e}^{2} = 7$ . Here, f(X₁, X₂, X₃) is a vector with i^th element equal to f(X_1i, X_2i, X_3i) and g(X₄) is a vector with i^th element equal to g(X_4i), where X_1i, X_2i, X_3i, and X_4i are the genotype values of individual i at causal loci 1, 2, 3, and 4, respectively. Table 1 gives the values of f(x₁, x₂, x₃), for (x₁, x₂, x₃) [set membership] {0,.5,1}³, and we set g(x₄) = .1, 1.25, or 1.5, according to whether x₄ = 0, .5, or 1. The frequency of allele 1 at loci 1, 2, 3, and 4 is .1, .2, .3, and .2, respectively. Model III is an additive polygenic model with no major genes. It is given by

Y = 1.5*1 + .5*1_female + ε,

(Equation 18)

where $ϵ \sim N (0, σ_{a}^{2} Φ + σ_{e}^{2} I)$ , with $σ_{a}^{2} = 4$ and $σ_{e}^{2} = 25$ . Model IV is a heavy-tailed polygenic model with no major genes. It is given by

Y = 1.5*1 + .5*1_female + ε + η,

(Equation 19)

where $ϵ \sim N (0, σ_{a}^{2} Φ + σ_{e}^{2} I)$ , with $σ_{a}^{2} = 4$ and $σ_{e}^{2} = 25$ and the η_i’s are i.i.d. draws from the Laplace distribution with location parameter 0 and scale parameter 10. Model V is a polygenic model with both additive and dominance components of variance and no major genes. It is given by Equation 18 where $ϵ \sim N (0, σ_{a}^{2} Φ + σ_{d}^{2} Δ_{7} + σ_{e}^{2} I)$ , with $σ_{a}^{2} = 4$ , $σ_{d}^{2} = 20$ , and $σ_{e}^{2} = 25$ , where Δ₇ is the matrix with (i, j)^th entry equal to Δ₇ [i, j], the seventh condensed identity coefficient between individuals i and j, which is the probability that, at any given locus, i and j share two alleles identical by descent (IBD), with neither one homozygous by descent. If the individuals are outbred, then Δ₇ [i, i] = 1 and, for i ≠ j, Δ₇ [i, j] is the probability that i and j share two alleles IBD.

Table 1

Model for Interaction among Three of the Four Causal Loci in Trait Model II

	x₃= 0			x₃= .5			x₃= 1
	x₂= 0	x₂= .5	x₂= 1	x₂= 0	x₂= .5	x₂= 1	x₂= 0	x₂= .5	x₂= 1
x₁ = 0	.5	.5	.5	.75	.75	.75	1	1	1
x₁ = .5	.5	3	3	3.5	3.5	3.5	5	5	5
x₁ = 1	.5	3	3	3.5	3.5	3.5	5	5	5

Genotypic effects are given as a function of (x₁, x₂, x₃), where x₁ is 0, .5, or 1 according to whether the individual has 0, 1, or 2 copies of a given allele at locus i.

Models I and II have major genes and are used in simulations to assess both type 1 error and power. Models III, IV, and V have no major genes and are therefore used to assess only type 1 error. For each model, we consider three different phenotype configuration settings, A, B, and C. In phenotype configuration A, the sample consists of 65 ascertained families, each of which consists of 16 individuals in a three-generation outbred pedigree, with one grandparent couple in the first generation and three parent couples in the second generation, two of whom have three offspring and one of whom has two offspring. To simulate null markers and/or major genes, founder genotypes or haplotypes are drawn at random based on their assumed population frequencies. Genotypes for other individuals are then generated by a standard “gene-dropping” approach. For models I and II, phenotypes are generated conditional on the simulated genotypes for the major genes, using the conditional distributions given in the previous paragraph. For the models without major genes (III–V), phenotypes are sampled according to the distributions given in the previous paragraph. For each simulation experiment, a genotype sampling scheme is chosen (described in the next subsection), and in phenotype configuration A, a family is ascertained conditional on having at least six genotyped individuals, i.e., at least six individuals who meet the criteria for the chosen genotype sampling scheme. (Computationally, this is carried out by rejection sampling.) Phenotypes for all individuals in an ascertained family are assumed to be observed in configuration A. In phenotype configuration B, the sample consists of 20 ascertained families who satisfy all the same conditions as in A, 500 ascertained unrelated individuals who are both phenotyped and genotyped and who are sampled conditional on meeting the criteria for the chosen genotype sampling scheme (see next subsection), and 500 unrelated, unphenotyped controls who are genotyped and who are randomly sampled from the population. In phenotype configuration C, the sample consists of 65 ascertained families, each of which consists of 16 individuals in a three-generation pedigree. Initially, some individuals’ phenotypes are set missing independently at random, with individuals in the oldest, second-oldest, and youngest generations having probabilities .1, .2, and .4, respectively, of having missing phenotype. In four of the genotype sampling schemes (all, even tails, skewed tails, and upper tail, which are described in the next subsection), all individuals with missing phenotype are assumed to be genotyped, and individuals with nonmissing phenotype are selected for genotyping according to the genotype sampling scheme. (In the random genotype sampling scheme, however, individuals are chosen at random for genotyping, as described in the next subsection, regardless of whether or not they are phenotyped.) Finally, the family is ascertained conditional on having at least three individuals who are both phenotyped and genotyped, at least three who are phenotyped and not genotyped, and at least three who are genotyped and not phenotyped. Note that, as a consequence, in each family in configuration C, there will be at least six individuals genotyped and at least six individuals phenotyped, and only in the random sampling scheme can an individual be missing both genotype and phenotype.

Sampled Genotypes

We consider five different genotype sampling schemes. These genotype sampling schemes do not apply to the 500 unrelated, unphenotyped controls in phenotype configuration B, who are genotyped and are randomly sampled from the population, but they do apply to all other sampled individuals. In the “all” sampling scheme, all sampled individuals are genotyped regardless of their phenotype. As a consequence, ascertainment is random or population based when the all sampling scheme is used. In the “even tails” sampling scheme, an individual is genotyped if and only if his or her phenotype value is ≤μ − 1.5σ or ≥μ + 1.5σ, where μ and σ are the population mean and standard deviation of the trait. In the “skewed tails” sampling scheme, an individual is genotyped if and only if his or her phenotype value is ≤μ − .5σ or ≥μ + 2.5σ. In the “upper tail” sampling scheme, an individual is genotyped if and only if his or her phenotype value is ≥μ + 1σ. In the “random” sampling scheme, individuals are chosen for genotyping independently at random, with individuals in the oldest, second-oldest, and youngest generations having probabilities .4, .7, and .9, respectively, of being genotyped, regardless of phenotype.

Impact of the Variance Components

One goal of our simulation studies is to assess sensitivity of MASTOR and GTAM to (1) misspecified VCs and (2) estimation of the heritability from a subset of individuals that is not exactly the same subset used in the test. To address (1), for both MASTOR and GTAM, we perform a procedure we call “misspecified VCs” in which we first set $σ_{a}^{2}$ and $σ_{e}^{2}$ to be the values used in the corresponding simulation model (I, II, III, IV, or V above). Then, instead of using the null MLE for ξ in the analysis, we set ξ to be $σ_{a}^{2} / (σ_{a}^{2} + σ_{e}^{2})$ . Furthermore, instead of using the generalized regression estimate of $σ_{T}^{2}$ under the alternative model, we plug $σ_{a}^{2} + σ_{e}^{2}$ in for $σ_{T}^{2}$ . Note that because of model misspecification and ascertainment, these VCs would be the correctly specified components of variance only for model III with the all sampling scheme; otherwise, they are misspecified. The resulting statistics are referred to in the Results section as “MASTOR misspecified VCs” and “GTAM misspecified VCs.”

To address (2), we calculate MASTOR and GTAM with the heritability estimated from a slightly different sample than the one used in the association test. Ordinarily, the heritability for MASTOR is estimated based on the individuals in group R, and the heritability for GTAM is estimated based on the individuals in group S = N ∩ R. In simulations, we also consider the results when the heritability for MASTOR is estimated based only on the individuals in group S (the GTAM sample), but the individuals in the larger group R [union or logical sum] N are included in the association analysis, and when the heritability for GTAM is estimated based on the larger set of individuals in R (the MASTOR phenotype sample), but only the individuals in S are included in the association analysis. We refer to the MASTOR and GTAM statistics calculated in this way as MASTOR_G and GTAM_M, respectively, where “G” refers to the estimation of heritability from the GTAM sample and “M” refers to the estimation of heritability from the MASTOR phenotype sample.

HDL-C Data from the Framingham Heart Study

The Framingham Heart Study (FHS)¹ is a multicohort, longitudinal study of risk factors for cardiovascular disease. Our use of the FHS data was approved by the Institutional Review Board of the Biological Sciences Division of the University of Chicago. The FHS sample consists of unrelated individuals as well as individuals from multigeneration pedigrees. We analyze high-density lipoprotein cholesterol (HDL-C) levels in exam 1 of cohort 3 (third-generation cohort) of FHS. We log-transform the phenotype and include age, age², sex, and log(FPG) as covariates in the analysis, where FPG is fasting plasma glucose. Although we analyze phenotype and covariate information on cohort 3 only, we include in the analysis Affymetrix 500K genotypes on individuals from all three cohorts (original, offspring, and generation 3). We exclude the genotypes of individuals who do not meet all of the following criteria: (1) empirical self-kinship < .525 (i.e., empirical inbreeding coefficient < .05) and (2) completeness (i.e., proportion of markers for which a given individual has genotype called) > 96%. We also use the off-diagonals of the empirical kinship matrix to exclude an additional 298 individuals with empirical kinship values that are not consistent with the pedigree information. The resulting data set has 3,879 individuals who are both genotyped and phenotyped with no missing covariates, 4,718 individuals who are genotyped but are missing the phenotype or some of the covariates, and 194 individuals who have complete phenotype and covariate information but do not have genotype data. Initially, we analyze the 369,046 SNPs from the Affymetrix 500K array that satisfy all of the following criteria: (1) call rate ≥ 96%, (2) Mendelian error rate ≤ 2%, and (3) minor allele frequency ≥ 1%.

Because interesting distinctions between MASTOR, GTAM, and T^SCORE would be expected to occur only when there are a substantial proportion of individuals who have phenotype and covariate information but are missing genotype data, we randomly mask some genotypes in the data set. In the random masking scheme, the probability of an individual being selected for genotyping is allowed to depend on the phenotype value. Specifically, we set $\bar{Y}$ and s_Y to be the sample mean and sample standard deviation of Y, and we denote ${\tilde{Y}}_{i} = (Y_{i} - \bar{Y}) / s_{Y}$ . We mask the genotype of individual i with probability .99 if ${\tilde{Y}}_{i} < - 1$ , .9 if $- 1 \leq {\tilde{Y}}_{i} < - .7$ , .1 if $- .7 \leq {\tilde{Y}}_{i} < - .3$ , .01 if $- .3 \leq {\tilde{Y}}_{i} \leq .3$ , .1 if $.3 < {\tilde{Y}}_{i} \leq .7$ , .9 if $.7 < {\tilde{Y}}_{i} \leq 1$ , and .99 if $1 < {\tilde{Y}}_{i}$ .

Go to:

Results

Assessment of Type 1 Error and the Impact of Variance Components

To assess the type 1 error of the methods, we simulate an unlinked, unassociated marker and test for association by each method. Our type 1 error test is particularly challenging to the methods because the null model is misspecified in almost every scenario (except trait model III with the all sampling scheme). The results in Table 2 show that MASTOR, ASTOR, and GTAM all appear to be correctly calibrated in our simulations, including in those cases when the null model is not correctly specified. In contrast, T^SCORE was not correctly calibrated when individuals were chosen for genotyping based on their phenotype value. A likely reason for this is that the T^SCORE variance calculation seems to be appropriate only if either (1) the imputation procedure recovers close to complete genotype information or (2) the phenotype distribution is the same for genotyped and ungenotyped individuals.

Table 2

Empirical Type 1 Error of Test Statistics, Based on 25,000 Replicates

Sampled Genos	Trait Model	Nominal Level	MASTOR	ASTOR	GTAM	T^SCORE
All	IIA	.05	.049	.048	.049	.050
All	IIIA	.05	.049	.049	.050	.048
All	IVA	.05	.048	.048	.047	.049
All	VA	.05	.051	.051	.052	.050
Random	IIC	.05	.050	.050	.052	.053
Upper tail	IIA	.05	.049	.050	.050	.006
Upper tail	IIIA	.05	.049	.050	.049	.015
Upper tail	IIC	.05	.050	.049	.052	.045
Even tails	IIA	.05	.051	.051	.053	.108
Even tails	IIC	.05	.050	.051	.054	.119
All	IIA	.001	.0011	.0011	.0010	.0008
All	IIIA	.001	.0008	.0009	.0009	.0008
All	IVA	.001	.0012	.0012	.0012	.0010
All	VA	.001	.0010	.0010	.0010	.0009
Random	IIC	.001	.0008	.0010	.0011	.0009
Upper tail	IIA	.001	.0006	.0010	.0012	.0000
Upper tail	IIIA	.001	.0013	.0012	.0011	.0000
Upper tail	IIC	.001	.0014	.0015	.0010	.0006
Even tails	IIA	.001	.0008	.0008	.0010	.0066
Even tails	IIC	.001	.0012	.0013	.0014	.0089

Open in a separate window

Values in bold are those that differ significantly (p value < .01) from the nominal level by a z-test. “Sampled Genos” refers to the different genotype sampling schemes described in subsection Sampled Genotypes of the Material and Methods. The trait models are described in subsection Trait Models of the Material and Methods. For trait model II, the tested SNP has minor allele frequency .2, whereas for trait models III, IV, and V, the tested SNP has minor allele frequency .1.

We also assess the impact of misspecified values for the VCs on the type 1 error of MASTOR and GTAM. In Table 3, the results in column 6 (MASTOR misspecified VCs) show that the type 1 error of MASTOR seems completely unaffected by misspecification of the VC values. In contrast, the results in column 7 (GTAM misspecified VCs) of Table 3 show that GTAM is more sensitive to misspecification of the VCs. This is to be expected, because the asymptotic assessment of significance that we use for GTAM depends crucially on the VCs.

Table 3

Effect on Type 1 Error of Misspecified VCs or Use of a Slightly Different Sample to Estimate Heritability

Sampled Genos	Trait Model	Nominal Level	MASTOR_G	GTAM_M	MASTOR Misspecified VCs	GTAM Misspecified VCs
Even tails	IIA	.05	.051	.066	.051	.097
Even tails	IIA	.001	.0009	.0018	.0007	.0048
Even tails	IIC	.05	.051	.063	.048	.074
Even tails	IIC	.001	.0010	.0028	.0012	.0034
Upper tail	IIIA	.05	.049	.052	.049	.046
Upper tail	IIIA	.001	.0012	.0013	.0010	.0010
Upper tail	IIA	.05	.049	.043	.050	.043
Upper tail	IIA	.001	.0009	.0008	.0007	.0010
Upper tail	IIC	.05	.049	.047	.049	.046
Upper tail	IIC	.001	.0009	.0009	.0011	.0008

Open in a separate window

MASTOR_G is calculated with heritability estimated from group S (the GTAM sample) and GTAM_M is calculated with heritability estimated from R (the MASTOR phenotype sample). Empirical type 1 error is based on 25,000 replicates. Values in bold are those that differ significantly (p value < .01) from the nominal level by a z-test. “Sampled Genos” and “Trait Model” defined in the legend to Table 2. For trait model II, the tested SNP has minor allele frequency .2, whereas for trait model III, the tested SNP has minor allele frequency .1.

We also consider a more subtle point, which is the effect of estimating the heritability parameter in a slightly different subset of the sampled individuals than the subset that is included in the association test. Again, in Table 3 column 4 (MASTOR_G), we find that the type 1 error of the MASTOR statistic seems completely unaffected by this, whereas the type 1 error of the GTAM statistic, in Table 3 column 5 (GTAM_M), can be thrown off by this. This is an important caveat for the use of GTAM: the heritability must be estimated in the same subset of individuals that is being tested for association. Thus, for example, individuals who are phenotyped but have missing genotype must be excluded from the heritability estimation for GTAM, whereas they could be included for MASTOR. (The power improvement for MASTOR when these individuals are included in the heritability estimation is considered in the next subsection.)

In the type 1 error assessments presented in Tables 2 and and3,3, for trait model II, the tested SNP has minor allele frequency .2, whereas for trait models III, IV, and V, the tested SNP has minor allele frequency .1. We obtained similar results with other choices of minor allele frequency, as well as with other choices of model and significance level (results not shown).

Power Studies

To compare the power of the tests, we simulate under models I and II, which have major genes. We simulate a total of 60 scenarios, where 40 of these are obtained by choosing all possible combinations from among two models (I and II), two phenotype configurations (A and B), four genotype sampling schemes (all, even tails, skewed tails, and upper tail), and for model II, testing each of the four different causal SNPs (whereas model I has only one causal SNP). An additional 20 scenarios were obtained by simulating model II with phenotype configuration C with each of five genotype sampling schemes (the four listed above plus random) and testing each of four different causal SNPs. For each of the 60 scenarios, power was evaluated at four different significance levels, .05, .01, .005, and .001. Results from all 240 settings are represented in Figure 1, and a subset of the results appears in Table 4. Because T^SCORE was not correctly calibrated in many of the simulation settings, we did not include it in the power comparison.

An external file that holds a picture, illustration, etc.
Object name is gr1.jpg

Figure 1

Power Comparisons between Statistics

For each of 60 stimulated scenarios, power was evaluated at four different significance levels: .05, .01, .005, and .001. Results from all 240 settings are represented. Empirical power is based on 25,000 replicates. MASTOR_G is a version of MASTOR with heritability estimated from group S (the GTAM sample) and GTAM_M is a version of GTAM with heritability estimated from R (the MASTOR phenotype sample). For each of the 240 settings, graphs plot (A) the power of MASTOR versus the power of MASTOR_G, (B) the power of MASTOR versus ASTOR; (C) the power of MASTOR versus GTAM_M; (D) the power of ASTOR versus GTAM; (E) the power of GTAM versus GTAM_M; and (F) the power of MASTOR versus GTAM. In (C) and (E), a point is blue if the type 1 error of GTAM_M was significantly deflated (.05 level) in the corresponding scenario, the point is red if the type 1 error of GTAM_M was significantly inflated (.05 level) in the corresponding scenario, and the point is black if the type 1 error of GTAM_M was not significantly different from nominal.

Table 4

Power for Detection of Association with a Quantitative Trait

Sampled Genos	Trait Model	Tested SNP	Level	ASTOR	GTAM	MASTOR
All	IA	1	.01	.48	.57	.57
All	IB	1	.01	.48	.53	.52
All	IIA	3	.01	.56	.66	.66
All	IIB	3	.01	.56	.62	.61
Random	IIC	3	.01	.32	.34	.37
Even tails	IA	1	.001	.55	.51	.60
Even tails	IB	1	.001	.88	.90	.91
Even tails	IIA	3	.001	.87	.86	.91
Even tails	IIB	2	.05	.29	.31	.31
Even tails	IIC	3	.001	.47	.45	.52
Skewed tails	IA	1	.05	.30	.30	.35
Skewed tails	IB	1	.05	.38	.48	.48
Skewed tails	IIA	4	.01	.41	.38	$\underline{\underline{.50}}$
Skewed tails	IIB	4	.01	.53	.66	.70
Skewed tails	IIC	4	.01	.57	.29	$\underline{\underline{.65}}$
Upper tail	IA	1	.05	.30	.22	$\underline{\underline{.33}}$
Upper tail	IB	1	.05	.21	.27	$\underline{\underline{.37}}$
Upper tail	IIA	4	.05	.45	.30	$\underline{\underline{.53}}$
Upper tail	IIB	4	.05	.36	.41	$\underline{\underline{.62}}$
Upper tail	IIC	4	.05	.78	.18	$\underline{\underline{.82}}$

Open in a separate window

Empirical power is based on 25,000 replicates. For empirical power estimates in the range .2–.8, the estimated standard error is .003, whereas for the estimates outside this range, the estimated standard error is .002. MASTOR power values with a single underline are those that are at least .05 larger than the power of GTAM for that scenario, whereas values with a double underline are those that are at least .1 larger than the power of GTAM for that scenario. “Sampled Genos” and “Trait Model” defined in the legend to Table 2. For model I, association is tested with the sole causal SNP. For model II, association is tested with one of causal SNPs 1, 2, 3, or 4, as indicated in the “Tested SNP” column.

From Table 4 and Figure 1F, it is clear that MASTOR has power that is either approximately equal to or higher than that of GTAM in all of our simulations, and in some cases it is much higher. From rows 1 and 3 of Table 4, we can see that when there are no missing data, the power of MASTOR and GTAM is equivalent. However, in most of the missing data scenarios, MASTOR outperforms GTAM. This is expected because MASTOR uses information on dependence among relatives to allow family members with some missing data to contribute to the analysis.

One goal of our simulation studies was to determine whether we could retain the high power of MASTOR without going to the trouble of estimating heritability. Specifically, we considered whether we could do almost as well as MASTOR by setting the heritability to 0 instead of estimating it. (We called the resulting statistic ASTOR.) From Table 4 and Figure 1B, it is clear that MASTOR dominates ASTOR in terms of power, so the heritability estimation step seems important for power. From Table 4 and Figure 1D, we can see that ASTOR sometimes does much better but often does worse than GTAM. ASTOR tends to do better than GTAM in phenotype configurations A and C with missing data and worse in phenotype configuration B or in the absence of missing data. It makes sense that ASTOR would perform well in strongly family-based samples with missing data because, like MASTOR, ASTOR is able to improve power by using information on dependence among relatives to allow family members with some missing data to contribute to the analysis.

Another goal of our simulation studies was to determine whether, to estimate heritability for the MASTOR and GTAM statistics, one should use only that subset, S, of individuals who have both phenotype and genotype data, or whether one should use the larger subset, R, of phenotyped individuals who do not necessarily have genotype data (but who meet additional conditions detailed above). From Figure 1A, it is clear that for the MASTOR analysis, higher power is achieved by estimating heritability from the full subset, R, (results labeled “MASTOR” in Figure 1A) rather than from the subset having both phenotype and genotype data (results labeled “MASTOR_G” in Figure 1A). In contrast, for the GTAM analysis, there is no gain in power from including the full subset of individuals in R when some of them have missing genotype data (results labeled “GTAM_M” in Figure 1E) as opposed to including only the subset of individuals who are both genotyped and phenotyped (results labeled “GTAM” in Figure 1E). In fact, use of the larger subset of individuals to estimate heritability for GTAM can lead to either an inflation or deflation of type 1 error (Table 2 and the red and blue dots in Figures 1C and 1E).

Analysis of HDL-C Data from the Framingham Heart Study

Table 5 reports the parameter estimates for the null model of log(HDL-C), fitted in cohort 3. We estimated the heritability to be .50 (95% confidence interval of .44–.56), which is consistent with previously reported^20,21 estimates of 0.40–0.69. The Q-Q plot for the MASTOR genome scan (see Figure S1 available online) does not show evidence of inflation, and the genomic control inflation factor is λ_GC = 1.01. In the initial association analysis (before masking of genotypes), the results by GTAM and T^SCORE (not shown) are almost the same as those for MASTOR, which is to be expected because of the low proportion of individuals with phenotype and covariate information who also have missing genotype. In the initial association analysis (before masking of genotypes), one SNP, rs9989419, shows significant association with HDL-C after Bonferroni correction, with a nominal p value of 1.0 × 10⁻⁸ by MASTOR. SNP rs99894919 is located in cholesterol ester transfer protein (CETP [MIM 118470]) and has been previously reported as associated with HDL-C level.^22,23 A previous analysis of a much larger subset of the Framingham data²⁴ also identified this association, using a method that accounts only for sibling correlations, with genomic control used to make a further correction. In Table 6 we report all SNPs with p value ≤10⁻⁵ in the MASTOR analysis. These SNPs are within or in close proximity to eight gene regions, three of which (CETP, LPL [MIM 609708], and LIPG [MIM 603684]) have been reported and replicated before.^22,25–27

Table 5

Null Parameter Estimates for the Analysis of log(HDL-C) in the Framingham Heart Study Data

Parameter	MLE	SE
Narrow-sense heritability (ξ)	.50	.03
Additive variance $(σ_{a}^{2})$	.032	.003
Environmental variance $(σ_{e}^{2})$	.032	.002
Intercept	5.12	0.02
Coefficient of age	−.007	.003
Coefficient of age²	.00012	.00005
Coefficient of sex	.243	.008
Coefficient of log(FPG)	−.32	.03

Age is measured in years. Sex is coded as female = 2, male = 1. The log(FPG) is the natural logarithm of fasting plasma glucose.

Table 6

SNPs with Smallest p Values for HDL-C Level in the Framingham Heart Study

SNP	Chr	Position	Gene Region	p Value of Statistic with
				Original Data	Masked Genotypes
				MASTOR^a	MASTOR	GTAM	T^SCORE
rs11707795	3	139549939	CLSTN2	4.8 × 10⁻⁶	2.7 × 10⁻⁴	.153	.005
rs4921964	8	18679795	PSD3	5.0 × 10⁻⁶	5.9 × 10⁻⁴	.075	.005
rs17482753	8	19832646	LPL-SLC18A1	2.1 × 10⁻⁷	3.1 × 10⁻⁴	.260	.002
rs10503669	8	19847690	LPL-SLC18A1	1.8 × 10⁻⁷	1.6 × 10⁻⁵	.021	3.4 × 10⁻⁴
rs17410962	8	19848080	LPL-SLC18A1	7.6 × 10⁻⁷	1.1 × 10⁻⁵	.002	1.7 × 10⁻⁴
rs17489268	8	19852045	LPL-SLC18A1	7.3 × 10⁻⁷	7.2 × 10⁻⁶	.013	2.6 × 10⁻⁴
rs17411024	8	19852134	LPL-SLC18A1	7.5 × 10⁻⁷	5.7 × 10⁻⁵	.042	4.6 × 10⁻⁴
rs17411031	8	19852310	LPL-SLC18A1	7.3 × 10⁻⁷	1.2 × 10⁻⁴	.112	.002
rs17411126	8	19855272	LPL-SLC18A1	1.8 × 10⁻⁶	8.3 × 10⁻⁵	.039	.001
rs765547	8	19866274	LPL-SLC18A1	8.0 × 10⁻⁷	3.1 × 10⁻⁵	.004	7.3 × 10⁻⁴
rs1837842	8	19868290	LPL-SLC18A1	8.5 × 10⁻⁷	7.5 × 10⁻⁵	.052	.002
rs1919484	8	19869676	LPL-SLC18A1	5.4 × 10⁻⁷	7.2 × 10⁻⁶	.093	3.5 × 10⁻⁴
rs7006101	8	81897200	PAG1	4.8 × 10⁻⁶	.003	.104	.011
rs7904836	10	4097880	KLF6	2.3 × 10⁻⁶	.035	.039	.041
rs17259942	12	77072077	ZDHHC17-OSBPL8	4.5 × 10⁻⁶	.003	.101	.009
rs9989419	16	56985139	CETP-HERPUD1	1.0 × 10⁻⁸	1.9 × 10⁻⁴	.090	.003
rs7240405	18	47159090	ACAA2-LIPG	6.3 × 10⁻⁶	.004	.349	.016
rs4939883	18	47167214	ACAA2-LIPG	7.7 × 10⁻⁶	.003	.358	.018
rs2156552	18	47181668	ACAA2-LIPG	7.7 × 10⁻⁶	.006	.532	.024
rs6507945	18	47243912	ACAA2-LIPG	2.9 × 10⁻⁶	.001	.005	.003

Open in a separate window

MIM numbers of genes not mentioned in the text: PSD3 (MIM 614440), SLC18A1 (MIM 193002), PAG1, KLF6 (MIM 602053), ZDHHC17 (MIM 607799), OSBPL8 (MIM 606736), HERPUD1 (MIM 608070), ACAA2 (MIM 604770).

^aResults for the original data by GTAM and T^SCORE were very similar to those for MASTOR.

After masking some individuals’ genotypes (as described in the HDL-C Data from the Framingham Heart Study subsection of Material and Methods), we again tested for association with each of the 20 SNPs in Table 6 by using MASTOR, GTAM, and T^SCORE, and the results are given in the last three columns of Table 6. The GTAM p values are consistently the largest because GTAM uses only the individuals having complete data. The T^SCORE p values are generally larger than those of MASTOR in this analysis, probably because the T^SCORE variance calculation is appropriate only if either (1) the imputation procedure recovers close to complete genotype information or (2) the phenotype distribution is the same for genotyped and ungenotyped individuals. In addition, the Ghost implementation of T^SCORE does not allow loops, which occurred in six of the Framingham pedigrees.

Run Times for MASTOR

We performed simulations to estimate the run times of MASTOR and demonstrate its computationally feasibility. We used a single processor on a shared machine with 8 core Intel Xeon 3.16 GHz CPU and 32 GB RAM. For a data set with 65 three-generation families with at least 6 genotyped individuals per family, the first step took 20 ms and the second step took 10 min (552,930 ms) to analyze 500,000 SNPs. As expected, doubling the number of families to 130 doubles the time, with the second step taking 20 min (1,120,780 ms) for 500,000 SNPs. Thus, our MASTOR software is clearly practical for genome-wide association studies.

Go to:

Discussion

Data sets that contain both individuals with known familial relationships and unrelated individuals are common. Often, families that have been previously recruited for linkage analysis are later typed on high-density SNP chips for association analysis. It is also common to recruit the offspring of individuals who are in population-based studies, resulting in data sets with related individuals, such as the Framingham Heart Study.¹ For such studies, it is important to develop association analysis tools that can incorporate pedigree and covariate information and can handle missing data.

We have developed MASTOR, a powerful and robust method for association testing of quantitative traits in samples containing related individuals. MASTOR includes adjustment for covariates, is applicable to completely general combinations of unrelated and related individuals, and can appropriately handle and leverage missing data. MASTOR takes into account the dependence among related individuals to incorporate into the analysis phenotype and covariate information from individuals with missing genotype data and genotype data from individuals with missing phenotype or covariate information. In simulations, we show that the type 1 error of MASTOR is well calibrated, and we demonstrate the power gains that MASTOR obtains by (1) modeling the residual phenotypic correlation among related individuals and (2) incorporating partially missing information on related individuals. Because MASTOR is a retrospective analysis, it is robust to misspecification of the model for the distribution of phenotypes among related individuals. Thus, as demonstrated in our simulations, MASTOR remains well calibrated even when the variance components are misspecified or the heritability is estimated from a different sample than the one being tested for association. For best power results, however, heritability should ideally be estimated with the set of phenotyped individuals who contribute to the association analysis.

We used MASTOR to test for association with HDL cholesterol based on genome-wide SNP data from the Framingham Heart Study. In a version of the data set that included both individuals with phenotypes and covariates but missing genotypes and also individuals with genotypes but missing phenotypes and covariates, MASTOR was able to use the partially missing data to increase power over GTAM. Out of the 369,046 SNPs we tested, all of our 10 smallest p values (and 15 of our 20 smallest p values) are within or in close proximity to genes that have been previously reported and replicated as associated with HDL cholesterol, verifying that MASTOR is able to home in on the important loci in a genome scan.

Software implementing the MASTOR, ASTOR, and GTAM statistics is freely downloadable under an open source GNU GPL (see Web Resources). We have demonstrated that MASTOR is computationally feasible, making it suitable for genome-wide association studies. There is still the potential for improvement in computational speed. For example, for a data set with multiple families, a natural parallelization scheme (not currently implemented) would be to process each family independently in parallel. Different sets of SNPs could also be processed independently in parallel. Another approach to speeding up the computations addresses the most time-consuming part of the analysis, which is the inversion of Φ_NN for every marker, where N is the set of individuals with nonmissing genotype at the given marker and where the inverse matrix can be calculated separately for each family. Recomputation of the inverse matrix at each marker allows for the possibility that genotypes may be missing for different individuals at different SNPs. However, in a genome screen with 500,000 SNPs, for each small- to medium-sized family in a sample, we are likely to see the same pattern of missingness for more than one SNP, in which case the inverse matrix could be calculated fewer times than the number of SNPs. For example, for a family with 16 genotyped individuals, there are 2¹⁶ = 65,536 missingness patterns possible, much fewer than the number of SNPs in a typical genome-wide association study. Furthermore, some individuals may be more likely to have missing genotypes than others, for example because of the quality of the sampled DNA, which may further reduce the number of observed missingness patterns. Then it becomes a question of whether or not storing in memory relevant information on all (or the most common) missing genotype patterns for each family is computationally preferable to performing the matrix inversion for every SNP.

For MASTOR (and, in fact, for ASTOR, GTAM, and T^SCORE as well), the variance calculation involves the kinship matrix and it would be moderately sensitive to misspecified kinship. We could extend the MASTOR method to also correct for both population structure and misspecified kinship, in addition to missing data and known family structure. We describe two possible approaches to this. One approach, analogous to the ROADTRIPS¹⁵ method for binary traits, would be to replace the MASTOR statistic of Equation 6 by

MASTOR = \frac{{(V^{T} X)}^{2}}{{\hat{Var}}_{0} (V^{T} X | W, Y)} = \frac{{(V^{T} X)}^{2}}{{\hat{σ}}_{X}^{2} V^{T} {\hat{Ψ}}_{N N} V},

(Equation 20)

where $\hat{Ψ}$ is an empirical kinship matrix calculated from genome-wide data. In terms of the BLUP imputation interpretation of MASTOR, this would mean using the known pedigree information for the BLUP imputation and also for the phenotypic residuals, but using the empirical kinship matrix to assess the overall variance of the statistic. An alternative approach would be to include ancestry-informative vectors as covariates in W. Recent related work includes Yu et al.²⁸ and Peloso et al.²⁹

Go to:

Acknowledgments

This study was supported by the National Institutes of Health grant R01 HG001645 (to M.S.M.). The Framingham Heart Study is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University (Contract No. N01-HC-25195). This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or NHLBI. Funding for SHARe Affymetrix genotyping was provided by NHLBI Contract N02-HL-64278. SHARe Illumina genotyping was provided under an agreement between Illumina and Boston University.

Go to:

Appendix A: Extension to Multiallelic Variant

One approach to testing association with a multiallelic variant having a allelic classes is to perform an (a − 1) degree-of-freedom score test of the null hypothesis of no allelic association. In some situations, it may make sense to first pool some of the allelic classes in order to reduce the number of degrees of freedom. Let a be the number of allelic classes, possibly after pooling, and let Z be the n × (a − 1) matrix with Z_ij = 1/2 × (the number of class-j alleles in individual i). Then the multiallelic extension of the MASTOR statistic is

MASTOR = \frac{(n - 1) Y^{T} P^{T} Φ_{R N} U Z {(Z^{T} U Z)}^{- 1} Z^{T} U Φ_{N R} P Y}{Y^{T} P^{T} Φ_{R N} U Φ_{N R} P Y} .

(Equation A1)

Note that when a = 2, Z reduces to X, and Equation A1 reduces to Equation 10. An equivalent formulation of Equation A1 can be obtained as follows: first, let F be the (a − 1) × (a − 1) matrix having (i, j)^th entry F_ij = Cov(Z_ki, Z_kj) for any outbred individual k. We assume Cov₀(Z_ki, Z_lj | W, Y) = F_ij [center dot] 2ϕ_kl. Define Z_i = (Z_1i,…, Z_ni)^T, the i^th column of Z. We estimate F by a generalized sample covariance matrix $\hat{F}$ having (i, j)^th entry ${\hat{F}}_{i j} = {(n - 1)}^{- 1} Z_{i}^{T} U Z_{j}$ . Note that $\hat{F}$ is an unbiased estimator of F, even in the presence of inbreeding. Then the multiallelic extension of the MASTOR statistic can be equivalently written as

MASTOR = {(Y^{T} P^{T} Φ_{R N} U Φ_{N R} P Y)}^{- 1} \sum_{i = 1}^{a - 1} \sum_{j = 1}^{a - 1} {({\hat{F}}^{- 1})}_{i j} (Z_{i}^{T} U Φ_{N R} P Y) (Z_{j}^{T} U Φ_{N R} P Y) .

(Equation A2)

Under the null hypothesis of no association and no linkage, the MASTOR statistic of Equations A1 and A2 is asymptotically $χ_{a - 1}^{2}$ distributed (assuming the usual regularity conditions).

A different, previously proposed estimator¹³ of F is given by $\overset{ˇ}{F}$ , where ${\overset{ˇ}{F}}_{i j} = 1 / 2 ({\hat{p}}_{i} 1_{i = j} - {\hat{p}}_{i} {\hat{p}}_{j})$ , with ${\hat{p}}_{i} = {(1^{T} Φ^{- 1} 1)}^{- 1} 1^{T} Φ^{- 1} Z_{i}$ . One could choose to use $\overset{ˇ}{F}$ in place of $\hat{F}$ in Equation A2. One difference between $\hat{F}$ and $\overset{ˇ}{F}$ is that $\overset{ˇ}{F}$ assumes HWE in outbred founders under the null hypothesis, whereas $\hat{F}$ is more robust to deviations from HWE. However, $\overset{ˇ}{F}$ involves estimation of fewer parameters than does $\hat{F}$ , so $\overset{ˇ}{F}$ might be preferred if there are some genotypes with small expected counts in the data.

Go to:

Appendix B: Extension to MZ Twins

The main challenge presented by the occurrence of MZ twins in the sample is that the kinship matrix Φ_NN will not be invertible¹⁵ if there are MZ twins in N, the set of individuals with nonmissing genotype. (A similar problem would occur for Φ_QQ.) In contrast, invertibility of Σ or $\hat{Σ}$ is generally not a problem, because this will still hold provided ξ < 1 or $\hat{ξ} < 1$ , respectively. When MZ twins are present in N, all the formulas in the paper will still hold, with the modification that $Φ_{N N}^{- 1}$ be replaced by $Φ_{N N}^{-}$ everywhere, where $Φ_{N N}^{-}$ is the Moore-Penrose generalized inverse of Φ_NN. (The same would hold for Φ_QQ.) From a computational standpoint, we do not actually need to perform the generalized inverse but can achieve the same result by a simpler approach, which we now describe. Suppose that the two genotypes at a variant for the individuals in an MZ twin pair are observed to be identical. Then, use of $Φ_{N N}^{-}$ in place of $Φ_{N N}^{- 1}$ in the formulas for ${\hat{σ}}_{X}^{2}$ , the BLUP, and the BLUE, given in Equations 7, 12, and 13, respectively, is mathematically equivalent to removing one twin in each MZ twin pair before performing the calculation. Because each of these calculations involves only genotype data, not phenotype data, it is intuitively clear that the genotype of the second twin in each MZ pair is redundant information. Use of $Φ_{Q Q}^{-}$ in place of $Φ_{Q Q}^{- 1}$ in the formula for U′ in Equation 8 is mathematically equivalent to (1) removing the genotype data for one twin in each MZ twin pair and (2) replacing each covariate for the remaining twin by the average of the two twins’ covariates, before performing the calculation of Equation 8. Use of $Φ_{N N}^{-}$ in place of $Φ_{N N}^{- 1}$ in the MASTOR statistic of Equation 10 is mathematically equivalent to setting the genotype data of one twin to be missing, so that individual is moved out of group N, and keeping the remaining twin’s data unchanged. Because the one twin’s missing genotype data will be imputed perfectly by the BLUP, no information is lost by doing this.

Go to:

Appendix C: Additional Mathematical Details of MASTOR

To see that the MASTOR statistic of Equations 6 and 10 can be rewritten in the form of Equation 11, note that we can rewrite the BLUP of Equation 12 in matrix notation as

\hat{X} = 1_{r} {(1^{T} Φ_{N N}^{- 1} 1)}^{- 1} 1^{T} Φ_{N N}^{- 1} X + Φ_{R N} U X,

(Equation C1)

where 1_r is a vector of length r with every element equal to 1 and 1 is a vector of length n with every element equal to 1. Then $Y^{T} P \hat{X} = Y^{T} P Φ_{R N} U X$ , because P1_r = 0, and the result follows immediately.

The asymptotic $χ_{1}^{2}$ null distribution of the MASTOR statistic of Equation 10 is based on the standard central limit theorem argument applied to linear combinations of X. In addition to the usual regularity conditions of the central limit theorem, the conditions that need to hold are

E₀(Y^TPΦ_RNUX|W,Y) = 0

(Equation C2)

and

{Var}_{0} (Y^{T} P Φ_{R N} UX | W, Y) = σ_{X}^{2} Y^{T} P Φ_{R N} U Φ_{N R} PY .

(Equation C3)

In subsection Justification for the MASTOR Statistic of the Material and Methods section, the assumptions needed for correct calibration of MASTOR are stated: (1) $E_{0} (\hat{X} | W, Y) = W α$ and (2) ${Var}_{0} (X | W, Y) = σ_{X}^{2} Φ_{N N}$ . To see that (1) and (2) imply Equations C2 and C3, note that $Y^{T} P Φ_{R N} U X = Y^{T} P \hat{X}$ , so $E_{0} (Y^{T} P Φ_{R N} U X | W, Y) = E_{0} (Y^{T} P \hat{X} | W, Y) = Y^{T} P W α = 0$ because PW = 0, and ${Var}_{0} (Y^{T} P Φ_{R N} U X | W, Y) = Y^{T} P Φ_{R N} U {Var}_{0} (X | W, Y) U Φ_{N R} P Y = σ_{X}^{2} Y^{T} P Φ_{R N} U Φ_{N N} U Φ_{N R} P Y = σ_{X}^{2} Y^{T} P Φ_{R N} U Φ_{N R} P Y$ because UΦ_NNU = U. The implication is that MASTOR is correctly calibrated even if the genotype is linearly related to the covariates and even if the phenotype model is misspecified (e.g., if the variance components are wrong). Note that the same proof applies to ASTOR by replacing P with P_I throughout.

The optimality of MASTOR is based on the standard results for quasi-likelihood and holds under the retrospective mean model in Equations 14 and 15, where this model is discussed in some detail in subsection Justification for the MASTOR Statistic of the Material and Methods section. The retrospective mean model of Equations 14 and 15 does not have genotype linearly related to the covariates under the null, so the version of MASTOR in Equation 10 is optimal only when genotype is unrelated to covariates under the null. If one desired optimality of MASTOR under the retrospective model E(X | W, Y) = Wα + δΦ_NRPY, in which genotype is linearly related to the covariates under the null, this would be achieved by replacing U by U′ in the MASTOR statistic of Equation 10. This could be interpreted in terms of BLUP imputation with covariates, in addition to relatives’ genotypes, as predictors.

Go to:

Supplemental Data

Document S1. Figure S1:

Click here to view.^{(54K, pdf)}

Go to:

Web Resources

The URLs for data presented herein are as follows:

MASTOR source code, http://www.stat.uchicago.edu/~mcpeek/software/index.html
Online Mendelian Inheritance in Man (OMIM), http://www.omim.org/

Go to:

References

1. Splansky G.L., Corey D., Yang Q., Atwood L.D., Cupples L.A., Benjamin E.J., D’Agostino R.B., Sr., Fox C.S., Larson M.G., Murabito J.M. The third generation cohort of the national heart, lung, and blood institute’s framingham heart study: design, recruitment, and initial examination. Am. J. Epidemiol. 2007;165:1328–1335. [Abstract] [Google Scholar]

2. Newman D.L., Abney M., McPeek M.S., Ober C., Cox N.J. The importance of genealogy in determining genetic associations with complex traits. Am. J. Hum. Genet. 2001;69:1146–1148. [Europe PMC free article] [Abstract] [Google Scholar]

3. Thornton T., McPeek M.S. Case-control association testing with related individuals: a more powerful quasi-likelihood score test. Am. J. Hum. Genet. 2007;81:321–337. [Europe PMC free article] [Abstract] [Google Scholar]

4. Abney M., Ober C., McPeek M.S. Quantitative-trait homozygosity and association mapping and empirical genomewide significance in large, complex pedigrees: fasting serum-insulin level in the Hutterites. Am. J. Hum. Genet. 2002;70:920–934. [Europe PMC free article] [Abstract] [Google Scholar]

5. Kang H.M., Sul J.H., Service S.K., Zaitlen N.A., Kong S.Y., Freimer N.B., Sabatti C., Eskin E. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 2010;42:348–354. [Europe PMC free article] [Abstract] [Google Scholar]

6. Kang H.M., Zaitlen N.A., Wade C.M., Kirby A., Heckerman D., Daly M.J., Eskin E. Efficient control of population structure in model organism association mapping. Genetics. 2008;178:1709–1723. [Europe PMC free article] [Abstract] [Google Scholar]

7. Zhou X., Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 2012;44:821–824. [Europe PMC free article] [Abstract] [Google Scholar]

8. Lippert C., Listgarten J., Liu Y., Kadie C.M., Davidson R.I., Heckerman D. FaST linear mixed models for genome-wide association studies. Nat. Methods. 2011;8:833–835. [Abstract] [Google Scholar]

9. Chen W.-M., Abecasis G.R. Family-based association tests for genomewide association scans. Am. J. Hum. Genet. 2007;81:913–926. [Europe PMC free article] [Abstract] [Google Scholar]

10. Abecasis G.R., Cardon L.R., Cookson W.O.C. A general test of association for quantitative traits in nuclear families. Am. J. Hum. Genet. 2000;66:279–292. [Europe PMC free article] [Abstract] [Google Scholar]

11. Lange C., DeMeo D.L., Laird N.M. Power and design considerations for a general class of family-based association tests: quantitative traits. Am. J. Hum. Genet. 2002;71:1330–1341. [Europe PMC free article] [Abstract] [Google Scholar]

12. Feng Z., Wong W.W., Gao X., Schenkel F. Generalized genetic association study with samples of related individuals. Ann. Appl. Stat. 2011;5:2109–2130. [Google Scholar]

13. Bourgain C., Hoffjan S., Nicolae R., Newman D., Steiner L., Walker K., Reynolds R., Ober C., McPeek M.S. Novel case-control test in a founder population identifies P-selectin as an atopy-susceptibility locus. Am. J. Hum. Genet. 2003;73:612–626. [Europe PMC free article] [Abstract] [Google Scholar]

14. Han L., Abney M. Identity by descent estimation with dense genome-wide genotype data. Genet. Epidemiol. 2011;35:557–567. [Europe PMC free article] [Abstract] [Google Scholar]

15. Thornton T., McPeek M.S. ROADTRIPS: case-control association testing with partially or completely unknown population and pedigree structure. Am. J. Hum. Genet. 2010;86:172–184. [Europe PMC free article] [Abstract] [Google Scholar]

16. McPeek M.S. BLUP genotype imputation for case-control association testing with related individuals and missing data. J. Comput. Biol. 2012;19:756–765. [Europe PMC free article] [Abstract] [Google Scholar]

17. McPeek M.S., Wu X., Ober C. Best linear unbiased allele-frequency estimation in complex pedigrees. Biometrics. 2004;60:359–367. [Abstract] [Google Scholar]

18. Wang Z., McPeek M.S. An incomplete-data quasi-likelihood approach to haplotype-based genetic association studies on related individualst. J. Am. Stat. Assoc. 2009;104:1251–1260. [Europe PMC free article] [Abstract] [Google Scholar]

19. Welham S.J., Thompson R. Likelihood ratio tests for fixed model terms using residual maximum likelihood. J. Royal Stat. Soc. Ser. B. 1997;59:701–714. [Google Scholar]

20. Weiss L.A., Pan L., Abney M., Ober C. The sex-specific genetic architecture of quantitative traits in humans. Nat. Genet. 2006;38:218–222. [Abstract] [Google Scholar]

21. Kathiresan S., Manning A.K., Demissie S., D’Agostino R.B., Surti A., Guiducci C., Gianniny L., Burtt N.P., Melander O., Orho-Melander M. A genome-wide association study for blood lipid phenotypes in the Framingham Heart Study. BMC Med. Genet. 2007;8(Suppl 1):S17. [Europe PMC free article] [Abstract] [Google Scholar]

22. Willer C.J., Sanna S., Jackson A.U., Scuteri A., Bonnycastle L.L., Clarke R., Heath S.C., Timpson N.J., Najjar S.S., Stringham H.M. Newly identified loci that influence lipid concentrations and risk of coronary artery disease. Nat. Genet. 2008;40:161–169. [Abstract] [Google Scholar]

23. Wallace C., Newhouse S.J., Braund P., Zhang F., Tobin M., Falchi M., Ahmadi K., Dobson R.J., Marçano A.C., Hajat C. Genome-wide association study identifies genes for biomarkers of cardiovascular disease: serum urate and dyslipidemia. Am. J. Hum. Genet. 2008;82:139–149. [Europe PMC free article] [Abstract] [Google Scholar]

24. Ma L., Yang J., Runesha H.B., Tanaka T., Ferrucci L., Bandinelli S., Da Y. Genome-wide association analysis of total cholesterol and high-density lipoprotein cholesterol levels using the Framingham heart study data. BMC Med. Genet. 2010;11:55. [Europe PMC free article] [Abstract] [Google Scholar]

25. Kathiresan S., Melander O., Guiducci C., Surti A., Burtt N.P., Rieder M.J., Cooper G.M., Roos C., Voight B.F., Havulinna A.S. Six new loci associated with blood low-density lipoprotein cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. Nat. Genet. 2008;40:189–197. [Europe PMC free article] [Abstract] [Google Scholar]

26. Aulchenko Y.S., Ripatti S., Lindqvist I., Boomsma D., Heid I.M., Pramstaller P.P., Penninx B.W.J.H., Janssens A.C.J.W., Wilson J.F., Spector T., ENGAGE Consortium Loci influencing lipid levels and coronary heart disease risk in 16 European population cohorts. Nat. Genet. 2009;41:47–55. [Europe PMC free article] [Abstract] [Google Scholar]

27. Kathiresan S., Willer C.J., Peloso G.M., Demissie S., Musunuru K., Schadt E.E., Kaplan L., Bennett D., Li Y., Tanaka T. Common variants at 30 loci contribute to polygenic dyslipidemia. Nat. Genet. 2009;41:56–65. [Europe PMC free article] [Abstract] [Google Scholar]

28. Yu J., Pressoir G., Briggs W.H., Vroh Bi I., Yamasaki M., Doebley J.F., McMullen M.D., Gaut B.S., Nielsen D.M., Holland J.B. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 2006;38:203–208. [Abstract] [Google Scholar]

29. Peloso G.M., Dupuis J., Lunetta K.L. Evaluation of methods accounting for population structure with pedigree data and continuous outcomes. Genet. Epidemiol. 2011;35:427–436. [Abstract] [Google Scholar]

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics

Full text links

Read article at publisher's site: https://doi.org/10.1016/j.ajhg.2013.03.014

Read article for free, from open access legal sources, via Unpaywall: http://www.cell.com/article/S0002929713001225/pdf

Citations & impact

Impact metrics

Citations

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/6334323

Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/6334323

Article citations

JASPER: Fast, powerful, multitrait association testing in structured samples gives insight on pleiotropy in gene expression.
Mbatchou J, McPeek MS
Am J Hum Genet, 111(8):1750-1769, 17 Jul 2024
Cited by: 0 articles | PMID: 39025064
Differential genome-wide associated variants and enriched pathways of ECG parameters among people with versus without HIV.
He J, Ding Y, Lin H, Liu X, Chen X, Shen W, Zhou S, Feng C, Wang M, Xia J, He N
AIDS, 37(12):1871-1882, 12 Jul 2023
Cited by: 1 article | PMID: 37418550 | PMCID: PMC10481915
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Genome-wide associated variants of subclinical atherosclerosis among young people with HIV and gene-environment interactions.
He J, Lin H, Ding Y, Liu X, Xu K, Chen X, Shen W, Zhou S, Wang M, Xia J, He N
J Transl Med, 20(1):609, 20 Dec 2022
Cited by: 3 articles | PMID: 36539828 | PMCID: PMC9764595
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Dissecting Complex Traits Using Omics Data: A Review on the Linear Mixed Models and Their Application in GWAS.
Alamin M, Sultana MH, Lou X, Jin W, Xu H
Plants (Basel), 11(23):3277, 28 Nov 2022
Cited by: 3 articles | PMID: 36501317 | PMCID: PMC9739826
Review
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Controlling for background genetic effects using polygenic scores improves the power of genome-wide association studies.
Bennett D, O'Shea D, Ferguson J, Morris D, Seoighe C
Sci Rep, 11(1):19571, 01 Oct 2021
Cited by: 4 articles | PMID: 34599249 | PMCID: PMC8486788
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC

Go to all (23) article citations

Data

Data behind the article

This data has been text mined from the article, or deposited into data resources.

BioStudies: supplemental material and supporting data

http://www.ebi.ac.uk/biostudies/studies/S-EPMC3644644?xr=true

Diseases (Showing 10 of 10)

(1 citation) OMIM - 609708
(1 citation) OMIM - 604770
(1 citation) OMIM - 614440
(1 citation) OMIM - 602053
(1 citation) OMIM - 607799
(1 citation) OMIM - 603684
(1 citation) OMIM - 606736
(1 citation) OMIM - 118470
(1 citation) OMIM - 193002
(1 citation) OMIM - 608070

Show less

SNPs (Showing 20 of 20)

(2 citations) dbSNP - rs9989419
(1 citation) dbSNP - rs17482753
(1 citation) dbSNP - rs17410962
(1 citation) dbSNP - rs17411126
(1 citation) dbSNP - rs7240405
(1 citation) dbSNP - rs10503669
(1 citation) dbSNP - rs17411024
(1 citation) dbSNP - rs17411031
(1 citation) dbSNP - rs11707795
(1 citation) dbSNP - rs1919484
(1 citation) dbSNP - rs7904836
(1 citation) dbSNP - rs765547
(1 citation) dbSNP - rs17489268
(1 citation) dbSNP - rs17259942
(1 citation) dbSNP - rs2156552
(1 citation) dbSNP - rs7006101
(1 citation) dbSNP - rs4939883
(1 citation) dbSNP - rs1837842
(1 citation) dbSNP - rs4921964
(1 citation) dbSNP - rs6507945

Show less

Funding

Funders who supported this work.

Boston University (1)

Grant ID: N02-HL-64278
2 publications

NHGRI NIH HHS (1)

Grant ID: R01 HG001645
41 publications

NHLBI NIH HHS (3)

Grant ID: N02HL64278
280 publications
Grant ID: N01-HC-25195
1142 publications
Grant ID: N01HC25195
1714 publications

National Heart, Lung, and Blood Institute

National Institutes of Health (1)

Grant ID: R01 HG001645
3 publications

Search life-sciences literature (45,103,477 articles, preprints and more)

MASTOR: mixed-model association mapping of quantitative traits in samples with related individuals.

Author information

Affiliations

Authors

ORCIDs linked to this article

Abstract

Free full text

MASTOR: Mixed-Model Association Mapping of Quantitative Traits in Samples with Related Individuals

Johanna Jakobsdottir

Mary Sara McPeek

Associated Data

Abstract

Introduction

Material and Methods

Brief Review of the Mixed-Effects GTAM Method of Abney et al.

MASTOR

Special Case: Complete Data

Justification for the MASTOR Statistic

The TSCORE Test

ASTOR: Is the Heritability Parameter Really Needed?

Choice of Statistics to Compare by Simulation

Computational Approach and Software

Simulation Studies

Trait Models

Table 1

Sampled Genotypes

Impact of the Variance Components

HDL-C Data from the Framingham Heart Study

Results

Assessment of Type 1 Error and the Impact of Variance Components

Table 2

Table 3

Power Studies

Table 4

Analysis of HDL-C Data from the Framingham Heart Study

Table 5

Table 6

Run Times for MASTOR

Discussion

Acknowledgments

Appendix A: Extension to Multiallelic Variant

Appendix B: Extension to MZ Twins

Appendix C: Additional Mathematical Details of MASTOR

Supplemental Data

Web Resources

References

Full text links

Citations & impact

Impact metrics

Citations of article over time

Alternative metrics

Article citations

Data

Data behind the article

BioStudies: supplemental material and supporting data

Diseases (Showing 10 of 10)

SNPs (Showing 20 of 20)

Similar Articles

Funding

Boston University (1)﻿

NHGRI NIH HHS (1)﻿

NHLBI NIH HHS (3)﻿

National Heart, Lung, and Blood Institute

National Institutes of Health (1)﻿

Partnerships & funding

The T^SCORE Test

Boston University (1)

NHGRI NIH HHS (1)

NHLBI NIH HHS (3)

National Institutes of Health (1)