Abstract
Free full text
MASTOR: Mixed-Model Association Mapping of Quantitative Traits in Samples with Related Individuals
Abstract
Genetic association studies often sample individuals with known familial relationships in addition to unrelated individuals, and it is common for some individuals to have missing data (phenotypes, genotypes, or covariates). When some individuals in a sample are related, power can be gained by incorporating all individuals in the analysis, including individuals with partially missing data, while properly accounting for the dependence among them. We propose MASTOR, a mixed-model, retrospective score test for genetic association with a quantitative trait. MASTOR achieves high power in samples that contain related individuals by making full use of the relationship information to incorporate partially missing data in the analysis while correcting for dependence. Individuals with available phenotype and covariate information who are not genotyped but have genotyped relatives in the sample can still contribute to the association analysis because of the dependence among genotypes. Similarly, individuals who are genotyped but are missing covariate or phenotype information can contribute to the analysis. MASTOR is valid even when the phenotype model is misspecified and with either random or phenotype-based ascertainment. In simulations, we demonstrate the correct type 1 error of MASTOR, the increase in power that comes from making full use of the relationship information, the robustness to misspecification of the phenotype model, and the improvement in power that comes from modeling the heritability. We show that MASTOR is computationally feasible and practical in genome-wide association studies. We apply MASTOR to data on high-density lipoprotein cholesterol from the Framingham Heart study.
Introduction
Genome-wide association studies commonly contain both related and unrelated individuals. For example, families that have previously been recruited for linkage studies may later be included in an association analysis. In other instances, such as the Framingham Heart Study,1 individuals may be sampled for a population-based study, with their relatives later recruited. For such studies it is important to develop association analysis tools that can incorporate pedigree and covariate information and can handle missing data. It has been demonstrated that when related individuals are included in an association study, the genealogy needs to be appropriately modeled2 and that additional power can be achieved by including all individuals in the analysis while accounting for the dependence resulting from relatedness in the sample.3
For genetic analysis of association with quantitative traits in samples with related individuals, the GTAM method4 has previously been developed. GTAM is a prospective, mixed-model analysis that incorporates covariates and uses estimated variance components (VCs) for additive polygenic and environmental variance. (For individuals not known to be related, later variations on the GTAM method include EMMAX5 and similar methods,6–8 in which the kinship matrix of GTAM is replaced by an IBS matrix or other empirical relatedness matrix.) The TSCORE test9 addresses the problem of testing association with a quantitative trait when there are high-resolution SNP genotypes for a subset of individuals in each pedigree, with sparser marker data for the remaining individuals. To address this problem, TSCORE uses genotype imputation, followed by association testing with a score statistic similar to that of GTAM. Family-based association testing approaches can also be used for quantitative traits.10,11 Family-based approaches, such as FBAT,11 that restrict analysis to the within-family component of association by conditioning on relatives’ genotypes, provide robustness to population structure at the price of strong requirements on the availability of genotype data on relatives in order to have adequate power. Another recent method for association mapping of quantitative traits in related individuals is GQLS,12 which is the same as the previously proposed WQLS method13 for binary traits, except with a quantitative trait used in place of the binary trait. GQLS is a retrospective analysis that does not allow for covariates and does not involve estimation of VCs.
We propose a retrospective mixed-model approach to genetic association mapping of a quantitative trait in samples with related individuals, which we call MASTOR (mixed-model association score test on related individuals). Unlike GQLS, MASTOR allows for covariates and for additive polygenic and environmental VCs. Unlike family-based tests, MASTOR is applicable to completely general samples of related and unrelated individuals, and unlike TSCORE, MASTOR allows pedigrees with loops. Unlike GTAM and TSCORE, MASTOR is retrospective and so is more robust to misspecification of the phenotypic model. A further advantage of MASTOR over GQLS and GTAM is in its handling of missing data. It uses the information on dependence of related individuals to allow individuals with missing genotype or missing phenotype to contribute power to the analysis. MASTOR uses a different approach to missing data than does TSCORE and we compare them in simulations.
MASTOR can be viewed as an extension to quantitative traits of the MQLS method3 for binary traits. Like MQLS, MASTOR can be shown to be asymptotically optimal when, for example, the genetic effect of the tested locus is additive with effect size tending to 0. Features of MASTOR include the following: (1) it is applicable to and computationally feasible for essentially arbitrary combinations of related and unrelated individuals, including small outbred pedigrees and unrelated individuals, as well as large, complex inbred pedigrees; (2) it is computationally feasible for genetic studies with millions of markers; and (3) it incorporates phenotype and covariate information on relatives who have missing genotype data at the marker being tested, incorporates genotype data on individuals with missing phenotype or covariate information, and appropriately accounts for the uncertainty resulting from missing data and dependence. For comparison, we also propose the ASTOR test, which is a simplified version of MASTOR that also corrects for dependence but does not require estimation of heritability.
In order to assess the type 1 error and power of MASTOR and compare it to previously proposed statistics, we perform simulations under various models for quantitative traits, allowing for multiple causal loci with epistasis, covariates, both normal and nonnormal error, polygenic additive and/or dominance and environmental components of variance, and various assumptions about missing genotype, phenotype, and ascertainment. We illustrate the applicability of the method by analyzing data on high-density lipoprotein cholesterol (HDL-C) from the Framingham Heart Study.
Material and Methods
Suppose the data for the genetic association study include genotype, phenotype, pedigree, and relevant covariate information on a sample of individuals, where we allow missing data. We assume that the phenotype is quantitative, i.e., continuously varying. We consider an analysis in which genetic variants are tested, one at a time, for association with the trait. For simplicity, we assume that each tested variant is biallelic. (The extension to a multiallelic variant is described in Appendix A.) The individuals in the sample can be arbitrarily related, with the pedigree(s) that specify the relationships assumed to be known. Unrelated individuals can also be included in the sample, or the entire sample could consist of unrelated individuals.
We fix a particular genetic variant, which we refer to as the variant of interest, and we arbitrarily label its two alleles “0” and “1.” Let N denote the subset of individuals in the sample who have nonmissing genotype at the variant of interest. Let n = |N| be the number of individuals in set N. Let R denote the subset of individuals in the sample who have nonmissing phenotype value and no missing covariates and who satisfy at least one of the following three criteria: (1) they have nonmissing genotype at the variant of interest, (2) they have a relative in the study with nonmissing genotype at the variant of interest, or (3) they are in the same pedigree with an individual with nonmissing phenotype and no missing covariates who either has nonmissing genotype or a relative with nonmissing genotype at the variant of interest. Let r = |R| denote the number of individuals in set R. The set of individuals included in the MASTOR analysis is N R. That is, in addition to individuals who have complete data on phenotype, covariates, and genotype at the variant of interest, we also include in the analysis (1) individuals who are genotyped at the variant of interest but are missing phenotype or covariate information and (2) individuals who have phenotype and covariate information but have missing genotype at the variant of interest, provided that they have at least one relative who is genotyped at the variant of interest (or are in a pedigree that meets condition 3 above).
Let X = (X1,…, Xn)T denote the vector of genotype data for the individuals in N, where Xi = 0, .5, or 1, according to whether the ith genotyped individual has 0, 1, or 2 copies of allele 1 at the variant of interest. Let Y = (Y1,…, Yr)T denote the vector of phenotype data, where Yi is the quantitative trait value for the ith individual in set R, and let W be the r × (w + 1) matrix of covariates with (i, j)th entry Wij equal to the value of the jth covariate for the ith individual in set R. We assume that W always includes an intercept (i.e., column of ones) and therefore has w + 1 columns, where w is the number of covariates to be included in the analysis in addition to the intercept.
In MASTOR, we analyze the data retrospectively, i.e., we condition on (Y, W), the covariates and phenotype, and treat the genotype vector, X, as random. This approach is appropriate with either random or phenotype-based ascertainment. Additionally, this order of conditioning provides a natural way of incorporating information on individuals with partially missing data.3
In what follows, we first give a brief review of the mixed-effects GTAM method of Abney et al.,4 which is closely connected to MASTOR. We then describe MASTOR and show how it can gain power by using additional information not used by GTAM. We also briefly describe the TSCORE statistic9 and contrast it with the MASTOR and GTAM statistics.
Brief Review of the Mixed-Effects GTAM Method of Abney et al.
We restrict attention to a special case of the GTAM method4 in which one tests for an additive effect of the variant of interest (1-df test), in the presence of covariates, with additive polygenic effects and independent normal error in the model. For computational convenience, we use an asymptotic assessment of significance instead of the more robust, covariance-preserving permutation test described by Abney et al.4
The GTAM method uses only the set of individuals, S = N ∩ R, who have complete data on phenotype, covariates, and genotype at the variant of interest. Let s = |S| be the number of such individuals. We let YS and XS represent the length s subvectors of Y and X, respectively, obtained by considering only the individuals in S. Similarly, we let WS denote the s × (w + 1) submatrix of W obtained by extracting the s rows of W that correspond to individuals in S.
In GTAM, the analysis is prospective, not retrospective. In other words, we condition on (WS, XS), the covariates and genotypes, and treat the quantitative trait vector, YS, as random. This approach is clearly justified for individuals randomly sampled from a population, though the justification is less obvious when ascertainment is based on phenotype. The quantitative trait (possibly after suitable transformation of phenotypes and/or covariates) is modeled as
where β is the (w + 1) × 1 vector of covariate effects, including intercept, γ is the (scalar) association parameter, measuring the effect of genotype on phenotype, and εS is a random vector of length s, satisfying
i.e., εS has the s-dimensional multivariate normal distribution with mean vector 0 and variance matrix
where ϕij is the kinship coefficient between the ith and jth individuals of S, and hi is the inbreeding coefficient of the ith individual of S. Note that the identity 2ϕii = 1 + hi holds for all i, so the diagonal entries can be equivalently expressed as twice the self-kinship coefficients. Here, as in the original GTAM method, ΦSS is taken to be the pedigree-based kinship matrix, though an estimated kinship matrix based on genotype data, such as that of Han and Abney,14 has also been used in GTAM.
We find it convenient to reparameterize the VCs,
The null hypothesis of no association is H0 : γ = 0, and the alternative hypothesis is HA : γ ≠ 0. The GTAM statistic is asymptotically equivalent to the score statistic for this null hypothesis, assuming the model given in Equations 1, 2, 3, and 4. To calculate the GTAM statistic, we first estimate the heritability, ξ, by its null maximum likelihood estimate (MLE), i.e., its MLE in the submodel of Equations 1, 2, 3, and 4 in which γ = 0. We call this null MLE
where
A technical point is that the set S typically differs from marker to marker, but we would like to avoid re-estimating the null MLE of the heritability,
MASTOR
MASTOR can be viewed as an extension, to quantitative traits, of the MQLS method3 for binary traits. In contrast to GTAM, MASTOR is based on a retrospective approach, rather than a prospective approach, and it uses the larger set of individuals N R, rather than S = N ∩ R. One advantage of the retrospective approach is that it provides a natural way to incorporate information on individuals with missing genotype by using the known dependence among relatives’ genotypes under the null hypothesis. This allows MASTOR to use extra information not used by GTAM. Another advantage is that the retrospective analysis is less dependent on correct specification of the phenotypic covariance matrix under the null hypothesis. (This is because the correct calibration of MASTOR depends only on the null conditional mean and variance of the genotype data, not on the phenotype model. See Appendix C for mathematical details and subsection Assessment of Type 1 Error and the Impact of Variance Components and subsection Power Studies for empirical confirmation.) We first present the calculation of the MASTOR statistic and then provide the justification for it.
The MASTOR statistic takes the form
where V, which will be defined in the next paragraph, is a vector of length n that is a function of W, Y, and the pedigree information. We assume that under the null hypothesis of no association between genotype and phenotype, we have E0(VTX | W, Y) = 0 and
where 1 denotes a vector of length n with every element equal to 1. An alternative estimator that accounts for possible dependence between genotype and covariates is derived by replacing U in Equation 7 with
and replacing (n − 1) in Equation 7 by (q − w − 1), where Q is defined to be the set of individuals with both genotype and covariate information, but who may or may not have known phenotype (S Q N), with q = |Q|, ΦQQ is the q × q kinship matrix for individuals in Q, and WQ is the q × (w + 1) submatrix of W that is obtained by extracting the q rows of W that correspond to individuals in Q. Given the choice of
To define the vector V of Equation 6, we first obtain a null MLE of the heritability, call it
(possibly after suitable transformation of phenotypes and/or covariates), where
where
Special Case: Complete Data
In the special case when there are complete data on all sampled individuals, we have R = N = S. In that case, it is easily verified that UΦNR P = P, so the numerators of the (GTAM)2 and MASTOR statistics are both equal to (XTPY)2. Their denominators represent different estimators of the variance of XTPY, with the denominator of (GTAM)2 representing an estimator of Var(XTPY | W, X) under a prospective model, and the denominator of MASTOR representing an estimator of Var0(XTPY | W, Y) under a retrospective model.
In the complete data case, the assumptions required for MASTOR to be a correctly calibrated statistic include the following. (1) E0(X | W, Y) = Wα, where α is an unknown (w + 1) vector of coefficients. In other words, under the null hypothesis of no association between genotype and phenotype, the genotype is permitted to be linearly related to the covariates, or it can be unrelated to the covariates. (2)
Justification for the MASTOR Statistic
We briefly present two different ways of understanding and justifying the MASTOR statistic. First, an intuitively clear interpretation of the MASTOR statistic is obtained by noting that it can be rewritten (see Appendix C) as
where
where
is the best linear unbiased estimator (BLUE) of allele frequency.17 When individual i N (i.e., i is genotyped), the formula for
The interpretation of MASTOR in terms of BLUP imputation of genotypes is useful in understanding the role played by missing data in MASTOR. Suppose individual i has phenotype and covariate information but is missing genotype data. Then, in the MASTOR method, the greatest potential contribution i could make to the association analysis would occur if he or she had genotyped relatives whose information could then be used in the BLUP imputation of i’s genotype. A somewhat less-substantial contribution that i could make to the association analysis would occur if i had no genotyped relatives but was in the same pedigree with an individual j such that
A second interpretation of MASTOR is that it is a quasilikelihood score test of the null hypothesis H0 : δ = 0, versus HA : δ ≠ 0 in the retrospective mean model
(See Bourgain et al.13 and Wang and McPeek18 for more details on quasilikelihood score tests in this setting.) For a genotyped individual i, the conditional expectation in Equation 14 can be rewritten as
so the summation can be viewed as a weighted sum of the transformed phenotypic residuals, i.e., elements of
For outbred individuals, the mean model given in Equations 14 and 15 holds up to terms of order o(δ) as δ → 0 assuming a general, 2-allele, prospective model of the form E(Yi|W, X) = Wiβ + γ11Xi=.5 + γ21Xi=1 + ϵi, with
Despite the fact that the mean model of Equations 14 and 15 does not allow X to depend linearly on W under the null hypothesis, MASTOR is still correctly calibrated when E0(X | W, Y) = Wα (see Appendix C). To also obtain optimality for this case, we can change the mean model, replacing p1 by Wα in Equation 14, which results in a modified MASTOR statistic obtained by replacing U by U′ in Equation 10.
The TSCORE Test
The TSCORE test9 is very similar to GTAM in that it is also a prospective mixed-model analysis that incorporates covariates and uses estimated VCs for additive and environmental variance. A major difference between TSCORE and GTAM is that TSCORE first imputes values for missing genotypes, so that a larger subset of individuals can be analyzed. Thus, the numerator of TSCORE resembles that of MASTOR in Equation 11 but with a different imputation approach. The denominator of TSCORE represents a prospective variance calculation, so it is similar to GTAM in that regard, but with some differences. In the GTAM denominator, the generalized genotypic sum of squares has covariates regressed out, but they are not regressed out in the generalized genotypic sum of squares appearing in the TSCORE denominator. In the GTAM denominator, the term
ASTOR: Is the Heritability Parameter Really Needed?
MASTOR involves estimation of the heritability. This is done only once per genome screen, at least initially, for computational reasons (see subsection Computational Approach and Software). Still, one could ask whether accurate heritability is really needed for MASTOR, because, in the retrospective framework, the statistic would still be correctly calibrated if heritability were ignored. We propose a simplified approach, ASTOR, that is a version of MASTOR in which the heritability is assumed to be 0, so that Σ = Ir. The formula for ASTOR is given by Equation 10 with P replaced by PI = Ir − W(WTW)−1 WT, eliminating the heritability estimation step. However, note that ASTOR still correctly accounts for relatedness in the sample. (This is because the correct calibration of ASTOR depends only on the null conditional mean and variance of the genotype data, not on the phenotype model; see Appendix C.) In Results we compare ASTOR to MASTOR in terms of power and type 1 error.
Choice of Statistics to Compare by Simulation
MASTOR is designed for association analysis of quantitative traits with related individuals, taking into account covariates. The previously proposed GTAM and TSCORE also address this problem, so it is useful to compare them to MASTOR. EMMAX is designed for samples of individuals not known to be related, rather than for family data, and when EMMAX is applied, the authors5 exclude individuals whose empirical kinship coefficient is greater than .10 (which corresponds to excluding first- and second-degree relatives). If EMMAX were modified to use the known pedigree information, it would be the same as GTAM in our simulation context. Therefore, within the context of our simulation, GTAM could be thought of as representing an upper bound on the power of a potential extension of EMMAX to family data. Previous work15 has shown that FBAT has very low power in simulation settings like those we consider, because it does not incorporate the data on the unrelated individuals and because many of the families do not meet the FBAT criteria for “informative families.” Thus, FBAT is not well suited to analyzing the type of data in our simulations and we do not consider this approach further. Similarly, GQLS does not allow covariates, so it is also not able to handle the type of data we consider. Of these methods, only GTAM and TSCORE are designed to address the problem addressed by MASTOR, namely, quantitative trait association analysis in family data with general pedigree types, taking into account covariates. We also consider the simplified method ASTOR, which addresses the same problems but does not require heritability estimation.
Computational Approach and Software
We have developed software, called MASTOR, which is coded in C and implements the MASTOR, ASTOR, and GTAM methods. To calculate the MASTOR statistic, our software performs two main steps. The first step involves estimation of heritability and covariate effects under the null model, from which we can calculate the transformed phenotypic residual vector, PY. The second step is calculation of the statistic of Equation 10. The first step involves singular value decomposition (SVD) of a kinship matrix, ΦRR, as well as numerical maximization of a likelihood to get the MLEs of the VCs. Note, however, that the computational burden is reduced in two ways. First, the block-diagonal structure of ΦRR, with blocks corresponding to families, allows the SVD to be done independently on each block. Thus, for example, if set R were divided into f equal-size families, the computational cost of the SVD would be O(r3f−2) when the block-diagonal structure is exploited. Depending on f, this could be much faster than the naive SVD, which would have cost O(r3). Second, in our implementation, we use a well-known algebraic trick4,5,19 to rewrite our likelihood as a function of just a single parameter,
In a genome scan, the first step of MASTOR would need to be performed only once under certain conditions. For example, this would hold if each person who is phenotyped is either genotyped or has a genotyped relative at every marker in the scan. In that case, the set R would be the same for every marker. For computational reasons, we choose to perform step 1 only once at the initial stage of the genome scan, even though R may differ slightly from marker to marker. To do this, we fix R at the beginning of the analysis. For example, one could include, in R, individuals who are phenotyped and who have at least some minimum number of markers at which they are either genotyped or have a genotyped relative. Then, once a subset of interesting markers has been identified in the initial analysis, one could perform a separate step 1 for each marker in the subset.
The second step, in which the statistic for each marker is calculated, scales linearly in the number of markers m, and for each marker there is an inversion of ΦNN. As in the first step, the block-diagonal structure of ΦNN greatly reduces the computational burden.
The MASTOR software also allows the option of fitting a linear mixed model to the data without performing a genome scan. This option, which we refer to below as mixedMLE, serves two purposes. First, it is useful for preliminary analyses of the phenotype and covariate data in order to formulate the null model (Equation 9). For example, the mixedMLE option could be used to fit the data under versions of the null model having different sets of covariates included and different choices of transformations of phenotype and/or covariates, in order to decide which choices should be used in the association tests. Once the null model is chosen, MASTOR can be run in the default mode to perform the association analysis. After genetic variants of interest have been identified by an initial analysis with MASTOR, the mixedMLE option of MASTOR can then serve a second purpose, namely, it can be applied, with one or more variants included as covariates, to estimate the parameters of the alternative model (Equations 1, 2, 3, and 4), including effect size(s) of variants or even interactions among them.
Simulation Studies
We perform simulation studies in order to (1) compare type 1 error and power of the tests; (2) determine whether it is possible to retain the high power of MASTOR without going to the trouble of estimating heritability (i.e., compare power of ASTOR and MASTOR); and (3) assess sensitivity of MASTOR and GTAM to misspecified VCs and to estimation of the heritability from a subset of individuals that is not exactly the same subset used in the test. To address these questions, we simulate data that include related individuals, under a variety of trait models and assumptions about missing genotype and phenotype, as we now describe.
Trait Models
We simulate five trait models, denoted I, II, III, IV, and V, all of which have sex as a covariate. Model I has a single major gene acting additively with additional additive polygenic effects. It is given by
where 1 is the vector with all elements equal to 1, 1female is the vector with ith element equal to 1 if i is female and 0 if i is male,
where
where
where
Table 1
x3= 0 | x3= .5 | x3= 1 | |||||||
---|---|---|---|---|---|---|---|---|---|
x2= 0 | x2= .5 | x2= 1 | x2= 0 | x2= .5 | x2= 1 | x2= 0 | x2= .5 | x2= 1 | |
x1 = 0 | .5 | .5 | .5 | .75 | .75 | .75 | 1 | 1 | 1 |
x1 = .5 | .5 | 3 | 3 | 3.5 | 3.5 | 3.5 | 5 | 5 | 5 |
x1 = 1 | .5 | 3 | 3 | 3.5 | 3.5 | 3.5 | 5 | 5 | 5 |
Genotypic effects are given as a function of (x1, x2, x3), where x1 is 0, .5, or 1 according to whether the individual has 0, 1, or 2 copies of a given allele at locus i.
Models I and II have major genes and are used in simulations to assess both type 1 error and power. Models III, IV, and V have no major genes and are therefore used to assess only type 1 error. For each model, we consider three different phenotype configuration settings, A, B, and C. In phenotype configuration A, the sample consists of 65 ascertained families, each of which consists of 16 individuals in a three-generation outbred pedigree, with one grandparent couple in the first generation and three parent couples in the second generation, two of whom have three offspring and one of whom has two offspring. To simulate null markers and/or major genes, founder genotypes or haplotypes are drawn at random based on their assumed population frequencies. Genotypes for other individuals are then generated by a standard “gene-dropping” approach. For models I and II, phenotypes are generated conditional on the simulated genotypes for the major genes, using the conditional distributions given in the previous paragraph. For the models without major genes (III–V), phenotypes are sampled according to the distributions given in the previous paragraph. For each simulation experiment, a genotype sampling scheme is chosen (described in the next subsection), and in phenotype configuration A, a family is ascertained conditional on having at least six genotyped individuals, i.e., at least six individuals who meet the criteria for the chosen genotype sampling scheme. (Computationally, this is carried out by rejection sampling.) Phenotypes for all individuals in an ascertained family are assumed to be observed in configuration A. In phenotype configuration B, the sample consists of 20 ascertained families who satisfy all the same conditions as in A, 500 ascertained unrelated individuals who are both phenotyped and genotyped and who are sampled conditional on meeting the criteria for the chosen genotype sampling scheme (see next subsection), and 500 unrelated, unphenotyped controls who are genotyped and who are randomly sampled from the population. In phenotype configuration C, the sample consists of 65 ascertained families, each of which consists of 16 individuals in a three-generation pedigree. Initially, some individuals’ phenotypes are set missing independently at random, with individuals in the oldest, second-oldest, and youngest generations having probabilities .1, .2, and .4, respectively, of having missing phenotype. In four of the genotype sampling schemes (all, even tails, skewed tails, and upper tail, which are described in the next subsection), all individuals with missing phenotype are assumed to be genotyped, and individuals with nonmissing phenotype are selected for genotyping according to the genotype sampling scheme. (In the random genotype sampling scheme, however, individuals are chosen at random for genotyping, as described in the next subsection, regardless of whether or not they are phenotyped.) Finally, the family is ascertained conditional on having at least three individuals who are both phenotyped and genotyped, at least three who are phenotyped and not genotyped, and at least three who are genotyped and not phenotyped. Note that, as a consequence, in each family in configuration C, there will be at least six individuals genotyped and at least six individuals phenotyped, and only in the random sampling scheme can an individual be missing both genotype and phenotype.
Sampled Genotypes
We consider five different genotype sampling schemes. These genotype sampling schemes do not apply to the 500 unrelated, unphenotyped controls in phenotype configuration B, who are genotyped and are randomly sampled from the population, but they do apply to all other sampled individuals. In the “all” sampling scheme, all sampled individuals are genotyped regardless of their phenotype. As a consequence, ascertainment is random or population based when the all sampling scheme is used. In the “even tails” sampling scheme, an individual is genotyped if and only if his or her phenotype value is ≤μ − 1.5σ or ≥μ + 1.5σ, where μ and σ are the population mean and standard deviation of the trait. In the “skewed tails” sampling scheme, an individual is genotyped if and only if his or her phenotype value is ≤μ − .5σ or ≥μ + 2.5σ. In the “upper tail” sampling scheme, an individual is genotyped if and only if his or her phenotype value is ≥μ + 1σ. In the “random” sampling scheme, individuals are chosen for genotyping independently at random, with individuals in the oldest, second-oldest, and youngest generations having probabilities .4, .7, and .9, respectively, of being genotyped, regardless of phenotype.
Impact of the Variance Components
One goal of our simulation studies is to assess sensitivity of MASTOR and GTAM to (1) misspecified VCs and (2) estimation of the heritability from a subset of individuals that is not exactly the same subset used in the test. To address (1), for both MASTOR and GTAM, we perform a procedure we call “misspecified VCs” in which we first set
To address (2), we calculate MASTOR and GTAM with the heritability estimated from a slightly different sample than the one used in the association test. Ordinarily, the heritability for MASTOR is estimated based on the individuals in group R, and the heritability for GTAM is estimated based on the individuals in group S = N ∩ R. In simulations, we also consider the results when the heritability for MASTOR is estimated based only on the individuals in group S (the GTAM sample), but the individuals in the larger group R N are included in the association analysis, and when the heritability for GTAM is estimated based on the larger set of individuals in R (the MASTOR phenotype sample), but only the individuals in S are included in the association analysis. We refer to the MASTOR and GTAM statistics calculated in this way as MASTORG and GTAMM, respectively, where “G” refers to the estimation of heritability from the GTAM sample and “M” refers to the estimation of heritability from the MASTOR phenotype sample.
HDL-C Data from the Framingham Heart Study
The Framingham Heart Study (FHS)1 is a multicohort, longitudinal study of risk factors for cardiovascular disease. Our use of the FHS data was approved by the Institutional Review Board of the Biological Sciences Division of the University of Chicago. The FHS sample consists of unrelated individuals as well as individuals from multigeneration pedigrees. We analyze high-density lipoprotein cholesterol (HDL-C) levels in exam 1 of cohort 3 (third-generation cohort) of FHS. We log-transform the phenotype and include age, age2, sex, and log(FPG) as covariates in the analysis, where FPG is fasting plasma glucose. Although we analyze phenotype and covariate information on cohort 3 only, we include in the analysis Affymetrix 500K genotypes on individuals from all three cohorts (original, offspring, and generation 3). We exclude the genotypes of individuals who do not meet all of the following criteria: (1) empirical self-kinship < .525 (i.e., empirical inbreeding coefficient < .05) and (2) completeness (i.e., proportion of markers for which a given individual has genotype called) > 96%. We also use the off-diagonals of the empirical kinship matrix to exclude an additional 298 individuals with empirical kinship values that are not consistent with the pedigree information. The resulting data set has 3,879 individuals who are both genotyped and phenotyped with no missing covariates, 4,718 individuals who are genotyped but are missing the phenotype or some of the covariates, and 194 individuals who have complete phenotype and covariate information but do not have genotype data. Initially, we analyze the 369,046 SNPs from the Affymetrix 500K array that satisfy all of the following criteria: (1) call rate ≥ 96%, (2) Mendelian error rate ≤ 2%, and (3) minor allele frequency ≥ 1%.
Because interesting distinctions between MASTOR, GTAM, and TSCORE would be expected to occur only when there are a substantial proportion of individuals who have phenotype and covariate information but are missing genotype data, we randomly mask some genotypes in the data set. In the random masking scheme, the probability of an individual being selected for genotyping is allowed to depend on the phenotype value. Specifically, we set
Results
Assessment of Type 1 Error and the Impact of Variance Components
To assess the type 1 error of the methods, we simulate an unlinked, unassociated marker and test for association by each method. Our type 1 error test is particularly challenging to the methods because the null model is misspecified in almost every scenario (except trait model III with the all sampling scheme). The results in Table 2 show that MASTOR, ASTOR, and GTAM all appear to be correctly calibrated in our simulations, including in those cases when the null model is not correctly specified. In contrast, TSCORE was not correctly calibrated when individuals were chosen for genotyping based on their phenotype value. A likely reason for this is that the TSCORE variance calculation seems to be appropriate only if either (1) the imputation procedure recovers close to complete genotype information or (2) the phenotype distribution is the same for genotyped and ungenotyped individuals.
Table 2
Sampled Genos | Trait Model | Nominal Level | MASTOR | ASTOR | GTAM | TSCORE |
---|---|---|---|---|---|---|
All | IIA | .05 | .049 | .048 | .049 | .050 |
All | IIIA | .05 | .049 | .049 | .050 | .048 |
All | IVA | .05 | .048 | .048 | .047 | .049 |
All | VA | .05 | .051 | .051 | .052 | .050 |
Random | IIC | .05 | .050 | .050 | .052 | .053 |
Upper tail | IIA | .05 | .049 | .050 | .050 | .006 |
Upper tail | IIIA | .05 | .049 | .050 | .049 | .015 |
Upper tail | IIC | .05 | .050 | .049 | .052 | .045 |
Even tails | IIA | .05 | .051 | .051 | .053 | .108 |
Even tails | IIC | .05 | .050 | .051 | .054 | .119 |
All | IIA | .001 | .0011 | .0011 | .0010 | .0008 |
All | IIIA | .001 | .0008 | .0009 | .0009 | .0008 |
All | IVA | .001 | .0012 | .0012 | .0012 | .0010 |
All | VA | .001 | .0010 | .0010 | .0010 | .0009 |
Random | IIC | .001 | .0008 | .0010 | .0011 | .0009 |
Upper tail | IIA | .001 | .0006 | .0010 | .0012 | .0000 |
Upper tail | IIIA | .001 | .0013 | .0012 | .0011 | .0000 |
Upper tail | IIC | .001 | .0014 | .0015 | .0010 | .0006 |
Even tails | IIA | .001 | .0008 | .0008 | .0010 | .0066 |
Even tails | IIC | .001 | .0012 | .0013 | .0014 | .0089 |
Values in bold are those that differ significantly (p value < .01) from the nominal level by a z-test. “Sampled Genos” refers to the different genotype sampling schemes described in subsection Sampled Genotypes of the Material and Methods. The trait models are described in subsection Trait Models of the Material and Methods. For trait model II, the tested SNP has minor allele frequency .2, whereas for trait models III, IV, and V, the tested SNP has minor allele frequency .1.
We also assess the impact of misspecified values for the VCs on the type 1 error of MASTOR and GTAM. In Table 3, the results in column 6 (MASTOR misspecified VCs) show that the type 1 error of MASTOR seems completely unaffected by misspecification of the VC values. In contrast, the results in column 7 (GTAM misspecified VCs) of Table 3 show that GTAM is more sensitive to misspecification of the VCs. This is to be expected, because the asymptotic assessment of significance that we use for GTAM depends crucially on the VCs.
Table 3
Sampled Genos | Trait Model | Nominal Level | MASTORG | GTAMM | MASTOR Misspecified VCs | GTAM Misspecified VCs |
---|---|---|---|---|---|---|
Even tails | IIA | .05 | .051 | .066 | .051 | .097 |
Even tails | IIA | .001 | .0009 | .0018 | .0007 | .0048 |
Even tails | IIC | .05 | .051 | .063 | .048 | .074 |
Even tails | IIC | .001 | .0010 | .0028 | .0012 | .0034 |
Upper tail | IIIA | .05 | .049 | .052 | .049 | .046 |
Upper tail | IIIA | .001 | .0012 | .0013 | .0010 | .0010 |
Upper tail | IIA | .05 | .049 | .043 | .050 | .043 |
Upper tail | IIA | .001 | .0009 | .0008 | .0007 | .0010 |
Upper tail | IIC | .05 | .049 | .047 | .049 | .046 |
Upper tail | IIC | .001 | .0009 | .0009 | .0011 | .0008 |
MASTORG is calculated with heritability estimated from group S (the GTAM sample) and GTAMM is calculated with heritability estimated from R (the MASTOR phenotype sample). Empirical type 1 error is based on 25,000 replicates. Values in bold are those that differ significantly (p value < .01) from the nominal level by a z-test. “Sampled Genos” and “Trait Model” defined in the legend to Table 2. For trait model II, the tested SNP has minor allele frequency .2, whereas for trait model III, the tested SNP has minor allele frequency .1.
We also consider a more subtle point, which is the effect of estimating the heritability parameter in a slightly different subset of the sampled individuals than the subset that is included in the association test. Again, in Table 3 column 4 (MASTORG), we find that the type 1 error of the MASTOR statistic seems completely unaffected by this, whereas the type 1 error of the GTAM statistic, in Table 3 column 5 (GTAMM), can be thrown off by this. This is an important caveat for the use of GTAM: the heritability must be estimated in the same subset of individuals that is being tested for association. Thus, for example, individuals who are phenotyped but have missing genotype must be excluded from the heritability estimation for GTAM, whereas they could be included for MASTOR. (The power improvement for MASTOR when these individuals are included in the heritability estimation is considered in the next subsection.)
In the type 1 error assessments presented in Tables 2 and and3,3, for trait model II, the tested SNP has minor allele frequency .2, whereas for trait models III, IV, and V, the tested SNP has minor allele frequency .1. We obtained similar results with other choices of minor allele frequency, as well as with other choices of model and significance level (results not shown).
Power Studies
To compare the power of the tests, we simulate under models I and II, which have major genes. We simulate a total of 60 scenarios, where 40 of these are obtained by choosing all possible combinations from among two models (I and II), two phenotype configurations (A and B), four genotype sampling schemes (all, even tails, skewed tails, and upper tail), and for model II, testing each of the four different causal SNPs (whereas model I has only one causal SNP). An additional 20 scenarios were obtained by simulating model II with phenotype configuration C with each of five genotype sampling schemes (the four listed above plus random) and testing each of four different causal SNPs. For each of the 60 scenarios, power was evaluated at four different significance levels, .05, .01, .005, and .001. Results from all 240 settings are represented in Figure 1, and a subset of the results appears in Table 4. Because TSCORE was not correctly calibrated in many of the simulation settings, we did not include it in the power comparison.
Table 4
Sampled Genos | Trait Model | Tested SNP | Level | ASTOR | GTAM | MASTOR |
---|---|---|---|---|---|---|
All | IA | 1 | .01 | .48 | .57 | .57 |
All | IB | 1 | .01 | .48 | .53 | .52 |
All | IIA | 3 | .01 | .56 | .66 | .66 |
All | IIB | 3 | .01 | .56 | .62 | .61 |
Random | IIC | 3 | .01 | .32 | .34 | .37 |
Even tails | IA | 1 | .001 | .55 | .51 | .60 |
Even tails | IB | 1 | .001 | .88 | .90 | .91 |
Even tails | IIA | 3 | .001 | .87 | .86 | .91 |
Even tails | IIB | 2 | .05 | .29 | .31 | .31 |
Even tails | IIC | 3 | .001 | .47 | .45 | .52 |
Skewed tails | IA | 1 | .05 | .30 | .30 | .35 |
Skewed tails | IB | 1 | .05 | .38 | .48 | .48 |
Skewed tails | IIA | 4 | .01 | .41 | .38 | |
Skewed tails | IIB | 4 | .01 | .53 | .66 | .70 |
Skewed tails | IIC | 4 | .01 | .57 | .29 | |
Upper tail | IA | 1 | .05 | .30 | .22 | |
Upper tail | IB | 1 | .05 | .21 | .27 | |
Upper tail | IIA | 4 | .05 | .45 | .30 | |
Upper tail | IIB | 4 | .05 | .36 | .41 | |
Upper tail | IIC | 4 | .05 | .78 | .18 |
Empirical power is based on 25,000 replicates. For empirical power estimates in the range .2–.8, the estimated standard error is .003, whereas for the estimates outside this range, the estimated standard error is .002. MASTOR power values with a single underline are those that are at least .05 larger than the power of GTAM for that scenario, whereas values with a double underline are those that are at least .1 larger than the power of GTAM for that scenario. “Sampled Genos” and “Trait Model” defined in the legend to Table 2. For model I, association is tested with the sole causal SNP. For model II, association is tested with one of causal SNPs 1, 2, 3, or 4, as indicated in the “Tested SNP” column.
From Table 4 and Figure 1F, it is clear that MASTOR has power that is either approximately equal to or higher than that of GTAM in all of our simulations, and in some cases it is much higher. From rows 1 and 3 of Table 4, we can see that when there are no missing data, the power of MASTOR and GTAM is equivalent. However, in most of the missing data scenarios, MASTOR outperforms GTAM. This is expected because MASTOR uses information on dependence among relatives to allow family members with some missing data to contribute to the analysis.
One goal of our simulation studies was to determine whether we could retain the high power of MASTOR without going to the trouble of estimating heritability. Specifically, we considered whether we could do almost as well as MASTOR by setting the heritability to 0 instead of estimating it. (We called the resulting statistic ASTOR.) From Table 4 and Figure 1B, it is clear that MASTOR dominates ASTOR in terms of power, so the heritability estimation step seems important for power. From Table 4 and Figure 1D, we can see that ASTOR sometimes does much better but often does worse than GTAM. ASTOR tends to do better than GTAM in phenotype configurations A and C with missing data and worse in phenotype configuration B or in the absence of missing data. It makes sense that ASTOR would perform well in strongly family-based samples with missing data because, like MASTOR, ASTOR is able to improve power by using information on dependence among relatives to allow family members with some missing data to contribute to the analysis.
Another goal of our simulation studies was to determine whether, to estimate heritability for the MASTOR and GTAM statistics, one should use only that subset, S, of individuals who have both phenotype and genotype data, or whether one should use the larger subset, R, of phenotyped individuals who do not necessarily have genotype data (but who meet additional conditions detailed above). From Figure 1A, it is clear that for the MASTOR analysis, higher power is achieved by estimating heritability from the full subset, R, (results labeled “MASTOR” in Figure 1A) rather than from the subset having both phenotype and genotype data (results labeled “MASTORG” in Figure 1A). In contrast, for the GTAM analysis, there is no gain in power from including the full subset of individuals in R when some of them have missing genotype data (results labeled “GTAMM” in Figure 1E) as opposed to including only the subset of individuals who are both genotyped and phenotyped (results labeled “GTAM” in Figure 1E). In fact, use of the larger subset of individuals to estimate heritability for GTAM can lead to either an inflation or deflation of type 1 error (Table 2 and the red and blue dots in Figures 1C and 1E).
Analysis of HDL-C Data from the Framingham Heart Study
Table 5 reports the parameter estimates for the null model of log(HDL-C), fitted in cohort 3. We estimated the heritability to be .50 (95% confidence interval of .44–.56), which is consistent with previously reported20,21 estimates of 0.40–0.69. The Q-Q plot for the MASTOR genome scan (see Figure S1 available online) does not show evidence of inflation, and the genomic control inflation factor is λGC = 1.01. In the initial association analysis (before masking of genotypes), the results by GTAM and TSCORE (not shown) are almost the same as those for MASTOR, which is to be expected because of the low proportion of individuals with phenotype and covariate information who also have missing genotype. In the initial association analysis (before masking of genotypes), one SNP, rs9989419, shows significant association with HDL-C after Bonferroni correction, with a nominal p value of 1.0 × 10−8 by MASTOR. SNP rs99894919 is located in cholesterol ester transfer protein (CETP [MIM 118470]) and has been previously reported as associated with HDL-C level.22,23 A previous analysis of a much larger subset of the Framingham data24 also identified this association, using a method that accounts only for sibling correlations, with genomic control used to make a further correction. In Table 6 we report all SNPs with p value ≤10−5 in the MASTOR analysis. These SNPs are within or in close proximity to eight gene regions, three of which (CETP, LPL [MIM 609708], and LIPG [MIM 603684]) have been reported and replicated before.22,25–27
Table 5
Parameter | MLE | SE |
---|---|---|
Narrow-sense heritability (ξ) | .50 | .03 |
Additive variance | .032 | .003 |
Environmental variance | .032 | .002 |
Intercept | 5.12 | 0.02 |
Coefficient of age | −.007 | .003 |
Coefficient of age2 | .00012 | .00005 |
Coefficient of sex | .243 | .008 |
Coefficient of log(FPG) | −.32 | .03 |
Age is measured in years. Sex is coded as female = 2, male = 1. The log(FPG) is the natural logarithm of fasting plasma glucose.
Table 6
SNP | Chr | Position | Gene Region | p Value of Statistic with | |||
---|---|---|---|---|---|---|---|
Original Data | Masked Genotypes | ||||||
MASTORa | MASTOR | GTAM | TSCORE | ||||
rs11707795 | 3 | 139549939 | CLSTN2 | 4.8 × 10−6 | 2.7 × 10−4 | .153 | .005 |
rs4921964 | 8 | 18679795 | PSD3 | 5.0 × 10−6 | 5.9 × 10−4 | .075 | .005 |
rs17482753 | 8 | 19832646 | LPL-SLC18A1 | 2.1 × 10−7 | 3.1 × 10−4 | .260 | .002 |
rs10503669 | 8 | 19847690 | LPL-SLC18A1 | 1.8 × 10−7 | 1.6 × 10−5 | .021 | 3.4 × 10−4 |
rs17410962 | 8 | 19848080 | LPL-SLC18A1 | 7.6 × 10−7 | 1.1 × 10−5 | .002 | 1.7 × 10−4 |
rs17489268 | 8 | 19852045 | LPL-SLC18A1 | 7.3 × 10−7 | 7.2 × 10−6 | .013 | 2.6 × 10−4 |
rs17411024 | 8 | 19852134 | LPL-SLC18A1 | 7.5 × 10−7 | 5.7 × 10−5 | .042 | 4.6 × 10−4 |
rs17411031 | 8 | 19852310 | LPL-SLC18A1 | 7.3 × 10−7 | 1.2 × 10−4 | .112 | .002 |
rs17411126 | 8 | 19855272 | LPL-SLC18A1 | 1.8 × 10−6 | 8.3 × 10−5 | .039 | .001 |
rs765547 | 8 | 19866274 | LPL-SLC18A1 | 8.0 × 10−7 | 3.1 × 10−5 | .004 | 7.3 × 10−4 |
rs1837842 | 8 | 19868290 | LPL-SLC18A1 | 8.5 × 10−7 | 7.5 × 10−5 | .052 | .002 |
rs1919484 | 8 | 19869676 | LPL-SLC18A1 | 5.4 × 10−7 | 7.2 × 10−6 | .093 | 3.5 × 10−4 |
rs7006101 | 8 | 81897200 | PAG1 | 4.8 × 10−6 | .003 | .104 | .011 |
rs7904836 | 10 | 4097880 | KLF6 | 2.3 × 10−6 | .035 | .039 | .041 |
rs17259942 | 12 | 77072077 | ZDHHC17-OSBPL8 | 4.5 × 10−6 | .003 | .101 | .009 |
rs9989419 | 16 | 56985139 | CETP-HERPUD1 | 1.0 × 10−8 | 1.9 × 10−4 | .090 | .003 |
rs7240405 | 18 | 47159090 | ACAA2-LIPG | 6.3 × 10−6 | .004 | .349 | .016 |
rs4939883 | 18 | 47167214 | ACAA2-LIPG | 7.7 × 10−6 | .003 | .358 | .018 |
rs2156552 | 18 | 47181668 | ACAA2-LIPG | 7.7 × 10−6 | .006 | .532 | .024 |
rs6507945 | 18 | 47243912 | ACAA2-LIPG | 2.9 × 10−6 | .001 | .005 | .003 |
After masking some individuals’ genotypes (as described in the HDL-C Data from the Framingham Heart Study subsection of Material and Methods), we again tested for association with each of the 20 SNPs in Table 6 by using MASTOR, GTAM, and TSCORE, and the results are given in the last three columns of Table 6. The GTAM p values are consistently the largest because GTAM uses only the individuals having complete data. The TSCORE p values are generally larger than those of MASTOR in this analysis, probably because the TSCORE variance calculation is appropriate only if either (1) the imputation procedure recovers close to complete genotype information or (2) the phenotype distribution is the same for genotyped and ungenotyped individuals. In addition, the Ghost implementation of TSCORE does not allow loops, which occurred in six of the Framingham pedigrees.
Run Times for MASTOR
We performed simulations to estimate the run times of MASTOR and demonstrate its computationally feasibility. We used a single processor on a shared machine with 8 core Intel Xeon 3.16 GHz CPU and 32 GB RAM. For a data set with 65 three-generation families with at least 6 genotyped individuals per family, the first step took 20 ms and the second step took 10 min (552,930 ms) to analyze 500,000 SNPs. As expected, doubling the number of families to 130 doubles the time, with the second step taking 20 min (1,120,780 ms) for 500,000 SNPs. Thus, our MASTOR software is clearly practical for genome-wide association studies.
Discussion
Data sets that contain both individuals with known familial relationships and unrelated individuals are common. Often, families that have been previously recruited for linkage analysis are later typed on high-density SNP chips for association analysis. It is also common to recruit the offspring of individuals who are in population-based studies, resulting in data sets with related individuals, such as the Framingham Heart Study.1 For such studies, it is important to develop association analysis tools that can incorporate pedigree and covariate information and can handle missing data.
We have developed MASTOR, a powerful and robust method for association testing of quantitative traits in samples containing related individuals. MASTOR includes adjustment for covariates, is applicable to completely general combinations of unrelated and related individuals, and can appropriately handle and leverage missing data. MASTOR takes into account the dependence among related individuals to incorporate into the analysis phenotype and covariate information from individuals with missing genotype data and genotype data from individuals with missing phenotype or covariate information. In simulations, we show that the type 1 error of MASTOR is well calibrated, and we demonstrate the power gains that MASTOR obtains by (1) modeling the residual phenotypic correlation among related individuals and (2) incorporating partially missing information on related individuals. Because MASTOR is a retrospective analysis, it is robust to misspecification of the model for the distribution of phenotypes among related individuals. Thus, as demonstrated in our simulations, MASTOR remains well calibrated even when the variance components are misspecified or the heritability is estimated from a different sample than the one being tested for association. For best power results, however, heritability should ideally be estimated with the set of phenotyped individuals who contribute to the association analysis.
We used MASTOR to test for association with HDL cholesterol based on genome-wide SNP data from the Framingham Heart Study. In a version of the data set that included both individuals with phenotypes and covariates but missing genotypes and also individuals with genotypes but missing phenotypes and covariates, MASTOR was able to use the partially missing data to increase power over GTAM. Out of the 369,046 SNPs we tested, all of our 10 smallest p values (and 15 of our 20 smallest p values) are within or in close proximity to genes that have been previously reported and replicated as associated with HDL cholesterol, verifying that MASTOR is able to home in on the important loci in a genome scan.
Software implementing the MASTOR, ASTOR, and GTAM statistics is freely downloadable under an open source GNU GPL (see Web Resources). We have demonstrated that MASTOR is computationally feasible, making it suitable for genome-wide association studies. There is still the potential for improvement in computational speed. For example, for a data set with multiple families, a natural parallelization scheme (not currently implemented) would be to process each family independently in parallel. Different sets of SNPs could also be processed independently in parallel. Another approach to speeding up the computations addresses the most time-consuming part of the analysis, which is the inversion of ΦNN for every marker, where N is the set of individuals with nonmissing genotype at the given marker and where the inverse matrix can be calculated separately for each family. Recomputation of the inverse matrix at each marker allows for the possibility that genotypes may be missing for different individuals at different SNPs. However, in a genome screen with 500,000 SNPs, for each small- to medium-sized family in a sample, we are likely to see the same pattern of missingness for more than one SNP, in which case the inverse matrix could be calculated fewer times than the number of SNPs. For example, for a family with 16 genotyped individuals, there are 216 = 65,536 missingness patterns possible, much fewer than the number of SNPs in a typical genome-wide association study. Furthermore, some individuals may be more likely to have missing genotypes than others, for example because of the quality of the sampled DNA, which may further reduce the number of observed missingness patterns. Then it becomes a question of whether or not storing in memory relevant information on all (or the most common) missing genotype patterns for each family is computationally preferable to performing the matrix inversion for every SNP.
For MASTOR (and, in fact, for ASTOR, GTAM, and TSCORE as well), the variance calculation involves the kinship matrix and it would be moderately sensitive to misspecified kinship. We could extend the MASTOR method to also correct for both population structure and misspecified kinship, in addition to missing data and known family structure. We describe two possible approaches to this. One approach, analogous to the ROADTRIPS15 method for binary traits, would be to replace the MASTOR statistic of Equation 6 by
where
Acknowledgments
This study was supported by the National Institutes of Health grant R01 HG001645 (to M.S.M.). The Framingham Heart Study is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University (Contract No. N01-HC-25195). This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or NHLBI. Funding for SHARe Affymetrix genotyping was provided by NHLBI Contract N02-HL-64278. SHARe Illumina genotyping was provided under an agreement between Illumina and Boston University.
Appendix A: Extension to Multiallelic Variant
One approach to testing association with a multiallelic variant having a allelic classes is to perform an (a − 1) degree-of-freedom score test of the null hypothesis of no allelic association. In some situations, it may make sense to first pool some of the allelic classes in order to reduce the number of degrees of freedom. Let a be the number of allelic classes, possibly after pooling, and let Z be the n × (a − 1) matrix with Zij = 1/2 × (the number of class-j alleles in individual i). Then the multiallelic extension of the MASTOR statistic is
Note that when a = 2, Z reduces to X, and Equation A1 reduces to Equation 10. An equivalent formulation of Equation A1 can be obtained as follows: first, let F be the (a − 1) × (a − 1) matrix having (i, j)th entry Fij = Cov(Zki, Zkj) for any outbred individual k. We assume Cov0(Zki, Zlj | W, Y) = Fij 2ϕkl. Define Zi = (Z1i,…, Zni)T, the ith column of Z. We estimate F by a generalized sample covariance matrix
Under the null hypothesis of no association and no linkage, the MASTOR statistic of Equations A1 and A2 is asymptotically
A different, previously proposed estimator13 of F is given by
Appendix B: Extension to MZ Twins
The main challenge presented by the occurrence of MZ twins in the sample is that the kinship matrix ΦNN will not be invertible15 if there are MZ twins in N, the set of individuals with nonmissing genotype. (A similar problem would occur for ΦQQ.) In contrast, invertibility of Σ or
Appendix C: Additional Mathematical Details of MASTOR
To see that the MASTOR statistic of Equations 6 and 10 can be rewritten in the form of Equation 11, note that we can rewrite the BLUP of Equation 12 in matrix notation as
where 1r is a vector of length r with every element equal to 1 and 1 is a vector of length n with every element equal to 1. Then
The asymptotic
and
In subsection Justification for the MASTOR Statistic of the Material and Methods section, the assumptions needed for correct calibration of MASTOR are stated: (1)
The optimality of MASTOR is based on the standard results for quasi-likelihood and holds under the retrospective mean model in Equations 14 and 15, where this model is discussed in some detail in subsection Justification for the MASTOR Statistic of the Material and Methods section. The retrospective mean model of Equations 14 and 15 does not have genotype linearly related to the covariates under the null, so the version of MASTOR in Equation 10 is optimal only when genotype is unrelated to covariates under the null. If one desired optimality of MASTOR under the retrospective model E(X | W, Y) = Wα + δΦNRPY, in which genotype is linearly related to the covariates under the null, this would be achieved by replacing U by U′ in the MASTOR statistic of Equation 10. This could be interpreted in terms of BLUP imputation with covariates, in addition to relatives’ genotypes, as predictors.
Web Resources
The URLs for data presented herein are as follows:
MASTOR source code, http://www.stat.uchicago.edu/~mcpeek/software/index.html
Online Mendelian Inheritance in Man (OMIM), http://www.omim.org/
References
Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics
Full text links
Read article at publisher's site: https://doi.org/10.1016/j.ajhg.2013.03.014
Read article for free, from open access legal sources, via Unpaywall: http://www.cell.com/article/S0002929713001225/pdf
Citations & impact
Impact metrics
Article citations
JASPER: Fast, powerful, multitrait association testing in structured samples gives insight on pleiotropy in gene expression.
Am J Hum Genet, 111(8):1750-1769, 17 Jul 2024
Cited by: 0 articles | PMID: 39025064
Differential genome-wide associated variants and enriched pathways of ECG parameters among people with versus without HIV.
AIDS, 37(12):1871-1882, 12 Jul 2023
Cited by: 1 article | PMID: 37418550 | PMCID: PMC10481915
Genome-wide associated variants of subclinical atherosclerosis among young people with HIV and gene-environment interactions.
J Transl Med, 20(1):609, 20 Dec 2022
Cited by: 3 articles | PMID: 36539828 | PMCID: PMC9764595
Dissecting Complex Traits Using Omics Data: A Review on the Linear Mixed Models and Their Application in GWAS.
Plants (Basel), 11(23):3277, 28 Nov 2022
Cited by: 3 articles | PMID: 36501317 | PMCID: PMC9739826
Review Free full text in Europe PMC
Controlling for background genetic effects using polygenic scores improves the power of genome-wide association studies.
Sci Rep, 11(1):19571, 01 Oct 2021
Cited by: 4 articles | PMID: 34599249 | PMCID: PMC8486788
Go to all (23) article citations
Data
Data behind the article
This data has been text mined from the article, or deposited into data resources.
BioStudies: supplemental material and supporting data
Diseases (Showing 10 of 10)
- (1 citation) OMIM - 609708
- (1 citation) OMIM - 604770
- (1 citation) OMIM - 614440
- (1 citation) OMIM - 602053
- (1 citation) OMIM - 607799
- (1 citation) OMIM - 603684
- (1 citation) OMIM - 606736
- (1 citation) OMIM - 118470
- (1 citation) OMIM - 193002
- (1 citation) OMIM - 608070
Show less
SNPs (Showing 20 of 20)
- (2 citations) dbSNP - rs9989419
- (1 citation) dbSNP - rs17482753
- (1 citation) dbSNP - rs17410962
- (1 citation) dbSNP - rs17411126
- (1 citation) dbSNP - rs7240405
- (1 citation) dbSNP - rs10503669
- (1 citation) dbSNP - rs17411024
- (1 citation) dbSNP - rs17411031
- (1 citation) dbSNP - rs11707795
- (1 citation) dbSNP - rs1919484
- (1 citation) dbSNP - rs7904836
- (1 citation) dbSNP - rs765547
- (1 citation) dbSNP - rs17489268
- (1 citation) dbSNP - rs17259942
- (1 citation) dbSNP - rs2156552
- (1 citation) dbSNP - rs7006101
- (1 citation) dbSNP - rs4939883
- (1 citation) dbSNP - rs1837842
- (1 citation) dbSNP - rs4921964
- (1 citation) dbSNP - rs6507945
Show less
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
CERAMIC: Case-Control Association Testing in Samples with Related Individuals, Based on Retrospective Mixed Model Analysis with Adjustment for Covariates.
PLoS Genet, 12(10):e1006329, 03 Oct 2016
Cited by: 10 articles | PMID: 27695091 | PMCID: PMC5047592
L-GATOR: Genetic Association Testing for a Longitudinally Measured Quantitative Trait in Samples with Related Individuals.
Am J Hum Genet, 102(4):574-591, 01 Apr 2018
Cited by: 4 articles | PMID: 29625022 | PMCID: PMC5985289
Robust rare variant association testing for quantitative traits in samples with related individuals.
Genet Epidemiol, 38(1):10-20, 18 Nov 2013
Cited by: 39 articles | PMID: 24248908 | PMCID: PMC4510991
Retrospective Association Analysis of Binary Traits: Overcoming Some Limitations of the Additive Polygenic Model.
Hum Hered, 80(4):187-195, 01 Jan 2015
Cited by: 6 articles | PMID: 27576759 | PMCID: PMC5505235
Review Free full text in Europe PMC
Funding
Funders who supported this work.
Boston University (1)
Grant ID: N02-HL-64278
NHGRI NIH HHS (1)
Grant ID: R01 HG001645
NHLBI NIH HHS (3)
Grant ID: N02HL64278
Grant ID: N01-HC-25195
Grant ID: N01HC25195
National Heart, Lung, and Blood Institute
National Institutes of Health (1)
Grant ID: R01 HG001645