ROADTRIPS: case-control association testing with partially or completely unknown population and pedigree structure.

Thornton T; McPeek MS

doi:10.1016/j.ajhg.2010.01.001

ROADTRIPS: case-control association testing with partially or completely unknown population and pedigree structure.

Thornton T ¹,

McPeek MS

Affiliations

1. Department of Biostatistics, University of Washington, Seattle, WA 98195, USA.
Authors
Thornton T¹
(1 author)

American Journal of Human Genetics, 04 Feb 2010, 86(2):172-184
https://doi.org/10.1016/j.ajhg.2010.01.001 PMID: 20137780 PMCID: PMC2820184

Free full text in Europe PMC

Abstract

Genome-wide association studies are routinely conducted to identify genetic variants that influence complex disorders. It is well known that failure to properly account for population or pedigree structure can lead to spurious association as well as reduced power. We propose a method, ROADTRIPS, for case-control association testing in samples with partially or completely unknown population and pedigree structure. ROADTRIPS uses a covariance matrix estimated from genome-screen data to correct for unknown population and pedigree structure while maintaining high power by taking advantage of known pedigree information when it is available. ROADTRIPS can incorporate data on arbitrary combinations of related and unrelated individuals and is computationally feasible for the analysis of genetic studies with millions of markers. In simulations with related individuals and population structure, including admixture, we demonstrate that ROADTRIPS provides a substantial improvement over existing methods in terms of power and type 1 error. The ROADTRIPS method can be used across a variety of study designs, ranging from studies that have a combination of unrelated individuals and small pedigrees to studies of isolated founder populations with partially known or completely unknown pedigrees. We apply the method to analyze two data sets: a study of rheumatoid arthritis in small UK pedigrees, from Genetic Analysis Workshop 15, and data from the Collaborative Study of the Genetics of Alcoholism on alcohol dependence in a sample of moderate-size pedigrees of European descent, from Genetic Analysis Workshop 14. We detect genome-wide significant association, after Bonferroni correction, in both studies.

Free full text

Am J Hum Genet. 2010 Feb 12; 86(2): 172–184.

https://doi.org/10.1016/j.ajhg.2010.01.001

PMCID: PMC2820184

PMID: 20137780

ROADTRIPS: Case-Control Association Testing with Partially or Completely Unknown Population and Pedigree Structure

Timothy Thornton¹ and Mary Sara McPeek^2,^3,

Author information Article notes Copyright and License information Disclaimer

This article has been cited by other articles in PMC.

Abstract

Introduction

It is well known that problems can arise in case-control genetic association studies when there is population structure.¹ At its most basic, case-control association testing can be thought of as a comparison of the allele (or genotype) frequency distribution between cases and controls, and markers that are not directly associated with the trait of interest can be spuriously associated with the trait if ancestry differences between cases and controls are not properly accounted for. Similarly, failure to account for population structure can also reduce power. To correct for population structure in case-control studies with samples of unrelated individuals, a number of methods have been proposed, including genomic control (GC),² structured association,³ spectral analysis,^4–8 and other approaches.^9–13 However, many genetic studies include related individuals. Several methods have been proposed for case-control association testing in related samples from a single population with known pedigrees^14–16 or with unknown or partially known pedigrees.¹⁷ However, these methods might not be valid in the presence of population heterogeneity. For certain types of study designs, family-based association tests such as the TDT¹⁸ and FBAT¹⁹ have been used to protect against potential problems of unknown population substructure. Family-based tests, however, are generally less powerful than case-control association methods^20,21 and are more restrictive because they typically require genotype data for family members of an affected individual. In contrast, case-control designs can allow, but do not require, genotype data for relatives of affected individuals.

We address the general problem of case-control association testing in samples with related individuals from structured populations. We do not put constraints on how the individuals might be related, and we allow for the possibility that the pedigree information could be partially or completely missing. We propose a new method, ROADTRIPS, where this name is inspired by the description of the method as a robust association-detection test for related individuals with population substructure. ROADTRIPS uses a covariance matrix estimated from genome-screen data to simultaneously correct for both population and pedigree structure. The method does not require the pedigree structure of the sampled individuals to be known, but when pedigree information is available, the method can improve power by incorporating this information into the analysis. ROADTRIPS is computationally feasible for genetic studies with millions of markers. Other features of ROADTRIPS include (1) appropriate handling of missing data and (2) the ability to incorporate both unaffected controls and controls of unknown phenotype (i.e., general population controls) in the analysis.

In order to compare ROADTRIPS to other methods, on the basis of type 1 error and power, we simulate case-control samples containing both related and unrelated individuals with various types of population structure, including admixture. We also apply ROADTRIPS to identification of SNPs associated with rheumatoid arthritis (RA [MIM 180300]) in small UK pedigrees²² from Genetic Analysis Workshop (GAW) 15, and we apply it to identification of SNPs associated with alcohol dependence (MIM 103780) in a sample of moderate-size pedigrees of European descent from the Collaborative Study of the Genetics of Alcoholism (COGA) data²³ of GAW 14.

Material and Methods

We first describe a class of testing procedures suitable for known structure. Then we describe the ROADTRIPS method for extending these tests to the contexts of unknown and partially known structure.

Overview of Association Testing with Known Structure

Consider the problem of testing for association of a genetic marker with a particular phenotype in a sample of n genotyped individuals. For simplicity, we assume that the marker to be tested is a SNP, with alleles labeled “0” and “1.” (The extension to multiallelic markers can be obtained as in previous work.¹⁶) Let Y = (Y₁, … Y_n)^T where Y_i = ½ × (the number of alleles of type 1 in individual i), so the value of Y_i is 0, ½, or 1. We treat the genotype data on the n individuals as random and the available phenotype information as fixed in the analysis, an approach that is appropriate, for example, with either random or phenotype-based ascertainment. Under the null hypothesis of no association and no linkage between marker and trait, we assume that the expected value of Y is E₀(Y) = p1, where 1 is a column vector of 1s of length n and p is a parameter representing the frequency of the type 1 allele. In models incorporating population structure, p would typically be interpreted as an “ancestral” allele frequency or some kind of average allele frequency across subpopulations. We denote by Var₀(Y) = Σ the n × n covariance matrix of Y under the null hypothesis of no association. It is often convenient to write Σ = σ²Ψ, where σ² is defined to be the variance of Y for an outbred individual in the absence of population structure, and Ψ accounts for relatedness, inbreeding, and population structure. We use the term “known structure” to refer to the case when the matrix Ψ is known. We always take σ² to be unknown and estimate it from the sample. Denote by $\hat{σ}$ ² a suitable estimator of σ² (where two examples of suitable estimators are given in the next subsection). Then, in the case of known structure, we consider test statistics for association that have the rather general form

\frac{{(V^{T} Y)}^{2}}{({\hat{σ}}^{2} V^{T} Ψ V)},

(Equation 1),

where V is a fixed, nonzero column vector of length n such that V^T1 = 0. Note that Var₀(V^TY) = σ²V^TΨV, so the denominator in Equation 1 can be viewed as an estimator of Var₀(V^TY). In a test for association, V would naturally include phenotype information and could also include pedigree information. One could include covariate information in V as well, although we do not treat that situation in the present work. There are a number of case-control association test statistics that have the general form in Equation 1, including the Pearson χ² test, the Armitage trend test,²⁴ the corrected χ² test,¹⁵ the W_QLS test,¹⁵ and the M_QLS test¹⁶ (details on how these tests can be written in the form of Equation 1 are given in subsection Examples of Association Tests with Known Structure). Under standard regularity conditions, the test statistic given in Equation 1 has an asymptotic χ₁² distribution under the null hypothesis of no association and no linkage.

Estimation of σ² when Structure Is Known

In the context of Equation 1, when structure is known, we have two general approaches for estimating σ² under the null hypothesis. If we assume that, for an outbred individual in the absence of population structure, HWE holds at the marker, then σ² = ½p(1 − p), where p is the frequency of allele 1 at the SNP being tested, and a reasonable estimator of σ² under this assumption is ${\hat{σ}}_{1}^{2} = 0.5 \hat{p} (1 - \hat{p})$ , where $\hat{p}$ is a suitable estimator of p, the frequency of allele 1 at the SNP being tested. Examples of suitable estimators of p are (1) the sample frequency, $\bar{Y}$ ; (2) the best linear unbiased estimator (BLUE),²⁵ given by

\hat{p} = {(l^{T} Ψ^{- 1} l)}^{- 1} l^{T} Ψ^{- 1} Y;

(Equation 2);

and (3) a Bayesian estimator²⁶ such as $(n \bar{Y} + 0.5) / (n + 1)$ .

Alternatively, an approach to estimation of σ² that does not assume σ² = 0.5p(1 − p) could be used. When Ψ is known, a reasonable estimator is

{\hat{σ}}_{2}^{2} = {(n - 1)}^{- 1} [Y^{T} Ψ^{- 1} Y - {(l^{T} Ψ^{- 1} l)}^{- 1} {(l^{T} Ψ^{- 1} Y)}^{2}],

(Equation 3),

which is RSS/(n − 1) for generalized regression of Y on 1, where RSS is the residual sum of squares. Note that when Ψ = I, the n × n identity matrix, e.g., with unrelated individuals in the absence of population structure, then ${\hat{σ}}_{2}^{2}$ is just the sample variance of Y.

Examples of Association Tests with Known Structure

Corrected Pearson χ² and Armitage Trend Tests

In the standard Pearson χ² and Armitage trend tests for allelic association, one assumes that the individuals are unrelated with no population structure, so that Ψ = I. A corrected version of the Pearson χ² test has previously been described¹⁵ for the situation when sampled individuals are related with all relationships known, in which case Ψ = Φ, where Φ is the kinship matrix, which is obtained as a function of the known pedigree information and is given by

Φ = (\begin{matrix} 1 + h_{1} & 2 ϕ_{12} & \dots & 2 ϕ_{1 n} \\ 2 ϕ_{12} & 1 + h_{2} & \dots & 2 ϕ_{2 n} \\ ⋮ & \dots & \dots & ⋮ \\ 2 ϕ_{n 1} & 2 ϕ_{n 2} & \dots & 1 + h_{n} \end{matrix}),

(Equation 4),

where h_i is the inbreeding coefficient of individual i, and ϕ_ij is the kinship coefficient between individuals i and j, 1 ≤ i, j ≤ n. We propose to use this same choice of Ψ in the corrected Armitage test. (More generally, for either test, one might consider known structure Ψ that does not necessarily equal Φ.) In both tests, one further assumes that every individual in the sample can be classified as either case or control. In that context, let 1_c be the case indicator, i.e., the vector of length n whose ith entry is 1 if individual i is a case and 0 if individual i is a control. Then both the corrected Pearson χ² and corrected Armitage test statistics are obtained as special cases of Equation 1 with the choice

V = 1_{c} - \frac{n_{c}}{n} 1,

(Equation 5),

where n_c is the number of case individuals among the n total individuals. (In the most general specification of the Armitage test for genetic association, mean-zero, nonlinear functions of Y are allowed in place of V^TY, but in practice, the test is almost always performed with the V given in Equation 5.) The difference between the two tests is in the estimation of σ². The corrected Pearson χ² test uses the estimator ${\hat{σ}}_{1}^{2}$ described in the previous subsection, with $\hat{p}$ taken to be either $\bar{Y}$ or the BLUE given in Equation 2, whereas the corrected Armitage test uses the estimator $(1 - n^{- 1}) {\hat{σ}}_{2}^{2}$ , where ${\hat{σ}}_{2}^{2}$ is given in Equation 3. In the special case, Φ = I, the corrected Pearson χ² and Armitage test statistics equal the standard Pearson χ² and Armitage test statistics, respectively. When we calculate the corrected Armitage test statistic, we actually use estimator ${\hat{σ}}_{2}^{2}$ instead of $(1 - n^{- 1}) {\hat{σ}}_{2}^{2}$ . For the large values of n typically encountered in human genetic studies, the difference between ${\hat{σ}}_{2}^{2}$ and $(1 - n^{- 1}) {\hat{σ}}_{2}^{2}$ is negligible. In the context of complex trait mapping in samples of related individuals with known pedigrees, the corrected χ² test has been demonstrated^15,16 to have correct type 1 error, generally higher power than the W_QLS test, and generally somewhat lower power than the M_QLS test. However, with additional unknown population structure, we would expect both the corrected Pearson χ² and corrected Armitage tests to have inflated type 1 error.

W_QLS Test

The W_QLS test¹⁵ was proposed in the context of related individuals without additional population structure, in which case Ψ = Φ given in Equation 4. The W_QLS test statistic is formed from Equation 1 by taking

V = Φ^{- 1} 1_{c} - 1_{c}^{T} Φ^{- 1} 1 {(1^{T} Φ^{- 1} 1)}^{- 1} Φ^{- 1} 1 .

(Equation 6).

This choice of V can be motivated by generalized least-squares regression, because V^TY is proportional to the estimated regression coefficient for 1_c in the generalized least-squares regression of Y on 1_c with intercept. The W_QLS test uses the estimator ${\hat{σ}}_{1}^{2}$ of σ², where $\hat{p}$ is taken to be the BLUE given by Equation 2. An alternative formulation could be obtained by using the estimator ${\hat{σ}}_{2}^{2}$ of Equation 3.

In the context of trait mapping in samples of related individuals with known pedigrees, the W_QLS test generally has lower power¹⁶ than the corrected χ² and M_QLS tests. Nonetheless, we include it in the present work because the ROADTRIPS extension of the W_QLS (described in subsection Association Tests when Structure Is Partially or Completely Unknown) is equivalent to the method, recently proposed by Rakovski and Stram,¹³ for association testing in the presence of hidden population structure and hidden relatedness. Thus, we include the ROADTRIPS extension of the W_QLS in our simulation studies to compare its power and type 1 error to those of our proposed methods.

M_QLS Test

In contrast to the preceding tests, the M_QLS test¹⁶ allows three possible values for an individual's phenotype: “affected,” “unaffected,” and “unknown,” where the label “unknown” is used to represent unphenotyped individuals, e.g., general population controls, or individuals who are deemed too young to have developed an age-related trait such as Alzheimer's, whereas the label “unaffected” is reserved for true unaffecteds. As they have different expected frequencies of predisposing alleles, the two types of controls are treated differently in the analysis. Furthermore, whereas the preceding tests use the phenotype information only for individuals who have genotype data at the marker being tested, the M_QLS also uses the phenotype information for individuals with missing genotype data at the marker being tested, provided that those individuals have a sampled relative who is genotyped at the marker.

As a result of these considerations, instead of using the phenotype vector 1_c that is used in the preceding four tests, the M_QLS uses the vector A = (A_N^T, A_M^T)^T, which contains more information than 1_c. Here, A_N is the phenotype vector for the n individuals with nonmissing genotype data at the marker being tested, and A_M is the phenotype vector for the m individuals with missing genotype data at the marker being tested, where individual i's phenotype is coded as A_i = 1 if i is affected, −k/(1 – k) if i is unaffected, and 0 if i is of unknown phenotype, where 0 < k < 1 is a constant that represents an external estimate of the population prevalence of the trait for a suitable reference population. (The prevalence estimate is permitted to be very rough; the M_QLS test is valid for arbitrary fixed k.)

The M_QLS test was proposed in the context of related individuals without additional population structure, in which case Ψ = Φ, the n × n matrix given in Equation 4. In order to incorporate the information of A_M into the M_QLS test, one also needs the n × m matrix, Φ_N,M, whose (i, j)th entry is 2ϕ_ij, where ϕ_ij is the kinship coefficient between the ith nonmissing and jth missing individuals. The M_QLS test statistic can be obtained from Equation 1 by choosing

V = A_N + Φ⁻¹Φ_N,MA_M − (A_N+Φ⁻¹Φ_N,MA_M)^T1(1^TΦ⁻¹1)⁻¹Φ⁻¹1

(Equation 7)

and using the estimator ${\hat{σ}}_{1}^{2}$ of σ², where $\hat{p}$ is taken to be the BLUE given by Equation 2. An alternative formulation could be obtained by using the estimator ${\hat{σ}}_{2}^{2}$ of Equation 3. Two different justifications for the choice of V in Equation 7 have previously been described; one¹⁶ is based on maximizing the noncentrality parameter among all tests of the type in Equation 1 when a two-allele model in outbreds (or an additive model in inbreds) with effect size tending to 0 is used, and the other²⁷ is based on a relationship with the score test for the retrospective likelihood based on logistic regression with an additive model. In the context of complex trait mapping in samples of related individuals with known pedigrees, the M_QLS test has been demonstrated¹⁶ to have generally higher power than both the corrected χ² and W_QLS tests. However, with additional unknown population structure, we would expect the M_QLS test to have inflated type 1 error.

Outline of ROADTRIPS Approach for Unknown Structure

The idea behind ROADTRIPS is to extend tests of the form given in Equation 1 to the situation when there could be unknown population structure and/or cryptic relatedness in the sample. To do this, we use genome-screen data to form an appropriate estimator $\hat{Ψ}$ of Ψ and consider various tests of the form

\frac{{(V^{T} Y)}^{2}}{({\hat{σ}}^{2} V^{T} \hat{Ψ} V)},

(Equation 8),

where we can allow V to take into account any known pedigree information, in addition to phenotype information, while simultaneously accounting for pedigree errors and additional unknown structure through $\hat{Ψ}$ . This approach allows us to easily adapt to different patterns of missing genotypes at different tested markers, by including only the rows and columns of $\hat{Ψ}$ (and the entries of V and Y) that correspond to the individuals genotyped at the particular marker being tested. In what follows, we first describe the population genetic modeling assumptions that underlie our estimation and testing procedures, then we describe the estimators $\hat{Ψ}$ and ${\hat{σ}}^{2}$ .

Population Genetic Modeling Assumptions

The modeling assumptions we make are weak and are satisfied by commonly used models of population structure and commonly used models for related individuals. We consider S SNPs in a genome screen, and we generalize the notation of the previous subsections to a set of S SNPs by letting Y^s be the genotype vector corresponding to SNP s, namely, Y^s = (Y₁^s, …, Y_n^s)^T, s = 1, …, S, where Y^s_i = ½ × (the number of alleles of type 1 at SNP s in individual i). Our modeling assumption on the null mean, generalized from the preceding subsections, can be stated as

E₀(Y^s) = p_s1, for 1 ≤ s ≤ S.

(Equation 9).

We make the following assumption regarding the null covariance matrix:

{Var}_{0} (Y^{s}) \equiv Σ_{s} = σ_{s}^{2} Ψ, for 1 \leq s \leq S,

(Equation 10),

where Ψ is an arbitrary, positive semidefinite matrix, and σ_s² > 0 for all 1 ≤ s ≤ S. Here, the key point is that the correlation structure, captured by Ψ, is assumed to be the same across SNPs, whereas the scalar multiplier σ_s² is allowed to vary across SNPs. (Of course, this presumes that the same individuals are genotyped at all SNPs. When some individuals have missing genotypes at SNP s, the entries of Y^s and the rows and columns of Ψ that correspond to individuals with missing genotypes would be deleted.)

Note that Ψ and σ_s² are defined only up to a constant multiple, in the sense that cΨ and c⁻¹σ_s² would give the same value of Σ_s. By convention, σ_s² is usually chosen to be the variance of an outbred individual in the absence of population structure. We now give two examples of population genetic models that satisfy our assumptions.

Example 1: Related Individuals without Additional Population Structure

An example of a simple model that satisfies the assumptions of the previous subsection is the model for related individuals in an unstructured population. In this model, individuals in the sample can be related by pedigrees, where the pedigree founders are assumed to be independently drawn from a population that is in Hardy-Weinberg equilibrium (HWE). Mendelian inheritance is assumed in the pedigrees. In this case, it has previously been shown¹⁵ that Equations 9 and 10 hold, where p_s is interpreted as the allele frequency of SNP s in the population from which the founders are drawn, $σ_{s}^{2} = 0.5 p_{s} (1 - p_{s})$ , and Ψ = Φ, where Φ is the kinship matrix given in Equation 4.

If the pedigrees are fully known, then the structure matrix Ψ is known, but if some genealogical information is missing, then Ψ might be partially or completely unknown.

Example 2: Balding-Nichols Model with Admixture

In the Balding-Nichols model^6,28,29 with admixture, we let p_s denote the “ancestral” allele frequency at SNP s and let q_k^s denote the allele frequency of SNP s in subpopulation k, 1 ≤ k ≤ K. We assume that the q_k^s are random variables that are independent across both k and s, with q_k^s ~Beta(p_s(1 − f_k)/f_k, (1 − p_s)(1 − f_k)/f_k), where f_k ≥ 0 can be viewed as Wright's standardized measure of variation³⁰ for subpopulation k. For SNP s, let q^s = (q₁^s, …, q_K^s)^T denote the vector of subpopulation-specific allele frequencies. Individual i is assumed to have admixture vector a_i = (a_i₁, …, a_iK)^T, where a_ik ≥ 0 for all i and k, and $\sum_{k = 1}^{K} a_{i k} = 1$ for all i. Conditional on the random variable q^s, the two alleles of individual i at SNP s are assumed to be independent, identically distributed (i.i.d.) Bernoulli(a_i^Tq^s) random variables. In the Balding-Nichols model with admixture, Equations 9 and 10 hold, where p_s is interpreted as the “ancestral” allele frequency, $σ_{s}^{2} = 0.5 p_{s} (1 - p_{s})$ , and the entries of Ψ are given by $Ψ_{i i} = 1 + \sum_{k = 1}^{K} a_{i k}^{2} f_{k}$ and $Ψ_{i j} = 2 \sum_{k = 1}^{K} a_{i k} a_{j k} f_{k}$ if i ≠ j. In this context, the population structure captured by Ψ would typically be unknown.

Estimation of the Matrix Ψ

The matrix Ψ is a function of the genealogy of the sampled individuals, where genealogy is broadly interpreted as including both population structure and the pedigree relationships of close relatives. The matrix Ψ will be unknown when there is hidden population structure and/or cryptic relatedness in the sample. We allow a completely general form for Ψ, assuming only that it is positive semidefinite (psd). When genome-screen data are available on the sampled individuals, this information can be used to estimate Ψ. For any pair of individuals i and j, let 𝒮_ij be the set of markers for which both i and j have nonmissing genotype data. Then if the allele frequencies p_s were known, and if σ_s² = 0.5p_s(1 − p_s) as in Examples 1 and 2, an unbiased estimator of Ψ_ij would be

{\tilde{Ψ}}_{i j} = \frac{1}{| S_{i j} |} \sum_{s \in S_{i j}} \frac{(Y_{i}^{s} - p_{s}) (Y_{j}^{s} - p_{s})}{.5 p_{s} (1 - p_{s})},

(Equation 11),

where |𝒮_ij| is the number of elements of 𝒮_ij. If one assumed, for example, that genotypes at different SNPs were independent with $| S_{i j} | \to ∞$ and p_s known and that σ_s² = 0.5p_s(1 − p_s) held at all but a finite number of SNPs, then Equation 11 would provide a consistent estimator of Ψ_ij. However, p_s will generally not be known, so we propose to further restrict 𝒮_ij to those markers that are polymorphic in the sample; let ${\hat{p}}_{s} = {\bar{Y}}^{s}$ , the observed proportion of type 1 alleles in the sample at marker s; and use estimator

{\hat{Ψ}}_{i j} = \frac{1}{| S_{i j} |} \sum_{s \in S_{i j}} \frac{(Y_{i}^{s} - {\hat{p}}_{s}) (Y_{j}^{s} - {\hat{p}}_{s})}{.5 {\hat{p}}_{s} (1 - {\hat{p}}_{s})} .

(Equation 12).

The estimator $\hat{Ψ}$ of Equation 12 is essentially the same as the estimated covariance matrix used in EIGENSTRAT.⁶ An alternative estimator could be obtained by using the sample variance of Y^s in the denominator instead of $0.5 {\hat{p}}_{s} (1 - {\hat{p}}_{s})$ .

If every sampled individual were genotyped at the same markers, with no missing genotypes, then $\hat{Ψ}$ would be psd and singular, with $\hat{Ψ} 1 = 0$ , i.e., 1 would be in the null space of $\hat{Ψ}$ . With missing genotypes, it is possible for $\hat{Ψ}$ to be nonsingular and non-psd. The fact that $\hat{Ψ}$ might be non-psd is not, in itself, particularly problematic from a practical point of view, provided $V^{T} \hat{Ψ} V > 0$ for the chosen V in Equation 8 and assuming this provides a sufficiently accurate estimator of Var₀(V^TY^s)/σ²_s. The fact that $\hat{Ψ}$ might be singular (e.g., in the case of no missing data) or close to singular, with $\hat{Ψ}$ orthogonal or approximately orthogonal to the vector 1, means that in those cases, one would not be able to directly plug $\hat{Ψ}$ into formulae such as Equations 2, 3, 6, and 7. This is discussed further in the next subsection. With substantially different amounts of missing data at different markers, as in the RA and COGA data sets analyzed in the Results section, the matrix $\hat{Ψ}$ might be nonsingular and so could be directly used in Equations 2, 3, 6, and 7.

Estimation of σ² when Structure Is Unknown

In this subsection, we drop the subscript s and use notations Y, p, and σ² for the SNP being tested; e.g., we assume E₀(Y) = p1 and Var₀(Y) = Σ = σ²Ψ. As we did for the case of known structure, we consider two general approaches to estimation of σ² when structure is unknown. The first approach is to take estimators of the form ${\hat{σ}}_{1}^{2} = 0.5 \hat{p} (1 - \hat{p})$ , where $\hat{p}$ is a suitable estimator of p. When $\hat{Ψ}$ is orthogonal or approximately orthogonal to the vector 1, we cannot plug it into Equation 2 to obtain the BLUE of p. (Note that use of the Moore-Penrose generalized inverse ${\hat{Ψ}}^{-}$ in place of ${\hat{Ψ}}^{- 1}$ also does not work, because ${\hat{Ψ}}^{-}$ is also orthogonal or approximately orthogonal to 1, so plugging into Equation 2 would result in both numerator and denominator being exactly or approximately zero.) Instead, we use the more stable estimator $\hat{p} = \bar{Y}$ , the sample allele frequency. Thus, our first estimator becomes

{\hat{σ}}_{1}^{2} = 0.5 \bar{Y} (1 - \bar{Y}) .

(Equation 13).

As we did in the case of known structure, we also consider an estimator of σ² that does not assume σ² = 0.5p(1 − p) at the SNP being tested. When $\hat{Ψ}$ is orthogonal or approximately orthogonal to the vector 1, we replace Equation 3 by

{\hat{σ}}_{2}^{2} = {(n - 1)}^{- 1} Y^{T} {\hat{Ψ}}^{-} Y,

(Equation 14),

where ${\hat{Ψ}}^{-}$ is the Moore-Penrose generalized inverse of $\hat{Ψ}$ .

Association Tests when Structure Is Partially or Completely Unknown

We apply Equation 8 to extend association tests developed for the situation of known structure (as described in subsection Examples of Association Tests with Known Structure) to association tests that are appropriate for situations of partially or completely unknown structure. We call the tests based on Equation 8 the ROADTRIPS versions of the corresponding tests given by Equation 1, and we now give several examples. Table 1 gives the weight vector for each statistic defined below.

Table 1

Weight Vectors V for the ROADTRIPS Statistics

Statistic	V
Rχ	$1_{c} - \frac{n_{c}}{n} 1$
RM	A_N + Φ⁻¹Φ_N,MA_M − (A_N + Φ⁻¹Φ_N,MA_M)^T1(1^TΦ⁻¹1)⁻¹Φ⁻¹1
RM_NI	$A_{N} - {\bar{A}}_{N} 1$
RW_NI	${\hat{Ψ}}^{-} 1_{c}$

${\hat{Ψ}}^{-}$ is the generalized inverse of $\hat{Ψ}$ and ${\bar{A}}_{N}$ is the average of the elements of A_N.

Rχ₁ and Rχ₂, the ROADTRIPS Versions of the Corrected χ² and Corrected Armitage Tests

To extend the corrected χ² and corrected Armitage tests to the situation of unknown structure, we apply Equation 8 with V given in Equation 5 and $\hat{Ψ}$ given in Equation 12. We define Rχ₁ to be the ROADTRIPS version of the corrected χ² test, where this is obtained by using ${\hat{σ}}_{1}^{2}$ given in Equation 13, and we define Rχ₂ to be the ROADTRIPS version of the corrected Armitage test, where this is obtained by using ${\hat{σ}}_{2}^{2}$ given in Equation 14. When all SNPs have the same pattern of missing genotypes, we expect Rχ₂ to perform similarly to GC, because in this case, both Rχ₂ and GC are equivalent to correcting all the Armitage χ² statistics across the genome by a common factor, though this factor differs between the two methods. However, when different SNPs have different rates of missing genotypes, we expect the Rχ₂ statistic to do better than GC, in terms of both type 1 error and power, because the Rχ₂ statistic allows different SNPs to have different correction factors appropriate to the level of genotype information available, whereas GC applies the same correction factor to all SNPs.

RM, the ROADTRIPS Version of M_QLS when Structure Is Partially Known

As the M_QLS is generally the most powerful of the statistics for complex trait mapping when structure is known,¹⁶ we expect the ROADTRIPS version of M_QLS to be powerful when structure is unknown. We consider separately the cases of partially known structure and completely unknown structure. An example of partially known structure occurs when reliable pedigree information on sampled individuals is available, but one wants to allow for the possibility of additional cryptic relatedness or unknown population structure not captured by the pedigree information. In the context of partially known structure, we compute the matrix Φ of Equation 4 as a function of the known pedigree information. At the same time, we also calculate the estimator $\hat{Ψ}$ in Equation 12 as before, with the expectation that it will capture the full structure in the data, including structure not explained by Φ. Then we obtain RM, the ROADTRIPS version of the M_QLS test when structure is partially known, by applying Equation 8 with V given in Equation 7. The idea is that we create a powerful test by using the known pedigree structure given in Φ to obtain weights V that will be optimal¹⁶ when Ψ = Φ. Then we preserve the validity of the test in the presence of additional structure, not captured by Φ, through use of the estimator $\hat{Ψ}$ in the denominator of Equation 8. As we did for the Rχ test, we could add subscripts 1 and 2 to the name of the test to distinguish the use of estimators ${\hat{σ}}_{1}^{2}$ and ${\hat{σ}}_{2}^{2}$ , given in Equations 13 and 14, respectively.

RM_NI, the ROADTRIPS Version of M_QLS when Structure Is Completely Unknown

For the case of completely unknown structure, we define the RM_NI test, which is a ROADTRIPS version of the M_QLS, where “NI” stands for “no information.” We form RM_NI from Equation 8, where we take $V = A_{N} - {\bar{A}}_{N} 1$ , where ${\bar{A}}_{N}$ is the sample average of the elements of A_N. This choice of V is the natural analog to Equation 7 when $\hat{Ψ}$ is used in place of Φ, for the case when $\hat{Ψ}$ is orthogonal to the 1 vector and where we ignore the contribution of A_M. The reason we ignore the contribution of A_M is that the expected gain by including this term for individuals not known to be related is not high enough to justify the computational cost involved in obtaining the inverse or generalized inverse of $\hat{Ψ}$ . (Note that $\hat{Ψ}$ tends to be much more costly to invert than Φ, because in typical applications, Φ is block-diagonal with small blocks.) We could add subscripts 1 and 2 to the name of the test to distinguish the use of estimators ${\hat{σ}}_{1}^{2}$ and ${\hat{σ}}_{2}^{2}$ given in Equations 13 and 14, respectively. Note that when the amount of missing data varies across SNPs, the matrix $\hat{Ψ}$ might be nonsingular, and there is the possibility of using Equation 7 with $\hat{Ψ}$ plugged in for Φ. For instance, this occurs in both the data sets we analyze in Results. In this case, one might still choose to ignore the information provided by A_M for the computational reasons mentioned.

RW_NI, the ROADTRIPS Version of W_QLS when Structure Is Completely Unknown

We form RW_NI from Equation 8, where we take $V = {\hat{Ψ}}^{-} 1_{c}$ , which is the natural analog to Equation 6 for the case when $\hat{Ψ}$ is orthogonal to the 1 vector. If we use estimator ${\hat{σ}}_{2}^{2} (n - 1) / (n - 2)$ of σ², then we obtain the test recently proposed by Rakovski and Stram.¹³ (In our simulation study, we actually use estimator ${\hat{σ}}_{2}^{2}$ instead of ${\hat{σ}}_{2}^{2} (n - 1) / (n - 2)$ , but the difference is completely negligible for the size of n we consider.)

GAW 15 Rheumatoid Arthritis Data

We apply ROADTRIPS to perform association analysis of RA data provided for GAW 15 by a UK group led by Jane Worthington and Sally John (these data are described in detail elsewhere).²² Data are available on 157 nuclear families, where 156 of these have at least two affected individuals. Individuals were diagnosed as affected according to the American College of Rheumatology (ACR) criteria. There are 550 individuals with available genotype data. After exclusion of 2 duplicate individuals and 4 outlier individuals who have estimated inbreeding coefficients more than 3 standard deviations (SDs) above the average (where the estimated inbreeding coefficient of individual i is taken to be ${\hat{Ψ}}_{i i} - 1$ ), there are 339 affected individuals, 198 unaffected controls, and 7 controls of unknown phenotype in the analysis. The data set includes 10,156 autosomal SNPs that passed quality control filters. We exclude 285 SNPs that are not polymorphic (minor allele frequency less than 0.01). The remaining 9871 SNPs are tested for association.

GAW 14 COGA Data

We apply ROADTRIPS to identify SNPs associated with alcohol dependence in data provided by the COGA for GAW 14.²³ These data were previously analyzed with association methods that assume known structure.¹⁶ There are a total of 1614 individuals from 143 pedigrees, with each pedigree containing at least three affected individuals. We include in our analysis only those individuals who are coded as “white, non-Hispanic.” We designate as cases those individuals who are affected with ALDX1 or who have symptoms of ALDX1, where ALDX1 is defined to be DSM-III-R alcohol dependence with the Feighner Alc Definite phenotype. By these criteria, there are 830 cases with available SNP data. We designate as “unaffected controls” those individuals who are labeled as “pure unaffected,” and we designate as “controls of unknown phenotype” those individuals who are labeled as “never drank alcohol.” Among individuals with available SNP data, these criteria result in 187 unaffected controls and 13 unknown controls. The data set includes 10,810 autosomal SNPs. We exclude 403 SNPs that are not polymorphic and analyze the remaining 10,407 SNPs.

Results

Simulation Studies

We perform simulation studies, in which population structure and related individuals are simultaneously present in the case-control sample, in order to compare the performance of ROADTRIPS to that of previously proposed association methods that correct in some way for either population structure or related individuals or both. The methods to which we compare ROADTRIPS are GC, FBAT, the method of Rakovski and Stram,¹³ EIGENSTRAT, M_QLS, and the corrected Armitage χ² test. We also include the standard (uncorrected) Armitage test in the type 1 error study. We simulate four different settings of population structure, including admixture, and two different settings of relationship configuration, where the latter refers to pedigree relationships among sampled individuals.

Relationship Configurations

Both relationship configurations 1 and 2 include 100 unrelated affected individuals, 400 unrelated unaffected individuals, and individuals sampled from 120 outbred, three-generation pedigrees, where each pedigree has a total of 16 individuals, the phenotypes of all individuals in the pedigrees are observed, and genotypes are observed for only a subset of individuals in each pedigree. We sample six types of these pedigrees; the types are described in Table 2. Relationship configuration 1 has 40 pedigrees of type 1, 40 of type 2, and 40 of type 3, as well as 100 unrelated affected and 400 unrelated unaffected individuals. Relationship configuration 2 contains all six types of pedigrees in Table 2, with 20 of each type, and it also contains 100 unrelated affected and 400 unrelated unaffected individuals.

Table 2

Pedigree Configuration Types Used in Simulations

Type	*N_af*	*N_un*	Genotyped Individuals
1	4	12	Unaffected sib pair and their unaffected first cousin
2	5	11	1 affected parent, 2 affected offspring
3	6	10	1 aff. parent, 2 aff. offspr., unaff. sib pair who are 1st cousins to the latter
4	4	12	1 affected parent with 2 affected and 1 unaffected offspring
5	5	11	1 affected and 2 unaffected sibs, unaffected aunt and her affected spouse
6	6	10	1 aff. and 1 unaff. parent with 2 unaff. offspr., 2 other affecteds

N_af and N_un are the total numbers of affected and unaffected individuals in the pedigree, respectively, among whom only the indicated individuals are genotyped.

Population Structure Settings

Each simulation setting specifies a particular relationship configuration combined with a particular setting of population structure. Each setting of population structure is a special case of the Balding-Nichols model with admixture described in Example 2, in which we take f_k = 0.01 for every subpopulation. Population structure 1 has individuals sampled from two subpopulations, with 60% of the pedigrees and affected unrelated individuals sampled from subpopulation 1 and the remaining 40% of the pedigrees and affected unrelated individuals sampled from subpopulation 2. Among the unrelated unaffecteds, 40% are sampled from subpopulation 1 and 60% from subpopulation 2. Population structure 2 is similar to population structure 1, except that the proportions 60% and 40% are replaced by 80% and 20%, respectively. Population structure 3 is similar to population structure 1, except that there are three subpopulations, with all of the unrelated unaffecteds sampled from subpopulation 3. Population structure 4 has individuals sampled from an admixed population, formed from two subpopulations. Individuals in the admixed population are assumed to have i.i.d. admixture vectors of the form (a, 1 − a), where a is a Uniform(0,1) random variable.

Within a given setting of population structure, to simulate the unrelated individuals needed in relationship configurations 1 and 2, we first simulate genotypes according to the chosen setting of population structure, simulate phenotypes conditional on genotypes, and then randomly ascertain 100 affected and 400 unaffected individuals. To sample particular pedigree types within a given setting of population structure, we first simulate genotypes for pedigree founders according to the chosen setting of population structure, drop alleles down the pedigree, simulate phenotypes conditional on genotypes, and then do rejection sampling to obtain 100 replicates of each of the pedigree configuration types. Then, in the simulations, pedigrees of each type are sampled with replacement from the 100 previously obtained replicates of that pedigree type.

Random and Causal SNPs

We use a trait model that has two unlinked causal SNPs (which we call “SNP 1” and “SNP 2”) with epistasis between them.¹⁶ The ancestral frequencies of the type 1 alleles at SNPs 1 and 2 are taken to be 0.1 and 0.5, respectively. Individuals with at least one copy of allele 1 at SNP 1 and at least one copy of allele 1 at SNP 2 have a penetrance of 0.3. All other individuals have a penetrance of 0.05. In the power studies, association is tested with SNP 2. In contrast, “random” SNPs are assumed to be unlinked and unassociated with the trait, and their ancestral allele frequencies are obtained as i.i.d. draws from a uniform (0.1, 0.9) distribution.

Assessment of Type 1 Error

For each of the eight combinations of population structure and relationship configuration, we generate genotype data for 100,000 random SNPs that are neither linked nor associated with the trait, and we test each of them for association at the 0.0001 level, using various test statistics. In Table 3, for each test statistic, we report the empirical type 1 error, which we calculate as the proportion of simulations in which the test statistic exceeds the χ₁² quantile corresponding to nominal type 1 error level 0.0001. The statistics compared are the four ROADTRIPS statistics, Rχ₂, RM₂, RM_NI₂, and RW_NI₂, where the subscript “2” denotes use of the estimator ${\hat{σ}}_{2}^{2}$ of Equation 14; GC; EIGENSTRAT; the M_QLS; the corrected Armitage χ²; and the uncorrected Armitage trend test. Note that because there are only two types of controls, the Rχ₂ and RM_NI₂ tests are identical. The method of Rakovski and Stram¹³ corresponds to the ROADTRIPS statistic RW_NI₂, which is included in the comparison. The correct type 1 error of FBAT has been established previously.³¹ Using an exact binomial calculation, we determine that empirical type 1 error rates falling in the range of 0.00004–0.00016 are not significantly different from the nominal 0.0001 level.

Table 3

Empirical Type 1 Error, at Level 0.0001, in the Presence of Both Related Individuals and Population Structure

	Empirical Type 1 Error of Tests^a
Setting^b	Rχ or RM_NI	RM	*RW_NI*	GC	EIG	*M_QLS*	Corr Arm	Arm
(1,1)	0.00009	0.00004	0.00011	0.00007	0.00118	0.00027	0.00018	0.00202
(1,2)	0.00011	0.00012	0.00011	0.00010	0.00059	0.00043	0.00016	0.00073
(2,1)	0.00008	0.00012	0.00010	0.00004	0.00116	0.01690	0.00485	0.01970
(2,2)	0.00013	0.00015	0.00012	0.00005	0.00054	0.02281	0.00723	0.01375
(3,1)	0.00010	0.00010	0.00010	0.00007	0.00155	0.06752	0.02409	0.06094
(3,2)	0.00015	0.00012	0.00008	0.00008	0.00058	0.08464	0.03189	0.05028
(4,1)	0.00007	0.00011	0.00012	0.00007	0.00086	0.00056	0.00032	0.00277
(4,2)	0.00008	0.00007	0.00012	0.00012	0.00036	0.00028	0.00019	0.00069

Empirical type 1 error rates are calculated based on 100,000 simulated random SNPs. Rates that are significantly different from the nominal 0.0001 level are in bold. Rχ and RM_NI are equivalent statistics for all the settings shown. For the ROADTRIPS statistics, the ${\hat{σ}}_{2}^{2}$ of Equation 14 is used.

^aAbbreviations of test names are genomic control (GC), EIGENSTRAT with outlier removal (EIG), corrected Armitage χ² (Corr Arm), and Armitage trend test (Arm).

^bSetting (i, j) denotes population structure setting i and relationship configuration setting j.

For GC and all of the ROADTRIPS statistics, empirical type 1 error is not significantly different from the nominal level. In contrast, type 1 error is inflated for EIGENSTRAT, M_QLS, the corrected Armitage χ², and the Armitage trend test. This is to be expected, because these tests either (1) correct for related individuals but not for population structure (M_QLS, corrected Armitage χ²), (2) correct for population structure but not for related individuals (EIGENSTRAT), or (3) correct for neither (Armitage trend test). In particular, the top principal components in EIGENSTRAT are not able to capture the complicated covariance structure due to the related individuals in the samples. The results for EIGENSTRAT in Table 3 are obtained with the default setting of ten principal components and with outlier removal. The results without outlier removal and with different numbers of principal components are similar (results not shown). We also performed all the ROADTRIPS tests with variance estimator ${\hat{σ}}_{1}^{2}$ of Equation 13 instead of ${\hat{σ}}_{2}^{2}$ and obtained nearly identical empirical type 1 error rates (results not shown).

Type 1 Error with Missing Genotypes: Rχ₂ and GC

When all SNPs have the same pattern of missing genotypes, we expect Rχ₂ to perform similarly to GC, because in this case, both Rχ₂ and GC are equivalent to correcting all the Armitage χ² statistics across the genome by a common factor, though this factor differs between the two methods. However, when different SNPs have different rates of missing genotypes, we expect the Rχ₂ statistic to have better control of type 1 error than GC, because the Rχ₂ statistic allows different SNPs to have different correction factors appropriate to the level of genotype information available, whereas GC applies the same correction factor to all SNPs.

To assess the magnitude of this effect, we perform a simulation study under four different settings of population structure and relationship configuration, given in columns 1 and 2 of Table 4. For each combination of settings, genotype data are generated for 100,000 random SNPs that are neither linked nor associated with the phenotype. The proportions of individuals with missing genotype data at different SNPs are taken to be i.i.d. random variables drawn from a Beta(3, 12) distribution, which has a mean of 0.2 and a SD of 0.1. Given the proportion of missing genotypes at a marker, the individuals whose genotypes will be set to missing for that marker are chosen uniformly at random from the sample.

Table 4

Empirical Type 1 Error at Level 0.0001 when Genotypes are Missing at Random in the Presence of Both Related Individuals and Population Structure

		Empirical Type 1 Error
Population Structure	Relationship Configuration	Rχ₂	Genomic Control
2	1	0.00008	0.00031
2	2	0.00007	0.00036
3	1	0.00006	0.00057
3	2	0.00013	0.00058

Empirical type 1 error rates are calculated based on 100,000 simulated random SNPs. Rates that are significantly different from the nominal 0.0001 level are in bold.

The empirical type 1 error rates for Rχ₂ and GC are given in Table 4. GC has inflated type 1 error for all of the simulation settings, because of undercorrection of test statistics from SNPs that have relatively low levels of missing genotypes, whereas the empirical type 1 error for Rχ₂ is not significantly different from the nominal 0.0001 level. The results illustrate that ROADTRIPS is not only robust to cryptic population and pedigree structure, but also to varying rates of randomly missing genotype data. This results from the fact that the entire empirical covariance matrix is estimated in ROADTRIPS, so when some individuals have missing genotype data at the SNP being tested, the corresponding rows and columns can be deleted from the empirical covariance matrix, allowing one to obtain a variance estimator that accounts for missing genotype data at the marker being tested.

Power Comparison

We assess power to detect association in the presence of population and pedigree structure only for those tests that maintain correct nominal type 1 error, namely the four ROADTRIPS tests, FBAT, and GC (although, as can be seen in Table 4, the type 1 error of GC might not be correct when the rate of missing genotype data varies across markers). The simulations are performed with relationship configuration 2 under each of the four different settings of population structure. For each setting, 5000 simulated replicates are performed, and SNP 2 of the trait model is tested for association with the trait by each method. Table 5 reports power for each statistic for settings 1, 2, and 4 of population structure. Here, power is calculated as the proportion of simulations for which the statistic exceeds the χ₁² quantile corresponding to nominal type 1 error level 0.0001. Power for population structure 3 is close to 0 for all of the statistics (data not shown). The ROADTRIPS statistics are all calculated with estimator ${\hat{σ}}_{2}^{2}$ of Equation 14. The FBAT and RM methods are given the information of the correct pedigree structure, but not the population structure. The FBAT statistic is calculated with offset value set equal to the prevalence, and the RM statistic is calculated with k set equal to the prevalence. As expected, the RM test is the most powerful in all settings, because it uses the known pedigree information to incorporate phenotype information about relatives with missing genotype data, and it corrects for additional unknown population structure by means of the empirical covariance matrix. In contrast, the FBAT test, which was given all the same information as the RM test, performs very poorly in these simulations, because it is not able to incorporate the data on the 500 unrelated individuals, and it is also not able to incorporate the data from pedigree types 1, 2, and 3, because they do not meet the FBAT criteria for “informative families.” If we assume that no pedigree information is available, then the Rχ, RM_NI, and GC tests all give identical or almost identical power, assuming that all markers have comparable amounts of missing genotype data. When different SNPs have different amounts of missing genotype data, GC might not adequately control type 1 error, so Rχ and RM_NI would be preferred. The RW_NI test, which corresponds to the test of Rakovski and Stram,¹³ has lower power than Rχ and RM_NI, which is not surprising in light of the fact that the W_QLS test, of which the RW_NI is an extension, was shown¹⁶ to have generally lower power than the M_QLS and corrected χ² tests, of which the RM_NI and Rχ, respectively, are extensions.

Table 5

Power to Detect Association in the Presence of Both Related Individuals and Population Structure

Population Structure	Power (Standard Error)
Population Structure	Rχ or RM_NI	RM	*RW_NI*	GC	FBAT
1	0.79 (0.006)	0.94 (0.003)	0.59 (0.007)	0.78 (0.006)	0.0012 (0.0005)
2	0.42 (0.007)	0.48 (0.007)	0.36 (0.007)	0.43 (0.007)	0.0016 (0.0006)
4	0.70 (0.007)	0.80 (0.006)	0.48 (0.007)	0.70 (0.007)	0.0002 (0.0002)

Power is assessed at significance level 0.0001 on the basis of 5000 simulated replicates. The highest power for each simulation setting is in bold. Relationship configuration 2 is used in each case. Rχ and RM_NI are equivalent statistics for all the settings shown. For the ROADTRIPS statistics, the ${\hat{σ}}_{2}^{2}$ of Equation 14 is used.

We also assess power to detect association when there is pedigree structure but no population structure. The simulation is carried out as in the preceding paragraph except that relationship configuration 2 with no population structure is simulated. In addition to the tests compared in Table 5, we also calculate power for the M_QLS and corrected Armitage χ² tests, both of which have correct type 1 error in this setting, provided that the pedigree structure is known. Estimated power for this setting is given in Table 6. The M_QLS and RM tests are the most powerful among the tests considered. The RM test is able to match the power of M_QLS even though RM uses the estimator $\hat{Ψ}$ in the variance calculation, whereas M_QLS uses the true Ψ. As in the previous power comparison, FBAT has almost no power in this setting, and the RW_NI test has lower power than all of the other statistics except FBAT. There is no significant difference in power among Rχ, RM_NI, GC, and the corrected Armitage χ² tests. The M_QLS and the corrected Armitage χ² tests have power identical to their corresponding ROADTRIPS extensions, RM and Rχ, which illustrates that power is not compromised by use of the empirical matrix $\hat{Ψ}$ in the variance correction for the ROADTRIPS statistics.

Table 6

Power to Detect Association with Related Individuals in the Absence of Population Structure

	Power (Standard Error)
*Rχ or RM_NI*	RM	*RW_NI*	GC	FBAT	*M_QLS*	Corr Arm
0.90 (0.004)	0.98 (0.002)	0.81 (0.006)	0.91 (0.004)	0.0002 (0.0002)	0.98 (0.002)	0.90 (0.004)

Power is assessed at significance level 0.0001 on the basis of 5000 simulated replicates of relationship configuration 2 with no population structure. The highest power is in bold. Rχ and RM_NI are equivalent statistics in this simulation setting. For the ROADTRIPS statistics, the ${\hat{σ}}_{2}^{2}$ of Equation 14 is used. Abbreviations of test names are genomic control (GC) and corrected Armitage χ² (Corr Arm).

GAW 15 Rheumatoid Arthritis Data

We apply RM, Rχ, GC, and FBAT to the GAW 15 data to test for association of SNPs with RA. Using a previously reported²² prevalence of 0.8% for RA in people of European descent, we set both the offset value in FBAT and the prevalence value in RM to 0.008. The entries of the empirical covariance matrix $\hat{Ψ}$ and the correction factor for GC are calculated using SNP data across the autosomal chromosomes for the study individuals. Table 7 gives the results of all tests for those SNPs for which at least one of the tests has a nominal p value <5.0 × 10⁻⁵. For 8 of these 10 SNPs, the RM test has the smallest p value among the four tests used. After Bonferroni correction to adjust for four different tests of association at each of 9,871 SNPs, the RM test is significant at the 5% level for 2 SNPs: snp264363 on chromosome 11 (p = 5.2 × 10⁻⁷ uncorrected, 0.021 corrected) and snp152076 on chromosome 9 (p = 5.4 × 10⁻⁷ uncorrected, 0.021 corrected). The Rχ test is significant at the 5% level for an additional SNP, snp547632 on chromosome 11 (p = 1.2 × 10⁻⁶ uncorrected, 0.047 corrected).

Table 7

Rheumatoid Arthritis Data Results: SNPs with p Value <0.00005 for at Least One of the Four Tests

Chr	Marker	Pos. (cM)	*N_CA*	*N_CO*	p	p Value
Chr	Marker	Pos. (cM)	*N_CA*	*N_CO*	p	Rχ	RM	GC	FBAT
11	snp264363	105.57721	208	133	0.86	4.7e-3	5.2e-7^a	1.7e-2	1.2e-2
9	snp152076	77.047858	267	162	0.92	3.0e-2	5.4e-7^a	1.1e-1	9.9e-3
11	snp547632	63.954804	230	135	0.80	1.2e-6^a	6.2e-2	1.8e-4	3.0e-1
3	snp151721	114.80384	279	174	0.84	2.8e-4	2.2e-6	6.8e-4	3.0e-3
18	snp511091	107.60685	219	130	0.57	4.2e-2	1.1e-5	6.6e-2	6.6e-2
14	snp51741	68.2942	298	188	0.95	1.4e-3	1.2e-5	1.3e-3	1.5e-2
15	snp66639	51.864772	296	187	0.97	1.5e-5	1.0e-2	1.4e-4	NA
3	snp71651	40.817259	188	118	0.74	3.5e-2	1.8e-5	6.3e-2	3.6e-2
4	snp570108	66.453602	224	142	0.71	5.1e-2	3.9e-5	1.2e-1	2.4e-1
8	snp261673	67.427701	192	131	0.61	2.9e-1	4.6e-5	4.0e-1	3.7e-1

The chromosome (Chr), the name of the marker (Marker), the position of the marker on the chromosome (Pos.), the number of genotypes available in cases (N_CA) and controls (N_CO), and the major allele frequency in the case-control sample as a whole (p) are displayed. An insufficient number of informative families for the FBAT analysis are indicated by NA.

^aGenome-wide significance after Bonferroni correction.

A histogram of the estimated self-kinship values (where the estimated self-kinship value of individual i is taken to be $.5 {\hat{Ψ}}_{i, i}$ ) of the individuals included in the analysis can be found in Figure 1. The histogram shows that the values are not centered around 0.5, which is the self-kinship value in the absence of population structure and inbreeding. The self-kinship mean is 0.512, and the majority of the kinship values (77%) are greater than 0.5. There are a few pairs of individuals that should have a kinship coefficient value equal to 0.25 based on the available pedigree information (i.e., parent-offspring pairs and sibling pairs), but have estimated kinship coefficient values close to 0 and thus appear to be unrelated. There are also a few pairs that are not members of the same pedigree but have kinship coefficient estimates that indicate that they are related. Both these phenomena could be caused by sample switches. An attractive feature of the ROADTRIPS methods is that they automatically correct for misspecified relationships in a sample, in addition to hidden population structure, while simultaneously allowing for different SNPs to have different rates of missing genotypes.

An external file that holds a picture, illustration, etc.
Object name is gr1.jpg

Figure 1

Histogram of Estimated Self-Kinship Coefficient Values for the GAW 15 UK RA Data

The vertical line at 0.5 represents the self-kinship value in the absence of population structure and inbreeding.

GAW 14 COGA Data

We apply RM, Rχ, and GC to test for association with an alcoholism-related phenotype in the GAW 14 COGA data. A previous analysis¹⁶ of these data used the M_QLS (with k set to 0.05) and corrected χ² tests; these tests correct for known pedigree information, but do not make any correction for unknown structure. In our reanalysis, we compare the results of the tests with and without the correction for unknown structure. To make the results comparable, we set k = 0.05 in the RM test. Table 8 lists SNPs for which at least one of the RM, Rχ, and GC tests has a nominal p value <1.0 × 10⁻⁵. For 11 of these 12 SNPs, the RM test has the smallest p value among the three tests used. After Bonferroni correction to adjust for three different tests of association at each of 10,407 SNPs, the RM test is significant at the 5% level for 4 SNPs: tsc1750530 on chromosome 16 (p = 2.3 × 10⁻⁸ uncorrected, 0.0007 corrected), tsc0046696 on chromosome 18 (p = 1.4 × 10⁻⁷ uncorrected, 0.004 corrected), tsc1177811 on chromosome 1 (p = 1.5 × 10⁻⁷ uncorrected, 0.005 corrected), and tsc1637642 on chromosome 5 (p = 6.8 × 10⁻⁷ uncorrected, 0.02 corrected). The Rχ test is significant at the 5% level for an additional SNP, tsc0571038 on chromosome 11 (p = 8.0 × 10⁻⁷ uncorrected, 0.025 corrected). Of the 5 SNPs that were previously¹⁶ identified as genome-wide significant by M_QLS or $W_{χ_{c o r r}^{2}}$ , 4 were also identified by ROADTRIPS. The exception is tsc0057290 on chromosome 18, which was identified by M_QLS and is no longer significant after correcting for cryptic structure. ROADTRIPS identifies an additional SNP, tsc1637642 on chromosome 5, which was not identified in the previous analysis. GC did not identify any significant SNPs. There are some SNPs in Table 8 that are in or near genes of interest; the details have previously been reported.¹⁶ A previous analysis³² of these data with FBAT, in which a slightly larger sample of individuals was used, detected one SNP with nominal p value 6 × 10⁻⁵, which is not significant after Bonferroni correction for the number of SNPs tested.

Table 8

COGA Data Results: SNPs with p Value <0.00001 for at Least One of the Three Tests

Chr	Marker	Pos. (cM)	*N_CA*	*N_CO*	p	p Value
Chr	Marker	Pos. (cM)	*N_CA*	*N_CO*	p	Rχ	RM	GC
16	tsc1750530	59.8297	644	145	.85	3.6e-4	2.3e-8^a	1.6e-3
18	tsc0046696	104.665	459	118	.60	4.0e-1	1.4e-7^a	4.8e-1
1	tsc1177811	105.535	587	149	.68	2.9e-2	1.5e-7^a	3.4e-2
5	tsc1637642	95.4901	419	159	.84	3.6e-1	6.8e-7^a	4.3e-2
11	tsc0571038	95.3968	581	122	.56	8.0e-7^a	4.5e-3	5.4e-5
18	tsc0057290	33.9594	497	126	.71	6.1e-2	1.8e-6	5.3e-2
6	tsc0808295	47.1522	681	162	.76	9.4e-1	1.8e-6	9.4e-1
11	tsc0569292	6.78451	455	127	.74	6.0e-1	3.6e-6	5.7e-1
19	tsc1189131	68.94	478	121	.55	7.4e-1	4.2e-6	7.3e-1
3	tsc0175005	158.199	594	152	.82	1.6e-2	4.8e-6	2.2e-2
3	tsc1519933	167.431	515	134	.64	9.5e-2	5.3e-6	7.6e-2
13	tsc0056748	73.9934	530	133	.84	6.5e-1	6.5e-6	6.7e-1

^aGenome-wide significance after Bonferroni correction.

Figure 2 gives a histogram of the estimated self-kinship coefficient values for the genotyped individuals who were included in the analysis. The center of the histogram is shifted from 0.5 (the self-kinship coefficient in the absence of population structure and inbreeding). Seventy-one percent of the values are greater than 0.5, and the mean self-kinship value is 0.506. Just as there were in the UK RA data, in the COGA data there are a few pairs of individuals who appear to have misspecified relationships or be cryptically related. As previously mentioned, ROADTRIPS adjusts for this in the variance correction.

An external file that holds a picture, illustration, etc.
Object name is gr2.jpg

Figure 2

Histogram of Estimated Self-Kinship Coefficient Values for the GAW 14 COGA Data

The vertical line at 0.5 represents the self-kinship value in the absence of population structure and inbreeding.

Assessment of Computation Time

Using a single processor on a shared machine with eight quad-core AMD Opteron 8384 25 GHz processors with 64 GB RAM, analysis of 10,156 SNPs from the RA data and 10,810 SNPs from the COGA data with four tests (Rχ, RM, GC, and a ROADTRIPS version of W_QLS) took approximately 5 and 21 min, respectively. The large difference in computing time is due to the COGA data having extended pedigrees and a sample size that is almost twice that of the RA data. The slowest step is the Cholesky decomposition³³ of Φ (for the calculation of RM and a ROADTRIPS version of W_QLS), which we compute at every SNP because the pattern of missing genotype data varies. The computing time scales linearly with the number of SNPs. The speed could presumably be improved, as we have not made serious attempts to optimize the code.

Discussion

Technological advances in high-density genome scans have made it feasible to perform case-control association studies on a genome-wide basis with hundreds of thousands or millions of markers. The observations in these studies can have several sources of dependence, including population structure and relatedness among the sampled individuals, some of which might be known and some unknown. Failure to properly account for this structure can lead to spurious association or reduced power. We develop ROADTRIPS, a case-control association testing method that simultaneously corrects for both pedigree and population structure, including admixture, where some or all of the structure can be unknown. The method also automatically adjusts for pedigree errors and sample switches. ROADTRIPS is computationally feasible for the analysis of genome-screen data with millions of markers, and it is applicable to association studies with completely general combinations of family and case-control designs. The method does not require the genealogy of the sampled individuals to be known, but when pedigree data are available, ROADTRIPS can incorporate this information to improve power. Our simulation studies indicate that including known pedigree information in ROADTRIPS (by use of the RM test) provides an overall and, in some cases, substantial improvement in power over other available methods. In an analysis of GAW 15 RA data from small UK pedigrees, ROADTRIPS detected three SNPs that have significant association with a RA phenotype. In a reanalysis of the GAW 14 COGA data, ROADTRIPS detected five SNPs that have significant association with alcoholism, one of which had not been identified as significant in the previous analysis, and another SNP identified as significant in the previous analysis is no longer identified when cryptic structure is accounted for.

We have shown that when different SNPs have different rates of missing genotype data, ROADTRIPS is still valid, whereas GC is not properly calibrated for this setting. The ROADTRIPS method takes into account both the structure in the data and the particular missing genotype pattern at each SNP to construct a valid test. In contrast, the uniform inflation factor applied to all SNPs by GC can result in both an increase in type 1 error, due to undercorrection of SNPs with lower rates of missing genotype data, as well as a loss of power, due to overcorrection of SNPs with higher rates of missing genotype data. One approach to dealing with this problem in samples of unrelated individuals is to impute missing genotype data. However, there are special difficulties that arise with the use of imputation methods in samples with related individuals and hidden structure. First, Mendelian errors and incompatible genotypes can be introduced with this approach, unless the imputation is performed jointly among related individuals. Second, in samples with hidden population structure, e.g., samples from admixed populations, it is often unclear what the reference population should be, because an individual's ancestry at a particular SNP will generally be unknown. Finally, imputed genotypes are dependent among relatives, where the dependence among imputed genotypes differs from the ordinary dependence among genotypes and is affected by the type and amount of information available for each individual for each SNP. Thus, unlike the pedigree and population structure we consider, the dependence structure among imputed genotypes for different individuals differs across SNPs. This complex dependence among imputed genotypes for related individuals would need to be taken into account in the analysis in order to construct a valid test. However, to our knowledge, the current generation of imputation methods gives information only on the marginal accuracy (e.g., marginal posterior probabilities and not joint posterior probabilities) of imputed genotypes across individuals, so these methods would not allow valid assessment of uncertainty in the general setting of case-control association testing with related individuals.

The ROADTRIPS method uses an estimator of Ψ that is closely related to the estimated covariance matrix used in EIGENSTRAT.⁶ Recently, Choi et al.¹⁷ have used a different estimated kinship matrix in the context of association testing when pedigree information is missing. The kinship estimation approach of Choi et al. is suitable for close relatives in the absence of population structure but is not particularly well suited to accounting for population structure. To make this statement more precise, we consider the case of unrelated individuals with population structure based on the Balding-Nichols model with or without admixture. In this context, if one assumed that genotypes at different SNPs were independent with $| S_{i j} | \to ∞$ and p_s known and that σ_s² = 0.5p_s(1 − p_s) held at all but a finite number of SNPs, then the ROADTRIPS estimator of Ψ_ij would be consistent, whereas the estimator of Choi et al. would generally be inconsistent. The estimator of Choi et al. is also substantially more computationally intensive to calculate than the $\hat{Ψ}$ we use in ROADTRIPS.

Web Resources

The URLs for data presented herein are as follows:

ROADTRIPS source code, http://www.stat.uchicago.edu/~mcpeek/software/index.html
Online Mendelian Inheritance in Man (OMIM), http://www.ncbi.nlm.nih.gov/Omim/

Acknowledgments

This study was supported in part by the University of California President's Postdoctoral Fellowship and the Lamond Family Foundation Postdoctoral Fellowship (to T.T.) and by National Institutes of Health grant R01 HG001645 (to M.S.M.). Data on alcohol dependence were provided by the Collaborative Study on the Genetics of Alcoholism (U10AA008401) through the Genetics Analysis Workshop (R01GM031575). Data on RA were provided by a UK group led by Jane Worthington and Sally John through the Genetics Analysis Workshop (R01GM031575).

References

1. Lander E.S., Schork N.J. Genetic dissection of complex traits. Science. 1994;265:2037–2048. [Abstract] [Google Scholar]

2. Devlin B., Roeder K. Genomic control for association studies. Biometrics. 1999;55:997–1004. [Abstract] [Google Scholar]

3. Pritchard J.K., Stephens M., Donnelly P. Inference of population structure using multilocus genotype data. Genetics. 2000;155:945–959. [Europe PMC free article] [Abstract] [Google Scholar]

4. Zhu X., Zhang S., Zhao H., Cooper R.S. Association mapping, using a mixture model for complex traits. Genet. Epidemiol. 2002;23:181–196. [Abstract] [Google Scholar]

5. Zhang S., Zhu X., Zhao H. On a semiparametric test to detect associations between quantitative traits and candidate genes using unrelated individuals. Genet. Epidemiol. 2003;24:44–56. [Abstract] [Google Scholar]

6. Price A.L., Patterson N.J., Plenge R.M., Weinblatt M.E., Shadick N.A., Reich D. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006;38:904–909. [Abstract] [Google Scholar]

7. Lee A.B., Luca D., Klei L., Devlin B., Roeder K. Discovering genetic ancestry using spectral graph theory. Genet. Epidemiol. 2010;34:51–59. [Abstract] [Google Scholar]

8. Zhang J., Niyogi P., McPeek M.S. Laplacian eigenfunctions learn population structure. PLoS ONE. 2009;4:e7928. [Europe PMC free article] [Abstract] [Google Scholar]

9. Satten G.A., Flanders W.D., Yang Q. Accounting for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. Am. J. Hum. Genet. 2001;68:466–477. [Europe PMC free article] [Abstract] [Google Scholar]

10. Hoggart C.J., Parra E.J., Shriver M.D., Bonilla C., Kittles R.A., Clayton D.G., McKeigue P.M. Control of confounding of genetic associations in stratified populations. Am. J. Hum. Genet. 2003;72:1492–1504. [Europe PMC free article] [Abstract] [Google Scholar]

11. Kimmel G., Jordan M.I., Halperin E., Shamir R., Karp R.M. A randomization test for controlling population stratification in whole-genome association studies. Am. J. Hum. Genet. 2007;81:895–905. [Europe PMC free article] [Abstract] [Google Scholar]

12. Luca D., Ringquist S., Klei L., Lee A.B., Gieger C., Wichmann H.-E., Schreiber S., Krawczak M., Lu Y., Styche A. On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants. Am. J. Hum. Genet. 2008;82:453–463. [Europe PMC free article] [Abstract] [Google Scholar]

13. Rakovski C.S., Stram D.O. A kinship-based modification of the armitage trend test to address hidden population structure and small differential genotyping errors. PLoS ONE. 2009;4:e5825. [Europe PMC free article] [Abstract] [Google Scholar]

14. Slager S.L., Schaid D.J. Evaluation of candidate genes in case-control studies: a statistical method to account for related subjects. Am. J. Hum. Genet. 2001;68:1457–1462. [Europe PMC free article] [Abstract] [Google Scholar]

15. Bourgain C., Hoffjan S., Nicolae R., Newman D., Steiner L., Walker K., Reynolds R., Ober C., McPeek M.S. Novel case-control test in a founder population identifies P-selectin as an atopy-susceptibility locus. Am. J. Hum. Genet. 2003;73:612–626. [Europe PMC free article] [Abstract] [Google Scholar]

16. Thornton T., McPeek M.S. Case-control association testing with related individuals: a more powerful quasi-likelihood score test. Am. J. Hum. Genet. 2007;81:321–337. [Europe PMC free article] [Abstract] [Google Scholar]

17. Choi Y., Wijsman E.M., Weir B.S. Case-control association testing in the presence of unknown relationships. Genet. Epidemiol. 2009;33:668–678. [Europe PMC free article] [Abstract] [Google Scholar]

18. Spielman R.S., McGinnis R.E., Ewens W.J. Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM) Am. J. Hum. Genet. 1993;52:506–516. [Europe PMC free article] [Abstract] [Google Scholar]

19. Rabinowitz D., Laird N. A unified approach to adjusting association tests for population admixture with arbitrary pedigree structure and arbitrary missing marker information. Hum. Hered. 2000;50:211–223. [Abstract] [Google Scholar]

20. Risch N., Teng J. The relative power of family-based and case-control designs for linkage disequilibrium studies of complex human diseases I. DNA pooling. Genome Res. 1998;8:1273–1288. [Abstract] [Google Scholar]

21. Bacanu S.A., Devlin B., Roeder K. The power of genomic control. Am. J. Hum. Genet. 2000;66:1933–1944. [Europe PMC free article] [Abstract] [Google Scholar]

22. Amos C.I., Chen W.V., Remmers E., Siminovitch K.A., Seldin M.F., Criswell L.A., Lee A.T., John S., Shephard N.D., Worthington J. Data for Genetic Analysis Workshop (GAW) 15 Problem 2, genetic causes of rheumatoid arthritis and associated traits. BMC Proc. 2007;1(Suppl 1):S3. [Europe PMC free article] [Abstract] [Google Scholar]

23. Edenberg H.J., Bierut L.J., Boyce P., Cao M., Cawley S., Chiles R., Doheny K.F., Hansen M., Hinrichs T., Jones K. Description of the data from the Collaborative Study on the Genetics of Alcoholism (COGA) and single-nucleotide polymorphism genotyping for Genetic Analysis Workshop 14. BMC Genet. 2005;6(Suppl 1):S2. [Europe PMC free article] [Abstract] [Google Scholar]

24. Sasieni P.D. From genotypes to genes: doubling the sample size. Biometrics. 1997;53:1253–1261. [Abstract] [Google Scholar]

25. McPeek M.S., Wu X., Ober C. Best linear unbiased allele-frequency estimation in complex pedigrees. Biometrics. 2004;60:359–367. [Abstract] [Google Scholar]

26. Lange K. Springer; New York: 2002. Mathematical and Statistical Methods for Genetic Analysis. [Google Scholar]

27. Wang Z., McPeek M.S. An incomplete-data quasi-likelihood approach to haplotype-based genetic association studies on related individuals. Journal of the American Statistical Association. 2009;104:1251–1260. [Europe PMC free article] [Abstract] [Google Scholar]

28. Balding D.J., Nichols R.A. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity. Genetica. 1995;96:3–12. [Abstract] [Google Scholar]

29. Pritchard J.K., Donnelly P. Case-control studies of association in structured or admixed populations. Theor. Popul. Biol. 2001;60:227–237. [Abstract] [Google Scholar]

30. Wright S. The genetical structure of populations. Ann. Eugenics. 1951;15:323–354. [Abstract] [Google Scholar]

31. Lake S.L., Blacker D., Laird N.M. Family-based tests of association in the presence of linkage. Am. J. Hum. Genet. 2000;67:1515–1525. [Europe PMC free article] [Abstract] [Google Scholar]

32. Zhu X., Cooper R., Kan D., Cao G., Wu X. A genome-wide linkage and association study using COGA data. BMC Genet. 2005;6(Suppl 1):S128. [Europe PMC free article] [Abstract] [Google Scholar]

33. Schott J.R. John Wiley; New York: 1996. Matrix analysis for statistics. [Google Scholar]

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics

Full text links

Read article at publisher's site: https://doi.org/10.1016/j.ajhg.2010.01.001

Read article for free, from open access legal sources, via Unpaywall: http://www.cell.com/article/S0002929710000029/pdf

Citations & impact

Impact metrics

112

Citations

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/1445693

Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/1445693

Article citations

Using GWAS and Machine Learning to Identify and Predict Genetic Variants Associated with Foodborne Bacteria Phenotypic Traits.
Tsoumtsa Meda L, Lagarde J, Guillier L, Roussel S, Douarre PE
Methods Mol Biol, 2852:223-253, 01 Jan 2025
Cited by: 0 articles | PMID: 39235748
Review
Estimation of inbreeding and kinship coefficients via latent identity-by-descent states.
Guan Y, Levy D
Bioinformatics, 40(2):btae082, 01 Feb 2024
Cited by: 0 articles | PMID: 38364309 | PMCID: PMC10902678
Free full text in Europe PMC
RETROSPECTIVE VARYING COEFFICIENT ASSOCIATION ANALYSIS OF LONGITUDINAL BINARY TRAITS: APPLICATION TO THE IDENTIFICATION OF GENETIC LOCI ASSOCIATED WITH HYPERTENSION.
Xu G, Amei A, Wu W, Liu Y, Shen L, Oh EC, Wang Z
Ann Appl Stat, 18(1):487-505, 31 Jan 2024
Cited by: 2 articles | PMID: 38577266 | PMCID: PMC10994004
Free full text in Europe PMC
A resource of induced pluripotent stem cell (iPSC) lines including clinical, genomic, and cellular data from genetically isolated families with mood and psychotic disorders.
Detera-Wadleigh SD, Kassem L, Besancon E, Lopes F, Akula N, Sung H, Blattner M, Sheridan L, Lacbawan LN, Garcia J, Gordovez F, Hosey K, Donner C, Salvini C, Schulze T, Chen DTW, England B, Cross J, Jiang X, [...] McMahon FJ
Transl Psychiatry, 13(1):397, 16 Dec 2023
Cited by: 4 articles | PMID: 38104115 | PMCID: PMC10725500
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Genetic association models are robust to common population kinship estimation biases.
Hou Z, Ochoa A
Genetics, 224(1):iyad030, 01 May 2023
Cited by: 1 article | PMID: 36843304 | PMCID: PMC10474929
Free full text in Europe PMC

Go to all (112) article citations

Other citations

Wikipedia

https://en.wikipedia.org/wiki/Mega2,_the_Manipulation_Environment_for_Genetic_Analysis

Data

Data behind the article

This data has been text mined from the article, or deposited into data resources.

BioStudies: supplemental material and supporting data

http://www.ebi.ac.uk/biostudies/studies/S-EPMC2820184?xr=true

Diseases (2)

(1 citation) OMIM - 103780
(1 citation) OMIM - 180300

Funding

Funders who supported this work.

NCI NIH HHS (1)

Grant ID: K01 CA148958
24 publications

NHGRI NIH HHS (1)

Grant ID: R01 HG001645
41 publications

NIAAA NIH HHS (1)

Grant ID: U10 AA008401
475 publications

NIGMS NIH HHS (2)

Grant ID: R01 GM031575-28
150 publications
Grant ID: R01 GM031575
551 publications

Search life-sciences literature (45,103,589 articles, preprints and more)

ROADTRIPS: case-control association testing with partially or completely unknown population and pedigree structure.

Author information

Affiliations

Authors

Abstract

Free full text

ROADTRIPS: Case-Control Association Testing with Partially or Completely Unknown Population and Pedigree Structure

Timothy Thornton

Mary Sara McPeek

Abstract

Introduction

Material and Methods

Overview of Association Testing with Known Structure

Estimation of σ2 when Structure Is Known

Examples of Association Tests with Known Structure

Corrected Pearson χ2 and Armitage Trend Tests

WQLS Test

MQLS Test

Outline of ROADTRIPS Approach for Unknown Structure

Population Genetic Modeling Assumptions

Example 1: Related Individuals without Additional Population Structure

Example 2: Balding-Nichols Model with Admixture

Estimation of the Matrix Ψ

Estimation of σ2 when Structure Is Unknown

Association Tests when Structure Is Partially or Completely Unknown

Table 1

Rχ1 and Rχ2, the ROADTRIPS Versions of the Corrected χ2 and Corrected Armitage Tests

RM, the ROADTRIPS Version of MQLS when Structure Is Partially Known

RMNI, the ROADTRIPS Version of MQLS when Structure Is Completely Unknown

RWNI, the ROADTRIPS Version of WQLS when Structure Is Completely Unknown

GAW 15 Rheumatoid Arthritis Data

GAW 14 COGA Data

Results

Simulation Studies

Relationship Configurations

Table 2

Population Structure Settings

Random and Causal SNPs

Assessment of Type 1 Error

Table 3

Type 1 Error with Missing Genotypes: Rχ2 and GC

Table 4

Power Comparison

Table 5

Table 6

GAW 15 Rheumatoid Arthritis Data

Table 7

GAW 14 COGA Data

Table 8

Assessment of Computation Time

Discussion

Web Resources

Acknowledgments

References

Full text links

Citations & impact

Impact metrics

Citations of article over time

Alternative metrics

Article citations

Other citations

Wikipedia

Data

Data behind the article

BioStudies: supplemental material and supporting data

Diseases (2)

Similar Articles

Funding

NCI NIH HHS (1)﻿

NHGRI NIH HHS (1)﻿

NIAAA NIH HHS (1)﻿

NIGMS NIH HHS (2)﻿

Partnerships & funding

Estimation of σ² when Structure Is Known

Corrected Pearson χ² and Armitage Trend Tests

W_QLS Test

M_QLS Test

Estimation of σ² when Structure Is Unknown

Rχ₁ and Rχ₂, the ROADTRIPS Versions of the Corrected χ² and Corrected Armitage Tests

RM, the ROADTRIPS Version of M_QLS when Structure Is Partially Known

RM_NI, the ROADTRIPS Version of M_QLS when Structure Is Completely Unknown

RW_NI, the ROADTRIPS Version of W_QLS when Structure Is Completely Unknown

Type 1 Error with Missing Genotypes: Rχ₂ and GC

NCI NIH HHS (1)

NHGRI NIH HHS (1)

NIAAA NIH HHS (1)

NIGMS NIH HHS (2)