Abstract
Free full text
SumHer better estimates the SNP heritability of complex traits from summary statistics
Abstract
We present SumHer, software for estimating confounding bias, SNP heritability, enrichments of heritability and genetic correlations using summary statistics from genome-wide association studies. The key difference between SumHer and the existing software LD Score Regression (LDSC) is that SumHer allows the user to specify the heritability model. We apply SumHer to results from 24 large-scale association studies (average sample size 121,000) using our recommended heritability model. We show that these studies tended to substantially over-correct for confounding, and as a result the number of genome-wide significant loci was under-reported by about a quarter. We also estimate enrichments for 24 categories of SNPs defined by functional annotations. A previous study using LDSC reported that conserved regions were 13-fold enriched, and found a further six categories with above 3-fold enrichment. By contrast, our analysis using SumHer finds that none of the categories have enrichment above 2-fold. SumHer provides an improved understanding of the genetic architecture of complex traits, which enables more efficient analysis of future genetic data.
LD Score Regression (LDSC) has been frequently applied to summary statistics from genome-wide association studies (GWAS).1–4 It has four main aims: to estimate the average bias due to confounding, to estimate the “SNP heritability” of a trait (the proportion of phenotypic variance explained by all SNPs), to estimate the heritability enrichments of SNP categories, and to estimate the genetic correlation between a pair of traits. LDSC assumes a specific heritability model in which each SNP in the genome is expected to contribute equally.1 Although this model is widely used in statistical genetics, we recently showed that across a range of human traits, it poorly reflects reality.5 In particular, it fails to appreciate that in regions of high linkage disequilibrium (LD), the average heritability of each SNP tends to be lower due to multiple tagging of causal variation.6 As a result of this model misspecification, LDSC tends to over-estimate confounding bias, under-estimate SNP heritability and produce exaggerated estimates of enrichments.
We propose SumHer, software for estimating SNP heritability from summary statistics that allows the user to specify the heritability model. We apply SumHer to publicly-available GWAS results for 24 disease and quantitative traits, using our recommended heritability model.5,6 Firstly we show that these GWAS tended to over-correct for confounding; although in total they reported 2060 genome-wide significant loci, were it not for this over-correction they would have discovered approximately 2800. Secondly, we show that for each of the 24 traits, the SumHer estimate of SNP heritability is at least double that from LDSC. Thirdly, we consider functional enrichments. A previous study using LDSC concluded that heritability is highly concentrated in specific functional categories;2 for example, it estimated that across 17 diseases, conserved regions contribute 35% of SNP heritability, indicating that they are 13-fold enriched for causal variation. When we repeat the analysis using SumHer and our 24 traits, the concentration of heritability in functional categories is more modest: for example, while conserved regions remain significantly enriched, we estimate that they contribute only 7.1% (s.d. 0.2) of SNP heritability, representing 2.0-fold (s.d. 0.07) enrichment. We finish by providing an example of how results from SumHer enable more efficient analysis of genetic data. We show how for body mass index, height, HDL & LDL cholesterol and triglyceride, we are able to significantly improve the predictive performance of polygenic risk scores by incorporating our heritability model and estimates of enrichments. We make SumHer freely available within our software package LDAK (www.ldak.org).6
Results
SumHer
SumHer has the same four aims as LDSC;1–3 we outline them here, with methodological details provided in Online Methods. Suppose we are provided with summary statistics from a GWAS where each of m SNPs has been tested individually for association with a particular trait. Suppose also that we have access to a well-matched reference panel, from which we can reliably estimate r2jl the squared correlation between SNPs j and l. Let S1, S2,…,Sm denote the χ2(1) test statistics from single-SNP analysis; the first aim is to estimate the average inflation of these test statistics due to confounding. Let h2j denote the heritability directly contributed by SNP j; the second aim is to estimate h2SNP = ∑j h2j the SNP heritability of the trait.7 Let index a category of SNPs; the third aim is to estimate (∑j h2j) / h2SNP, the share8 of h2SNP contributed by SNPs in (we can then estimate the enrichment of the category by dividing its estimated share by its expected share). Finally, if we are also provided with summary statistics from a second trait, the fourth aim is to estimate the correlation between SNP effect sizes for the two traits.9
In order to achieve these four aims, we require a heritability model, which describes how h2j is expected to vary across the genome. Suppose this heritability model takes the form E[h2j] qj. The main difference between LDSC and SumHer is that SumHer allows for any heritability model (i.e., the user can specify arbitrary qj), whereas LDSC assumes all qj are the same. We recommend using SumHer with the “LDAK Model”:
where fj is the minor allele frequency (MAF) of SNP j and wj is a weighting based on local levels of LD.5,6 In this model, a SNP with high MAF is expected to contribute more heritability than one with low MAF, while a SNP in a region of low LD is expected to contribute more than one in a region of high LD. By contrast, LDSC estimates are obtained by setting qj = 1, which corresponds to the assumption that all SNPs are expected to contribute equally.1 We refer to this as the “GCTA Model” as it is a core assumption of the software GCTA.5,10
A second difference between LDSC and SumHer is how they estimate confounding bias. In a GWAS with no confounding, E[Sj] ≈ 1 + nv2j, where n is the sample size and v2j = h2j + ∑l near j r2jl h2l is the heritability tagged by SNP j (a working definition of “near” is within 1 Mb1). Both LDSC and SumHer estimate the deviation of test statistics from their expected values assuming no confounding. LDSC uses the model E[Sj] ≈ 1 + A + nv2j, where A indicates the average amount test statistics are inflated additively due to confounding (LDSC then reports 1 + A, which it refers to as “the intercept”).1 By contrast, we recommend using the model E[Sj] ≈ C(1 + nv2j), where C is the average amount test statistics are inflated multiplicatively. While switching between additive and multiplicative models of inflation will not generally impact estimates of confounding bias, enrichment or genetic correlation, it will affect estimates of h2SNP (see Online Methods). We prefer to model inflation as multiplicative because our analyses below indicate that the largest contributor of bias in published GWAS is genomic control, which affects test statistics multiplicatively.11
In total we use six versions of SumHer, which differ according to their assumed heritability and confounding models. LDSC-Zero assumes the GCTA Model and that test statistics are not inflated due to confounding (A = 0, C = 1); this is equivalent to using the LDSC software1 with the option “--intercept-h2 1”. LDSC assumes the GCTA Model and that confounding inflation is additive (A free to vary, C = 1); this is equivalent to using the LDSC software1 with default options. SumHer-Zero assumes the LDAK Model and that there is no confounding inflation (A = 0, C = 1); this is our recommended version when estimating h2SNP or enrichments and confident that confounding is negligible. SumHer-GC assumes the LDAK Model and that confounding inflation is multiplicative (A = 0, C free to vary); this is our recommended version when estimating confounding bias or genetic correlation, or when estimating h2SNP or enrichments and it is likely that test statistics are inflated due to population structure or familial relatedness, or were obtained using genomic control or mixed-model association analysis (see below). Hybrid-Zero assumes the heritability model
and that there is no confounding inflation (A = 0, C = 1), while Hybrid-GC assumes the same heritability model and that confounding inflation is multiplicative (A = 0, C free to vary). Model (2) is a linear combination of the GCTA and LDAK models, where p indicates the weight assigned to the LDAK model; we use this heritability model to compare the fit of the GCTA and LDAK models on real data (see below).
For all analyses, we consider only common (MAF ≥ 0.01), biallelic, autosomal SNPs. Following previous guidelines,1,2,4,5 we exclude SNPs within the major histocompatibility complex (MHC; Chromosome 6: 25-34 Mb), as well as SNPs which individually explain > 1% of phenotypic variation, and SNPs in LD with these (within 1 cM and r2jl > 0.1). Our reference panel is 404 non-Finnish, European individuals from the 1000 Genomes Project, recorded for 8,598,995 SNPs.12 For our enrichment analyses, we consider the same 24 functional annotations as Finucane et al.2 which include coding, conserved, enhancer and promoter regions (see Supplementary Table 1 for a full list). When estimating enrichments using LDSC-Zero or LDSC, we use the 53-part model recommended by Finucane et al.2 (one category for each annotation, one for each of 28 “buffer regions”, plus one containing all SNPs); when using SumHer-Zero or SumHer-GC, we use a 25-part model (one category for each annotation, plus one containing all SNPs).
Summary statistics
For our real analyses, we use summary statistics from two sets of GWAS. The “25 raw GWAS” (18 binary traits, 7 quantitative, average sample size 9 700; see Supplementary Table 2) are those for which we have access to individual-level data, from either the Wellcome Trust Case Control Consortium13 (WTCCC) or the eMERGE Network,14 and therefore we perform the association analysis ourselves. The “24 summary GWAS” (9 binary traits, 15 quantitative, average sample size 121,000; see Table 1), are those for which we use summary statistics from previously-published analyses.15
Table 1
LDSC | SumHer-GC | No. of significant loci after dividing test statistics by | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Trait (disease prevalence, %) | n | GIF | h2SNP | s.d. | 1+A | s.d. | h2SNP | s.d. | C | s.d. | 1 | GIF | 1+A | C |
Alzheimer's Disease46 (7.5) | 54000 | 1.09 | 0.03 | 0.01 | 1.07 | 0.02 | 0.12 | 0.03 | 1.03 | 0.01 | 21 | 19 | 19 | 21 |
Coronary Artery47 (6) | 79000 | 1.10 | 0.04 | 0.01 | 1.06 | 0.01 | 0.15 | 0.02 | 0.99 | 0.01 | 10 | 6 | 7 | 10 |
Crohn's Disease36 (0.5) | 21000 | 1.14 | 0.15 | 0.03 | 1.08 | 0.01 | 0.47 | 0.06 | 0.97 | 0.02 | 64 | 52 | 58 | 64 |
Ever Smoked?48 (56) | 74000 | 1.11 | 0.08 | 0.01 | 1.02 | 0.01 | 0.19 | 0.02 | 0.96 | 0.01 | 0 | 0 | 0 | 0 |
Inflammatory Bowel36 (0.7) | 35000 | 1.17 | 0.09 | 0.02 | 1.13 | 0.01 | 0.33 | 0.03 | 0.98 | 0.01 | 78 | 59 | 65 | 80 |
Rheumatoid Arhritis49 (0.5) | 58000 | 1.05 | 0.05 | 0.01 | 1.00 | 0.01 | 0.17 | 0.03 | 0.90 | 0.02 | 109 | 104 | 109 | 123 |
Schizophrenia50 (1) | 82000 | 1.57 | 0.19 | 0.01 | 1.16 | 0.01 | 0.42 | 0.02 | 0.91 | 0.01 | 105 | 23 | 63 | 140 |
Type 2 Diabetes35 (8) | 157000 | 1.17 | 0.08 | 0.01 | 1.07 | 0.01 | 0.23 | 0.02 | 0.95 | 0.01 | 38 | 25 | 32 | 42 |
Ulcerative Colitis36 (0.2) | 27000 | 1.12 | 0.06 | 0.01 | 1.10 | 0.01 | 0.27 | 0.03 | 0.99 | 0.01 | 38 | 31 | 31 | 38 |
Bone Mineral Density37 | 33000 | 1.11 | 0.10 | 0.02 | 1.07 | 0.01 | 0.28 | 0.04 | 1.00 | 0.01 | 19 | 18 | 18 | 19 |
Body Mass Index25 | 230000 | 1.13 | 0.09 | 0.01 | 0.80 | 0.01 | 0.33 | 0.03 | 0.55 | 0.02 | 69 | 52 | 135 | 336 |
Depressive Symptoms38 | 161000 | 1.12 | 0.02 | 0.00 | 1.03 | 0.01 | 0.07 | 0.01 | 0.96 | 0.01 | 0 | 0 | 0 | 1 |
Fasting Glucose39 | 58000 | 1.08 | 0.05 | 0.01 | 1.04 | 0.01 | 0.14 | 0.03 | 0.99 | 0.01 | 22 | 20 | 20 | 23 |
Glycated Hemoglobin40 | 46000 | 1.04 | 0.02 | 0.01 | 1.03 | 0.01 | 0.10 | 0.02 | 0.99 | 0.01 | 10 | 10 | 10 | 10 |
HDL Cholesterol26 | 96000 | 1.03 | 0.07 | 0.03 | 1.04 | 0.07 | 0.50 | 0.09 | 0.68 | 0.03 | 130 | 122 | 121 | 216 |
Height41 | 246000 | 2.09 | 0.20 | 0.02 | 1.69 | 0.06 | 0.46 | 0.04 | 0.98 | 0.04 | 720 | 196 | 288 | 754 |
LDL Cholesterol26 | 91000 | 1.03 | 0.08 | 0.03 | 1.00 | 0.04 | 0.43 | 0.10 | 0.73 | 0.04 | 101 | 96 | 101 | 155 |
Menarche Age42 | 253000 | 1.66 | 0.15 | 0.01 | 1.21 | 0.02 | 0.32 | 0.02 | 0.89 | 0.02 | 289 | 111 | 190 | 354 |
Menopause Age43 | 69000 | 1.10 | 0.06 | 0.01 | 1.06 | 0.02 | 0.25 | 0.03 | 0.92 | 0.02 | 49 | 39 | 39 | 55 |
Neuroticism38 | 171000 | 1.26 | 0.06 | 0.01 | 1.06 | 0.01 | 0.17 | 0.02 | 0.90 | 0.02 | 10 | 4 | 7 | 18 |
Subjective Well-Being38 | 298000 | 1.16 | 0.02 | 0.00 | 1.03 | 0.01 | 0.04 | 0.00 | 0.97 | 0.02 | 0 | 0 | 0 | 0 |
Triglyceride26 | 92000 | 1.02 | 0.14 | 0.04 | 0.92 | 0.03 | 0.45 | 0.11 | 0.70 | 0.04 | 82 | 82 | 91 | 152 |
Waist-Hip Ratio44 | 142000 | 1.05 | 0.06 | 0.01 | 0.92 | 0.01 | 0.20 | 0.02 | 0.76 | 0.01 | 26 | 23 | 33 | 66 |
Years Education45 | 329000 | 1.54 | 0.07 | 0.00 | 1.11 | 0.01 | 0.20 | 0.01 | 0.83 | 0.01 | 70 | 13 | 46 | 148 |
Average | 121000 | 1.21 | 0.04 | 0.00 | 1.04 | 0.00 | 0.12 | 0.00 | 0.93 | 0.00 | 86 | 46 | 62 | 118 |
Total | 2060 | 1105 | 1483 | 2825 |
For each of the 25 raw GWAS, we perform strict quality control (see Online Methods), then use REML to estimate how much of the total phenotypic variance explained by SNPs is inflation due to population structure or familial relatedness.8,16 We estimate that on average 3.1% (range 0.2% to 7.2%) of the variance explained is inflation (Supplementary Fig. 1), indicating that confounding due to population structure or familial relatedness is is modest. For the 24 summary GWAS, we are largely reliant on the quality control decisions of the original authors, which are generally much less strict than ours (Supplementary Table 3). For 4 of the GWAS, imputation info scores are available, so we exclude SNPs with score < 0.95; for the remaining 20 GWAS, we instead restrict to the 4,555,718 SNPs present in the eMERGE data, as we observe that these SNPs tend to be well-imputed in Caucasian GWAS (Supplementary Table 4).
The importance of the heritability model
We begin by analyzing summary statistics from simulated phenotypes. Using the eMERGE data14,17 (25,875 individuals recorded for 4,555,718 SNPs), we generate 200 phenotypes each with 2,000 causal SNPs, h2SNP = 0.5, no confounding bias, and such that consecutive pairs (Phenotypes 1 & 2, 3 & 4, etc) have genetic correlation 0.5. For Phenotypes 1-100, we sample causal SNP effect sizes according to the GCTA Model, for Phenotypes 101-200, according to the LDAK Model. We perform single-SNP analysis for each phenotype, then analyze the resulting summary statistics using LDSC-Zero, LDSC, SumHer-Zero and SumHer-GC. Selected results are provided in Figure 1, with full results in Supplementary Figures 2 & 3.
First we examine results from analyzing each phenotype individually. Figure 1a shows that, as expected, accurate estimates of h2SNP are returned when phenotypes are analyzed assuming the matching heritability model (i.e., when GCTA phenotypes are analyzed using LDSC-Zero or LDSC and LDAK phenotypes are analyzed using SumHer-Zero or SumHer-GC), but that using a different heritability model can result in very poor estimates; SumHer-Zero tends to over-estimate h2SNP by about 20% when applied to GCTA phenotypes, while LDSC-Zero tends to under-estimate h2SNP by about 40% when applied to LDAK phenotypes. As shown in Supplementary Figure 2a, LDSC infers that there is only slight confounding when used on GCTA phenotypes (average intercept 1.008, s.d. 0.001), and therefore its estimates of h2SNP closely match those from LDSC-Zero. However, when used on LDAK phenotypes, LDSC wrongly infers that much of the causal signal is in fact confounding (average intercept 1.096, s.d. 0.001), and as a result, its estimates of h2SNP are on average about 60% lower than those from LDSC-Zero and about 75% lower than the true value.
Figure 1b reports the estimated enrichment of SNPs in conserved regions. As causal SNPs were picked at random from across the genome, the true enrichment is one. Again, we see that assuming the correct heritability model produces reliable estimates, but assuming the wrong model can result in misleading conclusions. In particular, we find that when LDSC-Zero or LDSC are used to analyze LDAK phenotypes, they infer that conserved regions are over 2-fold enriched for heritability. We chose to focus on conserved regions as this was the category that, by applying LDSC to real data, Finucane et al.2 found to be most enriched. These simulations show that a substantial portion of the enrichment observed by Finucane et al. could be an artifact of misspecifying the heritability model.
Figure 1c compares results from analyzing consecutive pairs of phenotypes. Whereas LDSC-Zero and SumHer-Zero only produce accurate estimates of genetic correlation when applied to GCTA and LDAK phenotypes, respectively, we find that LDSC and SumHer-GC produce accurate estimates for both sets of phenotypes (Supplementary Fig. 2f shows why this is the case). Even so, highest precision is achieved when the correct heritability model is assumed; the s.d. of SumHer-GC estimates is on average about 50% higher than that of LDSC estimates when analyzing GCTA phenotypes, but about 50% lower when analyzing LDAK phenotypes.
Determining the most appropriate heritability model
With access to individual-level data, we can compare how well different heritability models fit real data using the likelihood from REML analysis.5 With only summary statistics, we can instead use the likelihood from SumHer. Supplementary Table 5 shows that across the 25 raw GWAS, the SumHer log likelihood is on average 17 nats higher under the LDAK Model than under the GCTA Model (for comparison, the average difference in REML log likelihood is 19 nats; Supplementary Table 6). Across the 24 summary GWAS, the SumHer log likelihood is on average 76 nats higher under the LDAK Model (Supplementary Table 7).
Rather than comparing based on likelihoods, an easier-to-visualize approach is to run SumHer using a hybrid heritability model containing both the GCTA and LDAK Models, and allow the data to decide the relative weighting of each. Specifically, we use Hybrid-Zero (or Hybrid-GC if the test statistics are confounded), with the focus on estimating p, the “proportion of LDAK”, in Model (2). Figure 2a demonstrates proof of principle using our simulated phenotypes: we see that Hybrid-Zero correctly estimates p ≈ 0 when applied to the GCTA phenotypes, p ≈ 1 for the LDAK phenotypes, and p ≈ 0.5 for “Hybrid Phenotypes” (created by summing Phenotypes 1 & 101, 2 & 102, etc.). For Figure 2b and Supplementary Table 5, we apply Hybrid-Zero to the 25 raw GWAS; the average estimate of p is 1.03 (s.d. 0.02), indicating that the data overwhelmingly support the LDAK Model over the GCTA Model. For Figure 2c and Supplementary Table 7, we apply Hybrid-GC to the 24 summary GWAS; the average estimate of p is 0.85 (s.d. 0.01), again strongly supporting the LDAK Model (in the Discussion, we consider why this estimate is lower than that for the 25 raw GWAS).
Population structure and relatedness
Supplementary Tables 2, 8 & 9 reports estimates of confounding bias, SNP heritability, enrichments and genetic correlations for the 25 raw GWAS. As a consequence of our strict quality control, SumHer-GC finds only slight confounding bias; its average estimate of the scaling factor C is 1.005 (s.d. 0.002). We now construct GWAS with substantial confounding. First, for each of the 13 WTCCC GWAS, we introduce population structure by replacing 2000 of the controls (on average 54%) with 2000 individuals from POBI18 (People of the British Isles); this generates population structure because, although both WTCCC and POBI individuals were recruited from the UK, the latter predominately came from isolated, rural regions (Supplementary Fig. 4). Second, for each of the 12 eMERGE GWAS, we no longer filter individuals based on relatedness; this leads to the retention of approximately 1650 pairs of relatives (Supplementary Fig. 5). Supplementary Tables 10 & 11 confirm that in both cases, SumHer-GC now detects significant confounding; its average estimates of C are 1.022 (s.d. 0.003) and 1.020 (s.d. 0.003), respectively.
Genomic control and mixed-model association analysis
Although its use is declining, it remains that the majority of published GWAS have performed genomic control (divided test statistics by the GIF) at least once in their analyses.11 As the GIF tends to over-estimate confounding bias,19 genomic control will tend to result in deflated test statistics. Supplementary Figures 6 & 7 show that, if not accounted for, genomic control will result in under-estimation of h2SNP and inaccurate estimates of heritability enrichments, but that reliable estimates can be obtained by allowing for multiplicative inflation of test statistics. SumHer, like LDSC, is designed to be used with results from classical (least-squared) regression. However, with the development of software such as Fast-LMM, GCTA-LOCO and Bolt-LMM20–22 it is now common for GWAS to instead use mixed-model association analysis.23,24 Supplementary Figures 6, 8 & 9 show that when estimating h2SNP and enrichments, the impact of mixed-model analysis is similar to, albeit less severe than, that of genomic control. However, as with genomic control, reliable estimates can be obtained by allowing for multiplicative inflation of test statistics.
Estimates of confounding bias and SNP heritability
Table 1 reports estimates of confounding bias for each of the 24 summary GWAS. LDSC estimates the average intercept to be 1.042 (s.d. 0.002), indicating that the test statistics tend to be inflated. By contrast, SumHer-GC estimates the average scaling factor to be 0.929 (s.d. 0.003), indicating that the test statistics tend to be deflated. The widespread deflation observed by SumHer-GC is consistent with the fact that at least 15 of the 24 summary GWAS performed genomic control (Supplementary Table 3); in particular, we note that the four GWAS estimated to have lowest C (0.55 for body mass index, 0.68 for HDL cholesterol, 0.73 for LDL cholesterol and 0.70 for triglyceride levels), are all meta-analyses that used genomic control both before and after combining results across cohorts.25,26 Table 1 also reports the number of independent loci with P < 5×10-8. Without adjustment, there are on average 86 loci per trait. If we adjust based on the results of LDSC (i.e., for each trait divide test statistics by the LDSC estimate of confounding bias), the average reduces to 62, whereas if we adjust based on the results of SumHer-GC, it increases to 118. For these counts, we defined two significant SNPs as dependent if they are within 1 cM and have r2jl > 0.1, but we find that results are very similar if instead we increase the window size to 3 cM or use r2jl > 0.2 (Supplementary Table 12).
Table 1 shows that across the 24 traits, estimates of h2SNP from SumHer-GC are on average 2.7 (s.d. 0.05) times higher than those from LDSC. As shown in Supplementary Table 13, this difference is primarily due to changing the heritability model, rather than changing the confounding model; for example, were we to switch from the GCTA to LDAK Model but to continue using an additive model for confounding inflation, estimates would still be on average 2.3 (s.d. 0.04) times higher.
Estimates of functional enrichments and genetic correlations
Figure 3a and Supplementary Table 14 report estimates of enrichments for the 24 functional categories, averaged across the 24 traits. We see striking differences between the estimates from LDSC and SumHer-GC. For example, LDSC estimates of enrichments range from -1.1 to 9.4, whereas SumHer-GC estimates range from 0.78 to 2.0. The estimated enrichment of a category is its estimated share of h2SNP divided by its expected share under the assumed heritability model.27 Supplementary Table 14 shows that changing from the GCTA to LDAK model has a small impact on the expected share of each category (the denominator), but typically has a large impact on the estimated share (the numerator). Supplementary Figure 10 and Supplementary Tables 15 & 16 show that the large differences between LDSC and SumHer-GC estimates of enrichments remain if we calculate the former using the LDSC software1 (in this setting, LDSC is often referred to as stratified LDSC or S-LDSC) rather than using our implementation of LDSC in SumHer, and likewise if we vary the SNP sets (the authors of LDSC recommend that the reference panel contains as many SNPs as possible, but that only HapMap3 SNPs28 with MAF ≥ 0.05 are used when performing the regression1,2), or if we use the 75-part model proposed by Gazal et al.29
Based on the results from SumHer-GC, we conclude that conserved regions30,31 and transcription start sites32 are most enriched for heritability; they are estimated to contribute 7.1% (s.d. 0.2) and 3.6% (s.d. 0.2) of h2SNP, respectively, 1.95 (s.d. 0.07) and 1.97 (s.d. 0.09) times higher than their expected contributions under the LDAK Model. Repressed region are the only category significantly depleted; although they are estimated to contribute 35% (s.d. 0.5) of h2SNP, this is only 0.78 (s.d. 0.01) times their expected contribution.
Supplementary Figure 11 and Supplementary Table 17 provide estimates of genetic correlation for the 276 pairs of traits. As predicted by our simulation study, there is strong concordance between estimates from LDSC and SumHer-GC, but the SumHer-GC estimates are more precise; for example, across the 38 pairs of traits that both methods find to be significantly correlated (P < 0.05/276), the s.d. of SumHer-GC estimates is on average about 20% lower than that of LDSC estimates.
Improving polygenic risk scores
Figure 3b shows that there is strong concordance between the average estimates of enrichments obtained from the 9 binary traits and those from the 15 quantitative traits. This suggests broad similarities between the genetic architectures of different traits, which in turn implies that it should be possible to use information from existing GWAS to improve the efficiency of future analyses. As a demonstration, we consider prediction using polygenic risk scores (PRS). We focus on body mass index, height, HDL & LDL cholesterol and triglyceride levels, as for these five traits we can train prediction models using the 24 summary GWAS, then measure how well these perform using the (independent) eMERGE data.14
To construct a PRS, we need estimates of SNP effect sizes. The most common approach is to use estimates from single-SNP analysis33 (“Classical PRS”). However, in Online Methods, we explain how, given a heritability model, we can obtain a prior distribution for v2j, the heritability tagged by SNP j, then calculate a “Bayesian PRS” using the posterior mean effect sizes. For each trait, we construct four Bayesian PRS corresponding to four heritability models. First we use the GCTA heritability model (qj = 1). Next we use the “Enriched GCTA Model”, obtained by scaling the qj based on the LDSC estimates of enrichments (e.g., if a category was estimated to have 2-fold enrichment, then the SNPs it contains would have average qj = 2). We similarly construct PRS based on the LDAK Model, then the “Enriched LDAK Model”, where the qj are scaled according to the SumHer-GC estimates of enrichments.
Figure 4a compares the four prior distributions for v2j. We see that the two priors derived from the GCTA Model are more diffuse than the two from the LDAK Model. As explained in the Online Methods, this is because the GCTA Model predicts that v2j scales with local levels of LD, which vary considerably across the genome, whereas the LDAK Model predicts that v2j scales with local MAFs, which vary less. Figure 4b and Supplementary Table 18 report the performance of each PRS, measured as correlation between predicted and observed phenotypes for the eMERGE individuals. Averaged across the five traits, the Bayesian PRS constructed from the GCTA Model performs 0.1% (s.d. 1.6) worse than the Classical PRS (i.e., no significant difference). However, the Bayesian PRS constructed from the Enriched GCTA, LDAK and Enriched LDAK Models perform, respectively, 5.2%, 4.7% and 6.1% better (s.d.s all 1.6), each of which is a significant improvement (P < 0.05/4) over the Classical PRS.
Discussion
We have presented SumHer, software for estimating confounding bias, SNP heritability, enrichments of heritability and genetic correlations from GWAS results. While the aims of SumHer are the same as those of LDSC, the key difference is that SumHer allows the user to specify the heritability model. If SumHer is run using the GCTA Model, its estimates will match those from LDSC. However, we instead recommend using the LDAK Model, which we have shown is better supported by real data. We have analyzed GWAS results for tens of traits, showing that the impact of using an improved heritability model is often substantial, and overall provides a very different description of the genetic architecture of complex traits than has to date been obtained from LDSC analyses.
The heritability model matters
The heritability model describes the prior assumptions regarding how heritability is distributed across the genome. We began by showing that estimates of confounding bias, h2SNP and enrichments are sensitive to the assumed heritability model, and that using an unrealistic heritability model can result in misleading conclusions. Next we evaluated the GCTA and LDAK heritability models on real data. Not only did we demonstrate that the LDAK Model is substantially more realistic than the GCTA Model, but also our prediction analysis showed that the LDAK Model is sufficiently realistic to be useful. Nonetheless, it is important to recognize that the LDAK heritability model is not perfect, and as such SumHer provides tools for testing and comparing different heritability models on large-scale GWAS data.
Hybrid heritability models
As well as comparing the GCTA and LDAK Models based on likelihood, we also ran SumHer using a hybrid heritability model where the fractions 1-p and p indicate the proportions of GCTA and LDAK, respectively; across the 25 raw GWAS, the average estimate of p was 1.03 (s.d. 0.02), while across the 24 summary GWAS, the average estimate of p was 0.85 (s.d. 0.01). Although both estimates of p are greater than 0.5, consistent with the LDAK Model being more realistic than the GCTA Model, the latter suggests the hybrid model might improve on the LDAK Model. We recommend using only the LDAK Model for two reasons. Firstly, estimates from the LDAK Model tend to be more precise than those from the hybrid model. This is particularly true when estimating enrichments; for example, Supplementary Table 19 shows that across the 24 summary GWAS, the s.d. of SumHer-GC estimates is on average about 70% lower than that of Hybrid-GC estimates. Secondly, Supplementary Figure 12 shows that introducing poorly-genotyped SNPs makes traits appear more GCTA-like, indicating that the hybrid model is more sensitive to genotyping errors. In any case, Supplementary Figure 13 shows that even if we had instead preferred results from the hybrid model, it would not have affected our overall conclusions.
Confounding bias in GWAS
In addition to examining the choice of heritability model, we also considered how to model inflation of test statistics due to confounding. Whereas LDSC uses an additive model for confounding inflation, we prefer a multiplicative model, due to the fact that genomic control (which scales test statistics multiplicatively) has been widely used by published GWAS (we provide further support for our choice in Supplementary Fig. 14 and Supplementary Table 13). We recognize that as genomic control becomes less common, it will be necessary to check whether the multiplicative model remains preferable to the additive model.
Implications for complex trait genetics
Our work has three main conclusions. (i) Large-scale GWAS have tended to over-correct for confounding, which substantially reduced the discovery of genome-wide significant associations. Although SumHer estimates of confounding bias can be used to adjust test statistics from a confounded GWAS (as we did for Table 1, in order to estimate the number of genome-wide significant loci missed by the 24 summary GWAS), adjusted test statistics will not be as accurate as those from a GWAS without confounding. Therefore, we would prefer that SumHer is used during the primary analysis; if SumHer finds substantial confounding bias, this indicates that the analysis should be repeated with improved quality control. (ii) LDSC tends to substantially under-estimate h2SNP. Accurate estimates of h2SNP are important not only because they improve our understanding of genetic architecture,16 but also because they are now being incorporated in software for analyzing complex traits (e.g., the mixed-model association software Bolt-LMM22 and the prediction software LDPred34). The efficiency of these analyses can therefore be improved by incorporating SumHer estimates of h2SNP. (iii) The most striking differences between LDSC and SumHer are observed when estimating enrichments of heritability in functional regions. Whereas analyses using LDSC have found that heritability is concentrated in specific genomic regions, we have instead shown that heritability is spread more evenly across the genome. Although the realization that complex traits are even more complicated than previously thought is daunting, it is only by properly understanding their complexity that we can develop more efficient tools for analyzing genetic data.
Methods
Estimating SNP heritability
Suppose that we have summary statistics from a GWAS on n individuals and m SNPs; let Sj denote the χ2(1) test statistic from regressing the phenotype on Xj, the vector of additively-coded genotypes for SNP j, and let nj ≤ n denote the number of individuals used in this regression (note that in the main text, for simplicity, we assumed nj = n). If Sj was obtained using classical (i.e., least-squares) linear regression, then19
where v2j is the total amount of heritability tagged by SNP j. In the main text, we referred to h2j as the heritability “directly contributed” by SNP j, to emphasize that while a causal signal can contribute to multiple v2j (i.e., be tagged by multiple SNPs), it can only contribute to one h2j. More formally, the h2j represent a partitioning of h2SNP, the total heritability tagged by the m SNPs genotyped by the GWAS; this definition recognises that an ungenotyped causal variant can contribute to h2SNP, provided it is tagged by one or more genotyped SNPs (in which case its heritability will be shared across the h2j of the tagging SNPs, even though none of these “directly contribute” its heritability). If there is no population structure or familial relatedness, then Cor(Xj,Xl)2 will be negligible for distant SNPs, while for local SNPs an unbiased estimate of Cor(Xj,Xl)2 is1
where X′j is the vector of SNP j genotypes for the n′ individuals in a reference panel (for accurate estimates of r2jl, the individuals in the reference panel should have similar ancestry to those used in the GWAS). Therefore, in place of Equation (3) we use
where the set Nj indexes those SNPs “near” SNP j (a working definition of near is within 1 cM; Supplementary Fig. 15). Finally, given a heritability model of the form E[h2j] qj, we replace h2j by its expected value qj h2SNP / Q, where Q = ∑ qj, resulting in
which allows us to estimate h2SNP by regressing (Sj – 1) on uj. To account for correlated datapoints and heteroscedasticity we use weighted least-squares regression. Specifically, if D is a diagonal matrix whose non-zero entries are the regression weights, then the estimate of h2SNP would be (uTDu)-1uTD(S - 1), where u and S are vectors containing the m values for uj and Sj, respectively. Following LDSC,1 we use 1 / Djj = (∑lNj r2jl)(1 + uj h2SNP). Since h2SNP is unknown, we proceed iteratively starting at h2SNP = 0, then successively updating Djj and h2SNP until convergence. Again following LDSC, we estimate standard errors via block jackknifing (by default we use 200 blocks).1
Comparing heritability models
The (weighted) natural log likelihood is
To evaluate L(S | h2SNP, D) requires values for h2SNP and D. While the natural choice is to use the values after the final iteration, in order to compare different heritability models based on L(S | h2SNP, D), we must use the same D for each (else models leading to lower Djj would have an unfair advantage). Therefore, SumHer reports
Estimating confounding bias
When assuming that confounding inflates test statistics multiplicatively, we replace Equation (5) with E[Sj] ≈ C(1 + uj h2SNP), then estimate C and h2SNP by jointly regressing Sj on 1 and uj. To instead copy LDSC and assume confounding causes additive inflation, we replace Equation (5) with E[Sj] ≈ 1 + nj a + uj h2SNP, then estimate a and h2SNP by jointly regressing (Sj - 1) on nj and uj (note that in the main text we assumed nj = n, allowing us to replace nj a by A).
Estimating enrichments
Suppose we have K categories; let Ijk {0,1} indicate whether SNP j belongs to Category k. We wish to estimate h2Cat k = ∑ j Ijkh2j, the heritability contributed by SNPs in Category k. We now use the heritability model
where qj Ijk h2Part k / Qk represents the contribution of Category k to the expected heritability of SNP j, and h2SNP = h2Part 1 + … + h2Part K. We consider two scenarios: in Scenario 1, the K categories partition the genome (∑k Ijk = 1); in Scenario 2, the first K-1 categories correspond to annotations, and the Kth is the base category containing all SNPs (IjK = 1, QK = m). Using Equation (6) ensures that when there is no enrichment, E[h2j] = qj h2SNP / Q for both scenarios. To appreciate why, consider that for Scenario 1, h2Part k = h2Cat k, so that when no categories are enriched, h2Part k = Qk h2SNP / Q; for Scenario 2, h2Part 1, … , h2Part K-1 indicate how much each annotation increases the SNP heritability from h2Part K, so that when there is no enrichment, h2Part 1 = … = h2Part K-1 = 0 and h2Part K = h2SNP. Now when we replace h2j in Equation (4) by its expected value, we obtain
and therefore we can estimate h2Part 1, …, h2Part K by jointly regressing (Sj - 1) on uj1, …, ujK. Given these, our estimate of h2Cat k is ∑j,k′ (Ijk qj Ijk′ h2Part k’ / Qk’), which we then divide by Qk h2SNP / Q to get an estimate of the enrichment of Category k.
Estimating genetic correlation
Suppose we have summary statistics from two GWAS. Instead of χ2(1) test statistics, we now use (signed) Z-statistics. Let ZAj and ZBj denote the two Z-statistics for SNP j, computed using nAj and nBj individuals, respectively, of which nCj were common to both GWAS (if the two GWAS were independent, nCj = 0). We assume
where cAB is the phenotypic correlation between the two traits and h2AB is their genetic covariance. This equation matches that used by LDSC,3 except that we have replaced r2jl / m by qj r2jl / Q. By regressing ZAjZBj on u′j, we obtain an estimate of h2AB, which we then divide by estimates of
Impact of changing the confounding model
Note that when per-SNP sample sizes are constant, nj = n, so we can write 1 + nj a + uj h2SNP as C (1+ uj h2SNP / C), where C = 1 + n a = 1 + A. Therefore, given the heritability model (which determines the uj), the estimate of confounding bias will be the same whether we assume additive inflation of test statistics (and estimate 1+A) or assume multiplicative inflation (and estimate C), but estimates of h2SNP from the additive model will be C times those from the multiplicative model (estimates of enrichments and genetic correlations will be unaffected, as these are ratios of heritabilities so the factor C cancels).
Reference panel
Our primary reference panel is 404 non-Finnish, European individuals from the 1000 Genomes Project.12 In Supplementary Figure 16, we repeat our analyses of the 24 summary GWAS using instead 8,850 unrelated Caucasian individuals from the Health & Retirement Study. The results are almost identical, indicating that despite its small size, it suffices to construct reference panels from the 1000 Genomes Project data.
Regions of extreme LD and large-effect loci
Estimates can be unduly affected by regions of extreme LD and by SNPs with disproportionately large effect size.1,4,5 Therefore, for all analyses, we exclude SNPs within the MHC (Chromosome 6: 25-34 Mb), as well as SNPs which individually explain >1% of phenotypic variation (Sj > nj / 99), and SNPs in LD with these (within 1 cM and r2jl > 0.1); the latter resulted in the exclusion of SNPs for 10 of the 25 raw GWAS and for 4 of the 24 summary GWAS (Supplementary Table 20). Excluding large-effect loci will result in under-estimation of h2SNP; this has little consequence for our analyses, as we are primarily interested in comparing models, but if preferred can be avoided by re-introducing the large-effect SNPs as fixed effects.5
Binary phenotypes
So far, we have implicitly assumed the trait is quantitative. If instead it is binary (i.e., for case-control GWAS), there are two considerations. Firstly, heritability estimates now correspond to the observed scale; SumHer will convert them to the liability scale if provided with the prevalence and case-control ratio.51,52 Secondly, the p-values may have come from logistic regression instead of linear regression; Supplementary Figure 17 shows that because linear and logistic p-values closely match for SNPs with small or moderate effect, this has limited impact.
Polygenic risk scoring
To predict phenotypes, we construct PRS of the form ∑j βj X′j, where the vector X′j contains standardized genotypes for SNP j (obtained by centering Xj, then scaling to have variance 1). Without loss of generality, we assume phenotypes have also been standardized, in which case the estimate of βj from classical linear regression is ρj, the correlation observed between SNP j and the phenotype, and has variance (1 – ρ2j) / nj. We can infer ρj from the Wald test Z-statistic
Choosing a heritability model
Although the heritability model is defined in terms of h2j, as is clear from Equation (4), its fit depends only on how well it models v2j = h2j + ∑lNj r2jl h2l. Therefore, when specifying qj, the focus should be on accurately describing how v2j varies across the genome. LDSC assumes the GCTA Model,5,7 (qj = 1), which implies that v2j correlates with local levels of LD. Despite its widespread use in statistical genetics (e.g., it is implicitly assumed by any regression method where SNPs are standardized and the same penalty function / prior distribution is applied to each), the GCTA Model poorly reflects real data.5 Even for traits where a correlation is observed between v2j and Sj (those with p < 1 in Figures 2b & 2c), the correlation is substantially weaker than predicted by the GCTA Model (hence p >> 0), and partly a consequence of including lower-certainty SNPs in the GWAS (Supplementary Fig. 12). We recommend using the LDAK Model. Originally, this took the form qj = wj, where the weightings wj are calculated based on the assumption that E[v2j] is constant.6 We recently updated it to qj = wj [fj(1 - fj)]0.75 rj, reflecting that v2j tends to depend on MAF, fj, and genotype certainty, rj (we did not use the latter as either rj ≈ 1 or was unknown).5 The exponent 0.75 was chosen as this value performed well in a previous analysis of 42 human traits. In Supplementary Figure 18, we confirm 0.75 also performs well for both the 25 raw and 24 summary GWAS.
Quality control
To prepare the 13 WTCCC GWAS, we used our previously-described protocol.5 In summary, after excluding apparent population outliers, individuals with extreme missingness or heterozygosity and SNPs with MAF < 0.01, call rate < 0.95 or Hardy-Weinberg P < 1 × 10-6, we phased using SHAPEIT, then imputed using IMPUTE2 with the 1000 Genomes Project Phase 3 (2014) reference panel.12,55,56 When merging case and control datasets, we converted genotype probabilities to hard calls using a threshold of 0.95, then retained only autosomal SNPs that in all cohorts had MAF ≥ 0.01 and info score ≥ 0.99. Finally, we thinned individuals, so no pair remained with allelic correlation57 > 0.05. For the 12 eMERGE GWAS, data were provided post-imputation; this was also performed using SHAPEIT and IMPUTE2, but used the 1000 Genomes Project Phase 2 reference panel.17 We converted genotype probabilities to hard calls using a certainty threshold of 0.95, then retained only biallelic SNPs with MAF ≥ 0.01, call rate ≥ 0.95, info score ≥ 0.95 and whose genomic positions matched those in the 1000 Genomes Project. Finally, we excluded individuals whose genotypes were inconsistent with those of non-Finnish Europeans from the 1000 Genomes Project (Supplementary Fig. 19), and those whose ethnicity was reported as “Hispanic or Latino”, then filtered until no pair remained with allelic correlation57 > 0.05 (which left 25,875 individuals). For the 24 summary GWAS, quality control varied by study (see Table 1 for references), but was in general much less strict than we would have performed had we access to the individual-level data16 (for example, the most common info score thresholds were 0.3 and 0.5). Info scores were only available for Crohn’s Disease, inflammatory bowel disease, schizophrenia and ulcerative colitis; for these traits we restricted to SNPs with score ≥ 0.95, for the remainder, we restricted to SNPs with info score ≥ 0.95 in the eMERGE data (Supplementary Table 4).
Association analysis
For the 25 raw GWAS, we performed the association analysis using linear regression (regardless of whether the trait was quantitative or binary), including as covariates sex and ten principal components (five derived from the reference panel, five from the 1000 Genomes Project). The 24 summary GWAS varied in whether they used classical or mixed-model regression, and whether they performed a meta or mega analysis (Supplementary Table 3).
Software
SumHer is implemented in Version 5.0 of the LDAK software. In Supplementary Figure 20 we confirm that LDSC estimates computed using SumHer match those computed using Version 1.0.0 of the LDSC Software.1
Run times
To run SumHer using the LDAK Model requires three steps: the first is to calculate SNP weightings; the second is to compute a “tagfile”, which contains qj + ∑lNj ql r2jl for each SNP; the third is to perform the regression. When using the 404 non-Finnish Europeans from the 1000 Genomes Project, the first step takes a few hours, the second and third take minutes. When running SumHer using the GCTA Model, it is not necessary to calculate the SNP weightings, but then computing the tagfile will generally take a few hours.
Acknowledgments
We thank A Price, H Finucane, P O’Reilly and M Speed for helpful discussions. Access to Wellcome Trust Case Control Consortium data was authorized as work related to the project “Genome-wide association study of susceptibility and clinical phenotypes in epilepsy”, access to eMERGE Network data was granted under dbGaP Project 14422, “Comprehensive testing of SNP-based prediction models”, while access to the Health and Retirement Study was granted under dbGaP Project 15139, “Developing summary-statistic tools for analysing genetic association study data”. D.S. is funded by the UK Medical Research Council under grant MR/L012561/1, by the European Unions Horizon 2020 Research and Innovation Programme under the Marie Skłodowska-Curie grant agreement number 754513, by Aarhus University Research Foundation (AUFF) and the Independent Research Fund Denmark under Project 7025-00094B. The eMERGE Network was initiated and funded by NHGRI through the following grants: U01HG006828 (Cincinnati Childrens Hospital Medical Center/Boston Childrens Hospital); U01HG006830 (Childrens Hospital of Philadelphia); U01HG006389 (Essentia Institute of Rural Health, Marshfield Clinic Research Foundation and Pennsylvania State University); U01HG006382 (Geisinger Clinic); U01HG006375 (Group Health Cooperative); U01HG006379 (Mayo Clinic); U01HG006380 (Icahn School of Medicine at Mount Sinai); U01HG006388 (Northwestern University); U01HG006378 (Vanderbilt University Medical Center); and U01HG006385 (Vanderbilt University Medical Center serving as the Coordinating Center). The Health and Retirement Study genetic data is sponsored by the National Institute on Aging (grant numbers U01AG009740, RC2AG036495, and RC4AG039029) and was conducted by the University of Michigan. Analyses were performed with the use of the UCL Computer Science Cluster and the help of the CS Technical Support Group, as well as the use of the UCL Legion High-Performance Computing Facility (Legion@UCL) and associated support services.
Footnotes
Data availability
The simulations and 25 raw GWAS used data from the Wellcome Trust and eMERGE Network. These were applied for and downloaded from, respectively, the European Genome-phenome Archive (accession codes: EGAD00000000001, EGAD00000000002, EGAD00000000003, EGAD00000000004, EGAD00000000005, EGAD00000000006, EGAD00000000007, EGAD00000000008, EGAD00000000009, EGAD00000000021, EGAD00000000022, EGAD00000000023, EGAD00000000024, EGAD00000000025, EGAD00000000057, EGAD00010000124, EGAD00010000264, EGAD00010000506, EGAD00010000634, EGAS00001000672 ) and from dbGaP (accession codes: phs000888.v1.p1.c1, phs000888.v1.p1.c3, phs000888.v1.p1.c4, phs000888.v1.p1.c5). To investigate the impact of the reference panel, we used data from the Health and Retirement Study, also available from dbGaP (accession code: phs000428.v2.p2). Results for each of the 24 summary GWAS are available to download from the websites of the corresponding studies (see Table 1 for references).
Accession Codes
Wellcome Trust Case Control Consortium data: EGAD00000000001, EGAD00000000002, EGAD00000000003, EGAD00000000004, EGAD00000000005, EGAD00000000006, EGAD00000000007, EGAD00000000008, EGAD00000000009, EGAD00000000021, EGAD00000000022, EGAD00000000023, EGAD00000000024, EGAD00000000025, EGAD00000000057, EGAD00010000124, EGAD00010000264, EGAD00010000506, EGAD00010000634, EGAS00001000672. eMERGE Network data: phs000888.v1.p1.c1, phs000888.v1.p1.c3, phs000888.v1.p1.c4, phs000888.v1.p1.c5. Health and Retirement Study data: phs000428.v2.p2.
Code availability
Step-by-step code for using SumHer to estimate SNP heritability, confounding bias, heritability enrichments and genetic correlations from summary statistics are provided in the Supplementary Note.
URLs
LDAK, http://www.ldak.org; LDSC, http://www.github.com/bulik/ldsc; 24 functional annotations, https://data.broadinstitute.org/alkesgroup/LDSCORE/baseline_bedfiles.tgz.
Author contributions
D.S. performed the analysis, D.S. and D.J.B. wrote the manuscript.Competing interests
The authors declare no competing financial interests.
References
Full text links
Read article at publisher's site: https://doi.org/10.1038/s41588-018-0279-5
Read article for free, from open access legal sources, via Unpaywall: https://europepmc.org/articles/pmc6485398?pdf=render
Citations & impact
Impact metrics
Article citations
Scalable summary-statistics-based heritability estimation method with individual genotype level accuracy.
Genome Res, 34(9):1286-1293, 11 Oct 2024
Cited by: 0 articles | PMID: 39038848 | PMCID: PMC11529871
Improved polygenic risk prediction in migraine-first patients.
J Headache Pain, 25(1):161, 27 Sep 2024
Cited by: 0 articles | PMID: 39333847 | PMCID: PMC11438044
Widespread natural selection on metabolite levels in humans.
Genome Res, 34(8):1121-1129, 20 Sep 2024
Cited by: 0 articles | PMID: 39152035 | PMCID: PMC11444169
The goldmine of GWAS summary statistics: a systematic review of methods and tools.
BioData Min, 17(1):31, 05 Sep 2024
Cited by: 0 articles | PMID: 39238044 | PMCID: PMC11375927
Evaluation of heritability partitioning approaches in livestock populations.
BMC Genomics, 25(1):690, 13 Jul 2024
Cited by: 0 articles | PMID: 39003468 | PMCID: PMC11246585
Go to all (105) article citations
Data
Data behind the article
This data has been text mined from the article, or deposited into data resources.
BioStudies: supplemental material and supporting data
dbGaP - The database of Genotypes and Phenotypes (2)
- (4 citations) dbGaP - phs000888
- (1 citation) dbGaP - phs000428
Diseases (Showing 7 of 7)
- (1 citation) OMIM - 142000
- (1 citation) OMIM - 246000
- (1 citation) OMIM - 253000
- (1 citation) OMIM - 121000
- (1 citation) OMIM - 230000
- (1 citation) OMIM - 161000
- (1 citation) OMIM - 171000
Show less
European Genome-Phenome Archive (Showing 20 of 20)
- (2 citations) European Genome-Phenome Archive - EGAD00000000009
- (2 citations) European Genome-Phenome Archive - EGAD00010000506
- (2 citations) European Genome-Phenome Archive - EGAD00000000006
- (2 citations) European Genome-Phenome Archive - EGAD00000000005
- (2 citations) European Genome-Phenome Archive - EGAD00000000008
- (2 citations) European Genome-Phenome Archive - EGAD00010000634
- (2 citations) European Genome-Phenome Archive - EGAD00000000007
- (2 citations) European Genome-Phenome Archive - EGAD00000000002
- (2 citations) European Genome-Phenome Archive - EGAD00000000024
- (2 citations) European Genome-Phenome Archive - EGAD00000000057
- (2 citations) European Genome-Phenome Archive - EGAD00000000001
- (2 citations) European Genome-Phenome Archive - EGAD00000000023
- (2 citations) European Genome-Phenome Archive - EGAD00000000004
- (2 citations) European Genome-Phenome Archive - EGAS00001000672
- (2 citations) European Genome-Phenome Archive - EGAD00000000003
- (2 citations) European Genome-Phenome Archive - EGAD00000000025
- (2 citations) European Genome-Phenome Archive - EGAD00010000124
- (2 citations) European Genome-Phenome Archive - EGAD00010000264
- (2 citations) European Genome-Phenome Archive - EGAD00000000022
- (2 citations) European Genome-Phenome Archive - EGAD00000000021
Show less
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
Evaluating and improving heritability models using summary statistics.
Nat Genet, 52(4):458-462, 23 Mar 2020
Cited by: 87 articles | PMID: 32203469
Summary statistic analyses can mistake confounding bias for heritability.
Genet Epidemiol, 43(8):930-940, 20 Sep 2019
Cited by: 5 articles | PMID: 31541496
Leveraging LD eigenvalue regression to improve the estimation of SNP heritability and confounding inflation.
Am J Hum Genet, 109(5):802-811, 13 Apr 2022
Cited by: 11 articles | PMID: 35421325 | PMCID: PMC9118121
SNP-based heritability and selection analyses: Improved models and new results.
Bioessays, 44(5):e2100170, 13 Mar 2022
Cited by: 10 articles | PMID: 35279859
Review
Funding
Funders who supported this work.
Medical Research Council (1)
Improved methodology for understanding the genetics of complex traits, with particular application to epilepsy.
Prof Doug Speed, University College London
Grant ID: MR/L012561/1
NHGRI NIH HHS (10)
Grant ID: U01 HG006388
Grant ID: U01 HG006389
Grant ID: U01 HG006375
Grant ID: U01 HG006828
Grant ID: U01 HG006830
Grant ID: U01 HG006378
Grant ID: U01 HG006380
Grant ID: U01 HG006382
Grant ID: U01 HG006385
Grant ID: U01 HG006379
NIA NIH HHS (3)
Grant ID: RC4 AG039029
Grant ID: U01 AG009740
Grant ID: RC2 AG036495