Accounting for cellular heterogeneity is critical in epigenome-wide association studies.

Jaffe AE; Irizarry RA

doi:10.1186/gb-2014-15-2-r31

Accounting for cellular heterogeneity is critical in epigenome-wide association studies.

Jaffe AE ¹,

Irizarry RA ²

Affiliations

1. Lieber Institute for Brain Development, Johns Hopkins Medical Campus and Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
Authors
Jaffe AE¹
(1 author)
2. Biostatistics and Computational Biology, Dana Farber Cancer Institute and Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA
Authors
Irizarry RA²
(1 author)

ORCIDs linked to this article

Jaffe AE | 0000-0001-6886-1454

Genome Biology, 04 Feb 2014, 15(2):R31
https://doi.org/10.1186/gb-2014-15-2-r31 PMID: 24495553 PMCID: PMC4053810

This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.

Free full text in Europe PMC

Abstract

Background

Epigenome-wide association studies of human disease and other quantitative traits are becoming increasingly common. A series of papers reporting age-related changes in DNA methylation profiles in peripheral blood have already been published. However, blood is a heterogeneous collection of different cell types, each with a very different DNA methylation profile.

Results

Using a statistical method that permits estimating the relative proportion of cell types from DNA methylation profiles, we examine data from five previously published studies, and find strong evidence of cell composition change across age in blood. We also demonstrate that, in these studies, cellular composition explains much of the observed variability in DNA methylation. Furthermore, we find high levels of confounding between age-related variability and cellular composition at the CpG level.

Conclusions

Our findings underscore the importance of considering cell composition variability in epigenetic studies based on whole blood and other heterogeneous tissue sources. We also provide software for estimating and exploring this composition confounding for the Illumina 450k microarray.

Free full text

Genome Biol. 2014; 15(2): R31.

Published online 2014 Feb 4. https://doi.org/10.1186/gb-2014-15-2-r31

PMCID: PMC4053810

PMID: 24495553

Accounting for cellular heterogeneity is critical in epigenome-wide association studies

Andrew E Jaffe¹ and Rafael A Irizarry²

Author information Article notes Copyright and License information Disclaimer

This article has been cited by other articles in PMC.

Associated Data

Supplementary Materials: Additional file 1: Table S1 Studies included in the cellular composition analyses. 'Dataset' refers to each study used in the paper, followed by its citation (see References for full citation); 'N' is the number of samples included from each study; 'GEO ID' is the Gene Expression Omnibus identifier; 'Primary Outcome' is the main disease or trait reported by the referenced article - note that only some datasets were primarily focused on age; 'Median Age [IQR] (yrs)' is the median age of the study participants, followed by their interquartile range (25^th percentile, 75^th percentile), in years.
gb-2014-15-2-r31-S1.xlsx (13K)
Additional file 2: Figure S1 Differential DNA methylation by cell composition. Figure S2. Contributions of age and cell type to cell-sorted DNAm data. Figure S3. Age versus cell type for Liu et al.[5] and Hannum et al.[10] studies. Figure S4. Global variation in DNA methylation by composition, by study sample (Additional file 1). Figure S5. Composition P-values from previously reported age-associated differentially methylated regions. Figure S6. Composition confounding in Alisch et al. [8]. Figure S7. Removal of samples with outlying granulocyte counts. Figure S8. Differences between sorted profiles on the Illumina 27k versus the Illumina 450k. Figure S9. Cross-validated cell counts. Figure S10. Validation of algorithm using brain data.
gb-2014-15-2-r31-S2.pdf (4.6M)
Additional file 3: Table S2 Association of each probe on the Illumina 450k with blood cell composition. Note that probes on the sex chromosomes and those that contain annotated SNPs have been filtered (see Materials and methods). We recommend using the CpG identifiers to match each probe from a user’s differential methylation analysis in their whole blood data to obtain the corresponding composition P-value - if there are many small P-values for significant differentially methylated sites for the exposure/outcome/trait of interest, this may be a sign of confounding via composition differences, in which case we recommend estimating cellular components using the minfi Bioconductor package, and formally exploring this potential correlation between the trait, composition, and DNAm. 'Name' refers to the CpG identifier from the Illumina 450k; 'Fstat' and 'p.value' are the f-statistic and corresponding P-value for composition from the ANOVA containing six samples/biological replicates per cell type across six cell types (n=36; see Materials and methods); 'CD8T_mean' is the mean DNAm across the six CD8+ T cell replicates, on the beta/proportion methylation scale; 'CD4T_mean' is the mean DNAm across the six CD4+ T-cell replicates, on the beta/proportion methylation scale; 'NK_mean' is the mean DNAm across the six natural killer cell replicates, on the beta/proportion methylation scale; 'Bcell_mean' is the mean DNAm across the six B-cell replicates, on the beta/proportion methylation scale; 'Mono_mean' is the mean DNAm across the six monocyte replicates, on the beta/proportion methylation scale; 'Gran_mean' is the mean DNAm across the six granulocyte replicates, on the beta/proportion methylation scale; 'DNAm_min' and 'DNAm_max' are the minimum and maximum beta values, respectively, across the 36 samples at each loci; 'DNAm_range' is the range of beta values.
gb-2014-15-2-r31-S3.zip (27M)
Additional file 4: Table S3 Previously published results for age-associated differential methylation in blood. 'Study (Reference)' refers to a particular study, along with its reference, that reported age-associated differentially methylated regions (aDMRs); 'Platform' is the DNA methylation microarray platform used by the study - '450k' is the Illumina 450k, '27k' is the Illumina 27k and 'CHARM 2.0' is the second generation of the Comprehensive High-Throughput Arrays for Relative Methylation platform. '# of aDMRs' reports the number of differentially methylated loci associated with age - the number left of the backslash is the number reported at genome-wide significance (determined by respective publication) and to the right, the number of significant sites available as a Supplementary Table obtained from each respective manuscript; 'SVA?' displays whether surrogate variable analysis was used in the paper, which may have partially adjusted for blood cell composition effects.
gb-2014-15-2-r31-S4.xlsx (10K)
Additional file 5: Table S4 Gene Ontology (GO) enrichment before and after removing Illumina 450k probes associated with cellular composition. 'GO ID' refers to the GO identifier; 'Background' refers to all of the probes on the Illumina 450k that mapped to an Entrez Gene ID; 'Before' refers to age-associated probes that were not filtered by whether they associated with cellular composition; 'After' refers to age-associated probes after those probes associated with cellular composition were filtered from the analysis; 'Number of Probes Enriched' is the number of probes that mapped to that GO category for each condition; 'Expected Number of Probes' is the expected number of probes, assuming no enrichment, for each category; 'Observed/Expected Ratio' is the ratio of observed to expected counts, a.k.a. the odds ratio; 'GO Term' is the biological term corresponding to each GO ID; 'Set Size' is the number of genes for each GO set. 'Ontology' refers to the three GO classifications - molecular function ('MF'), biological processes ('BP'), and cellular component ('CC'); 'Rank' refers to the P-value rank, smallest to largest, before and after filtering age-associated probes also associated with cellular composition.
gb-2014-15-2-r31-S5.xlsx (169K)

Data Availability Statement: All datasets are publicly available in the GEO database [27] at the accessions available in Additional file 1.