Europe PMC

This website requires cookies, and the limited processing of your personal data in order to function. By using the site you are agreeing to this as outlined in our privacy notice and cookie policy.

Abstract 


Hi-C experiments explore the 3D structure of the genome, generating terabases of data to create high-resolution contact maps. Here, we introduce Juicer, an open-source tool for analyzing terabase-scale Hi-C datasets. Juicer allows users without a computational background to transform raw sequence data into normalized contact maps with one click. Juicer produces a hic file containing compressed contact matrices at many resolutions, facilitating visualization and analysis at multiple scales. Structural features, such as loops and domains, are automatically annotated. Juicer is available as open source software at http://aidenlab.org/juicer/.

Free full text 


Logo of nihpaLink to Publisher's site
Cell Syst. Author manuscript; available in PMC 2018 Mar 12.
Published in final edited form as:
PMCID: PMC5846465
NIHMSID: NIHMS804301
PMID: 27467249

Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments

Abstract

Hi-C experiments explore the three-dimensional structure of the genome, generating terabases of data to create high resolution contact maps. Here, we introduce Juicer, an open-source tool for analyzing terabase-scale Hi-C datasets. Juicer allows users without a computational background to transform raw sequence data into normalized contact maps with one click. Juicer produces a hic file containing compressed contact matrices at many resolutions, facilitating visualization and analysis at multiple scales. Structural features, such as loops and domains, are automatically annotated. Juicer is available as open source software at http://aidenlab.org/juicer/

Graphical abstract

Main Text

Hi-C experiments probe the three-dimensional structure of DNA and chromatin by ligating and sequencing DNA loci that are spatially proximate to one another (Lieberman-Aiden and Van Berkum et al., 2009; Rao and Huntley et al., 2014). The resulting maps reflect patterns of physical contact between loci, making it possible to deduce how loci are organized in 3D.

Efforts to improve the resolution of 3D maps have caused the amount of DNA sequence produced from Hi-C experiments to skyrocket. Our original maps, derived from 30 million reads and 16 Gb of DNA sequence, described the genome at 1 megabase resolution (Lieberman-Aiden and Van Berkum et al., 2009). In contrast, we recently generated 6.5 billion reads and 1.6 Tb of DNA sequence in order to create a single 3D map of the genome at kilobase resolution (Rao and Huntley et al., 2014).

Although pipelines for Hi-C data analysis exist (Lieberman-Aiden and Van Berkum et al., 2009; Schmid et al., 2015; Servant et al., 2015; Suria et al., 2015), these packages are not designed to process datasets at the terabase scale or to annotate the structural features that these maps reflect. Moreover, when designing tools that require high-performance computation, ensuring reliability and ease-of-use across software platforms and hardware instances becomes a crucial desideratum. Ensuring such compatibility can be a considerable engineering challenge.

Here, we introduce Juicer, an easy-to-use, fully-automated pipeline for the processing and annotation of data from Hi-C and other contact mapping experiments. Juicer is closely based on the algorithms that we recently developed in order to analyze and annotate our terabase-scale Hi-C experiments (Rao and Huntley et al., 2014). In order to meet the engineering challenge of handling such massive datasets, Juicer supports the use of parallelization and hardware acceleration whenever possible, including CPU clusters, general-purpose graphics processing units (GP-GPUs), and field-programmable gate arrays (FPGAs). Juicer is also compatible with a variety of cloud and cluster architectures.

Juicer comprises three tools, which are designed to be run one-after-another.

First, Juicer transforms raw sequence data into a list of Hi-C contacts (pairs of genomic positions that were adjacent to each other in three-dimensional space during the experiment). To accomplish this, read pairs are aligned to the genome; both duplicates and near-duplicates are removed, and read pairs that align to three or more locations are set aside. When appropriate hardware is available, this procedure can be accelerated, either by parallelizing across multiple CPUs or by using an FPGA (see Table 1).

Table 1

Using Juicer to process 1.5 billion paired-end Hi-C reads on different cluster systems. “RAM (Gb)” (resp., “VM(Gb)”) are the maximum RAM (resp., virtual memory”) used for each task. Loop annotation was not performed on the Broad cluster, which does not offer GPUs. See Table S1.

SystemAmazon Web Services
g2.8xlarge
Broad
Univa Grid Engine
Rice PowerOmicsRice PowerOmics + FPGA

CPUIntel Xeon E5-2670 @2.60GHzIntel Xeon X5650 @2.66GHzIBM zHG160.2@E8REWOP revision: 2.1IBM zHG160.2@E8REWOP revision: 2.1
Cores/node4×8 cores4×6 cores2×24 cores2×24 cores
RAM60GB32GB256GB256GB
Cluster OSOpenLava 2.2 (LSF Compatible)UGE 8.3.0Slurm 14.11.8Slurm 14.11.8
GPUNVIDIA Quadro K5000NoneNVIDIA Tesla K80NVIDIA Tesla K80
FPGANoneNoneNoneEdico Genome DRAGEN Bio-IT Platform
Max Parallel Cores32120015361536

Core Hours (hr:min)RAM (GB)VM (GB)Core Hours (hr:min)RAM (GB)VM (GB)Core Hours (hr:min)RAM (GB)VM (GB)Core Hours (hr:min)RAM (GB)VM (GB)

Align8744:4912.313.511614:0710.811.94221:2913.114.01:2900
Merge Sort35:369.910.1117:038.7198.1452:1314.0120.0426:3030.0120.0
Duplicate Removal12:210.50.517:040.40.53:120.40.01:280.40.0
.hic Creation112:4321.834.9209:4313.419.5139:1719.38177:0419.38
Feature Annotation2:0710.5139.31:046.419.53:254.29.14:2877.19.1

Total8906:1111959:014819:36608:59

Next, the catalog of contacts is used to create contact matrices. To do so, the linear genome is partitioned into loci of a fixed size, or “resolution,” (e.g., 1Mb or 1Kb). These loci correspond to the rows and columns of a contact matrix; each entry in the matrix reflects the number of contacts observed between the corresponding pair of loci during a Hi-C experiment. Due to factors such as chromatin accessibility, certain loci are observed more frequently in Hi-C experiments. Juicer can adjust for these biases in multiple ways. The options include our original normalization scheme (Lieberman-Aiden and Van Berkum et al., 2009), as well as a matrix balancing scheme that ensures that each row and column of the contact matrix sums to the same value (Knight and Ruiz, 2012). A wide array of quality statistics are also calculated, making it possible to assess the success and reliability of a given experiment before the costly deep-sequencing step.

The contact matrices generated in this way are stored efficiently in a compressed format, which is designed to facilitate all subsequent computations. For instance, 1 terabyte of raw sequencing data is represented as an 80 gigabyte hic file containing normalized and non-normalized contact matrices at 18 different resolutions, from 2.5Mb resolution to single restriction fragment resolution for a 4-cutter restriction enzyme (~400bp). Contact matrices in the hic format can also be visualized using Juicebox, which is described in the accompanying paper.

Finally, Juicer contains a suite of algorithms that are designed to annotate contact matrices and thus identify features of genome folding. These features include loops, loop anchor motifs, and contact domains.

Loops are identified using the HiCCUPS algorithm (Rao and Huntley et al., 2014), which searches for clusters of contact matrix entries in which the frequency of contact is enriched relative to the local background. Since there are trillions of pixels in a kilobase-resolution Hi-C map, HiCCUPS is implemented using GP-GPUs. Given CTCF and/or cohesin ChIP-Seq tracks for the same cell type, HiCCUPS can frequently use FIMO (Grant et al., 2011) to identify the CTCF motif that serves as the anchor for each loop. We recently performed CRISPR experiments disrupting seven different CTCF motifs, each of which was identified by HiCCUPS as the anchor of one or more loops. In each case, disruption of the motif led to disruption of the corresponding loop, thus confirming the accuracy of HiCCUPS loop anchor annotations (Sanborn and Rao et al., 2015).

Contact domains are identified using a dynamic programming algorithm that relies on applying the Arrowhead transformation [Ai,i+d = (M* i,i-d − M* i,i+d)/(M* i,i-d + M* i,i+d)] to a normalized contact matrix M* (Rao and Huntley et al., 2014). Many of these domains are associated with loops, and can be disrupted by manipulating the corresponding loop anchors (Sanborn and Rao et al., 2015).

It is frequently useful to examine the cumulative signal from a large number of putative features at once, including both loops and domains. To this end, Juicer includes an implementation of Aggregate Peak Analysis (Rao and Huntley et al., 2014).

Juicer is an open-source project. It is available at github.com/theaidenlab/juicer as a series of packages designed for a variety of hardware configurations: either a single machine, or clusters that run LSF, Univa Grid Engine, or SLURM. In addition, Juicer is available on the cloud at Amazon Web Services. Table 1 displays different performance metrics on each cluster system; the details of each setup are in the supplemental text. Once installed, Juicer can be executed using a single command, by users without informatics experience.

Experimental Methods

All algorithms and data are drawn from Rao and Huntley et al., 2014, except as described in the supplement.

An external file that holds a picture, illustration, etc.
Object name is nihms804301f1.jpg
Juicer analyzes terabases of Hi-C data with one click

(A) Sequenced read pairs (horizontal bars) are aligned to the genome in parallel. Color indicates genomic position. Read pairs aligning to more than two positions are excluded. Those remaining are sorted by position and merged into a single list, at which point duplicate reads are removed. The .hic file stores contact matrices at many resolutions, which can be loaded into Juicebox for visualization. See Table S2. (B) Contact domains (yellow) are annotated using the Arrowhead algorithm. (C) Loops (cyan) are annotated using HiCCUPS.

Supplementary Material

Acknowledgments

Supported by NIH New Innovator Award 1DP2OD008540, NIH 4D Nucleome Grant U01HL130010, NSF Physics Frontier Center PHY-1427654, NHGRI HG006193, Welch Foundation Q-1866, Cancer Prevention Research Institute of Texas Scholar Award R1304, an NVIDIA Research Center Award, an IBM University Challenge Award, a Google Research Award, a McNair Medical Institute Scholar Award, and the President’s Early Career Award in Science and Engineering to E.L.A.; an NHGRI grant (HG003067) to E.S.L.; and a PD Soros Fellowship to S.S.P.R. The Rice PowerOmics cluster was a gift from IBM.

Footnotes

Author Contributions: E.L.A. conceived of this project; N.C.D. created the pipeline; S.S.P.R. created HiCCUPS; M.H.H. created APA; M.H.H. and N.C.D. created Arrowhead; M.S.S. re-implemented all feature annotation algorithms in Java as fully-automated, end-to-end tools; I.M. ported the pipeline to SLURM and AWS; N.C.D., M.S.S., I.M., and E.S.L. contributed to tool development; N.C.D. and E.L.A. prepared the manuscript.

The software and test data sets used to review this manuscript are available at http://dx.doi.org/10.17632/c6bg4cbggn.1

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • Grant CE, Bailey TL, Noble WS. FIMO: Scanning for occurrences of a given motif. Bioinformatics. 2011;27(7):1017–1018. [Europe PMC free article] [Abstract] [Google Scholar]
  • Knight PA, Ruiz D. A fast algorithm for matrix balancing. IMA J Numer Anal. 2012;33:1029–1047. [Google Scholar]
  • Lieberman-Aiden E, van Berkum N, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie B, Sabo P, Dorschner M, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326:289–293. [Europe PMC free article] [Abstract] [Google Scholar]
  • Rao SSP, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn AL, Machol I, Omer AD, Lander ES, Aiden EL. A Three-dimensional Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell. 2014;159:1665–1680. [Europe PMC free article] [Abstract] [Google Scholar]
  • Sanborn AL, Rao SSP, Huang S, Durand NC, Huntley MH, Jewett AI, Bochkov ID, Chinnappan D, Cutkosky A, Geeting KP, Gnirke A, Melnikov A, McKenna D, Stamenova EK, Lander ES, Aiden EL. Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. Proceedings of the National Academy of Sciences. 2015;112(47):E6456–E6465. [Europe PMC free article] [Abstract] [Google Scholar]
  • Servant N, Varoquaux N, Lajoie BR, Viara E, Chen CJ, Vert JP, Heard E, Dekker J, Barillot E. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biology. 2015;16:259. [Europe PMC free article] [Abstract] [Google Scholar]
  • Schmid MW, Grob S, Stefan Grob, Grossniklaus U. HiCdat: a fast and easy-to-use Hi-C data analysis tool. BMC Bioinformatics. 2015;16(1):277. [Europe PMC free article] [Abstract] [Google Scholar]
  • Suria MEG, Phillips-Cremins JE, Corces VG, Taylor J. HiFive: a tool suite for easy and efficient HiC and 5C data analysis. Genome Biology. 2015;16:237. [Europe PMC free article] [Abstract] [Google Scholar]

Citations & impact 


Impact metrics

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/10117828
Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/10117828

Smart citations by scite.ai
Smart citations by scite.ai include citation statements extracted from the full text of the citing article. The number of the statements may be higher than the number of citations provided by EuropePMC if one paper cites another multiple times or lower if scite has not yet processed some of the citing articles.
Explore citation contexts and check if this article has been supported or disputed.
https://scite.ai/reports/10.1016/j.cels.2016.07.002

Supporting
Mentioning
Contrasting
6
2307
0

Article citations


Go to all (1,531) article citations

Data 


Data behind the article

This data has been text mined from the article, or deposited into data resources.

Similar Articles 


To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.

Funding 


Funders who supported this work.

NHGRI NIH HHS (2)

NHLBI NIH HHS (1)

NIH HHS (1)