Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments.

Durand NC; Shamim MS; Machol I; Rao SS; Huntley MH; Lander ES; Aiden EL

doi:10.1016/j.cels.2016.07.002

Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments.

Affiliations

1. The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Department of Computer Science and Department of Computational and Applied Mathematics, Rice University, Houston, TX 77005, USA; Broad Institute of Harvard and Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA.
Authors
Durand NC¹
(1 author)
2. The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Department of Computer Science and Department of Computational and Applied Mathematics, Rice University, Houston, TX 77005, USA.
Authors
Shamim MS²
Machol I²
(2 authors)
3. The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Department of Computer Science and Department of Computational and Applied Mathematics, Rice University, Houston, TX 77005, USA; School of Medicine, Stanford University, Stanford, CA 94305, USA.
Authors
Rao SS³
(1 author)
4. The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA; Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Department of Computer Science and Department of Computational and Applied Mathematics, Rice University, Houston, TX 77005, USA; John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA.
Authors
Huntley MH⁴
(1 author)
5. Broad Institute of Harvard and Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA; Department of Biology, MIT, Cambridge, MA 02139, USA; Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA.
Authors
Lander ES⁵
(1 author)

Show all (6)

ORCIDs linked to this article

Cell Systems, 01 Jul 2016, 3(1):95-98
https://doi.org/10.1016/j.cels.2016.07.002 PMID: 27467249 PMCID: PMC5846465

Free full text in Europe PMC

A comment on this article appears in "Minute-Made Data Analysis: Tools for Rapid Interrogation of Hi-C Contacts." Mol Cell. 2016 Oct 6;64(1):9-11.

Abstract

Hi-C experiments explore the 3D structure of the genome, generating terabases of data to create high-resolution contact maps. Here, we introduce Juicer, an open-source tool for analyzing terabase-scale Hi-C datasets. Juicer allows users without a computational background to transform raw sequence data into normalized contact maps with one click. Juicer produces a hic file containing compressed contact matrices at many resolutions, facilitating visualization and analysis at multiple scales. Structural features, such as loops and domains, are automatically annotated. Juicer is available as open source software at http://aidenlab.org/juicer/.

Free full text

Cell Syst. Author manuscript; available in PMC 2018 Mar 12.

Published in final edited form as:

Cell Syst. 2016 Jul; 3(1): 95–98.

https://doi.org/10.1016/j.cels.2016.07.002

PMCID: PMC5846465

NIHMSID: NIHMS804301

PMID: 27467249

Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments

Neva C. Durand,^1,^2,^3,^4,^* Muhammad S. Shamim,^1,^2,^3,^* Ido Machol,^1,^2,³ Suhas S. P. Rao,^1,^2,^3,⁵ Miriam H. Huntley,^1,^2,^3,⁶ Eric S. Lander,^4,^7,⁸ and Erez Lieberman Aiden^1,^2,^3,^4,⁹

Neva C. Durand

¹The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA

²Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA

³Department of Computer Science, Department of Computational and Applied Mathematics, Rice University, Houston, TX 77005, USA

⁴Broad Institute of Harvard and Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA

Find articles by Neva C. Durand

Muhammad S. Shamim

¹The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA

²Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA

³Department of Computer Science, Department of Computational and Applied Mathematics, Rice University, Houston, TX 77005, USA

Find articles by Muhammad S. Shamim

Ido Machol

¹The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA

²Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA

³Department of Computer Science, Department of Computational and Applied Mathematics, Rice University, Houston, TX 77005, USA

Find articles by Ido Machol

Suhas S. P. Rao

¹The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA

²Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA

³Department of Computer Science, Department of Computational and Applied Mathematics, Rice University, Houston, TX 77005, USA

⁵School of Medicine, Stanford University, Stanford, CA 94305, USA

Find articles by Suhas S. P. Rao

Miriam H. Huntley

¹The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA

²Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA

³Department of Computer Science, Department of Computational and Applied Mathematics, Rice University, Houston, TX 77005, USA

⁶John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138, USA

Find articles by Miriam H. Huntley

Eric S. Lander

⁴Broad Institute of Harvard and Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA

⁷Department of Biology, MIT, Cambridge, MA 02139, USA

⁸Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA

Find articles by Eric S. Lander

Erez Lieberman Aiden

¹The Center for Genome Architecture, Baylor College of Medicine, Houston, TX 77030, USA

²Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA

³Department of Computer Science, Department of Computational and Applied Mathematics, Rice University, Houston, TX 77005, USA

⁴Broad Institute of Harvard and Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, USA

⁹Center for Theoretical Biological Physics, Rice University, Houston, TX 77030, USA

Find articles by Erez Lieberman Aiden

Author information Copyright and License information Disclaimer

The publisher's final edited version of this article is available at Cell Syst

See commentary "Minute-made data analysis: tools for rapid interrogation of Hi-C contacts" in Mol Cell, volume 64 on page 9.

See other articles in PMC that cite the published article.

Associated Data

Supplementary Materials: 1.
NIHMS804301-supplement-1.docx (31K)
2.
NIHMS804301-supplement-2.xlsx (34K)
3.
NIHMS804301-supplement-3.pdf (188K)

Abstract

Hi-C experiments explore the three-dimensional structure of the genome, generating terabases of data to create high resolution contact maps. Here, we introduce Juicer, an open-source tool for analyzing terabase-scale Hi-C datasets. Juicer allows users without a computational background to transform raw sequence data into normalized contact maps with one click. Juicer produces a hic file containing compressed contact matrices at many resolutions, facilitating visualization and analysis at multiple scales. Structural features, such as loops and domains, are automatically annotated. Juicer is available as open source software at http://aidenlab.org/juicer/

Graphical abstract

An external file that holds a picture, illustration, etc.
Object name is nihms804301u1.jpg

Main Text

Hi-C experiments probe the three-dimensional structure of DNA and chromatin by ligating and sequencing DNA loci that are spatially proximate to one another (Lieberman-Aiden and Van Berkum et al., 2009; Rao and Huntley et al., 2014). The resulting maps reflect patterns of physical contact between loci, making it possible to deduce how loci are organized in 3D.

Efforts to improve the resolution of 3D maps have caused the amount of DNA sequence produced from Hi-C experiments to skyrocket. Our original maps, derived from 30 million reads and 16 Gb of DNA sequence, described the genome at 1 megabase resolution (Lieberman-Aiden and Van Berkum et al., 2009). In contrast, we recently generated 6.5 billion reads and 1.6 Tb of DNA sequence in order to create a single 3D map of the genome at kilobase resolution (Rao and Huntley et al., 2014).

Although pipelines for Hi-C data analysis exist (Lieberman-Aiden and Van Berkum et al., 2009; Schmid et al., 2015; Servant et al., 2015; Suria et al., 2015), these packages are not designed to process datasets at the terabase scale or to annotate the structural features that these maps reflect. Moreover, when designing tools that require high-performance computation, ensuring reliability and ease-of-use across software platforms and hardware instances becomes a crucial desideratum. Ensuring such compatibility can be a considerable engineering challenge.

Here, we introduce Juicer, an easy-to-use, fully-automated pipeline for the processing and annotation of data from Hi-C and other contact mapping experiments. Juicer is closely based on the algorithms that we recently developed in order to analyze and annotate our terabase-scale Hi-C experiments (Rao and Huntley et al., 2014). In order to meet the engineering challenge of handling such massive datasets, Juicer supports the use of parallelization and hardware acceleration whenever possible, including CPU clusters, general-purpose graphics processing units (GP-GPUs), and field-programmable gate arrays (FPGAs). Juicer is also compatible with a variety of cloud and cluster architectures.

Juicer comprises three tools, which are designed to be run one-after-another.

First, Juicer transforms raw sequence data into a list of Hi-C contacts (pairs of genomic positions that were adjacent to each other in three-dimensional space during the experiment). To accomplish this, read pairs are aligned to the genome; both duplicates and near-duplicates are removed, and read pairs that align to three or more locations are set aside. When appropriate hardware is available, this procedure can be accelerated, either by parallelizing across multiple CPUs or by using an FPGA (see Table 1).

Table 1

Using Juicer to process 1.5 billion paired-end Hi-C reads on different cluster systems. “RAM (Gb)” (resp., “VM(Gb)”) are the maximum RAM (resp., virtual memory”) used for each task. Loop annotation was not performed on the Broad cluster, which does not offer GPUs. See Table S1.

System	Amazon Web Services g2.8xlarge			Broad Univa Grid Engine			Rice PowerOmics			Rice PowerOmics + FPGA

CPU	Intel Xeon E5-2670 @2.60GHz			Intel Xeon X5650 @2.66GHz			IBM zHG160.2@E8REWOP revision: 2.1			IBM zHG160.2@E8REWOP revision: 2.1
Cores/node	4×8 cores			4×6 cores			2×24 cores			2×24 cores
RAM	60GB			32GB			256GB			256GB
Cluster OS	OpenLava 2.2 (LSF Compatible)			UGE 8.3.0			Slurm 14.11.8			Slurm 14.11.8
GPU	NVIDIA Quadro K5000			None			NVIDIA Tesla K80			NVIDIA Tesla K80
FPGA	None			None			None			Edico Genome DRAGEN Bio-IT Platform
Max Parallel Cores	32			1200			1536			1536

	Core Hours (hr:min)	RAM (GB)	VM (GB)	Core Hours (hr:min)	RAM (GB)	VM (GB)	Core Hours (hr:min)	RAM (GB)	VM (GB)	Core Hours (hr:min)	RAM (GB)	VM (GB)

Align	8744:49	12.3	13.5	11614:07	10.8	11.9	4221:29	13.1	14.0	1:29	0	0
Merge Sort	35:36	9.9	10.1	117:03	8.7	198.1	452:13	14.0	120.0	426:30	30.0	120.0
Duplicate Removal	12:21	0.5	0.5	17:04	0.4	0.5	3:12	0.4	0.0	1:28	0.4	0.0
.hic Creation	112:43	21.8	34.9	209:43	13.4	19.5	139:17	19.3	8	177:04	19.3	8
Feature Annotation	2:07	10.5	139.3	1:04	6.4	19.5	3:25	4.2	9.1	4:28	77.1	9.1

Total	8906:11			11959:01			4819:36			608:59

Next, the catalog of contacts is used to create contact matrices. To do so, the linear genome is partitioned into loci of a fixed size, or “resolution,” (e.g., 1Mb or 1Kb). These loci correspond to the rows and columns of a contact matrix; each entry in the matrix reflects the number of contacts observed between the corresponding pair of loci during a Hi-C experiment. Due to factors such as chromatin accessibility, certain loci are observed more frequently in Hi-C experiments. Juicer can adjust for these biases in multiple ways. The options include our original normalization scheme (Lieberman-Aiden and Van Berkum et al., 2009), as well as a matrix balancing scheme that ensures that each row and column of the contact matrix sums to the same value (Knight and Ruiz, 2012). A wide array of quality statistics are also calculated, making it possible to assess the success and reliability of a given experiment before the costly deep-sequencing step.

The contact matrices generated in this way are stored efficiently in a compressed format, which is designed to facilitate all subsequent computations. For instance, 1 terabyte of raw sequencing data is represented as an 80 gigabyte hic file containing normalized and non-normalized contact matrices at 18 different resolutions, from 2.5Mb resolution to single restriction fragment resolution for a 4-cutter restriction enzyme (~400bp). Contact matrices in the hic format can also be visualized using Juicebox, which is described in the accompanying paper.

Finally, Juicer contains a suite of algorithms that are designed to annotate contact matrices and thus identify features of genome folding. These features include loops, loop anchor motifs, and contact domains.

Loops are identified using the HiCCUPS algorithm (Rao and Huntley et al., 2014), which searches for clusters of contact matrix entries in which the frequency of contact is enriched relative to the local background. Since there are trillions of pixels in a kilobase-resolution Hi-C map, HiCCUPS is implemented using GP-GPUs. Given CTCF and/or cohesin ChIP-Seq tracks for the same cell type, HiCCUPS can frequently use FIMO (Grant et al., 2011) to identify the CTCF motif that serves as the anchor for each loop. We recently performed CRISPR experiments disrupting seven different CTCF motifs, each of which was identified by HiCCUPS as the anchor of one or more loops. In each case, disruption of the motif led to disruption of the corresponding loop, thus confirming the accuracy of HiCCUPS loop anchor annotations (Sanborn and Rao et al., 2015).

Contact domains are identified using a dynamic programming algorithm that relies on applying the Arrowhead transformation [A_i,i+d = (M^* _i,i-d − M^* _i,i+d)/(M^* _i,i-d + M^* _i,i+d)] to a normalized contact matrix M* (Rao and Huntley et al., 2014). Many of these domains are associated with loops, and can be disrupted by manipulating the corresponding loop anchors (Sanborn and Rao et al., 2015).

It is frequently useful to examine the cumulative signal from a large number of putative features at once, including both loops and domains. To this end, Juicer includes an implementation of Aggregate Peak Analysis (Rao and Huntley et al., 2014).

Juicer is an open-source project. It is available at github.com/theaidenlab/juicer as a series of packages designed for a variety of hardware configurations: either a single machine, or clusters that run LSF, Univa Grid Engine, or SLURM. In addition, Juicer is available on the cloud at Amazon Web Services. Table 1 displays different performance metrics on each cluster system; the details of each setup are in the supplemental text. Once installed, Juicer can be executed using a single command, by users without informatics experience.

Experimental Methods

All algorithms and data are drawn from Rao and Huntley et al., 2014, except as described in the supplement.

An external file that holds a picture, illustration, etc.
Object name is nihms804301f1.jpg

Figure 1

Juicer analyzes terabases of Hi-C data with one click

(A) Sequenced read pairs (horizontal bars) are aligned to the genome in parallel. Color indicates genomic position. Read pairs aligning to more than two positions are excluded. Those remaining are sorted by position and merged into a single list, at which point duplicate reads are removed. The .hic file stores contact matrices at many resolutions, which can be loaded into Juicebox for visualization. See Table S2. (B) Contact domains (yellow) are annotated using the Arrowhead algorithm. (C) Loops (cyan) are annotated using HiCCUPS.

Supplementary Material

Acknowledgments

Supported by NIH New Innovator Award 1DP2OD008540, NIH 4D Nucleome Grant U01HL130010, NSF Physics Frontier Center PHY-1427654, NHGRI HG006193, Welch Foundation Q-1866, Cancer Prevention Research Institute of Texas Scholar Award R1304, an NVIDIA Research Center Award, an IBM University Challenge Award, a Google Research Award, a McNair Medical Institute Scholar Award, and the President’s Early Career Award in Science and Engineering to E.L.A.; an NHGRI grant (HG003067) to E.S.L.; and a PD Soros Fellowship to S.S.P.R. The Rice PowerOmics cluster was a gift from IBM.

Footnotes

Author Contributions: E.L.A. conceived of this project; N.C.D. created the pipeline; S.S.P.R. created HiCCUPS; M.H.H. created APA; M.H.H. and N.C.D. created Arrowhead; M.S.S. re-implemented all feature annotation algorithms in Java as fully-automated, end-to-end tools; I.M. ported the pipeline to SLURM and AWS; N.C.D., M.S.S., I.M., and E.S.L. contributed to tool development; N.C.D. and E.L.A. prepared the manuscript.

The software and test data sets used to review this manuscript are available at http://dx.doi.org/10.17632/c6bg4cbggn.1

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

Grant CE, Bailey TL, Noble WS. FIMO: Scanning for occurrences of a given motif. Bioinformatics. 2011;27(7):1017–1018. [Europe PMC free article] [Abstract] [Google Scholar]
Knight PA, Ruiz D. A fast algorithm for matrix balancing. IMA J Numer Anal. 2012;33:1029–1047. [Google Scholar]
Lieberman-Aiden E, van Berkum N, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie B, Sabo P, Dorschner M, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326:289–293. [Europe PMC free article] [Abstract] [Google Scholar]
Rao SSP, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn AL, Machol I, Omer AD, Lander ES, Aiden EL. A Three-dimensional Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell. 2014;159:1665–1680. [Europe PMC free article] [Abstract] [Google Scholar]
Sanborn AL, Rao SSP, Huang S, Durand NC, Huntley MH, Jewett AI, Bochkov ID, Chinnappan D, Cutkosky A, Geeting KP, Gnirke A, Melnikov A, McKenna D, Stamenova EK, Lander ES, Aiden EL. Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. Proceedings of the National Academy of Sciences. 2015;112(47):E6456–E6465. [Europe PMC free article] [Abstract] [Google Scholar]
Servant N, Varoquaux N, Lajoie BR, Viara E, Chen CJ, Vert JP, Heard E, Dekker J, Barillot E. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biology. 2015;16:259. [Europe PMC free article] [Abstract] [Google Scholar]
Schmid MW, Grob S, Stefan Grob, Grossniklaus U. HiCdat: a fast and easy-to-use Hi-C data analysis tool. BMC Bioinformatics. 2015;16(1):277. [Europe PMC free article] [Abstract] [Google Scholar]
Suria MEG, Phillips-Cremins JE, Corces VG, Taylor J. HiFive: a tool suite for easy and efficient HiC and 5C data analysis. Genome Biology. 2015;16:237. [Europe PMC free article] [Abstract] [Google Scholar]

Full text links

Read article at publisher's site: https://doi.org/10.1016/j.cels.2016.07.002

Read article for free, from open access legal sources, via Unpaywall: http://www.cell.com/article/S2405471216302198/pdf

Citations & impact

Impact metrics

1,531

Citations

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/10117828

Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/10117828

Smart citations by scite.ai
Explore citation contexts and check if this article has been supported or disputed.
https://scite.ai/reports/10.1016/j.cels.2016.07.002

Supporting

Mentioning

Contrasting

2307

Article citations

Exploring Contact Distance Distributions with Google Colaboratory.
Nakato R
Methods Mol Biol, 2856:179-196, 01 Jan 2025
Cited by: 0 articles | PMID: 39283452
Analysis and Visualization of Multiple Hi-C and Micro-C Data with CustardPy.
Nagaoka Y, Nakato R
Methods Mol Biol, 2856:223-238, 01 Jan 2025
Cited by: 0 articles | PMID: 39283455
Systematic Inference of Multi-scale Chromatin Sub-compartments Using Calder2.
Liu Y
Methods Mol Biol, 2856:213-221, 01 Jan 2025
Cited by: 0 articles | PMID: 39283454
Micro-C Analysis Workflow Using Pairtools and Juicer.
Sakata T
Methods Mol Biol, 2856:63-70, 01 Jan 2025
Cited by: 0 articles | PMID: 39283446
CWL-Based Analysis Pipeline for Hi-C Data: From FASTQ Files to Matrices.
Miura H, Cerbus RT, Noda I, Hiratani I
Methods Mol Biol, 2856:79-117, 01 Jan 2025
Cited by: 0 articles | PMID: 39283448

Go to all (1,531) article citations

Data

Data behind the article

This data has been text mined from the article, or deposited into data resources.

BioStudies: supplemental material and supporting data

http://www.ebi.ac.uk/biostudies/studies/S-EPMC5846465?xr=true

Data Citations

(1 citation) DOI - 10.17632/c6bg4cbggn.1

Funding

Funders who supported this work.

NHGRI NIH HHS (2)

Grant ID: RM1 HG006193
130 publications
Grant ID: U54 HG003067
516 publications

NHLBI NIH HHS (1)

Grant ID: U01 HL130010
29 publications

NIH HHS (1)

Grant ID: DP2 OD008540
22 publications

Search life-sciences literature (45,100,050 articles, preprints and more)

Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments.

Author information

Affiliations

Authors

Authors

Authors

Authors

Authors

ORCIDs linked to this article

Abstract

Free full text

Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments

Neva C. Durand

Muhammad S. Shamim

Ido Machol

Suhas S. P. Rao

Miriam H. Huntley

Eric S. Lander

Erez Lieberman Aiden

Associated Data

Abstract

Graphical abstract

Main Text

Table 1

Experimental Methods

Supplementary Material

1

2

3

Acknowledgments

Footnotes

References

Full text links

Citations & impact

Impact metrics

Citations of article over time

Alternative metrics

Article citations

Data

Data behind the article

BioStudies: supplemental material and supporting data

Data Citations

Similar Articles

Funding

NHGRI NIH HHS (2)﻿

NHLBI NIH HHS (1)﻿

NIH HHS (1)﻿

Partnerships & funding

NHGRI NIH HHS (2)

NHLBI NIH HHS (1)

NIH HHS (1)