The Cancer Genomics Cloud: Collaborative, Reproducible, and Democratized-A New Paradigm in Large-Scale Computational Research.

1. Seven Bridges Genomics, Cambridge, Massachusetts.
Authors
Lau JW¹
Lehnert E¹
Sethi A¹
Malhotra R¹
Kaushik G¹
Onder Z¹
Groves-Kirkby N¹
Mihajlovic A¹
DiGiovanna J¹
Srdic M¹
Bajcic D¹
Radenkovic J¹
Mladenovic V¹
Krstanovic D¹
Arsenijevic V¹
Klisic D¹
Mitrovic M¹
Bogicevic I¹
Kural D¹
(19 authors)

ORCIDs linked to this article

Cancer Research, 01 Nov 2017, 77(21):e3-e6
https://doi.org/10.1158/0008-5472.can-17-0387 PMID: 29092927 PMCID: PMC5832960

Free full text in Europe PMC

This article has been corrected. See Cancer Res. 2018 Sep 1;78(17):5179.

Abstract

The Seven Bridges Cancer Genomics Cloud (CGC; www.cancergenomicscloud.org) enables researchers to rapidly access and collaborate on massive public cancer genomic datasets, including The Cancer Genome Atlas. It provides secure on-demand access to data, analysis tools, and computing resources. Researchers from diverse backgrounds can easily visualize, query, and explore cancer genomic datasets visually or programmatically. Data of interest can be immediately analyzed in the cloud using more than 200 preinstalled, curated bioinformatics tools and workflows. Researchers can also extend the functionality of the platform by adding their own data and tools via an intuitive software development kit. By colocalizing these resources in the cloud, the CGC enables scalable, reproducible analyses. Researchers worldwide can use the CGC to investigate key questions in cancer genomics. Cancer Res; 77(21); e3-6. ©2017 AACR.

Free full text

Cancer Res. Author manuscript; available in PMC 2018 Nov 1.

Published in final edited form as:

Cancer Res. 2017 Nov 1; 77(21): e3–e6.

https://doi.org/10.1158/0008-5472.CAN-17-0387

PMCID: PMC5832960

NIHMSID: NIHMS895282

PMID: 29092927

The Cancer Genomics Cloud: Collaborative, reproducible, and democratized—a new paradigm in large-scale computational research

Author information Copyright and License information Disclaimer

The publisher's final edited version of this article is available at Cancer Res

This article has been corrected. See the correction in NIHPA Author Manuscripts, volume 78 on page 5179.

See other articles in PMC that cite the published article.

Go to:

Associated Data

Supplementary Materials: 1.
NIHMS895282-supplement-1.mp4 (31M)

Go to:

Abstract

The Seven Bridges Cancer Genomics Cloud (CGC; www.cancergenomicscloud.org) enables researchers to rapidly access and collaborate on massive public cancer genomic datasets, including The Cancer Genome Atlas. It provides secure on-demand access to data, analysis tools and computing resources. Researchers from diverse backgrounds can easily visualize, query and explore cancer genomic datasets visually or programmatically. Data of interest can be immediately analyzed in the cloud using more than 200 pre-installed, curated bioinformatics tools and workflows. Researchers can also extend the functionality of the platform by adding their own data and tools via an intuitive software development kit. By colocalizing these resources in the cloud, the CGC enables scalable, reproducible analyses. Researchers worldwide can use the CGC to investigate key questions in cancer genomics.

Keywords: cancer, genomics, cloud computing, data, bioinformatics

Go to:

Introduction

As the size and complexity of cancer genomic datasets continue to grow, the availability of scalable compute resources (i.e. the ‘cloud’) facilitates rapid and cost effective data analysis (1). The Seven Bridges Cancer Genomics Cloud (CGC; www.cancergenomicscloud.org) was funded as a pilot project by the US National Cancer Institute (NCI) to explore novel approaches to democratize access to massive cancer genomic datasets alongside the tools and computational resources to analyze them. The CGC was publicly launched in February 2016 and is open to all cancer researchers worldwide, who can create a free profile online, or log in via their eRA Commons or NIH Center for Information Technology account.

The CGC enables researchers to quickly access The Cancer Genome Atlas (TCGA; 2), which contains genomic, transcriptomic, and clinical data from more than 11,000 cancer patients. TCGA has contributed greatly to understanding the molecular basis of cancer and identifying novel therapeutic targets (3, 4). However, downloading and storing TCGA requires significant time and resources. Furthermore, querying and using the data can be challenging for researchers without adequate computational resources or appropriate technical knowledge. The CGC addresses these challenges to make TCGA and other large cancer genomics datasets usable by a wide range of cancer researchers.

Go to:

Methods: Scalable cancer genomics analysis in the cloud

The CGC comprises an intuitive interface for rapidly accessing and using large public genomic datasets; comprehensive security controls; preloaded bioinformatics tools and workflows; access to scalable cloud-based computation; a software development kit; an application programming interface (API) for automation; data visualization and querying tools; and extensive support for collaborative, reproducible research (Figure 1A, Video 1). Built on Amazon Web Services’ enterprise level cloud services, the CGC provides researchers with secure access to public genomic datasets (including TCGA and the Cancer Cell Line Encyclopedia [CCLE; 5]), alongside the high-performance computation needed to analyze them. Because the data are stored together with cloud computing resources, analysis is readily scalable on-demand, allowing thousands of samples to be quickly analyzed (Figure 1B).

An external file that holds a picture, illustration, etc.
Object name is nihms895282f1.jpg

Figure 1

The Cancer Genomics Cloud is designed to enable scalable cancer genomics research, with features that support usability. A) In addition to hosting public cancer genomics datasets and providing curated analysis tools, the CGC enables users to upload and annotate their own data, as well as integrate their own tools. Data analyses are run using optimized computing resources and executions are recorded to ensure reproducibility. B) Time course of an RNA-Seq quantitation experiment in which more than 9,000 samples were analyzed in parallel. All samples were completed within 100 minutes. C) The visual Data Browser allows users to explore and select data by specifying properties of interest. Once selected, these files can be added to a project for analysis. D) The interactive Case Explorer allows users to visualize and select cases based on type of cancer, genes of interest, types of mutations, and more.

The hosted datasets are complemented by more than 200 biomedical data analysis tools and workflows, including pipelines for variant calling on whole genome and exome sequencing data, differential expression analysis on RNA sequencing data, and complex data visualization. These curated tools are continually revised to include updated versions, according to demand as determined by user interviews and surveys. Tools on the CGC are packaged within Docker containers, a lightweight software virtualization technology (www.docker.com). Execution instructions are described using Common Workflow Language (CWL; www.commonwl.org), an open-source, community-developed specification for describing analysis workflows and tools in a way that is portable and scalable across software and hardware environments. The CGC also offers researchers a robust software development kit, which enables them to easily describe their own tools and custom scripts in CWL for use on the platform. A visual workflow editor allows users to intuitively build reproducible workflows from individual tools.

The CGC offers a suite of technologies for visualizing, querying, and exploring to identify data of interest within complex datasets. By using a Semantic Web approach to link more than 140 clinical, biospecimen, and analysis metadata properties, the CGC enables researchers to build complex queries both visually and programmatically (Figure 1C). This allows scientists to quickly access, for example, all RNA sequencing count files from normal and tumor samples taken from patients with thyroid cancer who were treated with local radiation therapy. Importantly, this approach is readily extendable to support multi-study integration. We have made data from the CCLE available with TCGA, and we will continue to add further important public datasets. While the Data Browser allows users to explore datasets based on metadata, the Case Explorer focuses on the genetic properties of the data and allows global views of gene expression, copy number variation, and gene mutation status (Figure 1D).

Go to:

Methods: Collaborative, reproducible, extensible and scalable

The CGC has been designed to support best-practice, reproducible scientific research at scale. On the basis of our engagements and collaborations with cancer researchers, and our experience in user-centric software development, we identified key principles to guide the design and implementation of the platform. We suggest that any modern biomedical ‘knowledge cloud’ should follow similar principles.

First, public cancer genomic data need to be usable by all collaborators. Few institutions have the resources to download and manage TCGA, and specialized skills are required to manipulate the data. By contrast, users of the CGC can immediately begin to explore and analyze more than a petabyte of cancer genomics data through a simple web interface. Users benefit from visual tools and guides, documentation, a dedicated user support team, and the continual development of the Seven Bridges core infrastructure upon which the CGC is built. We regularly review the usability of the CGC and incorporate feedback from researchers to create new solutions and features. In addition, collaboration between multiple researchers and institutions promotes scientific advances (6). By colocalizing data, computation, and analysis the CGC enables distributed, multidisciplinary teams to collaborate on data analysis. Shared project spaces allow approved collaborators to access the same data and workflows, and to see the same results.

Second, reproducibility is required throughout the entire research workflow, from data management to analysis. Replication studies highlight the difficulty of reproducing a number of key studies in cancer biology (7, 8). The CGC ensures reproducibility of computational analyses by recording all aspects of data analysis, including files used, tool versions, and parameter settings. Moreover, because workflows are defined using CWL, they can be readily reproduced by collaborators, reviewers, and journal editors across different computational environments. The CGC currently supports CWL 1.0 via the API, and we are in the process of updating the CGC with Rabix (9), our open-source CWL executor capable of testing and running all CWL applications in any Unix-based environment. Rabix integration will further facilitate developer testing and interoperability with external platforms.

Third, the impact of large public cancer genomics datasets is extended by new tools and data. Besides providing curated biomedical data analysis tools and workflows, we created a software development kit for users to add their own tools and workflows, and a RESTful (10) API supported in multiple languages (Python, R). Using these tools, researchers on the CGC have developed custom pipelines for a variety of scientific applications, including characterizing tumor microbiomes, epitope identification, and segmenting patient populations by gene sequence. Researchers can extend the utility of TCGA data by adding their own private datasets and analyzing them together with the same workflows. Robust upload utilities (including a visual interface, command line, and API) use all available client bandwidth while ensuring secure data transfer. Once uploaded, data can be annotated with more than 40 standard metadata properties or any number of custom properties. Integration and semantic mapping of user-defined metadata to a common ontology is an area of active development.

Finally, data analyses must be scalable to fully make use of available datasets. The elasticity of the CGC infrastructure means that as data analysis scales up, additional computational resources are allocated to enable parallelization and processing of batch jobs. Cloud computing costs have decreased by more than 80% over the last 10 years (11), making it the most cost efficient way to analyze large genomic datasets (12). As an example, one researcher was able to perform targeted variant calling across the 11,000 TCGA participants in about three hours for under $15.

Go to:

Results: User uptake and impact

Within the 15 months since the launch of the CGC, over 1,900 researchers have registered on the platform, representing 150 institutions across 30 countries. In total, CGC users have deployed more than 5,000 tools or workflows and performed 80,000 executions, representing over 97 years of total computation. There is significant collaboration among users, with an average of seven members per project on the platform.

The CGC enables a diverse range of research (13, 14). For example, last year, Diermeier and colleagues (13) reported on potentially oncogenic mammary-tumor-associated RNAs (MaTARs). After identifying and characterizing MaTARs using mouse models of breast cancer, they confirmed the relevance of human MaTAR orthologs in clinical breast cancer with the CGC. By analyzing RNA-Seq data available in TCGA, the researchers found that some of the MaTARs were upregulated in breast tumors.

Go to:

Discussion

The CGC makes massive cancer genomic datasets available and usable for research, while safeguarding privacy and security. This approach is readily extensible: in addition to hosting TCGA, we have made CCLE data available and empowered users to integrate their private data with these public resources. New genomics datasets, including TARGET, CGCI, and Simons diversity data as well as new types of data (imaging and proteomic data) are being added to further extend the utility of the system. The CGC represents a successful model for democratizing access to and use of massive public datasets, allowing users to maximize their research productivity. Biomedical ‘knowledge clouds’ like this can serve as the gateway to a wide ecosystem of interoperable cloud resources to support scientific discovery.

Go to:

Supplementary Material

1

Click here to view.^{(31M, mp4)}

Go to:

Acknowledgments

Financial support: The Cancer Genomics Cloud is powered by Seven Bridges and has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, Department of Health and Human Services, under Contract No. HHSN261201400008C.

We thank the entire Seven Bridges team, the Cancer Genomics Cloud Pilot teams from the National Cancer Institute, the Broad Institute, and the Institute of Systems Biology, the Genomic Data Commons team, countless early users, and data donors. Thank you to Laura Tramontozzi for assistance with the figure. This work has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, Department of Health and Human Services, under Contract No. HHSN261201400008C.

Go to:

Footnotes

COI statement: All authors are employees of Seven Bridges Genomics.

Go to:

References

1. Stein LD, Knoppers BM, Campbell P, Getz G, Korbel JO. Data analysis: Create a cloud commons. Nature. 2015;523:149–151. [Abstract] [Google Scholar]

2. The future of cancer genomics. Nat Med. 2015;21:99. [Abstract] [Google Scholar]

3. Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010;17:98–110. [Europe PMC free article] [Abstract] [Google Scholar]

4. The Cancer Genome Atlas Research Network. Integrated genomic and molecular characterization of cervical cancer. Nature. 2017 10.1038/nature21386. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

5. Barretina J, Caponigro G, Stransky N, Venkatesan K, Margolin AA, Kim S, et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature. 2012;483:603–607. [Europe PMC free article] [Abstract] [Google Scholar]

6. Wuchty S, Jones BF, Uzzi B. The increasing dominance of teams in production of knowledge. Science. 2007;316:1036–1039. [Abstract] [Google Scholar]

7. Errington TM, Iorns E, Gunn W, Tan FE, Lomax J, Nosek BA. An open investigation of the reproducibility of cancer biology research. Elife. 2014 10.7554/eLife.04333. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

8. Nosek BA, Errington TM. Making sense of replications. Elife. 2017 10.7554/eLife.23383. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]

9. Kaushik G, Ivkovic S, Simonovic J, Tijanic N, Davis-Dusenbery B, Kural D. Rabix: an open-source workflow executor supporting recomputability and interoperability of workflow descriptions. Pac Symp Biocomput. 2016;22:154–165. [Europe PMC free article] [Abstract] [Google Scholar]

10. Fielding R. Architectural Styles and the Design of Network-based Software Architectures. University of California; Irvine: 2000. CHAPTER 5: Representational State Transfer (REST) http://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm. [Google Scholar]

11. Barr J. AWS Storage Update – S3 & Glacier Price Reductions + Additional Retrieval Options for Glacier. Amazon Web Services. 2016 https://aws.amazon.com/blogs/aws/aws-storage-update-s3-glacier-price-reductions/

12. Stein LD, Knoppers BM, Campbell P, Getz G, Korbel JO. Data analysis: Create a cloud commons. Nature. 2015;523:149–151. [Abstract] [Google Scholar]

13. Diermeier SD, Chang KC, Freier SM, Song J, El Demerdash O, Krasnitz A, et al. Mammary Tumor-Associated RNAs Impact Tumor Cell Proliferation, Invasion, and Migration. Cell Rep. 2016;17:261–274. [Europe PMC free article] [Abstract] [Google Scholar]

14. Collado-Torres L, Nellore A, Kammers K, Ellis SE, Taub MA, Hansen KD, et al. recount: A large-scale resource of analysis-ready RNA-seq expression data. bioRxiv. 2016 https://doi.org/10.1101/068478.

Full text links

Read article at publisher's site: https://doi.org/10.1158/0008-5472.can-17-0387

Read article for free, from open access legal sources, via Unpaywall: https://cancerres.aacrjournals.org/content/canres/77/21/e3.full.pdf

Citations & impact

Impact metrics

Citations

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/28212332

Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/28212332

Article citations

NCI's Proteomic Data Commons: A Cloud-Based Proteomics Repository Empowering Comprehensive Cancer Analysis through Cross-Referencing with Genomic and Imaging Data.
Thangudu RR, Holck M, Singhal D, Pilozzi A, Edwards N, Rudnick PA, Domagalski MJ, Chilappagari P, Ma L, Xin Y, Le T, Nyce K, Chaudhary R, Ketchum KA, Maurais A, Connolly B, Riffle M, Chambers MC, MacLean B, [...] Zhang X
Cancer Res Commun, 4(9):2480-2488, 01 Sep 2024
Cited by: 0 articles | PMID: 39225545 | PMCID: PMC11413857
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Robust estimation of cancer and immune cell-type proportions from bulk tumor ATAC-Seq data.
Gabriel AA, Racle J, Falquet M, Jandus C, Gfeller D
Elife, 13:RP94833, 09 Oct 2024
Cited by: 0 articles | PMID: 39383060 | PMCID: PMC11464006
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
A Pan-Cancer Patient-Derived Xenograft Histology Image Repository with Genomic and Pathologic Annotations Enables Deep Learning Analysis.
White BS, Woo XY, Koc S, Sheridan T, Neuhauser SB, Wang S, Evrard YA, Chen L, Foroughi Pour A, Landua JD, Mashl RJ, Davies SR, Fang B, Raso MG, Evans KW, Bailey MH, Chen Y, Xiao M, Rubinstein JC, [...] Chuang JH
Cancer Res, 84(13):2060-2072, 01 Jul 2024
Cited by: 1 article | PMID: 39082680
Integrative genomic analysis of the lung tissue microenvironment in SARS-CoV-2 and NL63 patients.
Bhuvaneshwar K, Madhavan S, Gusev Y
Heliyon, 10(12):e32772, 10 Jun 2024
Cited by: 0 articles | PMID: 39183848 | PMCID: PMC11341340
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC
Creating cloud platforms for supporting FAIR data management in biomedical research projects.
Jentsch M, Schneider-Lunitz V, Taron U, Braun M, Ishaque N, Wagener H, Conrad C, Twardziok S
F1000Res, 13:8, 29 Apr 2024
Cited by: 0 articles | PMID: 38779317 | PMCID: PMC11109697
This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.
Free full text in Europe PMC

Go to all (72) article citations

Data

Data behind the article

This data has been text mined from the article, or deposited into data resources.

BioStudies: supplemental material and supporting data

http://www.ebi.ac.uk/biostudies/studies/S-EPMC5832960?xr=true

Funding

Funders who supported this work.

NCI NIH (1)

Grant ID: HHSN261201400008C
1 publication

NCI NIH HHS (1)

Grant ID: HHSN261201400008C
41 publications

Search life-sciences literature (45,103,477 articles, preprints and more)

The Cancer Genomics Cloud: Collaborative, Reproducible, and Democratized-A New Paradigm in Large-Scale Computational Research.

Author information

Affiliations

Authors

ORCIDs linked to this article

Abstract

Free full text

The Cancer Genomics Cloud: Collaborative, reproducible, and democratized—a new paradigm in large-scale computational research

Associated Data

Abstract

Introduction

Methods: Scalable cancer genomics analysis in the cloud

Methods: Collaborative, reproducible, extensible and scalable

Results: User uptake and impact

Discussion

Supplementary Material

1

Acknowledgments

Footnotes

References

Full text links

Citations & impact

Impact metrics

Citations of article over time

Alternative metrics

Article citations

Data

Data behind the article

BioStudies: supplemental material and supporting data

Similar Articles

Funding

NCI NIH (1)﻿

NCI NIH HHS (1)﻿

Partnerships & funding

NCI NIH (1)

NCI NIH HHS (1)