Abstract
Background
Exome or whole-genome deep sequencing of tumor DNA along with paired normal DNA can potentially provide a detailed picture of the somatic mutations that characterize the tumor. However, analysis of such sequence data can be complicated by the presence of normal cells in the tumor specimen, by intratumor heterogeneity, and by the sheer size of the raw data. In particular, determination of copy number variations from exome sequencing data alone has proven difficult; thus, single nucleotide polymorphism (SNP) arrays have often been used for this task. Recently, algorithms to estimate absolute, but not allele-specific, copy number profiles from tumor sequencing data have been described.Materials and methods
We developed Sequenza, a software package that uses paired tumor-normal DNA sequencing data to estimate tumor cellularity and ploidy, and to calculate allele-specific copy number profiles and mutation profiles. We applied Sequenza, as well as two previously published algorithms, to exome sequence data from 30 tumors from The Cancer Genome Atlas. We assessed the performance of these algorithms by comparing their results with those generated using matched SNP arrays and processed by the allele-specific copy number analysis of tumors (ASCAT) algorithm.Results
Comparison between Sequenza/exome and SNP/ASCAT revealed strong correlation in cellularity (Pearson's r = 0.90) and ploidy estimates (r = 0.42, or r = 0.94 after manual inspecting alternative solutions). This performance was noticeably superior to previously published algorithms. In addition, in artificial data simulating normal-tumor admixtures, Sequenza detected the correct ploidy in samples with tumor content as low as 30%.Conclusions
The agreement between Sequenza and SNP array-based copy number profiles suggests that exome sequencing alone is sufficient not only for identifying small scale mutations but also for estimating cellularity and inferring DNA copy number aberrations.Free full text
Sequenza: allele-specific copy number and mutation profiles from tumor sequencing data
Associated Data
Abstract
Background
Exome or whole-genome deep sequencing of tumor DNA along with paired normal DNA can potentially provide a detailed picture of the somatic mutations that characterize the tumor. However, analysis of such sequence data can be complicated by the presence of normal cells in the tumor specimen, by intratumor heterogeneity, and by the sheer size of the raw data. In particular, determination of copy number variations from exome sequencing data alone has proven difficult; thus, single nucleotide polymorphism (SNP) arrays have often been used for this task. Recently, algorithms to estimate absolute, but not allele-specific, copy number profiles from tumor sequencing data have been described.
Materials and methods
We developed Sequenza, a software package that uses paired tumor-normal DNA sequencing data to estimate tumor cellularity and ploidy, and to calculate allele-specific copy number profiles and mutation profiles. We applied Sequenza, as well as two previously published algorithms, to exome sequence data from 30 tumors from The Cancer Genome Atlas. We assessed the performance of these algorithms by comparing their results with those generated using matched SNP arrays and processed by the allele-specific copy number analysis of tumors (ASCAT) algorithm.
Results
Comparison between Sequenza/exome and SNP/ASCAT revealed strong correlation in cellularity (Pearson's r = 0.90) and ploidy estimates (r = 0.42, or r = 0.94 after manual inspecting alternative solutions). This performance was noticeably superior to previously published algorithms. In addition, in artificial data simulating normal-tumor admixtures, Sequenza detected the correct ploidy in samples with tumor content as low as 30%.
Conclusions
The agreement between Sequenza and SNP array-based copy number profiles suggests that exome sequencing alone is sufficient not only for identifying small scale mutations but also for estimating cellularity and inferring DNA copy number aberrations.
introduction
Cancer is a genetic disease in which specific mutations or genomic aberrations can enable tumor initiation or progression, and in certain cases can determine the effectiveness of specific anticancer therapies. Several tumor resequencing projects have collected and analyzed genetic material from large cohorts of patients in an effort to identify important somatic events that may represent drug targets or predictive biomarkers [1]. In such projects, nonsynonymous substitutions and short indels in coding regions are typically detected by analyzing exome sequencing data derived from matched pairs of tumor and normal tissues of cancer patients, whereas larger aberrations such as copy number alterations or loss of heterozygosity (LOH) are typically detected using genome-wide single nucleotide polymorphism (SNP) arrays, which remains the current state-of-the-art.
Tumor tissue specimens comprise a mixture of cancer cells and normal cells; therefore, analysis of tumor data must take the specimen cellularity into consideration [2–5]. However, it is currently not possible to make a histological estimate of tumor cellularity and extract high-quality DNA from the very same specimen; therefore, cellularity estimates based on histology are commonly made from an adjacent tumor section which often does not reflect the cellularity of the section used for DNA sequencing. Thus, using the DNA itself to make cellularity estimates is an appealing approach. Several methods have been described that estimate, and then correct for, tumor cellularity in SNP array data in order to improve copy number profiles [2–5] or in DNA sequencing data for mutation calling [6].
Copy number profiles can be inferred from sequencing data of sufficient depth and coverage, by using the relative number of reads mapped to a given genomic position (depth ratio) as an indicator of copy number. This approach has recently been demonstrated in algorithms such as VarScan 2 [7] and APOLLOH [8], wherein inferred copy number profiles from whole-exome sequencing alone (WES profiles) are largely concordant with profiles inferred from SNP array data (SNP profiles). APOLLOH estimates the tumor cellularity, whereas VarScan 2 does not. In addition, algorithms such as PurityEst [9] and PurBayes [10] are specialized to estimate tumor cellularity directly from paired tumor-normal sequence data. Only recently, newer tools including absCN-seq [11] and newer versions of ABSOLUTE [4] have provided methods to estimate cellularity and ploidy and calculate copy number profiles directly from exome sequencing data. In such algorithms, accurate cellularity and ploidy estimation is essential for the generation of correct copy number profiles.
Here we describe Sequenza, a software package that uses paired tumor-normal exome or whole-genome sequencing data to estimate tumor cellularity and ploidy and to infer allele-specific tumor copy number profiles. Using publicly available matched tumor-normal data, we compare the results of exome sequence data analyzed by Sequenza with SNP array data from the same tumors analyzed by allele-specific copy number analysis of tumors (ASCAT). For comparison, we also assess the performance of the previously described algorithms absCN-seq and ABSOLUTE.
materials and methods
algorithm
Sequenza is based on a probabilistic model applied to segmented data. The observations include the average depth ratio (tumor versus normal) and B allele frequency (the lesser of the two allelic fractions as measured at germline heterozygous positions) for each segment. The model parameters include overall tumor ploidy and cellularity, and segment-specific copy number and minor allele copy number. The location of the segments and the segment-level dispersion are taken as known constants. We estimate model parameters using a maximum a posteriori approach in which prior probabilities are defined for the copy number such that two copies (by default) are preferred over other values. Under this model, given values for cellularity and ploidy, the segment-level parameters can be quickly estimated. Thus, we solve the overall estimation problem using a grid-based search over reasonable values of cellularity and ploidy (see supplementary Methods, available at Annals of Oncology online).
implementation
The Sequenza software consists of two distinct parts: a python-based preprocessing tool, and an R package implementing the model fitting and visualization functions (supplementary Figure S1, available at Annals of Oncology online).
The python script ‘sequenza-utils’ has two roles. First, it calculates the GC content in sliding windows from a genome reference file in FASTA format. Second, it processes the sequencing data from the tumor and normal specimens, which must be in the Pileup format, as output by SAMtools [12]. For genomic positions with sufficient sequencing depth (by default, >20 reads total from tumor and normal specimens), the script extracts sequencing depth, determines homozygous and heterozygous positions in the normal specimen, and calculates the variant alleles and allelic frequency from the tumor specimen. The output is a tab-delimited text file suitable for import into R. Additionally, ‘sequenza-utils’ is compatible with the pypy python implementation [13], which performs around six times faster than the standard python implementation.
The ‘sequenza’ R package is used to perform the analysis on the output of the sequenza-utils and is implemented with three high-level functions (supplementary Figure S1B, available at Annals of Oncology online): first, sequenza.extract efficiently reads the input file into R, performs GC-content normalization of the tumor versus normal depth ratio, and performs allele-specific segmentation using the ‘copynumber’ package [14]. Second, sequenza.fit applies the model described in the supplementary Material, available at Annals of Oncology online, to infer cellularity and ploidy parameters and copy number profiles. Alternative solutions are also provided, using local maxima of the posterior probability space. Finally, sequenza.results returns the results of the estimation together with alternative solutions and visualization of the data and the model along the genome and the individual chromosomes.
Detailed methods are available in supplementary Methods, available at Annals of Oncology online. The software has a web page at http://www.cbs.dtu.dk/biotools/sequenza and is freely available from CRAN.
data and analysis
Thousands of specimens are available from the TCGA; we arbitrarily selected the first 10 ovarian serous carcinomas (OVCA) and 20 clear-cell renal cell carcinomas (KIRC) sample IDs as of May 2013, when sorted alphabetically. The SNP arrays for ovarian serous carcinomas and renal clear-cell carcinomas were obtained on 22 January 2010 and 17 November 2011, respectively. Exome sequence data, previously aligned to the human genome version hg19, was obtained in BAM format in May 2013.
The SNP array files were preprocessed using the aroma.affymetrix package [15] as described [16], and copy number variations were determined using ASCAT version 2.1 [3]; sex chromosomes were excluded from the analysis.
The Sequenza results were obtained using version 2.1.0 with default parameters; the input was generated by the python script sequenza-utils.py version 2.1.0 with default binning size of 50 bases for the exome sequencing or 200 bases for the whole-genome sequencing. The absCN-seq results were obtained using version 1.0 with default parameters; the input was the same genomic segments used by Sequenza as well as high-quality somatic mutations calls detected by VarScan2 as described in the software documentation. The ABSOLUTE results were obtained using software version 1.0.6 with default parameters except that the platform was specified as ‘Illumina_WES’; the input was the same genomic segments used with Sequenza and absCN-seq.
Exome sequencing data from 31 of the NCI-60 tumor cell lines, aligned to the genome version hg19, were downloaded in May 2014 in the BAM format [17].
Whole-genome sequencing, aligned to the hg19 genome in the BAM format at ×30 of coverage, of two cell lines HCC1143 and HCC1954, matching normal blood, and simulated admixtures at tumor cellularity of 20%, 40%, 60%, and 80%, were obtained in March 2014 from the TCGA4 benchmark cohort (https://cghub.ucsc.edu/datasets/benchmark_download.html).
All BAM files were processed to remove PCR duplicates and low-quality mappings with Picard, and then converted to pileup format using SAMtools [12].
results
application of sequenza to tumor exome sequencing data
To compare Sequenza WES profiles with the current state-of-the-art, SNP profiles, we obtained paired tumor-normal exome and Affymetrix SNP6 arrays from 10 OVCA patients [18] and 20 KIRC patients [19]. We chose renal and ovarian cancer because these represent two widely different cancer types: clear-cell renal cancer has low cellularity and few copy number variations, whereas ovarian cancer typically shows extensive copy number alterations and high tumor cellularity.
The exome data were processed with Sequenza using default settings. Running on a single CPU core, this required an average per-specimen running time of 4 h for preprocessing, 30 min for segmentation, and 4 min for model fitting and parameter estimation. Results from a representative sample are shown in Figure Figure1.1. Of the 20 renal cancer copy number profiles, 17 exhibited 3p loss (supplementary Figure S5, available at Annals of Oncology online), consistent with previous observations of renal cancer [19].
comparison between exome/Sequenza and SNP array/ASCAT profiles
There is no tumor gold standard that could be used to validate the performance of Sequenza. However, the use of SNP arrays processed by ASCAT is an established approach for determining copy number profiles; therefore, a positive agreement between these two platforms would confirm the performance of Sequenza. Hereafter, for simplicity, we use the terms ‘Sequenza’ and ‘ASCAT’ with the understanding that it is actually the combined measurement platform/software that is being considered. Sequenza and ASCAT both provide estimates of cellularity and ploidy, and we found a strong correlation for both parameters (r = 0.90 and r = 0.42, respectively, Figure Figure2A2A and B, Table Table1).1). Interestingly, the ploidy comparison seems to be characterized by a few large outliers, many of which have low cellularity. Details about three highly discordant samples are shown in supplementary Figures S6–S8, available at Annals of Oncology online.
Table 1.
Algorithm | |||||
---|---|---|---|---|---|
Sequenza | 0.90 (0.91) | 0.42 (0.94) | 0.69 | 0.095 (0.087) | 0.95 (0.25) |
ABSOLUTE | 0.19 (0.61) | 0.13 (0.50) | 0.08 | 0.35 (0.19) | 1.81 (1.08) |
absCN-seq | 0.46 (0.65) | −0.26 (0.46) | 0.02 | 0.16 (0.13) | 1.91 (0.76) |
Both Sequenza and ASCAT return a list of genomic segments, each with an estimated copy number state. However, the breakpoints between segments are different, and the genomic coverage of the two platforms is not the same. We compared only the positions covered in the segmentation for both platforms (Figure (Figure2C).2C). Aside from samples where the Sequenza and ASCAT ploidy estimates disagree, the genome fraction with perfect agreement (
To assess copy number profile agreement between the two platforms in a ploidy-independent manner, we carried out hierarchical clustering of the sample profiles using a Pearson correlation distance metric. In all but one case, sample profiles from the same patient derived from different platforms clustered together (data not shown), and are thus more similar to each other than to other profiles, even when the ploidy call is substantially different between the two algorithms.
comparison with other methods
We are aware of two previously published methods to generate copy number profiles from exome sequencing data in a way that accounts for tumor cellularity: ABSOLUTE [4] and absCN-seq [11]. We assessed the performance of these methods using the same criteria as we used to evaluate Sequenza. Similarly to above, we use the terms ‘ABSOLUTE’ and ‘absCN-seq’ to indicate the results derived from exome sequencing applied to each specific method, and we compared the results of each algorithm with the results from ASCAT applied to SNP array data. To focus the comparison on ploidy and copy number estimation algorithms rather than segmentation algorithms, we used the same segmented input (processed by copynumber [14]) as input to each algorithm. The comparison results for each algorithm are summarized in Table Table11.
First, we compared cellularity and ploidy estimates. The ABSOLUTE estimates of cellularity and ploidy were weakly correlated with ASCAT estimates (r = 0.19 and r = 0.13, supplementary Figure S2A and B). The absCN-seq estimates were moderately correlated with the ASCAT cellularity estimate (r = 0.46), but had a negative correlation with the ASCAT ploidy estimate (r = −0.26, supplementary Figure S3A and B, available at Annals of Oncology online). Next, we compared segment-wise copy number estimates. As expected from the low agreement in ploidy estimates, the majority of samples showed substantial disagreement with ASCAT copy number estimates (supplementary Figure S2C and S3C, available at Annals of Oncology online).
Previous publications of copy number inference algorithms have stated the performance obtained after manual selection of a single solution from a set of multiple solutions proposed by the algorithm. Thus, we manually inspected the list of possible solutions from the three algorithms and selected the solution with best agreement to the SNP array solution. As expected, this resulted in increased accuracy for all three algorithms, with Sequenza obtaining the highest agreement (Table (Table11).
application to cell line data
We applied Sequenza to exome sequencing data from 31 cell lines from the NCI-60 panel [17], and compared the estimated ploidy with previously published modal chromosome numbers derived from spectral karyotyping [20]. These particular samples were selected to compare Sequenza performance with previously published results [11]. To accommodate the lack of matched normals in this dataset, we modified our algorithm to calculate the depth ratio and identify the heterozygous positions from two different sources: we used the near-diploid hematopoietic cell line SR as the normal genome for depth ratio calculation, and the selected cell line itself to determine heterozygous positions. However, with this approach, any LOH regions in the cell line would result in the absence of identified heterozygous positions; thus, we adjusted to zero the B allele frequency of segments with fewer than three heterozygous positions per megabase. Despite the suboptimal input data, we obtained a root mean square error (RMSE) between the karyotype-derived ploidy and Sequenza-estimated ploidy of 1.2 (supplementary Figure S4A, available at Annals of Oncology online), comparable with results of absCN-seq applied to the same data (0.55) [11].
For comparison to previously published results in which manual inspection of solutions was carried out, we carried out a similar analysis in which we visually inspected two to four alternative solutions, and for eight of the samples selected a solution different from the point estimate, resulting in an RMSE of 0.44 (supplementary Figure S4B, available at Annals of Oncology online). This can be compared with previously published results in which absCN-seq obtained an RMSE of 0.34 using the same data [11], or to results obtained with SNP array of the NCI-60 cohort with an RMSE of 0.54 using ABSOLUTE and 0.85 using ASCAT [4].
To assess how ploidy estimation accuracy is affected by low cellularity, we analyzed simulated tumor-normal admixtures at proportions of 100%, 80%, 60%, 40%, and 20% provided by the ‘TCGA benchmark 4’ whole-genome sequencing of the HCC1143 and HCC1954 cell lines [21]. Transformations from the normal-tumor reads admixture percentage to tumor content have to consider the tetraploid genomes of the cell line. Result from the simulations shows that the algorithm estimates the correct ploidy until the cellularity values decrease to below 0.3 (Figure (Figure2D2D and E).
discussion
We have described a simple model to infer accurate copy number profiles from next-generation sequencing data and its implementation in the software package Sequenza. For the majority of specimens we analyzed, we observed a strong agreement between the output of Sequenza and the output from ASCAT using matched SNP array data. The few cases with substantial disagreement in copy number profile seem to stem from disagreement in the ploidy and were more common in specimens with low cellularity. It is possible to determine ploidy experimentally using flow cytometry [22], but this was not carried out on the TCGA specimens. In cases where experimentally derived ploidy data are available, it is possible with Sequenza to explicitly specify the ploidy rather than determine it by model fitting.
One advantage of SNP arrays over exome sequencing is the genomic coverage. SNP arrays are often designed to both determine SNP genotypes and detect copy number changes. In particular, the Affymetrix SNP6.0 platform used for the samples tested in this manuscript covers more than 900 000 positions evenly distributed in the genome for copy number detection, and another 900 000 SNP positions, of which on average ~26% are heterozygous in a given individual. This design allows for highly accurate allele-specific determination of copy number profiles. In contrast, exome enrichment kits are generally not designed for the purpose of determining copy number states. Covered genomic regions are based on known exons, which are on average ~150 bases in size, and are not evenly distributed throughout the genome. Inference of allele-specific copy numbers requires heterozygous positions and thus can only be achieved for those exons that include SNPs. When working with the exome sequencing data, we recorded an average of 45 000 heterozygous positions for each patient, corresponding to ~1/5 the number identified by the SNP arrays.
However, it seems likely that whole-genome sequencing will eventually become more cost efficient and widely used than exome sequencing. Sequenza is compatible with whole-genome data, and we expect this to result in increased accuracy due to better genomic coverage and increased number of heterozygous positions. In fact, when processing available whole-genome sequencing data (data not shown), we identified an average of 1.7 × 106 heterozygous SNPs, and genotyping and depth information for 2.6 × 109 positions.
We are aware of four other methods also designed to estimate copy number profiles in tumor samples of unknown cellularity, but only two of these are designed to work on exome sequencing data. The three algorithms have many common elements in their models, but several important differences. AbsCN-seq uses a least squares method to estimate the most likely model, providing a fast running time; whereas ABSOLUTE and Sequenza use likelihood or posterior probability to estimate the best solution. ABSOLUTE incorporates prior probabilities from previous karyotype analyses, whereas Sequenza uses much simpler prior probabilities on copy numbers that are the same on each segment to estimate the best solution. AbsCN-seq does not incorporate prior probabilities. Additionally, Sequenza and ABSOLUTE provide graphical reports to further inspect the alternative solutions, whereas absCN-seq reports only the numerical alternative cellularity and ploidy values. One possible advantage of Sequenza over the other two algorithms is the use of the B allele frequency, which not only provides additional information beyond the depth ratio, but also enables calculation of allele-specific copy number, whereas the other algorithms provide only absolute copy number profiles. However, the requirement for the B allele frequency is a drawback in cases where it is not possible to accurately determine the heterozygous positions, for example in cell lines where the normal sample is not available.
In our comparison with previously published methods ABSOLUTE and absCN-seq, we found that Sequenza shows substantially stronger agreement with SNP array-based cellularity, ploidy, and copy number estimates. However, in the analysis of cell line exome data, absCN-seq performed better than Sequenza, likely because Sequenza relies on identification of heterozygous positions from a matched normal sample that was not available for these cell lines.
One limitation of Sequenza as well as its competing algorithms is that the segmentation is taken as a given; a more sophisticated analysis would consider uncertainty in the assignment of segment boundaries. Also, Sequenza does not account for possible heterogeneity of mutations within a tumor specimen, which has important consequences for patient diagnosis and for identification of driver mutations [23, 24]. However, it is possible to use the variant allele frequency and corresponding copy number states from Sequenza as input for external software such as PyClone [25] in order to resolve subclonal structures.
funding
This work was supported by the European Commission 7th Framework Programme (HEALTH-2010-F2-259303;); the Danish Council for Independent Research (09-073053/FSS); and the Breast Cancer Research Foundation (to ZS). Funding for open access charge: the Danish Council for Independent Research (09-073053/FSS).
disclosure
The authors have declared no conflicts of interest.
acknowledgement
We thank Anders Gorm Petersen for helpful discussions. The results published here are based on data generated by the Cancer Genome Atlas pilot project established by the NCI and NHGRI. Information about TCGA and the investigators and institutions who constitute the TCGA research network can be found at http://cancergenome.nih.gov/. The data were retrieved through dbGaP authorization (Accession No. phs000178.v8.p7).
references
Articles from Annals of Oncology are provided here courtesy of Oxford University Press
Full text links
Read article at publisher's site: https://doi.org/10.1093/annonc/mdu479
Read article for free, from open access legal sources, via Unpaywall: http://www.annalsofoncology.org/article/S0923753419313237/pdf
Citations & impact
Impact metrics
Article citations
Dual phenotypes in recurrent astrocytoma, IDH-mutant; coexistence of IDH-mutant and IDH-wildtype components: a case report with genetic and epigenetic analysis.
Acta Neuropathol Commun, 12(1):169, 26 Oct 2024
Cited by: 0 articles | PMID: 39456052 | PMCID: PMC11515116
Molecular profiling reveals novel therapeutic targets and clonal evolution in ovarian clear cell carcinoma.
BMC Cancer, 24(1):1403, 14 Nov 2024
Cited by: 0 articles | PMID: 39543535 | PMCID: PMC11566382
Proteogenomic analysis dissects early-onset breast cancer patients with prognostic relevance.
Exp Mol Med, 01 Nov 2024
Cited by: 0 articles | PMID: 39482530
Homologous recombination deficiency score is an independent prognostic factor in esophageal squamous cell carcinoma.
J Pathol Clin Res, 10(6):e70007, 01 Nov 2024
Cited by: 0 articles | PMID: 39469984 | PMCID: PMC11519826
Polyclonal-to-monoclonal transition in colorectal precancerous evolution.
Nature, 30 Oct 2024
Cited by: 0 articles | PMID: 39478225
Go to all (420) article citations
Data
Data behind the article
This data has been text mined from the article, or deposited into data resources.
BioStudies: supplemental material and supporting data
dbGaP - The database of Genotypes and Phenotypes
- (1 citation) dbGaP - phs000178
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
FACETS: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing.
Nucleic Acids Res, 44(16):e131, 07 Jun 2016
Cited by: 669 articles | PMID: 27270079 | PMCID: PMC5027494
Ploidy- and Purity-Adjusted Allele-Specific DNA Analysis Using CLONETv2.
Curr Protoc Bioinformatics, 67(1):e81, 01 Sep 2019
Cited by: 10 articles | PMID: 31524989 | PMCID: PMC6778654
AbsCN-seq: a statistical method to estimate tumor purity, ploidy and absolute copy numbers from next-generation sequencing data.
Bioinformatics, 30(8):1056-1063, 02 Jan 2014
Cited by: 31 articles | PMID: 24389661 | PMCID: PMC6280749
Using SAAS-CNV to Detect and Characterize Somatic Copy Number Alterations in Cancer Genomes from Next Generation Sequencing and SNP Array Data.
Methods Mol Biol, 1833:29-47, 01 Jan 2018
Cited by: 2 articles | PMID: 30039361
Review
Funding
Funders who supported this work.
Danish Council for Independent Research (1)
Grant ID: 09-073053/FSS
European Commission (1)
Grant ID: HEALTH-2010-F2-259303
the Breast Cancer Research Foundation (1)
Grant ID: 09-073053/FSS