Abstract
Free full text
Using sigLASSO to optimize cancer mutation signatures jointly with sampling likelihood
Abstract
Multiple mutational processes drive carcinogenesis, leaving characteristic signatures in tumor genomes. Determining the active signatures from a full repertoire of potential ones helps elucidate mechanisms of cancer development. This involves optimally decomposing the counts of cancer mutations, tabulated according to their trinucleotide context, into a linear combination of known signatures. Here, we develop sigLASSO (a software tool at github.com/gersteinlab/siglasso) to carry out this optimization efficiently. sigLASSO has four key aspects: (1) It jointly optimizes the likelihood of sampling and signature fitting, by explicitly factoring multinomial sampling into the objective function. This is particularly important when mutation counts are low and sampling variance is high (e.g., in exome sequencing). (2) sigLASSO uses L1 regularization to parsimoniously assign signatures, leading to sparse and interpretable solutions. (3) It fine-tunes model complexity, informed by data scale and biological priors. (4) Consequently, sigLASSO can assess model uncertainty and abstain from making assignments in low-confidence contexts.
Introduction
Mutagenesis is a fundamental process underlying cancer development. Examples of mutational mechanisms include spontaneous deamination of cytosines, the formation of pyrimidine dimers by ultraviolet (UV) light, and the crosslinking of guanines by alkylating agents. Multiple endogenous and exogenous mutational processes drive cancer mutagenesis and leave distinct fingerprints1. Notably, these processes have characteristic mutational nucleotide context biases2–6. Sequencing cancer samples at presentation revealed all mutations accumulated over lifetime; these include somatic alterations generated by multiple mutational processes both before cancer initiation and during cancer development. In a generative model, multiple latent mutational processes generate mutations over time, drawing from their corresponding nucleotide context distributions (mutation signature)4,5. Here, a mutation signature is a multinomial probability distribution of mutations of a set of nucleotide contexts. In cancer samples, mutations from various mutational processes are mixed and observable by sequencing.
By applying unsupervised methods such as non-negative matrix factorization (NMF) and clustering to large-scale cancer studies, researchers have decomposed the mutation mixture and identified at least 30 distinct mutational signatures2,7. Many signatures have been linked with mutational processes with known etiologies, such as aging, smoking, or ApoBEC activity. Investigating the fundamental processes underlying mutagenesis could help elucidate the initiation and development of cancer.
A major task in cancer research is to leverage signature studies on large-scale cancer cohorts and efficiently select and attribute active signatures to new cancer samples. A popular previously published method, deconstructSigs,8 decomposes the mutation profile into a signature mixture using binary search to iteratively test coefficients one-by-one and then hard pruning signatures with low estimated contribution to achieve sparsity. Other approaches use linear programming9 or iterate all combinations by brute force10. None of these approaches explicitly formulates sampling uncertainty into the model or uses efficient regression techniques. Moreover, no off-the-shelf implementation of these methods, besides deconstructSigs, is available.
Although we do not fully know the latent mutational processes in cancer samples, we can make reasonable and logical assumptions that facilitate our method design. Here, we aimed to design a computational framework, sigLASSO, which could meet these criteria. First, we assumed that the set of estimated mutational mechanisms should be small, as de novo studies indicate that not all signatures can be active in a single sample or even a given cancer type. In most cancer samples, only a few signatures are identified in the original de novo discovery studies. An ideal tool should generate solutions that follow this sparse distribution. Moreover, the number of detectable mutation signatures is limited by the amount of data support. Too many signatures lead to overfitting and unstable solutions. We aimed for a sparser solution as it explains observations in a simpler fashion. Second, the estimated mutational mechanisms should be biologically interpretable and reflect some cancer type specificity. For example, we should not observe UV-associated signatures in tissues that are not exposed to UV. Likewise, we only expect to observe activation-induced cytidine deaminase mutational processes, which are biologically involved in antibody diversification, in B-cell lymphomas. Finally, we felt the solution should be robust and the data should control the model complexity.
In particular, reliably recovering the signature composition is challenging when the mutation number is low8. Low mutation count results in high sampling variance, leading to an unreliable estimation of the mutation context probability distribution, which is the target for signature fitting. A desirable signature identification tool should model the sampling process and take sampling variance into consideration.
In this work, we formulated the task as a joint optimization problem with L1 regularization. First, by jointly fitting signatures and the parameters of a multinomial sampling process, sigLASSO takes into account the sampling uncertainty. Cooperatively fitting a linear mixture and maximizing the sampling likelihood enables knowledge transfer and improves performance. Specifically, signature fitting imposes constraints on the previously unconstrained multinomial sampling probability distribution. Conversely, a better estimation of the multinomial sampling probability improves signature fitting. This property is especially critical in high sampling variance settings, for example, when we only observe low mutation counts in whole-exome sequencing (WES). Existing methods use continuous relaxation, which makes the model invariant to different mutation counts. Second, sigLASSO penalizes the model complexity and achieves variable selection by regularization. Regularization is essential for this fitting problem with a large amount of signatures to achieve proper variable selection and avoid overfitting. Using regularization to promote sparsity and prevent overfitting is also a standard practice in de novo signature discovery. A few recent examples are rule-based constraints (SigProfiler), a Bayesian variant of NMF that penalizes on model complexity (SignatureAnalyzer11) and L1-based regularization (SparseSignatures12). The most straightforward way to do this would be to use the L0 norm (cardinality of active signatures), but this approach cannot be effectively optimized. Conversely, using the L2 norm flattened out at small values leads to many tiny, non-zero coefficients, which do not resemble the sparse signature distribution in the original de novo studies and are hard to interpret biologically. sigLASSO uses the L1 norm, which promotes sparsity. The L1 norm is convex, and thus allows efficient optimization13,14. In addition, this approach is able to harmoniously integrate prior biological knowledge into the solution by fine-tuning penalties on the coefficients. Compared with the approach of subsetting signatures before fitting, our soft thresholding method is more flexible to noise and unidentified signatures. Finally, sigLASSO is aware of data complexity such as mutational number and patterns in the observation. It is able to abstain (decline to assign mutational processes under high uncertainty) and defer to the human researcher to decide. Our method is automatically parameterized empirically on performance, allowing data complexity to inform model complexity. In this way, our approach also promotes result reproducibility and fair comparison of data sets.
In sum, sigLASSO exploits constraints in signature identification and provides a robust framework for scientists to achieve biologically sound solutions. sigLASSO also can empower researchers to use and integrate their biological knowledge and expertise into the model. Unveiling the underlying mutational processes in cancer samples will enable us to recognize and quantify new mutagens, understand mutagenesis and DNA repair processes, and develop new therapeutic strategies for cancer9,15–18.
Results
The signature identification problem
Mutational processes leave mutations in the genome within distinct nucleotide contexts. We denoted the total number of contexts as n. Typically, we considered the mutant nucleotide context and looked one nucleotide ahead and behind each mutation, dividing the mutations into n=96 trinucleotide contexts. Each mutational process carries a unique signature, which is represented by a multinomial probability distribution over mutations of trinucleotide context (Fig 1a).
Then there are K latent mutational processes. Large-scale pan-cancer analyses identified K=30 COSMIC signatures by NMF (with Frobenius norm penalty) and clustering2,3. Here, our objective was to leverage the pan-cancer analysis and decompose mutations from new samples into a linear combination of signatures. Mathematically, we formulated the following non-negative regression problem, maintaining the original Frobenius norm:
The mutation matrix,
Sampling variance
In practice, this problem is optimized using continuous relaxation for efficiency and simplicity,8 neglecting the discrete nature of mutation counts. This approach essentially transforms observed mutations into a multinomial probability distribution, making model estimation insensitive to the total mutation count. However, the total mutation count has a critical role in inference. Assuming mutations are drawn from a latent probability distribution, which is the mixture of several mutational signatures, the mutations follow a multinomial distribution. The total mutation count is the sample size of the distribution, thus greatly affecting the variance of the inferred distribution.
For instance, 20 mutations within the 96 categories give us very little confidence in inferring the underlying mutation distribution. By contrast, if we observed 2000 mutations, we would have much higher confidence. Methods using continuous relaxation treat these two conditions indifferently. Here, we aimed to use a likelihood-based approach to acknowledge the sampling variance and design a tool sensitive to the total mutation count.
sigLASSO model
We divided the data generation process into two parts. First, multiple mutational signatures mix together to form an underlying latent mutation distribution. Second, we observed a set of categorical data (mutations), which is a realization of the underlying mutation distribution. We used mi(i = 1…n) to denote the mutation count of the ith category. The vector
To achieve variable selection and promote sparsity and interpretability of the solution, sigLASSO adds an L1 norm regularizer on the weights
Overall, from the generative model, we can determine the likelihood function for a single sample.
Our objective function is to maximize the log-likelihood function, which is given as:
Here, α=1/σ2. We can infer α from the residual errors from linear regression (see parameter tuning). Meanwhile, because of its continuous nature,
sigLASSO is aware of the sampling variance
By jointly optimizing both the sampling process and signature fitting, sigLASSO is aware of the sampling variance and infers an underlying mutational context distribution
We illustrated how the mutation count affects the estimation of
Because linear fitting and sampling likelihood optimization mutually inform each other, concurrently learning an auxiliary sampling likelihood improves performance. We compared the accuracy of the estimation of
Performance on simulated data sets
We first evaluated sigLASSO on simulated data sets. Both sigLASSO (with and without priors) and deconstructSigs performed better with higher mutation number and lower noise (Fig. 2a, Supplementary Figure 1). A decrease in mutation number leads to an increase of uncertainty in sampling, which is mostly negligible in the high mutation scenarios. As expected, the MSE jumped to the 0.05–0.3 range regardless of the noise level when the mutation number was low. We observed a similar pattern in support recovery (i.e., precision/recall/accuracy). Thus, the error is dominated by undersampling rather than embedded noise. Despite giving a simpler, sparse solution, sigLASSO with priors outperformed sigLASSO with no priors and deconstructSigs in both MSE and support recovery. In support recovery, sigLASSO with priors showed a 5–10% gain in accuracy compared with deconstructSigs. sigLASSO with no priors also had better support recovery performance, measured by accuracy, than deconstructSigs. Overall, sigLASSO maintained a higher precision level when the mutation number decreased and/or the noise increased, which shows its ability to abstain (decline to assign signatures) and provide some control on the false positive rate. Notably, sigLASSO with priors maintained an accuracy above 0.8 in all simulation settings. Moreover, the precision of sigLASSO was minimally affected by the number of signatures.
Although the recall was slightly lower with sigLASSO than deconstructSigs in noisy settings, in very low and no noise situations sigLASSO achieved, a higher recall owing to its ability to adapt to different noise levels. By contrast, deconstructSigs, which assumes a fixed noise level, overly pruned the signatures in the post-fitting step when the noise was low. At last, adding correct priors helped boost both precision and recall significantly.
We next explored how different priors affect sigLASSO’s performance. Our experiments showed that using known signatures as priors to tune the weights boosts performance. Priors improved performance even when we included only a small fraction of true signatures or blended in a large number of wrong signatures (Fig. 2b, Supplementary Figure 2). As the fraction of true signatures given as prior knowledge increased from zero, the performance immediately started improving and continued to do so. When more false signatures were mixed with true signatures given as prior knowledge, the performance slowly deteriorated. However, even with 1.5 times false signatures mixed in with true ones, the performance was slightly better (~2% in accuracy) than the baseline that used no priors. Stronger priors had larger boosting effects on the solution, as expected.
WGS scenario using renal cancer data sets
We next moved from synthetic data sets to real cancer mutational profiles, which are likely noisier than simulations and exhibit a highly non-random distribution of signatures.
We benchmarked the two methods using 35 WGS papillary kidney cancer samples21. The median mutation count was 4528 (range: 912–9257). We found that without prior knowledge, both sigLASSO and deconstructSigs showed high contributions from signatures 3 and 5 (Fig. 3a, b). deconstructSigs also assigned a high proportion to signatures 8(9.9%), 9(6.9%), and 16(4.7%). Signatures 3, 8, 9, and 16 were not found to be active in pRCC in previous studies and currently no biological support connects them to pRCC2. As expected, sigLASSO resulted in sparser solutions than deconstructSigs (mean signatures assigned: 3.40 and 4.43, respectively). Adding prior from COSMIC (kidney cancer combined) helps sigLASSO to put more weight on Signature 5, and increase the number of signatures assigned to 4.06.
However, if we naively subset the signatures and took the ones that were found to be active in previous studies, the signature profile was completely dominated by signature 5, to which only ~10% mutations on average were assigned with other signatures. Moreover, the model assigned highly similar signature profiles to all samples. This finding suggests an overly simple, underfitted model.
To show that sigLASSO is sensitive to fitting uncertainty, able to abstain, and is robust, we performed subsampling on 30 pRCC samples (mutation counts ≥3000, Fig. 3c). As the sample size decreased, sigLASSO assigned fewer mutations with signatures (mean fraction dropped from 0.93 (all mutations) to 0.55 (20 mutations)), reflecting the greater uncertainty in fitting. This quality allows the user to be aware of the uncertainty in the solutions.
Moreover, the model complexity also declined accordingly (Supplementary Figure 3). The mean number of signatures assigned by sigLASSO decreased from 4.06 to 1.67, representing simpler models. Last, the performance of sigLASSO was robust and stable, as evidenced by stable outputs even in low sampling counts. Multiple subsampling runs showed high agreement in active and inactive signatures (~0.95 across all subsample sizes).
In contrast, deconstructSigs was unable to reflect model uncertainty. Surprisingly, as the mutation number decreased, the fraction of fitted mutations in deconstructSigs unexpectedly increased from 0.83 (all) to 0.97 (20 mutations). The model complexity, as reflected by the mean number of signatures assigned also increased from 4.43 to 4.70. The agreement between subsamples dropped significantly as the subsampling size decreased, indicating the solution is unstable and potentially overfitted.
WES scenario using esophageal carcinoma data sets
We next aimed to evaluate the two methods on 182 WES esophageal carcinoma (ESCA) samples with >20 mutations. The median mutation count was 175.5 (range: 28–2146), which is considerably lower than WGS but typical for WES.
In sigLASSO, the L1 penalty strength was tuned based on model performance. In a low mutation count setting, ESCA WES data set, the model variance was high, which pushed up the penalty. As expected by its ability to abstain, sigLASSO assigned a lower fraction of mutations with signatures than deconstructSigs (median: 0.68 and 0.90; interquartile range (IQR): 0.56–0.75 and 0.86–0.94, respectively, Supplementary Figure 4). deconstructSigs did not achieve 100% assignment because it performed a hard normalization of the coefficients after an unconstrained binary search and then discarded signatures that had ≤6% contribution. The leading signatures were 1, 16, 17, 3, and 13 in sigLASSO and 1, 17, 16, 6, 13, 3, and 2 in deconstructSigs.
deconstructSigs has been applied to distinguish between two different histological types of esophageal cancer8. We demonstrated that sigLASSO generates a sparser but comparable result with wider signature 1, and 16 gaps between the subtypes (Fig. 3d, Supplementary Figure 5). The adenocarcinoma subtypes had higher fractions of signature 1 and 17, and lower fractions of signature 3, 13, and 16.
Performance on 8893 TCGA samples
We ran sigLASSO and deconstructSigs with step-by-step set-ups on 8893 The Cancer Genome Atlas (TCGA) tumors (33 cancer types, Supplementary Table 11) with more than 20 mutations (Fig. 4, Supplementary Figure 6). The median mutation number is 100, IQR is 52–200.
Simple non-negative regression resulted in an overly dense matrix. Applying an L1 penalty made the solution remarkably sparse. Then, by incorporating the prior knowledge, the signature landscape further changed without significantly affecting the assignment sparsity. The change was inconsistent with the priors given. With low-count WES data (upper quartile: 200 and 96 mutational contexts), we expect the signature assignment to be very sparse as the uncertainty is high, the model should from making assignments. sigLASSO indeed assigned low number of signatures to the samples (median number of signatures assigned per sample: 1, IQR: 1–2). In comparison, the solutions of deconstructSigs were less sparse (median number of signatures assigned per sample: 4, IQR: 3–5). In all, 18.5% samples were fitted by deconstuctSigs with six signatures or more, compared with 0.4% by sigLASSO. We provide a kidney cancer example and detailed paired analysis in the Supplement (Supplementary Figure 7, 8).
When the mutation number cutoff increases to 50, the above results of the two methods still stand (Supplementary Figure 9). We also released all of the signature results of 8893 TCGA samples on the sigLASSO GitHub site.
Using the large-scale tumor signature profiles, we further explored the correlation of smoking signature and smoking status using annotations from a previous study15. In lung adenocarcinoma (LUAD) samples, smoking samples carry significantly higher signature 4 (“smoking signatures”) fractions than non-smoking ones (median: 0 and 0.68, respectively, p≤1×10−15, Wilcoxon rank test, Supplementary Figure 10). Similarly, in lung squamous cell carcinoma (LUSC) samples, we observed high fractions of signature 4 in smokers, but because only 3.5% of the LUSC cohort are nonsmokers (6/171), we were underpowered to draw a statistical significance. In non-lung cancer samples (N=1500), we also found a weaker but statistically significant trend of higher signature 4 in smokers (mean: 0.008 and 0.038, respectively, p=3.3×10−8, Wilcoxon rank test, Supplementary Figure 10). This result is in agreement with previous studies on smoking signatures15.
sigLASSO is computationally efficient
sigLASSO iteratively solves two convex problems. The
By profiling sigLASSO and deconstructSigs (Fig. 5), we noticed that neither total mutation numbers nor signature numbers remarkably affected the running time of sigLASSO. With a high mutation number, sigLASSO was 3–4 times faster than deconstructSigs; with a low mutation number (50 mutations), these two tools showed a comparable computation time. Noticeably, despite using less time sigLASSO employs empirical parameterization and alternative optimization, which regresses the signature-fitting problems hundreds of thousands of times with different parameters in a typical run. Therefore, by carefully designing an effective algorithm, the core fitting step of sigLASSO is orders of magnitudes faster than deconstructSigs. This enables sigLASSO to probe the data complexity and accordingly tune the model complexity.
Discussion
Studies decomposing cancer mutations into a linear combination of signatures have provided invaluable insights into cancer biology4–6,22. Indeed, researchers have gained a better understanding of one of the fundamental driving forces of cancer initiation and development, mutagenesis, by inferring mutational signatures and latent mutational processes.
A practical problem for many researchers is how to leverage results from large-scale signature studies and apply them to a small set of incoming samples. Although this might seem to be a simple linear system problem, core challenges include (1) preventing over- and underfitting on only one sample, often with very few mutations (especially in WES), (2) achieving proper variable selection from a large amount of recognized signatures, and (3) promoting interpretability.
First, with less than a few hundred mutations, sampling variance becomes a significant factor in reliably identifying signatures. Therefore, the fitting scheme should be aware of the sampling variance, which is especially pronounced in low mutation count scenarios (WES or cancer types with low mutation burden). Ideally, the tool should be able to attribute the signatures by flexibly inferring the underlying true mutation distribution given the sampling variance and the signature-fitting performance. Second, the large number of available signatures necessitates a proper variable selection, especially with a limited amount of observed data. Signature studies on large-scale cancer data sets have revealed that mutational signatures are not all active in one sample or cancer type; in most tumor cases, only a few signatures prevail. A recent signature summary suggested that 2–13 known signatures are observed in a given cancer type (based on 30 COSMIC signatures), which might include hundreds and even thousands of samples. Therefore, sparse solutions are in line with previous de novo studies, and are biologically sound and interpretable. In addition, sparse solutions can give better predictions owing to lower estimator variance. Third, a desirable method should be parameterized according to the data complexity to achieve optimum fitting. Finally, mutational signatures are not orthogonal owing to their biological nature. Collinearity of the signatures will lead to unstable fittings that change erratically with even a slight perturbation of the observation.
deconstructSigs was the first tool that could identify signatures even in a single tumor. Instead of regression, this tool uses binary search to iteratively tune coefficients. To achieve sparsity, deconstructSigs performs post hoc pruning with a preset 6% cutoff value. The mutation spectrum is normalized before fitting, thus making mutation counts invariant to the model. Moreover, the binary search operates on unbounded coefficients and uses a hardcoded upper and lower bounds, which provides no guarantee of finding the optimal solution. Finally, the greedy nature of stepwise coefficient tuning is prone to eliminating valuable predictors in later steps that are correlated with previously selected ones23.
Here, we describe sigLASSO, which simultaneously optimizes both the sampling process and an L1 regularized signature fitting. By explicitly formulating a multinomial sampling likelihood into the optimization, we designed sigLASSO to take into account the sampling variance. Meanwhile, sigLASSO uses the L1 norm to penalize the coefficients, thus achieving effective variable selection and promoting sparsity. By fine-tuning the penalizing terms using prior biological knowledge, sigLASSO is able to further exploit previous signature studies from large cohorts and promote signatures that are believed to be active in a soft thresholding manner.
Jointly optimizing a mutation sampling process enables sigLASSO to be aware of the sampling variance. By additionally modeling an auxiliary multinomial sampling process and corresponding distribution, we demonstrated that sigLASSO achieves better and more stable signature attribution, especially in cases with low mutation counts. In cancer research, WES data are abundant, but it also suffers from undersampling in signature attribution. In these cases, sigLASSO generates more reliable and robust solutions. We showcase the successful application of sigLASSO on 8893 TCGA WES samples. Overall, sigLASSO achieves better sparsity than deconstructSigs and promote structures in the previous signature studies by injecting priors into the optimization process. The final signature distribution of neither sigLASSO nor deconstructSigs is identical to the original signature studies, which is probably owing to biases in the original signature discovery, sequencing/variants calling, and unknown hypermutational processes. We noticed deconstructSigs assigned signatures to 100% of the mutations in most samples, while in sigLASSO assignment fraction correlated with cancer types. This abstain pattern indicates the presence of potentially unknown signatures and cancer subtypes.
As the cost of WGS drops rapidly, we expect an even greater number of cancer samples to be sequenced24. The vast amount of cancer genomics data will give scientists larger power to discern unknown or rare signatures. The growing number of signatures will eventually make the signature matrix underdetermined (when k>96, i.e., the number of possible mutational trinucleotide contexts). A traditional simple solver method would give infinitude (noiseless) or unstable (noisy) solutions in this underdetermined linear system. However, by assuming the solution is sparse, we were able to apply regularization to achieve a simpler, sparser, and more stable solution. Furthermore, sigLASSO is very efficient and thus able to handle a larger number of samples and signatures, as well as hyperparameter optimization.
Moreover, sigLASSO does not specify a noise level explicitly beforehand, but instead empirically tunes parameters based on model performance. By contrast, deconstructSigs specifies a noise level of 0.05 to derive a cutoff of 0.06 to achieve sparsity. In general, sigLASSO lets the data itself control the model complexity and leave any post hoc filtering to users. Abstention is a desired trait in machine learning models. In addition to the objectives, the model learns what it knows and what it does not (KWIK: Knows What It Knows),25 In both subsampling and simulation experiments, sigLASSO is able to abstain under high uncertainty, providing solutions with consistently high precision and robustness under various scenarios.
Next, owing to the collinearity nature of signatures, pure mathematical optimization might lead algorithms to select wrong signatures that are highly correlated with truly active ones. To overcome this problem, sigLASSO allows researchers to incorporate domain knowledge to guide signature identification. This input could be cancer type-specific signatures or patient clinical information (e.g., smoking history or chemotherapy). Furthermore, we transformed the binary classification (active or inactive) to a continuous space, with weights indicating the strength of the prior. These weights could be effectively learned using patient information or other assays (e.g., RNA sequencing or methylation arrays). Moreover, sigLASSO can adapt to sparsity promoting schemes in de novo discovery tools through priors. For example, SparseSignatures12 also uses L1 regularization but does not penalize on background signatures. sigLASSO could use this piece of information in its priors to better assign signatures when working together with SparseSignatures. We showcased the performance of sigLASSO on real cancer data sets. Although we lack the ground truth of the operative mutational signatures in tumors, we have several reasonable beliefs about the signature solution. sigLASSO produced signature solutions that are biologically interpretable, properly align with our current knowledge about mutational signatures, and well distinguish cancer types and histological subtypes.
We also implemented elastic net with hyperparameter optimization in sigLASSO. Elastic net blends L2 with L1 regularization and in principle demonstrates better performance and stability than LASSO on strongly correlated features26. We found that elastic net did not improve the performance (measured as meaningful reduction in MSE in cross-validation) in our simulations using the current 30 COSMIC signatures, likely because the correlations between them are not too high. However, with more cancer samples sequenced every day, researchers will gain power to discern highly correlated mutational processes and grow the size of the signature set significantly. Therefore, elastic net might be beneficial in the near future.
Finally, sigLASSO uses quadratic loss, which follows the previous de novo studies2,3,5. By doing so, sigLASSO is unbiased in detecting signatures in cancer samples. Nonetheless, the initial discovery suffers from several limitations. First, the actual number of mutational processes is likely higher than 30, the current number of COSMIC signatures. This is because the power of the de novo discovery is bounded by the amount of available data. Second, the nature of mutagenesis leads to some mutational processes being more prevalent than others. Some of the signatures are cancer type specific. The original signature discovery did not use a balanced data set with equal numbers of cancer types. Third, the original NMF objective function employs quadratic loss, which might lead to bias towards flat signatures over spiky ones. In addition to the original NMF approach, researchers have proposed other decomposition methods for signature discovery. For example, SomaticSignatures27 uses PCA. Because the loss function is also quadratic, we expect sigLASSO to work seamlessly with SomaticSignatures.
Our simulation reveals that sigLASSO does not have a discernable preference to assign to either spiky or smooth signatures (Supplementary Figure 12). However, we found that the signature distribution from sigLASSO and deconstructSigs cannot be fully explained by the current signature knowledge. Further research probing the biological foundation of the signatures and quantifying biological priors will help advance our understanding and improve the interpretability of signature assignments. Meanwhile, as there are still many unsolved issues in signature discovery, we advocate that sigLASSO should also be used with discretion and its results interpreted cautiously.
sigLASSO exploits the previously overlooked mutation sampling uncertainty and formulates a framework that jointly optimizes the objective (i.e., signature fitting with L1 regularization) and the sampling likelihood. In biological experiments, low-count observations on discrete variables are common. For example, in single cell RNA sequencing (scRNAseq), the measured discrete mRNA counts are often very low or even zero (owing to undersampling). Our joint optimization approach could have further implications in these scenarios. We, indeed, find some similar work in scRNAseq28,29. Another method for de novo signature discover, EMu, formulates the discreet mutational process as a Poisson generative model30. When the total mutational count is fixed, such a Poisson generative model is equivalent to our multinomial sampling process. Our work also could be extended to mutagenesis modeling and parameter estimation—for example, in estimating the nucleotide-specific background mutation rate in cancer.
Methods
Optimizing sigLASSO
The negative log-likelihood is convex in respect to both
Algorithm 1.
sigLASSO algorithm
1: initialization: |
2: while t<tmax do |
3: |
( |
4: |
( |
5: if |
6: break |
7: t←t+1 |
8: return
|
Specifically, to begin the iteration, we initialized
Next, we solved the
Intuitively, in the
Optimizing the -step
In the
We added the Lagrangian multiplier Λ to satisfy the linear constraint of
The roots of the first n quadratic equations are given by
α=1/σ2 is strictly positive and mi is non-negative. Therefore, if mi=0, there exists only one zero root and
We plugged all the roots into the last equation (i.e., the linear constraint) and used the R function uniroot() to solve Λ.
Parameter tuning
We tuned λ by repeatedly splitting the nucleotide contexts into training and testing sets and testing the performance. Because mutations of the same single-nucleotide substitution context are often correlated, we split the data set into eight subsets. Each subset contained two of each single-nucleotide substitution. We then held one subset as the testing data set and only fit the signatures on the remaining ones. After circling all eight subsets and repeating the process 20 times, we used the largest λ (which leads to a sparser solution) that gave an MSE 0.5 or 1 SD from the minimum MSE. λ was tuned whenever
α=1/σ2, σ2 is estimated using
Option for elastic net
Adding an L2 regularizer (i.e., elastic net) might improve and stabilize the performance when variables are highly correlated26. Therefore, we also implemented elastic net in sigLASSO. The objective function then became:
We tuned the hyperparameter γ by grid search together with λ. We always picked the largest γ and the largest λ (for better sparsity) that gave an MSE that was 0.5 SD from the minimal MSE. In simulations, we did not find that introducing L2 regularization added additional benefits. The model gave almost identical solutions to using L1 only. Nonetheless, we kept elastic net as an option in highly correlated signature scenarios to the user in our implementation.
Data simulation and model evaluation
We downloaded 30 previously identified COSMIC signatures v2 (http://cancer.sanger.ac.uk/cosmic/signatures). We created a simulated data set by randomly and uniformly drawing two to eight signatures and corresponding weights (minimum: 0.02). The reason for picking up to eight signatures is because (1) empirically, in cancer signature studies, most samples have only a few signatures. (2) We further confirmed the sparsity of signatures by running deconstructSigs on large-scale TCGA data and found 99.96% samples got assigned with eight or less signatures. (3) Moreover, biologically, we believe that it is unreasonable for more than eight of the processes to act simultaneously in one sample. Many of these processes describe biological disjoint conditions, e.g., smoke, UV radiation and tissue-specific cellular processes. We simulated additive Gaussian noise at various levels with a positive normal distribution of up to 25 (1–25, uniformly drawn) randomly selected trinucleotide contexts. Then, we summed all the signatures and noise to form a mutation distribution. We sample mutations from this distribution with different mutation counts.
We ran deconstructSigs according to the original publication8 and sigLASSO, both with and without prior knowledge of the underlying signature. To evaluate their performance, we compared the inferred signature distribution with the simulated distribution and calculated MSE. We also measured the number of false positive and false negative signatures in the solution (support recovery).
Illustrating sigLASSO on 8893 TCGA samples
To assess the performance of our method on real-world cancer data sets, we used somatic mutations from 33 cancer types from TCGA. We downloaded MAF files from the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/). A detailed list of files used in this study can be found in Supplementary Table 1.
We extracted prior knowledge on active signatures in various cancer types from a previous pan-cancer signature analysis (http://cancer.sanger.ac.uk/cosmic/signatures).
sigLASSO software suite
sigLASSO is implemented as an R package “siglasso”. It accepts processed mutational spectrums and VCF files. We provided functions to parse mutational spectrums from VCF files as well as some visualization methods in the package. “siglasso” allows users to specify biological priors (i.e., signatures that should be active or inactive) and their weights. “siglasso” uses 30 COSMIC signatures by default. Users are also given the option to supply customized signature files. It is computationally efficient; using default settings, the program can successfully decompose a whole genome sequencing (WGS) cancer sample in less than a few seconds on a regular laptop. For time profiling purposes, we ran siglasso and deconstructSigs on an Intel Xeon E5-2660 (2.60 GHz) CPU. We employed the R package “microbenchmark” to profile the function call siglasso() and whichSignatures(). For each setup, we generated 100 noiseless simulated data sets and repeated the process 100 times for each evaluation.
We made siglasso source code available at http://github.com/gersteinlab/siglasso.
Evaluation criteria for signature assignment
One of the limitations of cancer signature research is that the ground truth of real samples typically cannot be obtained. Previous large-scale signature studies have relied largely on mutagen exposure association from patient records and biochemical knowledge on mutagenesis. Here, besides using simulation, we illustrated the outputs of different models and compared the results on real data set with existing signature knowledge and distributions. Although no gold standard exists to evaluate the performance, we do have a few reasonable expectations about the solution:
Sparsity: one or more signature should be active in a given cancer sample and type. However, not all signatures should be active. A sparse distributed signature set among cancer samples and types is defined in previous de novo discovery studies. Any signature-fitting tool should follow and produce a similar signature distribution. Moreover, mutational processes are discrete in nature and tied with certain endogenous and environmental factors. An obvious example is that the UV signature should not exist in unexposed tissues. Existing signature-identifying methods aim to implicitly achieve sparse solutions by dropping signatures with small coefficients or pre-selecting a small signature subset for fitting.
The ability to abstain (decline to assign signatures to all mutations): signature assignment issues are often undefined owing to collinearity of the signatures and a larger number of possible signatures. Overfitting is a serious concern, especially when the mutation counts are low (e.g., WXS or cancer genomes with fewer mutations) or the fitting is poor (e.g., unknown mutational process or sample contamination causing high noise). A good solution should refuse to fully assign signatures to every mutation when it does not have enough confidence to do so and, instead, defer to the human researchers to decide.
Robustness: solutions should be robust and reproducible. Signatures are not orthogonal, thus simple regression might lead to solutions that change erratically when a small perturbation is made in the observation. Moreover, the solution should reflect the level of ascertainment. Especially in WES, low mutation count is often a severe obstacle for assigning signatures owing to undersampling. In particular, under low mutation count, not all of the operative signatures would be reliably discovered. It is better to abstain under high uncertainty.
Biological interpretability: the solution should be biological interpretable. Because of the biological nature of collinearity in the signatures, simple mathematical optimization might pick the wrong signature. Researchers now tackle this problem by simply removing the majority of predictors they believe to be inactive. sigLASSO allows users to supply domain knowledge to guide the variable selection in a soft thresholding manner, leaving space for noise and rare or unknown signatures. For instance, we expected to find divergent signature distributions in different cancer types, which is reported in previous de novo discovery studies. Various tissues have divergent endogenous biological features, are exposed to diverse mutagens, and undergo mutagenesis in dissimilar fashions. Signature patterns should be able to distinguish between cancer types. This can be achieved by using different cancer type-specific priors.
These expectations are not quantitative, but they help direct us to recognize the most plausible solution as well as the less-favorable ones.
Acknowledgements
This work was supported in part by NIH/NHGRI (1R01HG008126), NIH/NICHD (1DP2HD091799-01), A.L. Williams Professorship and the facilities and staffs of the Yale University Faculty of Arts and Sciences High Performance Computing Center. We thank Harrison Zhou, Cong Liang, Jinrui Xu, Diego Galeano, and Mengting Gu for helpful discussion.
Author contributions
S.L. and M.G. conceived and designed research. S.L. designed the algorithm, developed the software, and carried out experiments. S.L. and F.W.C. performed the theoretical analysis of the algorithm. S.L., F.W.C., and M.G. analyzed the results. S.L. and M.G. wrote the manuscript. M.G. supervised the study.
Data availability
The data that support the findings of this study are avaiable from Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/). A detailed list of files used in this study can be found in Supplementary Table 1.
Footnotes
Peer review information Nature Communications thanks Javier Herrero and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary information is available for this paper at 10.1038/s41467-020-17388-x.
References
Articles from Nature Communications are provided here courtesy of Nature Publishing Group
Full text links
Read article at publisher's site: https://doi.org/10.1038/s41467-020-17388-x
Read article for free, from open access legal sources, via Unpaywall: https://www.nature.com/articles/s41467-020-17388-x.pdf
Citations & impact
Impact metrics
Article citations
Accurate and sensitive mutational signature analysis with MuSiCal.
Nat Genet, 56(3):541-552, 15 Feb 2024
Cited by: 6 articles | PMID: 38361034
Assigning mutational signatures to individual samples and individual somatic mutations with SigProfilerAssignment.
Bioinformatics, 39(12):btad756, 01 Dec 2023
Cited by: 23 articles | PMID: 38096571 | PMCID: PMC10746860
BRCA2 Germline Mutations Identify Gastric Cancers Responsive to PARP Inhibitors.
Cancer Res, 83(10):1699-1710, 01 May 2023
Cited by: 6 articles | PMID: 37129948 | PMCID: PMC10183806
A phylogenetic approach to study the evolution of somatic mutational processes in cancer.
Commun Biol, 5(1):617, 22 Jun 2022
Cited by: 3 articles | PMID: 35732905 | PMCID: PMC9217972
Comprehensive profiling of 1015 patients' exomes reveals genomic-clinical associations in colorectal cancer.
Nat Commun, 13(1):2342, 29 Apr 2022
Cited by: 18 articles | PMID: 35487942 | PMCID: PMC9055073
Go to all (12) article citations
Data
Data behind the article
This data has been text mined from the article, or deposited into data resources.
BioStudies: supplemental material and supporting data
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
Hidden Markov models lead to higher resolution maps of mutation signature activity in cancer.
Genome Med, 11(1):49, 26 Jul 2019
Cited by: 10 articles | PMID: 31349863 | PMCID: PMC6660659
Bioinformatic Methods to Identify Mutational Signatures in Cancer.
Methods Mol Biol, 2185:447-473, 01 Jan 2021
Cited by: 6 articles | PMID: 33165866
A comprehensive comparison of tools for fitting mutational signatures.
Nat Commun, 15(1):9467, 02 Nov 2024
Cited by: 0 articles | PMID: 39487150 | PMCID: PMC11530434
Computational tools to detect signatures of mutational processes in DNA from tumours: A review and empirical comparison of performance.
PLoS One, 14(9):e0221235, 12 Sep 2019
Cited by: 34 articles | PMID: 31513583 | PMCID: PMC6741849
Review Free full text in Europe PMC
Funding
Funders who supported this work.
NHGRI NIH HHS (1)
Grant ID: R01 HG008126
NICHD NIH HHS (1)
Grant ID: DP2 HD091799