Abstract
Background
Single-cell RNA-sequencing (scRNA-seq) technologies have advanced rapidly in recent years and enabled the quantitative characterization at a microscopic resolution. With the exponential growth of the number of cells profiled in individual scRNA-seq experiments, the demand for identifying putative cell types from the data has become a great challenge that appeals for novel computational methods. Although a variety of algorithms have recently been proposed for single-cell clustering, such limitations as low accuracy, inferior robustness, and inadequate stability greatly impede the scope of applications of these methods.Results
We propose a novel model-based algorithm, named VPAC, for accurate clustering of single-cell transcriptomic data through variational projection, which assumes that single-cell samples follow a Gaussian mixture distribution in a latent space. Through comprehensive validation experiments, we demonstrate that VPAC can not only be applied to datasets of discrete counts and normalized continuous data, but also scale up well to various data dimensionality, different dataset size and different data sparsity. We further illustrate the ability of VPAC to detect genes with strong unique signatures of a specific cell type, which may shed light on the studies in system biology. We have released a user-friendly python package of VPAC in Github ( https://github.com/ShengquanChen/VPAC ). Users can directly import our VPAC class and conduct clustering without tedious installation of dependency packages.Conclusions
VPAC enables highly accurate clustering of single-cell transcriptomic data via a statistical model. We expect to see wide applications of our method to not only transcriptome studies for fully understanding the cell identity and functionality, but also the clustering of more general data.Free full text
VPAC: Variational projection for accurate clustering of single-cell transcriptomic data
Abstract
Background
Single-cell RNA-sequencing (scRNA-seq) technologies have advanced rapidly in recent years and enabled the quantitative characterization at a microscopic resolution. With the exponential growth of the number of cells profiled in individual scRNA-seq experiments, the demand for identifying putative cell types from the data has become a great challenge that appeals for novel computational methods. Although a variety of algorithms have recently been proposed for single-cell clustering, such limitations as low accuracy, inferior robustness, and inadequate stability greatly impede the scope of applications of these methods.
Results
We propose a novel model-based algorithm, named VPAC, for accurate clustering of single-cell transcriptomic data through variational projection, which assumes that single-cell samples follow a Gaussian mixture distribution in a latent space. Through comprehensive validation experiments, we demonstrate that VPAC can not only be applied to datasets of discrete counts and normalized continuous data, but also scale up well to various data dimensionality, different dataset size and different data sparsity. We further illustrate the ability of VPAC to detect genes with strong unique signatures of a specific cell type, which may shed light on the studies in system biology. We have released a user-friendly python package of VPAC in Github (https://github.com/ShengquanChen/VPAC). Users can directly import our VPAC class and conduct clustering without tedious installation of dependency packages.
Conclusions
VPAC enables highly accurate clustering of single-cell transcriptomic data via a statistical model. We expect to see wide applications of our method to not only transcriptome studies for fully understanding the cell identity and functionality, but also the clustering of more general data.
Background
Single-cell RNA-sequencing (scRNA-seq) has emerged as a revolutionary tool to reveal previously unknown heterogeneity and functional diversity at a microscopic resolution. Since the first protocols were published in 2009 [1], a massive expansion in method development has derived scRNA-seq technologies with distinct advantages and applicability [2]. For example, Smart-seq2 [3] and MARS-seq [4] is preferable when quantifying transcriptomes of fewer cells, while Drop-seq [5] is preferable when quantifying transcriptomes of large numbers of cells with low sequencing depth [6]. Advances in scRNA-seq technology have resulted in a wealth of studies aiming to reveal new cell types [7, 8], assess tissue composition [4, 9, 10], identify gene regulatory mechanisms [11, 12], investigate cell development or lineage processes [13–15], and many others. With the exponential growth of the number of cells profiled in individual scRNA-seq experiments, there is a demand for novel analysis methods for this new type of transcriptomic data, which has not only much greater scale of datasets than that of bulk experiments but also various challenges unique to the single-cell context [16].
A key advantage of scRNA-seq is that it can be used to identify putative cell types using unsupervised clustering, which is essential to fully understand the cell identity and functionality. A variety of algorithms have recently been proposed for single-cell clustering. For example, CellTree produces tree structures outlining the hierarchical relationship between single-cell samples using a novel statistical approach based on document analysis techniques [17]. Other statistical approaches based on Dirichlet mixture model are shown to be well suited for single cell clustering, especially for data as unique molecular identifiers (UMI) matrix. For example, DIMM-SC models UMI count data and characterizes variations across different cell clusters via a Dirichlet mixture prior [18]. Para-DPMM further improves the clustering quality by introducing a Dirichlet process prior to automatically infers the number of clusters from the dataset, and a split-merge mechanism is adopted to improve convergence and optimality of the result [19]. Given the high dimensionality of single-cell data, clustering directly on the original dimension may affect the performance due to the intrinsic noise of single-cell data, and usually demands tedious preprocessing such feature selection, which may restrict the robustness of clustering. A simple idea is to use traditional methods to reduce dimensions and then cluster. However, there are too many alternative combinations that make it difficult for us to make a choice. Methods combining dimension reduction with classic clustering such as t-Distributed Stochastic Neighbor Embedding (t-SNE) with K-means [8], and principal component analysis (PCA) with hierarchical clustering [20] for single-cell clustering are proposed. Combining PCA with K-means, a consensus clustering approach is proposed to achieve high accuracy and robustness [21]. CIDR further uses an implicit imputation approach to alleviate the impact of dropouts in scRNA-seq data in a principled manner and identifies putative cell types using hierarchical clustering [22]. Other recent works are also proposed to cluster high dimensional single-cell data with gene regulatory networks [23], or ranking on shared nearest neighbors (SNN) [13].
However, there are still limitations in single-cell clustering to be addressed. First, even the state-of-the-art methods have achieved encouraging performance, the clustering quality can still be significantly improved for challenging tasks as shown in the Results Section. Second, most methods are designed for one of the continuous data and discrete UMI counts. There is demand for methods that can be applied to single-cell data created with different scRNA-seq technologies, such as discrete counts created with UMI based techniques and continuous data normalized to transcripts per million mapped reads (TPM)—if reads are only generated from one end of the transcript, or fragments per kilobase per million mapped reads (FPKM)—if reads cover the entire transcript [24]. Third, a method that can be generally applied to data of various sizes or dimensions is desirable. Nevertheless, most proposed methods cannot scale up well with various data dimensionality and with different dataset size. Last but equally important, most public packages cannot be easily installed due to system environment or software version problems. In addition, some of them require other dependency packages, and thus further inconveniences the use of these packages. Therefore, there is a demand for an easy-to-install and easy-to-use package.
Motivated by the above understanding, we propose in this paper a model-based algorithm named VPAC (Variational Projection for Accurate Clustering) of single-cell transcriptomic data. VPAC is a novel extension of the framework of probabilistic principal components analysis (PPCA) [25], which has been shown to be effective in dimensionality reduction for scRNA-seq data [26]. VPAC projects single-cell samples to a latent space, where the samples are constrained to follow a Gaussian mixture distribution [27]. With a coordinate ascent variational inference (CAVI) algorithm, VPAC can implement parameter estimation efficiently and steadily. Using five scRNA-seq datasets, we show that our model is not only superior to existing methods in the clustering of single-cell transcriptomic data, but also able to be applied to single-cell data of both discrete counts and normalized continuous data. Through comprehensive experiments, we further show the robustness of our model for various data dimensionality, different dataset size and different data sparsity, and the ability of our model to detect genes with strong unique signatures of a specific cell type.
Methods
The statistical model of VPAC
Let N be the number of single-cell samples, D the number of genes, and M the desired number of clusters. We assume that the gene expression vector xn of cell n is generated from a projection of a latent L-dimensional vector zn (LD). We use n=1, ..., N to index over samples, i=1, ..., L to index over latent dimensions, and j=1, …, M to index over clusters in all derivations below. The distribution of xn could be a complex high-dimensional distribution. We assume xn follows a multivariate Gaussian distribution given zn, while the latent vector zn follows a Gaussian mixture distribution. An intuitive understanding of this assumption is that, when projecting, constraining samples to follow a Gaussian mixture distribution in the latent space contributes to the accurate clustering. As a generative model, VPAC independently draw each sample xi through the following process.
In the Gauss mixture distribution (1), the binary latent variable sjn, which gives discrete distributions governed by ρ, describes which component in the mixture gives rise to the latent vector zn, i.e., if zn is generated from component j then sjn=1, and sjn=0 otherwise. We choose a Dirichlet distribution for the prior of ρ
We complete the specification of Gaussian mixture distribution in Eq. (1) by introducing Gaussian and Wishart conjugate priors over the means m and precisions T, as
where γ is a small and fixed parameter chosen to give a broad prior to m, while the degrees of freedom υ and scale matrix V give a broad prior to T.
In order to endow VPAC the ability to automatically determine the appropriate dimensionality for the latent space to avoid discrete model selection, we introduce a hierarchical prior P(W|α) over the projection matrix W in Eq. (2), as
where each item of the L-dimensional vector α controls the corresponding column of the matrix W by playing a role as the precision of a Gaussian distribution. We again introduce broad priors over the parameters α,μ and τ to complete the specification of VPAC
The graphical model representation of VPAC is shown in Fig. 1. The broad priors introduced above are obtained by setting aα=bα=aτ=bτ=β=γ=10−3. The initial parameters δ of the Dirichlet distribution are set as
Variational inference of parameters
We use variational methods to find a lower bound on P(X) because it is analytically intractable to directly evaluate P(X). With θ denoting the set of all parameters and latent variables in VPAC, we introduce an approximating distribution Q(θ) of the true posterior distribution. The log marginal likelihood is then given by
The function
We assume the variational approximation is mean-field, i.e.,
Because the proposed model is conjugate, we can derive a coordinate ascent variational inference (CAVI) algorithm to update the variational parameters
We implement VPAC in Python using common packages for data analysis (including Numpy, Scikit-learn and Scipy), without other dependency packages. Users can directly import our VPAC class and conduct clustering. Algorithm 1 gives a pseudo-code description of VPAC.
Data collection
We collected a dataset of three T cell types (CD4+/CD25+ regulatory T cells, CD4+/CD45RA+/CD25- naive T cells and CD8+/CD45RA+ naïve cytotoxic T cells) from 10X Genomics [28]. The dataset measures the expression of 32,738 genes in 32,695 cells, which were enriched from fresh peripheral blood mononuclear cells (PBMCs) and sequenced by Illumina NextSeq 500 instrument. Clustering on this dataset was claimed as a challenging task by Z.Sun et al. [18] and T.Duan et al. [19]. We used this dataset to demonstrate the ability of our method to scale up well with various data dimensionality and with different dataset size on datasets from UMI based techniques. We also downloaded three preprocessed datasets of different scales used by Para-DPMM [19], which is the state-of-the-art method for clustering discrete UMI counts, to fairly evaluate the performance of our model.
A dataset of 561 cells derived from seven cell lines (A549, H1437, HCT116, IMR90, K562, GM12878, and H1) was downloaded from the NCBI Gene Expression Omnibus via accession GSE81861 [29]. Sequenced using the 101-bp paired-end protocol on the Illumina HiSeq 2000 platform, the dataset measures the expression of 55,186 DNA regions, and the expression was quantified as FPKM values. We used this dataset to demonstrate the robustness of our model for FPKM-normalized data of various dimensions. From the Single Cell Portal, we downloaded a plate-based scRNA-seq dataset of 24,649 genes in 27,998 cells from 12 clusters. The dataset was sequenced using a modified Smart-seq2 protocol, and the expression levels of genes were quantified as TPM values [30]. With this dataset, we further demonstrated the robustness of our model for TPM-normalized datasets of different size.
In order to illustrate the superior ability of our model to account for the sparsity of scRNA-seq data, we collected a dataset of 301 cells captured from 11 cell types using microfluidics. This dataset is publicly available from the NCBI Sequence Read Archive under accession SRP041736, and the quantification of gene expression levels in TPM for all 23,730 genes in all samples was performed [31]. We also collected a Smart-seq2 sequenced dataset of 742 dendritic cells from six clusters with two batches from the Single Cell Portal to illustrate the application of our model [32, 33].
Assessment of performance
We used adjusted rand index (ARI) [34] and normalized mutual information (NMI) [35] to assess the clustering performance, namely, the similarity between the clustering results and known cell types provided by their original references. Suppose T is the known cell types, C the predicted clustering results, N the total number of single-cell samples, xi the number of samples clustered to the i-th cluster of C, yj the number of samples belong to the j-th cell type of T, and nij the number of overlapping samples between the i-th cluster and the j-th cell type. ARI is computed as
where ()denotes a binomial coefficient. By denoting the entropy of C and T as H(C) and H(T), respectively, and the mutual information between them as MI(C,T), NMI can be computed as
Results
VPAC accurately clusters scRNA-seq data
To verify the clustering performance of VPAC, we conducted each experiment five times and computed two widely used metrics, ARI and NMI. We compared the performance of our method with four baseline methods (using their default parameters), including Para_DPMM, the state-of-the-art method for clustering discrete UMI counts [19], pcaReduce, another method combining dimension reduction with clustering [20], and other two widely used methods (Seurat [13] and SC3 [21]).
In order to fairly evaluate the performance of our model, we compared the performance of VPAC with other four baseline methods on the three T cell datasets of different scales provided by T.Duan et al. [19]. The small scale dataset (S-Set) consists of 1200 cells with the 1000 top variable genes, the medium scale dataset (M-Set) consists of 3000 cells with the 3000 top variable genes, and the large scale dataset (L-Set) consists of 6000 cells with the 5000 top variable genes. As shown in Table 1, our method consistently outperforms the four baseline methods for a large margin on all the three datasets. In more detail, pcaReduce and Seurat fail in the clustering task, while SC3 achieves satisfactory performance in the first two datasets. However, the performance of SC3 decreases dramatically when the number of cells exceeds about 3 thousand, which will be further demonstrated in the following sections. Compared to Para_DPMM, our method achieves average 11.3% improvement on ARI and 8.1% improvement on NMI even the clustering on these datasets was claimed as a challenging task [18, 19]. In addition, our method consistently achieves stable performance, which demonstrates its higher robustness than the baseline methods. In terms of efficiency, the performance of VPAC is moderate compared with discriminant methods, which have an inherent advantage of high speed. Taking the L-Set as an example, VPAC cost 241s, Para_DPMM cost 1000s, pcaReduce cost 37s, Seurat cost 66s and SC3 cost 653s on average. It is worth noting that the default minimal training time of Para_DPMM is 1000s, which slightly improves the performance of Para_DPMM shown in the original research.
Table 1
Dataset | S-Set | M-Set | L-Set | |||
---|---|---|---|---|---|---|
Model | ARI | NMI | ARI | NMI | ARI | NMI |
VPAC | 0.769 ± 0.000 | 0.759 ± 0.000 | 0.765 ± 0.000 | 0.765 ± 0.000 | 0.779 ± 0.000 | 0.769 ± 0.000 |
Para_DPMM | 0.675 ± 0.008 | 0.696 ± 0.008 | 0.704 ± 0.000 | 0.713 ± 0.002 | 0.700 ± 0.002 | 0.711 ± 0.002 |
pcaReduce | 0.290 ± 0.019 | 0.289 ± 0.010 | 0.279 ± 0.007 | 0.252 ± 0.007 | 0.311 ± 0.026 | 0.309 ± 0.044 |
Seurat | 0.378 ± 0.000 | 0.379 ± 0.000 | 0.379 ± 0.000 | 0.383 ± 0.000 | 0.383 ± 0.000 | 0.383 ± 0.000 |
SC3 | 0.724 ± 0.005 | 0.722 ± 0.010 | 0.724 ± 0.009 | 0.717 ± 0.007 | 0.523 ± 0.053 | 0.501 ± 0.030 |
VPAC scales up well with various data dimensionality
To illustrate the robustness of our model for various data dimensionality, we selected different numbers of top variable genes (features) based on their standard deviations across the cell transcriptome profiles. We first used the dataset of 32,695 cells enriched from PBMCs, which is the full version of the three preprocessed datasets, to demonstrate the superior performance of VPAC for discrete UMI counts of various dimensionality. As illustrated in Fig. 2a, our method significantly outperforms the four baseline methods and achieves consistent performance across datasets of different numbers of features. Note that pcaReduce raised an error when training the dataset with 30k features even the memory is sufficient. SC3 cannot be competent for clustering the dataset of 32,695 cells.
We further evaluated the robustness of our model for FPKM-normalized data using a dataset of 561 cells derived from seven cell lines. As shown in Fig. Fig.2b,2b, VPAC consistently outperforms the four baseline methods again. Different from the performance for discrete UMI counts, all the performance of pcaReduce, Seurat and SC3 are superior to that of Para_DPMM. SC3 achieves the second best performance, but, as other three baseline methods, the performance is obviously unstable, while VPAC achieves consistent performance across datasets of different numbers of features. The results demonstrate that VPAC can scale up well with various data dimensionality whether the scRNA-seq data is discrete or continuous.
VPAC scales up well with different dataset size
Different scRNA-seq technologies can be used to quantify transcriptomes for different numbers of cells. In order to demonstrate the robustness of our model for different dataset size, we randomly selected different proportions of single-cell samples from the whole cell population. Besides, we selected the top 90% variable genes (features) to avoid errors raised by pcaReduce. We again used the cells enriched from PBMCs to illustrate the superior performance of VPAC for discrete UMI counts of different dataset size. As illustrated in Fig. 3a, our method significantly outperforms the four baseline methods and achieves superior performance across datasets of different numbers of samples. As the number of samples increased, the performance of Para_DPMM deteriorates, while that of SC3 decreases dramatically since the number of cells exceeds about 3 thousand, which confirms the previous results.
We further illustrated the robustness of our model for TPM-normalized data using a dataset of 24,649 cells from 12 clusters. As shown in Fig. Fig.3b,3b, VPAC is still superior to the baseline methods and achieves more stable performance across datasets of different numbers of samples. The ARI scores of VPAC are higher than that of Seurat for a large margin, even that Seurat achieves comparable NMI scores. The performance of SC3 again deteriorates as the number of samples increased. The performance of pcaReduce is still very unstable even outperforms Para_DPMM. The above results demonstrate that VPAC can also scale up well with different dataset size whether the scRNA-seq data is discrete or continuous.
VPAC scales up well with different data sparsity
Single-cell gene expression data suffer from the high sparsity problem because they contain an abundance of dropout events that lead to zero expression measurements. In order to illustrate the superior ability of our model to account for the sparsity of scRNA-seq data, we randomly zeroed different proportions of non-zero items five times on a TPM-normalized dataset of 301 cells with about 2.7×105 reads per cell [31]. As illustrated in Fig. 4, VPAC is superior to the baseline methods and achieves stable performance. One-sided paired-sample Wilcoxon signed rank test consistently suggest that our metty hod achieves higher ARI scores (p-values=5.05e-5 for pcaReduce and 4.98e-5 for SC3) and NMI scores than a baseline method (p-values=8.60e-4 for pcaReduce and 2.25e-5 for SC3). The performance of VPAC, pcaReduce and SC3 obviously deteriorate when we set 80% of the non-zero expression data to zero. Seurat achieves relatively poor performance, while Para_DPMM again fails on clustering the TPM-normalized dataset. The results demonstrate that our model has the ability to account for the sparsity of scRNA-seq data, and thus benefits the clustering tasks with the exponential growth of the Drop-seq based single-cell transcriptomic data.
Applications of VPAC
To demonstrate potential applications of VPAC, we collected a dataset of 742 dendritic cells from six clusters [32, 33]. By setting the number of clusters to 6 and the dimensionality of latent space to 10, VPAC achieved an ARI of 0.876 and an NMI of 0.898. Actually, the items of α with large values in VPAC will result in ‘switching off’ the corresponding column of the matrix W, and thus provide guidance for determining the appropriate dimensionality for the latent space. As shown in Fig. 5, the visualization of the projection matrix W (the left one) indicates there are four principal directions in the latent space, while the visualization of the projection matrix obtained by classical PCA (the right one) cannot effectively reveal the number of principal directions. Therefore, by setting the dimensionality of latent space to 5 (in practice we have found that one more dimensionality, which may contain some additional information, tends to give better results), VPAC achieved an ARI of 0.875 and an NMI of 0.903, which are approximately equal to the results with the dimensionality of latent space setting to 10, which means the only one parameter we should determine for VPAC is the number of clusters.
We used the t-SNE algorithm to visualize the dimension-reduced data in the latent space of VPAC by projecting the data into a two-dimensional space so that certain hidden structures can be depicted intuitively. Note that t-SNE is a visualization tool, and it is not intended to be used for clustering scRNA-seq data [18]. As shown in Fig. 6a, VPAC accurately clustered different cell types accounting for cells from different batches. It is worth noting that VPAC intended to cluster DC2 and DC3 cell types together while inferring a small cluster (in the bottom) containing some individual cells. According to the original research for this dataset, it is reasonable to cluster DC2 and DC3 together because both of them correspond to new subdivisions of the CD1C/BDCA-1+cDC2, while DC1 corresponds to the cross-presenting CD141/BDCA-3+ cDC1, DC4 corresponds to CD1C−CD141−CD11C+ DC, DC5 is a unique DC subtype, AS DCs, and DC6 corresponds to the interferon-producing pDC [32]. With this understanding, we set the number of clusters to 5. VPAC successfully clustered the DC2/3 cluster and achieved an ARI of 0.892 and an NMI of 0.943. In addition, as illustrated in Fig. Fig.6b,6b, VPAC accounted for batch effects even there are still very few samples wrongly clustered, which means that our method is robust to scRNA-seq data from different batches.
We further conducted gene co-expression network analysis to identify which genes have a tendency to show a coordinated expression pattern of the DC2/3 cluster inferred by VPAC. By calculating the Pearson correlation coefficient (PCC) of each gene pairs, we plotted a co-expression network with a threshold 0.4 of PCC [36]. As illustrated in Fig. 7, the three genes, namely S100A9, S100A8 and CD14, with the highest degree of connectivity are expected to be drivers required for signaling pathways of essential functions. It is worth noting that, in the original research, these three genes were claimed as acute and chronic inflammatory genes that play as a strong unique signature to distinguish the DC2/3 cluster, which means our method has the ability to reveal the genes with strong unique signatures of a specific cell type.
It is worth noting that VPAC successfully clustered the novel DC5 samples detected by the original research. In order to study the potential mechanism implied in VPAC, we further analyzed the samples in the latent space inferred by VPAC. As shown in Fig. 8, the fifth items in latent vectors of the DC5 cluster are much larger than that of other clusters, which means that the fifth direction of the latent space can effectively distinguish the DC5 cluster. We sorted the absolute values of the fifth column in projection matrix W, which are the absolute weights of linear combinations of genes, and found that the two genes with the highest absolute weights (0.290 and 0.288) are AXL and SIGLEC6, which have much higher absolute weight than the successive gene CX3CR1 with a weight of 0.225. Interestingly, AXL and SIGLEC6 serve as unique markers for the population emerged from the unbiased cluster analysis (cluster DC5) according to the original literature, which further demonstrates that our method has the inherent ability to detect genes with strong unique signatures of a specific cell type while accurately clustering.
Discussion
scRNA-Seq technologies have advanced rapidly in recent years and enable the quantitative characterization of cell types based on transcriptome profiles. VPAC is proposed to identify putative cell types using unsupervised clustering. The superiority of our method over other baseline methods, such as Para_DPMM and pcaReduce, is mainly attributed to the variational projection while constraining single-cell samples to follow a Gaussian mixture distribution in the latent space. In addition, with the hierarchical prior over the projection matrix, VPAC can automatically determine the appropriate dimensionality for the latent space to avoid discrete model selection, which means the only one parameter we should determine for VPAC is the number of clusters. The comprehensive experiments demonstrate the generality and robustness of our model.
Our model can certainly be improved in some aspects. First, a more complex non-linear projection can be introduced to better model the scRNA-seq data. Second, more constraints or assumptions can be included in our model considering the characteristics of scRNA-seq data, such as the high sparsity. Recent studies have shown the introduction of zero-inflated assumption can effectively model the dropout evens of single-cell data [26, 37], which may also improve the performance of our model. Third, our model can be extended to incorporate other types of functional genomics data such as chromatin accessibility. For example, in the literature, the method of coupled nonnegative matrix factorizations performs clustering by the integrative analysis of scRNA-seq and single-cell ATAC-sequencing data [38]. Finally, the performance and efficiency of the model may be further improved by parameters inference using stochastic optimization and deep neural networks, which have been shown to be effective in statistical models for single-cell data analysis [39, 40].
Conclusions
We have proposed a model-based algorithm, named VPAC, for accurate clustering of single-cell transcriptomic data through variational projection. Benefitting from the variational projection while constraining single-cell samples to follow a Gaussian mixture distribution in the latent space, VPAC is superior to existing methods in the clustering of datasets of discrete counts, normalized continuous data, various data dimensionality, different dataset size and different data sparsity. We have further demonstrated the ability of VPAC to detect genes with strong unique signatures of a specific cell type, which may shed light on the studies in system biology. Eventually, with the explosive growth of scRNA-seq data, we expect that such a statistical approach will provide us with superior performance and be widely applicable.
Acknowledgements
We thank Yong Wang for his helpful discussions. Rui Jiang is a RONG professor at the Institute for Data Science, Tsinghua University.
Funding
Publication costs are funded by the National Key Research and Development Program of China (No. 2018YFC0910404), the National Natural Science Foundation of China (Nos. 61873141, 61721003, 61573207, 71871019 and 71471016), and the Tsinghua-Fuzhou Institute for Data Technology.
Availability of data and materials
The datasets supporting the conclusions of this article are publicly available from 10xgenomics (https://support.10xgenomics.com/single-cell-gene-expression/datasets), the NCBI Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/), the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/Traces/sra/), and the Single Cell Portal (https://portals.broadinstitute.org/single_cell).
About this supplement
This article has been published as part of BMC Bioinformatics Volume 20 Supplement 7, 2019: Selected papers from the 12th International Conference on Computational Systems Biology (ISB 2018). The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-20-supplement-7.
Abbreviations
ARI | Adjusted rand index |
CAVI | Coordinate ascent variational inference |
ELBO | Evidence lower bound |
FPKM | Fragments per kilobase per million mapped reads |
NMI | Normalized mutual information |
PBMCs | Peripheral blood mononuclear cells |
PCA | Principal component analysis |
PCC | Pearson correlation coefficient |
PPCA | Probabilistic principal components analysis |
scRNA-seq | Single-cell RNA-sequencing |
SNN | Shared nearest neighbors |
TPM | Transcripts per million mapped reads |
t-SNE | t-Distributed Stochastic Neighbor Embedding |
UMI | Unique molecular identifiers |
Authors’ contributions
RJ designed the research. SC and KH designed and implemented the models. SC and HC collected data and analyzed the results. SC, KH, HC and RJ wrote the manuscript. All authors read and confirmed the manuscript.
Notes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Shengquan Chen, Email: nc.ude.auhgnist.sliam@71qcc.
Kui Hua, Email: nc.ude.auhgnist.sliam@41kauh.
Hongfei Cui, Email: nc.ude.auhgnist.sliam@60fhiuc.
Rui Jiang, Email: nc.ude.auhgnist@gnaijiur.
References
Articles from BMC Bioinformatics are provided here courtesy of BMC
Full text links
Read article at publisher's site: https://doi.org/10.1186/s12859-019-2742-4
Read article for free, from open access legal sources, via Unpaywall: https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-019-2742-4
Citations & impact
Impact metrics
Alternative metrics
Smart citations by scite.ai
Explore citation contexts and check if this article has been
supported or disputed.
https://scite.ai/reports/10.1186/s12859-019-2742-4
Article citations
Celda: a Bayesian model to perform co-clustering of genes into modules and cells into subpopulations using single-cell RNA-seq data.
NAR Genom Bioinform, 4(3):lqac066, 13 Sep 2022
Cited by: 16 articles | PMID: 36110899 | PMCID: PMC9469931
EnClaSC: a novel ensemble approach for accurate and robust cell-type classification of single-cell transcriptomes.
BMC Bioinformatics, 21(suppl 13):392, 17 Sep 2020
Cited by: 3 articles | PMID: 32938367 | PMCID: PMC7496207
Data
Data behind the article
This data has been text mined from the article, or deposited into data resources.
BioStudies: supplemental material and supporting data
GEO - Gene Expression Omnibus
- (1 citation) GEO - GSE81861
IGSR: The International Genome Sample Resource
- (1 citation) IGSR/1000 Genomes - GM12878
Nucleotide Sequences
- (1 citation) ENA - SRP041736
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
EnClaSC: a novel ensemble approach for accurate and robust cell-type classification of single-cell transcriptomes.
BMC Bioinformatics, 21(suppl 13):392, 17 Sep 2020
Cited by: 3 articles | PMID: 32938367 | PMCID: PMC7496207
DIMM-SC: a Dirichlet mixture model for clustering droplet-based single cell transcriptomic data.
Bioinformatics, 34(1):139-146, 01 Jan 2018
Cited by: 27 articles | PMID: 29036318 | PMCID: PMC6454475
Autoencoder-based cluster ensembles for single-cell RNA-seq data analysis.
BMC Bioinformatics, 20(suppl 19):660, 24 Dec 2019
Cited by: 20 articles | PMID: 31870278 | PMCID: PMC6929272
Machine learning and statistical methods for clustering single-cell RNA-sequencing data.
Brief Bioinform, 21(4):1209-1223, 01 Jul 2020
Cited by: 79 articles | PMID: 31243426
Review