Abstract
Free full text
Single-cell topological RNA-Seq analysis reveals insights into cellular differentiation and development
Abstract
Transcriptional programs control cellular lineage commitment and differentiation during development. Understanding cell fate has been advanced by studying single-cell RNA-seq, but is limited by the assumptions of current analytic methods regarding the structure of data. We present single-cell topological data analysis (scTDA), an algorithm for topology-based computational analyses to study temporal, unbiased transcriptional regulation. Compared to other methods, scTDA is a non-linear, model-independent, unsupervised statistical framework that can characterize transient cellular states. We applied scTDA to the analysis of murine embryonic stem cell (mESC) differentiation in vitro in response to inducers of motor neuron differentiation. scTDA resolved asynchrony and continuity in cellular identity over time, and identified four transient states (pluripotent, precursor, progenitor, and fully differentiated cells) based on changes in stage-dependent combinations of transcription factors, RNA-binding proteins and long non-coding RNAs. scTDA can be applied to study asynchronous cellular responses to either developmental cues or environmental perturbations.
Introduction
The differentiation of motor neurons from neuroepithelial cells in the vertebrate embryonic spinal cordis a well characterized example of cellular lineage commitment and terminal cellular differentiation1. Neural precursor cells differentiate in response to spatiotemporally regulated morphogen gradients that are generated in the neural tube by activating a cascade of specific transcriptional programs1. A detailed understanding of this process has been hindered by the inability to isolate and purify sufficient quantities of synchronized cellular subpopulations from the developing murine spinal cord. Although in vitro approaches have been used to study both the mechanisms of motor neuron differentiation2, and motor neuron disease3, 4, alimitation of these approaches is the differential exposure of embryoid bodies (EBs) to inductive ligands and uncharacterized paracrine signaling within EBs, which lead to the generation of heterogeneous populations of differentiated cell types5. Motor neuron disease mechanisms are currently studied in a heterogeneous background of cell types whose contributions to pathogenesis are unknown. Methods to analyse the transcriptome of individual differentiating motor neurons could provide fundamental insights into the molecular basis of neurogenesis and motor neuron disease mechanisms.
Single-cell RNA-sequencing carried out over time enables the dissection of transcriptional programs during cellular differentiation of individual cells, thereby capturing heterogeneous cellular responses to developmental induction. Several algorithms for the analysis of single-cell RNA-sequencing data from developmental processes have been published, including Diffusion Pseudotime6, Wishbone7, SLICER8, Destiny9, Monocle10, and SCUBA11 (Supplementary Table 1). All of these methods can be used to order cells according to their expression profiles, and they enable the indentification of lineage branching events. However, Destiny9 lacks an unsupervised framework for determining the transcriptional events that are statistically associated with each stage of the differentiation process; and the statistical framework of Diffusion Pseudotime, Wishbone, Monocle, and SCUBA is biased, for example by assuming a differentiation process with exactly one branch event6, 7 or a tree-like structure10, 11. Although these methods can reveal the lineage structure when the biological process fits with the assumptions, an unsupervised method would be expected to have the advantage of extracting more complex relationships. For example, the presence of multiple independent lineages, convergent lineages, or the coupling of cell cycle to lineage commitment. Moreover, apart from SCUBA, these methods do not exploit the temporal information available in longitudinal single cell RNA-sequencing experiments, and they require the user to explicitly specify the least differentiated state6-10.
We present an unbiased, unsupervised, statistically robust mathematical approach to single cell RNA-sequencing data analysis that addresses these limitations. Topological data analysis (TDA) is a mathematical approach used to study the continuous structure of high-dimensional data sets. TDA has been used to study viral re-assortment12, human recombination13, 14, cancer15, and other complex genetic diseases16. scTDA is applied to study time-dependent gene expression using longitudinal single-cell RNA-seq data. Our scTDA method is a statistical framework for the detection of transient cellular populations and their transcriptional repertoires, and does not assume a tree-like structure for the expression space or a specific number of branching points. scTDA can be used to assess the significance of topological features of the expression space, such as loops or holes. In addition, it exploits temporal experimental information when available, inferring the least differentiated state from the data.
Here we apply scTDA to analyse the transcriptional programs that regulate developmental decisions as mESCs transition from pluripotency to fully differentiated motor neurons and concomitant cell types.
Results
Overview of scTDA
Single-cell gene expression can be represented as a sparse high-dimensional point cloud, with the number of dimensions equivalent to the number of expressed genes (~10,000). Extracting biological information from such data requires a reduction in the dimensionality of the space. Widely-used algorithms, such as multidimensional scaling (MDS), independent component analysis (ICA), and t-distributed stochastic neighbor embedding (t-SNE), have been adapted to flow cytometry, mass spectrometry11, 17, and single-cell RNA-seq measurements10, 18. These strategies, however, are not designed to preserve continuous relationships in high dimensions. Figure 1a represents a simple example of a one-dimensional continuous manifold (a circle) twisted in three dimensions. Reduction to two dimensions using these algorithms introduces artifacts in the low-dimensional representation, including artifactual intersections (in MDS and ICA), and tearing apart the original continuous structure (in t-SNE). As cell differentiation is a continuous asynchronous process, we reasoned that longitudinal single-cell data would be best analyzed using a mathematical approach that accounts for continuous structures.
We developed a computational approach to the analysis of longitudinal single-cell RNA-seq data based on the TDA algorithm Mapper19 (Supplementary Note 1). Mapper builds upon any given dimensional reduction algorithm, such as MDS, and produces a low-dimensional topological representation of the data that preserves locality. To that end, the projection obtained by the dimensional reduction algorithm is covered with overlapping bins, and clustering of the data within each bin is performed in the original high-dimensional space (Fig. 1b). A network is constructed by assigning a node to each cluster, and clusters that share one or more cells are connected by an edge. The result is a low-dimensional network representation of the data where nodes represent sets of cells with similar global transcriptional profiles, and edges connect nodes that have at least one cell in common (Fig. 1b). An important feature of these networks is that elements that are connected in the low-dimensional representation lie near each other in the original high-dimensional expression space, contrary to what occurs with most of the dimensional reduction algorithms currently in use, including PCA, ICA, and MDS (Fig. 1a). The scTDA approach can be applied even if there are complex structures in the high-dimensional expression space, including non-tree-like trajectories (Fig. 1a). Additionally, since nodes represent clusters of cells, the approach is highly scalable to large datasets.
We adapted topological representations so that they could be used to analyse longitudinal single-cell RNA-seq data, introducing the necessary mathematical concepts and statistics (Supplementary Fig. 1). To identify expression programs associated with different parts of the topological representation without predefining any cellular population, we developed the concept of gene connectivity. The connectivity of a gene acquires a significant value when cells that share similar global transcriptional profiles express that particular gene more than random (Fig. 1c and Online Methods). Genes with a significant connectivity are expected to be involved in a particular stage (or stages) of differentiation. To identify the least differentiated state in the topological representation, we introduced a root node, which is the node that maximizes correlation between sampling time and graph distance. Using this root node as a reference, we computed the centroid (Fig. 1d) and dispersion (the standard deviation) of genes in the topological representation (Online Methods). Genes with low centroids are upregulated in stem-like cells, whereas genes with large centroids are specific to fully differentiated cells.
To identify transient cellular subpopulations arising during differentiation, we clustered low-dispersion genes with a significant connectivity in the topological representation based on their centroid (Davies-Bouldin criterion, Fig. 1e). Finally, to assess the topological features of the representation, such as loops and holes, we used persistent homology, a framework within TDA for deriving and classifying topological features associated to data (Online Methods, Supplementary Note 1). Further details of the scTDA method and its mathematical foundations can be found in the Online Methods and Supplementary Note 1. A code implementing the scTDA method is provided as Supplementary Code.
scTDA for temporal ordering of differentiating cells
We assessed the capacity of scTDA to order asynchronously differentiating cells using a controlled setting where the truth is known. First, we simulated a noisy, branched, asynchronous cellular differentiation process with two branching points (Fig. 2a and Online Methods). 700 cells were sampled at three time points and the expression levels of 500 genes were simulated. From this data, scTDA correctly reconstructed the topological structure of the differentiation tree, and identified the most stem-like state (Fig. 2b). In contrast, the algorithms Diffusion Pseudotime6, Wishbone7 and SLICER8 failed to correctly assign some of the branches in our hands, producing artifacts in the inferred pseudo-temporal ordering of the cells (Figs. 2c to 2e), even if the most stem-like state was provided by the user of these algorithms.
To quantify the performance of scTDA, we extended the above simulations to processes with one, two, or three branching points (Online Methods). scTDA showed higher correlation between the inferred pseudo-time and the actual simulated differentiation time compared to the other three methods, thus reconstructing the underlying differentiation process more accurately from the data (Fig. 2f). The improvement was particularly large when drop-out events were included in the simulation (Fig. 2g), where the graphical representations produced by Diffusion Pseudotime, SLICER, and Wishbone were often unable to correctly infer the structure of the differentiation tree (Supplementary Fig. 2).
Motor neuron differentiation analysis with scTDA
We applied scTDA to longitudinal single-cell RNA-seq data from in vitro motor neuron differentiation experiments. We differentiated mESCs into motor neurons using a well-established protocol2. The mESC line expresses enhanced green fluorescent protein (eGFP) under the control of the early motor neuron-specific promoter Mnx1, allowing identification of committed motor neurons. Using a modified CEL-seq approach20 (Online Methods and Supplementary Note 2), we generated cDNA libraries from individual cells, spanning the transition from pluripotency to post-mitotic commitment. To assess reproducibility, we performed two separate differentiations. In the first differentiation (experiment 1) we sequenced 440 cells sampled throughout the entire course of motor neuron differentiation. Cells were concomitantly sampled in bulk at each time point. Using this data, we assessed library saturation and the optimal depth of sequencing coverage. In the second differentiation (experiment 2) we sought to increase statistical power and sequenced 2,304 individual with the depth of sequencing coverage near the saturation point (~250k reads/cell). In both experiments cells were sampled each day throughout the entire course of the differentiation, from day two to six. In total we detected 15,609 and 19,009 transcripts respectively across experiments 1 and 2. Given the improved statistical power associated with an increased number of cells, we focused our attention on the results from experiment 2, and assessed the reproducibility of the results using experiment 1.
Cell populations and surface markers of differentiation
We used scTDA to analyze the two longitudinal motor neuron differentiation single-cell RNA-seq datasets. We filtered the data for cells that passed stringent quality control tests (Online Methods, Supplementary Fig. 3), retaining, respectively, 373 and 1,964 individual cells in experiments 1 and 2. An scTDA analysis recapitulated chronological order based on mRNA expression levels alone (Figs. 3a and 3b, and Supplementary Figs. 4 and and5),5), and simultaneously detected detailed transcriptional relationships between individual cells. We observed only a mild dependence of library complexity on timescale of our differentiation (Supplementary Fig. 6, Online Methods), unlike other experimental settings where differences in cell cycle give rise to substantial variation of the library complexity21, 22. We did not observe large batch effects (Supplementary Fig. 7). Compared with analyses of the same data sets using PCA, MDS, t-SNE, Monocle, Wishbone, Slicer and Diffusion Pseudotime, we showed that scTDA had superior ability to identify the continuous chronological structure of the differentiation process (Fig. 3c and Supplementary Figs. 8 and 9).
Analysis of the data from experiment 2 identified 7,620 genes with significant gene connectivity (q-value < 0.05, Supplementary Fig. 10 and Supplementary Table 2), compromising 74% overlap with those identified from experiment 1 (p-value < 10-100, Fisher exact test, Fig. 3d). It is likely that the large number of significant genes is due to the transcriptional heterogeneity of the dataset, because the samples of cells represent multiple developmental stages. The centroids of the significant genes were consistent between the two datasets (Pearson correlation = 0.9, p-value < 10-100, Fig. 3d). We validated our findings by identifying known markers of pluripotent cells, progenitor cells and mature motor neurons in the dataset over time (Fig. 3e and Supplementary Fig. 4c). We also found that cell cycle genes were downregulated (as expected) in post-mitotic neurons (Supplementary Fig. 11). We verified that the choice of parameters for building the topological representation (number of genes, bins, and overlap between bins) did not affect the result of the analysis (Supplementary Fig. 12).
Interpretation of the distribution of the centroid and the dispersion of genes (Supplementary Fig. 13) using scTDA revealed four transcriptionally distinct cellular populations arising during differentiation from mESCs to motor neurons (Fig. 4a). 488 of 7,620 genes were identified in the four expression groups (Fig. 4b, and Supplementary Table 2). Groups 1a and 1b contain genes that are only expressed in early EBs (Oct3/4+ cells), and correspond to pluripotent and neural precursor states. Genes in groups 2 and 3 are only expressed in the progenitor (Olig2+ cells) and post-mitotic ensembles (VAChT+ cells), respectively.
Ontology enrichment analyses using PANTHER23 revealed developmental genes and genes related to DNA replication in groups 1a, 1b, and 2, whereas group 3 was enriched for genes related to axonogenesis, neuron migration, and regionalization (Supplementary Table 3), consistent with the underlying cellular differentiation process.
The analysis identified 74 transcripts in the four groups that encode proteins with an extracellular domain. These are therefore putative cell surface marker candidates that could be used to purify niche populations for further study (Supplementary Table 2). Several of these surface markers have not been reported as markers of neurogenesis. We selected three of these 74 markers that had high expression level and specific antibodies available for experimental validation. The three markers were validated using immunohistochemistry in vitro in EBs (Fig. 4c, and Supplementary Fig. 14) and in vivo in the murine embryonic spinal cord (Fig. 4d).
scTDA identifies distinct proliferative states
scTDA analysis of our motor neuron differentiation data revealed numerous loops in the neural precursor population (Fig 4a) that were not present in the mESCs or the motor neurons. These loops were also observed in the scTDA analysis of experiment 1 (Supplementary Fig. 4). Neural precursor cells proliferate rapidly upon induction with retinoic acid (RA). This is in keeping with biological expectations, given that the cell cycle will give rise to periodic structures in the expression space. To evaluate whether the loops in the topological representation of neural precursors are caused by differences in the cell cycle, we built a topological representation using only cell cycle genes (Supplementary Fig. 15a, Online Methods). The cell cycle topological representation contained substantially larger loops in the same region (Supplementary Fig. 15a), separating Stra8up neural precursors and progenitors into proliferative and non-proliferative populations according to the expression of Mki67 and other markers of proliferation (Supplementary Figs. 15b and 15c). We used persistent homology to statistically assess the significance of these loops in the topological representation, asking whether loops of similar size could arise from noise effects (Online Methods). This analysis showed a strong statistical significance for some of the larger loops (q-value < 0.05, Supplementary Fig. 15d), consistent with a biological, rather than technical, origin for these features.
Characterization of developmental transitions
We next used scTDA to characterize the developmental transitions that occur during the differentiation of motor neurons. These results are available through an online database (Online Methods). We used scTDA to identify transcriptional programs that are associated with the transitions between the four transient cellular populations that we identified (Fig. 5a). These findings were consistent across both experiments.
First, the transition from a pluripotent cell to a neural precursor population is characterized by the transcriptional dynamics of pathways involved in RA signaling and downstream effector proteins24, 25 (Fig. 5a). Our analysis resolved with unprecedented resolution the timing of the transcriptional events occurring during this transition, identifying upregulation of Stra8 and downregulation of Fgf4 as some of the earliest events that mark the transition (Supplementary Fig. 16). Subsequently, there is transcriptional activation of a subset of the homeobox gene family, including Hoxa1 and Hoxb2-8, which continue to be expressed during later stages of the differentiation process, and Hoxb1, which is transiently expressed along with caudalizing transcription factors (Figs. 5a and 5b, and Supplementary Fig. 17).
A second wave of RA inducible gene activation was identified during the formation of neural progenitors. This is accompanied by transcriptional inactivation of Stra8, and activation of Crabp1 and a second set of homeobox genes, Hoxa2, Hoxa3, Hoxc5, Hoxd3 and Hoxd4 (Fig. 5a). This pattern of Hox gene activation suggests that the linear chromosomal arrangement of the Hox gene clusters does not necessitate temporal co-linearity in anterior Hox gene expression, a phenomenon that has been reported in the developing spinal cord26. Both waves of homeobox gene expression were accompanied by the upregulation of several long non-coding RNAs (lncRNAs) derived from the opposite strand (Fig. 5b, Supplementary Fig. 17, and Supplementary Table 4), consistent with previously identified lncRNA-based regulation of homeobox gene clusters27-30.
The transition to mature neurons was marked by exit from the cell cycle and post-mitotic differentiation. In keeping with expectations from previous data31, scTDA identified Neurog1/2 (modulators of neuronal specification, cyclin regulation, cell cycle exit) as well as Ascl1 and Sox9 among the known mediators of neuronal commitment, concomitant with a marked repression of topoisomerase 2A in the neuron population (Fig. 5a). We observed variability in eGFP expression levels and fluorescence intensities among differentiated neurons, which might indicate heterogeneity in later stages of differentiation. We performed differential expression analysis of En1+, Gata3+, Vsx2+ and Egfp+ cells with in this cellular population (Online Methods), and classified differentiated neurons into motor neurons (n = 343), and V1 (n = 19, En1+), V2a (n = 10, Vsx2+) and V2b interneurons (n = 15, Gata3+) (Fig. 5c), confirming the presence of cellular heterogeneity.
Interestingly, we observed the presence of several lncRNAs in the differentially expressed genes. V1 interneurons specifically express Gm12688, an intergenic lncRNA located near Foxd3, and transcribed from the opposite strand. We validated this finding using qPCR (Supplementary Fig. 18). Similarly, we identified a lncRNA in chromosome 15 (NONCODEv4 accession number NONMMUG015572) which was only expressed in V2b interneurons. Gm14204, an intergenic lncRNA located near Slc32a1 and transcribed from the opposite strand was only expressed in V1 and V2b GABAergic interneurons. These results may suggest a role for lncRNAs in neural diversification.
Among the genes identified by scTDA as being associated with developmental transitions were 20 genesencoding RNA-binding proteins (Supplementary Table 2). These include known developmental state-dependent pre-mRNA splicing factors, as well as stage-specific but uncharacterized RNA binding proteins, which may guide cellular differentiation and post-mitotic commitment. In the context of the progenitor to neuron transition, our analysis identified Nova1/2, Rbfox3, Srrm4, and the Elav-like transcripts Elavl4 and Celf3 (Fig. 5a), consistent with previous data32, 33. scTDA further revealed the upregulation of Mex3b in the progenitor and post-mitotic cell populations, and constitutive expression of Ptbp1 and Ptbp2. Previously published studies have documented Srrm4 directed inclusion of neural specific exon 10 in Ptbp134. Our results may indicate a transcriptional switch in splicing factor regulation that culminates in neural specific splicing.
scTDA of heterogeneous cellular responses
Our analyses of mESC differentiation using scTDA showed that it can chronologically order asynchronous populations of single cells, while simultaneously preserving high-dimensional relationships between their transcriptional programs. We next tested whether scTDA could be applied to other data sets to provide new insights. We analyzed three different in vivo cellular differentiation datasets35-37 (Online Methods). In each case, scTDA produced an accurate reconstruction of developmental trajectories that confirmed and extended the published observations.
First, we examined 80 single cells from the differentiating distal lung epithelium of mouse embryo35 (Supplementary Table 5). All cells were sampled from the same time point (embryonic day E18.5). As shown in Figure 6a, scTDA recapitulated the proposed relationships between differentiating cells in both the bronchiolar and alveolar lineages. Moreover, our analysis uncovered a set of alveolar type I cells in which genes associated with mitochondrial respiration are down-regulated (Supplementary Fig. 19), which might indicate cellular stress or quiescence.
Second, we applied scTDA to 1,529 cells from human pre-implantation embryos36 (Supplementary Table 6). In this case, scTDA correctly ordered cells according to embryonic developmental time, and identified the segregation of several lineages, including the inner cell mass, the early trophectoderm, and a polar trophectoderm36 (Fig. 6b). In contrast with the published analysis, the identification and characterization of these cellular populations was completely unsupervised, highlighting the value of scTDA for characterizing cell differentiation processes, a priori.
Finally, we used scTDA to analyse 272 differentiating neurons in the mouse neo-cortex37 (Supplementary Table 7). scTDA identified a continuum of cellular states with a bifurcation between apical and basal progenitors, and neurons (Fig. 6c). The topological representation more accurately reflects a basal to apical progenitor migration and a split in potency, with apical progenitors sharing transcriptional profiles that more closely reflect their neuronal counterparts. scTDA faithfully represented nonlinear and converging cellular lineages. The convergence of distinct lineages cannot be identified by tree-like based algorithms.
Discussion
In order to preserve the continuity of cellular differentiation, account for asynchronous development, and enable statistical interpretation of patterns of gene activation we used a method that exploits topological data analysis. Our method, named scTDA, enables unsupervised analyses of single-cell RNA-seq datasets over time.
We applied scTDA to study the in vitro differentiation of mESCs into neurons1, 2 and showed that it accurately recapitulated cellular differentiation, and provided the statistics needed to identify and characterize the transcriptional programs that accompany lineage commitment. We comprehensively characterized the dynamic appearance of mRNAs encoding signaling proteins, transcription factors, RNA splicing factors and long non-coding RNAs (lncRNAs). These transcripts were dynamically regulated during the transition from pluripotent cells to neural precursors, progenitors, motor neurons, and interneurons, thus providing a valuable resource for studies of stem cell differentiation and neurogenesis, and more generally to any cellular differentiation process amenable to single cell transcriptomic analysis. Inclusion of topological data enabled scTDA to identify additional processes coupled to differentiation, such as the cell cycle, which cannot be captured by tree-like structures. Furthermore, scTDA identified extensive transcriptional co-regulation of thousands of coding and noncoding genes in precursor, progenitor and neuronal populations. Some of these, such as the surface markers Pecam1, Ednrb, and Slc10a4, respectively for the pluripotent, neural progenitor, and mature motor neuron populations, as well as the unique expression of the lncRNA Gm12688 by V1 interneurons were validated.
In principle, scTDA can be applied to any biological system responding to inductive cues or environmental perturbations (Fig. 6). For example, scTDA could be used to study cellular differentiation processes such as hematopoiesis, the evolution of cancer cells, neurodegeneration, or developmental disorders, all of which arise from extracellular signals and mutations that culminate in heterogeneous transcriptional responses and cellular behavior.
Accession Codes
We deposited the single-cell RNA-seq data and read counts from the two motor neuron differentiation experiments in the NCBI Gene Expression Omnibus (GEO) database, with accession number GSE94883. The single-cell RNA-seq data from the differentiating distal lung epithelium of mouse embryo35 can be accessed through the GEO database, accession number GSE52583. The single-cell RNA-seq data from human pre-implantation embryos36 can be accessed through the Array Express database, with accession number E-MTAB-3929. The single-cell RNA-seq data from differentiating neurons in the mouse neo-cortex37 can be accessed at https://vpn.ias.edu/science2016/,DanaInfo=genebrowser.unige.ch+telley_govindan_science2016.tgz.
Online Methods
Cell culture and single cell isolation
Murine embryonic stem cell (mESC) based differentiations were performed using the method of Wichterle et al.2. In brief, stem cell colonies are expanded on an adherent substrate, after which they are dissociated and monodispersed in a serum-free suspension (day 0). Individual stem cells aggregate into embryoid bodies (EBs), which maintain exclusive expression of pluripotency markers until they are induced down the neuron lineage by addition of RA and Smoothened Agonist (SAG) (day 2). Metabolized culture medium is replenished on day 5, coincident with the appearance of early eGFP positive cells within the EBs. EBs were dissociated into single cells using the Worthington Biochemical Papain Dissociation System (LK003178). Single cell deposition was accomplished using a Beckman Coulter MoFlo Astrios EQ cell sorter into 96 well plates. Cells were then snap-frozen and subsequently lysed.
Single cell library generation
To obtain single cell expression profiles, modified CEL-Seq20 was carried out as described in the Supplementary Note 2. Briefly, we reverse transcribed the mRNA in each cell lysate with barcoded primers, pooled the single cell samples, synthesized second strand cDNA, and then linearly amplified by in vitro transcription with T7 RNA polymerase. The amplified RNA (aRNA) was chemically fragmented and T4 RNA ligase was used to ligate Illumina 3′-RNA adapters. The aRNA was then run on an Agilent Bioanalyzer to assess proper fragmentation and then reverse transcribed to generate cDNA and subjected to PCR enrichment. The subsequent multiplexed samples were paired end sequenced (2×125 bps) on an Illumina HiSeq 2500. Bulk RNA samples were purified from 2×106 cells in 1 mL of Trizol and standard Qiaquick RNA extraction protocols with a RIN 9.8 or higher. Stranded RNA-seq libraries were generated at the New York Genome Center using a TruSeq Stranded Total RNA Library Prep Kit. Stranded cDNA libraries were paired end sequenced (2×125 bps) on a HiSeq 2500, operating in high output mode, yielding 30M reads per indexed library.
Immunofluorescence
EBs were washed three times in ice cold PBS and fixed for two hours at room temperature in 4% PFA. They were then washed with ice cold PBS and either embedded into OCT and stored at -80°C for sectioning, or stored in PBS at 4°C for whole EB staining. 20um sections were cut using a Leica CM 1950 cryotome and mounted onto a glass slide. Whole and sectioned EBs were incubated in blocking solution (0.1% tween-20, 5% donkey serum, 1% BSA) for two hours at room temperature followed by primary antibodies diluted in blocking solution overnight. Primary antibodies used were: goat-αSlc9a1 (Santa Cruz sc-16097, 1:100), rabbit-αOlig2 (Millipore AB9610, 1:100), mouse-αOlig2: clone 21F1.1 (Millipore MABN50, 1:100), rabbit-αEdnrb (Novus NLS54, 1:200), guinea pig-αVAChT (gift from Dr. Neil Schneider's lab, 1:400), rabbit-αSlc10a4 (Novus Biologicals NBP1-81134, 1:100), mouse-αPou5f1 (BD Biosciences 611202, 1:100), goat-αPECAM1 (Santa Cruz sc-1506, 1:100), sheep-αCD44 (R&D Systems AF6127, 1:250), goat-αFoxc1 (Santa Cruz sc-21396, 1:50), and goat-αSstr2 (Santa Cruz sc-11606, 1:50). Alexa Fluor-conjugated secondary antibodies from Life Technologies were used at a 1:1000 dilution for two hours at room temperature. CD44, Foxc1 and Sstr2 were respectively conjugated to Cy-5, Alexa-405, and Cy-3 fluorophores using the DyLight system from Abcam (ab201798, ab188287, ab188288). Coverslips were mounted with Vectashield and EBs were imaged on an Olympus Fluoview FV1000 microscope using Olympus Fluoview v4.1.
Processing of single cell RNA-seq data
Paired-end 125bp reads were de-multiplexed, trimmed, and mapped to the UCSC mouse reference (mm10) using Tophat38. Gene expression was quantified using transcript read counts as derived from HTSeq39. Read counts were normalized as,
where ri denotes the unambiguous read count for transcript i, and r denotes the total number of reads that are mapped to transcripts in the cell. A strategy based on spike-in read counts, as described by Stegle et al.40, was implemented to filter out cells with a low content of mapped RNA and/or low sequencing depth. Specifically, the ratio 1 between average spike-in reads in the cell and average spike-in reads in the library (for spike-ins with an average of more than five reads in the library) was used to discard cells with low sequencing coverage. Similarly, the ratio 2 between the total number of spike-in reads in the cell and the total number of mapped reads was used to discard cells with a low number of mapped reads, relative to the sequencing coverage. Those cells showed very low expression of house-keeping genes, possibly representing cells under stressed conditions and/or with large amounts of degraded RNA. Based on the distribution of 1 and 2 in each of the two experiments (Supplementary Fig. 2a), cells with 1 > 5 · 10−4 and 0.7 > 2 > 0.01 in the experiment 1, and cells with 4.0 > 1 > 0.05 and 1.0 > 2 in experiment 2, were kept for subsequent analysis. The distribution of filtered out cells across different libraries was uniform in both experiments (Supplementary Fig. 2b). To reduce the noise observed near detection threshold in experiment 1, read counts with ri < 5 were set to zero. Additionally, one of the libraries (RPI36) was discarded from subsequent analysis, because it presented a large batch effect. To assess the dependence of the library complexity on the differentiation time (Supplementary Fig. 3), we computed at each time point the distribution of the geometric library size, defined as the sum of log expression values over all genes in a cell22.
Topological representation
The algorithm Mapper19 (Supplementary Note 1) was used to build topological representations of the RNA-seq data, through the implementation provided by Ayasdi Inc. Several open-source implementations of Mapper are available (https://github.com/MLWave/kepler-mapper, http://danifold.net/mapper/, https://github.com/RabadanLab/sakmapper, https://github.com/paultpearson/TDAmapper). In brief, the processed RNA-seq data was endowed with a dissimilarity matrix by taking pairwise correlation distance (1 - Pearson correlation). To minimize the effect of drop-out events present in single-cell data, we only considered the 5,000 genes (for experiment 1) and the 4,600 genes (for experiment 2) with highest variance across each dataset. These are highly expressed transcripts for which the probability of not being captured by the RNA amplification (drop-out events) is small41-44. The space was reduced to R2 using MDS, as displayed in Fig. 3c. A covering of R2 consisting of 26×26 (62×62) rectangular patches was considered for experiment 1 (2), respectively. The size of the patches was chosen such that the number of cells in each row or column of patches was the same, hence avoiding sampling density biases. The overlap between patches was 66% (on average). Single linkage clustering was performed in each of the pre-images of the patches using the algorithm described in Singh et al.19. A network was constructed in which each vertex corresponds to a cluster, and edges correspond to non-vanishing intersections between clusters. We checked for the absence of batch effects in the topological representation by verifying that batches from the same timepoint had a substantial overlap in the representation (Supplementary Figs. 3d and 7) and the stability against different choices for the threshold for the number of genes used to compute the distance matrix and for the covering of R2 (Supplementary Fig. 12).
We verified that our analysis was reproducible using a freely-available open-source implementation of Mapper. To that end, we repeated the analysis of experiment 1 using a topological representation built with Sakmapper (https://github.com/RabadanLab/sakmapper) (Supplementary Fig. 20). Comparison with the results of the scTDA analysis presented in the main text shows a clear consistency between the two analyses. In particular, 78% of the genes with significant gene connectivity (q-value < 0.05) in the representation generated with Sakmapper also displayed significant gene connectivity in the Ayasdi representation (Fisher exact test p-value < 10-100). The centroid and dispersion of those genes were also consistent between the two topological representations, with Pearson's correlation coefficient r = 0.98 (p-value < 10-100) for the centroid (Supplementary Fig. 20b), and r = 0.94 (p-value < 10-100) for the dispersion (Supplementary Fig. 20c) of significant genes.
Gene connectivity, centroid and dispersion within the topological representation
A notion of gene connectivity in the topological representation was introduced, defined as
where ei,α represents the average expression of gene i in node α of the topological representation, normalized as described in the paragraph “Processing of single cell RNA-seq data”; Γ denotes the set of nodes of the topological representation; Aαβ is its adjacency matrix; and N the total number of nodes. With this normalization, si takes values between 0 and 1.
The gene connectivity score depends on the distribution of expression values of the specific gene (Supplementary Fig. 10), and therefore genes cannot be ranked accordingly to their gene connectivity score in a meaningful way. To assess the magnitude of the connectivity score relative to genes with the same expression profile, and rank genes accordingly, we introduced a non-parametric statistical test. We tested for the null hypothesis of a randomly expressed gene with the same distribution of expression values having a higher gene connectivity score. To that end, a null-distribution was built for each gene i using a permutation test. Cell labels were randomly permuted 5,000 times for each gene, computing si after each permutation. A p-value was estimated by counting the fraction of permutations that led to a larger value of si than the original one. Gene connectivity and its statistical significance were computed for each gene expressed in at least three cells. The resulting p-values were adjusted for multiple testing by using Benjamini-Hochberg procedure for controlling the false discovery rate.
To establish a pseudo-temporal ordering within the topological representation, a notion of root node was introduced. The latter was defined as the node that maximizes the function,
where corr(x,y) denotes Pearson's correlation coefficient between x and y; dα is the graph distance function to node α, that assigns to each node of the topological representation a value corresponding to the number of edges that are crossed in the shortest path from that node to node α; and t is the chronological sampling time function, that assigns to each node of the topological representation the average sampling time (expressed in days) of the cells contained in the node.
Least-squares linear regression was performed to determine the best fit for the coefficients a0 and a1 in the relation
where droot is the graph distance function to the root node, determined in the previous paragraph. These coefficients were used to define the centroid and dispersion of each gene in the topological representation, expressed in days, and given respectively by
and
Such normalization, i.e. using coefficients a0 and a1 to express the centroid and dispersion in units of pseudo-time (days), allows comparing the connectivity and dispersion of a gene across different topological representations or studies.
Significance of topological features
We computed the first persistent homology group45, 46 using the graph distance of the topological representation. Given the pairwise distances of a set of points sampled from a space, persistent homology enables the quantification of topological features (connected components, loops, cavities, etc., preserved under continuous deformations of the space) compatible with the data at each scale. The first homology group, in particular, classifies loops of the space (Supplementary Note 1). We used persistent homology death times as a proxy of the size of the loops, and evaluated their statistical significance using a permutation test. To that end, we randomly permuted the labels of the genes 500 times, for each cell independently. For each permutation we built a topological representation using the same parameters as in the original representation and computed the first persistent homology group. A p-value for each of the loops was estimated from the distribution of the number of loops as a function of their death time. The resulting p-values were adjusted for multiple testing by using Benjamini-Hochberg procedure for controlling the false discovery rate.
Comparison with other methods for analyzing longitudinal single-cell RNA-seq data
We dimensionally reduced the processed single cell RNA-seq data of experiment 2 using MDS, ICA and tSNE. For MDS and ICA we used the same set of highly expressed, highly variant genes that we used for building the topological representation. For tSNE, we used the top 10 principal components of the gene space. In each representation we determined the cell that maximized the Pearson correlation coefficient between the two-dimensional Euclidean distance to the cell and chronological sampling time, corresponding to the least differentiated cellular state.
Additionally, we compared scTDA to the single cell software Monocle10, based on ICA and minimum-spanning trees, Diffusion Pseudotime6 and Wishbone7, based on diffusion coefficients, and SLICER8, based on locally linear embedding. We followed all recommendations in the documentation of these algorithms. In our tests, Monocle failed in running over the complete dataset from experiment 2, consisting of 2,304 cells, and only a partial set of 834 cells, sampled from all time points was analyzed.
Simulated data
Noisy, branched asynchronous cellular differentiation processes were simulated, from which 700 cell were sampled at three time points. To that end, we used the following strategy:
- We simulated a noisy branched tree-like structure with three parameters (t, u, and v), and performed a non-linear transformation into four-dimensional embedding space (spanned by the variables g(1), g(2), g(3), and g(4)). This space provides the structure for four different groups of correlated genes.
- We randomly sampled 700 points from this space, corresponding to 700 cells, and assigned a sampling day based on a multinomial distribution with probabilities given by a logistic function of t, to simulate asynchrony.
- We simulated the expression of 300 genes driven by the variables g(1), g(2), g(3), and g(4), in addition to 200 genes with non-correlated expression, sampled from normal distributions.
- We simulated the effect of drop-out events using the standard logistic dependence of the drop-out probability as a function of expression.
In what follows, we provide a detailed description of each of these steps.
First, we simulated four groups of correlated genes. These were defined by the equations,
where
where only the first 2 + r equations are considered in simulated differentiation processes with r lineage branching points. The variable ti represents the differentiation pseudo-time of cell i at the time of sampling. To simulate asynchrony, each sampled cell was assigned a sampling day accordingly to
where pi [0,1] is a random number uniformly distributed, and τ is the logistic function
We simulated 75 genes in each of the four groups of genes, with expression values given by
where
with
where m is the original expression value of the gene in the cell.
Apart from scTDA, we ran the algorithms Diffusion Pseudotime6, Wishbone7, and SLICER8 in the simulated datasets. We followed all the recommendations in the documentation of these algorithms. We assessed the performance of each algorithm by computing the Pearson's correlation coefficient between the cell pseudotime inferred by each of these algorithms and the simulated variable t.
Gene ontology annotation
Gene ontologies were obtained from EMBL-EBI QuickGO47. Specifically, categories GO:0006355 “Regulation of transcription, DNA-templated”, GO:0008380 “RNA splicing”, GO:0044822 “Poly(A) RNA binding”, GO:0051726 “Regulation of cell cycle”, and GO:0007049 “Cell cycle” were used to annotate genes. Expression of genes associated to DNA replication was analyzed by considering the 99 genes in the category GO:0006260 “DNA replication” expressed in less than 1,400 cells in experiment 2. Genes coding for proteins in the cellular surface were identified by looking in UniProtKB database48 for proteins annotated with an extracellular topological domain. Gene ontology enrichment analysis was performed using PANTHER classification system23.
Transient cellular populations
Low-dispersion genes (ki < 1.7 days and ki < 2.25 days respectively in experiments 2 and 1) with significant gene connectivity (q < 0.05) in the topological representation were clustered according to their centroid using k-means clustering (Supplementary Fig. 13b and 13c). The optimal number of clusters according to Davies-Bouldin index was four in experiment 2 (three in experiment 1) (Supplementary Fig. 13a), as it was also evidenced from visual inspection of the centroid distribution for low-dispersion genes. A state r = 1, …, 4 was assigned to each node of the topological representation based on the average expression of each cluster of low-dispersion genes in the node (Supplementary Fig. 13d). Genes with significant gene connectivity according to the permutation test described in the paragraph “Gene connectivity, centroid, and dispersion within the topological representation” were assigned to each of the four populations based on the number of cells expressing the gene in each state r. Only genes expressed in at least 80 cells and at most 1,500 cells were considered.
Analysis of long non-coding RNAs
The coordinates of intergenic and antisense lncRNAs were downloaded from NONCODEv449, and read counts were obtained using HTSeq (http://www-huber.embl.de/users/anders/HTSeq). The connectivity and statistical significance of each long non-coding gene in the topological representation was computed using scTDA. Only lncRNAs that were significant (q < 0.05) in both experiments 1 and 2, and that were supported by at least 50 reads in the longitudinal stranded RNA-seq data were kept. Curation was performed to remove lncRNAs whose 3′-end overlapped the 3′-end of another gene, gene assignment therefore being ambiguous, or that corresponded to possible miss-annotations of the 3′ UTR of a nearby gene.
Characterization of interneuron populations
Differential expression analysis between En1+, Gata3+, Vsx2+ and Egfp+ cells in nodes characterized as post-mitotic (Supplementary Fig. 13d, state 3) was performed using the software SCDE42 with default parameters.
Analysis of mouse and human developmental datasets
We downloaded the read count data of the single cell RNA-seq experiments performed in refs. 35 (mouse distal lung epithelium), 36 (human preimplantation embryos), and 37 (mouse neocortex), normalized it as described in the paragraph “Processing of single cell RNA-seq data”, and anlysed it following the same steps described in the paragraphs “Topological representation” and “Gene connectivity, centroid and dispersion within the topological representation”. The topological representations were built using the Ayasdi implementation of Mapper, using correlation distance as metric and two dimensional MDS as dimensional reduction algorithm. We used a covering of R2 consisting of 33×33, 52×52, and 30×30 rectangular patches with an average overlap of 73%, 70%, and 71%, respectively for the data from mouse distal lung epithelium, human preimplantation embryos, and mouse neocortex. In the case of the data from human preimplantation embryos, we also performed an analysis of transient cellular populations, following the same steps described in paragraph “Transient cellular populations” and setting a threshold of 2.95 in gene dispersion.
scTDA software
The algorithms described in this work were implemented and documented in an object oriented python library for topological data analysis of high-throughput longitudinal single-cell RNA-seq data, called scTDA. scTDA is provided as Supplementary Code and is publicly available at https://github.com/RabadanLab/scTDA.
Online database
We established a database for the topological representations and statistics of the two motor neuron differentiation datasets. The database is publicly available at https://rabadan.c2b2.columbia.edu/motor_neurons_tda.
Supplementary Material
1
2
3
4
5
6
7
8
Acknowledgments
We thank Tom Jessell, Nicole Francis, and Hemali Phatnani for critical reading of the manuscript. A.H.R. and T.M. thank the New York Genome Center and D. Goldstein for sequencing support, Susan Morton for providing Engrailed antibody and P. Sims for experimental discussions. P.G.C. and R.R. thank A. J. Levine, F. Abate, I. Filip, S. Zairis, U. Rubin, and P. van Nieuwenhuizen for useful comments and discussions, and O. T. Elliott for technical support with the online database. The work of P.G.C and R.R. is supported by the NIH grants U54-CA193313-01 and R01GM117591. The work of A.H.R., E.K.K., T.J.R. and T.M. is supported by ALS Therapy Alliance grant ATA-2013-F-056 and NIH grant NS088992.
Footnotes
Author Contributions: P.G.C. and R.R. developed the topology-based computational approach (scTDA) and applied it to single cell RNA sequencing data. A.H.R., E.K.K, and T.M. designed all experiments. A.H.R., E.K.K., and T.J.R. conducted experiments. I.S. conducted all flow cytometry. A.H.R., P.G.C., E.K.K., T.M., and R.R. analyzed the data and wrote the manuscript.
Competing Financial Interests Statement: The authors declare no competing financial interests.
References
Rk f−1 Rk Methods-only References
Full text links
Read article at publisher's site: https://doi.org/10.1038/nbt.3854
Read article for free, from open access legal sources, via Unpaywall: https://europepmc.org/articles/pmc5569300?pdf=render
Citations & impact
Impact metrics
Citations of article over time
Alternative metrics
Smart citations by scite.ai
Explore citation contexts and check if this article has been
supported or disputed.
https://scite.ai/reports/10.1038/nbt.3854
Article citations
The topology and geometry of neural representations.
Proc Natl Acad Sci U S A, 121(42):e2317881121, 07 Oct 2024
Cited by: 0 articles | PMID: 39374397
Multiscale differential geometry learning of networks with applications to single-cell RNA sequencing data.
Comput Biol Med, 171:108211, 28 Feb 2024
Cited by: 1 article | PMID: 38422960
Scbean: a python library for single-cell multi-omics data analysis.
Bioinformatics, 40(2):btae053, 01 Feb 2024
Cited by: 1 article | PMID: 38290765 | PMCID: PMC10868338
Topological and geometric analysis of cell states in single-cell transcriptomic data.
Brief Bioinform, 25(3):bbae176, 01 Mar 2024
Cited by: 1 article | PMID: 38632952 | PMCID: PMC11024518
Topological data analysis reveals core heteroblastic and ontogenetic programs embedded in leaves of grapevine (Vitaceae) and maracuyá (Passifloraceae).
PLoS Comput Biol, 20(2):e1011845, 05 Feb 2024
Cited by: 0 articles | PMID: 38315720 | PMCID: PMC10868772
Go to all (103) article citations
Other citations
Data
Data behind the article
This data has been text mined from the article, or deposited into data resources.
BioStudies: supplemental material and supporting data
Functional Genomics Experiments
- (1 citation) ArrayExpress - E-MTAB-3929
GEO - Gene Expression Omnibus (2)
- (1 citation) GEO - GSE52583
- (1 citation) GEO - GSE94883
Quick GO (5)
- (1 citation) GO - GO0008380
- (1 citation) GO - GO0006260
- (1 citation) GO - GO0006355
- (1 citation) GO - GO0051726
- (1 citation) GO - GO0007049
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
scPADGRN: A preconditioned ADMM approach for reconstructing dynamic gene regulatory network using single-cell RNA sequencing data.
PLoS Comput Biol, 16(7):e1007471, 27 Jul 2020
Cited by: 5 articles | PMID: 32716923 | PMCID: PMC7410337
Single-cell RNA-Seq analysis reveals dynamic trajectories during mouse liver development.
BMC Genomics, 18(1):946, 04 Dec 2017
Cited by: 48 articles | PMID: 29202695 | PMCID: PMC5715535
Single-cell quantification of a broad RNA spectrum reveals unique noncoding patterns associated with cell types and states.
Proc Natl Acad Sci U S A, 118(51):e2113568118, 01 Dec 2021
Cited by: 33 articles | PMID: 34911763 | PMCID: PMC8713755
Constructing cell lineages from single-cell transcriptomes.
Mol Aspects Med, 59:95-113, 26 Nov 2017
Cited by: 12 articles | PMID: 29107741
Review
Funding
Funders who supported this work.
NCI NIH HHS (1)
Grant ID: U54 CA193313
NIGMS NIH HHS (2)
Grant ID: T32 GM008224
Grant ID: R01 GM117591
NINDS NIH HHS (2)
Grant ID: DP1 NS082099
Grant ID: R21 NS088992