Abstract
In biomedical research, lymphoblastoid cell lines (LCLs), often established by in vitro infection of resting B cells with Epstein-Barr virus, are commonly used as surrogates for peripheral blood lymphocytes. Genomic and transcriptomic information on LCLs has been used to study the impact of genetic variation on gene expression in humans. Here we present single-cell RNA sequencing (scRNA-seq) data on GM12878 and GM18502—two LCLs derived from the blood of female donors of European and African ancestry, respectively. Cells from three samples (the two LCLs and a 1:1 mixture of the two) were prepared separately using a 10x Genomics Chromium Controller and deeply sequenced. The final dataset contained 7,045 cells from GM12878, 5,189 from GM18502, and 5,820 from the mixture, offering valuable information on single-cell gene expression in highly homogenous cell populations. This dataset is a suitable reference for population differentiation in gene expression at the single-cell level. Data from the mixture provide additional valuable information facilitating the development of statistical methods for data normalization and batch effect correction.
Subject terms: Gene expression, Gene expression analysis
Design Type(s) | transcription profiling design • strain comparison design |
Measurement Type(s) | transcription profiling assay |
Technology Type(s) | RNA sequencing |
Factor Type(s) | ancestry status • sex |
Sample Characteristic(s) | GM12878 cell • GM18502 cell • immortal human peripheral vein-derived B cell line cell |
Machine-accessible metadata file describing the reported data (ISA-Tab format)
Background & Summary
Immortalized cell lines are continuously growing cells derived from biological samples. Lymphoblastoid cell lines (LCLs) are one of the important members among many immortalized cell lines1. LCLs are usually established by infecting human peripheral blood lymphocytes in vitro with Epstein-Barr virus (EBV). The viral infection selectively immortalizes resting B cells, giving rise to an actively proliferating B cell population2. LCLs exhibit a low somatic mutation rate in continuous culture, making them the preferred choice of storage for individuals’ genetic material3. As one of the most reliable, inexpensive, and convenient sources of cells, LCLs have been used by several large-scale genomic DNA sequencing efforts such as the International HapMap and the 1,000 Genomes projects4,5, in which a large collection of LCLs were derived from individuals of different genetic backgrounds, to document the extensive genetic variation in human populations.
LCLs are also an in vitro model system for a variety of molecular and functional assays, contributing to studies in immunology, cellular biology, genetics, and other research areas6–12. It is also believed that gene expression in LCLs encompasses a wide range of metabolic pathways specific to individuals where the cells originated13. LCLs have been used in population-scale RNA sequencing projects14–16, as well as epigenomic projects17. For many LCLs used as reference strains, both genomic and transcriptomic information is available, making it possible to detect the correlation between genotype and expression level of genes and infer the potential causative function of genetic variants18. Furthermore, comparisons of gene expression profiles of LCLs between populations such as between Centre d’Etude du Polymorphisme Humain – Utah (CEPH/CEU) and Yoruba in Ibadan, Nigeria (YRI), have revealed the genetic basis underlying the differences in transcriptional activity between the two populations16,19.
With the advent of single-cell RNA sequencing (scRNA-seq) technology20,21, our approach for understanding the origin, global distribution, and functional consequences of gene expression variation is ready to be extended. For example, data generated from scRNA-seq provide an unprecedented resolution of the gene expression profiles at single cell level, which allows the identification of previously unknown subpopulations of cells and functional heterogeneity in a cell population22–24.
In this study, we used scRNA-seq to assess the gene expression across thousands of cells from two LCLs: GM12878 and GM18502. Cells were prepared using a Chromium Controller (10x Genomics, Pleasanton, CA) as described previously21 and sequenced using an Illumina Novaseq. 6000 sequencer. We present this dataset on the single-cell gene expression profile for more than 7,000 cells from GM12878 and more than 5,000 from GM18502. GM12878 is a popular sample that has been widely used in genomic studies. For example, it is one of three ‘Tier 1’ cell lines of the Encyclopedia of DNA Elements (ENCODE) project17,25. GM18502, derived from the donor of African ancestry, serves as a representative sample from the divergent population. The two cell lines are part of the International HapMap project, and genotypic information is available for both of them4. We also processed and sequenced an additional sample of 1:1 mixture of GM12878 and GM18502 using the same scRNA-seq procedure. Our dataset presented here provides a suitable reference for those researchers interested in performing between-populations comparisons in gene expression at the single-cell level, as well as for those developing new statistical methods and algorithms for scRNA-seq data analysis.
Methods
Cell culture
GM12878 and GM18502 cell lines were purchased from the Coriell Institute for Medical Research. Cells were cultured in the Roswell Park Memorial Institute (RPMI) Medium 1640 supplemented with 2mM L-glutamine and 20% of non-inactivated fetal bovine serum in T25 tissue culture flasks. Flasks with 20 mL medium were incubated on the upright position at 37 °C under 5% of carbon dioxide. Cell cultures were split every three days for maintenance. Note that authentication test and mycoplasm contamination screening on these freshly purchased cell lines were not undertaken in this study.
Growth curve
Four culture flasks for each cell line were started with approximately 200,000 viable cells/mL to measure the growth rate of each cell line. Cells were prepared and cultured as described above. Viable cell number was estimated on a daily basis for four days. Briefly, 100 uL suspended cells from each flask were taken every day, to visualize the viable cells, the samples were stained using 10 uL of Trypan Blue (0.4%), and live cells were counted manually using a Neubauer counting chamber.
Single cell preparation
Single-cell sample preparation was conducted according to Sample Preparation Demonstrated Protocol provided by 10x Genomics as follows: 1 mL of cell suspensions from each cell line (day 4, stable phase) was pelleted in Eppendorf tubes by centrifugation (400 g, 5 min). The supernatant was discarded, and the cells pellet was then resuspended in 1x PBS with 0.04% BSA, followed by two washing procedures by centrifugation (150 g, 3 min). After the second wash, cells were resuspended in ~500 uL 1x PBS with 0.04% BSA followed by gently pipetting mix 10–15 times. Cells were counted using an Invitrogen Countess automated cell counter (Thermo Fisher Scientific, Carlsbad, CA) and the viability of cells was assessed by Trypan Blue staining (0.4%).
Generation of single cell GEMs (Gel bead in EMulsion) and sequencing libraries
Libraries were prepared using the 10x Genomics Chromium Controller in conjunction with the single-cell 3′ v2 kit. Briefly, the cell suspensions were diluted in nuclease-free water according to manufacturer instructions to achieve a targeted cell count of 5,000 for each cell line. The cDNA synthesis, barcoding, and library preparation were then carried out according to the manufacturer’s instructions. Libraries were sequenced in the North Texas Genome Centre facilities on a Novaseq. 6000 sequencer (Illumina, San Diego).
Mapping of reads to transcripts and cells
Sample demultiplexing, barcode processing, and unique molecular identifiers (UMI) counting were performed by using the 10x Genomics pipeline CellRanger v.2.1.0 with default parameters. Specifically, for each library, raw reads were demultiplexed using the pipeline command ‘cellranger mkfastq’ in conjunction with ‘bcl2fastq’ (v2.17.1.14, Illumina) to produce two fastq files: the read 1 file contains 26-bp reads, each consists of a cell barcode and a unique molecule identifier (UMI), and the read 2 file contains 96-bp reads including cDNA sequences. Reads then were aligned to the human reference genome (GRCh38), filtered, and counted using ‘cellranger count’ to generate the gene-barcode matrix. Summary metrics of barcoding and sequencing from raw data are given in Table 1.
Table 1.
GM12878 | GM18502 | Mixture | |
---|---|---|---|
Estimated Number of Cells | 7,247 | 5,530 | 5,828 |
Mean Reads per Cell | 65,466 | 91,493 | 83,326 |
Median Genes per Cell | 2,954 | 3,960 | 3,621 |
Number of Reads | 474,436,605 | 505,958,821 | 485,628,282 |
Valid Barcodes | 97.20% | 97.30% | 97.20% |
Sequencing Saturation | 50.30% | 53.50% | 53.30% |
Q30 Bases in Barcode | 94.90% | 94.80% | 94.80% |
Q30 Bases in RNA Read | 90.20% | 89.60% | 89.90% |
Q30 Bases in Sample Index | 91.50% | 93.40% | 92.20% |
Q30 Bases in UMI | 94.80% | 93.40% | 94.70% |
Reads Mapped to Genome | 93.90% | 93.70% | 93.70% |
Reads Mapped Confidently to Genome | 92.00% | 92.00% | 92.00% |
Reads Mapped Confidently to Intergenic Regions | 2.60% | 2.70% | 2.70% |
Reads Mapped Confidently to Intronic Regions | 12.90% | 13.10% | 12.80% |
Reads Mapped Confidently to Exonic Regions | 76.50% | 76.20% | 76.50% |
Reads Mapped Confidently to Transcriptome | 72.60% | 71.90% | 72.50% |
Reads Mapped Antisense to Gene | 0.90% | 0.90% | 0.90% |
Fraction Reads in Cells | 90.70% | 91.70% | 89.80% |
Total Genes Detected | 21,329 | 20,701 | 21,151 |
Median UMI Counts per Cell | 18,214 | 25,973 | 22,608 |
The estimates were produced by CellRanger on raw data, i.e., unfiltered feature-barcode matrix; values may differ slightly from those reported in the main text. For detailed definitions of metrics, refer to the 10x Genomics support website, https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/gex-metrics.
Quality control
Expression matrices were processed using Seurat (v2.3.4) R package26. Briefly, for each library, the expression matrix was loaded using the ‘Read10X’ function, and the default log-normalization was performed using the ‘NormalizeData’ function, followed by a cantering and scaling of the normalized values by using the ‘ScaleData’ function. Quality control (QC) measures, including UMI count, the number of genes detected per cell, and the percentage of mitochondrial transcripts were calculated. Cells with a proportion of mitochondrial reads lower than 10% and a library size smaller than 2.5x standard deviation (SD) from the average library size were considered good quality cells. The corresponding code used for the QC procedure is available online (see Code availability).
Cell cycle phase and population assignment
Cell cycle phase assignment was made using the ‘CellCycleScoring’ function in the Seurat R package26, which uses the phase-specific marker genes, given by the ‘cc.genes’ dataset27. Cell population assignment, i.e., assigning cells in the mixture sample back to the cell line (GM12878 or GM18502) they belong to, was made using the Brunet algorithm28 for non-negative matrix factorization, in the NMF (v0.21) R package29. A set of marker genes (n = 252) with absolute log-fold change >2.5 identified by comparing the pure cell lines was used as inputs and the resulting probabilities after 2,000 iterations were used to assign each cell in the mixture to either GM12878 or GM18502.
Dimensionality reduction
Expression matrices from GM12878, GM18502, and the mixture sample were merged and log-normalized using the function ‘MergeSeurat’. The resultant matrix was then centered and scaled. Highly variable genes were identified using function ‘FindVariableGenes’ in the Seurat R package26. Identified highly variable genes were used as input to produce the t-Distributed Stochastic Neighbour Embedding (t-SNE) projection using the ‘RunTSNE’ function with standard settings (perplexity = 30, theta = 0.5, maximum iteration = 1000, learning rate = 250, and momentum reduction = 0.5, by using the first 5 components from the principal component analysis). The Uniform Manifold Approximation and Projection (UMAP) was produced with the same set of highly variable genes as input using the function ‘RunUMAP’ with standard settings (min_dist = 0.3, metric = correlation, n_neighbors = 30).
scRNA-seq versus bulk RNA-seq
For both GM12878 and GM18502, transcriptome has been previously sequenced using bulk RNA-seq. The availability of these existing data allowed us to examine the correlation between gene expression levels measured using scRNA-seq and bulk RNA-seq in the same LCLs. Thus, we downloaded the raw fastq files of bulk RNA-seq experiments from the Gene Expression Omnibus (GEO) database using accessions GSM48489630,31 (for GM12878) and GSM239268932,33 (for GM18502) and quantified gene expression for both samples using Salmon34 (v0.12.0) against the human transcriptome (GRCh38). In addition, we also compared gene expression measured using scRNA-seq in GM12878 and GM18502 with the average gene expression measured in multiple samples from CEU and YRI populations. To do so, we downloaded the bulk RNA-seq data of 91 CEU and 89 YRI LCLs from the website of the Geuvadis RNA-seq project of 1,000 Genomes. The expression of each gene was measured as the mean of transcripts per million (TPM) values across all individuals of CEU or YRI population. To visualize the relationship of the single-cell gene-expression profiles of the two cell lines with their respective population, a principal component analysis (PCA) was performed. The input data for PCA was batch-effect corrected using the ‘removeBatchEffect’ function in the limma (3.4.0) R package35 and quantile normalized using the ‘normalize.quantiles’ function in the preprocessCore (1.46.0) R package.
Data Records
The sequencing data from this study have been submitted as the BioProject reference (PRJNA508890), with descriptions of the Biosamples (SUB4895416, SUB4895422, SUB4895423). Raw data of three samples have been deposited at the National Center for Biotechnology Information (NCBI) Sequence Reads Archive (SRA) with accession ID: SRP17283836. For each sample, data include unprocessed scRNA-seq reads in two raw fastq files (*R1.fastq.gz for cell barcodes and UMIs, and *R2.fastq.gz for RNA reads), as well as an expression matrix file in matrix market exchange format (*.mtx) with columns corresponding to cells and row to genes. UMI matrices of this study have been deposited with the Gene Expression Omnibus at GEO: GSE12632137. The identifiers for the columns and rows are included in separated files (barcodes.tsv and genes.tsv). These processed files correspond to the output produced by the cell ranger pipeline. In addition, a supplementary table with the barcodes, population, UMI count, gene count, and mitochondrial transcript levels is included.
Technical Validation
Here we present the scRNA-seq gene expression profile for 7,045 and 5,189 cells for GM12878 and GM18502, respectively. For GM12878, the median UMI counts per cell is 18,214 and the median number of genes detected (at least 1 UMI) per cell is 3,167; for GM18502, 25,973 and 3,891. Figure 1 is a heatmap of log-transformed expression data of top 200 highly expressed genes in the two LCLs. Cells are grouped by their cell cycle phases (G1, S, and G2/M) and sorted within each group by their library size. Among the top expressed genes, there are several immunoglobulin genes such as IGLC2, IGHA1, IGKC, IGLC3, and IGHM. These genes are not only expressed highly on average but also expressed highly variably across cells—i.e., highly expressed in one set of cells but no expression in another set of cells. We consider that this highly variable expression pattern can be attributed to immunoglobulin gene rearrangement. During the formation of the naïve-B cells, gene rearrangement process occurs to reshuffle different subunits of the variable (V), diversity (D) and joining (J) segments of immunoglobulin genes, resulting in the generation of a wide range of organism-specific antigen receptors that allow the immune system to recognize foreign molecules and initiate differential immune responses38,39. LCLs are produced through the rapid proliferation of few EBV-driven B cells from the blood cell population40. Thus, our scRNA-seq data of GM12878 and GM18502 offer a ‘snapshot’ of highly diverse immunoglobulin rearrangement profiles in a much larger population of polyclonal B cells found in the two donors.
We also performed scRNA-seq with a 1:1 mixture sample of the two LCLs and obtained data for additional 5,820 cells with a median UMI counts per cell of 22,608 and a median number of genes detected per cell of 3,625. This mixture sample can be considered as a technical replicate for both GM12878 and GM18502. The use of the mixture sample facilitates direct comparison of gene expression between GM12878 and GM18502 because cells from two cell lines in the mixture were processed simultaneously in the same reaction, maximally eliminating the batch effect. We found that cells in the mixture were able to be assigned back to their original cell lines almost unambiguously using a non-negative matrix factorization algorithm (see Methods). Furthermore, the average gene expression measured in cells in the mixture, after discriminating cells in the mixture and assigning them to their respective one of original cell lines, was virtually indistinguishable from that measured in the original ‘pure’ cells (Fig. 2).
The percentage of mitochondrial transcripts, an indicator of apoptotic cells, was computed for all cells sequenced in all the three samples. We found that no more than 0.4% of cells, that is, 26 cells from GM12878, 6 from GM18502, and 23 cells from the mixture sample, surpass the commonly used threshold of 10% mitochondrial transcripts41. This suggests that the majority of cells processed and sequenced were viable. Furthermore, as the 10x Genomics Chromium technology relies on droplets to partitioning cells and barcoding, it is normal some of them contain multiple cells in the cell droplet, making the estimation of the frequency of multiplets a critical aspect of quality control42. There are several ways to identify multiplets43–45. Here we adopted the threshold of 2.5x SD from the average library size for each cell. Based on this threshold, only 171 cells were considered to be multiplets for GM12878, 66 for GM18502, and 87 for the mixture (Fig. 3). These results support the quality of the dataset.
In either t-SNE or UMAP projection, no separation was observed between cells from the two pure cell lines, GM12878 and GM18502, and cells from the corresponding replicates of the two pure cell lines in the mixture (Fig. 4). This result suggests that cells in the mixture have the global expression profiles indistinguishable from those of cells of their original samples. Population signal of each sample allows a sample to be separated from others in the first two t-SNE or UMAP dimensional spaces. Furthermore, for each cell line, cells of different cell cycle phases are not entirely separated—a continuous path between the different clusters of cells exist. This allows researchers interested in cell cycle development to perform pseudo-time analysis46. Also, cells in the same cell cycle phase tend to be spread out and form a spectrum of cells in intermediate stages, indicating that cell proliferation is a continuous process and researchers interested in this process can use this dataset to refine reference cell sub-populations by their characterized expression profiles.
For both GM12878 and GM18502, we conducted correlation analyses to validate our scRNA-seq expression data using bulk RNA-seq expression information as a reference. We first compared gene expression measured using scRNA-seq and bulk RNA-seq in the same LCL, GM12878 or GM18502. We also compared gene expression measured using scRNA-seq in GM12878 (and GM18502) with the average gene expression in corresponding population CEU (and YRI). We found that in all cases the correlations are highly significant and strong with Spearman correlation coefficients (SCCs) of 0.78, 0.58, 0.76, and 0.77, respectively (Fig. 5). Thus, when scRNA-seq data are pooled across cells, genes’ expression levels are largely recapitulated as they were measured using bulk RNA-seq. These results further support the quality of our scRNA-seq dataset. We note that the SCC (0.58) between GM18502 scRNA-seq and GM18502 bulk RNA-seq is lower than that (0.78) between GM12878 scRNA-seq and GM12878 bulk RNA-seq. This may be due to differences in cell population state at the time when GM18502 cells were harvested for scRNA-seq and bulk RNA-seq.
As long-lasting supplies of cells containing genotypic and phenotypic information matching that of B-cell origins, LCLs have contributed significantly to biomedical research. We present a high-quality dataset of scRNA-seq from homogenous cell populations of two LCLs, including GM12878—one of the most popular reference cell lines. Our dataset provides information that can be used to quantify cell-to-cell variability in gene expression and study cellular states and associated gene expression changes. It also informs the analysis and comparison of gene expression at the single-cell level between European and African LCLs. The data from the mixture sample are a suitable resource for estimating the technical variability of scRNA-seq and can also be used to calibrate statistical methods for data normalization and batch effect correction.
ISA-Tab metadata file
Acknowledgements
We thank Andrew Hillhouse and Chris Blazier for help with single cell preparation and raw data processing and Jianhua Huang, Yan Zhong and Guanxun Li for helpful discussion on data analysis. This study was supported by Texas A&M University T3 grant for J.J.C., E.S. and P.Y. J.J.C. was supported by NIH grant R21AI126219.
Author Contributions
D.O., X.Y., P.Y., E.S. and J.J.C. conceived and designed the project; D.O. and X.Y. cultured the cells; D.O. and J.J.C. performed bioinformatics analysis, D.O., X.Y., P.Y., E.S. and J.J.C. analyzed the data; D.O. and J.J.C. wrote the manuscript. All authors reviewed the manuscript.
Code Availability
All the required code to replicate the feature characterization of GM12878 or GM18502 and the mixture, as well as all figures included in this document, are available in a public repository on GitHub at https://github.com/cailab-tamu/sciData-LCL.
Competing Interests
The authors declare no competing interests.
Footnotes
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
These authors contributed equally: Daniel Osorio and Xue Yu.
Contributor Information
Peng Yu, Email: [email protected].
Erchin Serpedin, Email: [email protected].
James J. Cai, Email: [email protected]
ISA-Tab metadata
is available for this paper at 10.1038/s41597-019-0116-4.
References
- 1.Nagy N. Establishment of EBV-Infected Lymphoblastoid Cell Lines. Methods in Molecular Biology. 2017;1532:57–64. doi: 10.1007/978-1-4939-6655-4_3. [DOI] [PubMed] [Google Scholar]
- 2.Neitzel H. A routine method for the establishment of permanent growing lymphoblastoid cell lines. Human Genetics. 1986;73:320–326. doi: 10.1007/BF00279094. [DOI] [PubMed] [Google Scholar]
- 3.Mohyuddin A, et al. Genetic instability in EBV-transformed lymphoblastoid cell lines. Biochimica et Biophysica Acta (BBA) 2004;1670:81–83. doi: 10.1016/j.bbagen.2003.10.014. [DOI] [PubMed] [Google Scholar]
- 4.Durbin RM, et al. A map of human genome variation from population-scale sequencing. Nature. 2010;467:1061–1073. doi: 10.1038/nature09534. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Sabeti PC, et al. Genome-wide detection and characterization of positive selection in human populations. Nature. 2007;449:913–918. doi: 10.1038/nature06250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Sie L, Loong S, Tan EK. Utility of lymphoblastoid cell lines. Journal of Neuroscience Research. 2009;87:1953–1959. doi: 10.1002/jnr.22000. [DOI] [PubMed] [Google Scholar]
- 7.Hussain T, Mulherkar R. Lymphoblastoid Cell lines: a Continuous in Vitro Source of Cells to Study Carcinogen Sensitivity and DNA Repair. International. Journal of Molecular and Cellular Medicine (IJMCM) 2012;1:75–87. [PMC free article] [PubMed] [Google Scholar]
- 8.Jiang S, et al. CRISPR/Cas9-Mediated Genome Editing in Epstein-Barr Virus-Transformed Lymphoblastoid B-Cell Lines. Current Protocols in Molecular Biology. 2018;121:31.12.31–31.12.23. doi: 10.1002/cpmb.51. [DOI] [PubMed] [Google Scholar]
- 9.Shim S-M, et al. MicroRNAs in human lymphoblastoid cell lines. Critical Reviews in Eukaryotic Gene Expression. 2012;22:189–196. doi: 10.1615/CritRevEukarGeneExpr.v22.i3.20. [DOI] [PubMed] [Google Scholar]
- 10.Wheeler HE, Dolan ME. Lymphoblastoid cell lines in pharmacogenomic discovery and clinical translation. Pharmacogenomics. 2012;13:55–70. doi: 10.2217/pgs.11.121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Gurwitz D. Human iPSC-derived neurons and lymphoblastoid cells for personalized medicine research in neuropsychiatric disorders. Dialogues in Clinical Neuroscience. 2016;18:267–276. doi: 10.31887/DCNS.2016.18.3/dgurwitz. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Ansel A, Rosenzweig JP, Zisman PD, Melamed M, Gesundheit B. Variation in Gene Expression in Autism Spectrum Disorders: An Extensive Review of Transcriptomic. Studies. Frontiers in Neuroscience. 2016;10:601–601. doi: 10.3389/fnins.2016.00601. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Amoli M, Carthy D, Platt H, Ollier W. EBV Immortalization of human B lymphocytes separated from small volumes of cryo-preserved whole blood. International Journal of Epidemiology. 2008;37:i41–i45. doi: 10.1093/ije/dym285. [DOI] [PubMed] [Google Scholar]
- 14.Lappalainen T, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501:506–511. doi: 10.1038/nature12531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Pickrell JK, et al. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010;464:768–772. doi: 10.1038/nature08872. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Martin AR, et al. Transcriptome Sequencing from Diverse Human Populations Reveals Differentiated Regulatory Architecture. PLoS Genetics. 2014;10:e1004549–e1004549. doi: 10.1371/journal.pgen.1004549. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.The, E. P. C. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Sajantila A. Editors’ pick: transcriptomes of 1000 genomes. Investigative Genetics. 2013;4:17–17. doi: 10.1186/2041-2223-4-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Stranger BE, et al. Population genomics of human gene expression. Nature Genetics. 2007;39:1217–1224. doi: 10.1038/ng2142. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Tang F, et al. mRNA-Seq whole-transcriptome analysis of a single cell. Nature Methods. 2009;6:377–382. doi: 10.1038/nmeth.1315. [DOI] [PubMed] [Google Scholar]
- 21.Zheng GXY, et al. Massively parallel digital transcriptional profiling of single cells. Nature Communications. 2017;8:14049–14049. doi: 10.1038/ncomms14049. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Kolodziejczyk AA, Kim JK, Svensson V, Marioni JC, Teichmann SA. The Technology and Biology of Single-Cell RNA Sequencing. Molecular Cell. 2015;58:610–620. doi: 10.1016/j.molcel.2015.04.005. [DOI] [PubMed] [Google Scholar]
- 23.Shalek AK, et al. Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature. 2013;498:236–240. doi: 10.1038/nature12172. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Marinov GK, et al. From single-cell to cell-pool transcriptomes: Stochasticity in gene expression and RNA splicing. Genome Research. 2014;24:496–510. doi: 10.1101/gr.161034.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Zhao B, et al. The NF-κB Genomic Landscape in Lymphoblastoid B Cells. Cell Reports. 2014;8:1595–1606. doi: 10.1016/j.celrep.2014.07.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Satija R, Farrell JA, Gennert D, Schier AF, Regev A. Spatial reconstruction of single-cell gene expression data. Nature Biotechnology. 2015;33:495–502. doi: 10.1038/nbt.3192. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Tirosh I, et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science. 2016;352:189–196. doi: 10.1126/science.aad0501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Brunet JP, Tamayo P, Golub TR, Mesirov JP. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the National Academy of Sciences. 2004;101:4164–4169. doi: 10.1073/pnas.0308531101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Gaujoux R, Seoighe C. A flexible R package for nonnegative matrix factorization. BMC Bioinformatics. 2010;11:367–367. doi: 10.1186/1471-2105-11-367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Kasowski M, et al. Variation in Transcription Factor Binding Among Humans. Science. 2010;328:232–235. doi: 10.1126/science.1183621. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Kasowski M, 2009. Gene Expression Omnibus. GSM484896
- 32.Banovich NE, et al. Impact of regulatory variation across human iPSCs and differentiated cells. Genome Research. 2018;28:122–131. doi: 10.1101/gr.224436.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Banovich NE, 2016. Gene Expression Omnibus. GSM2392689
- 34.Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods. 2017;14:417–419. doi: 10.1038/nmeth.4197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Ritchie ME, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43:e47. doi: 10.1093/nar/gkv007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Osorio D, Xue Y, Yu P, Serpedin E, Cai J. 2019. NCBI Sequence Read Archive. SRP172838
- 37.Osorio D, Xue Y, Yu P, Serpedin E, Cai J. 2019. Gene Expression Omnibus. GSE126321
- 38.Papavasiliou F, et al. V(D)J recombination in mature B cells: a mechanism for altering antibody responses. Science. 1997;278:298–301. doi: 10.1126/science.278.5336.298. [DOI] [PubMed] [Google Scholar]
- 39.Tonegawa S. Somatic generation of antibody diversity. Nature. 1983;302:575–581. doi: 10.1038/302575a0. [DOI] [PubMed] [Google Scholar]
- 40.Ryan JL, et al. Clonal evolution of lymphoblastoid cell lines. Laboratory Investigation. 2006;86:1193–1200. doi: 10.1038/labinvest.3700472. [DOI] [PubMed] [Google Scholar]
- 41.MacParland SA, et al. Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations. Nature Communications. 2018;9:4383–4383. doi: 10.1038/s41467-018-06318-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Bloom JD. Estimating the frequency of multiplets in single-cell RNA sequencing from cell-mixing experiments. PeerJ. 2018;6:e5578–e5578. doi: 10.7717/peerj.5578. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.McGinnis, C. S., Murrow, L. M. & Gartner, Z. J. DoubletFinder: Doublet detection in single-cell RNA sequencing data using artificial nearest neighbors. Preprint at, https://www.biorxiv.org/content/10.1101/352484v3 (2018). [DOI] [PMC free article] [PubMed]
- 44.Wolock, S. L., Lopez, R. & Klein, A. M. Scrublet: computational identification of cell doublets in single-cell transcriptomic data. Preprint at, https://www.biorxiv.org/content/10.1101/357368v1 (2018). [DOI] [PMC free article] [PubMed]
- 45.DePasquale, E. A. et al. DoubletDecon: cell-state aware removal of single-cell RNA-seq doublets. Preprint at, https://www.biorxiv.org/content/10.1101/364810v2 (2018).
- 46.Trapnell C, et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nature Biotechnology. 2014;32:381–386. doi: 10.1038/nbt.2859. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
- Kasowski M, 2009. Gene Expression Omnibus. GSM484896
- Banovich NE, 2016. Gene Expression Omnibus. GSM2392689
- Osorio D, Xue Y, Yu P, Serpedin E, Cai J. 2019. NCBI Sequence Read Archive. SRP172838
- Osorio D, Xue Y, Yu P, Serpedin E, Cai J. 2019. Gene Expression Omnibus. GSE126321
Supplementary Materials
Data Availability Statement
All the required code to replicate the feature characterization of GM12878 or GM18502 and the mixture, as well as all figures included in this document, are available in a public repository on GitHub at https://github.com/cailab-tamu/sciData-LCL.