Abstract
Molecular profiles of tumors and tumor-associated cells hold great promise as biomarkers of clinical outcomes. However, existing datasets are fragmented and difficult to analyze systematically. Here we present a pan-cancer resource and meta-analysis of expression signatures from ~18,000 human tumors with overall survival outcomes across 39 malignancies. Using this resource, we identified a FOXM1 regulatory network as a major predictor of adverse outcomes, and found that expression of favorably prognostic genes, including KLRB1, largely reflect tumor-associated leukocytes. By applying CIBERSORT, a computational approach for inferring leukocyte representation in bulk tumor transcriptomes, we identified complex associations between 22 distinct leukocyte subsets and cancer survival. For example, tumor-associated neutrophil and plasma cell signatures emerged as significant but opposite predictors of survival for diverse solid tumors, including breast and lung adenocarcinomas. This resource and associated analytical tools (http://precog.stanford.edu) may help delineate prognostic genes and leukocyte subsets within and across cancers, shed light on the impact of tumor heterogeneity on cancer outcomes, and discover biomarkers and therapeutic targets.
Genomic features of tumors and their microenvironments represent promising candidates for predictive and prognostic biomarkers1-3. Despite intensive efforts over the past decade to leverage emerging high-throughput genomic technologies4,5, heterogeneity in patient cohorts, treatment regimens, and technological platforms, among other factors, have led to apparently inconsistent results and modest translational impact6-11. To address these issues, and to gain new insights into molecular features of tumors with prognostic associations, we integrated tumor gene expression profiles (GEPs) and overall survival data from nearly 18,000 patients within a meta-analytical framework that enhances statistical power and improves reproducibility12. This effort differs from previous efforts as the latter have not been amenable to meta-analysis13, or were limited to either single cancer types14 or single datasets per cancer type15.
We applied to this resource the recently described CIBERSORT method16 to analyze associations between clinical outcomes and abundance of diverse tumor-associated leukocyte (TAL) subsets, several of which have been linked to tumor growth17,18, cancer progression and outcome19. This approach quantifies relative expression of cell type specific signatures in bulk tumors, and is largely independent of methods relying on cell isolation, preservation, and reagent quality, all of which are difficult to standardize across large numbers of tumors.
Using this resource and these analytical tools, we constructed a global pan-cancer map of the landscape of both genes and TALs predicting clinical outcomes, integrating with existing resources such as ENCODE20. Our findings reveal genome-wide molecular portraits of human tumors and identify candidate genes and TALs for prognostic stratification and targeted therapy.
Results
A cancer-wide atlas of prognostic genes
We assembled, curated, and integrated cancer gene expression and clinical outcome data from the public domain into a new resource for PREdiction of Clinical Outcomes from Genomic profiles (PRECOG, Fig. 1a, Supplementary Table 1, available at http://precog.stanford.edu). PRECOG encompasses nearly 30,000 expression profiles from 166 cancer expression datasets covering 39 distinct malignant histologies (Fig. 1b). Importantly, all data were manually curated with respect to relevant clinical parameters, including disease status, stage, and histology, and key results from selected studies (e.g., survival plots) were verified (Methods). To avoid ambiguities in outcome annotation, we restricted the results presented herein to the ~18,000 cases with overall survival data. We further confirmed consistency and quality of integrated datasets and curated annotations by assessing concordance between molecularly-inferred versus clinically-reported patient gender. In all, 98% of tumors were gender-concordant, with less than 2% of tumors affected by likely annotation errors (such as “off-by-one” mismatches) (Supplementary Fig. 1a–c). All microarray studies in PRECOG were consistently normalized and pre-processed (Methods) and associations between gene expression and survival were assessed by univariate Cox regression.
To compare prognostic associations across independent datasets, and to minimize the confounding influence of batch effects, the statistical associations between genes and clinical outcomes were assessed by z-scores. Z-scores are directly related to P values, but conveniently encode both the directionality and robustness of statistical associations (Supplementary Fig. 1d). Moreover, z-scores have a straightforward interpretation; they represent the number of standard deviations from the mean of a normal distribution. For example, |z| > 1.96 is equivalent to a two-sided P < 0.05 (Supplementary Fig. 1d). Unlike parameters such as hazard ratios, z-scores are independent of different time-scales measuring survival follow-up times, and of the range/scale of predictor variables, permitting direct comparison across studies and platforms. To facilitate cross-cancer analyses, z-scores for individual studies were combined to yield “meta-z scores” for the prognostic significance of each gene in each cancer type (Methods; Supplementary Table 1). We observed high concordance between meta-z scores and z-scores, where the latter were obtained by first merging expression data from multiple studies of the same cancer (e.g., lung adenocarcinoma, Spearman's R = 0.9, P < 2.2×10−16; Methods). To further evaluate the robustness of the meta-z metric, we calculated a global meta-z score for each gene across all cancers, and compared PRECOG to a validation set of 9 independent studies that were held-out (Supplementary Table 1). Globally prognostic genes were significantly correlated between PRECOG and the validation set (R = 0.55, P < 2.2 × 10−16; Supplementary Fig. 2a,b). In addition, pan-cancer prognostic genes were significantly concordant between PRECOG and another validation set comprised of studies profiled by RNA-seq from The Cancer Genome Atlas (TCGA) (R = 0.52, P < 2.2 × 10−16; Supplementary Fig. 2a,b). We also evaluated the influence of batch effects21 on z-score values. Notably, only modest differences in z-scores were observed following batch effect removal (e.g., for samples run on different dates) (Supplementary Fig. 2c–e).
Pan-cancer prognostic genes
PRECOG provides an unprecedented opportunity to quantify commonalities in prognostic genes across a large number of human malignancies. We found that prognostic genes (filtered at |meta-z| > 3.09, or nominal one-sided P < 0.001) are significantly more likely to be shared by distinct tumor types than expected by random chance (Fig. 1c, Supplementary Table 2). This result was reproducible across a broad range of statistical thresholds (Supplementary Fig. 3a,b), and is reminiscent of the high cancer-wide concordance reported among somatic aberrations influencing genome-wide copy number22. Conversely, cancer-specific prognostic genes are less frequent than expected by random chance (Fig. 1c, Supplementary Fig. 3a,b), and predominantly reflect tissues of origin (Supplementary Fig. 3c, Supplementary Table 2).
To obtain a global map of prognostic patterns, we clustered survival-associated z-scores across all 166 PRECOG datasets using AutoSOME, an unsupervised method that is robust against outliers and does not require pre-specification of the number of clusters23 (Fig. 1d, Supplementary Table 3). Prognostic clusters include genes involved in cell adhesion and epithelial-mesenchymal transitions, vascularization, and immunological and proliferative processes (Supplementary Table 3). When clusters were ordered by a metric that integrates gene-level meta-z scores and cluster size, the two largest clusters were most highly ranked (Fig. 1d, left; Methods). One of these two clusters is broadly associated with inferior outcomes, and is functionally linked to cell proliferation and cell cycle phase (Fig. 1d, right). While this cluster is prognostic in many solid tumors, such as breast and lung adenocarcinoma, proliferation genes were not adversely prognostic in some cancers, including colon cancer and AML (Supplementary Table 1), two malignancies for which the clinical relevance of generally quiescent cancer stem cells has been demonstrated24,25. The other large cluster is associated with favorable survival and is highly enriched in immunological processes and immune response genes (Fig. 1d, right; Supplementary Table 3), suggesting that the immune microenvironment is a key factor contributing to favorable survival across cancers.
To further explore cancer-wide prognostic signatures, we used PRECOG to define robust pan-cancer survival models. First, we determined the number of histologies needed to identify genes with maximal prognostic power. Using a cross-validation approach to avoid outliers, we observed quantitative improvements in the significance of pan-cancer prognostic genes until ~30 distinct histologies were sampled, after which additional gains were marginal (Fig. 2a, left). With this framework, we identified the top ten adverse and top ten favorable prognostic genes, largely consisting of genes regulating or oscillating with the mitotic cell cycle and markers of distinct lymphocyte subsets (Fig. 2a, right). Expression of the proto-oncogene FOXM1 (encoding forkhead box M1)26 was most frequently associated with adverse risk, andoutperformed MKI67, which encodes a protein (Ki-67) used clinically as a marker of proliferation27 (Supplementary Fig. 4a). Expression of KLRB1 (encoding CD161, a surface marker on several T cell subsets28 and NK cells; Supplementary Fig. 4b) was most frequently associated with favorable outcomes. Notably, FOXM1 and KLRB1 were among the top pan-cancer genes in validation sets, including TCGA RNA-seq data (Fig. 2b and Supplementary Fig. 2). Bivariate models based on these two genes outperformed either gene alone (Supplementary Table 4). Moreover, when weighted by their Cox regression coefficients in a training set (Supplementary Table 4), a FOXM1-KLRB1 composite score stratified patient survival in diverse internal validation datasets, including carcinomas, brain tumors, and hematopoietic neoplasms (Fig. 2c), and was prognostic in multivariate models integrating common clinical indices (Supplementary Table 4). Furthermore, the FOXM1-KLRB1 composite score remained significant in validation datasets held-out from PRECOG, including TCGA RNA-seq datasets (Supplementary Fig. 4c,d).
We next asked whether integrating cancers in a meta-analysis might illuminate additional functional elements of biological programs impacting survival. To address this, we evaluated connectivity among top prognostic genes within protein-protein association networks. For adversely prognostic genes, a considerably higher connectivity was achieved when considering all cancer histologies together as opposed to separately (Fig. 2d, bottom; Supplementary Table 5), suggesting that individual malignancies in PRECOG contribute complementary information to a pervasive proliferation program (Fig. 2d, middle). We further characterized this network by integrating PRECOG with data from ENCODE20, ChEA29, and mSigDB30 for 1,006 gene sets defined by shared transcription factor binding sites. Adversely prognostic genes in this network and across PRECOG were most significantly enriched in FOXM1 ChIP-Seq binding targets31 (Supplementary Fig. 4e, Supplementary Table 5), which together with FOXM1 itself, may be drivers of inferior survival. In contrast, we observed no difference in protein-protein connectivity among favorably prognostic genes, whether identified from pooled cancers or individual cancers (Fig. 2d, bottom; Supplementary Table 5). We hypothesized that transcriptome heterogeneity among distinct TALs, which account for the majority of favorable pan-cancer prognostic associations, may not be well captured by aggregated protein-protein association studies. In addition, several highly prognostic genes, including KLRB1, are expressed by multiple immune subsets (Supplementary Fig. 4b). Therefore we performed an “in silico dissection” of PRECOG to infer the specific leukocyte subsets associated with survival across cancers.
Leukocyte composition in bulk tumors
Infiltration of tumors by specific leukocyte cell subsets such as CD8+ and CD45RO+ memory T-lymphocytes has been largely linked with favorable outcomes in different cancers2,32, while others such as regulatory T-cells and macrophages can confer good or poor prognosis depending on context33-36. To systematically and comprehensively map compositional differences in TALs and their relationships to survival, we applied a new machine-learning framework for Cell-type Identification By Estimating Relative Subsets Of known RNA Transcripts (or CIBERSORT)16. CIBERSORT outperforms previous deconvolution methods with respect to noise, unknown mixture content, and closely related cell types, in statistically estimating relative proportions of cell subsets from expression profiles of complex tissues (e.g., bulk tumors)16. As input, we used purified expression profiles for 22 distinct leukocyte subsets, and defined “barcodes” of gene expression signatures that robustly distinguish these cell types without requiring cell type-specific marker genes16. At a |meta-z score| > 3.3 (corresponding to two-sided P < 0.001), 28% of these barcode genes (152 of 547) are individually significant in PRECOG, out of 2,851 total pan-cancer prognostic genes at the same significance threshold. This is higher than expected by random chance (P < 0.001, Chi-squared test). Whether directly or indirectly compared against flow cytometry and immunohistochemistry, CIBERSORT exhibited robust performance on solid tumors, accurately estimating relative fractions of leukocyte subsets in colorectal cancer and lung adenocarcinoma (Fig. 3a), and follicular lymphoma16.
Applied to PRECOG, CIBERSORT revealed striking differences in relative leukocyte composition between hematopoietic neoplasms, brain cancers, and non-brain solid tumors (Fig. 3b, Supplementary Table 6). Variation in TAL content was also consistent and reproducible across independent studies of the same cancer type, including solid tumors (Supplementary Fig. 5a). Of note, while the majority of tumors profiled within PRECOG were unpurified and uncontrolled with respect to tumor content (Supplementary Table 1), CIBERSORT correctly inferred high fractions of plasma cells in multiple myeloma-enriched specimens (Fig. 3b). Furthermore, as expected, B-cell signatures were found to predominate in B-cell malignancies (Fig. 3b), suggesting that CIBERSORT has general utility for discerning cell of origin in diverse cancers.
Prognostic associations of TALs
To complement our gene-centric survival analysis, we assembled a global map of prognostic associations for 22 immune populations across human malignancies (Supplementary Fig. 6a). We observed considerable variation between cell subsets and cancer-specific outcomes, and many of these associations are statistically significant (Supplementary Fig. 6b–d). Pooling cancers yielded significant global leukocyte prognostic patterns, in which higher levels of estimated T-cell fractions were found to generally correlate with superior survival while increasing levels of myeloid populations primarily correlated with poorer survival. Intra-tumoral γδ T-cell37,38 and polymorphonuclear (PMN)39,40 signatures emerged as the most significant favorable and adverse cancer-wide prognostic populations, respectively (Fig. 3c, left). Moreover, when inferred leukocyte fractions were compared with KLRB1 expression across cancers, γδ T-cell and CD8 T-cell signatures were most highly correlated (Supplementary Fig. 5b), suggesting a link to the prognostic significance of this gene. We found no relationship between estimated PMN levels in datasets with annotated necrotic tissue content (Methods), suggesting that intra-tumoral PMNs are not simply a correlate of tissue necrosis41. Furthermore, consistent with previous reports33,42, signatures of tumor-associated M2 macrophages were found to predict worse outcomes than pro-inflammatory M1 macrophages, and anti-CD3/anti-CD28-costimulated, but not resting, CD45RO+ memory helper T-cells were correlated with superior outcomes.
Prognostic TALs in solid tumors
By comparing leukocyte survival signatures in breast and lung cancer—two of the most highly profiled cancers in PRECOG—we identified two populations, PMNs and plasma cells (PCs), with unexpectedly strong yet reciprocal relationships to survival (Fig. 3d). PC signatures are significant predictors of favorable survival across solid tumors in general (Fig. 3c, right), and were the most inversely correlated prognostic population to PMNs (Fig. 4a) when assessed globally in a cross-correlation analysis between human cancers (Supplementary Fig. 5c). Estimated PC levels were not correlated with tumor stage (Supplementary Fig. 7a). Since PC signatures were found to be higher in tumors than in adjacent normal tissues (Supplementary Fig. 7b), the prognostic value of tumor infiltrating PCs is unlikely a proxy for general immunological health, supporting a role for antigen-driven processes required for their clonal expansion and emergent humoral immune responses43. Furthermore, a simple ratio of estimated PMN to PC levels was found to be significantly prognostic in diverse solid tumors (Fig. 4b).
To experimentally evaluate the reciprocal survival associations of PMN and PC signatures, we assessed their infiltration of 187 lung adenocarcinomas using tissue microarray (TMA) analysis (Supplementary Table 7). Characteristics of both cell types were observed by H&E staining of tissue sections (Supplementary Fig. 7c,d), and the presence of tumor infiltrating plasmacytic cells (i.e., plasmablasts or plasma cells) was confirmed in fresh tumor specimens using both flow cytometry (Supplementary Fig. 7e) and morphological assessment (Supplementary Fig. 7f). Moreover, we confirmed an elevated presence of plasmacytic cells in non-small cell lung cancer (NSCLC) tumors, as compared to normal adjacent tissues (Supplementary Fig. 7g,h). In serial lung adenocarcinoma tissue sections, we stained for the presence of MPO (myeloperoxidase) and IGKC (Immunoglobulin kappa constant), markers of PMNs and PCs, respectively (Supplementary Fig. 8a). Since B-cells express varying levels of IGKC, we also tested for CD20, a surface marker of mature B-cells but not PCs (Supplementary Fig. 7e). We found <10% overlap with CD20, indicating the high specificity of IGKC for PCs (Supplementary Fig. 8b; Methods). Next, we quantitated the staining area for each marker in the tissue array (Methods; Supplementary Fig. 8c,d). While operating on differing scales and measured on independent tumor specimens, fractional levels of these three markers measured in situ on TMAs were comparable to relative infiltrate levels inferred by CIBERSORT (Fig. 4c). Moreover, in both continuous and binary models, we found a strong relationship between inferior survival and a higher ratio of PMN to PC levels in lung adenocarcinoma, whether measured in PRECOG (Fig. 4d), in held-out microarray validation datasets (Supplementary Fig. 8e), or by surrogate markers in tissue microarray specimens (Fig. 4e). Furthermore, TMA results remained significant in multivariate models incorporating relevant clinical parameters (Supplementary Table 7). Together, these data validate our computational approach, and demonstrate that tumor-associated PMNs and PCs exhibit opposite associations with overall survival.
Circulating leukocytes, including PMNs and B-lymphocytes contribute to the tumor microenvironment44-46, and leukocyte frequencies of innate and adaptive effectors in peripheral blood can have prognostic value43,47. Therefore, we examined a subset of NSCLC patients from the TMA with available peri-operative complete blood counts to assess the concordance between levels of circulating leukocytes and TALs. While intra-tumoral PMN to PC ratios remained significantly prognostic within this subset, we found no significant correlation between circulating and infiltrating compartments, and no prognostic value from circulating leukocyte levels (Supplementary Table 7).
Discussion
Because expression signatures of small numbers of genes48 and cells47 can have utility for predicting response to therapy, including anti-tumor agents, we envision that these resources, available at http://precog.stanford.edu, will be useful for future discovery of predictive biomarkers, including immunotherapies.
PRECOG has unique advantages over related resources49. First, multiple datasets are included for most human cancers, and the use of a robust survival meta-z approach to integrate studies reduces the potential for erroneous conclusions drawn from single datasets. For example, in contrast to one recent study15, we found that molecular markers can indeed add significant prognostic information to clinical variables (see Integrated Discrimination Improvement (IDI) and Net Reclassification Improvement (NRI) analyses for FOXM1-KLRB1 in Supplementary Table 4), consistent with our previous work50. Nevertheless, we acknowledge that prognostic associations remain modest in some cancers (e.g. lung squamous cell carcinoma), and development of clinically applicable molecular models in such cases may remain challenging. Second, although recent studies have inferred TAL activities in relation to patient outcomes32,51,52, most have focused on one or two cell types in multiple cancer histologies32,52, have analyzed a relatively small number of markers that are not uniquely expressed by single immune cell types (e.g., granzyme A (GZMA) and perforin 1 (PRF1); Supplementary Fig. 4b)52, or have studied multiple infiltrating immune cell types in a single cancer histology51.
By applying CIBERSORT, a computational method for inferring leukocyte representation in bulk tumors16, we identified complex relationships between 22 immune subset signatures and overall survival across 25 cancer histologies. One potential limitation of our approach is the fidelity of input reference profiles, which could deviate from the expression programs and functional states of tumor-associated leukocytes. However, CIBERSORT was developed to be robust to noise, and thus far we have observed strong agreement with ground truth assessments in bulk tumors (e.g., Fig. 3a in this work and Fig. 2i in ref. 16). We identified and validated an intriguing reciprocal prognostic association of PC and PMN signatures in diverse and common solid tumors, though the immunological basis for this observation remains unclear. For example, it is unknown whether infiltrating PCs passively reflect non-specific host immunity or actively contribute to anti-tumor humoral immune responses. Future studies could help define novel tumor-associated antigens targeted by these PCs, with potential implications for new monoclonal antibody therapies. We also speculate that tumor-associated PMNs may be functionally related to myeloid derived suppressor cells (MDSCs), known to inhibit active T-cell anti-tumor immune responses53. Indeed, our analysis revealed several broadly favorable prognostic T cell signatures, including γδ and CD8 T cells, whose corresponding cell subsets express KLRB1 (CD161), the top favorable pan-cancer prognostic gene across PRECOG and a marker of enhanced innate immune characteristics in diverse T cell subsets28.
The favorable prognostic association of specific TALs across cancers, including effector T-cell subsets, is relevant to emerging cancer immunotherapies. Recently, therapeutic antibodies targeting T-cell checkpoints, including PD-1 and PD-L1, have yielded unprecedented successes for immunotherapy in some cancers, but not others54. The immunological diversity underlying the presence or absence of such responses in different cancers and in individual patients remains poorly understood55. Our approach provides a framework for characterizing diverse immune effectors as candidate biomarkers for predicting clinical response to these and other novel immunotherapies. It is also amenable to pharmacodynamic measurements assessing the recruitment and/or activation of targeted immune effectors by specific immunotherapies as a means of better understanding treatment responses or failures.
Our results illuminate the prognostic landscape of genes and TALs across human cancers, and suggest numerous hypotheses for further investigation and clinical translation. Moreover, the data and tools presented here should facilitate a variety of future studies, including the investigation of prognostic biomarkers within molecular subtypes and the assessment of multivariate interactions between genes, TALs, and other clinical indices.
Online Methods
PRECOG assembly and quality control
To identify cancer gene expression datasets with corresponding patient outcome data we queried NCBI Gene Expression Omnibus (GEO), EBI ArrayExpress, NCI caArray, and Stanford Microarray Database for the terms, survival, prognosis, prognostic, or outcome. Perl scripts were implemented to download processed and raw data, and associated annotation. For data within NCBI, the array platform was determined from the SOFT format file, and the corresponding annotation file was retrieved from GEO. From these, the Probe ID, Genbank accession, HUGO gene symbol and gene description were extracted based on the internal headers of the SOFT annotation file. The desired fields were specified manually if this automated procedure failed. For older platforms, such as cDNA microarrays, where annotations had not been recently updated, we re-mapped the probe sequences to HUGO gene symbols via the Genbank or Refseq accession number through the NCBI Entrez gene identifier. In cases without available accessions, but with the DNA sequence of the probe, we performed the mapping using BLAT to compare probes to a Refseq reference and look for unique highest-scoring hits.
Scripts were written to extract sample annotation information from GEO SOFT format files and parse them into tables. Since the contents of annotation fields are not semantically enforced, sample data can be contained in various fields, including Sample_title, Sample_characteristics, Sample_description, and Sample_source. Moreover, not all fields are specified for every sample. To parse this information into tabular format, we attempted to estimate the correct variable name (column header) by searching for common substrings across samples. In some cases, a dataset clearly had survival information, but was not deposited with the genomic data. In such cases, we first searched supplementary information of corresponding literature for the missing information. Failing this, we contacted corresponding and first authors, of which roughly half supplied the requested data.
All tabulations of clinical annotations were further checked and manually curated. This process included verification of results in selected studies by direct comparison of Kaplan-Meier plots and time scales with those in the corresponding primary publications, as well as consistency of prognostic genes across studies. Separately, errors due to technical issues or the curation process were estimated by comparing annotated gender to the ratio of RPS4Y1 to XIST (male:female) expression levels56 after microarray normalization, as detailed below (Supplementary Fig. 1a–c). Furthermore, identical samples present in more than one dataset were identified using MD5 checksums for Affymetrix data, and by cross-correlation analysis of expression vectors, and redundant samples were accordingly eliminated.
We applied the following gene expression normalization strategy to allow unification of data from diverse microarray platforms within PRECOG. For Affymetrix GeneChip data, raw CEL files were obtained when possible, and were normalized with the MAS5 algorithm (affy package v. 1.26 of Bioconductor v. 1.8 in R 2.15.1), using a custom CDF (Chip Definition File) for probeset summarization, which updates and maps array oligonucleotides to Entrez gene identifiers57-59 (http://brainarray.mbni.med.umich.edu/Brainarray/). Each dataset, regardless of platform, was quantile normalized separately. Moreover, each gene was log2 transformed if not already in log space, and was then unit mean/variance standardized across samples within a given dataset. While alternative microarray normalization methods have been proposed (e.g., RMA60, gcRMA61, fRMA62, SCAN-UPC63), for survival analysis we did not observe any significant benefit in comparing Affymetrix data normalized as described above to alternate normalization strategies (data not shown). TCGA RNA-seq and clinical data were downloaded from the TCGA Data Coordinating Center using TCGA-assembler64. The gene-level RNA-seq data were pre-processed using TCGA-assembler's ProcessRNASeqData function. RNA-seq and clinical data were matched via the patient barcode provided by TCGA.
For each study, the association of each probe on an array platform with survival outcomes was assessed via Cox proportional hazards regression using the coxph function of the R survival package (v. 2.37). Cox coefficients, hazard ratios with 95% confidence intervals, P values, and z-scores were obtained for each array probe. For datasets that had not been processed with Custom CDF, which yields a unique per-gene expression value, survival z-scores for probes were collapsed to the gene level by averaging z-scores of probes that matched to the same HUGO gene symbol. Z-scores for each gene were summarized across all datasets in each malignancy using Lipták's weighted meta-z test65,66, with weights set to the square roots of sample sizes67. To identify genes with cancer-wide prognostic significance, and avoid bias due to cancers with different sample sizes, we further combined weighted meta-z-scores into a single global meta-z-score for each gene using Stouffer's method (unweighted)66.
Validation of z statistics in PRECOG
Using lung adenocarcinoma as a test case, we assessed the relationship between the weighted meta-z-score metric and standard z-scores, the latter of which were derived from a merged expression matrix consisting of GEPs from lung adenocarcinoma studies in PRECOG (Supplementary Table 1). For this purpose, we selected datasets that had at least 40 stage I samples (indicated in Supplementary Table 1). To mitigate batch effects, we standardized each gene in each dataset such that it had unit mean and variance across stage I samples. Sample annotations were manually reviewed to ensure that staging corresponded to American Joint Committee on Cancer (AJCC) version 6 (2002), based on TNM (Tumor-Nodes-Metastasis) information. Many datasets pre-dated version 7 of AJCC, and did not contain the required detail for annotating to that standard. These refinements and standardizations permitted merging of samples from different datasets comprising different array platforms and different distributions of tumor stage across the cohort. In all, we compared lung adenocarcinoma GEPs from n = 1,106 patients, and found that weighted meta-z scores are significantly correlated with merged z-scores (Spearman's R = 0.9, P < 2.2 × 10−16). We observed similar results when comparing the meta- and merged-z statistics for a compendium of 5 AML studies, thus validating our use of the meta-z statistic. Of note, while we applied batch-correction procedures to merge expression datasets prior to calculating cross-study z-scores, these steps were not necessary with the meta-z metric, as z-scores from individual studies were directly integrated. This suggests that the meta-z approach effectively overcomes batch differences across datasets.
We further evaluated the influence of batch effects21 within individual datasets using Combat68 (Supplementary Fig. 2c–e). Applied to microarray processing dates in four AML studies, we observed only a modest effect on prognostic z-scores, as pre- and post-batch-corrected data were all highly correlated (R ≥ 0.92, P < 2.2 × 10−16; Supplementary Fig. 2c). To test whether batch correction of samples profiled by different study sites would improve data quality, we compared pre- and post-batch-corrected expression data from the NCI director's challenge lung adenocarcinoma dataset (ca00182) with a control dataset consisting of prognostic meta-z scores from a pooled set of all remaining 19 lung adenocarcinoma studies in PRECOG. Little difference in performance was observed for the most prognostic genes, with changes primarily affecting genes whose association with survival outcomes was subtle. (Supplementary Fig. 2d,e).
PRECOG false discovery rate
While z-scores and meta-z scores were analyzed in this work, Q values for global unweighted meta-z and weighted cancer-specific meta-z-scores were estimated using the False Discovery Rate (FDR) method of Storey and Tibshirani69, and are available for all analyzed z-score matrices online (http://precog.stanford.edu). Notably, of 23,288 HUGO gene symbols in PRECOG, 4,385 (19%) have a global meta-z significant at Q < 0.05 (|meta-z| > 2.6), and 2,986 (13%) are significant at a Q < 0.01 (|meta-z| > 3.22) (Supplementary Table 1).
Blinding and sample selection criteria
No blinding was used in this work. Duplicate and non-diagnostic (relapse) samples were excluded from analysis.
Statistical analysis of shared prognostic genes
To determine an empirical P value for the fraction of overlapping genes in Figure 1c, we randomized gene labels for every cancer, and for each cancer, calculated the fraction of prognostic genes shared with at least 1 other cancer (again, using |meta-z| > 3.09). We then calculated the median fraction of shared genes across all cancers and repeated the analysis 100,000 times. We performed a similar analysis in Supplementary Fig. 3a,b, but used 1,000 Monte Carlo iterations to test a broad range of P value and Q value thresholds, considering prognostic genes shared by at least 2, 3, or 4 tumor types in PRECOG.
Clustering of the PRECOG z-score matrix
To identify groups of genes with similar prognostic patterns across the entire PRECOG database (166 cancer datasets and 22,461 gene symbols; 827 genes represented in only a small subset of cancers were excluded), we performed unsupervised cluster analysis using AutoSOME23. Prior to clustering, columns (i.e., cancer datasets) were pre-processed to unit variance, rows (i.e., genes) were median-centered, and sum of squares = 1 normalization was applied to rows and then columns23. AutoSOME was run with a P value threshold of 0.01, 100 ensemble iterations, and otherwise default settings. In all, 665 clusters were identified, and ~50% genes were assigned to the 55 largest clusters (Supplementary Table 3). Functional annotation for each cluster was assessed using a variety of published gene sets and significance of overlap was determined by a hypergeometric test (for PubMed IDs, see Supplementary Table 3). Gene sets with a Bonferroni-corrected P < 0.01 were considered significant. To further evaluate each cluster, we calculated a global meta-z score for each gene (as in Supplementary Table 1), but considered only absolute meta-z scores to avoid biasing clusters with pan-cancer prognostic genes with the same survival orientation. We then calculated a compound score for each cluster by applying the same formula used for the unweighted meta-z score, which weights cluster-wide meta-z scores by cluster size. Importantly, this formula is not intended to yield a prognostic score in this context, since intra-cluster meta-z scores are not independent variables. Moreover, our approach is not simply a proxy for cluster size, as the top 5 ranking clusters were originally ranked 2, 1, 4,13, and 7 with respect to size. The top 5 clusters ranked by decreasing compound scores are shown in Figure 1d.
Cross-validation of top prognostic genes
The utility of PRECOG as a cancer-wide meta-analysis tool was assessed by cross-validation. Cancer histologies (from Supplementary Table 1) were randomly selected and pooled in a training set, with the number of cancers t ranging from 1 to 31 (of 39) total cancers. For each t, the top 10 adverse and favorable prognostic genes were identified using the mean meta-z score across the training set, and the mean meta-z for the same genes was determined in the remaining (39 – t) cancers (validation set). This process was repeated 100 times for each t, and the results are plotted in Fig. 2a. To determine the top prognostic genes across PRECOG, the top 10 adverse and favorable prognostic genes in each validation set were recorded at t = 31, and the most frequently occurring genes are shown in Fig. 2a.
Construction and assessment of a FOXM1-KLRB1 prognostic model
In brief, we used the coxph function in the R survival package to test the prognostic significance of FOXM1 and KLRB1 in bivariate models across PRECOG (each dataset was normalized as described for PRECOG construction). Bivariate models with a P < 0.05 (Wald test) were considered significant. Derivation of the FOXM1-KLRB1 composite score is described in detail below.
To integrate FOXM1 and KLRB1 into a composite score, we analyzed cancer types with at least two independent datasets having a FOXM1, KLRB1 bivariate P < 0.05. Of these, we extracted bivariate coefficients from a randomly selected group of 20% of the corresponding datasets, which served as the training set (n = 8; also see Supplementary Table 5). The median value of each coefficient in the training set was used as an independent weight in the following formula to define a composite score, (FOXM1 × 0.243) + (KLRB1 × −0.169). The median value of the FOXM1-KLRB1 composite score was determined for each left-out dataset (validation set), and used to stratify corresponding patients into high and low risk groups, which were then aggregated by cancer type, and subjected to Kaplan Meier analysis (Fig. 2b). Importantly, each cancer type in Fig. 2b only includes left-out datasets (i.e., validation data) for which FOXM1 and KLRB1 expression values were both available (see Supplementary Table 1 for datasets used). Moreover, the composite score was not re-optimized for RNA-seq when applied to TCGA data.
PRECOG network connectivity analysis
We evaluated the functional coherence of top prognostic genes in PRECOG by interrogation of human protein-protein associations in STRING version 9.070. STRING integrates evidence from multiple sources, including curated databases, experimentally confirmed physical interactions, co-expression data, and associations inferred via text mining70. We compared the connectivity of STRING networks (for edges with a combined confidence level of at least 0.4) for the top 100 adverse and favorable prognostic genes identified from all cancers (by mean meta-z score), individual cancers (meta-z score), and individual datasets (z-score). For each set of proteins, the largest connected component was assessed by two metrics, (1) the average number of associations (i.e., edges) per protein (i.e., node), and (2) algebraic connectivity, a graph theoretic measure of overall network connectedness71. Only networks with at least 10 nodes were considered. Functional enrichment of global networks shown in Fig. 2c was performed with ToppFun72 and P values were determined using a Benjamini-Hochberg-corrected hypergeometric test.
Geneset analysis of PRECOG clusters and ranked z-scores
Enrichment of biological processes with respect to survival z-scores was assessed in two ways. First, cluster memberships defined by AutoSOME were compared to pre-defined gene sets by hypergeometric test. Comparison gene sets were obtained from the Molecular Signatures Database (mSigDB)73, and additional sets of genes that are targets of specific transcription factors were extracted from CHEA29 and ENCODE20. Ranked lists of gene survival z-scores were also analyzed with respect to these gene sets using the PreRanked tool of Gene Set Enrichment Analysis74.
Inferring TAL levels in bulk tumor GEPs
The samples profiled within PRECOG primarily represent bulk diagnostic pre-therapy tumor specimens, which often contain a variety of cell types, including diverse TALs. Given the enrichment of lymphocyte markers in favorably prognostic genes across PRECOG (Figs. 1d, 2c), a method to systematically “unmix” or deconvolve bulk tumor GEPs in PRECOG may reveal new insights into tumor immunobiology. We recently developed a new approach for Cell Type Identification By Estimating Relative Subsets Of RNA Transcripts (CIBERSORT), a machine-learning method that outperformed other approaches in benchmarking experiments16. CIBERSORT produces an empirical P value for the deconvolution using Monte Carlo sampling. Like other linear deconvolution methods, CIBERSORT only operates on expression values in non-log linear space75.
TAL heterogeneity and prognostic associations
CIBERSORT was applied to all normalized PRECOG GEPs from Affymetrix HGU133 platforms (57 studies and 25 cancers; Supplementary Table 6). In all, 5,782 tumor GEPs were successfully deconvolved (CIBERSORT P < 0.005). For each dataset, estimated mRNA fractions of each leukocyte subset were related to survival using univariate Cox regression. Weighted meta-z scores were determined using the same approach described for PRECOG in order to build an immune-centric version of PRECOG (iPRECOG, Supplementary Fig. 6a), and unweighted global meta-z scores were used to summarize pan-cancer leukocyte associations in Figure 3c.
Immune-PRECOG false discovery rate
To differentiate real from stochastic variation in inferred leukocyte prognostic associations, we first compared P values and meta-z scores in immune-PRECOG (Supplementary Fig. 6b), as any deviation from a standard normal distribution must be considered when drawing statistical conclusions. We generated 1000 null meta-z matrices by (1) shuffling the cell type fractions inferred for each dataset in Supplementary Table 6, and (2) computing z-scores and corresponding meta-z scores to capture relationships to overall survival. We found a tight correspondence between the distribution of null meta-z scores and a standard normal distribution (Supplementary Fig. 6b). Having validated the normality of the meta-z score, we then filtered Supplementary Fig. 6a using a range of statistical significance thresholds, and at each cutoff, compared observed versus expected fractions for all leukocyte prognostic associations (Supplementary Fig. 6c). At a two-sided P value threshold of 0.05 (|z| > 1.96), we found nearly three times more prognostic associations than would be expected by random chance; at P < 0.01, there is a five-fold enrichment, which continues to increase with lower P value cutoffs (Supplementary Fig. 6c).
Separately, we performed a similar analysis on the global meta-z scores shown in Fig. 3c. Here, we integrated the null meta-z scores from Supplementary Fig. 6c into null global meta-z scores and recomputed the analysis shown for pan-cancer leukocyte prognostic associations (plotted as the fraction of leukocyte subsets retained at different significance thresholds; Supplementary Fig. 6d). Taken together, these results explicitly quantify significant versus stochastic variation in leukocyte prognostic associations at different statistical cutoffs, and allow others to tune the nominal statistical threshold to achieve a desired false discovery rate.
Relative PMN levels versus necrotic tissue content
Relative RNA fractions of PMNs inferred by CIBERSORT were not correlated with annotated necrotic content in lung squamous cell carcinoma (TCGA; R2 = 0.01; P = NS) or melanoma (microarray dataset GSE840176; R2 ~ 0; P = NS).
Flow cytometry versus CIBERSORT
Flow cytometry analysis of non-small cell lung cancer tumor (n = 13) specimens was performed as described below, and median fractions of CD4+, CD8+, CD19+, CD56+, and CD14+ populations were normalized by overall CD45+ content (Fig. 3a). For comparison with CIBERSORT, leukocyte signature matrix populations were grouped into the same cluster of differentiation categories: CD14+, monocytes, macrophages, and dendritic cells; CD4+, all T cell subsets except CD8 and γδ T cells; CD8+, CD8 T cells; CD19+, naïve and memory B-cells, CD56+, resting and activated NK cells. Median CIBERSORT-inferred fractions for lung adenocarcinoma GEPs, shown in Fig. 3a, were determined from two publicly available microarray datasets, GSE767077 and GSE1007278.
Patient samples
All aspects of this study were approved by the Stanford Institutional Review Board in accordance with the Declaration of Helsinki guidelines for the ethical conduct of research, and all patients involved provided informed consent. For Fig. 3a, fresh human lung tumor samples were obtained from Stanford Tissue Bank. For tissue microarray analyses (Fig. 4c,e, Supplementary Figs. 7c–h), patient samples were retrieved from the surgical pathology archives at the Stanford Department of Pathology and linked to a clinical database using the Cancer Center Database and STRIDE Database tools from Stanford.
Human lung dissociation and flow cytometry
Fresh human lung tumor samples were cut into small pieces and dissociated into single cell suspensions by 45 minutes of Collagenase I (STEMCELL Technologies) digestion. Dissociated single cells were suspended at 1 × 107 per mL in staining buffer (HBSS with 2% heat-inactivated calf serum). After 10 minutes of blocking with 10 μg μl−1 rat IgG, the cells were stained for at least 10 minutes with the antibodies listed below. After washing, stained cells were re-suspended in staining buffer with 1μg/ml DAPI, analyzed, and sorted with a FACS Aria II cell sorter (BD Biosciences). Antibodies used for experiments related to Fig. 3a: CD45-A700, CD14-PE, CD8-APC, CD4-FITC, CD56-PE-cy7, and CD19-PerCP-cy5.5. Antibodies used for enumeration of plasmacytic cells: CD45-PE-cy7, CD20-PerCP-cy5.5, CD138-PE, CD38-APC, CD19-A700, and CD27-FITC. All antibodies were obtained from BioLegend.
Tissue Microarray (TMA) cohort
We reviewed patients with lung cancer who had surgically treated disease and paraffin embedded samples from 1995 through June, 2010 for inclusion. Patients with recurrent or metastatic disease samples only were excluded. Medical charts were reviewed to clinically annotate the tumor specimens with demographic, operative procedures, imaging data, and follow-up. Pathology reports were reviewed to confirm specimen type, site, pathology, stage, histology, invasion status and operative procedure. Treated samples (neoadjuvant therapy) were excluded, resulting in a final analysis cohort of 187 pre-treated lung adenocarcinoma tumor specimens with follow-up data.
TMA cohort follow-up
Recurrence was defined by imaging or biopsy and patients with advanced disease or who did not have at least 6 months of follow-up were censored for further analyses. The National Death Index (NDI) was used to define vital status through October 30, 2010. Patients not dead were assumed to be alive except for those who had left the country or were from other countries (who were censored) since the NDI relies on a social security number for vital status assessment. Synchronous tumors resected over time were eligible for prognostic assessment in patients with two primaries.
TMA construction
The Stanford Lung Cancer TMA was developed from surgical specimens that contained viable tumor from duplicate slides that were reviewed by a board-certified pathologist (R.B.W.). The pathologist was not blinded to sample identity. The area of highest tumor content was marked for coring blocks corresponding to the slides. We used 2 mm cores to build the tissue microarray. These cores were aligned by histology and stage and negative controls were taken from the West Lab and included a variety of benign and malignant tissues (65 cores) that included normal non-lung tissue (12 cores), abnormal non-lung tissue (13 cores), placental markers (23 cores) and normal lung (17 cores). Normal lung consisted of a specimen adjacent, but distinct, from tumor over the years 1995 through 2010 to assess the variability of staining by year. OligoDT analysis was performed on the finished array to assess the architecture of selected cores and adequacy of tissue content prior to target IHC analysis. A co-registered Hematoxylin & Eosin (H&E) slide was used as well to verify tumor location for cases where this was unclear on initial inspection.
TMA immunohistochemistry
MPO (DAKO) and CD20 (clone L26, DAKO) immunohistochemistry performed on 4 mm sections using the Ventana BenchMark XT automated immunostaining platform (Ventana Medical Systems/Roche, Tucson, AZ).
TMA RNA in situ hybridization
The RNA in situ hybridization probe for IGKC was designed against chr2: 88,937,790–88,938,290 (hg18) using primer 5′ – CTG TTG TGT GCC TGC TGA AT – 3’ and the T7 promoter-tagged primer 5’ – CTA ATA CGA CTC ACT ATA GGG TTA AAG CCA AGG AGG AGG AG – 3’. RNA in situ hybridizations were performed on TA369, as described previously79.
TMA microscopy
All slides were scanned at 20x on an Ariol imaging analysis system (originally built by Applied Imaging).
TMA staining quantification and analysis
To facilitate consistency and reproducibility in quantitating TMA staining patterns, we evaluated the performance of GemIdent80, a supervised in silico image segmentation system. As an initial exercise, we trained GemIdent on a single lung adenocarcinoma specimen to recognize both IGKC stains and non-tissue background (white space). GemIdent was then applied to 10 TMA specimens to generate separate image masks of both IGKC localization and non-tissue background (i.e., “empty space”). A custom Perl script was used to process each image mask and quantify the staining area of IGKC for each specimen (by first removing non-tissue white space to calculate the surface area of each tissue). To test the utility of this approach, a board-certified pathologist (RBW) scored IGKC for the same 10 specimens. The pathologist had no knowledge of the results from automated staining, but was not blinded to sample identity. Both assessments were highly correlated (R2 = 0.98; Supplementary Fig. 8c). In a separate exercise, two independent operators trained GemIdent on distinct CD20-stained specimens. CD20-stained fractions were then quantified across the entire TMA (n = 187 lung adenocarcinomas) and results were processed as described above. The concordance between independent operators was very high (R2 ~ 1; Supplementary Fig. 8d). These data support the utility of GemIdent coupled with image post-processing for automated scoring of TMA specimens. We applied this approach to quantitatively score IGKC, CD20, and MPO for all lung adenocarcinoma TMA specimens (e.g., see Supplementary Fig. 8a).
Comparison between TALs and circulating leukocytes
Among patients with available perioperative circulating leukocyte (lymphocyte and PMN) counts, we analyzed the sample closest to the date of procedure (DOP), within −120 to +28 days, where precedence was given to preoperative samples (total n = 48 lung adenocarcinoma patients). As shown in Supplementary Table 7, no relationships were found between circulating leukocyte (CL) levels and TALs quantified on the TMA. Moreover, while the ratio of MPO to IGKC levels remained significantly prognostic within this patient subset (P = 0.02), CL levels had no significant relationship to survival.
Supplementary Material
Acknowledgements
We would like to thank S. Galli, I. Weissman, P. Brown, R. Levy, and H. Kohrt for critically reading the manuscript, and members of the Center for Cancer Systems Biology, Plevritis, Diehn, Levy, and Alizadeh labs for valuable guidance and suggestions. This work was supported by grants from the Doris Duke Charitable Foundation (A.A.A.), Damon Runyon Cancer Research Foundation (A.A.A.), V-Foundation (A.A.A.); by the US Public Health Service/National Institutes of Health U01 CA194389 (A.A.A.), R01 CA188298 (M.D. and A.A.A.), U54 CA149145 (S.K.P.), U01CA154969 (S.K.P.), and 5T32 CA09302-35 (A.M.N.); by the Bent & Janet Cardan Oncology Research Fund (A.A.A.); by the Ludwig Institute for Cancer Research (A.A.A.); by a Department of Defense grant W81XWH-12-1-0498 (A.M.N.); and by a grant from the Siebel Stem Cell Institute and the Thomas and Stacey Siebel Foundation (A.M.N.).
Footnotes
Author Contributions A.J.G., S.K.P., and A.A.A. conceived PRECOG, and A.M.N. and A.A.A. conceived immune-PRECOG. A.J.G., A.M.N., and A.A.A. designed the framework, collected and curated the primary data, and developed strategies for implementation and optimizations in related experiments, analyzed the data, and wrote the paper. A.M.N. and A.J.G. wrote all bioinformatics software for PRECOG and related analyses. A.J.G. and C.L.L. implemented web infrastructure for hosting PRECOG. S.V.B., V.S.N., R.B.W., and M.D. curated the NSCLC tumor GEP and TMA data including clinical annotations. Y.X., A.K. and C.D.H. identified and provided viable NSCLC patient specimens. D.K. and W.F. assisted with flow cytometry characterizations of primary NSCLC tumor specimens and enumeration of corresponding TALs. V.S.N. and R.B.W. constructed the NSCLC TMA and R.B.W. performed in situ hybridizations and immunohistochemistry characterizations for TALs. A.A.A. and S.K.P. contributed equally as senior authors to supervising and funding the project. All authors discussed the results and implications and commented on the manuscript at all stages.
The authors have no competing financial interests to declare.
References
- 1.Coussens LM, Werb Z. Inflammation and cancer. Nature. 2002;420:860–867. doi: 10.1038/nature01322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Zhang L, et al. Intratumoral T cells, recurrence, and survival in epithelial ovarian cancer. N. Engl. J. Med. 2003;348:203–213. doi: 10.1056/NEJMoa020177. [DOI] [PubMed] [Google Scholar]
- 3.Topalian SL, et al. Safety, activity, and immune correlates of anti-PD-1 antibody in cancer. N. Engl. J. Med. 2012;366:2443–2454. doi: 10.1056/NEJMoa1200690. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Fan J, Chee M, Gunderson K. Highly parallel genomic assays. Nat. Rev. Genet. 2006;7:632–644. doi: 10.1038/nrg1901. [DOI] [PubMed] [Google Scholar]
- 5.Koscielny S. Why Most Gene Expression Signatures of Tumors Have Not Been Useful in the Clinic. Sci. Transl. Med. 2010;2:14ps12–14ps12. doi: 10.1126/scitranslmed.3000313. [DOI] [PubMed] [Google Scholar]
- 6.Dupuy A, Simon RM. Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting. J. Natl. Cancer Inst. 2007;99:147–157. doi: 10.1093/jnci/djk018. [DOI] [PubMed] [Google Scholar]
- 7.Subramanian J, Simon R. Gene Expression–Based Prognostic Signatures in Lung Cancer: Ready for Clinical Use? J. Natl. Cancer Inst. 2010;102:464–474. doi: 10.1093/jnci/djq025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Dalton WS, Friend SH. Cancer Biomarkers–An Invitation to the Table. Science. 2006;312:1165–1168. doi: 10.1126/science.1125948. [DOI] [PubMed] [Google Scholar]
- 9.Ein-Dor L, Zuk O, Domany E. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc. Natl. Acad. Sci. U. S. A. 2006;103:5923–5928. doi: 10.1073/pnas.0601231103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Ransohoff DF. Rules of evidence for cancer molecular-marker discovery and validation. Nat. Rev. Cancer. 2004;4:309–314. doi: 10.1038/nrc1322. [DOI] [PubMed] [Google Scholar]
- 11.Varmus H. Ten Years On–The Human Genome and Medicine. N. Engl. J. Med. 2010;362:2028–2029. doi: 10.1056/NEJMe0911933. [DOI] [PubMed] [Google Scholar]
- 12.Lee HK, Hsu AK, Sajdak J, Qin J, Pavlidis P. Coexpression analysis of human genes across many microarray data sets. Genome Res. 2004;14:1085–1094. doi: 10.1101/gr.1910904. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Mizuno H, Kitada K, Nakai K, Sarai A. PrognoScan: a new database for meta-analysis of the prognostic value of genes. BMC Med. Genomics. 2009;2:18. doi: 10.1186/1755-8794-2-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Hebestreit K, et al. Leukemia gene atlas--a public platform for integrative exploration of genome-wide molecular data. PLoS One. 2012;7:e39148. doi: 10.1371/journal.pone.0039148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Yuan Y, et al. Assessing the clinical utility of cancer genomic and proteomic data across tumor types. Nat. Biotechnol. 2014;32:644–652. doi: 10.1038/nbt.2940. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Newman AM, et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods. 2015;12:453–457. doi: 10.1038/nmeth.3337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Hanahan D, Weinberg RA. Hallmarks of cancer: the next generation. Cell. 2011;144:646–674. doi: 10.1016/j.cell.2011.02.013. [DOI] [PubMed] [Google Scholar]
- 18.Mantovani A, et al. Chemokines in the recruitment and shaping of the leukocyte infiltrate of tumors. Semin. Cancer Biol. 2004;14:155–160. doi: 10.1016/j.semcancer.2003.10.001. [DOI] [PubMed] [Google Scholar]
- 19.Coussens LM, Zitvogel L, Palucka AK. Neutralizing Tumor-Promoting Chronic Inflammation: A Magic Bullet? Science. 2013;339:286–291. doi: 10.1126/science.1232227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Gerstein MB, et al. Architecture of the human regulatory network derived from ENCODE data. Nature. 2012;489:91–100. doi: 10.1038/nature11245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Leek JT, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 2010;11:733–739. doi: 10.1038/nrg2825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Beroukhim R, et al. The landscape of somatic copy-number alteration across human cancers. Nature. 2010;463:899–905. doi: 10.1038/nature08822. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Newman AM, Cooper JB. AutoSOME: a clustering method for identifying gene expression modules without prior knowledge of cluster number. BMC Bioinformatics. 2010;11:117. doi: 10.1186/1471-2105-11-117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Gentles AJ, Plevritis SK, Majeti R, Alizadeh AA. Association of a leukemic stem cell gene expression signature with clinical outcomes in acute myeloid leukemia. JAMA. 2010;304:2706–2715. doi: 10.1001/jama.2010.1862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Zeuner A, Todaro M, Stassi G, De Maria R. Colorectal Cancer Stem Cells: From the Crypt to the Clinic. Cell Stem Cell. 2014;15:692–705. doi: 10.1016/j.stem.2014.11.012. [DOI] [PubMed] [Google Scholar]
- 26.Myatt SS, Lam EW-F. The emerging roles of forkhead box (Fox) proteins in cancer. Nat. Rev. Cancer. 2007;7:847–859. doi: 10.1038/nrc2223. [DOI] [PubMed] [Google Scholar]
- 27.Scholzen T, Gerdes J. The Ki-67 protein: from the known and the unknown. J. Cell. Physiol. 2000;182:311–322. doi: 10.1002/(SICI)1097-4652(200003)182:3<311::AID-JCP1>3.0.CO;2-9. [DOI] [PubMed] [Google Scholar]
- 28.Fergusson JR, et al. CD161 defines a transcriptional and functional phenotype across distinct human T cell lineages. Cell Rep. 2014;9:1075–1088. doi: 10.1016/j.celrep.2014.09.045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Lachmann A, et al. ChEA: transcription factor regulation inferred from integrating genome-wide ChIP-X experiments. Bioinformatics. 2010;26:2438–2444. doi: 10.1093/bioinformatics/btq466. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Liberzon A, et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27:1739–1740. doi: 10.1093/bioinformatics/btr260. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Chen X, et al. The forkhead transcription factor FOXM1 controls cell cycle-dependent gene expression through an atypical chromatin binding mechanism. Mol. Cell. Biol. 2013;33:227–236. doi: 10.1128/MCB.00881-12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Galon J, et al. Type, Density, and Location of Immune Cells Within Human Colorectal Tumors Predict Clinical Outcome. Science. 2006;313:1960–1964. doi: 10.1126/science.1129139. [DOI] [PubMed] [Google Scholar]
- 33.Fridman WH, Pagès F, Sautès-Fridman C, Galon J. The immune contexture in human tumours: impact on clinical outcome. Nat. Rev. Cancer. 2012;12:298–306. doi: 10.1038/nrc3245. [DOI] [PubMed] [Google Scholar]
- 34.Lewis CE, Pollard JW. Distinct role of macrophages in different tumor microenvironments. Cancer Res. 2006;66:605–612. doi: 10.1158/0008-5472.CAN-05-4005. [DOI] [PubMed] [Google Scholar]
- 35.de Visser KE, Eichten A, Coussens LM. Paradoxical roles of the immune system during cancer development. Nat. Rev. Cancer. 2006;6:24–37. doi: 10.1038/nrc1782. [DOI] [PubMed] [Google Scholar]
- 36.Beyer M, Schultze JL. Regulatory T cells in cancer. Blood. 2006;108:804–811. doi: 10.1182/blood-2006-02-002774. [DOI] [PubMed] [Google Scholar]
- 37.Girardi M, et al. Regulation of cutaneous malignancy by gamma delta T cells. Science Signaling. 2001;294:605. [Google Scholar]
- 38.Haas W, Pereira P, Tonegawa S. Gamma/delta cells. Annu. Rev. Immunol. 1993;11:637–685. doi: 10.1146/annurev.iy.11.040193.003225. [DOI] [PubMed] [Google Scholar]
- 39.Fridlender ZG, Albelda SM. Tumor-associated neutrophils: friend or foe? Carcinogenesis. 2012;33:949–955. doi: 10.1093/carcin/bgs123. [DOI] [PubMed] [Google Scholar]
- 40.Di Carlo E, et al. The intriguing role of polymorphonuclear neutrophils in antitumor reactions. Blood. 2001;97:339–345. doi: 10.1182/blood.v97.2.339. [DOI] [PubMed] [Google Scholar]
- 41.Vakkila J, Lotze MT. Inflammation and necrosis promote tumour growth. Nat. Rev. Immunol. 2004;4:641–648. doi: 10.1038/nri1415. [DOI] [PubMed] [Google Scholar]
- 42.Sica A, Schioppa T, Mantovani A, Allavena P. Tumour-associated macrophages are a distinct M2 polarised population promoting tumour progression: potential targets of anti-cancer therapy. Eur. J. Cancer. 2006;42:717–727. doi: 10.1016/j.ejca.2006.01.003. [DOI] [PubMed] [Google Scholar]
- 43.Teramukai S, et al. Pretreatment neutrophil count as an independent prognostic factor in advanced non-small-cell lung cancer: An analysis of Japan Multinational Trial Organisation LC00-03. Eur. J. Cancer. 2009;45:1950–1958. doi: 10.1016/j.ejca.2009.01.023. [DOI] [PubMed] [Google Scholar]
- 44.de Visser KE, Korets LV, Coussens LM. De novo carcinogenesis promoted by chronic inflammation is B lymphocyte dependent. Cancer Cell. 2005;7:411–423. doi: 10.1016/j.ccr.2005.04.014. [DOI] [PubMed] [Google Scholar]
- 45.Liyanage UK, et al. Prevalence of Regulatory T Cells Is Increased in Peripheral Blood and Tumor Microenvironment of Patients with Pancreas or Breast Adenocarcinoma. J. Immunol. 2002;169:2756–2761. doi: 10.4049/jimmunol.169.5.2756. [DOI] [PubMed] [Google Scholar]
- 46.Minárik I, et al. Regulatory T cells, dendritic cells and neutrophils in patients with renal cell carcinoma. Immunol. Lett. 2013 doi: 10.1016/j.imlet.2013.05.010. [DOI] [PubMed] [Google Scholar]
- 47.Yamanaka T, et al. The baseline ratio of neutrophils to lymphocytes is associated with patient prognosis in advanced gastric cancer. Oncology. 2008;73:215–220. doi: 10.1159/000127412. [DOI] [PubMed] [Google Scholar]
- 48.Paik S, et al. A Multigene Assay to Predict Recurrence of Tamoxifen-Treated, Node-Negative Breast Cancer. N. Engl. J. Med. 2004;351:2817–2826. doi: 10.1056/NEJMoa041588. [DOI] [PubMed] [Google Scholar]
- 49.Chen X, Sun X, Hoshida Y. Survival analysis tools in genomics research. Hum Genomics. 2014;8:21. doi: 10.1186/s40246-014-0021-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Alizadeh AA, et al. Prediction of survival in diffuse large B-cell lymphoma based on the expression of 2 genes reflecting tumor and microenvironment. Blood. 2011;118:1350–1358. doi: 10.1182/blood-2011-03-345272. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Bindea G, et al. Spatiotemporal dynamics of intratumoral immune cells reveal the immune landscape in human cancer. Immunity. 2013;39:782–795. doi: 10.1016/j.immuni.2013.10.003. [DOI] [PubMed] [Google Scholar]
- 52.Rooney MS, Shukla SA, Wu CJ, Getz G, Hacohen N. Molecular and genetic properties of tumors associated with local immune cytolytic activity. Cell. 2015;160:48–61. doi: 10.1016/j.cell.2014.12.033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Gabrilovich DI, Nagaraj S. Myeloid-derived suppressor cells as regulators of the immune system. Nat. Rev. Immunol. 2009;9:162–174. doi: 10.1038/nri2506. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Pardoll DM. The blockade of immune checkpoints in cancer immunotherapy. Nat. Rev. Cancer. 2012;12:252–264. doi: 10.1038/nrc3239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Ribas A. Tumor Immunotherapy Directed at PD-1. N. Engl. J. Med. 2012;366:2517–2519. doi: 10.1056/NEJMe1205943. [DOI] [PubMed] [Google Scholar]
Methods-only References
- 56.Day A, Carlson MR, Dong J, O'Connor BD, Nelson SF. Celsius: a community resource for Affymetrix microarray data. Genome Biol. 2007;8:R112. doi: 10.1186/gb-2007-8-6-r112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Dai M, et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 2005;33:e175. doi: 10.1093/nar/gni179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Gautier L, Cope L, Bolstad BM, Irizarry RA. affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics. 2004;20:307–315. doi: 10.1093/bioinformatics/btg405. [DOI] [PubMed] [Google Scholar]
- 59.Gentleman RC, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5:R80. doi: 10.1186/gb-2004-5-10-r80. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Irizarry RA, et al. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003;31:e15. doi: 10.1093/nar/gng015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F. A Model-Based Background Adjustment for Oligonucleotide Expression Arrays. J Am Stat Assoc. 2004;99:909–917. [Google Scholar]
- 62.McCall MN, Bolstad BM, Irizarry RA. Frozen robust multiarray analysis (fRMA). Biostatistics. 2010;11:242–253. doi: 10.1093/biostatistics/kxp059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Piccolo SR, et al. A single-sample microarray normalization method to facilitate personalized-medicine workflows. Genomics. 2012;100:337–344. doi: 10.1016/j.ygeno.2012.08.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Zhu Y, Qiu P, Ji Y. TCGA-Assembler: open-source software for retrieving and processing TCGA data. Nat. Methods. 2014;11:599–600. doi: 10.1038/nmeth.2956. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Lipták T. On the combination of independent tests. Magyar Tud. Akad. Mat. Kutato Int. Közl. 1958;3:171–196. [Google Scholar]
- 66.Stouffer S, DeVinney L, Suchmen E. The American Soldier: Adjustment During Army Life. Princeton University Press; Princeton, NJ: 1949. [Google Scholar]
- 67.Zaykin DV. Optimally weighted Z-test is a powerful method for combining probabilities in meta-analysis. J. Evol. Biol. 2011;24:1836–1841. doi: 10.1111/j.1420-9101.2011.02297.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8:118–127. doi: 10.1093/biostatistics/kxj037. [DOI] [PubMed] [Google Scholar]
- 69.Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. U. S. A. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Szklarczyk D, et al. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2011;39:D561–D568. doi: 10.1093/nar/gkq973. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Fiedler M. Algebraic connectivity of graphs. Czechoslovak Mathematical Journal. 1973;23:298–305. [Google Scholar]
- 72.Chen J, Bardes EE, Aronow BJ, Jegga AG. ToppGene Suite for gene list enrichment analysis and candidate gene prioritization. Nucleic Acids Res. 2009;37:W305–W311. doi: 10.1093/nar/gkp427. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Liberzon A, et al. Molecular signatures database (MSigDB) 3.0. Bioinformatics. 2011;27:1739–1740. doi: 10.1093/bioinformatics/btr260. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Subramanian A, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U. S. A. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Zhong Y, Liu Z. Gene expression deconvolution in linear space. Nat. Methods. 2012;9:8–9. doi: 10.1038/nmeth.1830. [DOI] [PubMed] [Google Scholar]
- 76.Xu L, et al. Gene Expression Changes in an Animal Melanoma Model Correlate with Aggressiveness of Human Melanoma Metastases. Mol. Cancer Res. 2008;6:760–769. doi: 10.1158/1541-7786.MCR-07-0344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Su L-J, et al. Selection of DDX5 as a novel internal control for Q-RT-PCR from microarray data using a block bootstrap re-sampling scheme. BMC Genomics. 2007;8:140. doi: 10.1186/1471-2164-8-140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Landi MT, et al. Gene Expression Signature of Cigarette Smoking and Its Role in Lung Adenocarcinoma Development and Survival. PLoS One. 2008;3:e1651. doi: 10.1371/journal.pone.0001651. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Brunner A, et al. Transcriptional profiling of long non-coding RNAs and novel transcribed regions across a diverse panel of archived human cancers. Genome Biol. 2012;13:R75. doi: 10.1186/gb-2012-13-8-r75. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Holmes S, Kapelner A, Lee PP. An interactive java statistical image segmentation system: Gemident. J Stat Softw. 2009;30:i10. [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.