lenhancer: LASSO based Enhancer identification
Combining a novel strategy to identify communities of related control elements with a penalized regression approach, we developed individual gene-by-gene models to identify the potential control elements predictive of the expression of the gene.
The package requires the following data:
- expression (E): gene expression data in say n cell types
- regions (R): A set of putative enhancer CREs with a given distance of the genes
- regulation signals (S): regulation signals such as active chromatin marks, DNaseI-seq or ATAC-seq in the n cell types. You can normalise the way you prefer for library sizes, and region sizes.
- tfbs (T): Transcription factor binding signals in these regions. You can use as many TF binding data as you want, but TF ChIP-seq in the n cell types is highly recommended.
For each gene, the method consists of the following steps:
- m = consider peaks that are with 100kB (can be changed)
- Using correlation of T and S between these m regions, a community clustering algorithm is used to group the CREs into clusters called coCREs.
- singleton CREs within 10Kb and coCREs are then considered as predictors, and the gene's expression across cell types as response variable.
- Penalised regression (L1 regularisation, LASSO) along with cross-validation is used to identify enhancers for the gene.
- covariance testing is used to compute the p-value.
install.packages("devtools")
devtools::install_github("hadley/devtools")
devtools::install_github(c("vjbaskar/covTest", "vjbaskar/lenhancer"))
The below example run is for gene "Runx1". Please not the format of the four input data below and keep them the same for your input as well.
#### Input variables
library(lenhancer)
gene = "Runx1"
coCRE_cutoff = 100000 # 100Kb upper limit for considering regions as being mapped to the gene
singleton_cutoff = 20000 # 20Kb upper limit for considering regions that are not coCREs in the model
coCRE_corr_cutoff = 0.5 # The higher the value the tighter the community of CREs
min_tfb_events = 2 # lower limit for considering a region for walktrap community clustering
alphaVal=1 # 1 is the lasso penalty, and 0 the ridge penalty.
scale.predictors=TRUE # scale predictors True or False
family = "gaussian" # "gaussian","binomial","poisson","multinomial","cox","mgaussian"
nfold = 10 # nfold cross-validation: >= 3 and can be as large as the sample (# of cell types) size (leave-one-out CV)
data("expression")
data("regulation_signal")
data("regulation_tfbs")
data("region_gene_mapping")
gene_preds = findEnhancer(gene, expression, regulation_signal, regulation_tfbs, region_gene_mapping, min_tfb_events = 2, coCRE_corr_cutoff = 0.5, singleton_cutoff = 20000, alphaVal = 1, scale.predictors = TRUE, family = "gaussian", nfoldxval = 10)
expression data consists of gene expression values (such as log2(FPKM)). gene as rows, cell types as columns.
> data(expression)
> head(expression)
ESC CESC MES CMES CM CP HB HE HP
Sergef 1.137991 1.044118 1.895715 2.217683 1.666260 0.646738 2.008090 1.244708 1.802633
Bcl7a 2.206694 2.550776 2.844359 3.496080 3.116237 3.478903 2.129927 1.403524 2.419382
Lnx2 2.053856 2.500270 1.863102 1.802951 1.930653 1.332876 2.289595 2.377188 2.195265
Ppia 10.074714 10.342072 9.826837 10.160430 9.599557 9.509954 1.916361 1.533936 2.086379
Gkn1 -9.965784 -9.965784 -9.965784 -9.965784 -9.965784 -9.965784 -9.965784 -9.965784 -9.965784
Lrat -6.122783 -6.090128 -6.229214 -7.160255 -6.348240 -7.412639 -8.432611 -5.641031 -6.950677
MAC
Sergef 2.6563687
Bcl7a 0.2551161
Lnx2 2.8721921
Ppia 0.8198570
Gkn1 -9.9657843
Lrat -5.6209541
region_gene_mapping data frame consists of all the regulatory regions that are mapped to the genes in expression. There are many software such as ChIPseeker, ChIPpeakAnno and HOMER that can map regions to genes. If you map peaks to genes, and the peak lies within the gene you can mark their distance as any value less than singleton_cutoff (default, 20Kb)
> data("region_gene_mapping")
> head(region_gene_mapping)
chr start end peak gene distance
7786 chr16 92514276 92515568 merge.105629 Runx1 85898
7792 chr16 92515907 92516390 merge.105630 Runx1 85076
7797 chr16 92522863 92525592 merge.105632 Runx1 75874
7815 chr16 92572528 92574113 merge.105643 Runx1 27353
7820 chr16 92579265 92579735 merge.105644 Runx1 21731
7824 chr16 92605464 92606424 merge.105650 Runx1 3000
NOTE: Please keep the colnames the same. findEnhancer function looks for c("peak", "gene", "distance")
columns specifically in its codes and is case-senstive.
NOTE that the region_gene_mapping$gene and row.names(expression) must overlap.
regulation_signal data consists of regions as rows and chromatin activity signal (such as log2(FPKM)) for each cell type as columns.
NOTE that the colnames for expression and regulation_signal are the same. NOTE that the rownames for regulation_signal and region_gene_mapping$peak should be same.
> data("regulation_signal")
> head(regulation_signal)
ESC CESC MES CMES CM CP HB HE
merge.105629 2.56820271 2.5806279 0.9168354 1.5256741 0.4129932 0.5330959 1.6784691 -0.1319497
merge.105630 2.44097491 1.6429856 0.5797611 2.1639965 1.5515023 1.7488139 3.1067132 0.7631582
merge.105632 1.30810596 0.7371804 0.9214790 1.5322401 1.1576157 1.1091539 1.2070316 0.7889895
merge.105643 0.02400262 -0.3086307 0.6754163 0.2950902 1.3850548 0.7898872 0.7038305 3.2788514
merge.105644 1.24260338 1.2721741 1.9149712 2.2436335 2.2543188 2.6906041 3.7203829 2.8612361
merge.105650 2.54822747 3.0395300 3.8857761 3.0925726 2.1348828 1.5124602 4.0995385 2.2658194
HP MAC
merge.105629 0.9904566 0.8144609
merge.105630 1.6498018 2.3616716
merge.105632 1.2059928 4.0246107
merge.105643 1.5171622 5.4947875
merge.105644 3.3901037 0.6582151
merge.105650 1.3444846 -3.3652031
regulation_tfbs data consists of TF binding data as 0 = non-binding and 1 = binding. The rownames are regions and colnames are TF ChIP-seq data. In theory, you can use any ChIP-seq data for computing overlap, but the most meaningful is to use the TFBS data for the cell types you are studying.
NOTE that rownames(regulation_tfbs)
is same as rownames(regulation_signal)
> data("regulation_tfbs")
> head(regulation_tfbs)
ESC_ESRRB ESC_NANOG ESC_OCT4 ESC_SOX2 MES_CEBPB MES_ELK4 MES_OCT4 HB_CEBPB HB_GATA2
merge.105677 0 0 0 0 0 0 0 0 0
merge.105727 0 0 0 0 0 0 0 0 0
merge.105746 0 0 0 0 0 0 0 0 0
merge.105753 0 0 0 0 0 0 0 0 0
merge.105769 0 0 0 0 0 0 0 0 0
merge.105629 0 0 0 0 0 0 0 0 0
HB_LMO2 HB_TAL1 HB_TEAD4 HE_CEBPB HE_FLI1 HE_LMO2 HE_MEIS1 HE_TAL1 HP_CEBPB HP_FLI1
merge.105677 1 1 0 0 0 0 0 0 0 0
merge.105727 0 0 0 0 0 0 0 0 0 0
merge.105746 1 1 1 0 0 0 0 0 0 0
merge.105753 0 0 0 0 0 0 0 0 0 0
merge.105769 0 0 0 0 0 0 0 0 0 0
merge.105629 0 0 0 0 0 0 0 0 0 0
HP_GATA1 HP_GATA2 HP_GFI1B HP_GFI1 HP_LMO2 HP_PU1 HP_RUNX1 HP_TAL1 MAC_CEBPB MAC_FLI1
merge.105677 0 0 0 0 0 0 0 0 0 0
merge.105727 0 0 0 0 0 0 0 0 1 0
merge.105746 0 0 0 0 0 0 0 0 0 0
merge.105753 0 0 0 0 0 0 0 0 1 0
merge.105769 0 0 0 0 0 0 0 0 1 0
merge.105629 0 0 0 0 0 0 0 0 0 0
MAC_LMO2 MAC_PU1 MAC_RUNX1 MAC_TAL1
merge.105677 0 1 0 0
merge.105727 0 1 0 0
merge.105746 0 0 0 0
merge.105753 0 1 0 0
merge.105769 0 1 0 0
merge.105629 0 0 0 0
The output gene_pred is a list and the main output are lambda_min = the predictors (regions) obtained when thresholding for lambda at min(MSE), lambda_1se = same as lambda_min but with lambda at min(MSE) + 1stdev and more conservative and p = p value
> summary(gene_preds$Runx1)
Length Class Mode
expression 10 -none- numeric
meta 6 data.frame list
reg_signal 10 data.frame list
reg_tfbs 33 data.frame list
lambda_1se 2 data.frame list
lambda_min 6 data.frame list
p 1 -none- numeric