Skip to content

vjbaskar/lenhancer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lenhancer: LASSO based Enhancer identification

Introduction

Combining a novel strategy to identify communities of related control elements with a penalized regression approach, we developed individual gene-by-gene models to identify the potential control elements predictive of the expression of the gene.

The package requires the following data:

  • expression (E): gene expression data in say n cell types
  • regions (R): A set of putative enhancer CREs with a given distance of the genes
  • regulation signals (S): regulation signals such as active chromatin marks, DNaseI-seq or ATAC-seq in the n cell types. You can normalise the way you prefer for library sizes, and region sizes.
  • tfbs (T): Transcription factor binding signals in these regions. You can use as many TF binding data as you want, but TF ChIP-seq in the n cell types is highly recommended.

For each gene, the method consists of the following steps:

  • m = consider peaks that are with 100kB (can be changed)
  • Using correlation of T and S between these m regions, a community clustering algorithm is used to group the CREs into clusters called coCREs.
  • singleton CREs within 10Kb and coCREs are then considered as predictors, and the gene's expression across cell types as response variable.
  • Penalised regression (L1 regularisation, LASSO) along with cross-validation is used to identify enhancers for the gene.
  • covariance testing is used to compute the p-value.

Installation

install.packages("devtools")
devtools::install_github("hadley/devtools")
devtools::install_github(c("vjbaskar/covTest", "vjbaskar/lenhancer"))

Example run

The below example run is for gene "Runx1". Please not the format of the four input data below and keep them the same for your input as well.

#### Input variables
library(lenhancer)
gene = "Runx1"
coCRE_cutoff = 100000 # 100Kb upper limit for considering regions as being mapped to the gene
singleton_cutoff = 20000 # 20Kb upper limit for considering regions that are not coCREs in the model
coCRE_corr_cutoff = 0.5 # The higher the value the tighter the community of CREs
min_tfb_events = 2 # lower limit for considering a region for walktrap community clustering
alphaVal=1 # 1 is the lasso penalty, and 0 the ridge penalty.
scale.predictors=TRUE # scale predictors True or False
family = "gaussian" # "gaussian","binomial","poisson","multinomial","cox","mgaussian"
nfold = 10 # nfold cross-validation: >= 3 and can be as large as the sample (# of cell types) size (leave-one-out CV) 


data("expression")
data("regulation_signal")
data("regulation_tfbs")
data("region_gene_mapping")
gene_preds = findEnhancer(gene, expression, regulation_signal, regulation_tfbs, region_gene_mapping, min_tfb_events = 2, coCRE_corr_cutoff = 0.5, singleton_cutoff = 20000, alphaVal = 1, scale.predictors = TRUE, family = "gaussian", nfoldxval = 10)

Input data types

expression

expression data consists of gene expression values (such as log2(FPKM)). gene as rows, cell types as columns.

> data(expression)
> head(expression)
             ESC      CESC       MES      CMES        CM        CP        HB        HE        HP
Sergef  1.137991  1.044118  1.895715  2.217683  1.666260  0.646738  2.008090  1.244708  1.802633
Bcl7a   2.206694  2.550776  2.844359  3.496080  3.116237  3.478903  2.129927  1.403524  2.419382
Lnx2    2.053856  2.500270  1.863102  1.802951  1.930653  1.332876  2.289595  2.377188  2.195265
Ppia   10.074714 10.342072  9.826837 10.160430  9.599557  9.509954  1.916361  1.533936  2.086379
Gkn1   -9.965784 -9.965784 -9.965784 -9.965784 -9.965784 -9.965784 -9.965784 -9.965784 -9.965784
Lrat   -6.122783 -6.090128 -6.229214 -7.160255 -6.348240 -7.412639 -8.432611 -5.641031 -6.950677
              MAC
Sergef  2.6563687
Bcl7a   0.2551161
Lnx2    2.8721921
Ppia    0.8198570
Gkn1   -9.9657843
Lrat   -5.6209541

regions

region_gene_mapping data frame consists of all the regulatory regions that are mapped to the genes in expression. There are many software such as ChIPseeker, ChIPpeakAnno and HOMER that can map regions to genes. If you map peaks to genes, and the peak lies within the gene you can mark their distance as any value less than singleton_cutoff (default, 20Kb)

> data("region_gene_mapping")
> head(region_gene_mapping)
       chr    start      end         peak  gene distance
7786 chr16 92514276 92515568 merge.105629 Runx1    85898
7792 chr16 92515907 92516390 merge.105630 Runx1    85076
7797 chr16 92522863 92525592 merge.105632 Runx1    75874
7815 chr16 92572528 92574113 merge.105643 Runx1    27353
7820 chr16 92579265 92579735 merge.105644 Runx1    21731
7824 chr16 92605464 92606424 merge.105650 Runx1     3000

NOTE: Please keep the colnames the same. findEnhancer function looks for c("peak", "gene", "distance") columns specifically in its codes and is case-senstive. NOTE that the region_gene_mapping$gene and row.names(expression) must overlap.

regulation signal

regulation_signal data consists of regions as rows and chromatin activity signal (such as log2(FPKM)) for each cell type as columns.

NOTE that the colnames for expression and regulation_signal are the same. NOTE that the rownames for regulation_signal and region_gene_mapping$peak should be same.

> data("regulation_signal")
> head(regulation_signal)
                    ESC       CESC       MES      CMES        CM        CP        HB         HE
merge.105629 2.56820271  2.5806279 0.9168354 1.5256741 0.4129932 0.5330959 1.6784691 -0.1319497
merge.105630 2.44097491  1.6429856 0.5797611 2.1639965 1.5515023 1.7488139 3.1067132  0.7631582
merge.105632 1.30810596  0.7371804 0.9214790 1.5322401 1.1576157 1.1091539 1.2070316  0.7889895
merge.105643 0.02400262 -0.3086307 0.6754163 0.2950902 1.3850548 0.7898872 0.7038305  3.2788514
merge.105644 1.24260338  1.2721741 1.9149712 2.2436335 2.2543188 2.6906041 3.7203829  2.8612361
merge.105650 2.54822747  3.0395300 3.8857761 3.0925726 2.1348828 1.5124602 4.0995385  2.2658194
                    HP        MAC
merge.105629 0.9904566  0.8144609
merge.105630 1.6498018  2.3616716
merge.105632 1.2059928  4.0246107
merge.105643 1.5171622  5.4947875
merge.105644 3.3901037  0.6582151
merge.105650 1.3444846 -3.3652031

TF binding data

regulation_tfbs data consists of TF binding data as 0 = non-binding and 1 = binding. The rownames are regions and colnames are TF ChIP-seq data. In theory, you can use any ChIP-seq data for computing overlap, but the most meaningful is to use the TFBS data for the cell types you are studying.

NOTE that rownames(regulation_tfbs) is same as rownames(regulation_signal)

> data("regulation_tfbs")
> head(regulation_tfbs)
             ESC_ESRRB ESC_NANOG ESC_OCT4 ESC_SOX2 MES_CEBPB MES_ELK4 MES_OCT4 HB_CEBPB HB_GATA2
merge.105677         0         0        0        0         0        0        0        0        0
merge.105727         0         0        0        0         0        0        0        0        0
merge.105746         0         0        0        0         0        0        0        0        0
merge.105753         0         0        0        0         0        0        0        0        0
merge.105769         0         0        0        0         0        0        0        0        0
merge.105629         0         0        0        0         0        0        0        0        0
             HB_LMO2 HB_TAL1 HB_TEAD4 HE_CEBPB HE_FLI1 HE_LMO2 HE_MEIS1 HE_TAL1 HP_CEBPB HP_FLI1
merge.105677       1       1        0        0       0       0        0       0        0       0
merge.105727       0       0        0        0       0       0        0       0        0       0
merge.105746       1       1        1        0       0       0        0       0        0       0
merge.105753       0       0        0        0       0       0        0       0        0       0
merge.105769       0       0        0        0       0       0        0       0        0       0
merge.105629       0       0        0        0       0       0        0       0        0       0
             HP_GATA1 HP_GATA2 HP_GFI1B HP_GFI1 HP_LMO2 HP_PU1 HP_RUNX1 HP_TAL1 MAC_CEBPB MAC_FLI1
merge.105677        0        0        0       0       0      0        0       0         0        0
merge.105727        0        0        0       0       0      0        0       0         1        0
merge.105746        0        0        0       0       0      0        0       0         0        0
merge.105753        0        0        0       0       0      0        0       0         1        0
merge.105769        0        0        0       0       0      0        0       0         1        0
merge.105629        0        0        0       0       0      0        0       0         0        0
             MAC_LMO2 MAC_PU1 MAC_RUNX1 MAC_TAL1
merge.105677        0       1         0        0
merge.105727        0       1         0        0
merge.105746        0       0         0        0
merge.105753        0       1         0        0
merge.105769        0       1         0        0
merge.105629        0       0         0        0

Output

The output gene_pred is a list and the main output are lambda_min = the predictors (regions) obtained when thresholding for lambda at min(MSE), lambda_1se = same as lambda_min but with lambda at min(MSE) + 1stdev and more conservative and p = p value

> summary(gene_preds$Runx1)
           Length Class      Mode   
expression 10     -none-     numeric
meta        6     data.frame list   
reg_signal 10     data.frame list   
reg_tfbs   33     data.frame list   
lambda_1se  2     data.frame list   
lambda_min  6     data.frame list   
p           1     -none-     numeric

About

Enhancer identification using LASSO

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages