ELMER workshop

Introduction

Workshop Description

This workshop demonstrates how to perform ELMER analysis using matched RNA-seq and DNA methylation data.

You can find a detailed information about ELMER at http://bioconductor.org/packages/3.10/bioc/vignettes/ELMER/inst/doc/index.html.

Articles about ELMER:

  • Tiago C Silva, Simon G Coetzee, Nicole Gull, Lijing Yao, Dennis J Hazelett, Houtan Noushmehr, De-Chen Lin, Benjamin P Berman, ELMER v.2: an R/Bioconductor package to reconstruct gene regulatory networks from DNA methylation and transcriptome profiles, Bioinformatics, Volume 35, Issue 11, 1 June 2019, Pages 1974–1977, https://doi.org/10.1093/bioinformatics/bty902
  • Yao, Lijing, et al. "Inferring regulatory element landscapes and transcription factor networks from cancer methylomes." Genome biology 16.1 (2015): 105. https://doi.org/10.1186/s13059-015-0668-3
  • Yao, Lijing, Benjamin P. Berman, and Peggy J. Farnham. "Demystifying the secret mission of enhancers: linking distal regulatory elements to target genes." Critical reviews in biochemistry and molecular biology 50.6 (2015): 550-573. https://doi.org/10.3109/10409238.2015.1087961

Material

Pre-requisites

  • Basic knowledge of R syntax
  • Familiarity with the SummarizedExperiment classes
  • Familiarity with ’omics data types including DNA methylation and gene expression

Workshop Participation

Students will have a chance to run ELMER analysis on a provided MultiAssayExperiment object created from TCGA data from GDC data portal.

Goals and objectives

  • gain familiarity ELMER input data, a MultiAssayExperiment object
  • Execute ELMER analysis on real data and understand its meaning

Main R/Bioconductor packages used

Starting ELMER analysis

Loading libraries

First, we need to load the R packages used in this workshop, they should provide the required functions to run the code.

In [1]:
suppressMessages({
    library("ELMER")
    library("MultiAssayExperiment")
})

ELMER input data

The RNA-seq data, DNA methylation data and patients metadata (i.e age, gender) used in this workshop is structured as a MultiAssayExperiment. You can read more about a MultiAssayExperiment object at http://bioconductor.org/packages/MultiAssayExperiment/ and https://bioconductor.github.io/BiocWorkshops/workflow-for-multi-omics-analysis-with-multiassayexperiment.html.

Through the next section we will:

  1. Load the data
  2. Verify the RNA-seq data
  3. Verify the DNA methylation data
  4. Verify the samples metadata

Loading the data

The data which is available in this machine, can also be downloaded at this google drive.

In [2]:
mae <- readRDS("Data/TCGA_ESCA_MAE_distal_regions.rds")
mae
colnames(mae)
A MultiAssayExperiment object of 2 listed
 experiments with user-defined names and respective classes. 
 Containing an ExperimentList class object of length 2: 
 [1] DNA methylation: RangedSummarizedExperiment with 148951 rows and 171 columns 
 [2] Gene expression: RangedSummarizedExperiment with 56830 rows and 171 columns 
Features: 
 experiments() - obtain the ExperimentList instance 
 colData() - the primary/phenotype DataFrame 
 sampleMap() - the sample availability DataFrame 
 `$`, `[`, `[[` - extract colData columns, subset, or experiment 
 *Format() - convert into a long or wide DataFrame 
 assays() - convert ExperimentList to a SimpleList of matrices
CharacterList of length 2
[["DNA methylation"]] TCGA-2H-A9GF-01A-11D-A37D-05 ...
[["Gene expression"]] TCGA-2H-A9GF-01A-11R-A37I-31 ...

Check RNA-seq object

In [3]:
# check experiments
experiments(mae)
ExperimentList class object of length 2: 
 [1] DNA methylation: RangedSummarizedExperiment with 148951 rows and 171 columns 
 [2] Gene expression: RangedSummarizedExperiment with 56830 rows and 171 columns 
In [4]:
# Get the Gene expression object
rna.seq <- mae[[2]]

message("Number of genes ",nrow(rna.seq))

message("Number of samples ",ncol(rna.seq))

message("Check genes metadata")
rowRanges(rna.seq)

message("Check gene expression data")
assay(rna.seq)[1:4,1:4]
Number of genes 56830
Number of samples 171
Check genes metadata
GRanges object with 56830 ranges and 3 metadata columns:
                  seqnames              ranges strand | ensembl_gene_id
                     <Rle>           <IRanges>  <Rle> |     <character>
  ENSG00000000003     chrX 100627109-100639991      - | ENSG00000000003
  ENSG00000000005     chrX 100584802-100599885      + | ENSG00000000005
  ENSG00000000419    chr20   50934867-50958555      - | ENSG00000000419
  ENSG00000000457     chr1 169849631-169894267      - | ENSG00000000457
  ENSG00000000460     chr1 169662007-169854080      + | ENSG00000000460
              ...      ...                 ...    ... .             ...
  ENSG00000281904     chr2   90365737-90367699      + | ENSG00000281904
  ENSG00000281909    chr15   22480439-22484840      - | ENSG00000281909
  ENSG00000281910    chr16   58559796-58559931      - | ENSG00000281910
  ENSG00000281912     chr1   45303910-45305619      + | ENSG00000281912
  ENSG00000281920     chr2   65623272-65628424      + | ENSG00000281920
                  external_gene_name original_ensembl_gene_id
                         <character>              <character>
  ENSG00000000003             TSPAN6       ENSG00000000003.13
  ENSG00000000005               TNMD        ENSG00000000005.5
  ENSG00000000419               DPM1       ENSG00000000419.11
  ENSG00000000457              SCYL3       ENSG00000000457.12
  ENSG00000000460           C1orf112       ENSG00000000460.15
              ...                ...                      ...
  ENSG00000281904         AC233263.6        ENSG00000281904.1
  ENSG00000281909            HERC2P7        ENSG00000281909.1
  ENSG00000281910           SNORA50A        ENSG00000281910.1
  ENSG00000281912          LINC01144        ENSG00000281912.1
  ENSG00000281920         AC007389.5        ENSG00000281920.1
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths
Check gene expression data
A matrix: 4 × 4 of type dbl
TCGA-2H-A9GF-01A-11R-A37I-31TCGA-2H-A9GG-01A-11R-A37I-31TCGA-2H-A9GH-01A-11R-A37I-31TCGA-2H-A9GI-01A-11R-A37I-31
ENSG0000000000317.6316516.75826717.62450817.89604
ENSG00000000005 0.00000 7.721927 9.236257 0.00000
ENSG0000000041919.3312519.18803019.76541120.90188
ENSG0000000045715.8756516.25120515.74524315.92405

Check DNA methylation data

In [5]:
# Get the DNA methylation object
dna.met <- mae[[1]]

message("Number of probes: ", nrow(dna.met))
message("Number of samples: ", ncol(dna.met))


message("Check DNA methylation probes metadata")
rowRanges(dna.met)[,1:4]

message("Check DNA methylation beta values")
assay(dna.met)[1:4,1:4]
Number of probes: 148951
Number of samples: 171
Check DNA methylation probes metadata
GRanges object with 148951 ranges and 4 metadata columns:
             seqnames              ranges strand | address_A address_B
                <Rle>           <IRanges>  <Rle> | <integer> <integer>
  cg08258224     chr1       864703-864704      - |  23712478      <NA>
  cg13938959     chr1       898803-898804      + |  38601418      <NA>
  cg12445832     chr1       898915-898916      - |  70726486      <NA>
  cg23999112     chr1       898976-898977      - |  49747457      <NA>
  cg11527153     chr1       902156-902157      + |  47662324      <NA>
         ...      ...                 ...    ... .       ...       ...
  cg16428758     chrX 155539968-155539969      - |  28623392      <NA>
  cg05682970     chrX 155609878-155609879      - |  34723345      <NA>
  cg07211220     chrX 155616254-155616255      + |  33714463      <NA>
  cg25059696     chrY     7563149-7563150      + |  61613414      <NA>
  cg13851368     chrY   11954167-11954168      + |  36601306  32634316
                 channel  designType
             <character> <character>
  cg08258224        Both          II
  cg13938959        Both          II
  cg12445832        Both          II
  cg23999112        Both          II
  cg11527153        Both          II
         ...         ...         ...
  cg16428758        Both          II
  cg05682970        Both          II
  cg07211220        Both          II
  cg25059696        Both          II
  cg13851368         Grn           I
  -------
  seqinfo: 26 sequences from an unspecified genome; no seqlengths
Check DNA methylation beta values
A matrix: 4 × 4 of type dbl
TCGA-2H-A9GF-01A-11D-A37D-05TCGA-2H-A9GG-01A-11D-A37D-05TCGA-2H-A9GH-01A-11D-A37D-05TCGA-2H-A9GI-01A-11D-A37D-05
cg082582240.90683150.88881160.82470160.8174693
cg139389590.27910420.21780400.32419530.3408468
cg124458320.19004860.12026980.22904220.1741125
cg239991120.31981710.19690210.42689680.4687823

Check samples metadata

In [6]:
message("metadata can be accessed using the function colData")
message("we will check the first 4 samples and their 4 columns")
colData(mae)[1:4,1:4]

message("or you can also access the metadata directly with the $")
mae$primary_diagnosis[1:4]

message("summarize the number of tissue type with the function table")
table(mae$definition)

message("summarize the numbers of samples using two features")
table(mae$primary_diagnosis,mae$definition)
metadata can be accessed using the function colData
we will check the first 4 samples and their 4 columns
DataFrame with 4 rows and 4 columns
                           sample      patient shortLetterCode
                      <character>  <character>     <character>
TCGA-2H-A9GF-01A TCGA-2H-A9GF-01A TCGA-2H-A9GF              TP
TCGA-2H-A9GG-01A TCGA-2H-A9GG-01A TCGA-2H-A9GG              TP
TCGA-2H-A9GH-01A TCGA-2H-A9GH-01A TCGA-2H-A9GH              TP
TCGA-2H-A9GI-01A TCGA-2H-A9GI-01A TCGA-2H-A9GI              TP
                          definition
                         <character>
TCGA-2H-A9GF-01A Primary solid Tumor
TCGA-2H-A9GG-01A Primary solid Tumor
TCGA-2H-A9GH-01A Primary solid Tumor
TCGA-2H-A9GI-01A Primary solid Tumor
or you can also access the metadata directly with the $
  1. 'Adenocarcinoma, NOS'
  2. 'Adenocarcinoma, NOS'
  3. 'Adenocarcinoma, NOS'
  4. 'Adenocarcinoma, NOS'
summarize the number of tissue type with the function table
         Metastatic Primary solid Tumor Solid Tissue Normal 
                  1                 161                   9 
summarize the numbers of samples using two features
                                            
                                             Metastatic Primary solid Tumor
  Adenocarcinoma, NOS                                 0                  78
  Basaloid squamous cell carcinoma                    0                   1
  Mucinous adenocarcinoma                             0                   1
  Squamous cell carcinoma, keratinizing, NOS          0                   4
  Squamous cell carcinoma, NOS                        1                  76
  Tubular adenocarcinoma                              0                   1
                                            
                                             Solid Tissue Normal
  Adenocarcinoma, NOS                                          7
  Basaloid squamous cell carcinoma                             0
  Mucinous adenocarcinoma                                      1
  Squamous cell carcinoma, keratinizing, NOS                   0
  Squamous cell carcinoma, NOS                                 1
  Tubular adenocarcinoma                                       0

Selecting the samples

Since we will use only Primary solid Tumor samples we will remove the Metastatic and Solid Tissue Normal samples from our object

In [7]:
mae <- mae[, mae$definition == "Primary solid Tumor"]

Performing ELMER analysis

In [8]:
# which is the column of the samples metadata defining our groups
group.col <- "primary_diagnosis"

# which are the grops withing the group.col that we want to compare 
group1 <- "Adenocarcinoma, NOS"
group2 <- "Squamous cell carcinoma, NOS"

# Identify probes hypo in group1 or hypo in group2 ? 
direction <- "hypo" # hypo in group 1

cores <- 2
mode <- "supervised"

# Where do we want to save the results ?
dir.out <- paste0("analysis/",group.col,"-",
                  gsub("[[:punct:]]| ", "_", group1),
                  "_vs_",
                  gsub("[[:punct:]]| ", "_", group2),
                  "_", mode,
                  "/", direction)
message("Results will be saved at: ",dir.out)
dir.create(dir.out, showWarnings = FALSE,recursive = T)
Results will be saved at: analysis/primary_diagnosis-Adenocarcinoma__NOS_vs_Squamous_cell_carcinoma__NOS_supervised/hypo

Identifying hypo methylated distal probes between ESAD samples compared to ESCC samples

In [9]:
# time expected with one core: 4 minuntes
diff.probes <- get.diff.meth(data = mae, 
                             group.col = group.col,
                             group1 = group1,
                             group2 = group2,
                             diff.dir = direction, # Get probes hypometh. in group 1 
                             cores = cores, 
                             mode = mode,
                             pvalue = 0.01, 
                             sig.dif = 0.3,
                             dir.out =  dir.out, 
                             save = TRUE)
# check the results
head(diff.probes)
Supervised mode will use all samples from boths groups. Setting argument minSubgroupFrac to 1
ELMER will search for probes hypomethylated in group Adenocarcinoma, NOS (n:78) compared to Squamous cell carcinoma, NOS (n:76)
ooo Arguments ooo
o Number of probes: 148951
o Beta value difference cut-off: 0.3
o FDR cut-off: 0.01
o Mode: supervised
o % of samples per group in each comparison: 1
o Min number of samples per group in each comparison: 5
o Nb of samples group1 in each comparison: 78
o Nb of samples group2 in each comparison: 76
Output direction: analysis/primary_diagnosis-Adenocarcinoma__NOS_vs_Squamous_cell_carcinoma__NOS_supervised/hypo
ooooooooooooooooo
Progress disabled when using parallel plyr
Warning message in setup_parallel():
"No parallel backend registered"Saving results
Number of relevant probes found: 1524
A data.frame: 6 × 4
probepvalueAdenocarcinoma..NOS_Minus_Squamous.cell.carcinoma..NOSadjust.p
<chr><dbl><dbl><dbl>
cg19860160cg198601603.178629e-41-0.42562664.782424e-38
cg02838683cg028386837.322499e-46-0.31388762.423763e-42
cg22512779cg225127791.305816e-45-0.35382534.225151e-42
cg17094363cg170943634.500204e-35-0.47516642.376985e-32
cg20429981cg204299811.834374e-22-0.32276211.790510e-20
cg01189072cg011890729.596881e-18-0.37710105.028016e-16

Correlate RNA-seq expression and DNA methylation

Now that we have our hypomethylated probes we will look for the 20 nearest genes
to each probes to identify if any of these genes could being affected by the hypomethylation in the distal regulatory region.

In [10]:
# time expected less than one minute
nearGenes <- GetNearGenes(data = mae, 
                          probes =  diff.probes$probe, 
                          numFlankingGenes = 20)
head(nearGenes)
Searching for the 20 near genes
Identifying gene position for each probe
A grouped_df: 6 × 5
IDGeneIDSymbolDistanceSide
<chr><chr><chr><dbl><chr>
cg00009553ENSG00000261590AC018554.2-1462705L10
cg00009553ENSG00000278499AC018554.3-1460699L9
cg00009553ENSG00000261436AC018554.1-1377934L8
cg00009553ENSG00000261310AC009081.1-1296929L7
cg00009553ENSG00000259844GNPATP -1163530L6
cg00009553ENSG00000261178AC009169.1 -891638L5

Now that we have our hypomethylated probes and the 20 nearest genes we want to identify if any of these genes could being affected by the hypomethylatio in the distal regulatory region.

In [11]:
# time expected  one minute
pair <- get.pair(data = mae,
                 nearGenes = nearGenes,
                 group.col = group.col,
                 group1 = group1,
                 group2 = group2,
                 mode = mode,
                 diff.dir = direction, 
                 raw.pvalue = 0.001,   
                 Pe = 0.001, 
                 filter.probes = TRUE,
                 filter.percentage = 0.05,
                 filter.portion = 0.3,
                 dir.out =  dir.out,
                 cores = cores, 
                 label = direction)

head(pair)
Using pre-defined groups. U (unmethylated): Adenocarcinoma, NOS, M (methylated): Squamous cell carcinoma, NOS
-------------------
* Filtering probes
-------------------
For more information see function preAssociationProbeFiltering
Making sure we have at least 5% of beta values lesser than 0.3 and 5% of beta values greater 0.3.
Removing 100701 probes out of 148951
Calculating Pp (probe - gene) for all nearby genes
Progress disabled when using parallel plyr
Warning message in setup_parallel():
"No parallel backend registered"File created: analysis/primary_diagnosis-Adenocarcinoma__NOS_vs_Squamous_cell_carcinoma__NOS_supervised/hypo/getPair.hypo.all.pairs.statistic.csv
A data.frame: 6 × 7
ProbeGeneIDSymbolDistanceSidesRaw.pFDR
<chr><chr><chr><dbl><chr><dbl><dbl>
cg13100118.ENSG00000178828cg13100118ENSG00000178828RNF186345152R94.334152e-273.345954e-24
cg19533443.ENSG00000173702cg19533443ENSG00000173702MUC13 -38437L26.023782e-273.345954e-24
cg14981432.ENSG00000124920cg14981432ENSG00000124920MYRF 0L17.607601e-273.345954e-24
cg26545245.ENSG00000166866cg26545245ENSG00000166866MYO1A 40016R47.909052e-273.345954e-24
cg17277896.ENSG00000143375cg17277896ENSG00000143375CGN 37443R18.886298e-273.345954e-24
cg21157873.ENSG00000143375cg21157873ENSG00000143375CGN 0L18.886298e-273.345954e-24

We can plot the correlation heatmap of DNA methylation and gene expression with the heatmapPairs function. We will plot only the first 100 pairs.

In [12]:
# Plot size options
options(repr.plot.width=12, repr.plot.height=7)


heatmapPairs(data = mae, 
             group.col = group.col,
             group1 = group1,
             group2 = group2,
             subset = TRUE,
             pairs = pair[1:100,],
             filename =  NULL)
Subsetting
Ordering groups

Identifying TF enriched motifs on the hypomethylated regions

We have hypomethylated regions that has a distal gene upregulated. We want to identify which is the MR TF binding to those regions and regulating the distal gene. First we identify which are the enriched motifs for those hypomethylated regions using the function get.enriched.motif. This will output the enriched motifs and the respective hypomethylated probes with the motif signature around it (since a DNA methylation probes is 1bp, we extended the search window to $\pm$ 250 bp).

In [13]:
# time expected  less than one minute
enriched.motif <- get.enriched.motif(data = mae,
                                     probes = unique(pair$Probe), 
                                     dir.out = dir.out,
                                     label = direction,
                                     min.incidence = 10,
                                     lower.OR = 1.5)
Loading object: Probes.motif.hg38.450K
|====================================================|100%                      Completed after 6 s 
Retrieving TFClass family classification from ELMER.data.
Retrieving TFClass subfamily classification from ELMER.data.
------------------------------------
** Filtering motifs based on quality
------------------------------------
Number of enriched motifs with quality:
-----------
 => A: 18
 => B: 5
 => C: 5
 => D: 7
 => S: 0
-----------
Considering only motifs with quality from A up to DS: 35 motifs are enriched.
---------------------------------------------------
* Adding enriched motifs to significant pairs file
---------------------------------------------------
Adding coordinates for probes and genes from the provided data
Joining, by = "GeneID"
Joining, by = "Probe"
Saving file: analysis/primary_diagnosis-Adenocarcinoma__NOS_vs_Squamous_cell_carcinoma__NOS_supervised/hypo/getPair.hypo.pairs.significant.withmotif.csv
In [14]:
message("check top 5 enriched motifs")
names(enriched.motif)[1:5]

message("Check the paired hypomethylated probes with the motif signature")
enriched.motif$GATA4_HUMAN.H11MO.0.A

message("Plot enrichement analysis results")
motif.result.file <- file.path(dir.out, "getMotif.hypo.motif.enrichment.csv")

# Plot size options
options(repr.plot.width=12, repr.plot.height=7)
motif.enrichment.plot(motif.enrichment = motif.result.file,
                      significant = list(lowerOR = 2.0), 
                      label = "hypo", 
                      summary = TRUE,
                      save = FALSE)
check top 5 enriched motifs
  1. 'GATA4_HUMAN.H11MO.0.A'
  2. 'GATA6_HUMAN.H11MO.0.A'
  3. 'HNF4A_HUMAN.H11MO.0.A'
  4. 'GATA2_HUMAN.H11MO.1.A'
  5. 'GATA1_HUMAN.H11MO.1.A'
Check the paired hypomethylated probes with the motif signature
  1. 'cg22050733'
  2. 'cg04275506'
  3. 'cg19860160'
  4. 'cg22512779'
  5. 'cg19626253'
  6. 'cg03601886'
  7. 'cg13784017'
  8. 'cg11959316'
  9. 'cg07191152'
  10. 'cg25098793'
  11. 'cg05625834'
  12. 'cg17277896'
  13. 'cg26787863'
  14. 'cg00075975'
  15. 'cg07005353'
  16. 'cg00577847'
  17. 'cg07117596'
  18. 'cg14520947'
  19. 'cg23486701'
  20. 'cg22107147'
  21. 'cg20043649'
  22. 'cg08818866'
  23. 'cg08847775'
  24. 'cg12077754'
  25. 'cg27239243'
  26. 'cg10560561'
  27. 'cg05977696'
  28. 'cg23367358'
  29. 'cg10018933'
  30. 'cg21926118'
  31. 'cg07613752'
  32. 'cg01450522'
  33. 'cg24590723'
  34. 'cg03865925'
  35. 'cg18342703'
  36. 'cg21425296'
  37. 'cg00584840'
  38. 'cg23174148'
  39. 'cg22539744'
  40. 'cg01377807'
  41. 'cg04319611'
  42. 'cg11578055'
  43. 'cg15634747'
  44. 'cg23971987'
  45. 'cg01923775'
  46. 'cg05626226'
  47. 'cg03464224'
  48. 'cg00796302'
  49. 'cg26728709'
  50. 'cg08988420'
  51. 'cg08830157'
  52. 'cg18430895'
  53. 'cg08609445'
  54. 'cg16016281'
  55. 'cg11821200'
  56. 'cg24606701'
  57. 'cg27351783'
  58. 'cg21566177'
  59. 'cg03172688'
  60. 'cg10505610'
  61. 'cg11584098'
  62. 'cg19692192'
  63. 'cg05165087'
  64. 'cg07186962'
  65. 'cg08015382'
  66. 'cg27565517'
  67. 'cg15094236'
  68. 'cg12663302'
  69. 'cg23381884'
  70. 'cg14493742'
  71. 'cg09305161'
  72. 'cg10846969'
  73. 'cg23226631'
  74. 'cg10784258'
  75. 'cg22443940'
  76. 'cg19929194'
  77. 'cg12964881'
  78. 'cg24645679'
  79. 'cg00084338'
  80. 'cg17287122'
  81. 'cg27246129'
  82. 'cg09022230'
  83. 'cg00772192'
  84. 'cg02498579'
  85. 'cg08828787'
  86. 'cg14033170'
  87. 'cg18100008'
  88. 'cg09422239'
  89. 'cg15059474'
  90. 'cg24300607'
  91. 'cg19935531'
  92. 'cg14199261'
  93. 'cg05921889'
  94. 'cg05744395'
  95. 'cg18991321'
  96. 'cg00987534'
  97. 'cg02901002'
  98. 'cg12018521'
  99. 'cg01824603'
  100. 'cg10087556'
  101. 'cg14533298'
  102. 'cg17291136'
  103. 'cg04237288'
  104. 'cg18336854'
  105. 'cg12726213'
  106. 'cg24027477'
  107. 'cg15235922'
  108. 'cg07273304'
  109. 'cg26895804'
  110. 'cg10131026'
  111. 'cg14325112'
  112. 'cg03489427'
  113. 'cg10063407'
  114. 'cg01364755'
  115. 'cg14509153'
  116. 'cg10476206'
  117. 'cg17568611'
  118. 'cg23681599'
  119. 'cg10494981'
  120. 'cg25227803'
  121. 'cg27065896'
  122. 'cg11688410'
  123. 'cg17425818'
  124. 'cg04337594'
  125. 'cg18203466'
  126. 'cg10955669'
  127. 'cg19754622'
  128. 'cg13608733'
  129. 'cg26725502'
  130. 'cg16311302'
  131. 'cg15722533'
  132. 'cg20482760'
  133. 'cg23091777'
  134. 'cg03078767'
  135. 'cg02856190'
  136. 'cg25015613'
  137. 'cg15110243'
  138. 'cg18246298'
  139. 'cg02159731'
  140. 'cg06978117'
  141. 'cg23327896'
  142. 'cg16182447'
  143. 'cg26540315'
  144. 'cg18248586'
  145. 'cg03554573'
  146. 'cg21127079'
  147. 'cg21672401'
  148. 'cg21738443'
  149. 'cg25757470'
  150. 'cg17046586'
  151. 'cg19091779'
  152. 'cg01579458'
  153. 'cg10621809'
  154. 'cg17066349'
  155. 'cg27614751'
  156. 'cg26545245'
  157. 'cg04874286'
  158. 'cg11658060'
  159. 'cg13712197'
  160. 'cg17843665'
  161. 'cg20360395'
  162. 'cg16378261'
  163. 'cg12194594'
  164. 'cg16536610'
  165. 'cg12128696'
  166. 'cg13995599'
  167. 'cg02909176'
  168. 'cg16217297'
  169. 'cg07800658'
  170. 'cg15935770'
  171. 'cg23639055'
  172. 'cg14850604'
  173. 'cg21207694'
  174. 'cg19708418'
  175. 'cg13594138'
  176. 'cg21523688'
  177. 'cg04094640'
  178. 'cg09113939'
  179. 'cg07092029'
  180. 'cg07098502'
  181. 'cg10020890'
  182. 'cg00599850'
  183. 'cg13688786'
  184. 'cg04726374'
  185. 'cg04136369'
  186. 'cg08659204'
  187. 'cg03828329'
  188. 'cg04770088'
  189. 'cg04603419'
  190. 'cg17415265'
  191. 'cg00777445'
  192. 'cg19078037'
  193. 'cg00423871'
  194. 'cg09926389'
  195. 'cg18788725'
  196. 'cg07604565'
  197. 'cg26162564'
  198. 'cg27403406'
  199. 'cg21211039'
  200. 'cg01327767'
  201. 'cg13758960'
  202. 'cg05321038'
  203. 'cg00389552'
Plot enrichement analysis results
TableGrob (2 x 2) "arrange": 2 grobs
  z     cells    name           grob
1 1 (1-1,1-1) arrange gtable[layout]
2 2 (1-1,2-2) arrange gtable[layout]

Inferring MR TF binding to hypomethylated regions

Now that we have our enriched motifs, we have the following question:

  • Which is the candidate master regulator Transcription factor binding to those regions ?

It is important to highlight that since TF within the same family and subfamily have very close motifs, they will likely be indentified to bind the same regions. You can check that by verifying the intersection of probes from our enrichment analysis results.

So more than 70% of probes are enriched for both GATA4 and GATA6. How could we infer which of those MR might be the one with higher impact ?

For that ELMER correlates the TF expression vs the average mean methylation of the hypomethylated probes with a motif signature and rank them. We expect that a TF with higher expression and lower methylation in the binding regions will be a better MR TF candidate than TF that are for example equally expressed on both groups.

In [15]:
# time expected  ~5 minutes
TF <- get.TFs(data = mae, 
              group.col = group.col,
              mode = mode,
              group1 = group1,
              group2 = group2,
              diff.dir = direction,
              enriched.motif = enriched.motif,
              dir.out = dir.out,
              label = direction, 
              cores = cores)
Using pre-defined groups. U (unmethylated): Adenocarcinoma, NOS, M (methylated): Squamous cell carcinoma, NOS
-------------------------------------------------------------------------------------------------------------------
** Downloading TF list from Lambert, Samuel A., et al. The human transcription factors. Cell 172.4 (2018): 650-665.
-------------------------------------------------------------------------------------------------------------------
Accessing TF families from TFClass database to indentify known potential TF
Retrieving TFClass family classification from ELMER.data.
Retrieving TFClass subfamily classification from ELMER.data.
Calculating the average methylation at all motif-adjacent probes 
Progress disabled when using parallel plyr
Warning message in setup_parallel():
"No parallel backend registered"Performing Mann-Whitney U test
Progress disabled when using parallel plyr
Warning message in setup_parallel():
"No parallel backend registered"Finding potential TF and known potential TF
Progress disabled when using parallel plyr
Warning message in setup_parallel():
"No parallel backend registered"
In [16]:
message("Check the results for the GATA4_HUMAN.H11MO.0.A motif")
TF["GATA4_HUMAN.H11MO.0.A",1:5]
Check the results for the GATA4_HUMAN.H11MO.0.A motif
A data.frame: 1 × 5
motiftop.potential.TF.familytop.potential.TF.subfamilypotential.TF.familypotential.TF.subfamily
<chr><chr><chr><chr><chr>
GATA4_HUMAN.H11MO.0.AGATA4_HUMAN.H11MO.0.AGATA6GATA6GATA6;GATA4GATA6;GATA4

Visualizing the results with TF ranking plot

ELMER provides an easy way to visualize the rank of all human TF evaluated with the function TF.rank.plot.

In [17]:
# Plot size options
options(repr.plot.width=7, repr.plot.height=7)

load(file.path(dir.out,"getTF.hypo.TFs.with.motif.pvalue.rda"))
motif <-  "GATA4_HUMAN.H11MO.0.A"
TF.rank.plot(motif.pvalue = TF.meth.cor, 
             motif = motif,
             save = FALSE)
Retrieving TFClass family classification from ELMER.data.
Retrieving TFClass subfamily classification from ELMER.data.
|====================================================|100%                      Completed after 0 s 
$GATA4_HUMAN.H11MO.0.A

attr(,"split_type")
[1] "array"
attr(,"split_labels")
  X1
1  1

Summarizing several ELMER analysis

MR TF heatmap

When you perform several ELMER analysis with different arguments, you might want to summarize and visualize those several results. For that, we have the ELMER:::get.tabs function , which accepts as input the directories of the results and if you want to select the MR TF within the family of subfamily classification.

In the Data/analysis_april_2019 folder we provided 4 ELMER analysis. The first step is to retrieve the directory names with the results. Also, it is a golden rule when running several ELMER runs to add a understable name to the dir.out directories, such that you easily know which analysis were ran, with which arguments. In the example below, you can identify the tumor type, type of ELMER run (supervised/unsupervised), groups compared and direction of the analysis (hypo or hyper in group 1).

In [18]:
analsysis.dir <- unique(dirname(dir("Data/analysis_april_2019/",recursive = T,full.names = T)))
analsysis.dir
  1. 'Data/analysis_april_2019//COAD-READ/unsupervised/definition-Primary.solid.Tumor_vs_Solid.Tissue.Normal/hypo'
  2. 'Data/analysis_april_2019//ESCA/unsupervised/group-Adenocarcinoma.Primary.solid.Tumor_vs_Adenocarcinoma.Solid.Tissue.Normal/hypo'
  3. 'Data/analysis_april_2019//ESCA/unsupervised/group-Adenocarcinoma.Primary.solid.Tumor_vs_Squamous.cell.carcinoma.Primary.solid.Tumor/hypo'
  4. 'Data/analysis_april_2019//PAAD/unsupervised/definition-Primary.solid.Tumor_vs_Solid.Tissue.Normal/hypo'

With those directory names we will use the function ELMER:::get.tabs to summarize the results.

In [19]:
classification <- "subfamily"
suppressWarnings({
    suppressMessages({
        tabs <- ELMER:::get.tabs(analsysis.dir,classification)
    })
})
# A list of matrices are returned
print(class(tabs))

# names of the matrices
print(names(tabs))

# The default it the name directory as column name
tabs$tab[1:3,1:4]

# To be more readable we will change the path names to the analysis name
analysis.names <- c( "COAD-READ (vs Normal)",
                     "EAC (vs Normal)",
                     "EAC (vs ESCC)",
                     "PAAD (vs Normal)")
for (i in 1:3) colnames(tabs[[i]]) <- analysis.names
[1] "list"
[1] "tab"         "tab.pval"    "tab.or"      "tf.or.table"
A data.frame: 3 × 4
Data/analysis_april_2019//COAD-READ/unsupervised/definition-Primary.solid.Tumor_vs_Solid.Tissue.Normal/hypoData/analysis_april_2019//ESCA/unsupervised/group-Adenocarcinoma.Primary.solid.Tumor_vs_Adenocarcinoma.Solid.Tissue.Normal/hypoData/analysis_april_2019//ESCA/unsupervised/group-Adenocarcinoma.Primary.solid.Tumor_vs_Squamous.cell.carcinoma.Primary.solid.Tumor/hypoData/analysis_april_2019//PAAD/unsupervised/definition-Primary.solid.Tumor_vs_Solid.Tissue.Normal/hypo
<dbl><dbl><dbl><dbl>
ARID3A1000
FOXQ11000
HNF1A1111

This function creates 4 matrices using some of the files in each analysis directory:

  1. a binary matrix saying if the TF was found or not in the analysis: this function parses each getTF.hypo.significant.TFs.with.motif.summary.csv (create after you run the get.TFs function) file that has the MR TF for each enriched motif.
In [20]:
# 1.  a binary matrix saying if the TF was found or not in the analysis
tab.binary <- tabs$tab
head(tab.binary)
A data.frame: 6 × 4
COAD-READ (vs Normal)EAC (vs Normal)EAC (vs ESCC)PAAD (vs Normal)
<dbl><dbl><dbl><dbl>
ARID3A1000
FOXQ11000
HNF1A1111
PDX11000
POU5F1B1000
SOX91001
  1. a matrix with the MR TF p-value found in the analysis: this uses getTF.hypo.significant.TFs.with.motif.summary.csv to identify the MR TF and getTF.hypo.TFs.with.motif.pvalue.rda, which is a matrix with TF as columns and enriched motifs as rows and the contains the p-value of the wilcoxon test of the TF expression for the Unmethylated vs Methylated groups.
In [21]:
# 2. a matrix  with the MR TF p-value found in the analysis
tab.pval <- tabs$tab.pval
head(tab.pval)
A matrix: 6 × 4 of type dbl
COAD-READ (vs Normal)EAC (vs Normal)EAC (vs ESCC)PAAD (vs Normal)
ARID3A8.397450e-08 NA NA NA
FOXQ13.774568e-08 NA NA NA
HNF1A5.845199e-100.0077005481.689635e-091.777426e-07
PDX13.243946e-09 NA NA NA
POU5F1B1.259413e-10 NA NA NA
SOX98.410768e-08 NA NA2.017003e-07
  1. a matrix with the highest motif OR for the MR TF found in the analysis: this function uses the file getTF.hypo.significant.TFs.with.motif.summary.csv to get the MR TF and then uses getMotif.hypo.motif.enrichment.csv to get the motifs OR to which the MR TF is known to bind and returns the highest OR.
In [22]:
# 3. a matrix  with the highest motif OR for the MR TF found in the analysis
tab.or <- tabs$tab.or
head(tab.or)
A matrix: 6 × 4 of type dbl
COAD-READ (vs Normal)EAC (vs Normal)EAC (vs ESCC)PAAD (vs Normal)
ARID3A1.348007 NA NA NA
FOXQ11.362852 NA NA NA
HNF1A1.4719071.5018692.1901331.960243
PDX11.730954 NA NA NA
POU5F1B1.786492 NA NA NA
SOX91.689930 NA NA1.684580
  1. a matrix with 4 collums: MR TF, motif, FDR and analysis it was identified: This function uses the same files as the previous matrices, but returns the results in a different format.
In [23]:
# 4. a matrix  with 4 collums: MR TF, motif, FDR and analysis it was identified.
tf.or.table <- tabs$tf.or.table

# Rename analysis
tf.or.table$analysis <- gsub("ESCA","EAC",paste0(basename(dirname(dirname(dirname(tf.or.table$analysis)))), 
                               " (vs ",
                               ifelse(grepl("Normal",tf.or.table$analysis),"Normal","ESCC"),")"))
head(tf.or.table)
A data.frame: 6 × 4
TFmotifFDRanalysis
<chr><chr><dbl><chr>
7ZNF263 ZIM3_HUMAN.H11MO.0.C 7.120537e-11COAD-READ (vs Normal)
5POU5F1BP5F1B_HUMAN.H11MO.0.D1.259413e-10COAD-READ (vs Normal)
8ZSCAN16ZNF85_HUMAN.H11MO.0.C3.818943e-10COAD-READ (vs Normal)
3HNF1A HNF1B_HUMAN.H11MO.1.A5.845199e-10COAD-READ (vs Normal)
17HNF1A HNF1B_HUMAN.H11MO.1.A1.689635e-09EAC (vs ESCC)
21FOXA3 FOXA1_HUMAN.H11MO.0.A1.806201e-09EAC (vs ESCC)

Creating Summary plots

With those summarized results we can create two summary plots:

  1. A heatmap with the MR TFs identified in each analysis.
  2. A scatter plot of the TF expression vs the avg DNA methylation of the TFBS.

MR TF heatmap

The code below creates the heatmap using complexHeatmap package for the binary MR TF matrix and the p-value one.

In [24]:
library(ComplexHeatmap)
library(vegan)

options(repr.plot.width=14, repr.plot.height=8)

col <- ifelse(classification == "family","top.potential.TF.family", "top.potential.TF.subfamily")

analysis.colors <- c(
  "COAD-READ (vs Normal)" = '#ff964f',
  "EAC (vs Normal)" = '#94e894',
  "EAC (vs ESCC)" = '#eaadb9',
  "PAAD (vs Normal)" = '#9ab9f9'                   
)

# get top TFs in each anaylsis
labels <- ELMER:::get.top.tf.by.pval(tab.pval,top = nrow(tab.pval))

# Annotate each column of the matrix (it should be correspondant to the input matrix)
hb = HeatmapAnnotation(df = data.frame("Mode" = rep("unsupervised",4),
                                       "Analysis" = analysis.names),
                       col = list("Analysis" = analysis.colors,
                                  "Mode" = c("unsupervised" = "#ebd540",
                                             "supervised" = "#718da5")),
                       show_annotation_name = FALSE,
                       show_legend = T,
                       annotation_name_side = "left",
                       annotation_name_gp = gpar(fontsize = 12))

# Binary heatmap
ht_list_binary <- Heatmap(as.matrix(tab.binary),
                   name = "Binary heatmap" ,
                   col = c("1" = "black", "0" = "white"),
                   column_names_gp = gpar(fontsize = 8),
                   show_column_names = FALSE,
                   use_raster = TRUE,
                   na_col = "white",
                   raster_device = c("png"),
                   raster_quality = 2,
                   show_row_names = F,
                   show_heatmap_legend = TRUE,
                   cluster_columns = FALSE,
                   cluster_rows = TRUE,
                   show_column_dend = FALSE,
                   show_row_dend = FALSE,
                   clustering_distance_rows = function(x) {
                     vegan::vegdist(tab.binary,method = "jaccard",binary = TRUE)
                   },
                   clustering_method_rows = "average",
                   top_annotation = hb,
                   width = unit(5, "cm"),
                   row_names_gp = gpar(fontsize = 1),
                   column_title = paste0("Binrary MR TF Heatmap"), 
                   column_title_gp = gpar(fontsize = 10), 
                   row_title_gp = gpar(fontsize = 16)) 

ht_list_binary <-  ht_list_binary + 
  rowAnnotation(link = anno_mark(at = match(labels,rownames(tab.pval)), 
                                 labels = labels,labels_gp = gpar(fontsize = 10)),
                width = unit(1, "cm") + max_text_width(labels)
  ) 
draw(ht_list_binary,
     newpage = FALSE, 
     column_title_gp = gpar(fontsize = 12, fontface = "bold"),
     heatmap_legend_side = "left",
     show_heatmap_legend = TRUE,
     annotation_legend_side = "right")

# P-value heatmap
ht_list <- Heatmap(scale(as.matrix(-log10(tab.pval))),
                   name = "column z-score FDR",
                   col = colorRampPalette(c('#3361A5', '#248AF3', '#14B3FF', 
                                            '#88CEEF', '#C1D5DC', '#EAD397', 
                                            '#FDB31A','#E42A2A', '#A31D1D'))(100),
                   column_names_gp = gpar(fontsize = 8),
                   show_column_names = FALSE,
                   use_raster = TRUE,
                   na_col = "white",
                   raster_device = c("png"),
                   raster_quality = 2,
                   show_row_names = F,
                   show_heatmap_legend = TRUE,
                   cluster_columns = FALSE,
                   cluster_rows = FALSE,
                   row_order = row_order(ht_list_binary),
                   top_annotation = hb,
                   width = unit(5, "cm"),
                   row_names_gp = gpar(fontsize = 1),
                   column_title = paste0("MR TF Heatmap"), 
                   column_title_gp = gpar(fontsize = 10), 
                   row_title_gp = gpar(fontsize = 16)) 

ht_list <-  ht_list + 
  rowAnnotation(link = anno_mark(at = match(labels,rownames(tab.pval)), 
                                 labels = labels,labels_gp = gpar(fontsize = 10)),
                width = unit(1, "cm") + max_text_width(labels)
  ) 

draw(ht_list,
     newpage = TRUE,     
     column_title_gp = gpar(fontsize = 12, fontface = "bold"),
     heatmap_legend_side = "left",
     show_heatmap_legend = TRUE,
     annotation_legend_side = "right")
Loading required package: grid
========================================
ComplexHeatmap version 2.1.0
Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
Github page: https://github.com/jokergoo/ComplexHeatmap
Documentation: http://jokergoo.github.io/ComplexHeatmap-reference

If you use it in published research, please cite:
Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
  genomic data. Bioinformatics 2016.
========================================

Loading required package: permute
Loading required package: lattice
This is vegan 2.5-6