The major packages used in this tutorial as listed below. Vignettes with example are available in the links provided.
A MultiAssayExperiment is a R data structure created to store different assays (RNA-Seq, DNA methylation, ATAC-Sec, ...) and samples metadata (age, gender, treatment information, etc.) in one single object. This data structure is used as to create the input for the ELMER analysis, which requires a MultiAssayExperiment containing DNA methylation and gene expression for all samples.
For more information about the MultiAssayExperiment you can check its Bioconductor page: http://bioconductor.org/packages/MultiAssayExperiment/.
We can use the TCGAbiolinks package (http://bioconductor.org/packages/TCGAbiolinks/) to retrieve TCGA DNA methylation and gene expression data from GDC data portal https://portal.gdc.cancer.gov/. TCGAbiolinks searches the data, download it and then transform it into either a matrix or a SummarizedExperiment (http://bioconductor.org/packages/SummarizedExperiment/), which is another Bioconductor data structure to handle a single assay and its samples metadata.
In this tutorial we will download data for 4 TCGA samples and create a MultiAssayExperiment.
suppressMessages({
library(TCGAbiolinks)
library(SummarizedExperiment)
})
# Define samples to be downloaded - to download all samples just remove the barcode argument
samples <- c("TCGA-2H-A9GF","TCGA-2H-A9GG","TCGA-2H-A9GH","TCGA-2H-A9GI")
# DNA methylation data aligned to hg19
query.met <- GDCquery(project = "TCGA-ESCA",
legacy = TRUE,
data.category = "DNA methylation",
barcode = samples, # Filter by barcode
platform = "Illumina Human Methylation 450",
sample.type = "Primary solid Tumor")
GDCdownload(query.met)
methylation <- GDCprepare(query = query.met,
save = TRUE,
save.filename = paste0("TCGA-ESCA_450K_hg19.rda"),
summarizedExperiment = TRUE
)
# Checking object
methylation
# Checking DNA methylation beta-values of the first 5 probes (rows)
as.data.frame(assay(methylation)[1:4,])
# Gene expression
query.exp <- GDCquery(project = "TCGA-ESCA",
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "normalized_results",
experimental.strategy = "RNA-Seq",
barcode = samples,
legacy = TRUE)
GDCdownload(query.exp, method = "api", files.per.chunk = 10)
gene.expression <- GDCprepare(query.exp)
# Checking object
gene.expression
# Checking expression values of the first 5 genes (rows)
as.data.frame(assay(gene.expression)[1:5,])
To create a MultiAssayExperiment object, ELMER provides the function createMAE
, which requires the DNA methylation data
(either a matrix or a Summarized Experiment object) [function argument met
] and Gene expression data
(either a matrix or a Summarized Experiment object) [function argument exp
].
For DNA methylation we only keep distal probes (at least $2Kbp$ away from TSS) since it tries to
infer for distal interactions regulating genes. ELMER provides the function get.feature.probe
to retrieve those set of probes.
The DNA methylation data from the 450k platform has three types of probes cg (CpG loci), ch (non-CpG loci) and rs (SNP assay).
The last type of probe can be used for sample identification and tracking and should be excluded for differential methylation analysis according to the ilumina manual (https://www.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/genomestudio/genomestudio-2011-1/genomestudio-methylation-v1-8-user-guide-11319130-b.pdf). Along with the filter based on the distance to TSS, probes with the rs prefix
is also removed by get.feature.probe
.
Finally, some probes should also be masked for other reasons presented by @zhou2017comprehensive and listed below (source: https://zwdzwd.github.io/InfiniumAnnotation).
Thus, MASK.general
is used by ELMER to remove the masked probes from the eavluated distal probes.
library(SummarizedExperiment)
library(ELMER)
# 1) retrieve distal probes
genome <- "hg19"
distal.probes <- get.feature.probe(feature = NULL,
genome = genome,
met.platform = "450K")
ranges(distal.probes)
# Number of distal probes
length(distal.probes)
For Gene expression data the row names accepted are Ensembl gene IDs, since our data has gene name as inout we will change the row names.
# 2) ELMER required ENSEMBL ID as rownames for gene expression
# for hg19, gene symbols|entre_id were the default default for RNA-seq
# so we need to rename our rows to the ensembl_gene_id which is
# already in the rowRanges information
rowRanges(gene.expression)
rownames(gene.expression) <- rowRanges(gene.expression)$ensembl_gene_id
Finally, we can use createMAE
setting the genome of reference ("hg19" or "hg38")
used to add DNA methylation probes metadata and gene metadata information.
To perform a correlation between DNA methylation and gene expression it is better to take the log2 of expression data
(executed if the argument linearize.exp
is TRUE).
Also, if samples are from TCGA
, ELMER will
automatically pull samples information from TCGA, otherwise it should be provided by the user
using colData
and sampleMap
arguments (tip: you can check the documentation of a function in R with the following command ?createMAE
)
mae <- createMAE(exp = gene.expression,
met = methylation,
met.platform = "450K",
genome = genome,
linearize.exp = TRUE, # takes log2(expression + 1)
filter.probes = distal.probes,
save = FALSE,
TCGA = TRUE)
You can check the MAE object using some accessor described in this cheatsheet.
mae
# Since it is TCGA ELMER add autmaticalle the samples information
as.data.frame(colData(mae)[1:4,1:5])
# Available information
as.data.frame(colnames(colData(mae)))
The next sections we will provide the code to download the complete TCGA ESCA dataset.
The code below downloaded all TCGA-ESCA data (DNA methylation and Gene Expression) , but due to the long time to download the data it will not be run.
{R}
# 1) get all DNA methylation data aligned annotated with hg38 information
query.met <- GDCquery(project = "TCGA-ESCA",
data.category = "DNA Methylation",
platform = "Illumina Human Methylation 450",
sample.type = "Primary solid Tumor")
GDCdownload(query.met)
methylation <- GDCprepare(query = query.met,
save = TRUE,
save.filename = "TCGA-ESCA_450K_hg38.rda"
)
# 1) get all RNA-seq data aligned against hg38
query.exp <- GDCquery(project = "TCGA-ESCA",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - FPKM-UQ")
GDCdownload(query.exp, method = "api", files.per.chunk = 10)
gene.expression <- GDCprepare(query.exp,
save = TRUE,
save.filename = "TCGA-ESCA_450K_hg38.rda")
# 3) Create ELMER input
# A MultiAssayExperiment with samples
# that has both RNA-seq and 450K DNA methylation (distal probes)
genome <- "hg38"
distal.probes <- get.feature.probe(feature = NULL,
genome = genome,
met.platform = "450K")
mae <- createMAE(exp = gene.expression,
met = methylation,
met.platform = "450K",
genome = genome,
linearize.exp = TRUE,
filter.probes = distal.probes,
save = TRUE,
save.filename = "TCGA_ESCA_MAE_distal_regions.rda",
TCGA = TRUE)
sessionInfo()
We have a set of recorded videos, explaining some of the workshops.