An Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq), initially described by Buenrostro et al. (2013), captures open chromatin sites and is able to reveal the interplay between genomic locations of open chromatin, DNA-binding proteins, individual nucleosomes and chromatin compaction at nucleotide resolution .
Later, Corces et al. (2017) described Omni-ATAC, an improved ATAC-seq protocol for chromatin accessibility able to generates chromatin accessibility profiles from archival frozen tissue samples, which was recently used to investigate the genome-wide chromatin accessibility profiles of 410 tumor samples spanning 23 cancer types from The Cancer Genome Atlas (TCGA) (Corces et al. 2018). The integration of this data with other TCGA multi-omic data was able to provide numerous discoveries which helped to understand the noncoding genome in cancer to advance diagnosis and therapy.
In this workshop, we present in more detail this TCGA ATAC-seq data, which is available to the public through the Genomic Data Commons Portal (https://gdc.cancer.gov/about-data/publications/ATACseq-AWG), and demonstrate how this data can be analyzed within the R/Bioconductor environment.
For more information about the data please visit GDC publication website and read the paper:
A recorded video explaining this workshop is available at: https://youtu.be/3ftZecz0lU4. Also, other workshop videos are available: https://www.youtube.com/playlist?list=PLoDzAKMJh15kNpCSIxpSuZgksZbJNfmMt.
Students will have a chance to download ATAC-Seq cancer-specific peaks from GDC and import to R. After, esophageal adenocarcinoma (ESAD) vs esophageal squamous cell carcinoma (ESCC) analysis is performed and the results are visualized as a volcano plot and a heatmap.
The code below will load all the required R libraries to perform the workshop. Their version is available at the session information section.
# to read txt files
library(readr)
# to transform data into GenomicRanges
library(GenomicRanges)
# other ones used to prepare the data
library(tidyr)
library(dplyr)
library(SummarizedExperiment)
# For the t.test loop
library(plyr)
# For easy volcano plot
library(TCGAbiolinks)
# For heatmap plot
library(ComplexHeatmap)
library(circlize)
# For the bigwig plot
library(karyoploteR)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
The files used in this workshop are available at at google drive which contains some of the TCGA ATAC-seq data from https://gdc.cancer.gov/about-data/publications/ATACseq-AWG.
The ATAC-Seq data used in this workshop is available at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG. It is important to highlight, that this data has been aligned against Homo sapiens (human) genome assembly GRCh38 (hg38).
There are mainly two types of ATAC-Seq Counts Matrices raw and normalized which covers mainly two sets of peaks:
If we check the both sets (Files downloaded from GDC: "All cancer type-specific peak sets. [ZIP]" and "Pan-cancer peak set. [TXT]"), the set of peaks "pan-cancer peak set" consists of $~562K$ peaks, and it contains a subset of each "cancer type-specific peak set". We show an example for Esophageal carcinoma (ESCA) below.
# ESCA specific peaks set
atac_esca <- readr::read_tsv("Data/ESCA_peakCalls.txt", col_types = readr::cols())
head(atac_esca)
# pan-cancer peak set
atac_pan <- readr::read_tsv("Data/TCGA-ATAC_PanCancer_PeakSet.txt", col_types = readr::cols())
head(atac_pan)
# from the pancan set how many belongs to each cancer type?
table(stringr::str_split(atac_pan$name,"_",simplify = T)[,1])
message("How many of the ESCA peaks are the strongest in the pancan")
plyr::count(atac_esca$name %in% atac_pan$name)
plyr::count(grep("ESCA",atac_pan$name,value = T) %in% atac_esca$name)
However, it is important to highlight that the "pan-cancer peak set" will keep the most significant peaks (highest score) for the overlapping peaks. In other words, the name in the "pan-cancer peak set" consists of the one cancer-specific one with highest score. If we check the regions overlap of the ESCA peaks, we can see that the majority of the peaks are still within the PAN-can, but they are higher in another cancer type.
atac_esca.gr <- makeGRangesFromDataFrame(atac_esca,keep.extra.columns = T)
atac_pan.gr <- makeGRangesFromDataFrame(atac_pan,keep.extra.columns = T)
length(subsetByOverlaps(atac_esca.gr,atac_pan.gr))
So, we will check an overlaping peak. The named ESCA_17603" peak is not within the PanCAN set of peaks, because it overlaps the "ACC_10008" peak, which has a higher score.
"ESCA_17603" %in% atac_pan.gr$name
subsetByOverlaps(atac_pan.gr[atac_pan.gr$name == "ACC_10008"],atac_esca.gr)
subsetByOverlaps(atac_esca.gr,atac_pan.gr[atac_pan.gr$name == "ACC_10008"])
Also, it is important to note that the peaks size is the same. Each peak has a size of 502bp.
unique(width(atac_pan.gr))
unique(width(atac_esca.gr))
In summary, in the pan-can set the ESCA named peaks will be the ones that have the strongest signal on the ESCA samples when compared to the other cancer types, which will be a subset of all ESCA peaks. So, if you are looking for all ATAC-Seq ESCA peaks identified in at least two samples the cancer-specific set should be used.
For each set of peaks previously identified, a count matrix was produced. As discribed on the supplemental section "ATAC-seq data analysis – Constructing a counts matrix and normalization":
To obtain the number of independent Tn5 insertions in each peak, first the BAM files were corrected for the Tn5 offset (“+” stranded +4 bp, “-” stranded -5 bp) (16) into a Genomic Ranges object in R using Rsamtools “scanbam”. To get the number of Tn5 insertions per peak, each corrected insertion site (end of a fragment) was counted using “countOverlaps”. This was done for all individual technical replicates and a 562,709 x 796 counts matrix was compiled. From this, a RangedSummarizedExperiment was constructed including peaks as GenomicRanges, a counts matrix, and metadata detailing information for each sample. The counts matrix was then normalized by using edgeR’s “cpm(matrix , log = TRUE, prior.count = 5)” followed by a quantile normalization using preprocessCore’s “normalize.quantiles” in R.
Through the next section we will load the normalized counts data for ESCA and compare two groups of samples, to identify which peaks are stronger in a given group compared to the other one.
The main file used in this section is included in the following folder.
In the code below, we are showing the beginning of the objects. It is important to highlight
that the samples are using Stanford UUID
instead of TCGA barcodes
and each patient
normally has two replicates.
atac_esca_norm_ct <- readr::read_tsv("Data/ESCA_log2norm.txt", col_types = readr::cols())
atac_esca_norm_ct[1:4,1:8]
We will change the samples names to TCGA barcodes using the file "Lookup table for various TCGA sample identifiers. [TXT]" from GDC. The file path can be found at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG by copying the URL link to the file, readr::read_tsv downloads and reads the table automatically.
#link from gdc.cancer.gov/about-data/publication/ATACseq-AWG, “Lookup table”
gdc.file <- "https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c"
samples.ids <- readr::read_tsv(gdc.file, col_types = readr::cols())
head(samples.ids)
Now we will match our Standford UUIDs to TCGA barcodes which might be more meaningful https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/ For that we get the columns names removing non-sample names (seqnames, start, end, name, score) and match them to the bam_prefix column. With the matched index (i.e 1st UUID is the 10th row from samples.ids), we will use it to get the TCGA barcode (Case_ID).
colnames(atac_esca_norm_ct)[-c(1:5)] <- samples.ids$Case_ID[match(gsub("_","-",colnames(atac_esca_norm_ct)[-c(1:5)]),samples.ids$bam_prefix)]
atac_esca_norm_ct[1:4,1:8]
Now that we have our matrix, we will create a SummarizedExperiment from it. Which will contain 3 matrices, one with the ATAC-seq values, the metadata of the peaks and the samples metadata.
atac <- atac_esca_norm_ct
non.cts.idx <- 1:5
# We will get the samples metadata from GDC using the TCGAbiolinks packages
# we simply need to give the TCGA barcode, and all the metadata available from that sample will be pull out.
# you can check the first sample metadata at GDC
# https://portal.gdc.cancer.gov/cases/f8dbab24-b9f4-4b8a-bfea-57856ccf6364?bioId=3889c9fe-3777-4b3d-9cf1-12135d4d7f7d
samples.info <- TCGAbiolinks:::colDataPrepare(unique(colnames(atac)[-c(non.cts.idx)]))
# We will raname Squamous cell carcinoma to ESCC and Adenocarcinoma to ESAD (removing other informations)
head(samples.info)[,c("sample","primary_diagnosis")]
samples.map <- gsub(",| |NOS","",gsub("Adenocarcinoma","ESAD",gsub("Squamous cell carcinoma","ESCC",paste0(samples.info$primary_diagnosis,"-",samples.info$sample))))
colnames(atac)[-c(non.cts.idx)] <- samples.map[match(substr(colnames(atac)[-c(non.cts.idx)],1,16),substr(samples.map,6,21))]
Now we have all the data required to create our SummarizedExperiment (SE) object.
# Matrix 1: data
counts <- atac[,-c(1:5)]
head(counts)
# Matrix 2: genomic information
rowRanges <- makeGRangesFromDataFrame(atac)
rowRanges$score <- atac$score
rowRanges$name <- atac$name
names(rowRanges) <- paste(atac$name,atac$seqnames,atac$start,atac$end, sep = "_")
rowRanges
# Matrix 3: Samples metadata
# create key for merging
samples.ids$sample <- substr(samples.ids$Case_ID,1,16)
colData <- unique(left_join(samples.info,samples.ids))
head(colData)
esca.rse <- SummarizedExperiment(assays=SimpleList(log2norm=as.matrix(counts)),
rowRanges = rowRanges,
colData = DataFrame(colData))
esca.rse
Since we have two samples for each patient we will rename tham as rep1 and rep2
duplicated.idx <- duplicated(colnames(esca.rse))
colnames(esca.rse)[!duplicated.idx] <- paste0(colnames(esca.rse)[!duplicated.idx],"_rep1")
colnames(esca.rse)[duplicated.idx] <- paste0(colnames(esca.rse)[duplicated.idx],"_rep2")
colnames(esca.rse)
A t-test will be used to identify the peaks that have a significant different mean counts between the ESCC and ESAD samples.
escc.idx <- which(esca.rse$primary_diagnosis == "Squamous cell carcinoma, NOS")
esad.idx <- which(esca.rse$primary_diagnosis == "Adenocarcinoma, NOS")
# We will use 2 cores to run the code
library(doParallel)
registerDoParallel(2)
# Time expected ~ 3 min
result <- plyr::adply(assay(esca.rse),.margins = 1,.fun = function(peak){
results <- t.test(peak[escc.idx],peak[esad.idx],conf.level = TRUE)
return(tibble::tibble("raw_p_value"= results$p.value,
"ESCC_minus_ESAD" = results$estimate[1] - results$estimate[2]))
}, .progress = "time", .id = "peak", .parallel = TRUE)
result$FDR <- stats::p.adjust(result$raw_p_value,method = "fdr")
It is possible to visualize the results of a t.test
as a volcano plot,
which can be used to better select a cut-off for the significant ATAC-Seq peaks.
In this example we will use the $FDR < 0.01$ and $\Delta Log_2 Counts > 2$.
We will be using TCGAbiolinks
package to plot it, but you can do it using ggplot2
or plotly
for an interactive volcano plot. You can also find some examples at https://huntsmancancerinstitute.github.io/hciR/volcano.html.
fdr.cut.off <- 0.01
diff.cut.off <- 2
TCGAbiolinks:::TCGAVisualize_volcano(x = result$ESCC_minus_ESAD,
y = result$FDR,
title = paste0("Volcano plot - ATAC-seq peaks ",
"difference in ",
"ESCC vs ESAD\n"),
filename = NULL,
label = c("Not Significant",
paste0("High in ESCC (vs ESAD)"),
paste0("Low in ESCC (vs ESAD)")),
ylab = expression(paste(-Log[10],
" (FDR) [two tailed t-test] - cut-off FDR < ",fdr.cut.off
)),
xlab = expression(paste(
"Log2(Counts) difference - cut-off log2 delta(cts) > ",diff.cut.off
)),
x.cut = diff.cut.off,
y.cut = fdr.cut.off)
# How many peaks pass our cut-offs
message(sum(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off))
First, we will load the libraries used to plot the heatmap.
library(ComplexHeatmap)
library(circlize)
# Colors of the heatmap
pal_atac <- colorRampPalette(c('#3361A5',
'#248AF3',
'#14B3FF',
'#88CEEF',
'#C1D5DC',
'#EAD397',
'#FDB31A',
'#E42A2A',
'#A31D1D'))(100)
# Upper track with the samples annotation
ha = HeatmapAnnotation(df = data.frame("Group" = esca.rse$primary_diagnosis,
"Replicate" = stringr::str_match(colnames(esca.rse),"rep[0-9]?")),
show_annotation_name = T,
col = list(Group = c("Squamous cell carcinoma, NOS" = "red",
"Adenocarcinoma, NOS" = "blue")),
show_legend = T,
annotation_name_side = "left",
annotation_name_gp = gpar(fontsize = 6))
# Select significant peals to plot
plot.atac <- assay(esca.rse)[result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off,]
# Define the color scale
col <- colorRamp2(seq(min(plot.atac), max(plot.atac),
by = (max(plot.atac) - min(plot.atac))/99), pal_atac)
# Show the names of the peaks 1 and 18
rows.annot <- rowAnnotation(foo = anno_mark(at = c(1,18), labels = rownames(plot.atac)[c(1,18)]))
# Plot the ATAC-Seq signals
ht_list <-
Heatmap(plot.atac,
name = "ATAC-seq log2(counts)",
col = col,
column_names_gp = gpar(fontsize = 8),
show_column_names = F,
heatmap_legend_param = list(legend_direction = "horizontal",
labels_gp = gpar(fontsize = 12),
title_gp = gpar(fontsize = 12)),
show_row_names = FALSE,
cluster_columns = TRUE,
use_raster = TRUE,
raster_device = c("png"),
raster_quality = 2,
cluster_rows = T,
right_annotation = rows.annot,
row_title = paste0(sum(result$FDR < fdr.cut.off &
abs(result$ESCC_minus_ESAD) > diff.cut.off),
" ATAC-seq peaks"),
row_names_gp = gpar(fontsize = 4),
top_annotation = ha,
column_title_gp = gpar(fontsize = 12),
row_title_gp = gpar(fontsize = 12))
options(repr.plot.width=15, repr.plot.height=8)
draw(ht_list,newpage = TRUE,
column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,
", Diff mean log2 Count > ",diff.cut.off,")"),
column_title_gp = gpar(fontsize = 12, fontface = "bold"),
heatmap_legend_side = "bottom",
annotation_legend_side = "right")