Introduction

An Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq), initially described by Buenrostro et al. (2013), captures open chromatin sites and is able to reveal the interplay between genomic locations of open chromatin, DNA-binding proteins, individual nucleosomes and chromatin compaction at nucleotide resolution .

Later, Corces et al. (2017) described Omni-ATAC, an improved ATAC-seq protocol for chromatin accessibility able to generates chromatin accessibility profiles from archival frozen tissue samples, which was recently used to investigate the genome-wide chromatin accessibility profiles of 410 tumor samples spanning 23 cancer types from The Cancer Genome Atlas (TCGA) (Corces et al. 2018). The integration of this data with other TCGA multi-omic data was able to provide numerous discoveries which helped to understand the noncoding genome in cancer to advance diagnosis and therapy.

Workshop description

In this workshop, we present in more detail this TCGA ATAC-seq data, which is available to the public through the Genomic Data Commons Portal (https://gdc.cancer.gov/about-data/publications/ATACseq-AWG), and demonstrate how this data can be analyzed within the R/Bioconductor environment.

For more information about the data please visit GDC publication website and read the paper:

Workshop video

A recorded video explaining this workshop is available at: https://youtu.be/3ftZecz0lU4. Also, other workshop videos are available: https://www.youtube.com/playlist?list=PLoDzAKMJh15kNpCSIxpSuZgksZbJNfmMt.

Pre-requisites

  • Basic knowledge of R syntax

Workshop Participation

Students will have a chance to download ATAC-Seq cancer-specific peaks from GDC and import to R. After, esophageal adenocarcinoma (ESAD) vs esophageal squamous cell carcinoma (ESCC) analysis is performed and the results are visualized as a volcano plot and a heatmap.

Goals and objectives

  • Download and understand the ATAC-Seq data
  • Compare two different groups of samples ATAC-Seq data

Enviroment: R libraries

The code below will load all the required R libraries to perform the workshop. Their version is available at the session information section.

In [1]:
# to read txt files
library(readr)

# to transform data into GenomicRanges
library(GenomicRanges)

# other ones used to prepare the data
library(tidyr)
library(dplyr)
library(SummarizedExperiment)

# For the t.test loop
library(plyr)

# For easy volcano plot
library(TCGAbiolinks)

# For heatmap plot
library(ComplexHeatmap)
library(circlize)

# For the bigwig plot
library(karyoploteR)
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
Loading required package: stats4
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package: ‘BiocGenerics’

The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs

The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, basename, cbind, colMeans,
    colnames, colSums, dirname, do.call, duplicated, eval, evalq,
    Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply,
    lengths, Map, mapply, match, mget, order, paste, pmax, pmax.int,
    pmin, pmin.int, Position, rank, rbind, Reduce, rowMeans, rownames,
    rowSums, sapply, setdiff, sort, table, tapply, union, unique,
    unsplit, which, which.max, which.min

Loading required package: S4Vectors

Attaching package: ‘S4Vectors’

The following object is masked from ‘package:base’:

    expand.grid

Loading required package: IRanges
Loading required package: GenomeInfoDb

Attaching package: ‘tidyr’

The following object is masked from ‘package:S4Vectors’:

    expand


Attaching package: ‘dplyr’

The following objects are masked from ‘package:GenomicRanges’:

    intersect, setdiff, union

The following object is masked from ‘package:GenomeInfoDb’:

    intersect

The following objects are masked from ‘package:IRanges’:

    collapse, desc, intersect, setdiff, slice, union

The following objects are masked from ‘package:S4Vectors’:

    first, intersect, rename, setdiff, setequal, union

The following objects are masked from ‘package:BiocGenerics’:

    combine, intersect, setdiff, union

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: Biobase
Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

Loading required package: DelayedArray
Loading required package: matrixStats

Attaching package: ‘matrixStats’

The following objects are masked from ‘package:Biobase’:

    anyMissing, rowMedians

The following object is masked from ‘package:dplyr’:

    count

Loading required package: BiocParallel

Attaching package: ‘DelayedArray’

The following objects are masked from ‘package:matrixStats’:

    colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges

The following objects are masked from ‘package:base’:

    aperm, apply

------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
------------------------------------------------------------------------------

Attaching package: ‘plyr’

The following object is masked from ‘package:matrixStats’:

    count

The following objects are masked from ‘package:dplyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following object is masked from ‘package:IRanges’:

    desc

The following object is masked from ‘package:S4Vectors’:

    rename

Loading required package: grid
========================================
ComplexHeatmap version 2.1.0
Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/
Github page: https://github.com/jokergoo/ComplexHeatmap
Documentation: http://jokergoo.github.io/ComplexHeatmap-reference

If you use it in published research, please cite:
Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional 
  genomic data. Bioinformatics 2016.
========================================

========================================
circlize version 0.4.8
CRAN page: https://cran.r-project.org/package=circlize
Github page: https://github.com/jokergoo/circlize
Documentation: http://jokergoo.github.io/circlize_book/book/

If you use it in published research, please cite:
Gu, Z. circlize implements and enhances circular visualization 
  in R. Bioinformatics 2014.
========================================

Loading required package: regioneR
Loading required package: GenomicFeatures
Loading required package: AnnotationDbi

Attaching package: 'AnnotationDbi'

The following object is masked from 'package:dplyr':

    select

Data

The files used in this workshop are available at at google drive which contains some of the TCGA ATAC-seq data from https://gdc.cancer.gov/about-data/publications/ATACseq-AWG.

Understanding the data: peaks sets

The ATAC-Seq data used in this workshop is available at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG. It is important to highlight, that this data has been aligned against Homo sapiens (human) genome assembly GRCh38 (hg38).

There are mainly two types of ATAC-Seq Counts Matrices raw and normalized which covers mainly two sets of peaks:

  1. "cancer type-specific peak set" containing all of the reproducible peaks observed in an individual cancer type. These peaks were observed in at least two samples with a score per million value $>=5$
  2. "pan-cancer peak set" representing reproducible peaks from all cancer types that could then be used for cross-cancer comparisons

Comparing pan-cancer peak set and cancer type-specific peak set

If we check the both sets (Files downloaded from GDC: "All cancer type-specific peak sets. [ZIP]" and "Pan-cancer peak set. [TXT]"), the set of peaks "pan-cancer peak set" consists of $~562K$ peaks, and it contains a subset of each "cancer type-specific peak set". We show an example for Esophageal carcinoma (ESCA) below.

In [2]:
# ESCA specific peaks set
atac_esca <- readr::read_tsv("Data/ESCA_peakCalls.txt", col_types = readr::cols())
head(atac_esca)
A tibble: 6 × 8
seqnamesstartendnamescoreannotationpercentGCpercentAT
<chr><dbl><dbl><chr><dbl><chr><dbl><dbl>
chr112900951290596ESCA_107 2.4643783' UTR0.67664670.3233533
chr112911151291616ESCA_108 2.5879293' UTR0.70259480.2974052
chr112917531292254ESCA_109 7.5799623' UTR0.63872260.3612774
chr114408241441325ESCA_160 4.4672743' UTR0.65868260.3413174
chr116301881630689ESCA_17920.6213243' UTR0.72854290.2714571
chr120302182030719ESCA_34115.5725813' UTR0.61876250.3812375
In [3]:
# pan-cancer peak set
atac_pan <- readr::read_tsv("Data/TCGA-ATAC_PanCancer_PeakSet.txt", col_types = readr::cols())
head(atac_pan)

# from the  pancan set how many belongs to each cancer type?
table(stringr::str_split(atac_pan$name,"_",simplify = T)[,1])

message("How many of the ESCA peaks are the strongest in the pancan")
plyr::count(atac_esca$name %in% atac_pan$name)

plyr::count(grep("ESCA",atac_pan$name,value = T) %in% atac_esca$name)
A tibble: 6 × 7
seqnamesstartendnamescoreannotationpercentGC
<chr><dbl><dbl><chr><dbl><chr><dbl>
chr1 906012 906513ACC_10 7.171193Intron 0.6127745
chr2112541661112542162ACC_1000822.030579Promoter0.5548902
chr1 21673421 21673922ACC_1001 6.459954Distal 0.5089820
chr2112584205112584706ACC_1001343.208555Promoter0.5868263
chr2112596243112596744ACC_10016 5.428209Intron 0.4910180
chr1 21725692 21726193ACC_1002 5.201273Intron 0.4051896
  ACC  BLCA  BRCA  CESC  CHOL  COAD  ESCA   GBM  HNSC  KIRC  KIRP   LGG  LIHC 
29311 25337 49748 14358 11819 25404 13237 15394 16651 15067 24324 23836 35787 
 LUAD  LUSC  MESO  PCPG  PRAD  SKCM  STAD  TGCT  THCA  UCEC 
23729 22195 22958 31372 30067 36591 17358 26120 31568 20478 
How many of the ESCA peaks are the strongest in the pancan
A data.frame: 2 × 2
xfreq
<lgl><int>
FALSE113818
TRUE 13237
A data.frame: 1 × 2
xfreq
<lgl><int>
TRUE13237

However, it is important to highlight that the "pan-cancer peak set" will keep the most significant peaks (highest score) for the overlapping peaks. In other words, the name in the "pan-cancer peak set" consists of the one cancer-specific one with highest score. If we check the regions overlap of the ESCA peaks, we can see that the majority of the peaks are still within the PAN-can, but they are higher in another cancer type.

In [4]:
atac_esca.gr <- makeGRangesFromDataFrame(atac_esca,keep.extra.columns = T)
atac_pan.gr <- makeGRangesFromDataFrame(atac_pan,keep.extra.columns = T)
length(subsetByOverlaps(atac_esca.gr,atac_pan.gr))
126935

Checking an overlapping peak

So, we will check an overlaping peak. The named ESCA_17603" peak is not within the PanCAN set of peaks, because it overlaps the "ACC_10008" peak, which has a higher score.

In [5]:
"ESCA_17603" %in% atac_pan.gr$name

subsetByOverlaps(atac_pan.gr[atac_pan.gr$name == "ACC_10008"],atac_esca.gr)
subsetByOverlaps(atac_esca.gr,atac_pan.gr[atac_pan.gr$name == "ACC_10008"])
FALSE
GRanges object with 1 range and 4 metadata columns:
      seqnames              ranges strand |        name       score  annotation
         <Rle>           <IRanges>  <Rle> | <character>   <numeric> <character>
  [1]     chr2 112541661-112542162      * |   ACC_10008 22.03057903    Promoter
       percentGC
       <numeric>
  [1] 0.55489022
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths
GRanges object with 1 range and 5 metadata columns:
      seqnames              ranges strand |        name            score
         <Rle>           <IRanges>  <Rle> | <character>        <numeric>
  [1]     chr2 112541649-112542150      * |  ESCA_17603 6.25798951741993
       annotation        percentGC        percentAT
      <character>        <numeric>        <numeric>
  [1]    Promoter 0.55688622754491 0.44311377245509
  -------
  seqinfo: 24 sequences from an unspecified genome; no seqlengths

Peaks size

Also, it is important to note that the peaks size is the same. Each peak has a size of 502bp.

In [6]:
unique(width(atac_pan.gr))
unique(width(atac_esca.gr))
502
502

Summary

In summary, in the pan-can set the ESCA named peaks will be the ones that have the strongest signal on the ESCA samples when compared to the other cancer types, which will be a subset of all ESCA peaks. So, if you are looking for all ATAC-Seq ESCA peaks identified in at least two samples the cancer-specific set should be used.

Analysis: Using ATAC-Seq counts to compare two groups

For each set of peaks previously identified, a count matrix was produced. As discribed on the supplemental section "ATAC-seq data analysis – Constructing a counts matrix and normalization":

To obtain the number of independent Tn5 insertions in each peak, first the BAM files were corrected for the Tn5 offset (“+” stranded +4 bp, “-” stranded -5 bp) (16) into a Genomic Ranges object in R using Rsamtools “scanbam”. To get the number of Tn5 insertions per peak, each corrected insertion site (end of a fragment) was counted using “countOverlaps”. This was done for all individual technical replicates and a 562,709 x 796 counts matrix was compiled. From this, a RangedSummarizedExperiment was constructed including peaks as GenomicRanges, a counts matrix, and metadata detailing information for each sample. The counts matrix was then normalized by using edgeR’s “cpm(matrix , log = TRUE, prior.count = 5)” followed by a quantile normalization using preprocessCore’s “normalize.quantiles” in R.

Through the next section we will load the normalized counts data for ESCA and compare two groups of samples, to identify which peaks are stronger in a given group compared to the other one.

The main file used in this section is included in the following folder.

  1. All cancer type-specific count matrices in normalized counts. [ZIP]

In the code below, we are showing the beginning of the objects. It is important to highlight that the samples are using Stanford UUID instead of TCGA barcodes and each patient normally has two replicates.

In [7]:
atac_esca_norm_ct <- readr::read_tsv("Data/ESCA_log2norm.txt", col_types = readr::cols())
atac_esca_norm_ct[1:4,1:8]
A tibble: 4 × 8
seqnamesstartendnamescoreESCA_1E6AC686_96C2_46C4_A451_12D30114DD06_X012_S02_L027_B1_T1_P024ESCA_1E6AC686_96C2_46C4_A451_12D30114DD06_X012_S02_L028_B1_T2_P026ESCA_1FEF5E19_2351_4D46_99B4_63F0637ABF17_X010_S08_L063_B1_T1_P016
<chr><dbl><dbl><chr><dbl><dbl><dbl><dbl>
chr1 10180 10679ESCA_12.0842201.4408351.220656 1.9305615
chr1180641181140ESCA_22.1668902.5453052.229090 2.2083821
chr1181193181692ESCA_36.6338012.0967891.978599 2.0173702
chr1184246184745ESCA_42.5879291.4408351.158167-0.2450712

We will change the samples names to TCGA barcodes using the file "Lookup table for various TCGA sample identifiers. [TXT]" from GDC. The file path can be found at https://gdc.cancer.gov/about-data/publications/ATACseq-AWG by copying the URL link to the file, readr::read_tsv downloads and reads the table automatically.

In [8]:
#link from gdc.cancer.gov/about-data/publication/ATACseq-AWG, “Lookup table”
gdc.file <- "https://api.gdc.cancer.gov/data/7a3d7067-09d6-4acf-82c8-a1a81febf72c"
samples.ids <- readr::read_tsv(gdc.file, col_types = readr::cols())
head(samples.ids)
A tibble: 6 × 5
bam_prefixstanfordUUIDaliquot_idCase_UUIDCase_ID
<chr><chr><chr><chr><chr>
BRCA-000CFD9F-ADDF-4304-9E60-6041549E189C-X017-S06-L011-B1-T1-P040000CFD9F-ADDF-4304-9E60-6041549E189CTCGA-A7-A13F-01A-31-A615-422cf68894-168b-458b-af4f-53cad72989a8TCGA-A7-A13F-01A-31-A615-42
BRCA-000CFD9F-ADDF-4304-9E60-6041549E189C-X017-S06-L012-B1-T2-P046000CFD9F-ADDF-4304-9E60-6041549E189CTCGA-A7-A13F-01A-31-A615-422cf68894-168b-458b-af4f-53cad72989a8TCGA-A7-A13F-01A-31-A615-42
PCPG-007124EC-1F9B-4FCB-BC6E-DB8C25FD9146-X033-S03-L098-B1-T1-P073007124EC-1F9B-4FCB-BC6E-DB8C25FD9146TCGA-RM-A68W-01A-31-A644-421a1cf490-8bd4-4a99-bf3a-34f06435de86TCGA-RM-A68W-01A-31-A644-42
PCPG-007124EC-1F9B-4FCB-BC6E-DB8C25FD9146-X033-S03-L100-B1-T2-P077007124EC-1F9B-4FCB-BC6E-DB8C25FD9146TCGA-RM-A68W-01A-31-A644-421a1cf490-8bd4-4a99-bf3a-34f06435de86TCGA-RM-A68W-01A-31-A644-42
STAD-00DFAA4D-DE64-4476-9546-18E728653046-X029-S06-L011-B1-T1-P07200DFAA4D-DE64-4476-9546-18E728653046TCGA-BR-A4J1-01A-31-A646-42e9a98a44-83f2-490c-b053-1e953ebd4e7eTCGA-BR-A4J1-01A-31-A646-42
STAD-00DFAA4D-DE64-4476-9546-18E728653046-X029-S06-L012-B1-T2-P07700DFAA4D-DE64-4476-9546-18E728653046TCGA-BR-A4J1-01A-31-A646-42e9a98a44-83f2-490c-b053-1e953ebd4e7eTCGA-BR-A4J1-01A-31-A646-42

Now we will match our Standford UUIDs to TCGA barcodes which might be more meaningful https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/ For that we get the columns names removing non-sample names (seqnames, start, end, name, score) and match them to the bam_prefix column. With the matched index (i.e 1st UUID is the 10th row from samples.ids), we will use it to get the TCGA barcode (Case_ID).

In [9]:
colnames(atac_esca_norm_ct)[-c(1:5)] <- samples.ids$Case_ID[match(gsub("_","-",colnames(atac_esca_norm_ct)[-c(1:5)]),samples.ids$bam_prefix)]
atac_esca_norm_ct[1:4,1:8]
A tibble: 4 × 8
seqnamesstartendnamescoreTCGA-IG-A51D-01A-31-A616-42TCGA-IG-A51D-01A-31-A616-42TCGA-LN-A9FQ-01A-11-A617-42
<chr><dbl><dbl><chr><dbl><dbl><dbl><dbl>
chr1 10180 10679ESCA_12.0842201.4408351.220656 1.9305615
chr1180641181140ESCA_22.1668902.5453052.229090 2.2083821
chr1181193181692ESCA_36.6338012.0967891.978599 2.0173702
chr1184246184745ESCA_42.5879291.4408351.158167-0.2450712

Now that we have our matrix, we will create a SummarizedExperiment from it. Which will contain 3 matrices, one with the ATAC-seq values, the metadata of the peaks and the samples metadata.

In [10]:
atac <- atac_esca_norm_ct
non.cts.idx <- 1:5

# We will get the samples metadata from GDC using the TCGAbiolinks packages
# we simply need to give the TCGA barcode, and all the metadata available from that sample will be pull out.
# you can check the first sample metadata at GDC
# https://portal.gdc.cancer.gov/cases/f8dbab24-b9f4-4b8a-bfea-57856ccf6364?bioId=3889c9fe-3777-4b3d-9cf1-12135d4d7f7d
samples.info <- TCGAbiolinks:::colDataPrepare(unique(colnames(atac)[-c(non.cts.idx)]))

# We will raname Squamous cell carcinoma to ESCC and Adenocarcinoma to ESAD (removing other informations)
head(samples.info)[,c("sample","primary_diagnosis")]
samples.map <- gsub(",| |NOS","",gsub("Adenocarcinoma","ESAD",gsub("Squamous cell carcinoma","ESCC",paste0(samples.info$primary_diagnosis,"-",samples.info$sample))))
colnames(atac)[-c(non.cts.idx)] <- samples.map[match(substr(colnames(atac)[-c(non.cts.idx)],1,16),substr(samples.map,6,21))]
Starting to add information to samples
 => Add clinical information to samples
Add FFPE information. More information at: 
=> https://cancergenome.nih.gov/cancersselected/biospeccriteria 
=> http://gdac.broadinstitute.org/runs/sampleReports/latest/FPPP_FFPE_Cases.html
 => Adding subtype information to samples
esca subtype information from:doi:10.1038/nature20805
A data.frame: 6 × 2
sampleprimary_diagnosis
<chr><chr>
TCGA-IG-A51D-01A-31-A616-42TCGA-IG-A51D-01ASquamous cell carcinoma, NOS
TCGA-LN-A9FQ-01A-11-A617-42TCGA-LN-A9FQ-01ASquamous cell carcinoma, NOS
TCGA-L7-A6VZ-01A-31-A616-42TCGA-L7-A6VZ-01AAdenocarcinoma, NOS
TCGA-L5-A4OE-01A-31-A616-42TCGA-L5-A4OE-01AAdenocarcinoma, NOS
TCGA-LN-A49O-01A-31-A616-42TCGA-LN-A49O-01ASquamous cell carcinoma, NOS
TCGA-IG-A7DP-01A-21-A616-42TCGA-IG-A7DP-01AAdenocarcinoma, NOS

Now we have all the data required to create our SummarizedExperiment (SE) object.

In [11]:
# Matrix 1: data
counts <- atac[,-c(1:5)]
head(counts)
A tibble: 6 × 33
ESCC-TCGA-IG-A51D-01AESCC-TCGA-IG-A51D-01AESCC-TCGA-LN-A9FQ-01AESCC-TCGA-LN-A9FQ-01AESAD-TCGA-L7-A6VZ-01AESAD-TCGA-L7-A6VZ-01AESAD-TCGA-L5-A4OE-01AESAD-TCGA-L5-A4OE-01AESCC-TCGA-LN-A49O-01AESCC-TCGA-LN-A49O-01AESCC-TCGA-LN-A4A2-01AESCC-TCGA-LN-A4MR-01AESCC-TCGA-IG-A625-01AESCC-TCGA-IG-A625-01AESCC-TCGA-LN-A49W-01AESCC-TCGA-LN-A49W-01AESAD-TCGA-IC-A6RE-01AESAD-TCGA-IC-A6RE-01AESAD-TCGA-M9-A5M8-01AESAD-TCGA-M9-A5M8-01A
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1.4408351.220656 1.93056151.38565151.4197561.28653141.3176991.42115940.92281439 0.91435832.7565289 1.04519211.8370442.13183911.77438531.7021271.9919842.1346861.4998361.7181801
2.5453052.229090 2.20838212.69951971.0968470.80682712.0698281.84670542.51514795 1.39092411.6890171 1.09684712.2822682.34668922.16954422.2144471.9189991.8104740.9363991.6480467
2.0967891.978599 2.01737022.22316982.0460831.78162992.9623721.84670542.30934956 2.28570472.4644978 0.42742801.7135142.31098112.73991642.9155563.6614313.7386163.7381824.0028779
1.4408351.158167-0.24507120.83706652.6699342.73655541.0968470.90800520.02756702-0.18400650.9603146-0.82325811.1170820.58494640.56138771.0473001.8349931.5711711.8447190.4924719
1.3371331.540553 3.13140193.12642941.9940561.62422831.3176990.75801451.78351492 1.11043501.6218309 1.75880142.2521582.17297701.70212701.4197562.2330442.1124901.0413060.9972153
5.4768005.351221 4.00709494.04230475.0299884.98635205.0686124.92909614.43081814 4.71926575.7945555 4.54633224.7937644.73467415.13849995.0228525.4407205.4386865.2965035.0908551
In [12]:
# Matrix 2: genomic information
rowRanges <- makeGRangesFromDataFrame(atac)
rowRanges$score <- atac$score
rowRanges$name <- atac$name
names(rowRanges) <- paste(atac$name,atac$seqnames,atac$start,atac$end, sep = "_")
rowRanges
GRanges object with 126935 ranges and 2 metadata columns:
                                       seqnames              ranges strand |
                                          <Rle>           <IRanges>  <Rle> |
               ESCA_1_chr1_10180_10679     chr1         10180-10679      * |
             ESCA_2_chr1_180641_181140     chr1       180641-181140      * |
             ESCA_3_chr1_181193_181692     chr1       181193-181692      * |
             ESCA_4_chr1_184246_184745     chr1       184246-184745      * |
             ESCA_5_chr1_267755_268254     chr1       267755-268254      * |
                                   ...      ...                 ...    ... .
  ESCA_126940_chrX_155880541_155881040     chrX 155880541-155881040      * |
  ESCA_126941_chrX_155881050_155881549     chrX 155881050-155881549      * |
  ESCA_126942_chrX_155963006_155963505     chrX 155963006-155963505      * |
  ESCA_126943_chrX_155985136_155985635     chrX 155985136-155985635      * |
  ESCA_126944_chrX_156030008_156030507     chrX 156030008-156030507      * |
                                                  score        name
                                              <numeric> <character>
               ESCA_1_chr1_10180_10679 2.08422038177439      ESCA_1
             ESCA_2_chr1_180641_181140 2.16689026248904      ESCA_2
             ESCA_3_chr1_181193_181692  6.6338010584359      ESCA_3
             ESCA_4_chr1_184246_184745 2.58792851900195      ESCA_4
             ESCA_5_chr1_267755_268254 4.97049203476181      ESCA_5
                                   ...              ...         ...
  ESCA_126940_chrX_155880541_155881040 2.01910337344523 ESCA_126940
  ESCA_126941_chrX_155881050_155881549 15.5822634683249 ESCA_126941
  ESCA_126942_chrX_155963006_155963505 4.76520941542765 ESCA_126942
  ESCA_126943_chrX_155985136_155985635 4.08426076220543 ESCA_126943
  ESCA_126944_chrX_156030008_156030507 2.29289618413069 ESCA_126944
  -------
  seqinfo: 23 sequences from an unspecified genome; no seqlengths
In [13]:
# Matrix 3: Samples metadata
# create key for merging
samples.ids$sample <- substr(samples.ids$Case_ID,1,16)
colData <- unique(left_join(samples.info,samples.ids))
head(colData)
Joining, by = "sample"
A data.frame: 6 × 149
samplepatientbarcodeshortLetterCodedefinitionyear_of_diagnosisclassification_of_tumorlast_known_disease_statusupdated_datetime.xprimary_diagnosissubtype_GEA.CIN.Integrated.Cluster...COCAsubtype_GEA.CIN.Integrated.Cluster...iClustersubtype_GEA.CIN.Integrated.Cluster...SuperClustersubtype_GEA.CIN.Integrated.Cluster...MKL.KNN.4subtype_GEA.CIN.Integrated.Cluster...MKL.KNN.7bam_prefixstanfordUUIDaliquot_idCase_UUIDCase_ID
<chr><chr><chr><chr><chr><int><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
TCGA-IG-A51D-01ATCGA-IG-A51DTCGA-IG-A51D-01A-31-A616-42TPPrimary solid Tumor2012not reportednot reported2019-08-08T16:38:14.344513-05:00Squamous cell carcinoma, NOSNANANANANAESCA-1E6AC686-96C2-46C4-A451-12D30114DD06-X012-S02-L027-B1-T1-P0241E6AC686-96C2-46C4-A451-12D30114DD06TCGA-IG-A51D-01A-31-A616-42f8dbab24-b9f4-4b8a-bfea-57856ccf6364TCGA-IG-A51D-01A-31-A616-42
TCGA-IG-A51D-01ATCGA-IG-A51DTCGA-IG-A51D-01A-31-A616-42TPPrimary solid Tumor2012not reportednot reported2019-08-08T16:38:14.344513-05:00Squamous cell carcinoma, NOSNANANANANAESCA-1E6AC686-96C2-46C4-A451-12D30114DD06-X012-S02-L028-B1-T2-P0261E6AC686-96C2-46C4-A451-12D30114DD06TCGA-IG-A51D-01A-31-A616-42f8dbab24-b9f4-4b8a-bfea-57856ccf6364TCGA-IG-A51D-01A-31-A616-42
TCGA-LN-A9FQ-01ATCGA-LN-A9FQTCGA-LN-A9FQ-01A-11-A617-42TPPrimary solid Tumor2013not reportednot reported2019-08-08T16:39:06.938006-05:00Squamous cell carcinoma, NOSNANANANANAESCA-1FEF5E19-2351-4D46-99B4-63F0637ABF17-X010-S08-L063-B1-T1-P0161FEF5E19-2351-4D46-99B4-63F0637ABF17TCGA-LN-A9FQ-01A-11-A617-4237d13493-975c-432c-bd21-65f383fc66c9TCGA-LN-A9FQ-01A-11-A617-42
TCGA-LN-A9FQ-01ATCGA-LN-A9FQTCGA-LN-A9FQ-01A-11-A617-42TPPrimary solid Tumor2013not reportednot reported2019-08-08T16:39:06.938006-05:00Squamous cell carcinoma, NOSNANANANANAESCA-1FEF5E19-2351-4D46-99B4-63F0637ABF17-X010-S08-L064-B1-T2-P0161FEF5E19-2351-4D46-99B4-63F0637ABF17TCGA-LN-A9FQ-01A-11-A617-4237d13493-975c-432c-bd21-65f383fc66c9TCGA-LN-A9FQ-01A-11-A617-42
TCGA-L7-A6VZ-01ATCGA-L7-A6VZTCGA-L7-A6VZ-01A-31-A616-42TPPrimary solid Tumor2013not reportednot reported2019-08-08T16:38:43.222690-05:00Adenocarcinoma, NOS C1C1C2C4C7ESCA-22102BD7-4D9F-4914-84D0-C64E2CB0D8C1-X018-S02-L027-B1-T1-P04022102BD7-4D9F-4914-84D0-C64E2CB0D8C1TCGA-L7-A6VZ-01A-31-A616-42dc4062d7-1c81-4b19-aabd-8307a0df5029TCGA-L7-A6VZ-01A-31-A616-42
TCGA-L7-A6VZ-01ATCGA-L7-A6VZTCGA-L7-A6VZ-01A-31-A616-42TPPrimary solid Tumor2013not reportednot reported2019-08-08T16:38:43.222690-05:00Adenocarcinoma, NOS C1C1C2C4C7ESCA-22102BD7-4D9F-4914-84D0-C64E2CB0D8C1-X018-S02-L028-B1-T2-P04522102BD7-4D9F-4914-84D0-C64E2CB0D8C1TCGA-L7-A6VZ-01A-31-A616-42dc4062d7-1c81-4b19-aabd-8307a0df5029TCGA-L7-A6VZ-01A-31-A616-42
In [14]:
esca.rse <- SummarizedExperiment(assays=SimpleList(log2norm=as.matrix(counts)),
                                 rowRanges = rowRanges, 
                                 colData = DataFrame(colData))
esca.rse
class: RangedSummarizedExperiment 
dim: 126935 33 
metadata(0):
assays(1): log2norm
rownames(126935): ESCA_1_chr1_10180_10679 ESCA_2_chr1_180641_181140 ...
  ESCA_126943_chrX_155985136_155985635
  ESCA_126944_chrX_156030008_156030507
rowData names(2): score name
colnames(33): ESCC-TCGA-IG-A51D-01A ESCC-TCGA-IG-A51D-01A ...
  ESAD-TCGA-M9-A5M8-01A ESAD-TCGA-M9-A5M8-01A
colData names(149): sample patient ... Case_UUID Case_ID

Since we have two samples for each patient we will rename tham as rep1 and rep2

In [15]:
duplicated.idx <- duplicated(colnames(esca.rse))
colnames(esca.rse)[!duplicated.idx] <- paste0(colnames(esca.rse)[!duplicated.idx],"_rep1")
colnames(esca.rse)[duplicated.idx] <- paste0(colnames(esca.rse)[duplicated.idx],"_rep2")
colnames(esca.rse)
  1. 'ESCC-TCGA-IG-A51D-01A_rep1'
  2. 'ESCC-TCGA-IG-A51D-01A_rep2'
  3. 'ESCC-TCGA-LN-A9FQ-01A_rep1'
  4. 'ESCC-TCGA-LN-A9FQ-01A_rep2'
  5. 'ESAD-TCGA-L7-A6VZ-01A_rep1'
  6. 'ESAD-TCGA-L7-A6VZ-01A_rep2'
  7. 'ESAD-TCGA-L5-A4OE-01A_rep1'
  8. 'ESAD-TCGA-L5-A4OE-01A_rep2'
  9. 'ESCC-TCGA-LN-A49O-01A_rep1'
  10. 'ESCC-TCGA-LN-A49O-01A_rep2'
  11. 'ESAD-TCGA-IG-A7DP-01A_rep1'
  12. 'ESAD-TCGA-IG-A7DP-01A_rep2'
  13. 'ESCC-TCGA-LN-A4A5-01A_rep1'
  14. 'ESCC-TCGA-LN-A4A5-01A_rep2'
  15. 'ESCC-TCGA-IG-A4P3-01A_rep1'
  16. 'ESCC-TCGA-IG-A3YA-01A_rep1'
  17. 'ESCC-TCGA-IG-A3YA-01A_rep2'
  18. 'ESAD-TCGA-IG-A4QS-01A_rep1'
  19. 'ESAD-TCGA-IG-A4QS-01A_rep2'
  20. 'ESCC-TCGA-LN-A7HX-01A_rep1'
  21. 'ESCC-TCGA-LN-A7HX-01A_rep2'
  22. 'ESCC-TCGA-LN-A8HZ-01A_rep1'
  23. 'ESCC-TCGA-LN-A4A2-01A_rep1'
  24. 'ESCC-TCGA-LN-A4A2-01A_rep2'
  25. 'ESCC-TCGA-LN-A4MR-01A_rep1'
  26. 'ESCC-TCGA-IG-A625-01A_rep1'
  27. 'ESCC-TCGA-IG-A625-01A_rep2'
  28. 'ESCC-TCGA-LN-A49W-01A_rep1'
  29. 'ESCC-TCGA-LN-A49W-01A_rep2'
  30. 'ESAD-TCGA-IC-A6RE-01A_rep1'
  31. 'ESAD-TCGA-IC-A6RE-01A_rep2'
  32. 'ESAD-TCGA-M9-A5M8-01A_rep1'
  33. 'ESAD-TCGA-M9-A5M8-01A_rep2'

Comparing ESCC vs ESAD ATAC-Seq

A t-test will be used to identify the peaks that have a significant different mean counts between the ESCC and ESAD samples.

In [16]:
escc.idx <- which(esca.rse$primary_diagnosis == "Squamous cell carcinoma, NOS")
esad.idx <- which(esca.rse$primary_diagnosis == "Adenocarcinoma, NOS")

# We will use 2 cores to run the code
library(doParallel)
registerDoParallel(2)

# Time expected ~ 3 min
result <- plyr::adply(assay(esca.rse),.margins = 1,.fun = function(peak){
  results <- t.test(peak[escc.idx],peak[esad.idx],conf.level = TRUE)
  return(tibble::tibble("raw_p_value"= results$p.value,
                        "ESCC_minus_ESAD" = results$estimate[1] - results$estimate[2]))
}, .progress = "time", .id = "peak", .parallel = TRUE)
result$FDR <- stats::p.adjust(result$raw_p_value,method = "fdr")
Loading required package: foreach
Loading required package: iterators
Progress disabled when using parallel plyr

Volcano plot of t-test analysis

It is possible to visualize the results of a t.test as a volcano plot, which can be used to better select a cut-off for the significant ATAC-Seq peaks. In this example we will use the $FDR < 0.01$ and $\Delta Log_2 Counts > 2$.

We will be using TCGAbiolinks package to plot it, but you can do it using ggplot2 or plotly for an interactive volcano plot. You can also find some examples at https://huntsmancancerinstitute.github.io/hciR/volcano.html.

In [17]:
fdr.cut.off <- 0.01
diff.cut.off <- 2

TCGAbiolinks:::TCGAVisualize_volcano(x = result$ESCC_minus_ESAD,
                                     y = result$FDR, 
                                     title =  paste0("Volcano plot - ATAC-seq peaks ",
                                                     "difference in ", 
                                                     "ESCC vs ESAD\n"),
                                     filename = NULL,
                                     label =  c("Not Significant",
                                                paste0("High in ESCC (vs ESAD)"),
                                                paste0("Low in ESCC (vs ESAD)")),
                                     ylab =  expression(paste(-Log[10],
                                                              " (FDR) [two tailed t-test] - cut-off FDR < ",fdr.cut.off
                                     )),
                                     xlab =  expression(paste(
                                       "Log2(Counts) difference - cut-off log2 delta(cts) > ",diff.cut.off
                                     )),
                                     x.cut = diff.cut.off, 
                                     y.cut = fdr.cut.off)

# How many peaks pass our cut-offs
message(sum(result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off))
4067

Heatmap of differential significant peaks

First, we will load the libraries used to plot the heatmap.

In [18]:
library(ComplexHeatmap)
library(circlize)

# Colors of the heatmap
pal_atac <- colorRampPalette(c('#3361A5',
                               '#248AF3',
                               '#14B3FF',
                               '#88CEEF',
                               '#C1D5DC',
                               '#EAD397',
                               '#FDB31A',
                               '#E42A2A',
                               '#A31D1D'))(100)

# Upper track with the samples annotation
ha = HeatmapAnnotation(df = data.frame("Group" = esca.rse$primary_diagnosis, 
                                       "Replicate" = stringr::str_match(colnames(esca.rse),"rep[0-9]?")),
                       show_annotation_name = T,
                       col = list(Group = c("Squamous cell carcinoma, NOS" =  "red", 
                                            "Adenocarcinoma, NOS" = "blue")),
                       show_legend = T,
                       annotation_name_side = "left",
                       annotation_name_gp = gpar(fontsize = 6))

# Select significant peals to plot
plot.atac <- assay(esca.rse)[result$FDR < fdr.cut.off & abs(result$ESCC_minus_ESAD) > diff.cut.off,]

# Define the color scale
col <- colorRamp2(seq(min(plot.atac), max(plot.atac), 
                      by = (max(plot.atac) - min(plot.atac))/99), pal_atac)

# Show the names of the peaks 1 and 18
rows.annot <- rowAnnotation(foo = anno_mark(at = c(1,18), labels = rownames(plot.atac)[c(1,18)]))


# Plot the ATAC-Seq signals
ht_list <- 
  Heatmap(plot.atac,
          name = "ATAC-seq log2(counts)", 
          col = col,
          column_names_gp = gpar(fontsize = 8),
          show_column_names = F,
          heatmap_legend_param = list(legend_direction = "horizontal",
                                      labels_gp = gpar(fontsize = 12), 
                                      title_gp = gpar(fontsize = 12)),
          show_row_names = FALSE,
          cluster_columns = TRUE,
          use_raster = TRUE,
          raster_device = c("png"),
          raster_quality = 2,
          cluster_rows = T,
          right_annotation = rows.annot,
          row_title = paste0(sum(result$FDR < fdr.cut.off & 
                                   abs(result$ESCC_minus_ESAD) > diff.cut.off),
                             " ATAC-seq peaks"),
          row_names_gp = gpar(fontsize = 4),
          top_annotation = ha,
          column_title_gp = gpar(fontsize = 12), 
          row_title_gp = gpar(fontsize = 12)) 

options(repr.plot.width=15, repr.plot.height=8)
draw(ht_list,newpage = TRUE, 
     column_title = paste0("ATAC-seq ESCC vs ESAD (FDR < ", fdr.cut.off,
                           ",  Diff mean log2 Count > ",diff.cut.off,")"),
     column_title_gp = gpar(fontsize = 12, fontface = "bold"),
     heatmap_legend_side = "bottom",
     annotation_legend_side = "right")