Workshop description

Aims

This workshop will:

  1. Introduce you to the TCGA (The Cancer Genome Atlas) data available at the NCI's Genomic Data Commons (GDC).
  2. Demonstrate how to access and import TCGA data using the R/Bioconductor package TCGAbiolinks [@colaprico2015tcgabiolinks].
  3. Provide instruction to visualize copy number alteration and mutation data using the R/Bioconductor package maftools.

R/Bioconductor packages used

Pre-requisites

Introduction

Understanding GDC (genomics data common portal)

In this workshop we will access (The Cancer Genome Atlas) data available at the NCI's Genomic Data Commons (GDC). Before we start, it is important to know that the data is deposit in two different databases:

An overview of the web portal is available at https://docs.gdc.cancer.gov/Data_Portal/Users_Guide/Getting_Started/.

Also, a more in-depth comparison between the legacy and Harmonized data was recently published:

Understanding TCGA data

Data access

GDC provides the data with two access levels:

  • Open: includes high level genomic data that is not individually identifiable, as well as most clinical and all biospecimen data elements.
  • Controlled: includes individually identifiable data such as low-level genomic sequencing data, germline variants, SNP6 genotype data, and certain clinical data elements

You can find more information about those two levels and how to get access to controlled data at: https://gdc.cancer.gov/access-data/data-access-processes-and-tools.

TCGA barcode description

Each TCGA sample has a unique identifier called TCGA barcode, which contains important information about each sample. A description of the barcode is shown below (Source: https://docs.gdc.cancer.gov/Encyclopedia/pages/TCGA_Barcode/).

You can find a table with all the code and meanings at https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables.

Data structure

In order to filter the data available in GDC some fields are available such as project (TCGA, TARGET, etc.), data category (Transcriptome Profiling, DNA methylation, Clinical, etc.), data type (Gene Expression Quantification, Isoform Expression Quantification, Methylation Beta Value, etc.), experimental strategy (miRNA-Seq, RNA-Seq, etc.), Workflow Type, platform, access type and others.

In terms of data granularity, a project has data on several categories, each category contains several data types that might have been produced with different workflows, experimental strategy and platforms. In that way, if you select data type "Gene Expression Quantification" the data category will be Transcriptome Profiling.

You can find the entry possibilities for each filter at the repository page of the database at https://portal.gdc.cancer.gov/repository.

The SummarizedExperiment data structure

Before we start, it is important to know that the R/Bioconductor environment provides a data structure called SummarizedExperiment, which was created to handle both samples metadata (age, gender, etc), genomics data (i.e. DNA methylation beta value) and genomics metadata information (chr, start, end, gene symbol) in the same object. You can access samples metadata with colData function, genomics data with assays and genomics metadata with rowRanges.

Loading required packages

In [1]:
suppressMessages({
    library(TCGAbiolinks)
    library(MultiAssayExperiment)
    library(maftools)
    library(dplyr)
    library(ComplexHeatmap)
})
In [2]:
clinical <- GDCquery_clinic("TCGA-COAD")
head(clinical)
A data.frame: 6 × 73
submitter_idyear_of_diagnosisclassification_of_tumorlast_known_disease_statusupdated_datetimeprimary_diagnosistumor_stageage_at_diagnosismorphologydays_to_last_known_disease_statustreatments_radiation_treatment_or_therapytreatments_radiation_days_to_treatment_starttreatments_radiation_treatment_effecttreatments_radiation_initial_disease_statustreatments_radiation_regimen_or_line_of_therapytreatments_radiation_treatment_anatomic_sitetreatments_radiation_treatment_outcometreatments_radiation_days_to_treatment_endbcr_patient_barcodedisease
<chr><int><chr><chr><chr><chr><chr><int><chr><lgl><chr><lgl><lgl><lgl><lgl><lgl><lgl><lgl><chr><chr>
TCGA-3L-AA1B2013not reportednot reported2019-08-08T16:33:45.855164-05:00Adenocarcinoma, NOS stage i 223798140/3NAnoNANANANANANANATCGA-3L-AA1BCOAD
TCGA-4N-A93T2013not reportednot reported2019-08-08T16:33:45.855164-05:00Adenocarcinoma, NOS stage iiib245238140/3NAnoNANANANANANANATCGA-4N-A93TCOAD
TCGA-4T-AA8H2013not reportednot reported2019-08-08T16:33:45.855164-05:00Mucinous adenocarcinomastage iia 154948480/3NAnoNANANANANANANATCGA-4T-AA8HCOAD
TCGA-5M-AAT42009not reportednot reported2019-08-08T16:33:45.855164-05:00Adenocarcinoma, NOS stage iv 270958140/3NAnoNANANANANANANATCGA-5M-AAT4COAD
TCGA-5M-AAT62009not reportednot reported2019-08-08T16:33:45.855164-05:00Adenocarcinoma, NOS stage iv 148528140/3NAnoNANANANANANANATCGA-5M-AAT6COAD
TCGA-5M-AATE2011not reportednot reported2019-08-08T16:33:45.855164-05:00Adenocarcinoma, NOS stage iia 278708140/3NAnoNANANANANANANATCGA-5M-AATECOAD
In [3]:
query <- GDCquery(project = "TCGA-ACC", 
                  data.category = "Clinical",
                  data.type = "Clinical Supplement", 
                  data.format = "BCR Biotab")
GDCdownload(query)
clinical.BCRtab.all <- GDCprepare(query)
names(clinical.BCRtab.all)
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-ACC
--------------------
oo Filtering results
--------------------
ooo By data.format
ooo By data.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
Warning: There are more than one file for the same case. Please verify query results. You can use the command View(getResults(query)) in rstudio
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
Downloading data for project TCGA-ACC
Of the 7 files for download 7 already exist.
All samples have been already downloaded
  |========================================                              |  57%
Warning message:
“Duplicated column names deduplicated: 'metastatic_tumor_site' => 'metastatic_tumor_site_1' [39]”
  |======================================================================| 100%
  1. 'clinical_radiation_acc'
  2. 'clinical_drug_acc'
  3. 'clinical_follow_up_v4.0_acc'
  4. 'clinical_omf_v4.0_acc'
  5. 'clinical_patient_acc'
  6. 'clinical_nte_acc'
  7. 'clinical_follow_up_v4.0_nte_acc'
In [4]:
clinical.BCRtab.all$clinical_drug_acc  %>% 
  head  %>% 
  as.data.frame
A data.frame: 6 × 28
bcr_patient_uuidbcr_patient_barcodebcr_drug_barcodebcr_drug_uuidform_completion_datepharmaceutical_therapy_drug_nameclinical_trial_drug_classificationpharmaceutical_therapy_typepharmaceutical_tx_started_days_topharmaceutical_tx_ongoing_indicatorregimen_indicationregimen_indication_notesregimen_numberroute_of_administrationstem_cell_transplantationstem_cell_transplantation_typetherapy_type_notestotal_dosetotal_dose_unitstx_on_clinical_trial
<chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
bcr_patient_uuid bcr_patient_barcodebcr_drug_barcode bcr_drug_uuid form_completion_datedrug_name clinical_trail_drug_classificationtherapy_type days_to_drug_therapy_starttherapy_ongoingregimen_indicationregimen_indication_notesregimen_number route_of_administrationstem_cell_transplantationstem_cell_transplantation_typetherapy_type_notestotal_dose total_dose_unitstx_on_clinical_trial
CDE_ID: CDE_ID:2003301 CDE_ID: CDE_ID: CDE_ID: CDE_ID:2975232CDE_ID:3378323 CDE_ID:2793530 CDE_ID:3392465 CDE_ID:3103479 CDE_ID:2793511 CDE_ID:2793516 CDE_ID:2744948 CDE_ID:2003586 CDE_ID:3090688 CDE_ID:2730901 CDE_ID:2001762 CDE_ID:1515 CDE_ID:3088785 CDE_ID:3925111
FB54458D-C373-46C2-841E-82663E13EFAATCGA-OR-A5JM TCGA-OR-A5JM-D49539B4F66718-B197-4EFB-AF9B-380B238BF6B52013-10-3 sunitinib [Not Available] Targeted Molecular therapy378 NO [Not Available] [Not Applicable] [Not Available][Not Available] [Not Available] [Not Available] [Not Available] [Not Available][Not Available] NO
FB54458D-C373-46C2-841E-82663E13EFAATCGA-OR-A5JM TCGA-OR-A5JM-D495402B5DD32A-1F38-4899-A90F-77C16B1F900E2013-10-3 ketoconazole [Not Available] Targeted Molecular therapy378 NO [Not Available] [Not Applicable] [Not Available][Not Available] [Not Available] [Not Available] [Not Available] [Not Available][Not Available] NO
344F7EA4-2BD4-4F2D-89F2-9FC7F572A3D6TCGA-OR-A5JY TCGA-OR-A5JY-D483101125569B-151B-4740-A864-9F8BFC63E2B42013-9-11 xeloda [Not Available] Chemotherapy 78 NO [Not Available] [Not Applicable] [Not Available][Not Available] [Not Available] [Not Available] [Not Available] [Not Available][Not Available] NO
6D980941-219E-40A0-9B41-E84C3FB0BD1ATCGA-OR-A5K2 TCGA-OR-A5K2-D482153C5B0FF6-2A7C-456E-8662-D38484AA7F3C2013-9-10 Adriamycin [Not Available] Chemotherapy 118 NO [Not Available] [Not Applicable] [Not Available][Not Available] [Not Available] [Not Available] [Not Available] [Not Available][Not Available] NO

RNA-Seq data

The RNA-Seq pipeline produces raw counts, FPKM and FPKM-UQ quantifications and is described at https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/.

The following options are used to search mRNA results using TCGAbiolinks:

  • data.category: "Transcriptome Profiling"
  • data.type: "Gene Expression Quantification"
  • workflow.type: "HTSeq - Counts", "HTSeq - FPKM", "HTSeq - FPKM-UQ"

Here is the example to download the raw counts, which can be used with DESeq2 (http://bioconductor.org/packages/DESeq2/) for differential expression analysis.

In [5]:
query.exp.hg38 <- GDCquery(project = "TCGA-GBM", 
                  data.category = "Transcriptome Profiling", 
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - Counts",
                  barcode =  c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"))
GDCdownload(query.exp.hg38)
raw.counts <- GDCprepare(query = query.exp.hg38, summarizedExperiment = FALSE)
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-GBM
--------------------
oo Filtering results
--------------------
ooo By data.type
ooo By workflow.type
ooo By barcode
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
Downloading data for project TCGA-GBM
Of the 2 files for download 2 already exist.
All samples have been already downloaded
|====================================================|100%                      Completed after 1 s 
In [6]:
head(raw.counts)
A tibble: 6 × 3
X1TCGA-14-0736-02A-01R-2005-01TCGA-06-0211-02A-02R-2005-01
<chr><dbl><dbl>
ENSG00000000003.1342823853
ENSG00000000005.5 34 8
ENSG00000000419.11 8991785
ENSG00000000457.12 293 551
ENSG00000000460.15 143 378
ENSG00000000938.11 286 683
In [7]:
query.exp.hg38 <- GDCquery(project = "TCGA-GBM", 
                  data.category = "Transcriptome Profiling", 
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - FPKM-UQ",
                  barcode =  c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"))
GDCdownload(query.exp.hg38)
fpkm.uq.counts <- GDCprepare(query = query.exp.hg38, summarizedExperiment = FALSE)
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-GBM
--------------------
oo Filtering results
--------------------
ooo By data.type
ooo By workflow.type
ooo By barcode
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
Downloading data for project TCGA-GBM
Of the 2 files for download 2 already exist.
All samples have been already downloaded
|====================================================|100%                      Completed after 0 s 
In [8]:
head(fpkm.uq.counts)
A tibble: 6 × 3
X1TCGA-14-0736-02A-01R-2005-01TCGA-06-0211-02A-02R-2005-01
<chr><dbl><dbl>
ENSG00000242268.2 1532.591 3517.609
ENSG00000270112.3 4253.036 4978.410
ENSG00000167578.15261186.594135247.024
ENSG00000273842.1 0.000 0.000
ENSG00000078237.5 224251.000180269.615
ENSG00000146083.10117080.711197064.420

Mutation

TCGAbiolinks has provided a few functions to download mutation data from GDC. There are two options to download the data:

  1. Use GDCquery_Maf which will download MAF aligned against hg38.

This example will download MAF (mutation annotation files) for variant calling pipeline muse. Pipelines options are: muse, varscan2, somaticsniper, mutect. For more information please access GDC docs.

You can download the data using TCGAbiolinks GDCquery_Maf function.

In [9]:
maf <- GDCquery_Maf("COAD", pipelines = "muse")
maf %>% head %>% as.data.frame
============================================================================
 For more information about MAF data please read the following GDC manual and web pages:
 GDC manual: https://gdc-docs.nci.nih.gov/Data/PDF/Data_UG.pdf
 https://gdc-docs.nci.nih.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/
 https://gdc.cancer.gov/about-gdc/variant-calling-gdc
============================================================================
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-COAD
--------------------
oo Filtering results
--------------------
ooo By access
ooo By data.type
ooo By workflow.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
Downloading data for project TCGA-COAD
Of the 1 files for download 1 already exist.
All samples have been already downloaded
|=================================================================| 100%  286 MB
A data.frame: 6 × 120
Hugo_SymbolEntrez_Gene_IdCenterNCBI_BuildChromosomeStart_PositionEnd_PositionStrandVariant_ClassificationVariant_TypeFILTERCONTEXTsrc_vcf_idtumor_bam_uuidnormal_bam_uuidcase_idGDC_FILTERCOSMICMC3_OverlapGDC_Validation_Status
<chr><int><chr><chr><chr><int><int><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr><chr>
ATAD3B 83858BCMGRCh38chr1 1485803 1485803+Nonsense_MutationSNPPASSTCAGTCGACCC9130f121-b7ce-460f-b90c-8e31add6cd107de9d9e2-c4a4-4311-826f-973cb7987c66861ad835-790a-47ad-8c03-06f7eb7b57107a70f061-9a6f-408e-a416-7f5295ceba3bNACOSM1333470 TrueUnknown
PLCH2 9651BCMGRCh38chr1 2487195 2487195+Silent SNPPASSAGGAGCCCTGC9130f121-b7ce-460f-b90c-8e31add6cd107de9d9e2-c4a4-4311-826f-973cb7987c66861ad835-790a-47ad-8c03-06f7eb7b57107a70f061-9a6f-408e-a416-7f5295ceba3bNACOSM1340725;COSM1340726TrueUnknown
CHD5 26038BCMGRCh38chr1 6146395 6146395+Missense_MutationSNPPASSAGTTGCGATAC9130f121-b7ce-460f-b90c-8e31add6cd107de9d9e2-c4a4-4311-826f-973cb7987c66861ad835-790a-47ad-8c03-06f7eb7b57107a70f061-9a6f-408e-a416-7f5295ceba3bNACOSM911251 TrueUnknown
IFNLR1163702BCMGRCh38chr12415906024159060+Missense_MutationSNPPASSGGCCCGTGGCA9130f121-b7ce-460f-b90c-8e31add6cd107de9d9e2-c4a4-4311-826f-973cb7987c66861ad835-790a-47ad-8c03-06f7eb7b57107a70f061-9a6f-408e-a416-7f5295ceba3bNACOSM1340840 TrueUnknown
YTHDF2 51441BCMGRCh38chr12876913628769136+3'UTR SNPPASSAAAAAAAGAAA9130f121-b7ce-460f-b90c-8e31add6cd107de9d9e2-c4a4-4311-826f-973cb7987c66861ad835-790a-47ad-8c03-06f7eb7b57107a70f061-9a6f-408e-a416-7f5295ceba3bNANA TrueUnknown
LRP8 7804BCMGRCh38chr15328963453289634+Silent SNPPASSCGTTCGTGGAT9130f121-b7ce-460f-b90c-8e31add6cd107de9d9e2-c4a4-4311-826f-973cb7987c66861ad835-790a-47ad-8c03-06f7eb7b57107a70f061-9a6f-408e-a416-7f5295ceba3bNANA TrueUnknown

Then visualize the results using the maftools package.

In [10]:
# create maftools input
maftools.input <- maf %>% read.maf
-Validating
-Silent variants: 77997 
-Summarizing
--Mutiple centers found
BCM;WUGSC;WUGSC;BCM;BCM;BI--Possible FLAGS among top ten genes:
  TTN
  MUC16
  SYNE1
  OBSCN
-Processing clinical data
--Missing clinical data
-Finished in 14.3s elapsed (12.8s cpu) 
In [11]:
# Check summary
plotmafSummary(maf = maftools.input, 
               rmOutlier = TRUE, 
               addStat = 'median', 
               dashboard = TRUE)
In [12]:
oncoplot(maf = maftools.input, 
         top = 10, 
         removeNonMutated = TRUE)
In [13]:
# classifies Single Nucleotide Variants into Transitions and Transversions
titv = titv(maf = maftools.input, 
            plot = FALSE, 
            useSyn = TRUE)
plotTiTv(res = titv)

You can extract sample summary from MAF object.

In [14]:
getSampleSummary(maftools.input) %>% head
A data.table: 6 × 7
Tumor_Sample_BarcodeMissense_MutationNonsense_MutationNonstop_MutationSplice_SiteTranslation_Start_Sitetotal
<fct><int><int><int><int><int><dbl>
TCGA-AA-A010-01A-01D-A17O-106440691512837267
TCGA-CA-6717-01A-11D-1835-105989773812456899
TCGA-AZ-4315-01A-01D-1408-1052225362 7925841
TCGA-AA-3984-01A-02D-1981-1038674755 5904406
TCGA-AA-A00N-01A-02D-A17O-1036904314 5244181
TCGA-CK-4951-01A-01D-1408-1032542056 7253542

Copy number alteration data

The Copy Number Variation Analysis Pipeline is described at https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/CNV_Pipeline/

Numeric focal-level Copy Number Variation (CNV) values were generated with "Masked Copy Number Segment" files from tumor aliquots using GISTIC2 on a project level. Only protein-coding genes were kept, and their numeric CNV values were further thresholded by a noise cutoff of 0.3:

  • Genes with focal CNV values smaller than -0.3 are categorized as a "loss" (-1)
  • Genes with focal CNV values larger than 0.3 are categorized as a "gain" (+1)
  • Genes with focal CNV values between and including -0.3 and 0.3 are categorized as "neutral" (0).

You can access "Gene Level Copy Number Scores" from GISTIC with the code below:

In [15]:
query <- GDCquery(project = "TCGA-GBM",
             data.category = "Copy Number Variation",
             data.type = "Gene Level Copy Number Scores",              
             access = "open")
GDCdownload(query)
scores <- GDCprepare(query)
scores[1:5,1:5]
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-GBM
--------------------
oo Filtering results
--------------------
ooo By access
ooo By data.type
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
Downloading data for project TCGA-GBM
Of the 1 files for download 1 already exist.
All samples have been already downloaded
Reading GISTIC file
Reading file: GDCdata/TCGA-GBM/harmonized/Copy_Number_Variation/Gene_Level_Copy_Number_Scores/45e4aef6-2dbf-405d-a033-722241c79565/GBM.focal_score_by_genes.txt
A tibble: 5 × 5
Gene SymbolGene IDCytobandTCGA-19-1790-01B-01D-1224-01TCGA-41-2572-01A-01D-1224-01
<chr><dbl><chr><dbl><dbl>
ENSG00000008128.2101p36.3300
ENSG00000008130.1401p36.3300
ENSG00000067606.1401p36.3300
ENSG00000078369.1601p36.3300
ENSG00000078808.1501p36.3300

You can visualize the data using the R/Bioconductor package complexHeatmap.

In [16]:
scores.matrix <- scores %>% 
  dplyr::select(-c(1:3)) %>%  # Removes metadata from the first 3 columns
  as.matrix

rownames(scores.matrix) <- paste0(scores$`Gene Symbol`,"_",scores$Cytoband)

# gain in more than 200 samples
gain.more.than.twohundred.samples <- which(rowSums(scores.matrix == 1) > 200)

# loss in more than 200 samples
loss.more.than.twohundred.samples <- which(rowSums(scores.matrix == -1) > 200)

lines.selected <- c(gain.more.than.twohundred.samples,loss.more.than.twohundred.samples)

Heatmap(scores.matrix[lines.selected,],
        show_column_names = FALSE, 
        show_row_names = TRUE,
        row_names_gp = gpar(fontsize = 8),
        col = circlize::colorRamp2(c(-1,0,1), colors = c("red","white","blue")))

DNA methylation data

The processed DNA methylation data measure the level of methylation at known CpG sites as beta values, calculated from array intensities (Level 2 data) as Beta = $M/(M+U)$ [@zhou2017comprehensive] which ranges from 0 being unmethylated and 1 fully methylated.

More information about the DNA methylation pipeline is available at https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Methylation_LO_Pipeline/.

We will download two Glioblastoma (GBM) as a summarizedExperiment object.

In [17]:
query_met.hg38 <- GDCquery(project = "TCGA-GBM", 
                           data.category = "DNA Methylation", 
                           platform = "Illumina Human Methylation 27", 
                           barcode = c("TCGA-02-0116-01A","TCGA-14-3477-01A-01D"))
GDCdownload(query_met.hg38)
data.hg38 <- GDCprepare(query_met.hg38,summarizedExperiment = TRUE)
--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-GBM
--------------------
oo Filtering results
--------------------
ooo By platform
ooo By barcode
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
Downloading data for project TCGA-GBM
Of the 2 files for download 2 already exist.
All samples have been already downloaded
|====================================================|100%                      Completed after 0 s 
Joining, by = c("Composite.Element.REF", "Chromosome", "Start", "End", "Gene_Symbol", "Gene_Type", "Transcript_ID", "Position_to_TSS", "CGI_Coordinate", "Feature_Type")
Starting to add information to samples
 => Add clinical information to samples
Add FFPE information. More information at: 
=> https://cancergenome.nih.gov/cancersselected/biospeccriteria 
=> http://gdac.broadinstitute.org/runs/sampleReports/latest/FPPP_FFPE_Cases.html
 => Adding subtype information to samples
gbm subtype information from:doi:10.1016/j.cell.2015.12.028
In [18]:
data.hg38
class: RangedSummarizedExperiment 
dim: 27578 2 
metadata(1): data_release
assays(1): ''
rownames(27578): cg00000292 cg00002426 ... cg27662877 cg27665659
rowData names(7): Composite.Element.REF Gene_Symbol ... CGI_Coordinate
  Feature_Type
colnames(2): TCGA-02-0116-01A-01D-0199-05 TCGA-14-3477-01A-01D-0915-05
colData names(113): sample patient ...
  subtype_Telomere.length.estimate.in.blood.normal..Kb.
  subtype_Telomere.length.estimate.in.tumor..Kb.

You can access the probes information with rowRanges.

In [19]:
data.hg38 %>% rowRanges %>% as.data.frame %>% head
A data.frame: 6 × 12
seqnamesstartendwidthstrandComposite.Element.REFGene_SymbolGene_TypeTranscript_IDPosition_to_TSSCGI_CoordinateFeature_Type
<fct><int><int><int><fct><chr><chr><chr><chr><chr><chr><chr>
cg00000292chr16 28878779 288787802*cg00000292ATP2A1;ATP2A1;ATP2A1;ATP2A1;ATP2A1 protein_coding;protein_coding;protein_coding;protein_coding;protein_coding ENST00000357084.6;ENST00000395503.7;ENST00000536376.4;ENST00000562185.4;ENST00000563975.1 373;290;-1275;-465;-83 CGI:chr16:28879633-28880547 N_Shore
cg00002426chr3 57757816 577578172*cg00002426SLMAP;SLMAP;SLMAP;SLMAP;SLMAP;SLMAP protein_coding;protein_coding;protein_coding;protein_coding;protein_coding;protein_codingENST00000295951.6;ENST00000295952.6;ENST00000383718.6;ENST00000428312.4;ENST00000449503.5;ENST00000467901.11585;368;261;257;257;514 CGI:chr3:57756198-57757263 S_Shore
cg00003994chr7 15686237 156862382*cg00003994MEOX2 protein_coding ENST00000262041.5 576 CGI:chr7:16399497-16399700 .
cg00005847chr2 1761643451761643462*cg00005847AC009336.19;HOXD3;HOXD3;HOXD3;RP11-387A1.5protein_coding;protein_coding;protein_coding;protein_coding;antisense ENST00000468418.4;ENST00000249440.4;ENST00000410016.4;ENST00000432796.2;ENST00000608941.1 13259;267;3453;27387;1372CGI:chr2:176164685-176165509N_Shore
cg00006414chr7 1491257451491257462*cg00006414RN7SL521P;ZNF398;ZNF425;ZNF425 misc_RNA;protein_coding;protein_coding;protein_coding ENST00000488398.3;ENST00000426851.5;ENST00000378061.5;ENST00000483014.1 242;-672;602;562 CGI:chr7:149126122-149127136N_Shore
cg00007981chr11 94129428 941294292*cg00007981PANX1;PANX1 protein_coding;protein_coding ENST00000227638.6;ENST00000436171.2 499;498 CGI:chr11:94128394-94129607 Island

You can access the samples metadata with colData.

In [20]:
data.hg38 %>% colData %>% as.data.frame
A data.frame: 2 × 113
samplepatientbarcodeshortLetterCodedefinitionyear_of_diagnosisclassification_of_tumorlast_known_disease_statusupdated_datetime.xprimary_diagnosissubtype_Transcriptome.Subtypesubtype_Pan.Glioma.RNA.Expression.Clustersubtype_IDH.specific.RNA.Expression.Clustersubtype_Pan.Glioma.DNA.Methylation.Clustersubtype_IDH.specific.DNA.Methylation.Clustersubtype_Supervised.DNA.Methylation.Clustersubtype_Random.Forest.Sturm.Clustersubtype_RPPA.clustersubtype_Telomere.length.estimate.in.blood.normal..Kb.subtype_Telomere.length.estimate.in.tumor..Kb.
<chr><chr><chr><chr><chr><int><chr><chr><chr><chr><fct><fct><fct><fct><fct><fct><fct><fct><dbl><dbl>
TCGA-02-0116-01A-01D-0199-05TCGA-02-0116-01ATCGA-02-0116TCGA-02-0116-01A-01D-0199-05TPPrimary solid Tumor2003not reportednot reported2019-08-08T16:40:20.353826-05:00GlioblastomaNALGr4NALGm4IDHwt-K1Classic-likeNAK1NANA
TCGA-14-3477-01A-01D-0915-05TCGA-14-3477-01ATCGA-14-3477TCGA-14-3477-01A-01D-0915-05TPPrimary solid Tumor2009not reportednot reported2019-08-08T16:42:28.969386-05:00GlioblastomaPNLGr1NALGm6IDHwt-K3LGm6-GBM NANANANA

You can access the DNA methylation levels with assay.

In [21]:
data.hg38 %>% assay %>% head %>% as.data.frame
A data.frame: 6 × 2
TCGA-02-0116-01A-01D-0199-05TCGA-14-3477-01A-01D-0915-05
<dbl><dbl>
cg000002920.784952500.11314067
cg000024260.060284240.02843910
cg000039940.042854540.24091388
cg000058470.624900420.07579907
cg00006414 NA NA
cg000079810.040205150.03530111
In [22]:
# plot 10 most variable probes
data.hg38 %>% 
  assay %>% 
  rowVars %>% 
  order(decreasing = TRUE) %>% 
  head(10) -> idx

pal_methylation <- colorRampPalette(c("#000436",
                                      "#021EA9",
                                      "#1632FB",
                                      "#6E34FC",
                                      "#C732D5",
                                      "#FD619D",
                                      "#FF9965",
                                      "#FFD32B",
                                      "#FFFC5A"))(100)

Heatmap(assay(data.hg38)[idx,],
        show_column_names = TRUE, 
        show_row_names = TRUE,
         name = "Methylation Beta-value", 
         row_names_gp = gpar(fontsize = 8),
         column_names_gp = gpar(fontsize = 8),
        col = circlize::colorRamp2(seq(0, 1, by = 1/99), pal_methylation))

ATAC-Seq data

Please, check our ATAC-seq Workshop: http://rpubs.com/tiagochst/atac_seq_workshop

Session information

In [23]:
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS/LAPACK: /opt/conda/lib/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
 [1] grid      parallel  stats4    stats     graphics  grDevices utils    
 [8] datasets  methods   base     

other attached packages:
 [1] ComplexHeatmap_2.1.0        dplyr_0.8.3                
 [3] maftools_2.0.20             MultiAssayExperiment_1.8.1 
 [5] SummarizedExperiment_1.12.0 DelayedArray_0.8.0         
 [7] BiocParallel_1.16.6         matrixStats_0.55.0         
 [9] Biobase_2.42.0              GenomicRanges_1.34.0       
[11] GenomeInfoDb_1.18.2         IRanges_2.16.0             
[13] S4Vectors_0.20.1            BiocGenerics_0.28.0        
[15] TCGAbiolinks_2.13.6        

loaded via a namespace (and not attached):
  [1] uuid_0.1-2                  backports_1.1.4            
  [3] circlize_0.4.8              aroma.light_3.12.0         
  [5] NMF_0.21.0                  plyr_1.8.4                 
  [7] selectr_0.4-1               ConsensusClusterPlus_1.46.0
  [9] repr_1.0.1                  lazyeval_0.2.2             
 [11] splines_3.5.1               ggplot2_3.2.1              
 [13] gridBase_0.4-7              sva_3.30.1                 
 [15] digest_0.6.20               foreach_1.4.7              
 [17] htmltools_0.3.6             magrittr_1.5               
 [19] memoise_1.1.0               cluster_2.1.0              
 [21] doParallel_1.0.15           limma_3.38.3               
 [23] Biostrings_2.50.2           readr_1.3.1                
 [25] annotate_1.60.1             wordcloud_2.6              
 [27] R.utils_2.9.0               prettyunits_1.0.2          
 [29] colorspace_1.4-1            blob_1.2.0                 
 [31] rvest_0.3.4                 ggrepel_0.8.1              
 [33] xfun_0.9                    crayon_1.3.4               
 [35] RCurl_1.95-4.12             jsonlite_1.6               
 [37] genefilter_1.64.0           zeallot_0.1.0              
 [39] survival_2.44-1.1           zoo_1.8-6                  
 [41] iterators_1.0.12            glue_1.3.1                 
 [43] survminer_0.4.6             registry_0.5-1             
 [45] gtable_0.3.0                zlibbioc_1.28.0            
 [47] XVector_0.22.0              GetoptLong_0.1.7           
 [49] shape_1.4.4                 scales_1.0.0               
 [51] DESeq_1.34.1                rngtools_1.4               
 [53] DBI_1.0.0                   edgeR_3.24.3               
 [55] bibtex_0.4.2                ggthemes_4.2.0             
 [57] Rcpp_1.0.2                  xtable_1.8-4               
 [59] progress_1.2.2              clue_0.3-57                
 [61] bit_1.1-14                  matlab_1.0.2               
 [63] km.ci_0.5-2                 httr_1.4.1                 
 [65] RColorBrewer_1.1-2          pkgconfig_2.0.2            
 [67] XML_3.98-1.19               R.methodsS3_1.7.1          
 [69] locfit_1.5-9.1              reshape2_1.4.3             
 [71] tidyselect_0.2.5            rlang_0.4.0                
 [73] AnnotationDbi_1.44.0        munsell_0.5.0              
 [75] tools_3.5.1                 downloader_0.4             
 [77] generics_0.0.2              RSQLite_2.1.2              
 [79] broom_0.5.2                 evaluate_0.14              
 [81] stringr_1.4.0               knitr_1.24                 
 [83] bit64_0.9-7                 survMisc_0.5.5             
 [85] purrr_0.3.2                 EDASeq_2.16.3              
 [87] nlme_3.1-141                R.oo_1.22.0                
 [89] xml2_1.2.2                  biomaRt_2.38.0             
 [91] compiler_3.5.1              curl_4.1                   
 [93] png_0.1-7                   ggsignif_0.6.0             
 [95] tibble_2.1.3                geneplotter_1.60.0         
 [97] stringi_1.4.3               GenomicFeatures_1.34.8     
 [99] lattice_0.20-38             IRdisplay_0.7.0            
[101] Matrix_1.2-17               KMsurv_0.1-5               
[103] vctrs_0.2.0                 pillar_1.4.2               
[105] lifecycle_0.1.0             GlobalOptions_0.1.0        
[107] data.table_1.12.2           bitops_1.0-6               
[109] rtracklayer_1.42.2          R6_2.4.0                   
[111] latticeExtra_0.6-28         hwriter_1.3.2              
[113] ShortRead_1.40.0            gridExtra_2.3              
[115] codetools_0.2-16            assertthat_0.2.1           
[117] pkgmaker_0.27               rjson_0.2.20               
[119] withr_2.1.2                 GenomicAlignments_1.18.1   
[121] Rsamtools_1.34.1            GenomeInfoDbData_1.2.1     
[123] mgcv_1.8-28                 hms_0.5.1                  
[125] IRkernel_1.0.2              tidyr_1.0.0                
[127] ggpubr_0.2.3                pbdZMQ_0.3-3               
[129] base64enc_0.1-3            

Workshop materials

Source code

All source code used to produce the workshops are available at https://github.com/tiagochst/ELMER_workshop_2019.

Workshops HTMLs

Workshop videos

We have a set of recorded videos, explaining some of the workshops.