Title: | Retrieve Genomic and Clinical Data from CBioPortal Including TCGA Data |
---|---|
Description: | The Cancer Genome Atlas (TCGA) is a program aimed at improving our understanding of Cancer Biology. Several TCGA Datasets are available online. 'TCGAretriever' helps accessing and downloading TCGA data hosted on 'cBioPortal' via its Web Interface (see <https://www.cbioportal.org/> for more information). |
Authors: | Damiano Fantini |
Maintainer: | Damiano Fantini <[email protected]> |
License: | GPL-3 |
Version: | 1.9.1 |
Built: | 2024-10-25 02:44:50 UTC |
Source: | https://github.com/dami82/tcgaretriever |
A list of objects including examples of the output returned by different 'TCGAretriever' functions. The objects were obtained from the '"blca_tcga"' study (bladder cancer).
data(blcaOutputExamples)
data(blcaOutputExamples)
A list including 7 elements.
data.frame (dimensions: 10 by 13). Sample output of the 'get_cancer_studies()' function.
data.frame (dimensions: 10 by 5). Sample output of the 'get_cancer_types()' function.
data.frame (dimensions: 9 by 5). Sample output of the 'get_case_lists()' function.
list including 9 elements. Sample output of the 'expand_cases()' function.
data.frame (dimensions: 10 by 94). Sample output of the 'get_clinical_data()' function.
data.frame (dimensions: 9 by 8). Sample output of the 'get_genetic_profiles()' function.
data.frame (dimensions: 10 by 3). Sample output of the 'get_gene_identifiers()' function.
data.frame (dimensions: 2 by 10). Sample output of the 'get_molecular_data()' function.
data.frame (dimensions: 6 by 27). Sample output of the 'get_mutation_data()' function.
The object was built using the following lines of code.
blcaOutputExamples <- list(
exmpl_1 = head(get_cancer_studies(), 10),
exmpl_2 = head(get_cancer_types(), 10),
exmpl_3 = head(get_case_lists("blca_tcga"), 10) ,
exmpl_4 = expand_cases("blca_tcga"),
exmpl_5 = head(get_clinical_data("blca_tcga"), 10),
exmpl_6 = head(get_genetic_profiles("blca_tcga") , 10),
exmpl_7 = head(get_gene_identifiers(), 10),
exmpl_8 = get_molecular_data(case_list_id = 'blca_tcga_3way_complete',
gprofile_id = 'blca_tcga_rna_seq_v2_mrna',
glist = c("TP53", "E2F1"))[, 1:10],
exmpl_9 = head(get_mutation_data(case_list_id = 'blca_tcga_sequenced',
gprofile_id = 'blca_tcga_mutations',
glist = c('TP53', 'PTEN'))))
data(blcaOutputExamples) blcaOutputExamples$exmpl_1
data(blcaOutputExamples) blcaOutputExamples$exmpl_1
Each study includes one or more "case lists". Each case list is a collection of samples that were analyzed using one or more platforms/assays. It is possible to obtain a list of all sample identifiers for each case list of interest.
expand_cases(csid, dryrun = FALSE)
expand_cases(csid, dryrun = FALSE)
csid |
String corresponding to a TCGA Cancer Study identifier. |
dryrun |
Logical. If TRUE, all other arguments (if any) are ignored and a representative example is returned as output. No Internet connection is required for executing the operation when 'dryrun' is TRUE. |
list containing as many elements as TCGA case lists available for a given TCGA Study. Each element is named after a case list identifier. Also, each element is a character vector including all sample identifiers (case ids) corresponding to the corresponding case list identifier.
Damiano Fantini, [email protected]
https://www.data-pulse.com/dev_site/TCGAretriever/
# Set `dryrun = FALSE` (default option) in production! x <- expand_cases("blca_tcga", dryrun = TRUE) lapply(x, utils::head)
# Set `dryrun = FALSE` (default option) in production! x <- expand_cases("blca_tcga", dryrun = TRUE) lapply(x, utils::head)
Recursively query cbioportal to retrieve data corresponding to all available genes. Data are returned as a 'data.frame' that can be easily manipulated for downstream analyses.
fetch_all_tcgadata(case_list_id, gprofile_id, mutations = FALSE)
fetch_all_tcgadata(case_list_id, gprofile_id, mutations = FALSE)
case_list_id |
string corresponding to the identifier of the TCGA Case List of interest |
gprofile_id |
string corresponding to the identifier of the TCGA Profile of interest |
mutations |
logical. If TRUE, extended mutation data are fetched instead of the standard TCGA data |
A data.frame is returned, including the desired TCGA data. Typically, rows are genes and columns are cases. If "extended mutation" data are retrieved (mutations = TRUE), rows correspond to individual mutations while columns are populated with mutation features
Damiano Fantini, [email protected]
https://www.data-pulse.com/dev_site/TCGAretriever/
# The examples below require an active Internet connection. # Note: execution may take several minutes. ## Not run: # Download all brca_pub mutation data (complete samples) all_brca_MUT <- fetch_all_tcgadata(case_list_id = "brca_tcga_pub_complete", gprofile_id = "brca_tcga_pub_mutations", mutations = TRUE) # Download all brca_pub RNA expression data (complete samples) all_brca_RNA <- fetch_all_tcgadata(case_list_id = "brca_tcga_pub_complete", gprofile_id = "brca_tcga_pub_mrna", mutations = FALSE) ## End(Not run)
# The examples below require an active Internet connection. # Note: execution may take several minutes. ## Not run: # Download all brca_pub mutation data (complete samples) all_brca_MUT <- fetch_all_tcgadata(case_list_id = "brca_tcga_pub_complete", gprofile_id = "brca_tcga_pub_mutations", mutations = TRUE) # Download all brca_pub RNA expression data (complete samples) all_brca_RNA <- fetch_all_tcgadata(case_list_id = "brca_tcga_pub_complete", gprofile_id = "brca_tcga_pub_mrna", mutations = FALSE) ## End(Not run)
Retrieve information about the studies or datasets available at cbioportal.org. Information include a 'studyId', description, references, and more.
get_cancer_studies(dryrun = FALSE)
get_cancer_studies(dryrun = FALSE)
dryrun |
Logical. If TRUE, all other arguments (if any) are ignored and a representative example is returned as output. No Internet connection is required for executing the operation when 'dryrun' is TRUE. |
Data Frame including cancer study information.
Damiano Fantini, [email protected]
https://www.data-pulse.com/dev_site/TCGAretriever/
# Set `dryrun = FALSE` (default option) in production! all_studies <- get_cancer_studies(dryrun = TRUE) utils::head(all_studies)
# Set `dryrun = FALSE` (default option) in production! all_studies <- get_cancer_studies(dryrun = TRUE) utils::head(all_studies)
Retrieve information about cancer types and corresponding abbreviations from cbioportal.org. Information include identifiers, names, and parental cancer type.
get_cancer_types(dryrun = FALSE)
get_cancer_types(dryrun = FALSE)
dryrun |
Logical. If TRUE, all other arguments (if any) are ignored and a representative example is returned as output. No Internet connection is required for executing the operation when 'dryrun' is TRUE. |
A data.frame including cancer type information.
Damiano Fantini, [email protected]
https://www.data-pulse.com/dev_site/TCGAretriever/
# Set `dryrun = FALSE` (default option) in production! all_canc <- get_cancer_types(dryrun = TRUE) utils::head(all_canc)
# Set `dryrun = FALSE` (default option) in production! all_canc <- get_cancer_types(dryrun = TRUE) utils::head(all_canc)
Each study includes one or more "case lists". Each case list is a collection of samples that were analyzed using one or more platforms/assays. It is possible to obtain a list of case list identifiers from cbioportal.org for a cancer study of interest. Identifier, name, description and category are returned for each entry.
get_case_lists(csid, dryrun = FALSE)
get_case_lists(csid, dryrun = FALSE)
csid |
String corresponding to the Identifier of the Study of Interest |
dryrun |
Logical. If TRUE, all other arguments (if any) are ignored and a representative example is returned as output. No Internet connection is required for executing the operation when 'dryrun' is TRUE. |
Data Frame including Case List information.
Damiano Fantini, [email protected]
https://www.data-pulse.com/dev_site/TCGAretriever/
# Set `dryrun = FALSE` (default option) in production! blca_case_lists <- get_case_lists("blca_tcga", dryrun = TRUE) blca_case_lists
# Set `dryrun = FALSE` (default option) in production! blca_case_lists <- get_case_lists("blca_tcga", dryrun = TRUE) blca_case_lists
Retrieve Clinical Information about the samples included in a cancer study of interest. For each sample/case, information about the corresponding cancer patient are returned. These may include sex, age, therapeutic regimen, tumor stage, survival status, as well as other information.
get_clinical_data(csid, case_list_id = NULL, dryrun = FALSE)
get_clinical_data(csid, case_list_id = NULL, dryrun = FALSE)
csid |
String corresponding to a TCGA Cancer Study identifier. |
case_list_id |
String corresponding to the case_list identifier of interest. This Can be NULL. |
dryrun |
Logical. If TRUE, all other arguments (if any) are ignored and a representative example is returned as output. No Internet connection is required for executing the operation when 'dryrun' is TRUE. |
data.frame including clinical information of a list of samples/cases of interest.
Damiano Fantini, [email protected]
https://www.data-pulse.com/dev_site/TCGAretriever/
# Set `dryrun = FALSE` (default option) in production! clinic_data <- get_clinical_data("blca_tcga", dryrun = TRUE) utils::head(clinic_data[, 1:7])
# Set `dryrun = FALSE` (default option) in production! clinic_data <- get_clinical_data("blca_tcga", dryrun = TRUE) utils::head(clinic_data[, 1:7])
Obtain all valid gene identifiers, including ENTREZ gene identifiers and HUGO gene symbols. Genes are classified according to the gene type (*e.g.*, 'protein-coding', 'pseudogene', 'miRNA', ...). Note that miRNA and phosphoprotein genes are associated with a negative entrezGeneId.
get_gene_identifiers(dryrun = FALSE)
get_gene_identifiers(dryrun = FALSE)
dryrun |
Logical. If TRUE, all other arguments (if any) are ignored and a representative example is returned as output. No Internet connection is required for executing the operation when 'dryrun' is TRUE. |
Data Frame including gene identifiers.
Damiano Fantini, [email protected]
https://www.data-pulse.com/dev_site/TCGAretriever/
# Set `dryrun = FALSE` (default option) in production! x <- get_gene_identifiers(dryrun = TRUE)
# Set `dryrun = FALSE` (default option) in production! x <- get_gene_identifiers(dryrun = TRUE)
Retrieve Information about all genetic profiles associated with a cancer study of interest. Each cancer study includes one or more types of molecular analyses. The corresponding assays or platforms are referred to as genetic profiles. A genetic profile identifier is required to download molecular data.
get_genetic_profiles(csid = NULL, dryrun = FALSE)
get_genetic_profiles(csid = NULL, dryrun = FALSE)
csid |
String corresponding to the cancer study id of interest |
dryrun |
Logical. If TRUE, all other arguments (if any) are ignored and a representative example is returned as output. No Internet connection is required for executing the operation when 'dryrun' is TRUE. |
data.frame including information about genetic profiles.
Damiano Fantini, [email protected]
https://www.data-pulse.com/dev_site/TCGAretriever/
# Set `dryrun = FALSE` (default option) in production! get_genetic_profiles("blca_tcga", dryrun = TRUE)
# Set `dryrun = FALSE` (default option) in production! get_genetic_profiles("blca_tcga", dryrun = TRUE)
Retrieve Data corresponding to a Genetic Profile of interest from a cancer study of interest. This function is the workhorse of the TCGAretriever package and can be used to fetch data concerning several genes at once. For retrieving mutation data, please use the 'get_mutation_data()' function. For large queries (more than 500 genes), please use the 'fetch_all_tcgadata()' function.
get_molecular_data( case_list_id, gprofile_id, glist = c("TP53", "E2F1"), dryrun = FALSE )
get_molecular_data( case_list_id, gprofile_id, glist = c("TP53", "E2F1"), dryrun = FALSE )
case_list_id |
String corresponding to the Identifier of a list of cases. |
gprofile_id |
String corresponding to the Identifier of a genetic Profile of interest. |
glist |
Vector including one or more gene identifiers (ENTREZID or OFFICIAL_SYMBOL). ENTREZID gene identifiers should be passed as numeric. |
dryrun |
Logical. If TRUE, all other arguments (if any) are ignored and a representative example is returned as output. No Internet connection is required for executing the operation when 'dryrun' is TRUE. |
data.frame including the molecular data of interest. Rows are genes, columns are samples.
Damiano Fantini, [email protected]
https://www.data-pulse.com/dev_site/TCGAretriever/
# Set `dryrun = FALSE` (default option) in production! x <- get_molecular_data(case_list_id = 'blca_tcga_3way_complete', gprofile_id = 'blca_tcga_rna_seq_v2_mrna', glist = c("TP53", "E2F1"), dryrun = TRUE) x[, 1:10]
# Set `dryrun = FALSE` (default option) in production! x <- get_molecular_data(case_list_id = 'blca_tcga_3way_complete', gprofile_id = 'blca_tcga_rna_seq_v2_mrna', glist = c("TP53", "E2F1"), dryrun = TRUE) x[, 1:10]
Retrieve DNA Sequence Variations (Mutations) identified by exome sequencing projects. This function is the workhorse of the TCGAretriever package for mutation data and can be used to fetch data concerning several genes at once. For retrieving non-mutation data, please use the 'get_molecular_data()' function. For large queries (more than 500 genes), please use the 'fetch_all_tcgadata()' function.
get_mutation_data( case_list_id, gprofile_id, glist = c("TP53", "E2F1"), dryrun = FALSE )
get_mutation_data( case_list_id, gprofile_id, glist = c("TP53", "E2F1"), dryrun = FALSE )
case_list_id |
String corresponding to the Identifier of a list of cases. |
gprofile_id |
String corresponding to the Identifier of a genetic Profile of interest. |
glist |
Vector including one or more gene identifiers (ENTREZID or OFFICIAL_SYMOL). ENTREZID gene identifiers should be passed as numeric. |
dryrun |
Logical. If TRUE, all other arguments (if any) are ignored and a representative example is returned as output. No Internet connection is required for executing the operation when 'dryrun' is TRUE. |
data Frame inluding one row per mutation
Damiano Fantini, [email protected]
https://www.data-pulse.com/dev_site/TCGAretriever/
# Set `dryrun = FALSE` (default option) in production! x <- get_mutation_data(case_list_id = 'blca_tcga_sequenced', gprofile_id = 'blca_tcga_mutations', glist = c('TP53', 'PTEN'), dryrun = TRUE) utils::head(x[, c(4, 7, 23, 15, 16, 17, 24, 18, 21)])
# Set `dryrun = FALSE` (default option) in production! x <- get_mutation_data(case_list_id = 'blca_tcga_sequenced', gprofile_id = 'blca_tcga_mutations', glist = c('TP53', 'PTEN'), dryrun = TRUE) utils::head(x[, c(4, 7, 23, 15, 16, 17, 24, 18, 21)])
Assign each element of a numeric vector to a group. Grouping is based on ranks: numeric values are sorted and then split in 2 or more groups. Values may be sorted in an increasing or decreasing fashion. The vector is returned in the original order. Labels may be assigned to each group.
make_groups(num_vector, groups, group_labels = NULL, desc = FALSE)
make_groups(num_vector, groups, group_labels = NULL, desc = FALSE)
num_vector |
numeric vector. It includes the values to be assigned to the different groups |
groups |
integer. The number of groups that will be generated |
group_labels |
character vector. Labels for each group. Note that the length of group_labels has to be equal to the number of groups |
desc |
logical. If TRUE, the sorting is applied in a decreasing fashion |
data.frame including the vector provided as argument in the original order ("value") and the grouping vector ("rank"). If labels are provided as an argument, group labels are also included in the data.frame ("labels").
Damiano Fantini, [email protected]
https://www.data-pulse.com/dev_site/TCGAretriever/
exprs_geneX <- c(19.1,18.4,22.4,15.5,20.2,17.4,9.4,12.4,31.2,33.2,18.4,22.1) groups_num <- 3 groups_labels <- c("high", "med", "low") make_groups(exprs_geneX, groups_num, groups_labels, desc = TRUE)
exprs_geneX <- c(19.1,18.4,22.4,15.5,20.2,17.4,9.4,12.4,31.2,33.2,18.4,22.1) groups_num <- 3 groups_labels <- c("high", "med", "low") make_groups(exprs_geneX, groups_num, groups_labels, desc = TRUE)