Cancer cells accumulate DNA mutations as result of DNA damage and DNA repair processes. mutSignatures is a computational framework aimed at deciphering DNA mutational signatures operating in cancer.
The framework includes three modules that support 1) raw data import and pre-processing, 2) mutation counts deconvolution, and 3) data visualization. Therefore, mutSignatures is a comprehensive software suite that can guide the user throughout all steps of a mutational signature analysis.
Inputs are VCF files, MAF files, as well as other commonly used
file formats storing DNA variant information. These files are imported
and pre-processed to obtain a mutationCounts
-class object
(which includes a numeric matrix of DNA mutation counts).
The framework de-novo extracts mutational signatures via Non-Negative Matrix Factorization (NMF). Bootstrapping is performed as part of the analysis.
The framework relies on parallelization (optional) and is optimized for use on multi-core systems.
The mutSignatures framework was developed in the Meeks Lab at Northwestern University (Chicago, IL), and was described in a Scientific Reports paper Fantini D et al, 2020
The software is based on a custom R-based implementation of the MATLAB WTSI framework by Alexandrov LB et al, 2013, and includes a wide range of tools for expanded functionality.
The mutSignatures framework has been used for analyses described in peer-reviewed publications, including Fantini D et al, 2018 and Fantini D et al, 2019.
More vignettes discussing how to extract and analyze mutational
signatures from different sources using mutSignatures
can
be found at the following URLs:
Getting started with mutSignatures (extended): link to PDF
Deploying mutSignatures on Computational Clusters: link to PDF
Deconvoluting Mutation Counts Against Known Mutational Signatures: link to PDF
More tutorials: mutsignatures official website
The package is available from CRAN. The official version of
mutSignatures
(version 2.1.1) can be installed from any
CRAN mirror.
The latest (most recently updated, beta) version of the
package is available on GitHub. It is possible to install
mutSignatures
using devtools
.
devtools::install_github("dami82/mutSignatures", force = TRUE,
build_opts = NULL, build_vignettes = TRUE)
One of the pre-processing steps in the mutSignatures pipeline relies on a list of Bioconductor (https://www.bioconductor.org/) libraries. These libraries are optional, since they are only required for the attachContext() step (these libs are not automatically downloaded when installing the package). The bioconductor libraries are: i) IRanges; ii) GenomicRanges; iii) BSgenome; iv) GenomeInfoDb. You can install them as shown below.
# Get BiocManager
if (!"BiocManager" %in% rownames(utils::installed.packages()))
install.packages("BiocManager")
# Get Bioc libs
BiocManager::install(pkgs = c("IRanges", "GenomicRanges", "BSgenome", "GenomeInfoDb"))
To complete the attachContext() step, a BSgenome package including full genome sequences of the organism of interest has to be installed as well. For example, human hg19 genome sequences can be installed as shown below.
The typical mutSignatures
analytic pipeline includes
three steps:
Data Import and pre-processing: Mutation data
are imported from a VCF, MAF, or similar file. Single Nucleotide
Variants (SNVs) are filtered, nucleotide context around the mutation
site is retrieved, and tri-nucleotide mutation types are computed and
counted across samples. The result of this step is a
mutationCounts
-class object.
De-novo extraction of mutational signatures: Non-negative Matrix factorization is performed. Typically, 500 to 1,000 bootstrapping iterations are performed (for preparative analytic runs). 100-500 iterations are typically enough for a preliminary analysis. The NMF step returns both: i) mutational signatures, as well as ii) mutSignature exposures.
Visualization, downstream analysis, data export:
Results can be visualized (barplots). Signatures can be compared and
matched to known signatures (i.e. previously extracted
signatures or published signatures, such as the COSMIC signatures
published by the Sanger Institute, London). You can use the
msigPlot()
method to generate plots.
If the set of mutational signatures is already known, it is possible
to use mutSignatures
to estimate the contribution of each
mutational pattern in a collection of samples. In this case, the
pipeline includes the following steps:
Data Import and pre-processing: see above.
Estimate exposures to known mutational signatures: Use a Fast Combinatorial Nonnegative Least-Square approach to compute exposures when the mutation signatures are pre-defined.
Visualization, downstream analysis, data export: see above.
For running the demos described in this vignette, the following R
libraries will be used: dplyr
, reshape2
,
ggplot2
, kableExtra
,
BSgenome.Hsapiens.UCSC.hg19
, and
mutSignatures
. Before proceeding, we assign the
hg19 BSgenome object to a variable named hg19
, and
we load the mutSigData
dataset, which is provided together
with the mutSignatures
package.
The first demo shows how to extract mutational signatures from a dataset including DNA mutations from bladder cancer samples.
Here, the input has a VCF-like structure, and is decorated with an
extra column (namely, SAMPLEID
) that includes a unique
identifier for the biological samples.
# Import data (VCF-like format)
x <- mutSigData$inputC
# Filter non SNV
x <- filterSNV(dataSet = x, seq_colNames = c("REF", "ALT"))
# Visualize head
head(x) %>% kable() %>% kable_styling(bootstrap_options = "striped")
CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | XTR1 | SAMPLEID |
---|---|---|---|---|---|---|---|---|---|---|
chr15 | 73624589 | . | G | T | 29.084 | . | . | GT:PL | 0/1 | BLCA.001103810 |
chr22 | 35684398 | . | G | A | 21.995 | . | . | GT:PL | 0/1 | BLCA.001103810 |
chr3 | 183689484 | . | G | A | 27.296 | . | . | GT:PL | 0/1 | BLCA.001103810 |
chr16 | 89261395 | . | G | A | 24.398 | . | . | GT:PL | 0/1 | BLCA.001103810 |
chr19 | 44662014 | . | G | C | 21.626 | . | . | GT:PL | 0/1 | BLCA.001103810 |
chr16 | 30724634 | . | G | C | 38.741 | . | . | GT:PL | 0/1 | BLCA.001103810 |
# Attach context
x <- attachContext(mutData = x,
chr_colName = "CHROM",
start_colName = "POS",
end_colName = "POS",
nucl_contextN = 3,
BSGenomeDb = hg19)
CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | XTR1 | SAMPLEID | context |
---|---|---|---|---|---|---|---|---|---|---|---|
chr15 | 73624589 | . | G | T | 29.084 | . | . | GT:PL | 0/1 | BLCA.001103810 | CGA |
chr22 | 35684398 | . | G | A | 21.995 | . | . | GT:PL | 0/1 | BLCA.001103810 | TGA |
chr3 | 183689484 | . | G | A | 27.296 | . | . | GT:PL | 0/1 | BLCA.001103810 | TGC |
chr16 | 89261395 | . | G | A | 24.398 | . | . | GT:PL | 0/1 | BLCA.001103810 | AGG |
chr19 | 44662014 | . | G | C | 21.626 | . | . | GT:PL | 0/1 | BLCA.001103810 | AGA |
chr16 | 30724634 | . | G | C | 38.741 | . | . | GT:PL | 0/1 | BLCA.001103810 | GGA |
# Remove mismatches
x <- removeMismatchMut(mutData = x, # input data.frame
refMut_colName = "REF", # column name for ref base
context_colName = "context", # column name for context
refMut_format = "N")
# Compute mutType
x <- attachMutType(mutData = x, # as above
ref_colName = "REF", # column name for ref base
var_colName = "ALT", # column name for mut base
context_colName = "context")
CHROM | POS | ID | REF | ALT | QUAL | FILTER | INFO | FORMAT | XTR1 | SAMPLEID | context | mutType |
---|---|---|---|---|---|---|---|---|---|---|---|---|
chr15 | 73624589 | . | G | T | 29.084 | . | . | GT:PL | 0/1 | BLCA.001103810 | CGA | T[C>A]G |
chr22 | 35684398 | . | G | A | 21.995 | . | . | GT:PL | 0/1 | BLCA.001103810 | TGA | T[C>T]A |
chr3 | 183689484 | . | G | A | 27.296 | . | . | GT:PL | 0/1 | BLCA.001103810 | TGC | G[C>T]A |
chr16 | 89261395 | . | G | A | 24.398 | . | . | GT:PL | 0/1 | BLCA.001103810 | AGG | C[C>T]T |
chr19 | 44662014 | . | G | C | 21.626 | . | . | GT:PL | 0/1 | BLCA.001103810 | AGA | T[C>G]T |
chr16 | 30724634 | . | G | C | 38.741 | . | . | GT:PL | 0/1 | BLCA.001103810 | GGA | T[C>G]C |
# Count
blca.counts <- countMutTypes(mutTable = x,
mutType_colName = "mutType",
sample_colName = "SAMPLEID")
## Mutation Counts object - mutSignatures
##
## Total num of MutTypes: 96
## MutTypes: A[C>A]A, A[C>A]C, A[C>A]G, A[C>A]T, A[C>G]A ...
##
## Total num of Samples: 50
## Sample Names: BLCA.001103810, BLCA.001289938, BLCA.001362782, BLCA.001427035, BLCA.001575414 ...
After multi-nucleotide mutation count data have been prepared, it is
possible to proceed with the de-novo signature extraction.
Settings guiding the analysis are defined as a list of parameters that
can be tuned via the setMutClusterParams()
function. Next,
the NMF analysis is executed by the
decipherMutationalProcesses()
function. At the end of the
analysis, a silhouette plot is returned. This plot can be very helpful
to understand if the obtained signatures are relatively weak or
robust.
# how many signatures should we extract?
num.sign <- 4
# Define parameters for the non-negative matrix factorization procedure.
# you should parallelize if possible
blca.params <-
mutSignatures::setMutClusterParams(
num_processesToExtract = num.sign, # num signatures to extract
num_totIterations = 20, # bootstrapping: usually 500-1000
num_parallelCores = 4) # total num of cores to use (parallelization)
# Extract new signatures - may take a while
blca.analysis <-
decipherMutationalProcesses(input = blca.counts,
params = blca.params)
Profiles of de-novo extracted mutational signatures can be
visualized (by barplots). Also it is possible to plot the estimated
exposures of each sample to the identified mutational patterns. It is
also possible to compare de-novo extracted signatures with
known signatures (such as signatures from previous analyses, or COSMIC
signatures). To cast mutSignature
objects as
data.frames
, you can use the
mutSignatures::as.data.frame()
method.
Note: you can use the
mutSignatures::getCosmicSignatures()
function to download a
copy of the COSMIC signatures.
# Retrieve signatures (results)
blca.sig <- blca.analysis$Results$signatures
# Retrieve exposures (results)
blca.exp <- blca.analysis$Results$exposures
# Plot signature 1 (standard barplot, you can pass extra args such as ylim)
msigPlot(blca.sig, signature = 1, ylim = c(0, 0.10))
# Plot exposures (ggplot2 object, you can customize as any other ggplot2 object)
msigPlot(blca.exp, main = "BLCA samples") +
scale_fill_manual(values = c("#1f78b4", "#cab2d6", "#ff7f00", "#a6cee3"))
# Export Signatures as data.frame
xprt <- coerceObj(x = blca.sig, to = "data.frame")
head(xprt) %>% kable() %>% kable_styling(bootstrap_options = "striped")
Sign.01 | Sign.02 | Sign.03 | Sign.04 | |
---|---|---|---|---|
A[C>A]A | 0.0022120 | 0.0040737 | 0.0057090 | 0.0062098 |
A[C>A]C | 0.0007129 | 0.0088800 | 0.0016303 | 0.0145214 |
A[C>A]G | 0.0002019 | 0.0039400 | 0.0007855 | 0.0120000 |
A[C>A]T | 0.0005315 | 0.0022517 | 0.0110999 | 0.0075955 |
A[C>G]A | 0.0009370 | 0.0048720 | 0.0024274 | 0.0098213 |
A[C>G]C | 0.0013582 | 0.0001527 | 0.0054321 | 0.0046073 |
# Get signatures from data (imported as data.frame)
# and then convert it to mutSignatures object
cosmixSigs <- mutSigData$blcaSIGS %>%
dplyr::select(starts_with("COSMIC")) %>%
as.mutation.signatures()
blcaKnwnSigs <- mutSigData$blcaSIGS %>%
dplyr::select(starts_with("BLCA")) %>%
as.mutation.signatures()
# Compare de-novo signatures with selected COSMIC signatures
msig1 <- matchSignatures(mutSign = blca.sig, reference = cosmixSigs,
threshold = 0.45, plot = TRUE)
msig2 <- matchSignatures(mutSign = blca.sig, reference = blcaKnwnSigs,
threshold = 0.45, plot = TRUE)
# Visualize match
# signature 1 is similar to COSMIC ;
# signatures 2 and 3 are similar to COSMIC
# Here, we should probably extract only 2 mutational signatures
hm1 <- msig1$plot + ggtitle("Match to COSMIC signs.")
hm2 <- msig2$plot + ggtitle("Match to known BLCA signs.")
# Show
grid.arrange(hm1, hm2, ncol = 2)
The second demo illustrates a case where the signatures are known. We
will use the COSMIC signatures 1, 2, 5, and 13 in the analysis, as well
as a set of custom signatures previously identified in bladder cancer
tumor samples. We will estimate their contributions to a collection of
blca mutations that are included in the mutSigData
dataset.
Data import and pre-processing can be conducted as shown above.
Note that if a data.frame
with counts of
mutation types across samples is available, it is sufficient to use the
as.mutation.counts()
method to convert it to the desired
object.
# Retrieve a mutation.counts data.frame
x <- mutSigData$blcaMUTS
# Visualize header
x[1:10, 1:5] %>% kable() %>% kable_styling(bootstrap_options = "striped")
BLCA.001103810 | BLCA.001289938 | BLCA.001362782 | BLCA.001427035 | BLCA.001575414 | |
---|---|---|---|---|---|
A[C>A]A | 1 | 0 | 0 | 0 | 0 |
A[C>A]C | 0 | 0 | 0 | 0 | 0 |
A[C>A]G | 0 | 0 | 0 | 1 | 0 |
A[C>A]T | 1 | 0 | 0 | 1 | 0 |
A[C>G]A | 1 | 0 | 0 | 2 | 0 |
A[C>G]C | 0 | 0 | 0 | 0 | 0 |
A[C>G]G | 1 | 0 | 0 | 0 | 0 |
A[C>G]T | 0 | 0 | 2 | 0 | 0 |
A[C>T]A | 1 | 0 | 1 | 3 | 0 |
A[C>T]C | 2 | 1 | 1 | 1 | 0 |
## Mutation Counts object - mutSignatures
##
## Total num of MutTypes: 96
## MutTypes: A[C>A]A, A[C>A]C, A[C>A]G, A[C>A]T, A[C>G]A ...
##
## Total num of Samples: 50
## Sample Names: BLCA.001103810, BLCA.001289938, BLCA.001362782, BLCA.001427035, BLCA.001575414 ...
After preparing the mutational signatures as a
mutationSignatures
-class object, it is sufficient to run
the resolveMutSignatures()
function. It is not recommended
to use a large number of signatures when running this analysis (for
example, refrain from using all COSMIC signatures). Likewise, you don’t
want to run this step using a very low number of signatures. The
algorithm is way more accurate if ONLY AND ALL the
relevant signatures are used (i.e., the signatures we are
reasonably expecting to be operative in the tumor samples being
analyzed). This analysis is typically very fast!
# Obtain 4 COSMIC signatures
cosmx <- mutSigData$blcaSIGS %>% dplyr::select(starts_with("COSMIC"))
cosmx <- as.mutation.signatures(cosmx)
# Obtain 4 BLCA signatures
blcmx <- mutSigData$blcaSIGS %>% dplyr::select(starts_with("BLCA"))
blcmx <- as.mutation.signatures(blcmx)
## Mutation Signatures object - mutSignatures
##
## Total num of Signatures: 4
## Total num of MutTypes: 96
##
## Sign.1 Sign.2 Sign.3 Sign.4
## ------ ------ ------ ------
## + 0.0111 0.0007 0.0149 0.0003 + A[C>A]A
## + 0.0091 0.0006 0.0090 0.0006 + A[C>A]C
## + 0.0015 0.0001 0.0022 0.0000 + A[C>A]G
## + 0.0062 0.0003 0.0092 0.0008 + A[C>A]T
## + 0.0018 0.0003 0.0117 0.0038 + A[C>G]A
## + 0.0026 0.0003 0.0073 0.0009 + A[C>G]C
## + 0.0006 0.0002 0.0023 0.0000 + A[C>G]G
## + 0.0030 0.0006 0.0117 0.0039 + A[C>G]T
## + 0.0295 0.0074 0.0218 0.0015 + A[C>T]A
## + 0.0143 0.0027 0.0128 0.0000 + A[C>T]C
## ...... ...... ...... ......
## Mutation Signatures object - mutSignatures
##
## Total num of Signatures: 4
## Total num of MutTypes: 96
##
## Sign.1 Sign.2 Sign.3 Sign.4
## ------ ------ ------ ------
## + 0.0019 0.0020 0.0072 0.0059 + A[C>A]A
## + 0.0003 0.0073 0.0111 0.0105 + A[C>A]C
## + 0.0001 0.0064 0.0049 0.0069 + A[C>A]G
## + 0.0006 0.0038 0.0071 0.0061 + A[C>A]T
## + 0.0007 0.0028 0.0055 0.0118 + A[C>G]A
## + 0.0017 0.0002 0.0021 0.0060 + A[C>G]C
## + 0.0017 0.0028 0.0062 0.0029 + A[C>G]G
## + 0.0024 0.0031 0.0015 0.0089 + A[C>G]T
## + 0.0019 0.0082 0.0135 0.0155 + A[C>T]A
## + 0.0028 0.0119 0.0171 0.0100 + A[C>T]C
## ...... ...... ...... ......
As discussed above, exposures to mutational signatures can be plotted
(barplots), and results can be exported using the
mutSignatures::data.frame
method.
# Retrieve exposures (results)
blca.exp.1x <- blca.expo1$Results$count.result
blca.exp.2x <- blca.expo2$Results$count.result
# Plot exposures
bp1 <- msigPlot(blca.exp.1x, main = "BLCA | COSMIC sigs.") +
scale_fill_manual(values = c("#fdbf6f", "#e31a1c", "#fb9a99", "#1f78b4"))
bp2 <- msigPlot(blca.exp.2x, main = "BLCA | pre-blca sigs.") +
scale_fill_manual(values = c("#fdbf6f", "#e31a1c", "#fb9a99", "#1f78b4"))
# Visualize
grid.arrange(bp1, bp2, ncol = 2)
# Compare sigs
# Export Exposures as data.frame
xprt <- as.data.frame(blca.exp.1x, transpose = TRUE)
head(xprt) %>% round() %>% kable() %>%
kable_styling(bootstrap_options = "striped")
COSMIC.1 | COSMIC.2 | COSMIC.5 | COSMIC.13 | |
---|---|---|---|---|
BLCA.001103810 | 32 | 29 | 23 | 30 |
BLCA.001289938 | 42 | 24 | 0 | 32 |
BLCA.001362782 | 50 | 119 | 24 | 228 |
BLCA.001427035 | 79 | 35 | 14 | 74 |
BLCA.001575414 | 49 | 49 | 0 | 40 |
BLCA.001698493 | 34 | 6 | 53 | 0 |
More info and other examples can be found at the following URLs:
Our Scientific Reports paper describing the latest version of the software: MutSignatures: An R Package for Extraction and Analysis of Cancer Mutational Signatures, https://www.nature.com/articles/s41598-020-75062-0/. Please, cite this paper if you use mutSignatures in your work! Thanks.
Oncogene paper: Mutational Signatures operative in bladder cancer, https://www.nature.com/articles/s41388-017-0099-6/
More tutorials and vignettes about the mutSignatures R library: https://www.data-pulse.com/dev_site/mutsignatures/
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] mutSignatures_2.1.5 foreach_1.5.2 gridExtra_2.3
## [4] ggplot2_3.5.1 kableExtra_1.4.0 reshape2_1.4.4
## [7] dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 generics_0.1.3 xml2_1.3.6 stringi_1.8.4
## [5] pracma_2.4.4 digest_0.6.37 magrittr_2.0.3 evaluate_1.0.3
## [9] grid_4.4.2 iterators_1.0.14 fastmap_1.2.0 doParallel_1.0.17
## [13] plyr_1.8.9 jsonlite_1.8.9 viridisLite_0.4.2 scales_1.3.0
## [17] codetools_0.2-20 jquerylib_0.1.4 cli_3.6.3 rlang_1.1.5
## [21] munsell_0.5.1 withr_3.0.2 cachem_1.1.0 yaml_2.3.10
## [25] parallel_4.4.2 tools_4.4.2 colorspace_2.1-1 buildtools_1.0.0
## [29] vctrs_0.6.5 R6_2.5.1 proxy_0.4-27 lifecycle_1.0.4
## [33] stringr_1.5.1 cluster_2.1.8 pkgconfig_2.0.3 pillar_1.10.1
## [37] bslib_0.9.0 gtable_0.3.6 glue_1.8.0 Rcpp_1.0.14
## [41] systemfonts_1.2.1 xfun_0.50 tibble_3.2.1 tidyselect_1.2.1
## [45] rstudioapi_0.17.1 sys_3.4.3 knitr_1.49 farver_2.1.2
## [49] htmltools_0.5.8.1 labeling_0.4.3 rmarkdown_2.29 svglite_2.1.3
## [53] maketools_1.3.1 compiler_4.4.2
Thanks for your interest in our software. If you run into any issue,
bug, or you have a question about mutSignatures
, feel free
to email [email protected]. At this time,
mutSignatures is mainly supported by DF in his free
time (so, please, be patient upon your inquires).
C-2020 (November-01, 2020).