Introduction
Preparations
We have selected a publicly available dataset
from GEO with accession number GSE179590 which can be downloaded here.
You can download the zipped data using wget or curl,
e.g. wget http://kkh.bric.ku.dk/fabienne/crmetrics_testdata.tar.gz
,
and then unpack using
tar -xvf crmetrics_testdata.tar.gz
Using Python modules
CRMetrics utilizes several Python packages. Of these, the packages
for doublet detection, scrublet
and
DoubletDetection
, can be run from R using
reticulate
. However, CRMetrics can provide a Python script
to run the analyses directly in Python instead through the
export
parameter.
In order to run the analyses in R, first you should install
reticulate
:
install.packages("reticulate")
library(reticulate)
Also, you need to install Conda. Then, you are ready to create a
Conda virtual environment. In this example, we’re on a server and we
load miniconda
using modules. You may need to include the
conda
parameter to point to wherever your Conda binary is
located (in terminal, try whereis conda
). In this example,
we install a virtual environment for scrublet
.
conda_create("scrublet",
conda = "/opt/software/miniconda/4.12.0/condabin/conda",
python_version = 3.8)
There is a known problem with openBLAS which may be different between
R and Python. If this is the case, you will receive the error
floating point exception
and R will crash when you try to
run a Python script using reticulate
. In Python, the
problem lies within numpy
. numba
requires
numpy
< 1.23, so force reinstall from scratch with no
binaries in the scrublet
Conda environment from terminal
module load miniconda/4.12.0
conda activate scrublet
python -m pip install numpy==1.22.0 --force-reinstall --no-binary numpy
Then, follow the instructions to install scrublet
.
Finally, restart your R session.
Please note, if at any point you receive an error that you can’t
change the current Python instance, please remove any Python-dependent
object in your environment and restart your R session.
Main use
Load the library
library(CRMetrics)
library(magrittr)
library(dplyr)
There are two ways to initialize a new object of class
CRMetrics
, either by providing data.path
or
cms
. data.path
is the path to a directory
containing sample-wise directories with the Cell Ranger count outputs.
Optionally, it can be a vector of multiple paths. cms
is a
(named, optional) list of (sparse, optional) count matrices.
Please note, if data.path
is not provided, some
functionality is lost, e.g. ambient RNA removal.
Optionally, metadata can be provided, either as a file or as a
data.frame. For a file, the separator can be set with the parameter
sep.meta
(most often, either ,
(comma) or
\t
(tab) is used). In either format, the columns must be
named and one column must be named sample
and contain
sample names. In combination with data.path
, the sample
names must match the sample directory names. Unmatched directory names
are dropped.
If cms
is provided, it is recommended to add summary
metrices afterwards:
crm <- CRMetrics$new(cms = cms,
n.cores = 10,
pal = grDevices::rainbow(8),
theme = ggplot2::theme_bw())
crm$addSummaryFromCms()
Please note, some functionality depends on aggregation of sample and
cell IDs using the sep.cell
parameter. The default is
!!
which creates cell names in the format of
<sampleID>!!<cellID>
. If another separator is
used, this needs to be provided in relevant function calls.
Here, the folder with our test data is stored in
/data/ExtData/CRMetrics_testdata/
and we provide metadata
in a comma-separated file.
crm <- CRMetrics$new(data.path = "/data/ExtData/CRMetrics_testdata/",
metadata = "/data/ExtData/CRMetrics_testdata/metadata.csv",
sep.meta = ",",
n.cores = 10,
verbose = FALSE,
pal = grDevices::rainbow(8),
theme = ggplot2::theme_bw())
We can review our metadata
crm$metadata
## sample age sex type RIN
## 1 SRR15054421 43 male RRMS medium
## 2 SRR15054422 57 male RRMS high
## 3 SRR15054423 52 male SPMS high
## 4 SRR15054424 66 female SPMS medium
## 5 SRR15054425 50 female SPMS high
## 6 SRR15054426 58 female RRMS high
## 7 SRR15054427 56 female SPMS low
## 8 SRR15054428 61 male SPMS high
Plot summary statistics
We can investigate which metrics are available and choose the ones we
would like to plot
crm$selectMetrics()
## no metrics
## 1 1 estimated number of cells
## 2 2 mean reads per cell
## 3 3 median genes per cell
## 4 4 number of reads
## 5 5 valid barcodes
## 6 6 sequencing saturation
## 7 7 q30 bases in barcode
## 8 8 q30 bases in rna read
## 9 9 q30 bases in umi
## 10 10 reads mapped to genome
## 11 11 reads mapped confidently to genome
## 12 12 reads mapped confidently to intergenic regions
## 13 13 reads mapped confidently to intronic regions
## 14 14 reads mapped confidently to exonic regions
## 15 15 reads mapped confidently to transcriptome
## 16 16 reads mapped antisense to gene
## 17 17 fraction reads in cells
## 18 18 total genes detected
## 19 19 median umi counts per cell
Samples per condition
First, we can plot the number of samples per condition. Here, we
investigate how the distribution of the sex differs between the type of
MS of the samples where RRMS is short for relapsing remitting MS, and
SPMS is short for secondary progressive MS.
crm$plotSummaryMetrics(comp.group = "sex",
metrics = "samples per group",
second.comp.group = "type",
plot.geom = "bar")
Metrics per sample
In one plot, we can illustrate selected metric summary stats. If no
comparison group is set, it defaults to sample
.
metrics.to.plot <- crm$selectMetrics(ids = c(1:4,6,18,19))
crm$plotSummaryMetrics(comp.group = "sample",
metrics = metrics.to.plot,
plot.geom = "bar")
Metrics per condition
We can do the same, but set the comparison group to
type
. This will add statistics to the plots. Additionally,
we can add a second comparison group for coloring.
crm$plotSummaryMetrics(comp.group = "type",
metrics = metrics.to.plot,
plot.geom = "point",
stat.test = "non-parametric",
second.comp.group = "sex")
Metrics per condition with >2 levels
For the sake of the example, we change the RIN
values to
low
(RIN<6), medium
(6<RIN<7), and
high
(RIN>7). This will provide us with three
comparisons groups to exemplify how to use automated statistics for such
situations.
crm$metadata$RIN %<>%
as.character() %>%
{c("medium","high","high","medium","high","high","low","high")} %>%
factor(., levels = c("low", "medium", "high"))
crm$plotSummaryMetrics(comp.group = "RIN",
metrics = metrics.to.plot,
plot.geom = "point",
stat.test = "non-parametric",
second.comp.group = "type",
secondary.testing = TRUE)
Metrics per condition with numeric covariate
We can choose a numeric comparison group, in this case
age
, which will add regression lines to the plots.
crm$plotSummaryMetrics(comp.group = "age",
metrics = metrics.to.plot,
plot.geom = "point",
second.comp.group = "type",
se = FALSE)
If the numeric vector has a significant effect on one of the metrics
we can investigate it closer by performing regression analyses for both
conditions of type
.
crm$plotSummaryMetrics(comp.group = "age",
metrics = "mean reads per cell",
plot.geom = "point",
second.comp.group = "type",
group.reg.lines = TRUE)
We see that there is no significant effect of the numeric vector on
neither of the MS types.
Add detailed metrics
We can read in count matrices to assess detailed metrics. Otherwise,
if count matrices have already been added earlier, this step prepares
data for plotting UMI and gene counts.
crm$addCms(add.metadata = FALSE)
crm$addDetailedMetrics()
We plot the detailed metrics. The horizontal lines indicates the
median values for all samples.
metrics.to.plot <- crm$detailed.metrics$metric %>%
unique()
crm$plotDetailedMetrics(comp.group = "type",
metrics = metrics.to.plot,
plot.geom = "violin")
Embed cells using Conos
In order to plot our cells in our embedding, we need to perform
preprocessing of the raw count matrices. To do this, either
pagoda2
(default) or Seurat
can be used.
crm$doPreprocessing()
Then, we create the embedding using conos
.
crm$createEmbedding()
We can now plot our cells.
crm$plotEmbedding()
Cell depth
We can plot cell depth, both in the embedding or as histograms per
sample.
crm$plotEmbedding(depth = TRUE,
depth.cutoff = 1e3)
crm$plotDepth()
We can see that the depth distribution varies between samples. We can
create a cutoff vector specifying the depth cutoff per sample. It should
be a named vector containing sample names.
depth_cutoff_vec <- c(2.5e3, 2e3, 1e3, 1.5e3, 1.5e3, 2e3, 2.5e3, 2e3) %>%
setNames(crm$detailed.metrics$sample %>%
unique() %>%
sort())
depth_cutoff_vec
## SRR15054421 SRR15054422 SRR15054423 SRR15054424 SRR15054425 SRR15054426
## 2500 2000 1000 1500 1500 2000
## SRR15054427 SRR15054428
## 2500 2000
Let’s plot the updated cutoffs:
crm$plotDepth(cutoff = depth_cutoff_vec)
Also, we can do this in the embedding:
crm$plotEmbedding(depth = TRUE,
depth.cutoff = depth_cutoff_vec)
Mitochondrial fraction
We can also investigate the mitochondrial fraction in our cells
crm$plotEmbedding(mito.frac = TRUE,
mito.cutoff = 0.05,
species = "human")
Similar as for depth, we can plot the distribution of the
mitochondrial fraction per sample and include sample-wise cutoffs (not
shown here).
crm$plotMitoFraction(cutoff = 0.05)
Remove ambient RNA
We have added functionality to remove ambient RNA from our samples.
This approach should be used with caution since it induces changes to
the UMI counts (NB: it does not overwrite the outputs from Cell Ranger).
We have included preparative steps for CellBender as
well as incorporated SoupX into
CRMetrics.
CellBender
Installation
To install, follow these
instructions. It is highly recommended to run
CellBender
using GPU acceleration. If you are more
comfortable installing through reticulate
in R, these lines
should be run:
library(reticulate)
conda_create("cellbender",
conda = "/opt/software/miniconda/4.12.0/condabin/conda",
python_version = 3.7)
conda_install("cellbender",
conda = "/opt/software/miniconda/4.12.0/condabin/conda",
forge = FALSE,
channel = "anaconda",
packages = "pytables")
conda_install("cellbender",
conda = "/opt/software/miniconda/4.12.0/condabin/conda",
packages = c("pytorch","torchvision","torchaudio"),
channel = "pytorch")
Then, clone the CellBender
repository as instructed in
the manual. Here, we clone to /apps/
through
cd /apps/; git clone https://github.com/broadinstitute/CellBender.git
and then CellBender
can be installed:
conda_install("cellbender",
conda = "/opt/software/miniconda/4.12.0/condabin/conda",
pip = TRUE,
pip_options = "-e",
packages = "/apps/CellBender/")
Analysis
For CellBender
, we need to specify expected number of
cells and total droplets included (please see the manual
for additional information). As hinted in the manual, the number of
total droplets included could be expected number of cells multiplied by
3 (which we set as default). First, we plot these measures:
crm$prepareCellbender()
We could change the total droplets included for any sample. Let us
first look at the vector.
droplets <- crm$getTotalDroplets()
droplets
## SRR15054421 SRR15054422 SRR15054423 SRR15054424 SRR15054425 SRR15054426
## 16212 17652 19728 20931 14322 13044
## SRR15054427 SRR15054428
## 16842 16620
Then we change the total droplets for SRR15054424.
droplets["SRR15054424"] <- 2e4
We plot this change.
crm$prepareCellbender(shrinkage = 100,
show.expected.cells = TRUE,
show.total.droplets = TRUE,
total.droplets = droplets)
We could also multiply expected cells by 2.5 for all samples and save
this in our CRMetrics object.
crm$cellbender$total.droplets <- crm$getTotalDroplets(multiplier = 2.5)
Finally, we save a script for running CellBender
on all
our samples. To only select specific samples, use the
samples
parameter. Here, we use our modified total droplet
vector. If total.droplets
is not specified, it will use the
stored vector at crm$cellbender$total.droplets
.
crm$saveCellbenderScript(file = "/apps/cellbender_script.sh",
fpr = 0.01,
epochs = 150,
use.gpu = TRUE,
total.droplets = droplets)
We can run this script in the terminal. Here, we activate the
environment: conda activate cellbender
and we run the bash
script: sh /apps/cellbender_script.sh
Plotting
We can plot the changes in cell numbers following CellBender
estimations.
crm$plotCbCells()
We can plot the CellBender training results.
crm$plotCbTraining()
We can plot the cell probabilities.
crm$plotCbCellProbs()
We can plot the identified ambient genes per sample.
crm$plotCbAmbExp(cutoff = 0.005)
Lastly, we can plot the proportion of samples expressing ambient
genes. We see that MALAT1 is identified as an ambient gene in
all samples which
is expected.
crm$plotCbAmbGenes(cutoff = 0.005, pal = grDevices::rainbow(17))
Then, we can add the filtered CMs to our object. Additionally, it is
recommended to generate new summary metrics from the adjusted CMs which
can then be plotted.
crm$cms <- NULL
crm$addCms(cellbender = T)
crm$addSummaryFromCms()
crm$detailed.metrics <- NULL
crm$addDetailedMetrics()
SoupX
The implementation of SoupX uses the automated estimation of
contamination and correction. Please note, SoupX depends on Seurat for
import of data. Since this calculation takes several minutes, it is not
run in this vignette.
crm$runSoupX()
Then, we can plot the corrections.
crm$plotSoupX()
Then, we can add the SoupX adjusted CMs to our object. Additionally,
it is recommended to generate new summary metrics from the adjusted CMs
which can then be plotted.
crm$cms <- NULL
crm$addCms(cms = crm$soupx$cms.adj,
unique.names = TRUE,
sep = "!!")
## Warning in crm$addCms(cms = crm$soupx$cms.adj, unique.names = TRUE, sep = "!!"): Overwriting metadata
## Warning in crm$addCms(cms = crm$soupx$cms.adj, unique.names = TRUE, sep = "!!"): Consider updating detailed metrics by setting $detailed.metrics <- NULL and running $addDetailedMetrics().
## Warning in crm$addCms(cms = crm$soupx$cms.adj, unique.names = TRUE, sep = "!!"): Consider updating embedding by setting $cms.preprocessed <- NULL and $con <- NULL, and running $doPreprocessing() and $createEmbedding().
## Warning in crm$addCms(cms = crm$soupx$cms.adj, unique.names = TRUE, sep = "!!"): Consider updating doublet scores by setting $doublets <- NULL and running $detectDoublets().
crm$addSummaryFromCms()
## Warning in crm$addSummaryFromCms(): Overwriting existing summary metrics
crm$detailed.metrics <- NULL
crm$addDetailedMetrics()
Doublet detection
For doublet detection, we included the possibility to do so using the
Python modules scrublet
and DoubletDetection
.
Here, we use virtual environments where we installed each method.
scrublet
is the default method, which is fast.
DoubletDetection
is significantly slower, but performs
better according to this
review. Here, we show how to run scrublet
and
DoubletDetection
to compare in the next section. Since this
takes some time, the results have been precalculated and are not run in
this vignette.
crm$detectDoublets(env = "scrublet",
conda.path = "/opt/software/miniconda/4.12.0/condabin/conda",
method = "scrublet")
crm$detectDoublets(env = "doubletdetection",
conda.path = "/opt/software/miniconda/4.12.0/condabin/conda",
method = "doubletdetection")
It is also possible to generate a Python script to run each method
from the terminal. To do this, set export = TRUE
and run
python <METHOD>.py
inside the virtual environment to
generate data. Then, load data with:
crm$addDoublets()
We can plot the estimated doublets in the embedding.
crm$plotEmbedding(doublet.method = "scrublet")
crm$plotEmbedding(doublet.method = "doubletdetection")
And we can plot the scores for the doublet estimations.
crm$plotEmbedding(doublet.method = "scrublet",
doublet.scores = TRUE)
crm$plotEmbedding(doublet.method = "doubletdetection",
doublet.scores = TRUE)
Differences between methods
We can compare how much scrublet
and
DoubletDetection
overlap in their doublets estimates.
First, let us plot a bar plot of the number of doublets per sample.
scrub.res <- crm$doublets$scrublet$result %>%
select(labels, sample) %>%
mutate(method = "scrublet")
dd.res <- crm$doublets$doubletdetection$result %>%
select(labels, sample) %>%
mutate(labels = as.logical(labels),
method = "DoubletDetection")
dd.res[is.na(dd.res)] <- FALSE
plot.df <- rbind(scrub.res,
dd.res) %>%
filter(labels) %>%
group_by(sample, method) %>%
summarise(count = n())
ggplot(plot.df, aes(sample, count, fill = method)) +
geom_bar(stat = "identity", position = position_dodge()) +
crm$theme +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
labs(x = "", y = "No. doublets", fill = "Method", title = "Doublets per sample") +
scale_fill_manual(values = crm$pal)
We can also show the total number of doublets detected per
method.
plot.df %>%
group_by(method) %>%
summarise(count = sum(count)) %>%
ggplot(aes(method, count, fill = method)) +
geom_bar(stat = "identity") +
crm$theme +
guides(fill = "none") +
labs(x = "", y = "No. doublets", title = "Total doublets per method") +
scale_fill_manual(values = crm$pal)
Finally, let’s plot an embedding showing the method-wise estimations
as well as overlaps.
plot.vec <- data.frame(scrublet = scrub.res$labels %>% as.numeric(),
doubletdetection = dd.res$labels %>% as.numeric()) %>%
apply(1, \(x) if (x[1] == 0 & x[2] == 0) "Kept" else if (x[1] > x[2]) "scrublet" else if (x[1] < x[2]) "DoubletDetection" else "Both") %>%
setNames(rownames(scrub.res)) %>%
factor(levels = c("Kept","scrublet","DoubletDetection","Both"))
crm$con$plotGraph(groups = plot.vec,
mark.groups = FALSE,
show.legend = TRUE,
shuffle.colors = TRUE,
title = "Doublets",
size = 0.3) +
scale_color_manual(values = c("grey80","red","blue","black"))
Plot filtered cells
We can plot all the cells to be filtered in our embedding
crm$plotFilteredCells(type = "embedding",
depth = TRUE,
depth.cutoff = depth_cutoff_vec,
doublet.method = "scrublet",
mito.frac = TRUE,
mito.cutoff = 0.05,
species = "human")
And we can plot the cells to be filtered per sample where
combination
means a cell that has at least two filter
labels, e.g. mito
and depth
.
crm$plotFilteredCells(type = "bar",
doublet.method = "scrublet",
depth = TRUE,
depth.cutoff = depth_cutoff_vec,
mito.frac = TRUE,
mito.cutoff = 0.05,
species = "human")
Finally, we can create a tile plot with an overview of sample quality
for the different filters. NB, this is experimental and has not been
validated across datasets.
crm$plotFilteredCells(type = "tile",
doublet.method = "doubletdetection",
depth = TRUE,
depth.cutoff = depth_cutoff_vec,
mito.frac = TRUE,
mito.cutoff = 0.05,
species = "human")
We can also extract the raw numbers for plotting in other ways than
those included here
filter.data <- crm$plotFilteredCells(type = "export")
filter.data %>% head()
Filter count matrices
Finally, we can filter the count matrices to create a cleaned list to
be used in downstream applications.
crm$filterCms(depth.cutoff = depth_cutoff_vec,
mito.cutoff = 0.05,
doublets = "scrublet",
samples.to.exclude = NULL,
species = "human",
verbose = TRUE)
## Filtering based on: depth.cutoff = 2500; depth.cutoff = 2000; depth.cutoff = 1000; depth.cutoff = 1500; depth.cutoff = 1500; depth.cutoff = 2000; depth.cutoff = 2500; depth.cutoff = 2000; mito.cutoff = 0.05 and species = human; doublet method = scrublet
## 2023-07-06 15:56:04 Preparing filter
## 2023-07-06 15:56:07 Removing 6158 of 45117 cells (13.6%)
The filtered list of count matrices is stored in $cms.filtered which
can be saved on disk afterwards.
qs::qsave(crm$cms.filtered, "/data/ExtData/CRMetrics_testdata/cms_filtered.qs",
nthreads = 10)
sessionInfo()
## R version 4.2.2 (2022-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Red Hat Enterprise Linux 8.8 (Ootpa)
##
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.15.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.0.10 magrittr_2.0.3 CRMetrics_0.3.0
##
## loaded via a namespace (and not attached):
## [1] drat_0.2.3 Rtsne_0.16 ggbeeswarm_0.7.1
## [4] colorspace_2.1-0 ggsignif_0.6.4 rjson_0.2.21
## [7] ellipsis_0.3.2 dendsort_0.3.4 circlize_0.4.15
## [10] GlobalOptions_0.1.2 clue_0.3-64 rstudioapi_0.14
## [13] ggpubr_0.5.0 MatrixModels_0.5-1 urltools_1.7.3
## [16] conos_1.5.0 ggrepel_0.9.3 fansi_1.0.4
## [19] codetools_0.2-18 splines_4.2.2 R.methodsS3_1.8.2
## [22] sparseMatrixStats_1.10.0 doParallel_1.0.17 cachem_1.0.7
## [25] knitr_1.42 jsonlite_1.8.4 broom_1.0.1
## [28] cluster_2.1.4 png_0.1-8 R.oo_1.25.0
## [31] compiler_4.2.2 backports_1.4.1 assertthat_0.2.1
## [34] Matrix_1.5-3 fastmap_1.1.1 cli_3.6.1
## [37] htmltools_0.5.4 quantreg_5.94 tools_4.2.2
## [40] igraph_1.4.2 gtable_0.3.3 glue_1.6.2
## [43] Rcpp_1.0.10 carData_3.0-5 jquerylib_0.1.4
## [46] vctrs_0.5.1 nlme_3.1-160 iterators_1.0.14
## [49] sccore_1.0.3 xfun_0.39 lifecycle_1.0.3
## [52] irlba_2.3.5.1 rstatix_0.7.1 MASS_7.3-58.1
## [55] scales_1.2.1 MatrixGenerics_1.10.0 parallel_4.2.2
## [58] SparseM_1.81 RColorBrewer_1.1-3 qs_0.25.4
## [61] ComplexHeatmap_2.14.0 yaml_2.3.7 gridExtra_2.3
## [64] ggplot2_3.4.2 ggpmisc_0.5.1 sass_0.4.5
## [67] triebeard_0.3.0 Rook_1.2 S4Vectors_0.36.1
## [70] foreach_1.5.2 BiocGenerics_0.44.0 ggpp_0.5.0
## [73] shape_1.4.6 rlang_1.1.0 pkgconfig_2.0.3
## [76] matrixStats_0.63.0 RMTstat_0.3.1 evaluate_0.20
## [79] lattice_0.20-45 N2R_1.0.1 purrr_1.0.1
## [82] pagoda2_1.0.10 leidenAlg_1.0.5 cowplot_1.1.1
## [85] tidyselect_1.2.0 R6_2.5.1 IRanges_2.32.0
## [88] generics_0.1.3 DBI_1.1.3 pillar_1.9.0
## [91] withr_2.5.0 mgcv_1.8-41 survival_3.4-0
## [94] abind_1.4-5 tibble_3.2.1 crayon_1.5.2
## [97] car_3.1-1 utf8_1.2.3 RApiSerialize_0.1.2
## [100] rmarkdown_2.21 GetoptLong_1.0.5 grid_4.2.2
## [103] digest_0.6.31 tidyr_1.2.1 brew_1.0-8
## [106] R.utils_2.12.2 RcppParallel_5.1.5 stats4_4.2.2
## [109] munsell_0.5.0 stringfish_0.15.7 beeswarm_0.4.0
## [112] vipor_0.4.5 bslib_0.4.2