Introduction

Preparations

We have selected a publicly available dataset from GEO with accession number GSE179590 which can be downloaded here. You can download the zipped data using wget or curl, e.g. wget http://kkh.bric.ku.dk/fabienne/crmetrics_testdata.tar.gz, and then unpack using tar -xvf crmetrics_testdata.tar.gz

Using Python modules

CRMetrics utilizes several Python packages. Of these, the packages for doublet detection, scrublet and DoubletDetection, can be run from R using reticulate. However, CRMetrics can provide a Python script to run the analyses directly in Python instead through the export parameter.

In order to run the analyses in R, first you should install reticulate:

install.packages("reticulate")
library(reticulate)

Also, you need to install Conda. Then, you are ready to create a Conda virtual environment. In this example, we’re on a server and we load miniconda using modules. You may need to include the conda parameter to point to wherever your Conda binary is located (in terminal, try whereis conda). In this example, we install a virtual environment for scrublet.

conda_create("scrublet", 
             conda = "/opt/software/miniconda/4.12.0/condabin/conda", 
             python_version = 3.8)

There is a known problem with openBLAS which may be different between R and Python. If this is the case, you will receive the error floating point exception and R will crash when you try to run a Python script using reticulate. In Python, the problem lies within numpy. numba requires numpy < 1.23, so force reinstall from scratch with no binaries in the scrublet Conda environment from terminal module load miniconda/4.12.0 conda activate scrublet python -m pip install numpy==1.22.0 --force-reinstall --no-binary numpy

Then, follow the instructions to install scrublet. Finally, restart your R session.

Please note, if at any point you receive an error that you can’t change the current Python instance, please remove any Python-dependent object in your environment and restart your R session.

Main use

Load the library

library(CRMetrics)
library(magrittr)
library(dplyr)

There are two ways to initialize a new object of class CRMetrics, either by providing data.path or cms. data.path is the path to a directory containing sample-wise directories with the Cell Ranger count outputs. Optionally, it can be a vector of multiple paths. cms is a (named, optional) list of (sparse, optional) count matrices.

Please note, if data.path is not provided, some functionality is lost, e.g. ambient RNA removal.

Optionally, metadata can be provided, either as a file or as a data.frame. For a file, the separator can be set with the parameter sep.meta (most often, either , (comma) or \t (tab) is used). In either format, the columns must be named and one column must be named sample and contain sample names. In combination with data.path, the sample names must match the sample directory names. Unmatched directory names are dropped.

If cms is provided, it is recommended to add summary metrices afterwards:

crm <- CRMetrics$new(cms = cms, 
                     n.cores = 10, 
                     pal = grDevices::rainbow(8),
                     theme = ggplot2::theme_bw())
crm$addSummaryFromCms()

Please note, some functionality depends on aggregation of sample and cell IDs using the sep.cell parameter. The default is !! which creates cell names in the format of <sampleID>!!<cellID>. If another separator is used, this needs to be provided in relevant function calls.

Here, the folder with our test data is stored in /data/ExtData/CRMetrics_testdata/ and we provide metadata in a comma-separated file.

crm <- CRMetrics$new(data.path = "/data/ExtData/CRMetrics_testdata/", 
                     metadata = "/data/ExtData/CRMetrics_testdata/metadata.csv", 
                     sep.meta = ",",
                     n.cores = 10,
                     verbose = FALSE,
                     pal = grDevices::rainbow(8),
                     theme = ggplot2::theme_bw())

We can review our metadata

crm$metadata
##        sample age    sex type    RIN
## 1 SRR15054421  43   male RRMS medium
## 2 SRR15054422  57   male RRMS   high
## 3 SRR15054423  52   male SPMS   high
## 4 SRR15054424  66 female SPMS medium
## 5 SRR15054425  50 female SPMS   high
## 6 SRR15054426  58 female RRMS   high
## 7 SRR15054427  56 female SPMS    low
## 8 SRR15054428  61   male SPMS   high

Plot summary statistics

We can investigate which metrics are available and choose the ones we would like to plot

crm$selectMetrics()
##    no                                        metrics
## 1   1                      estimated number of cells
## 2   2                            mean reads per cell
## 3   3                          median genes per cell
## 4   4                                number of reads
## 5   5                                 valid barcodes
## 6   6                          sequencing saturation
## 7   7                           q30 bases in barcode
## 8   8                          q30 bases in rna read
## 9   9                               q30 bases in umi
## 10 10                         reads mapped to genome
## 11 11             reads mapped confidently to genome
## 12 12 reads mapped confidently to intergenic regions
## 13 13   reads mapped confidently to intronic regions
## 14 14     reads mapped confidently to exonic regions
## 15 15      reads mapped confidently to transcriptome
## 16 16                 reads mapped antisense to gene
## 17 17                        fraction reads in cells
## 18 18                           total genes detected
## 19 19                     median umi counts per cell

Samples per condition

First, we can plot the number of samples per condition. Here, we investigate how the distribution of the sex differs between the type of MS of the samples where RRMS is short for relapsing remitting MS, and SPMS is short for secondary progressive MS.

crm$plotSummaryMetrics(comp.group = "sex", 
                       metrics = "samples per group", 
                       second.comp.group = "type",
                       plot.geom = "bar")

Metrics per sample

In one plot, we can illustrate selected metric summary stats. If no comparison group is set, it defaults to sample.

metrics.to.plot <- crm$selectMetrics(ids = c(1:4,6,18,19))
crm$plotSummaryMetrics(comp.group = "sample",
                       metrics = metrics.to.plot, 
                       plot.geom = "bar")

Metrics per condition

We can do the same, but set the comparison group to type. This will add statistics to the plots. Additionally, we can add a second comparison group for coloring.

crm$plotSummaryMetrics(comp.group = "type",
                       metrics = metrics.to.plot, 
                       plot.geom = "point", 
                       stat.test = "non-parametric",
                       second.comp.group = "sex")