Introduction

In this tutorial, we demonstrate how to use Monocle 3 (alpha version) to perform clustering for very large datasets and then identify marker genes specific for each cluster. We will mainly introduce 1) use delayedarray to facilitate calculations in functions estimateSizeFactor, estimateDispersions and preprocessCDS, etc for large datasets. 2) apply UMAP, a very promising non-linear dimension reduction to learn low dimensional representation of the data. 3) introduce a general differential test (principalGraphTest) to identify genes spatially autocorrelated on the low dimensional embedding. 4) the find_cluster_markers function to identify cluster specific genes 5) various visualization functionality, including the dotplot, gene expression over low-dimensional embedding, or the marker heatmap plot, etc., to visualize the marker gene expression specificity.

Note that this mouse cell atalas (MCA) dataset (Han, Xiaoping, et al. "Mapping the mouse cell atlas by Microwell-seq." Cell 172.5 (2018): 1091-1107.) has been used by Seurat from Rahul's group as a tutorial (https://satijalab.org/seurat/mca.html) also but our analysis here is complimentary to theirs.

In [1]:
rm(list = ls()) # clear the environment 
#load all the necessary libraries 
options(warn=-1) # turn off warning message globally 

# install.packages("flexclust")
# install.packages("mcclust")

suppressMessages(library(reticulate))

# py_install("r-reticulate", "louvain")

suppressMessages(library(devtools))
suppressMessages(load_all("/Users/andyshehmohajeri/Documents/monocle-dev-7-19/monocle-dev"))
suppressMessages(library(flexclust))
suppressMessages(library(mcclust))

Make sure the python package louvain is installed so that we can run louvain clustering with different resolutions (see below). Please also make sure the newest UMAP 0.30 version is installed. Please run the following code:

py_install('umap-learn', pip = T, pip_ignore_installed = T) # ensure the umap-learn 0.3 installed
In [2]:
library(reticulate)
import("louvain")
Module(louvain)

Create a cds object for MCA dataset

In the following, we load a preprocessed sparse matrix and cell meta dataframe and then create the cell dataset for downstream analysis. We thank Seurat development team's efforts to provide those processed data (https://satijalab.org/seurat/mca.html)!

In [3]:
MCA <- readRDS("/Users/andyshehmohajeri/Downloads/MCA_merged_mat.rds")
cell.meta.data <- read.csv("/Users/andyshehmohajeri/Downloads/MCA_All-batch-removed-assignments.csv", row.names = 1)

overlapping_cells <- intersect(row.names(cell.meta.data), colnames(MCA))
gene_ann <- data.frame(gene_short_name = row.names(MCA), row.names = row.names(MCA))

pd <- new("AnnotatedDataFrame",data=cell.meta.data[overlapping_cells, ])
fd <- new("AnnotatedDataFrame",data=gene_ann)

MCA_cds <- newCellDataSet(MCA[, overlapping_cells], phenoData = pd,featureData =fd,
                           expressionFamily = negbinomial.size(),
                           lowerDetectionLimit=1)

Estimate size factor and dispersion

As usual, we will now calculate library size factors so we can account for variation in the number of mRNAs per cell when we look for differentially expressed genes. We will also fit a dispersion model, which will be used by Monocle's differential expression algorithms. These operations can take a lot of memory when run on very large datasets, so we use the DelayedArray package to split them up into more manageable blocks. You can control the block size as shown below. The larger the value of DelayedArray.block.size, the faster these steps will go, but they will use more memory. If you run out, the whole thing will take forever, or possibly crash R entirely.

In [4]:
# MCA_cds <- MCA_cds[, Matrix::colSums(exprs(MCA_cds)) > 1000 & Matrix::colSums(exprs(MCA_cds) > 0) > 500]
DelayedArray:::set_verbose_block_processing(TRUE)
options(DelayedArray.block.size=1000e6)
MCA_cds <- estimateSizeFactors(MCA_cds)
MCA_cds <- estimateDispersions(MCA_cds)
FALSE
Processing block 1/41 ... OK
Processing block 2/41 ... OK
Processing block 3/41 ... OK
Processing block 4/41 ... OK
Processing block 5/41 ... OK
Processing block 6/41 ... OK
Processing block 7/41 ... OK
Processing block 8/41 ... OK
Processing block 9/41 ... OK
Processing block 10/41 ... OK
Processing block 11/41 ... OK
Processing block 12/41 ... OK
Processing block 13/41 ... OK
Processing block 14/41 ... OK
Processing block 15/41 ... OK
Processing block 16/41 ... OK
Processing block 17/41 ... OK
Processing block 18/41 ... OK
Processing block 19/41 ... OK
Processing block 20/41 ... OK
Processing block 21/41 ... OK
Processing block 22/41 ... OK
Processing block 23/41 ... OK
Processing block 24/41 ... OK
Processing block 25/41 ... OK
Processing block 26/41 ... OK
Processing block 27/41 ... OK
Processing block 28/41 ... OK
Processing block 29/41 ... OK
Processing block 30/41 ... OK
Processing block 31/41 ... OK
Processing block 32/41 ... OK
Processing block 33/41 ... OK
Processing block 34/41 ... OK
Processing block 35/41 ... OK
Processing block 36/41 ... OK
Processing block 37/41 ... OK
Processing block 38/41 ... OK
Processing block 39/41 ... OK
Processing block 40/41 ... OK
Processing block 41/41 ... OK
Removing 273 outliers

Reduce the dimensionality of the MCA dataset

Note preprocessCDS may support running different initial dimension methods in future, currently the default method is PCA.

In [5]:
library(dplyr)

disp_table = dispersionTable(MCA_cds)
disp_table = disp_table %>% mutate(excess_disp = (dispersion_empirical - dispersion_fit) / dispersion_fit) %>% arrange(plyr::desc(excess_disp))
top_subset_genes = as.character(head(disp_table, 2500)$gene_id)

MCA_cds = setOrderingFilter(MCA_cds, top_subset_genes)
MCA_cds <- preprocessCDS(MCA_cds,  method = 'PCA', 
                         norm_method = 'log', 
                         num_dim = 50, 
                         verbose = T)
MCA_cds <- reduceDimension(MCA_cds, max_components = 2,   
                       reduction_method = 'UMAP',
                       metric="correlation", 
                       min_dist = 0.75, 
                       n_neighbors = 50, 
                       verbose = T)
Attaching package: ‘dplyr’

The following objects are masked from ‘package:igraph’:

    as_data_frame, groups, union

The following object is masked from ‘package:Biobase’:

    combine

The following objects are masked from ‘package:IRanges’:

    collapse, desc, intersect, setdiff, slice, union

The following objects are masked from ‘package:S4Vectors’:

    first, intersect, rename, setdiff, setequal, union

The following objects are masked from ‘package:BiocGenerics’:

    combine, intersect, setdiff, union

The following object is masked from ‘package:matrixStats’:

    count

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Remove noise by PCA ...
Processing block 1/5 ... OK
Processing block 2/5 ... OK
Processing block 3/5 ... OK
Processing block 4/5 ... OK
Processing block 5/5 ... OK
Retrieving normalized data ...
Running Uniform Manifold Approximation and Projection

Group cells into different clusters

Monocle 3 includes powerful new routines for clustering cells, which is often a key step in classifying cells according to type. In Monocle 2, we used density peak clustering (default) or the louvain clustering implemented in igraph package to identify cell clusters on the reduced tSNE space. However, density peak clustering doesn't scale well with large datasets and the louvain clustering algorithm from igraph doesn't provide the flexibity to cluster cells at different resolutions.

In Monocle 3, we added the new support of louvain clustering based on a flexible package which provides a variety of improved community detection algorithms and allows clustering at different resolutions.

In [6]:
MCA_cds <- clusterCells(MCA_cds, method = 'louvain', res = 1e-6, louvain_iter = 1, verbose = T)
Run kNN based graph clustering starts:
  -Input data of 242533 rows and 2 columns
  -k is set to 20
  Finding nearest neighbors...DONE ~ 0.954 s
  Compute jaccard coefficient between nearest-neighbor sets ...DONE ~ 0.106 s
  Build undirected graph from the weighted links ...DONE ~ 1.665 s
  Run louvain clustering on the graph ...
Running louvain iteration  1 ...
Current iteration is 1; current resolution is 1e-06; Modularity is 0.871906702724192; Number of clusters are 18
Maximal modularity is 0.871906702724192; corresponding resolution is 1e-06

Run kNN based graph clustering DONE, totally takes 36.79847407341 s.
  -Number of clusters: 18 

Here res represents the resolution parameter (range from 0-1) for the louvain clustering. Values between 0 and 1e-2 are good, bigger values give you more clusters. Default is set to be 1e-4. louvain_iter represents the number of iterations used for Louvain clustering. The clustering result gives the largest modularity score that will be used as the final clustering result. The default is 1.

Note that Monocle 3 also supports clustering cells in the PCA space prior to UMAP embedding. The results are often very similar to those identified directly from low dimensional embedding. You can test this out by setting use_pca = T.

Visualize the clustering results

Let's visualize the clustering results. Here's what the cells look like in the UMAP space:

In [7]:
col_vector_origin <- c("#db83da",
                "#53c35d",
                "#a546bb",
                "#83b837",
                "#a469e6",
                "#babb3d",
                "#4f66dc",
                "#e68821",
                "#718fe8",
                "#d6ac3e",
                "#7957b4",
                "#468e36",
                "#d347ae",
                "#5dbf8c",
                "#e53e76",
                "#42c9b8",
                "#dd454a",
                "#3bbac6",
                "#d5542c",
                "#59aadc",
                "#cf8b36",
                "#4a61b0",
                "#8b8927",
                "#a24e99",
                "#9cb36a",
                "#ca3e87",
                "#36815b",
                "#b23c4e",
                "#5c702c",
                "#b79add",
                "#a55620",
                "#5076af",
                "#e38f67",
                "#85609c",
                "#caa569",
                "#9b466c",
                "#88692c",
                "#dd81a9",
                "#a35545",
                "#e08083",
                "#17becf",
                "#9edae5")
col_vector <- col_vector_origin[1:length(unique(as.character(pData(MCA_cds)$Tissue)))]
names(col_vector) <- unique(as.character(pData(MCA_cds)$Tissue))
options(repr.plot.width = 11)
options(repr.plot.height = 8)
plot_cell_clusters(MCA_cds, color_by = 'Tissue', cell_size = 0.1, show_group_id = T)  + 
theme(legend.text=element_text(size=6)) + #set the size of the text 
theme(legend.position="right") #put the color legend on the right