Single-cell experiments are often performed on tissues containing many cell
types. Monocle 3 provides a simple set of functions you can use to group your
cells according to their gene expression profiles into
In this section, you will learn how to cluster cells using Monocle 3. We will
demonstrate the main functions used for clustering with the
You can load the data into Monocle 3 like this:
Now that the data's all loaded up, we need to
It's a good idea to check that you're using enough PCs to capture most of
the variation in gene expression across all the cells in the data set. You
can look at the fraction of variation explained by each PC using
plot_pc_variance_explained()
:
We can see that using more than 100 PCs would capture only a small amount of additional variation, and each additional PC makes downstream steps in Monocle slower.
Now we're ready to visualize the cells. To do so, you can use either
t-SNE,
which is very popular in single-cell RNA-seq, or UMAP,
which is increasingly common. Monocle 3 uses UMAP by default, as we feel that it is both faster and better suited for
clustering and trajectory analysis in RNA-seq. To reduce the dimensionality of the data down into the X, Y plane so
we can plot it easily, call reduce_dimension()
:
To plot the data, use Monocle's main plotting function, plot_cells()
:
Each point in the plot above represents a different cell in the cell_data_set
object cds
. As you can see the cells form many groups, some with thousands of
cells, some with only a few. Cao & Packer annotated each cell according to type manually by looking
at which genes it expresses. We can color the cells in the UMAP plot by the authors' original annotations
using the color_cells_by
argument to plot_cells()
.
You can see that many of the cell types land very close to one another in the UMAP plot.
Except for a few cases described in a moment, color_cells_by
can be the name of any column in
colData(cds)
. Note that when color_cells_by
is a categorical variable, labels are
added to the plot, with each label positioned roughly in the middle of all the cells that have that label.
You can also color your cells according to how much of a gene or set of genes they express:
If you have a relatively large dataset (with >10,000 cells or more), you may want to take advantage
of options that can accelerate UMAP. Passing umap.fast_sgd=TRUE
to
reduce_dimension()
will use a fast stochastic gradient descent method inside of UMAP.
If your computer has multiple cores, you can use the cores
argument to make UMAP multithreaded.
However, invoking reduce_dimension()
with either of these options will make it produce slighly
different output each time you run it. If this is acceptable to you, you could see signifant reductions in
the running time of reduction_dimension()
.
If you want, you can also use t-SNE to visualize your data. First, call reduce_dimension with
reduction_method="tSNE"
.
Then, when you call plot_cells()
, pass reduction_method="tSNE"
to it as well:
You can actually use UMAP and t-SNE on the same cds
object - one won't overwrite the results of the
other. But you must specify which one you want in downstream functions like plot_cells
.
When performing gene expression analysis, it's important to check for
You should always check for batch effects when you perform dimensionality reduction. You should add a column to
the colData
that encodes which batch each cell is from. Then you can simply color the cells by batch.
Cao & Packer et al included a "plate" annotation in their data, which specifies which sci-RNA-seq plate each cell
originated from. Coloring the UMAP by plate reveals:
Dramatic batch effects are not evident in this data. If the data contained more substantial variation due to plate, we'd expect to see groups of cells that really only come from one plate. Nevertheless, we can try and remove what batch effect is by running the align_cds()
function:
When run with the alignment_group
argument, align_cds()
tries to remove batch effects using mutual nearest neighbor alignment, a technique introduced by John Marioni's lab. Monocle 3 does so by calling Aaron Lun's excellent package batchelor. If you use align_cds()
, be sure to call get_citations()
to see how you should cite the software on which Monocle depends.
Grouping cells into clusters is an important step in identifying the cell types represented in your data. Monocle uses a technique called community detection to group cells. This approach was introduced by Levine et al as part of the phenoGraph algorithm. You can cluster your cells using the cluster_cells()
function, like this:
Note that now when we call plot_cells()
with no arguments, it colors the cells by
cluster according to default.
The cluster_cells()
also divides the cells into larger, more well separated groups called
Once you run cluster_cells()
, the plot_cells()
function will label each cluster of cells
is labeled separately according to how you want to color the cells. For example, the call below colors the cells
according to their cell type annotation, and each cluster is labeled according the most common annotation within it:
You can choose to label whole partitions instead of clusters by passing group_cells_by="partition"
.
You can also plot the top 2 labels per cluster by passing labels_per_group=2
to plot_cells()
.
Finally, you can disable this labeling policy, making plot_cells()
behave like it did before we called
cluster_cells()
, like this:
Once cells have been clustered, we can ask what genes makes them different from one another. To do that, start by calling
the top_markers()
function:
The data frame marker_test_res
contains a number of metrics for how specifically expressed each gene
is in each partition. We could group the cells according to cluster, partition, or any categorical variable in
colData(cds)
. You can rank the table according to one or more of the specificity metrics and take the top
gene for each cluster. For example, pseudo_R2
is one such measure. We can rank markers according to
pseudo_R2
like this:
Now, we can plot the expression and fraction of cells that express each marker in each group with the
plot_genes_by_group
function:
It's often informative to look at more than one marker, which you can do just by changing the first argument to
top_n()
:
There are many ways to compare and contrast clusters (and other groupings) of cells. We will explore them in great detail in the differential expression analysis section a bit later.
Identifying the type of each cell in your dataset is critical for many downstream analyses. There are several ways of
doing this. One commonly used approach is to first cluster the cells and then assign a cell type to each cluster based
on its gene expression profile. We've already seen how to use top_markers()
. Reviewing literature associated
with a marker gene often give strong indications of the identity of clusters that express it. In Cao & Packer
colData(cds)$cao_cell_type
.
To assign cell types based on clustering, we begin by creating a new column in colData(cds)
and
initialize it with the values of partitions(cds)
(can also use clusters(cds) depending on your dataset):
Now, we can use the dplyr
package's recode()
function to
remap each cluster to a different cell type:
Let's see how the new annotations look:
Partition 7 has some substructure, and it's not obvious just from looking at the output of top_markers()
what cell type or types it corresponds to. So we can isolate it with the choose_cells()
function for
further analysis:
Now we have a smaller cell_data_set
object that contains just the cells from the partition we'd like to
drill into. We can use graph_test()
to identify genes that are differentially expressed in different
subsets of cells from this partition:
We will learn more about graph_test()
in the
differential expression analysis
section later. We can take all the genes that vary across this set of cells and group those that have similar patterns
of expression into
Plotting these modules' aggregate expression values reveals which cells express which modues.
You can explore the genes in each module or conduct gene ontology enrichment analysis on them to glean insights about which cell types are present. Suppose after doing this we have a good idea of what the cell types in the partition are. Let's recluster the cells at finer resolution and then see how they overlap with the clusters in the partition:
Based on how the patterns line up, we'll make the following assignments:
Now we can transfer the annotations from the cds_subset
object back to the full dataset. We'll also filter
out low-quality cells at this stage
The above process for manually annotating cells by type can be laborious, and must be re-done if the underlying cluster changes. We recently developed Garnett, a software toolkit for automatically annotating cells. Garnett classifies cells based on marker genes. If you've gone through the trouble of annotated your cells manually, Monocle can generate a file of marker genes that can be used with Garnett. This will help you annotate other datasets in the future or reannotate this one if you refine your analysis and update your clustering in the future.
To generate a Garnett file, first find the top markers that each annotated cell type expresses:
Next, filter these markers according to how stringent you want to be:
Then call generate_garnett_marker_file
:
generate_garnett_marker_file
The marker files produced by generate_garnett_marker_file()
are just a starting point for classifying your
cells with Garnett. You may want to edit this file to add or remove markers based on literature or other information. You
also should consider defining subtypes of cells, which can greatly increase the usefulness and accuracy of Garnett. For
example, the L2 data contains many different types of neurons. Making a "Neuron" cell type in the file above and then using
the subtype of
keyword to organize the various subtypes of neurons will make Garnett more able to recognize
them and distinguish them from non-neuronal cell types. When two or more of your cell types share most of their top markers
in plot_genes_by_group()
, consider defining a broader cell type definition of which they are both subtypes.
You might also want to define markers for the various subtypes of neurons by subsetting the cds
object above
and running top_markers()
just on them. See the Garnett
documentation for more on how you can enrich your marker
files.
When you're ready run Garnett, load the package:
Garnett was originally written to work with Monocle 2. We have created a branch of Garnett that works with Monocle 3, which will eventually replace the main branch. In the meantime, you must install and load the Monocle 3 branch of Garnett!
Now train a Garnett classifier based on your marker file like this:
Now that we've trained a classifier worm_classifier
, we can use it to annotate the L2 cells according to type:
Here's how Garnett annotated the cells:
Garnett classifiers can be applied to datasets other than the one they were trained on. We strongly encourage you to share your Garnett files and include them with your papers so that others can use them.
As part of writing a paper about Garnett, we trained a Garnett model to classify classify_cells()
function: