Differential gene expression analysis is a common task in RNA-Seq experiments. Monocle can help you find genes that are differentially expressed between groups of cells and assesses the statistical signficance of those changes. Monocle 3 includes a powerful system for finding genes that vary across cells of different types, were collected at different developmental time points, or that have been perturbed in different ways.
There are two approaches for differential analysis in Monocle:
fit_models()
, you can evaluate whether each gene depends on
variables such as time, treatments, etc.
graph_test()
, you can find genes that vary over a
trajectory or between clusters.
Monocle also comes with specialized functions for finding co-regulated modules of differentially expressed genes. Monocle also allows you to interactively interrogate specific clusters or regions of a trajectory (e.g. branch points) for genes that vary within them.
Let's examine these tools in turn.
In this section, we'll explore how to use Monocle to find genes that are differentially expressed according to several different criteria. Performing differential expression analysis on all genes in a cell_data_set object can take anywhere from minutes to hours, depending on how complex the analysis is. To keep the vignette simple and fast, we'll be working with small sets of genes. Rest assured, however, that Monocle can analyze several thousands of genes even in large experiments, making it useful for discovering dynamically regulated genes during the biological process you're studying.
Let's begin with a small set of genes that we know are important in ciliated neurons to demonstrate Monocle's capabilities:
The differential analysis tools in Monocle are extremely flexible. Monocle works by fitting a
where $y_i$ is a random variable corresponding to the expression values of gene $i$, $x_t$ is the time each cell was
collected (in minutes), and the $\beta_t$ capture the effect of time on expression, and $\beta_0$ is an intercept term. We
can identify genes that vary over time by fitting this model to each one, and then testing whether its $\beta_t$ is
significantly different from zero. To do so, we first call the fit_models()
function:
gene_fits
is a tibble that contains a row for each gene. The
model
column contains generalized linear model objects, each of which aims to explain the expression of a gene
across the cells using the equation above. The parameter model_formula_str
should be a string specifying
the model formula. The model formulae you use in your tests can include any term that exists as a column in the
colData
table, including those columns that are added by Monocle in other analysis steps. For example, if you
use cluster_cells
, you can test for genes
that differ between clusters and partitions by using ~cluster
or ~partition
(respectively) as
your model formula. You can also include multiple variables, for example ~embryo.time + batch
, which can be
very helpful for subtracting unwanted effects.
Now let's see which of these genes have time-dependent expression. First, we extract a table of coefficients from each
model using the coefficient_table()
function:
fit_coefs
looks like this:
id | gene_short_name | num_cells_expressed | status | term | estimate | std_err | test_val | p_value | normalized_effect | model_component | q_value |
---|---|---|---|---|---|---|---|---|---|---|---|
WBGene00001820 | ham-1 | 1618 | OK | (Intercept) | 1.09 | 0.0665 | 16.4 | 4.58e-59 | 0 | count | 1.83e-58 |
WBGene00001820 | ham-1 | 1618 | OK | embryo.time | -0.00388 | 0.000211 | -18.4 | 9.18e-74 | -0.00558 | count | 4.59e-73 |
WBGene00007058 | dmd-6 | 873 | OK | (Intercept) | -5.07 | 0.123 | -41.3 | 0 | 0 | count | 0 |
WBGene00007058 | dmd-6 | 873 | OK | embryo.time | 0.0075 | 0.000209 | 35.9 | 3.39e-256 | 0.00418 | count | 2.03e-255 |
WBGene00001961 | hlh-17 | 51 | OK | (Intercept) | -6.04 | 0.501 | -12.1 | 4.26e-33 | 0 | count | 8.52e-33 |
WBGene00001961 | hlh-17 | 51 | OK | embryo.time | 0.00334 | 0.00103 | 3.23 | 0.00123 | 0.00093 | count | 0.00247 |
WBGene00003605 | nhr-6 | 133 | OK | (Intercept) | -6.09 | 0.287 | -21.2 | 2.35e-96 | 0 | count | 1.17e-95 |
WBGene00003605 | nhr-6 | 133 | OK | embryo.time | 0.00579 | 0.000523 | 11.1 | 2.8e-28 | 0.00155 | count | 8.41e-28 |
WBGene00000457 | ceh-36 | 1123 | OK | (Intercept) | -1.63 | 0.105 | -15.5 | 3.02e-53 | 0 | count | 9.05e-53 |
WBGene00000457 | ceh-36 | 1123 | OK | embryo.time | 0.00279 | 0.000224 | 12.5 | 3.62e-35 | 0.00382 | count | 1.45e-34 |
WBGene00000483 | che-1 | 292 | OK | (Intercept) | -1.56 | 0.185 | -8.43 | 4.27e-17 | 0 | count | 4.27e-17 |
WBGene00000483 | che-1 | 292 | OK | embryo.time | -0.000419 | 0.000481 | -0.871 | 0.384 | -0.000577 | count | 0.384 |
Note that the table includes one row for each
Now, let's pull out the genes that have a significant time component. coefficient_table()
tests
whether each coefficient differs significantly from zero under the
Wald test. By default, coefficient_table()
adjusts these
p-values for multiple hypothesis testing
using the method of
Benjamini and Hochberg. These adjusted values can be found
in the q_value
column. We can filter the results and control the false discovery rate as follows:
gene_short_name | term | q_value | estimate |
---|---|---|---|
ham-1 | embryo.time | 4.59e-73 | -0.00388 |
dmd-6 | embryo.time | 2.03e-255 | 0.0075 |
hlh-17 | embryo.time | 0.00247 | 0.00334 |
nhr-6 | embryo.time | 8.41e-28 | 0.00579 |
ceh-36 | embryo.time | 1.45e-34 | 0.00279 |
We can see that five of the six genes significantly vary as a function of time.
Monocle also provides some easy ways to plot the expression of a small set of genes grouped by the factors you use during differential analysis. This helps you visualize the differences revealed by the tests above. One type of plot is a "violin" plot.
By default, the violin plot log-scales the expression, which drops cells with zero expression, resulting in potentially misleading figures. The Monocle3 develop branch has a hybrid plot where cells appear as red dots in a Sina plot and the cell distribution appears as a histogram with green bars and a blue median_qi interval.
gene_short_name | term | q_value | estimate |
---|---|---|---|
ham-1 | embryo.time | 3.65e-27 | -0.00276 |
ham-1 | batchWaterston_400_minutes | 0.00228 | 0.212 |
ham-1 | batchWaterston_500_minutes_batch_1 | 0.00535 | -0.357 |
ham-1 | batchWaterston_500_minutes_batch_2 | 4.24e-09 | -1.4 |
ham-1 | batchMurray_r17 | 0.702 | 0.0932 |
ham-1 | batchMurray_b01 | 0.234 | -0.187 |
ham-1 | batchMurray_b02 | 0.168 | -0.451 |
dmd-6 | embryo.time | 2.61e-150 | 0.00713 |
dmd-6 | batchWaterston_400_minutes | 1.08e-06 | 2.18 |
dmd-6 | batchWaterston_500_minutes_batch_1 | 3.28e-12 | 3 |
dmd-6 | batchWaterston_500_minutes_batch_2 | 5.86e-06 | 2.06 |
dmd-6 | batchMurray_r17 | 0.00108 | 1.65 |
dmd-6 | batchMurray_b01 | 2.64e-05 | 1.96 |
dmd-6 | batchMurray_b02 | 1.46e-05 | 2.12 |
hlh-17 | embryo.time | 0.262 | -0.00155 |
hlh-17 | batchWaterston_400_minutes | 1 | 16 |
hlh-17 | batchWaterston_500_minutes_batch_1 | 0.987 | 18.3 |
hlh-17 | batchWaterston_500_minutes_batch_2 | 0.986 | 18.7 |
hlh-17 | batchMurray_r17 | 0.988 | 16.1 |
hlh-17 | batchMurray_b01 | 1 | 17.3 |
hlh-17 | batchMurray_b02 | 1 | 1.13 |
nhr-6 | embryo.time | 2.97e-13 | 0.00532 |
nhr-6 | batchWaterston_400_minutes | 0.0458 | 3.01 |
nhr-6 | batchWaterston_500_minutes_batch_1 | 0.0152 | 3.49 |
nhr-6 | batchWaterston_500_minutes_batch_2 | 0.0974 | 2.7 |
nhr-6 | batchMurray_r17 | 0.264 | 2.38 |
nhr-6 | batchMurray_b01 | 0.262 | 2.19 |
nhr-6 | batchMurray_b02 | 0.128 | 2.94 |
ceh-36 | embryo.time | 7.22e-13 | 0.00222 |
ceh-36 | batchWaterston_400_minutes | 0.00087 | 0.523 |
ceh-36 | batchWaterston_500_minutes_batch_1 | 2.32e-13 | 1.14 |
ceh-36 | batchWaterston_500_minutes_batch_2 | 0.132 | 0.341 |
ceh-36 | batchMurray_r17 | 0.702 | 0.21 |
ceh-36 | batchMurray_b01 | 1 | 0.0985 |
ceh-36 | batchMurray_b02 | 1 | -0.227 |
che-1 | embryo.time | 0.0449 | 0.00143 |
che-1 | batchWaterston_400_minutes | 1 | 0.0917 |
che-1 | batchWaterston_500_minutes_batch_1 | 0.0152 | -0.752 |
che-1 | batchWaterston_500_minutes_batch_2 | 0.000384 | -1.63 |
che-1 | batchMurray_r17 | 0.000266 | -1.91 |
che-1 | batchMurray_b01 | 0.0291 | -0.831 |
che-1 | batchMurray_b02 | 0.168 | -2.48 |
How good are these models at "explaining" gene expression? We can evaluate the fits of each model using the
evaluate_fits()
function:
id | gene_short_name | num_cells_expressed | status | null_deviance | df.null | logLik | AIC | BIC | deviance | df_residual |
---|---|---|---|---|---|---|---|---|---|---|
WBGene00001820 | ham-1 | 1618 | OK | 2.1e+04 | 6187 | NA | NA | NA | 1.45e+04 | 6180 |
WBGene00007058 | dmd-6 | 873 | OK | 6.37e+03 | 6187 | NA | NA | NA | 4.1e+03 | 6180 |
WBGene00001961 | hlh-17 | 51 | OK | 642 | 6187 | NA | NA | NA | 600 | 6180 |
WBGene00003605 | nhr-6 | 133 | OK | 1.66e+03 | 6187 | NA | NA | NA | 1.44e+03 | 6180 |
WBGene00000457 | ceh-36 | 1123 | OK | 1.8e+04 | 6187 | NA | NA | NA | 1.56e+04 | 6180 |
WBGene00000483 | che-1 | 292 | OK | 8.85e+03 | 6187 | NA | NA | NA | 7.91e+03 | 6180 |
Should we include the batch term in our model of gene expression or not? Monocle provides a function
compare_models()
that can help you decide. Compare models takes two models and returns the result of a
likelihood ratio test between them. Any time you add
terms to a model, it will improve the fit. But we should always to use the simplest model we can to explain our data. The
likelihood ratio test helps us decide whether the improvement in fit is large enough to justify the complexity our extra
terms introduce. You run compare_models()
like this:
The first of the two models is called the full model. This model is essentially a way of predicting the expression value of each gene in a given cell knowing both what time it was collected and which batch of cells it came from. The second model, called the reduced model, does the same thing, but it only knows about the time each cell was collected. Because the full model has more information about each cell, it will do a better job of predicting the expression of the gene in each cell. The question Monocle must answer for each gene is how much better the full model's prediction is than the reduced model's. The greater the improvement that comes from knowing the batch of each cell, the more significant the result of the likelihood ratio test.
As we can see, all of the genes' likelihood ratio tests are significant, indicating that there are substantial batch effects
in the data. We are therefore justified in adding the batch
term to our model.
gene_short_name | p_value |
---|---|
ham-1 | 4.46e-18 |
dmd-6 | 8.46e-50 |
hlh-17 | 2.48e-06 |
nhr-6 | 7.09e-05 |
ceh-36 | 3.84e-11 |
che-1 | 2.97e-06 |
Monocle uses generalized linear models to capture how a gene's expression depends on each variable in the experiment. These models require you to specify a distribution that describes gene expression values. Most studies that use this approach to analyze their gene expression data use the negative binomial distribution, which is often appropriate for sequencing read or UMI count data. The negative binomial is at the core of many packages for RNA-seq analysis, such as DESeq2.
Monocle's fit_models()
supports the negative binomial distribution and several others listed in the table below.
The default is the "quasipoisson", which is very similar to
the negative binomial. Quasipoisson is a a bit less accurate than the negative binomial but much faster to fit, making it
well suited to datasets with thousands of cells.
There are several allowed values for expression_family
:
expression_family | Distribution | Accuracy | Speed | Notes |
---|---|---|---|---|
quasipoisson |
Quasi-poisson | ++ | ++ | Default for fit_models() . Recommended for most users. |
negbinomial |
Negative binomial | +++ | + | Recommended for users with small datasets (fewer than 1,000 cells). |
poisson |
Poisson | - | +++ | Not recommended. For debugging and testing only. |
binomial |
Binomial | ++ | ++ | Recommended for single-cell ATAC-seq |
evaluate_fits()
and compare_models()
will be NA
.
In the L2 worm data, we identified a number of clusters that were very distinct as neurons:
Subset just the neurons:
There are many subtypes of neurons, so perhaps the different neuron clusters correspond to different subtypes. To
investigate which genes are expressed differentially across the clusters, we could use the regression analysis tools
discussed above. However, Monocle provides an alternative way of finding genes that vary between groups of cells in UMAP
space. The function graph_test()
uses a statistic from spatial autocorrelation analysis called
Moran's I, which Cao & Spielmann
You can run graph_test()
like this:
The data frame pr_graph_test_res
has the Moran's I test results for each gene in the cell_data_set
.
If you'd like to rank the genes by effect size, sort this table by the morans_I
column, which ranges from -1 to +1.
A value of 0 indicates no effect, while +1 indicates perfect positive autocorrelation and suggests that nearby cells have
very similar values of a gene's expression. Significant values much less than zero are generally rare.
Positive values indicate a gene is expressed in a focal region of the UMAP space (e.g. specific to one or more clusters). But how do we associate genes with clusters? The next section explains how to collect genes into modules that have similar patterns of expression and associate them with clusters.
Once you have a set of genes that vary in some interesting way across the clusters, Monocle provides a means of grouping them
into modules. You can call find_gene_modules()
, which essentially runs UMAP on the genes (as opposed to the cells)
and then groups them into modules using Louvain community analysis:
The data frame gene_module_df
contains a row for each gene and identifies the module it belongs to. To see which
modules are expressed in which clusters or partitions you can use two different approaches for visualization. The first is just
to make a simple table that shows the aggregate expression of all genes in each module across all the clusters. Monocle
provides a simple utility function called aggregate_gene_expression
for this purpose:
Some modules are highly specific to certain partitions of cells, while others are shared across multiple partitions. Note that
aggregate_gene_expression
can work with arbitrary groupings of cells and genes. You're not limited to looking at
modules from find_gene_modules()
, clusters()
, and partitions()
.
The second way of looking at modules and their expression is to pass gene_module_df
directly to
plot_cells()
. If there are many modules, it can be hard to see where each one is expressed, so we'll just look
at a subset of them:
Identifying the genes that change as cells progress along a trajectory is a core objective of this type of analysis. Knowing the order in which genes go on and off can inform new models of development. For example, Sharon and Chawla et al recently analyzed pseudotime-dependent genes to arrive at a whole new model of how islets form in the pancreas.
Let's return to the embryo data, which we processed using the commands
How do we find the genes that are differentially expressed on the different paths through the trajectory? How do we find the ones that are restricted to the beginning of the trajectory? Or excluded from it?
Once again, we turn to graph_test()
, this time passing it neighbor_graph="principal_graph"
,
which tells it to test whether cells at similar positions on the trajectory have correlated expression:
Here are a couple of interesting genes that score as highly significant according to graph_test()
:
As before, we can collect the trajectory-variable genes into modules:
Here we plot the aggregate module scores within each group of cell types as annotated by Packer & Zhu
We can also pass gene_module_df
to plot_cells()
as we did when we compared clusters in the L2
data above.
Monocle offers another plotting function that can sometimes give a clearer view of a gene's dynamics along a single path. You can select a path with choose_cells()
or by subsetting the cell data set by cluster, cell type, or other annotation that's restricted to the path.
Let's pick one such path, the AFD cells:
The function plot_genes_in_pseudotime()
takes a small set of genes and shows you their dynamics as a function of
pseudotime:
You can see that
Analyzing the genes that are regulated around trajectory branch points often provides insights into the genetic circuits
that control cell fate decisions. Monocle can help you drill into a branch point that corresponds to a fate decision in
your system. Doing so is as simple as selecting the cells (and branch point) of interest with choose_cells()
:
And then calling graph_test()
on the subset. This will identify genes with interesting patterns of
expression that fall only within the region of the trajectory you selected, giving you a more refined and relevant
set of genes.
Grouping these genes into modules can reveal fate specific genes or those that are activate immediate prior to or following the branch point:
We will organize the modules by their similarity (using hclust
) over the trajectory so it's a little easier to
see which ones come on before others: