Differential expression analysis

Differential gene expression analysis is a common task in RNA-Seq experiments. Monocle can help you find genes that are differentially expressed between groups of cells and assesses the statistical signficance of those changes. Monocle 3 includes a powerful system for finding genes that vary across cells of different types, were collected at different developmental time points, or that have been perturbed in different ways.

There are two approaches for differential analysis in Monocle:

  • Regression analysis: using fit_models(), you can evaluate whether each gene depends on variables such as time, treatments, etc.
  • Graph-autocorrelation analysis: using graph_test(), you can find genes that vary over a trajectory or between clusters.

Monocle also comes with specialized functions for finding co-regulated modules of differentially expressed genes. Monocle also allows you to interactively interrogate specific clusters or regions of a trajectory (e.g. branch points) for genes that vary within them.

Let's examine these tools in turn.

Regression analysis

In this section, we'll explore how to use Monocle to find genes that are differentially expressed according to several different criteria. Performing differential expression analysis on all genes in a cell_data_set object can take anywhere from minutes to hours, depending on how complex the analysis is. To keep the vignette simple and fast, we'll be working with small sets of genes. Rest assured, however, that Monocle can analyze several thousands of genes even in large experiments, making it useful for discovering dynamically regulated genes during the biological process you're studying.

Let's begin with a small set of genes that we know are important in ciliated neurons to demonstrate Monocle's capabilities:

ciliated_genes <- c("che-1",
cds_subset <- cds[rowData(cds)$gene_short_name %in% ciliated_genes,]

The differential analysis tools in Monocle are extremely flexible. Monocle works by fitting a regression model to each gene. You can specify this model to account for various factors in your experiment (time, treatment, and so on). For example, In the embryo data, the cells were collected at different time points. We can test whether any of the genes above change over time in their expression by first fitting a generalized linear model to each one:

\begin{equation} log(y_i) = \beta_0 + \beta_t x_t \end{equation}

where $y_i$ is a random variable corresponding to the expression values of gene $i$, $x_t$ is the time each cell was collected (in minutes), and the $\beta_t$ capture the effect of time on expression, and $\beta_0$ is an intercept term. We can identify genes that vary over time by fitting this model to each one, and then testing whether its $\beta_t$ is significantly different from zero. To do so, we first call the fit_models() function:

gene_fits <- fit_models(cds_subset, model_formula_str = "~embryo.time")

gene_fits is a tibble that contains a row for each gene. The model column contains generalized linear model objects, each of which aims to explain the expression of a gene across the cells using the equation above. The parameter model_formula_str should be a string specifying the model formula. The model formulae you use in your tests can include any term that exists as a column in the colData table, including those columns that are added by Monocle in other analysis steps. For example, if you use cluster_cells, you can test for genes that differ between clusters and partitions by using ~cluster or ~partition (respectively) as your model formula. You can also include multiple variables, for example ~embryo.time + batch, which can be very helpful for subtracting unwanted effects.

Now let's see which of these genes have time-dependent expression. First, we extract a table of coefficients from each model using the coefficient_table() function:

fit_coefs <- coefficient_table(gene_fits)

fit_coefs looks like this:

id gene_short_name num_cells_expressed status term estimate std_err test_val p_value normalized_effect model_component q_value
WBGene00001820 ham-1 1618 OK (Intercept) 1.174944 0.0798585 14.7128 3.45e-48 0 count 1.38e-47
WBGene00001820 ham-1 1618 OK embryo.time -0.0044819 0.0002364 -18.9621 5.52e-78 -0.00644606244572239 count 2.76e-77
WBGene00007058 dmd-6 873 OK (Intercept) -4.7629322 0.1310131 -36.3546 1.9e-262 0 count 1.14e-261
WBGene00007058 dmd-6 873 OK embryo.time 0.0069775 0.0002214 31.5177 2.47e-202 0.00464572663641225 count 1.482e-201
WBGene00001961 hlh-17 51 OK (Intercept) -5.0452735 0.4754091 -10.6125 4.37e-26 0 count 8.74e-26
WBGene00001961 hlh-17 51 OK embryo.time 0.0018918 0.0009967 1.898 0.0577 0.00106972293168327 count 0.0577
WBGene00003605 nhr-6 133 OK (Intercept) -5.856329 0.3123133 -18.7515 2.39e-76 0 count 1.195e-75
WBGene00003605 nhr-6 133 OK embryo.time 0.0053983 0.0005571 9.6903 4.75e-22 0.00173648950379359 count 1.9e-21
WBGene00000457 ceh-36 1123 OK (Intercept) -1.4275416 0.1258996 -11.3387 1.65e-29 0 count 4.95e-29
WBGene00000457 ceh-36 1123 OK embryo.time 0.0022045 0.0002597 8.4898 2.57e-17 0.00305328700191416 count 7.71e-17
WBGene00000483 che-1 292 OK (Intercept) -1.0230766 0.2036804 -5.023 5.23e-07 0 count 5.23e-07
WBGene00000483 che-1 292 OK embryo.time -0.0020443 0.0005311 -3.8495 0.00012 -0.00286940056906084 count 0.00024

Note that the table includes one row for each term of each gene's model. We generally don't care about the intercept term $\beta_0$, so we can easily just extract the time terms:

emb_time_terms <- fit_coefs %>% filter(term == "embryo.time")

Now, let's pull out the genes that have a significant time component. coefficient_table() tests whether each coefficient differs significantly from zero under the Wald test. By default, coefficient_table() adjusts these p-values for multiple hypothesis testing using the method of Benjamini and Hochberg. These adjusted values can be found in the q_value column. We can filter the results and control the false discovery rate as follows:

emb_time_terms %>% filter (q_value < 0.05) %>%
         select(gene_short_name, term, q_value, estimate)
gene_short_name term q_value estimate
ham-1 embryo.time 2.76e-77 -0.0044819
dmd-6 embryo.time 1.482e-201 0.0069775
nhr-6 embryo.time 1.9e-21 0.0053983
ceh-36 embryo.time 7.71e-17 0.0022045
che-1 embryo.time 0.00024 -0.0020443

We can see that five of the six genes significantly vary as a function of time.

Monocle also provides some easy ways to plot the expression of a small set of genes grouped by the factors you use during differential analysis. This helps you visualize the differences revealed by the tests above. One type of plot is a "violin" plot.

plot_genes_violin(cds_subset, group_cells_by="embryo.time.bin", ncol=2) +
      theme(axis.text.x=element_text(angle=45, hjust=1))

Controlling for batch effects and other factors

gene_fits <- fit_models(cds_subset, model_formula_str = "~embryo.time + batch")
fit_coefs <- coefficient_table(gene_fits)
fit_coefs %>% filter(term != "(Intercept)") %>%
      select(gene_short_name, term, q_value, estimate)
gene_short_name term q_value estimate
ham-1 embryo.time 3.645e-30 -0.0033071
ham-1 batchWaterston_400_minutes 0.02444 0.2081798
ham-1 batchWaterston_500_minutes_batch_1 0.0752 -0.2915104
ham-1 batchWaterston_500_minutes_batch_2 2.712e-09 -1.3399132
ham-1 batchMurray_r17 1 -0.0282036
ham-1 batchMurray_b01 0.376 -0.192458
ham-1 batchMurray_b02 0.1285 -0.4719288
dmd-6 embryo.time 1.074e-124 0.0068932
dmd-6 batchWaterston_400_minutes 0.000732 2.7842421
dmd-6 batchWaterston_500_minutes_batch_1 2.485e-06 3.6380532
dmd-6 batchWaterston_500_minutes_batch_2 0.000995 2.7043763
dmd-6 batchMurray_r17 0.01095 2.2870417
dmd-6 batchMurray_b01 0.001902 2.621274
dmd-6 batchMurray_b02 0.001332 2.7209527
hlh-17 embryo.time 0.1808 -0.0021732
hlh-17 batchWaterston_400_minutes 0.9834 15.0971566
hlh-17 batchWaterston_500_minutes_batch_1 0.9807 17.4913067
hlh-17 batchWaterston_500_minutes_batch_2 0.9809 17.3903823
hlh-17 batchMurray_r17 1 15.7524311
hlh-17 batchMurray_b01 1 16.2552263
hlh-17 batchMurray_b02 1 0.4047871
nhr-6 embryo.time 2.708e-11 0.0050927
nhr-6 batchWaterston_400_minutes 0.1512 2.5443605
nhr-6 batchWaterston_500_minutes_batch_1 0.0752 3.0097966
nhr-6 batchWaterston_500_minutes_batch_2 0.201 2.4127551
nhr-6 batchMurray_r17 0.42 2.2002814
nhr-6 batchMurray_b01 0.48 2.0580199
nhr-6 batchMurray_b02 0.1587 2.6081939
ceh-36 embryo.time 1.929e-10 0.0022543
ceh-36 batchWaterston_400_minutes 0.000732 0.7040218
ceh-36 batchWaterston_500_minutes_batch_1 2.034e-07 1.0919273
ceh-36 batchWaterston_500_minutes_batch_2 0.342 0.3063349
ceh-36 batchMurray_r17 0.528 0.3259252
ceh-36 batchMurray_b01 1 0.0628626
ceh-36 batchMurray_b02 1 -0.1126079
che-1 embryo.time 0.666 -0.0002872
che-1 batchWaterston_400_minutes 0.24 0.3021055
che-1 batchWaterston_500_minutes_batch_1 0.0752 -0.6486185
che-1 batchWaterston_500_minutes_batch_2 0.00644 -1.291998
che-1 batchMurray_r17 0.01032 -1.5233989
che-1 batchMurray_b01 0.48 -0.4155791
che-1 batchMurray_b02 0.1285 -2.4638001

Evaluating models of gene expression

How good are these models at "explaining" gene expression? We can evaluate the fits of each model using the evaluate_fits() function:

id gene_short_name num_cells_expressed status null_deviance df.null logLik AIC BIC deviance df_residual
WBGene00001820 ham-1 1618 OK 13655.7654578325 6187 NA NA NA 11893.0882454786 6180
WBGene00007058 dmd-6 873 OK 7030.86130128015 6187 NA NA NA 4718.1063335795 6180
WBGene00001961 hlh-17 51 OK 883.667871513098 6187 NA NA NA 792.617506155167 6180
WBGene00003605 nhr-6 133 OK 1800.16220335684 6187 NA NA NA 1593.59947339399 6180
WBGene00000457 ceh-36 1123 OK 17125.8121504849 6187 NA NA NA 16048.5463210961 6180
WBGene00000483 che-1 292 OK 7024.02456305247 6187 NA NA NA 6639.71745553004 6180

Should we include the batch term in our model of gene expression or not? Monocle provides a function compare_models() that can help you decide. Compare models takes two models and returns the result of a likelihood ratio test between them. Any time you add terms to a model, it will improve the fit. But we should always to use the simplest model we can to explain our data. The likelihood ratio test helps us decide whether the improvement in fit is large enough to justify the complexity our extra terms introduce. You run compare_models() like this:

time_batch_models <- fit_models(cds_subset,
                                model_formula_str = "~embryo.time + batch",
time_models <- fit_models(cds_subset,
                          model_formula_str = "~embryo.time",
compare_models(time_batch_models, time_models) %>% select(gene_short_name, q_value)

The first of the two models is called the full model. This model is essentially a way of predicting the expression value of each gene in a given cell knowing both what time it was collected and which batch of cells it came from. The second model, called the reduced model, does the same thing, but it only knows about the time each cell was collected. Because the full model has more information about each cell, it will do a better job of predicting the expression of the gene in each cell. The question Monocle must answer for each gene is how much better the full model's prediction is than the reduced model's. The greater the improvement that comes from knowing the batch of each cell, the more significant the result of the likelihood ratio test.

As we can see, all of the genes' likelihood ratio tests are significant, indicating that there are substantial batch effects in the data. We are therefore justified in adding the batch term to our model.

gene_short_name p_value
ham-1 3.73572311938246e-20
dmd-6 7.10722667535268e-37
hlh-17 1.04335923260778e-05
nhr-6 0.00594883637080634
ceh-36 6.22590118281485e-10
che-1 5.05815654718081e-06

Choosing a distribution for modeling gene expression

Monocle uses generalized linear models to capture how a gene's expression depends on each variable in the experiment. These models require you to specify a distribution that describes gene expression values. Most studies that use this approach to analyze their gene expression data use the negative binomial distribution, which is often appropriate for sequencing read or UMI count data. The negative binomial is at the core of many packages for RNA-seq analysis, such as DESeq2.

Monocle's fit_models() supports the negative binomial distribution and several others listed in the table below. The default is the "quasipoisson", which is very similar to the negative binomial. Quasipoisson is a a bit less accurate than the negative binomial but much faster to fit, making it well suited to datasets with thousands of cells.

There are several allowed values for expression_family:

expression_family Distribution Accuracy Speed Notes
quasipoisson Quasi-poisson ++ ++ Default for fit_models(). Recommended for most users.
negbinomial Negative binomial +++ + Recommended for users with small datasets (fewer than 1,000 cells).
poisson Poisson - +++ Not recommended. For debugging and testing only.
binomial Binomial ++ ++ Recommended for single-cell ATAC-seq

Likelihood based analysis and quasipoisson

The quasi-poisson distribution doesn't have a real likelihood function, so some of Monocle's methods won't work with it. Several of the columns in results tables from evaluate_fits() and compare_models() will be NA.

Graph-autocorrelation analysis for comparing clusters

In the L2 worm data, we identified a number of clusters that were very distinct as neurons:

# reload and reprocess the data as described in the 'Clustering and classifying your cells' section
expression_matrix <- readRDS(url("http://staff.washington.edu/hpliner/data/cao_l2_expression.rds"))
cell_metadata <- readRDS(url("http://staff.washington.edu/hpliner/data/cao_l2_colData.rds"))
gene_annotation <- readRDS(url("http://staff.washington.edu/hpliner/data/cao_l2_rowData.rds"))

# Make the CDS object
cds <- new_cell_data_set(expression_matrix,
                         cell_metadata = cell_metadata,
                         gene_metadata = gene_annotation)
cds <- preprocess_cds(cds, num_dim = 100)
cds <- reduce_dimension(cds)
cds <- cluster_cells(cds, resolution=1e-5)
colData(cds)$assigned_cell_type <- as.character(partitions(cds))
colData(cds)$assigned_cell_type <- dplyr::recode(colData(cds)$assigned_cell_type,
                                                "2"="Body wall muscle",
                                                "3"="Unclassified neurons",
                                                "4"="Vulval precursors",
                                                "5"="Failed QC",
                                                "6"="Seam cells",
                                                "7"="Pharyngeal epithelia",
                                                "9"="Am/PH sheath cells",
                                                "10"="Failed QC",
                                                "11"="Touch receptor neurons",
                                                "12"="Intestinal/rectal muscle",
                                                "13"="Pharyngeal neurons",
                                                "15"="flp-1(+) interneurons",
                                                "16"="Canal associated neurons",
                                                "17"="Ciliated sensory neurons",
                                                "18"="Other interneurons",
                                                "19"="Pharyngeal gland",
                                                "20"="Failed QC",
                                                "21"="Ciliated sensory neurons",
                                                "22"="Oxygen sensory neurons",
                                                "23"="Ciliated sensory neurons",
                                                "24"="Ciliated sensory neurons",
                                                "25"="Ciliated sensory neurons",
                                                "26"="Ciliated sensory neurons",
                                                "27"="Oxygen sensory neurons",
                                                "28"="Ciliated sensory neurons",
                                                "29"="Unclassified neurons",
                                                "30"="Socket cells",
                                                "31"="Failed QC",
                                                "32"="Pharyngeal gland",
                                                "33"="Ciliated sensory neurons",
                                                "34"="Ciliated sensory neurons",
                                                "35"="Ciliated sensory neurons",
                                                "36"="Failed QC",
                                                "37"="Ciliated sensory neurons",
                                                "38"="Pharyngeal muscle")

Subset just the neurons:

neurons_cds <- cds[,grepl("neurons", colData(cds)$assigned_cell_type, ignore.case=TRUE)]
plot_cells(neurons_cds, color_cells_by="partition")

There are many subtypes of neurons, so perhaps the different neuron clusters correspond to different subtypes. To investigate which genes are expressed differentially across the clusters, we could use the regression analysis tools discussed above. However, Monocle provides an alternative way of finding genes that vary between groups of cells in UMAP or t-SNE space. The function graph_test() uses a statistic from spatial autocorrelation analysis called Moran's I, which Cao & Spielmann et al showed to be effective in finding genes that vary in single-cell RNA-seq datasets.

You can run graph_test() like this:

pr_graph_test_res <- graph_test(neurons_cds, neighbor_graph="knn", cores=8)
pr_deg_ids <- row.names(subset(pr_graph_test_res, q_value < 0.05))

The data frame pr_graph_test_res has the Moran's I test results for each gene in the cell_data_set. If you'd like to rank the genes by effect size, sort this table by the morans_Icolumn, which ranges from -1 to +1. A value of 0 indicates no effect, while +1 indicates perfect positive autocorrelation and suggests that nearby cells have very similar values of a gene's expression. Significant values much less than zero are generally rare.

Positive values indicate a gene is expressed in a focal region of the UMAP space (e.g. specific to one or more clusters). But how do we associate genes with clusters? The next section explains how to collect genes into modules that have similar patterns of expression and associate them with clusters.

Finding modules of co-regulated genes

Once you have a set of genes that vary in some interesting way across the clusters, Monocle provides a means of grouping them into modules. You can call find_gene_modules(), which essentially runs UMAP on the genes (as opposed to the cells) and then groups them into modules using Louvain community analysis:

gene_module_df <- find_gene_modules(neurons_cds[pr_deg_ids,], resolution=1e-2)

The data frame gene_module_df contains a row for each gene and identifies the module it belongs to. To see which modules are expressed in which clusters or partitions you can use two different approaches for visualization. The first is just to make a simple table that shows the aggregate expression of all genes in each module across all the clusters. Monocle provides a simple utility function called aggregate_gene_expression for this purpose:

cell_group_df <- tibble::tibble(cell=row.names(colData(neurons_cds)), 
agg_mat <- aggregate_gene_expression(neurons_cds, gene_module_df, cell_group_df)
row.names(agg_mat) <- stringr::str_c("Module ", row.names(agg_mat))
colnames(agg_mat) <- stringr::str_c("Partition ", colnames(agg_mat))

pheatmap::pheatmap(agg_mat, cluster_rows=TRUE, cluster_cols=TRUE,
                   scale="column", clustering_method="ward.D2",

Some modules are highly specific to certain partitions of cells, while others are shared across multiple partitions. Note that aggregate_gene_expression can work with arbitrary groupings of cells and genes. You're not limited to looking at modules from find_gene_modules(), clusters(), and partitions().

The second way of looking at modules and their expression is to pass gene_module_df directly to plot_cells(). If there are many modules, it can be hard to see where each one is expressed, so we'll just look at a subset of them:

           genes=gene_module_df %>% filter(module %in% c(8, 28, 33, 37)),

Finding genes that change as a function of pseudotime

Identifying the genes that change as cells progress along a trajectory is a core objective of this type of analysis. Knowing the order in which genes go on and off can inform new models of development. For example, Sharon and Chawla et al recently analyzed pseudotime-dependent genes to arrive at a whole new model of how islets form in the pancreas.

Let's return to the embryo data:

           color_cells_by = "cell.type",

How do we find the genes that are differentially expressed on the different paths through the trajectory? How do we find the ones that are restricted to the beginning of the trajectory? Or excluded from it?

Once again, we turn to graph_test(), this time passing it neighbor_graph="principal_graph", which tells it to test whether cells at similar positions on the trajectory have correlated expression:

ciliated_cds_pr_test_res <- graph_test(cds, neighbor_graph="principal_graph", cores=4)
pr_deg_ids <- row.names(subset(ciliated_cds_pr_test_res, q_value < 0.05))

Here are a couple of interesting genes that score as highly significant according to graph_test():

plot_cells(cds, genes=c("hlh-4", "gcy-8", "dac-1", "oig-8"),

As before, we can collect the trajectory-variable genes into modules:

gene_module_df <- find_gene_modules(cds[pr_deg_ids,], resolution=c(0,10^seq(-6,-1)))

Here we plot the aggregate module scores within each group of cell types as annotated by Packer & Zhu et al:

cell_group_df <- tibble::tibble(cell=row.names(colData(cds)), 
agg_mat <- aggregate_gene_expression(cds, gene_module_df, cell_group_df)
row.names(agg_mat) <- stringr::str_c("Module ", row.names(agg_mat))
                   scale="column", clustering_method="ward.D2")

We can also pass gene_module_df to plot_cells() as we did when we compared clusters in the L2 data above.

           genes=gene_module_df %>% filter(module %in% c(27, 10, 7, 30)),

Monocle offers another plotting function that can sometimes give a clearer view of a gene's dynamics along a single path. You can select a path with choose_cells() or by subsetting the cell data set by cluster, cell type, or other annotation that's restricted to the path. Let's pick one such path, the AFD cells:

AFD_genes <- c("gcy-8", "dac-1", "oig-8")
AFD_lineage_cds <- cds[rowData(cds)$gene_short_name %in% AFD_genes,
                       colData(cds)$cell.type %in% c("AFD")]

The function plot_genes_in_pseudotime() takes a small set of genes and shows you their dynamics as a function of pseudotime:


You can see that dac-1 is activated before the other two genes.

Analyzing branches in single-cell trajectories

Analyzing the genes that are regulated around trajectory branch points often provides insights into the genetic circuits that control cell fate decisions. Monocle can help you drill into a branch point that corresponds to a fate decision in your system. Doing so is as simple as selecting the cells (and branch point) of interest with choose_cells():

cds_subset <- choose_cells(cds)

And then calling graph_test() on the subset. This will identify genes with interesting patterns of expression that fall only within the region of the trajectory you selected, giving you a more refined and relevant set of genes.

subset_pr_test_res <- graph_test(cds_subset, neighbor_graph="principal_graph", cores=4)
pr_deg_ids <- row.names(subset(subset_pr_test_res, q_value < 0.05))

Grouping these genes into modules can reveal fate specific genes or those that are activate immediate prior to or following the branch point:

gene_module_df <- find_gene_modules(cds_subset[pr_deg_ids,], resolution=0.001)

We will organize the modules by their similarity (using hclust) over the trajectory so it's a little easier to see which ones come on before others:

agg_mat <- aggregate_gene_expression(cds_subset, gene_module_df)
module_dendro <- hclust(dist(agg_mat))
gene_module_df$module <- factor(gene_module_df$module, 
                                levels = row.names(agg_mat)[module_dendro$order])

Previous Next