Differential abundance

Hooke can identify differentially abundant cell types for a given contrast. The underlying PLN model mimics the standard (G)LM-like interface of R::stats. This allows users to fit a multivariate Poisson lognormal model after correcting for effects of offsets and covariates. This framework allows users to perform multivariate statistical regression to describe how perturbations alter the relative abundances of each cell state. For example, you could model the effects of genotype (WT vs MT) while controlling for experimental batch.

Analysis of differentially abundant cell types in silicosis with Hooke

This data was published in Hasegawa, Franks, et al. bioRxiv. The authors performed a longitudinal transcriptomic analysis using scRNA-sequencing of the lungs from pre- and post-silica challenged mice.

The data includes 12 whole lung samples across pre- and post- intratracheal silica. 35 unique cell states were identified using highly and specifically expressed marker genes. For simplicity, we are splitting samples into 2 levels: exposed and not exposed. The goal is to identify which cell types are differentially abundant post-silica exposure.

cds = readRDS("silicosis_cds.rds")

# for simplicity, we are lumping together pre and post i.t. silica
colData(cds)$exposed = ifelse(colData(cds)$Timepoint == 0, "not exposed", "exposed")
colData(cds)$Rep = as.factor(colData(cds)$Rep)

plot_cells(cds, color_cells_by = "fine_annotation", label_groups_by_cluster = F)

silicosis_umap

Making a cell count set object

The first step of Hooke is to create a cell_count_set object. This requires three inputs:

cds - a monocle3 cell_data_set object
sample group - A column in colData(cds) that specifies how cells are grouped into samples
cell group - A column in colData(cds) that specifies how cells are grouped into types or states (e.g. cluster)

ccs = new_cell_count_set(cds, 
                         sample_group = "ID", 
                         cell_group = "fine_annotation")

Fitting a cell count model

The next step is to fit a cell_count_model given a model formula. This function works with the standard R formula notations.

A cell count model requires the following inputs:

ccs - a Hooke cell_count_set object
main_model_formula_str - A character string specifying the model of cell abundances across samples, where terms refer to columns in colData(ccs). Put main effects here.
nuisance_model_formula_str - A character string specifying the model of cell abundances across samples. Put nuisance effects here.

In this case we are interested in comparing silica exposed mouse lungs to unexposed lungs.

ccm  = new_cell_count_model(ccs,
                            main_model_formula_str = "~ exposed")

Estimating abundances

Next we can predict fold change values for a given contrast of interest using estimate_abundances().

ccm - a Hooke cell_count_model object
newdata - A tibble of variables used for the prediction.
min_log_abund - numeric Minimum log abundance value.

cond_exp = estimate_abundances(ccm, tibble::tibble(exposed = "exposed"))
cond_not_exp = estimate_abundances(ccm, tibble::tibble(exposed = "not exposed"))

cond_not_exp %>% head()

exposed	cell_group	log_abund	log_abund_se	log_abund_sd
not exposed	Endothelial Arterial	2.87	0.196	0.559
not exposed	Endothelial Capillary	5.31	0.0591	0.559
not exposed	Endothelial Venous	3.09	0.179	0.559
not exposed	Lymphatic Endothelial	2.63	0.213	0.559
not exposed	Aerocytes	4.06	0.110	0.559
not exposed	Fibroblasts	5.26	0.0607	0.559

Comparing abundances

And then compare two estimates of cell abundances from a Hooke model to identify significant changes with compare_abundances(). In this case, exposed vs not exposed.

ccm - a Hooke cell_count_model object
cond_x - A cell type abundance estimate from estimate_abundances().
cond_y - A cell type abundance estimate from estimate_abundances().

cond_ne_v_e_tbl = compare_abundances(ccm, cond_not_exp, cond_exp)

cond_ne_v_e_tbl %>% select(cell_group, exposed_x, exposed_y,
                           delta_log_abund, delta_log_abund_se, delta_q_value)

cell_group	delta_log_abund	delta_log_abund_se	delta_q_value
Endothelial Arterial	-0.584	0.196	0.0774
Endothelial Capillary	-0.400	0.0769	0.0131
Endothelial Venous	-0.353	0.145	0.130
Lymphatic Endothelial	0.0529	0.330	0.989
Aerocytes	-0.354	0.130	0.102
Fibroblasts	-0.477	0.102	0.0131

Plotting differential abundance changes on a UMAP

Finally, we can color our UMAP by the fold changes we estimate in compare_abundances() using the plot_contrast() function. We can select fold changes that are signifiant given a q_value_thresh.

ccm - A Hooke cell_count_model object.
cond_b_vs_a_tbl - A data frame from compare_abundances().
log_abundance_thresh - numeric Select cell groups by log abundance.
q_value_thresh - Remove contrasts whose change in q-value exceeds q_value_thresh.

plot_contrast(ccm, cond_ne_v_e_tbl, q_value_thresh = 0.05)

silicosis_rep

Controlling for batch terms

If your data contains multiple experimental batch, Hooke supports batch correction with the use of a nuisance model string term.

ccm_rep  = new_cell_count_model(ccs,
                            main_model_formula_str = "~ exposed",
                            nuisance_model_formula_str = "~ Rep")

cond_exp_rep = estimate_abundances(ccm_rep, tibble::tibble(exposed = "exposed", Rep = "1"))
cond_not_exp_rep = estimate_abundances(ccm_rep, tibble::tibble(exposed = "not exposed", Rep = "1"))
cond_ne_v_e_tbl_rep = compare_abundances(ccm_rep, cond_not_exp_rep, cond_exp_rep)

plot_contrast(ccm_rep, cond_ne_v_e_tbl_rep, q_value_thresh = 0.05)

If we control for replicate, we see some difference in significant fold changes.

silicosis_rep