Cicero is an R package that provides tools for analyzing single-cell chromatin accessibility experiments. The main function of Cicero is to use single-cell chromatin accessibility data to predict cis-regulatory interactions (such as those between enhancers and promoters) in the genome by examining co-accessibility. In addition, Cicero extends Monocle to allow clustering, ordering, and differential accessibility analysis of single cells using chromatin accessibility. For more information about Cicero, explore our publications.
The main purpose of Cicero is to use single-cell chromatin accessibility data to predict regions of the genome that are more likely to be in physical proximity in the nucleus. This can be used to identify putative enhancer-promoter pairs, and to get a sense of the overall stucture of the cis-architecture of a genomic region.
Because of the sparsity of single-cell data, cell must be aggregated by similarity to allow robust correction for various technical factors in the data.
Ultimately, Cicero provides a "Cicero co-accessibility" score between -1 and 1 between each pair of accessibile peaks within a user defined distance where a higher number indicates higher co-accessibility.
In addition, the Cicero package provides an extension toolkit for analyzing single-cell ATAC-seq experiments using the framework provided by Monocle. This vignette provides an overview of a single-cell ATAC-Seq analysis workflow with Cicero. For further information and more options, please see the manual pages for the Cicero R package, and our publications.
Cicero can help you perform two main types of analysis:
Before we look at Cicero's functions for each of these analysis tasks, let's see how to install Cicero.
Cicero runs in the R statistical computing environment. You will need R version 3.4 or higher and Cicero 1.0.0 or higher to have access to the latest features.
Cicero was recently added to Bioconductor. It has been released as part of Bioconductor 3.8 and can be installed with the following code:
For the absolute latest updates to the Cicero package, you can install Cicero directly from the github repository using these instructions:
Cicero builds upon a package called Monocle. Before installing Cicero, first, follow these instructions to install Monocle.
In addition to Monocle, Cicero requires the following packages from Bioconductor
You're now ready to install Cicero:
After installation, test that Cicero installed correctly by opening a new R session and typing:
Questions about Cicero should be posted on our Google Group. Please do not email technical questions to Cicero contributors directly.
The Cicero workflow is broken up into broad steps. When there's more than one way to do a certain step, we've labeled the options as follows:
Required | You need to do this. |
Recommended | Of the ways you could do this, we recommend you try this one first. |
Alternative | Of the ways you could do this, this way might work better than the one we usually recommend. |
Cicero holds data in objects of the
CellDataSet (CDS)
class. The class is derived from the Bioconductor ExpressionSet
class, which provides a common interface familiar to those who have analyzed microarray experiments with
Bioconductor. Monocle provides detailed documentation about how to generate an input CDS
here.
To modify the CDS object to hold chromatin accessibility rather than expression data, Cicero uses peaks as
its feature data fData
rather than genes or transcripts. Specifically, many Cicero functions
require peak information in the form chr1_10390134_10391134. For example, an input fData table might look
like this:
site_name | chromosome | bp1 | bp2 | |
---|---|---|---|---|
chr10_100002625_100002940 | chr10_100002625_100002940 | 10 | 100002625 | 100002940 |
chr10_100006458_100007593 | chr10_100006458_100007593 | 10 | 100006458 | 100007593 |
chr10_100011280_100011780 | chr10_100011280_100011780 | 10 | 100011280 | 100011780 |
chr10_100013372_100013596 | chr10_100013372_100013596 | 10 | 100013372 | 100013596 |
chr10_100015079_100015428 | chr10_100015079_100015428 | 10 | 100015079 | 100015428 |
The Cicero package includes a small dataset called cicero_data
as an example.
For convenience, Cicero includes a function called make_atac_cds
. This function takes as input a
data.frame
or a path to a file in a sparse matrix format. Specifically, this file should be a
tab-delimited text file with three columns. The first column is the peak coordinates in the form
"chr10_100013372_100013596", the second column is the cell name, and the third column is an integer that
represents the number of reads from that cell overlapping that peak. The file should not have a header line.
For example:
chr10_100002625_100002940 | cell1 | 1 |
---|---|---|
chr10_100006458_100007593 | cell2 | 2 |
chr10_100006458_100007593 | cell3 | 1 |
chr10_100013372_100013596 | cell2 | 1 |
chr10_100015079_100015428 | cell4 | 3 |
The output of make_atac_cds
is a valid CDS object ready to be input into downstream Cicero
functions.
Option | Default | Notes |
---|---|---|
binarize |
FALSE | Logical, should the count matrix be converted to binary? Because of the sparsity of single-cell data, we don't generally expect more than 1 read per cell per peak. We have found that converting a count matrix to a binary "is this site open in this cell" can ultimately give clearer results. |
If your scATAC-seq data comes from the 10x Genomics platform, here is an example of creating your input CDS from the output of Cell Ranger ATAC. The output of interest will be in the folder called filtered_peak_bc_matrix.
Because single-cell chromatin accessibility data is extremely sparse, accurate estimation of co-accessibility scores requires us to aggregate similar cells to create more dense count data. Cicero does this using a k-nearest-neighbors approach which creates overlapping sets of cells. Cicero constructs these sets based on a reduced dimension coordinate map of cell similarity, for example, from a tSNE or DDRTree map.
You can use any dimensionality reduction method to base your aggregated CDS on. We will show you how to create two versions, tSNE and DDRTree (below). Both of these dimensionality reduction methods are available from Monocle (and loaded by Cicero).
Once you have your reduced dimension coordinate map, you can use the function make_cicero_cds
to
create your aggregated CDS object. The input to make_cicero_cds
is your input CDS object, and
your reduced dimension coordinate map. The reduced dimension map reduced_coordinates
should be
in the form of a data.frame
or a matrix
where the row names match the cell IDs from
the pData
table of your CDS. The columns of reduced_coordinates
should be the
coordinates of the reduced dimension object, for example:
ddrtree_coord1 | ddrtree_coord2 | |
---|---|---|
cell1 | -0.7084047 | -0.7232994 |
cell2 | -4.4767964 | 0.8237284 |
cell3 | 1.4870098 | -0.4723493 |
Here is an example of both dimensionality reduction and creation of a Cicero CDS. Using monocle as a guide, we first find tSNE coordinates for our input_cds: For more information on the above code, see the Monocle website on clustering cells
Next, we access the tSNE coordinates from the input CDS object where they are stored by Monocle and run
make_cicero_cds
:
The main function of the Cicero package is to estimate the co-accessiblity of sites in the genome in order to predict cis-regulatory interactions. There are two ways to get this information:
run_cicero
: get Cicero outputs with all defaults The function
run_cicero
will call each of the relevant pieces of Cicero code using default values, and
calculating best-estimate parameters as it goes. For most users, this will be the best place to start.
run_cicero
Recommended
The easiest way to get Cicero co-accessibility scores is to run run_cicero
. To run
run_cicero
, you need a cicero CDS object (created above) and a genome coordinates file, which
contains the lengths of each of the chromosomes in your organism. The human hg19 coordinates are included
with the package and can be accessed with data("human.hg19.genome")
. Here is an example call,
continuing with our example data:
Peak1 | Peak2 | coaccess |
---|---|---|
chr18_10025_10225 | chr18_104385_104585 | 0.03791526 |
chr18_10025_10225 | chr18_10603_11103 | 0.86971957 |
chr18_10025_10225 | chr18_111867_112367 | 0.00000000 |
The alternative to calling run_cicero
is to run each piece of the Cicero pipeline in pieces. The
three functions that make up the Cicero pipeline are:
estimate_distance_parameter
. This function calculates the distance
penalty parameter based on small random windows of the genome.
generate_cicero_models
. This function uses the distance parameter
determined above and uses graphical LASSO to calculate the co-accessibility scores of overlapping windows
of the genome using a distance-based penalty.
assemble_connections
. This function takes as input the output of
generate_cicero_models
and reconciles the overlapping models to create the final list of
co-accessibility scores.
Use the manual pages in R (example: ?estimate_distance_parameter
) to see all options. Key
options are detailed below:
Option | Default | Notes |
---|---|---|
window |
500000 | The window parameter controls how large of a window (in base pairs) in the genome is used for
calculating each individual model. This parameter will control the furthest distance between sites
that are compared. This parameter must be the same as the window parameter for
generate_cicero_models . |
sample_num |
100 | This is the number of sample regions to calculate a distance parameter for. |
distance_constraint |
250000 | The distance_constraint controls the distance at which Cicero expects few meaningful
cis-regulatory contacts. estimate_distance_parameter uses this value to estimate the
distance parameter by increasing regularization until there are few contacts at a distance beyond
the distance_constraint . The distance_constraint must be
sufficiently lower than the window parameter - as a rule of thumb, set your window to be at least
1/3 larger than your distance_constraint . |
s |
0.75 | s is a constant that captures the power-law distribution of contact frequencies
between different locations in the genome as a function of their linear distance. For a complete
discussion of the various polymer models of DNA packed into the nucleus and of justifiable values
for s , we refer readers to (Dekker et al., 2013). We use a value of 0.75 by default in
Cicero, which corresponds to the “tension globule” polymer model of DNA (Sanborn et al., 2015). For
data not from humans, we recommend you read further discussion of this parameter here:
Important
Considerations for Non-Human Data. This parameter must be the same as the
s parameter for generate_cicero_models . |
Option | Default | Notes |
---|---|---|
window |
500000 | The window parameter controls how large of a window (in base pairs) in the genome is used for
calculating each individual model. This parameter will control the furthest distance between sites
that are compared. This parameter must be the same as the window parameter for
estimate_distance_parameter . |
distance_parameter |
100 | The distance_parameter controls the distance-based scaling of the regularization
parameter in graphical LASSO. This parameter is generally the mean of the sample values calculated
using estimate_distance_parameter |
s |
0.75 | See above in the documentation for estimate_distance_parameter . This parameter
must be the same as the s parameter for
estimate_distance_parameter . |
Option | Default | Notes |
---|---|---|
silent |
FALSE | Logical, whether to print summary information. |
The Cicero package includes a general plotting function for visualizing co-accessibility called
plot_connections
. This function uses the
Gviz framework for plotting
genome browser-style plots. We have adapted a function from the
Sushi R package for
mapping connections. plot_connections
has many options, some detailed in the
Advanced Visualization section, but to get a
basic plot from your co-accessibility table is quite simple:
Often, it is useful to compare Cicero connections to other datasets with similar kinds of connections. For
example, you might want to compare the output of Cicero to ChIA-PET ligations. To do this, Cicero includes a
function called compare_connections
. This function takes as input two data frames of connection
pairs, conns1
and conns2
, and returns a logical vector of connections from
conns1
found in conns2
. The comparison in this function is conducted using the
GenomicRanges
package, and uses the max_gap
argument from that package to allow slop in the comparisons.
For example, if we wanted to compare our Cicero predictions to a set of (made-up) ChIA-PET connections, we could run:
You may find that this overlap is too strict when comparing completely distinct datasets. Looking carefully,
the 2nd line of the ChIA-PET data matches fairly closely to the last line shown of conns. The difference is
only ~67 base pairs, which could be a matter of peak-calling. This is where the max_gap
parameter can be useful:
In addition, Cicero's plotting function has a way to compare datasets visually. To do this, use the
comparison_track
argument. The comparison data frame must include a third
columns beyond the first two peak columns called "coaccess". This is how the plotting function determines the
height of the plotted connections. This could be a quantitative measure, like the number of ligations in
ChIA-PET, or simply a column of 1s. More info on plotting options in manual pages
?plot_connections
and in the Advanced
Visualization section.
In addition to pairwise co-accessibility scores, Cicero also has a function to find Cis-Co-accessibility
Networks (CCANs), which are modules of sites that are highly co-accessible with one another. We use the
Louvain community detection algorithm (Blondel et al., 2008) to find clusters of sites that tended to be
co-accessible. The function generate_ccans
takes as input a connection data frame and outputs a
data frame with CCAN assignments for each input peak. Sites not included in the output data frame were not
assigned a CCAN.
The function generate_ccans
has one optional input called coaccess_cutoff_override
.
When coaccess_cutoff_override
is NULL, the function will determine and report an appropriate
co-accessibility score cutoff value for CCAN generation based on the number of overall CCANs at varying
cutoffs. You can also set coaccess_cutoff_override
to be a numeric between 0 and 1, to override
the cutoff-finding part of the function. This option is useful if you feel that the cutoff found
automatically was too strict or loose, or for speed if you are rerunning the code and know what the cutoff
will be, since the cutoff finding procedure can be slow.
We have found that often, accessibility at promoters is a poor predictor of gene expression. However, using Cicero links, we are able to get a better sense of the overall accessibility of a promoter and it's associated distal sites. This combined score of regional accessibility has a better concordance with gene expression. We call this score the Cicero gene activity score, and it is calculated using two functions.
The initial function is called build_gene_activity_matrix
. This function takes an input CDS and
a Cicero connection list, and outputs an unnormalized table of gene activity scores.
IMPORTANT: the input CDS must have a column in the fData table called "gene" which indicates
the gene if that peak is a promoter, and NA
if the peak is distal. One way to add this column is
demonstrated below.
The output of build_gene_activity_matrix
is unnormalized. It must be normalized
using a second function called normalize_gene_activities
. If you intend to compare gene
activities across different datasets of subsets of data, then all gene activity subsets should be normalized
together, by passing in a list of unnormalized matrices. If you only wish to normalized one matrix, simply
pass it to the function on its own. normalize_gene_activities
also requires a named vector of
of total accessible sites per cell. This is easily found in the pData table of your CDS, called
"num_genes_expressed". See below for an example. Normalized gene activity scores range from 0 to 1.
Here, we will show some example of plotting parameters you may want to use to further explore your data
visually. Find documentation of all of the parameters in R's manual pages using
?plot_connections
.
Viewpoints: the viewpoint
parameter lets you view only the connections
originating from a specific place in the genome. This might be useful in comparing data with 4C-seq data.
alpha_by_coaccess
: the alpha_by_coaccess
parameter lets you is
useful when you are concerned about overplotting. This parameter causes the alpha (transparency) of the
connection curves to be scaled based on the magnitude of the co-accessibility. Compare the following two
plots.
Colors: There are several parameters having to do with colors:
peak_color
comparison_peak_color
connection_color
comparison_connection_color
gene_model_color
viewpoint_color
viewpoint_fill
peak_color
and
comparison_peak_color
, the column should correspond to Peak1 color assignment.
return_as_list
We realize that you may wish to plot other kinds of data along with Cicero connections. Because of this, we
include the return_as_list
option. This parameter allows you to use the extensive functionality
of Gviz to plot many different
kinds of genomic data. See the
Gviz user guide
to see the many options.
If return_as_list
TRUE
, an image will not be plotted, and instead, a
list of plot components will be returned. Let's go back to a simpler example:
This made a list of components in the order of plotting. First is the CustomTrack, which is the Cicero connections, next is the peak annotation track, next is the genome axis track, and lastly is the gene model track. I can now add another track, rearrange tracks, or replace tracks using Gviz:
Cicero's default parameters were designed for use in human and mouse. There are a few parameters that we expect will need to be changed for use in different model organisms. Here are the parameters along with a brief discussion of each:
Parameter | Functions where used | Notes | Potential values |
---|---|---|---|
s |
estimate_distance_parameter , generate_cicero_models |
As discussed previously, the value s is the scaling exponent of the power-law
function that estimates the global decay in contact frequency. This value is generally estimated
using Hi-C contact frequencies. As a default, we have used 0.75, which corresponds to the
“tension globule” polymer model of DNA (Sanborn et al., 2015). However, in other organisms,
different decay functions have been observed. For example, in Drosophila, a scaling value of 0.85
has been proposed (Sexton et al., 2012).
|
Human and mouse: 0.75 Drosophila: 0.85 |
distance_constraint |
estimate_distance_parameter |
The value distance_constraint is the distance in base pairs at beyond which few
(less than 5%) meaningful cis-regulatory connections are expected to exist. While there are
examples of very long range cis-regulatory contacts in various organisms, this value should be
more of a general rule as Cicero can still detect longer range connections with high
co-accessibility.
|
Human and mouse: 250,000 Drosophila: 50,000 |
window |
estimate_distance_parameter , generate_cicero_models |
The window value is the length in base pairs of the sliding window that Cicero
will use when measuring co-accessibility. This value must be significantly longer (>30%) than
the value for distance_constraint . This is the maximum distance of sites that
will be compared.
|
Human and mouse: 500,000 Drosophila: 100,000 |
The second major function of the Cicero package is to extend Monocle 2 for use with single-cell accessibility data. The main obstacle to overcome with chromatin accessibility data is the sparsity, so most of the extensions and methods are designed to address that.
We strongly recommend that you consult the Monocle website, especially this section prior to reading about Cicero's extension of the Monocle analysis described. Briefly, Monocle infers pseudotime trajectories in three steps:
The primary way that the Cicero package deals with the sparsity of single-cell chromatin accessibility data is through aggregation. Aggregating the counts of either single cells or single peaks allows us to produce a "consensus" count matrix, reducing noise and allowing us to move out of the binary regime. Under this grouping, the number of cells in which a particular site is accessible can be modeled with a binomial distribution or, for sufficiently large groups, the corresponding Gaussian approximation. Modeling grouped accessibility counts as normally distributed allows Cicero to easily adjust them for arbitrary technical covariates by simply fitting a linear model and taking the residuals with respect to it as the adjusted accessibility score for each group of cells. We demonstrate how to apply this grouping practically below.
aggregate_nearby_peaks
The primary aggregation used for trajectory reconstruction is betweeen nearby peaks. This keeps single cells
separate while aggregating regions of the genome and looking for chromatin accessibility within them. The
function aggregate_nearby_peaks
finds sites within a certain distance of each other and
aggregates them together by summing their counts. Depending on the density of your data, you may want to try
different distance
parameters. In published work we have used 1000 and 10,000.
There are several options for choosing the sites to use during dimensionality reduction. Monocle has a discussion about the options here. Any of these options could be used with your new aggregated CDS, depending on the information you have a available in your dataset. Here, we will show two examples:
If your data has defined beginning and end points, you can determine which sites define progress by a
differential accessibility test. We use Monocle's differentialGeneTest
function looking for
sites that are different in the timepoint groups.
Alternatively, you can choose sites for dimensionality reduction by using Monocle's dpFeature method. dpFeature chooses sites based on how they differ among clusters of cells. Here, we give some example code reproduced from Monocle, for more information, see the Monocle description.
However you chose your ordering sites, the first step of dimensionality reduction is to use
setOrderingFilter
to mark the sites you want to use for dimesionality reduction. In the
following figures, we are using the ordering_sites from
Choose sites by differential analysis above.
Next, we use DDRTree to reduce dimensionality and then order the cells along the trajectory. Importantly, we use num_genes_expressed in our residual model formula to account for overall assay efficiency.
Once you have a cell trajectory, you need to make sure that pseudotime proceeds how you expect. In our example, we want pseudotime to start where most of the time 0 cells are located and proceed towards the later timepoints. Further information on this can be found here on the Monocle website. We first color our trajectory plot by State, which is how Monocle assigns segments of the tree.
From this plot, we can see that the beginning of pseudotime should be from state 4. We now reorder cells setting the root state to 4. We can then check that the ordering makes sense by coloring the plot by Pseudotime.
Now that we have Pseudotime values for each cell (pData(agg_cds)$Pseudotime
), we need to assign
these values back to our original CDS object. In addition, we will assign the State information back to the
original CDS.
Once you have your cells ordered in pseudotime, you can ask where in the genome chromatin accessibility is changing across time. If you know of specific sites that are important to your system, you may want to visualize the accessibility at those sites across pseudotime.
plot_accessibility_in_pseudotime
For simplicity, we will exclude the branch in our trajectory to make our trajectory linear.
differentialGeneTest
with single cell chromatin accessibility data
In the previous section, we used aggregation of sites to discover cell-level information (cell pseudotime).
In this section, we are interested in a site-level statistic (whether a site is changing in pseudotime), so
we will aggregate similar cells. To do this, Cicero has a useful function called
aggregate_by_cell_bin
.
aggregate_by_cell_bin
We use the function aggregate_by_cell_bin
to aggregate our input CDS object by a column in the
pData table. In this example, we will assign cells to bins by cutting the pseudotime trajectory into 10 parts.
We are now ready to run Monocle's differentialGeneTest function to find sites that are differentially accessible across pseudotime. In this example, we include num_genes_expressed as a covariate to subtract its effect.
annotate_cds_by_site
It is often useful to add additional annotations about your peaks to your CDS object. For example, you might
want to know which peaks overlap an exon, or a transcription start site. Cicero includes the function
annotate_cds_by_site
, which takes as input your CDS, and a data frame or file path with
bed-format information (chromosome, bp1, bp2, further columns). For example, suppose I have some data
(feat
) about epigenetic marks in my system:
I can use annotate_cds_by_site to determine which of my peaks have those epigenetic marks.
This function has added two columns compared to above. Overlap represents the number of base pairs
overlapping between the interval in feat
and the given peak, and type is the assigned status
based on the overlap. If you look closely, site chr18_10603_11103
was actually overlapped by two
intervals in feat
. In its default mode, annotate_cds_by_site
will choose the
largest overlapping interval to report. If you would like to see all overlapping intervals, you can set the
all
flag to TRUE
.
As you see above, when all
is set to TRUE
, all of the intervals that overlap are
reported.
find_overlapping_coordinates
Lastly, you may simply want to know which of your peaks overlap a specific region of the genome. For that
Cicero includes the function find_overlapping_coordinates
.
The genome_coordinates files are simply a table for chromosomes and their sizes. You can find one for your genome/build on the USCS genome browser website. Go here, find the "Full Data Set" link for your genome of interest, and find the [genome].chrom.sizes file. Download this and pass the path into the cicero function you're running! Note: some genomes have a lot of non-standard chromosomes in the sizes file (chr_unk, alternative haplotypes, etc). You may want to remove these lines from your file, which will help speed up your run.
A gene_model to pass to plotting functions can be derived from the ENSEMBL GTF for your organism/build. Download the GTF file from here - found under the "Gene sets" column. GTF files need a bit of processing to be ready for plotting:
Blondel, V.D., Guillaume, J.-L., Lambiotte, R., and Lefebvre, E. (2008). Fast unfolding of communities in large networks.
Dekker, J., Marti-Renom, M.A., and Mirny, L.A. (2013). Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nat. Rev. Genet. 14, 390–403.
Sanborn, A.L., Rao, S.S.P., Huang, S.-C., Durand, N.C., Huntley, M.H., Jewett, A.I., Bochkov, I.D., Chinnappan, D., Cutkosky, A., Li, J., et al. (2015). Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. . Proc. Natl. Acad. Sci. U. S. A. 112, E6456–E6465.
Sexton, T., Yaffe, E., Kenigsberg, E., Bantignies, F., Leblanc, B., Hoichman, M., Parrinello, H., Tanay, A., and Cavalli, G. (2012). Three-Dimensional Folding and Functional Organization Principles of the Drosophila Genome. . Cell 148:3, 458-472.