##Install quick-start

###Required software Monocle runs in the R statistical computing environment. You will need R version 3.1 or higher. You will also need to install Bioconductor:

> source("http://bioconductor.org/biocLite.R") 
> biocLite()

Once you’ve installed Bioconductor, you’re ready to install Monocle and all of its required dependencies:

> biocLite("monocle")

###Testing the installation To ensure that Monocle was installed correctly, start a new R session and type:

> library(monocle)

###Installing Monocle 2 Monocle 2 is currently available through GitHub. Once it works through Bioconductor’s development-to-release cycle, which takes 6 months, it will become available via the steps above. To install it through GitHub, enter the following commands at the R console:

> install.packages("devtools")
> devtools::install_github("cole-trapnell-lab/monocle-release@monocle2")

Monocle 2 has a number of new dependencies that Monocle 1 didn’t. You may see errors when you try the above command. You can install the packages in the error message by typing (for example):

> biocLite(c("DDRTree", "pheatmap"))

If you install Monocle 2, make sure to have a look at the new vignette and reference manual, as many things have changed.

##Computing expression values for single cells

To use Monocle, you must first compute the expression of each gene in each cell for your experiment. There are a number of ways to do this for RNA-Seq. We recommend using Cufflinks, but you could also use RSEM, eXpress, Sailfish, or another tool for estimating gene and transcript expression levels from aligned reads. Here, we’ll show a simplified workflow for using TopHat and Cufflinks to estimate expression. You can read more about how to use TopHat and Cufflinks to calculate expression here.

To estimate gene and transcript expression levels for single-cell RNA-Seq using TopHat and Cufflinks, you must have a file of RNA-Seq reads for each cell you captured. If you performed paired-end RNA-Seq, you should have two files for each cell. Depending on how the base calling was performed, the naming conventions for these files may differ. In the examples below, we assume that each file follows the format:


Where XX is the time point at which the cell was collected in our experiment, YY is the well of the 96-well plate used during library prep, and Z is either 1 or 2 depending on whether we are looking at the left mate or the right mate in a paired end sequencing run. So CELL_T24_A01.R1.fastq.gz means we are looking at the left mate file for a cell collected 24 hours into our experiment and which was prepped in well A01 of the 24-hour capture plate.

###Aligning reads to the genome with TopHat We begin by aligning each cell’s reads separately, so we will have one BAM file for each cell. The commands below show how to run each cell’s reads through TopHat. These alignment commands can take a while, but they can be run in parallel if you have access to a compute cluster. If so, contact your cluster administrator for more information on how to run TopHat in a cluster environment.

tophat -o CELL_T24_A01_thout -G GENCODE.gtf bowtie-hg19-idx CELL_T24_A01.R1.fastq.gz CELL_T24_A01.R2.fastq.gz 
tophat -o CELL_T24_A02_thout -G GENCODE.gtf bowtie-hg19-idx CELL_T24_A02.R1.fastq.gz CELL_T24_A02.R2.fastq.gz 
tophat -o CELL_T24_A03_thout -G GENCODE.gtf bowtie-hg19-idx CELL_T24_A03.R1.fastq.gz CELL_T24_A03.R2.fastq.gz 

The commands above show how to align the reads for each of three cells in the experiment. You will need to run a similar command for each cell you wish to include in your analysis. These TopHat alignment commands are simplified for brevity - there are options to control the number of CPUs used by TopHat and otherwise control how TopHat aligns reads that you may want to explore on the TopHat manual. The key components of the above commands are:

  • The -o option, which sets the directory in which each cell’s output will be written.
  • The gene annotation file, specified with -G, which tells TopHat where to look for splice junctions.
  • The Bowtie index for genome of your organism, in this case build hg19 of the human genome.
  • The read files for each cell as mentioned above.

When the commands finish, there will be a BAM file in each cell’s TopHat output directory. For example, CELL_T24_A01_thout/accepted_hits.bam will contain the alignments for cell T24_A01.

###Computing gene expression using Cufflinks Now, we will use Cufflinks to estimate gene expression levels for each cell in your study.

cuffquant -o CELL_T24_A01_cuffquant_out GENCODE.gtf CELL_T24_A01_thout/accepted_hits.bam 
cuffquant -o CELL_T24_A02_cuffquant_out GENCODE.gtf CELL_T24_A02_thout/accepted_hits.bam 
cuffquant -o CELL_T24_A03_cuffquant_out GENCODE.gtf CELL_T24_A03_thout/accepted_hits.bam 

The commands above show how to convert aligned reads for each cell into gene expression values for that cell. You will need to run a similar command for each cell you wish to include in your analysis. These commands are simplified for brevity - there are options to control the number of CPUs used by the cuffquant utility and otherwise control how cuffquant estimates expression that you may want to explore on the Cufflinks manual. The key components of the above commands are:

  • The -o option, which sets the directory in which each cell’s output will be written.
  • The gene annotation file, which tells cuffquant what the gene structures are in the genome.
  • The BAM file containing the aligned reads.

Next, you will need to merge the expression estimates into a single table for use with Monocle. You can do this with the following command:

cuffnorm --use-sample-sheet -o sc_expr_out GENCODE.gtf sample_sheet.txt

The option –use-sample-sheet tells cuffnorm that it should look in the file sample_sheet.txt for the expression files, to make the above command simpler. If you choose not to use a sample sheet, you will need to specify the expression files on the command line directly. The sample sheet is a tab-delimited file that looks like this:

sample_name group
CELL_T24_A01_cuffquant_out/abundances.cxb T24_A01
CELL_T24_A02_cuffquant_out/abundances.cxb T24_A02
CELL_T24_A03_cuffquant_out/abundances.cxb T24_A03

Now, you are ready to load the expression data into Monocle and start analyzing your experiment.

##Analyzing data with Monocle

Monocle provides a number of tools you can use to analyze your single cell expression experiments. To get started, we must create a CellDataSet object. You can do this with the commands below:

> library(monocle)
> sample_sheet <- read.delim("sc_expr_out/samples.table", row.names=1)
> gene_annotations <- read.delim("sc_expr_out/genes.attr_table", row.names=1)
> fpkm_matrix <- read.delim("sc_expr_out/genes.fpkm_table", row.names=1)
> pd <- new("AnnotatedDataFrame", data = sample_sheet)
> my_data <- newCellDataSet(exprs = as.matrix(fpkm_matrix), phenoData = pd, featureData = fd)

Now, you have created an object named “my_data” that stores your single-cell expression data. This object is the central object in Monocle. You will use it to identify differentially expressed genes and perform other analyses. To see what Monocle can do for you and how to proceed, please have a look at the vignette (PDF). Good luck!