If you are using Monocle or Monocle 2 (loading using library(monocle)
), then follow the
instructions on this page. If you are instead using Monocle 3 (loading using library(monocle3)
),
then please follow the instructions here.
NOTE: Garnett for Monocle will eventually be deprecated and will only be available for Monocle3.
Garnett is a software package that faciliates automated cell type classification from single-cell expression data. Garnett works by taking single-cell data, along with a cell type definition (marker) file, and training a regression-based classifier. Once a classifier is trained for a tissue/sample type, it can be applied to classify future datasets from similar tissues. In addition to describing training and classifying functions, this website aims to be a repository of previously trained classifiers.
Garnett runs in the R statistical computing environment. You will need R version 3.5 or higher to install Garnett.
To install Garnett directly from the Github repository, use the following instructions:
Garnett builds upon a package called Monocle. Before installing Garnett, first install Monocle using Bioconductor:
You're now ready to install Garnett:
After installation, test that Garnett installed correctly by opening a new R session and typing:
Questions about Garnett should be posted on our Google Group. Please do not email technical questions to Garnett contributors directly.
The Garnett workflow has two major parts, each described in detail below:
There are two options for the first part of the Garnett workflow: using a pre-trained classifier, or generating your own classifier.
We have generated a series of pre-trained classifiers for various organisms and tissues. If a pre-trained classifier exists for your data type, we recommend you try it. The list of available classifiers can be found here. We hope to continually update and add new classifiers as we generate them. We also accept classifiers generated by others - please submit any classifiers you make and help build the community! Details on how to submit your classifier are can be found here.
To use a pre-trained classifier, first download the classifier, and then load it into your R session using:
Once you've loaded the classifier, go on to part 2, Classifying your cells
If a classifier doesn't exist for your tissue type, or doesn't include the cell types you expect in your data, then you'll need to generate your own.
The first step to train your classifier is to load your single-cell data.
Because Garnett builds on Monocle, data for
Garnett is held in objects of the
CellDataSet (CDS)
class. This class is derived from the Bioconductor ExpressionSet
class, which provides a common interface familiar to those who have analyzed microarray experiments with
Bioconductor. Monocle provides detailed documentation about how to generate an input CDS
here.
As an example, Garnett includes a small dataset derived from the PBMC 10x V1 expression data [1].
Besides the expression data, the second major input you'll need is a marker file . The marker file contains a list of cell type definitions written in an easy-to-read text format. The cell type definitions tell Garnett how to choose cells to train the model on. Each cell type definition starts with a '>' symbol and the cell type name, followed by a series of lines with definition information. Definition lines start with a keyword and a ':' and entries are separated by a comma. Note: the Garnett syntax allows for entries following the ':' to move onto following lines however, you may not move to a new line mid entry (i.e. you can go to a new line only after a comma).
A simple valid example:
There are several ways to define cell types in the Garnett marker file format. In general, each cell's definition can have three major components. Only the first component is required.
The first and most important specification for a cell type is its expression. Garnett offers several options for specifying marker genes, detailed below.
Format | Example |
---|---|
expressed: gene1, gene2 |
expressed: MYOD1, MYH3 |
not expressed: gene1, gene2 |
not expressed: PAX6, PAX3 |
This is the default way to specify marker genes for your cell types. When using this specification, Garnett
calculates a marker score for each cell, accounting for potential leakage, overall expression levels, and read
depth. expressed:
markers should be specific to the cell type being defined.
Format | Example |
---|---|
expressed above: gene1 value, gene2 value |
expressed above: MYOD1 4.2, MYH3 700 |
expressed below: gene1 value, gene2 value |
expressed below: PAX6 20, PAX3 4 |
expressed between: gene1 value1 value2, gene2 value1 value2 |
expressed between: PAX6 10 20, PAX3 4 100 |
This is an alternative way to specify expression, which can be useful if you know an exact range you expect the expression of a gene to occupy. In general however, we do not recommend using these specifications because they will not account for read depth and overall expression in each cell. Values are in the same units as your input data.
In addition to expression information, you can further refine your cell type definitions using meta data. This is also where you will specify any subtypes you expect in your data.
Format | Example |
---|---|
subtype of: celltype |
subtype of: T cells |
custom meta data: attribute1, attribute2 |
tissue: spleen, thymus |
subtype of:
allows you to specify that a cell type is a subtype of another cell type in your
definition file.
custom meta data:
specification allows you to provide any further meta data requirements for your
cell type. Any column in the pData
table of your CDS object can be used as a meta data
specification. In the example above, there would be a column in the pData
table called "tissue".
Lastly, we highly recommend that you document how you chose your marker definitions. To make it easier to keep
track of, we provide an additional specification - references:
- that will store your citation
information for each cell type. Add a set of URLs or DOIs and they will be included in your classifier. See
here for functions to get access to this information.
Similar to R code, we have included a comment character #
so you can add notes/comments in your
marker file. Anything after a #
on the same line will be ignored by Garnett.
A more complex example:
Because defining the marker file is often the hardest part of the process, Garnett includes functions to check
whether your markers are likely to work well. The two functions relevant are check_markers
and
plot_markers
. check_markers
generates a table of information about your markers and
plot_markers
plots the most relevant information.
In addition to the small included dataset, we have included two example marker files with the package. Here are the contents of "pbmc_bad_markers.txt"
Besides the CDS object and the path to the marker file, there are a few arguments to add:
db
: db
is a required argument for a Bioconductor AnnotationDb-class package used
for converting gene IDs. For example, for humans use
org.Hs.eg.db. See
available packages at the
Bioconductor website. Load your chosen
db using library(db)
. If your species does not have an AnnotationDb-class package, see
here.cds_gene_id_type
: This argument tells Garnett the format of the gene IDs in your CDS object.
It should be one of the values in columns(db)
. The default is "ENSEMBL".marker_file_gene_id_type
: Similarly to above, this argument tells Garnett the format of the
gene IDs in your marker file.plot_markers
to view the results.
This marker plot provides some key information about whether the chosen markers are good. First, The red note 'not in db' lets us know that the marker ACTN was not present as a 'SYMBOL' in the org.Hs.eg.db annotation. In this case, it's a typo. Next, the x-axis shows the ambiguity score for each of the markers - a measure of how many cells receive ambiguous labels when this marker is included - in this case, ACTB and PTPRC have high ambiguity and should be excluded. The last piece of relevant information is what percent of all the cells nominated in a cell type were chosen because of that marker. This is indicated by color.
NOTE: The values output by check_markers
and plotted by plot_markers
are estimates of the numbers of cells that could be chosen by the classifier.
However, it uses heuristics to quickly find candidate cells, and will not exactly match the cells that are chosen
by the marker. Please use the numbers as relative measures, rather than absolute representations of the training
set.
A further note on ambiguity scores: Ambiguity scores are the fraction of the cells a marker nominates that a become labeled as "Ambiguous" when that marker is included in the marker file. However, a high ambiguity score does not necessarily mean that a given marker is not specific. It could mean that a different marker is the culprit, but that marker also nominates a lot of otherwise unlabeled cells (high nomination rate). Look closely at both the marker with a high ambiguity score, and markers in the cell type that it is most ambiguous with before deciding which marker to exclude.
After making a marker plot, you may want to revise your marker file. In our toy example, we'll remove ACTN, ACTB and PTPRC to get our final 'pbmc_test.txt' marker file.
classifier_gene_id_type
in both train_cell_classifier
and check_markers
to specify a different ID type. The value you
choose will be stored with the classifier, so you do not need to specify it again when classifying future datasets.
Now it's time to train the classifier. The arguments should be pretty close to those for check_markers
.
The one parameter I am changing from default below is the num_unknown
argument. This tells Garnett
how many outgroup cells it should compare against. The default is 500, but in this toy dataset with so few cells
we want fewer.
After running train_cell_classifier, the output object, of type "garnett_classifier" contains all of the information necessary to classify cells.
Garnett classification is trained using a multinomial elastic-net regression. This means that certain genes are chosen as the relevant genes for distinguishing between cell types. Which genes are chosen may be of interest, so Garnett includes a function to access the chosen genes. Note: Garnett does not regularize the input markers, so they will be included in the classifier regardless.
The function we use to see the relevant genes is get_feature_genes
. The arguments are the
classifier, which node you'd like to view (if your tree is hierarchical) - use "root" for the top node and the
parent cell type name for other nodes, and the db
for your species. The function will
automatically convert the gene IDs to SYMBOL if you set
convert_ids = TRUE
.
We explained above how to include documentation about
how you chose your markers in your marker file. In order to get this information back out - to see how markers
were chosen for an already trained classifier - use the function get_classifier_references
. Besides
the classifier, there is one additional optional argument called cell_type
. If you pass the name
of a cell type, only the references for that cell type will be printed, otherwise they will all be printed.
We encourage you to submit your high quality classifiers to us so we can make them available to the community. To do this, open a special issue and fill out the form in the Garnett github repository. Click here and click the "New issue" button to get started!
If you haven't loaded your data yet (because you're using a pre-trained classifier), now is the time!
Because Garnett builds on Monocle, data for
Garnett is held in objects of the
CellDataSet (CDS)
class. This class is derived from the Bioconductor ExpressionSet
class, which provides a common interface familiar to those who have analyzed microarray experiments with
Bioconductor. Monocle provides detailed documentation about how to generate an input CDS
here.
As an example, Garnett includes a small dataset derived from the PBMC 10x V1 expression data [1].
Once you've got your classifier, it's time to classify your cells using the classify_cells
function! The key arguments are:
cds
: This is the CDS object containing your gene expression data (see above).classifier
: This the the garnett_classifier you obtained above.db
: db
is a required argument for a Bioconductor AnnotationDb-class package used
for converting gene IDs. For example, for humans use org.Hs.eg.db. See available packages at the
Bioconductor website. Load your
chosen db using library(db)
. If your species does not have an AnnotationDb-class package, see
here.cluster_extend
: This tells Garnett whether to create a second set of assignments that
expands classifications to cells in the same cluster. You can either provide cluster IDs in the pData
table in a column titled "garnett_cluster", or you can let Garnett calculate the clusters and populate
the column. cluster_extend
to TRUE
with a very large dataset, this function will slow
down considerably. For convenience, Garnett will save the clusters it calculates to "garnett_cluster", so
the function will be faster if run again.cds_gene_id_type
: This argument tells Garnett the format of the gene IDs in your CDS object.
It should be one of the values in columns(db)
. The default is "ENSEMBL".classify_cells
function returns the input CDS object with one (or two if
cluster_extend = TRUE
) new columns in the pData table containing the Garnett
classifications.
The top plot above shows Garnett's cell type assignments, and the second plot show Garnett's cluster-extended type assignments. You can see that the T cell subsets (CD4 and CD8) are not well separated in these clusters, so when the cluster-extended type is calculated, Garnett backs up the hierarchy to the more confident assignment of "T cells".
Because this example data is from a FACS sorted cell sample, we can compare Garnett's assignments to the "true" cell types.
Here, we provide examples of a few of the common marker file errors and potential outcomes in Garnett's classification. For all panels, a classifier was trained on the 10x PBMC version 2 (V2) data and the classifier was then used to classify the 10x PBMC version 1 (V1) data shown above. The first panel is colored by the FACS-based 10x cell type assignments. The remaining panels are colored by the Garnett cluster-agnostic cell type assignment.
If your species doesn't have an available AnnotationDbi-class database, then Garnett won't be able to convert among
gene ID types. However, you can still use Garnett for classification. Set db = 'none'
and then be sure
that you use the same gene ID type in your marker file as your CDS object. When db = 'none'
Garnett ignores the arguments for gene ID type.
More troubleshooting to come...