# Garnett Documentation for Monocle 3

#### Monocle or Monocle3?

If you are using Monocle 3 (loading using library(monocle3)), then follow the instructions on this page. If you are instead using Monocle or Monocle 2 (loading using library(monocle)), then please follow the instructions here.

## Abstract

Garnett is a software package that faciliates automated cell type classification from single-cell expression data. Garnett works by taking single-cell data, along with a cell type definition (marker) file, and training a regression-based classifier. Once a classifier is trained for a tissue/sample type, it can be applied to classify future datasets from similar tissues. In addition to describing training and classifying functions, this website aims to be a repository of previously trained classifiers.

## Installing Garnett

Garnett runs in the R statistical computing environment. You will need R version 3.5 or higher to install Garnett.

### Install from Github

To install Garnett directly from the Github repository, use the following instructions:

Garnett builds upon a package called Monocle 3. Before installing Garnett, first install Monocle 3 following the instructions here. Briefly, install Monocle 3 using:

You're now ready to install Garnett:

After installation, test that Garnett installed correctly by opening a new R session and typing:

## Getting help

Questions about Garnett should be posted on our Google Group. Please do not email technical questions to Garnett contributors directly.

## Garnett overview

The Garnett workflow has two major parts, each described in detail below:

1. Train/obtain the classifier: Either download an existing classifier, or train your own. To train, Garnett parses a marker file, chooses a set of training cells, and then trains a multinomial classifier to distinguish cell types.
2. Classify cells: Next, Garnett applies the classifier to a set of cells to generate cell type assignments. Optionally, Garnett will expand classifications to similar cells to generate a separate set of cluster-extended type assignments.

## 1a. Using a pre-trained classifier

There are two options for the first part of the Garnett workflow: using a pre-trained classifier, or generating your own classifier.

We have generated a series of pre-trained classifiers for various organisms and tissues. If a pre-trained classifier exists for your data type, we recommend you try it. The list of available classifiers can be found here. We hope to continually update and add new classifiers as we generate them. We also accept classifiers generated by others - please submit any classifiers you make and help build the community! Details on how to submit your classifier are can be found here.

Once you've loaded the classifier, go on to part 2, Classifying your cells

## 1b. Train your own classifier

If a classifier doesn't exist for your tissue type, or doesn't include the cell types you expect in your data, then you'll need to generate your own.

#### The cell_data_set class

Because Garnett builds on Monocle 3, data for Garnett is held in objects of the cell_data_set (CDS) class. This class is derived from the Bioconductor SingleCellExperiment class, which provides a common interface familiar to those who have analyzed single-cell data with Bioconductor. Monocle 3 provides detailed documentation about how to generate an input CDS here.

As an example, Garnett includes a small dataset derived from the PBMC 10x V1 expression data [1].

### Constructing a marker file

Besides the expression data, the second major input you'll need is a marker file. The marker file contains a list of cell type definitions written in an easy-to-read text format. The cell type definitions tell Garnett how to choose cells to train the model on. Each cell type definition starts with a '>' symbol and the cell type name, followed by a series of lines with definition information. Definition lines start with a keyword and a ':' and entries are separated by a comma. Note: the Garnett syntax allows for entries following the ':' to move onto following lines however, you may not move to a new line mid entry (i.e. you can go to a new line only after a comma).

A simple valid example:

There are several ways to define cell types in the Garnett marker file format. In general, each cell's definition can have three major components. Only the first component is required.

#### Define the expression

The first and most important specification for a cell type is its expression. Garnett offers several options for specifying marker genes, detailed below.

#### Recommended expression specifications

Format Example
expressed: gene1, gene2 expressed: MYOD1, MYH3
not expressed: gene1, gene2 not expressed: PAX6, PAX3

This is the default way to specify marker genes for your cell types. When using this specification, Garnett calculates a marker score for each cell, accounting for potential leakage, overall expression levels, and read depth. expressed: markers should be specific to the cell type being defined.

#### Alternative expression specifications

Format Example
expressed above: gene1 value, gene2 value expressed above: MYOD1 4.2, MYH3 700
expressed below: gene1 value, gene2 value expressed below: PAX6 20, PAX3 4
expressed between: gene1 value1 value2, gene2 value1 value2 expressed between: PAX6 10 20, PAX3 4 100

This is an alternative way to specify expression, which can be useful if you know an exact range you expect the expression of a gene to occupy. In general however, we do not recommend using these specifications because they will not account for read depth and overall expression in each cell. Values are in the same units as your input data.

#### Define the meta data

In addition to expression information, you can further refine your cell type definitions using meta data. This is also where you will specify any subtypes you expect in your data.

#### Meta data specifications

Format Example
subtype of: celltype subtype of: T cells
custom meta data: attribute1, attribute2 tissue: spleen, thymus

subtype of: allows you to specify that a cell type is a subtype of another cell type in your definition file.

custom meta data: specification allows you to provide any further meta data requirements for your cell type. Any column in the pData table of your CDS object can be used as a meta data specification. In the example above, there would be a column in the pData table called "tissue".

Lastly, we highly recommend that you document how you chose your marker definitions. To make it easier to keep track of, we provide an additional specification - references: - that will store your citation information for each cell type. Add a set of URLs or DOIs and they will be included in your classifier. See here for functions to get access to this information.

Similar to R code, we have included a comment character # so you can add notes/comments in your marker file. Anything after a # on the same line will be ignored by Garnett.

A more complex example:

#### Warning for Windows users:

At the moment, the Garnett marker file cannot have carriage returns (\r), which will be included automatically if you generate your marker file in certain Windows text editors. Instead, you must use newline characters (\n). There are many ways to replace carriage returns on Windows, for example using Notepad++ you can use replace, choose extended, and replace all \r with \n. I intend to address this in a more automated way soon.

Because defining the marker file is often the hardest part of the process, Garnett includes functions to check whether your markers are likely to work well. The two functions relevant are check_markers and plot_markers. check_markers generates a table of information about your markers and plot_markers plots the most relevant information.

In addition to the small included dataset, we have included two example marker files with the package. Here are the contents of "pbmc_bad_markers.txt"

Besides the CDS object and the path to the marker file, there are a few arguments to add:

• db: db is a required argument for a Bioconductor AnnotationDb-class package used for converting gene IDs. For example, for humans use org.Hs.eg.db. See available packages at the Bioconductor website. Load your chosen db using library(db). If your species does not have an AnnotationDb-class package, see here.
• cds_gene_id_type: This argument tells Garnett the format of the gene IDs in your CDS object. It should be one of the values in columns(db). The default is "ENSEMBL".
• marker_file_gene_id_type: Similarly to above, this argument tells Garnett the format of the gene IDs in your marker file.
After generating your marker_check data.frame, you can use our built in plotting function plot_markers to view the results.

This marker plot provides some key information about whether the chosen markers are good. First, The red note 'not in db' lets us know that the marker ACTN was not present as a 'SYMBOL' in the org.Hs.eg.db annotation. In this case, it's a typo. Next, the x-axis shows the ambiguity score for each of the markers - a measure of how many cells receive ambiguous labels when this marker is included - in this case, ACTB and PTPRC have high ambiguity and should be excluded. The last piece of relevant information is what percent of all the cells nominated in a cell type were chosen because of that marker. This is indicated by color.

NOTE: The values output by check_markers and plotted by plot_markers are estimates of the numbers of cells that could be chosen by the classifier. However, it uses heuristics to quickly find candidate cells, and will not exactly match the cells that are chosen by the marker. Please use the numbers as relative measures, rather than absolute representations of the training set.

A further note on ambiguity scores: Ambiguity scores are the fraction of the cells a marker nominates that a become labeled as "Ambiguous" when that marker is included in the marker file. However, a high ambiguity score does not necessarily mean that a given marker is not specific. It could mean that a different marker is the culprit, but that marker also nominates a lot of otherwise unlabeled cells (high nomination rate). Look closely at both the marker with a high ambiguity score, and markers in the cell type that it is most ambiguous with before deciding which marker to exclude.

After making a marker plot, you may want to revise your marker file. In our toy example, we'll remove ACTN, ACTB and PTPRC to get our final 'pbmc_test.txt' marker file.

#### No ENSEMBL in your db

By default, Garnett converts gene IDs into ENSEMBL for use in the classifier. This allows classifiers to be portable so they can classify future datasets that might have different gene IDs. For some organisms however, ENSEMBL IDs are either not available or not commonly used. In these cases, you can use the parameter classifier_gene_id_type in both train_cell_classifier and check_markers to specify a different ID type. The value you choose will be stored with the classifier, so you do not need to specify it again when classifying future datasets.

#### Train the classifier

Now it's time to train the classifier. The arguments should be pretty close to those for check_markers. The one parameter I am changing from default below is the num_unknown argument. This tells Garnett how many outgroup cells it should compare against. The default is 500, but in this toy dataset with so few cells we want fewer.

After running train_cell_classifier, the output object, of type "garnett_classifier" contains all of the information necessary to classify cells.

### Viewing the classification genes

Garnett classification is trained using a multinomial elastic-net regression. This means that certain genes are chosen as the relevant genes for distinguishing between cell types. Which genes are chosen may be of interest, so Garnett includes a function to access the chosen genes. Note: Garnett does not regularize the input markers, so they will be included in the classifier regardless.

The function we use to see the relevant genes is get_feature_genes. The arguments are the classifier, which node you'd like to view (if your tree is hierarchical) - use "root" for the top node and the parent cell type name for other nodes, and the db for your species. The function will automatically convert the gene IDs to SYMBOL if you set convert_ids = TRUE.

### Viewing references

We explained above how to include documentation about how you chose your markers in your marker file. In order to get this information back out - to see how markers were chosen for an already trained classifier - use the function get_classifier_references. Besides the classifier, there is one additional optional argument called cell_type. If you pass the name of a cell type, only the references for that cell type will be printed, otherwise they will all be printed.

### Submitting a classifier

We encourage you to submit your high quality classifiers to us so we can make them available to the community. To do this, open a special issue and fill out the form in the Garnett github repository. Click here and click the "New issue" button to get started!

## 1c. Train a marker-free classifier

#### Warning - Alpha Feature:

This is a new feature of Garnett. This method has been neither published, nor tested extensively, and is therefore an alpha feature. This means that it is subject to change, and in addition that we would appreciate hearing any feedback that you might have! Please submit a bug report if you run into any issues, or if you would like to suggest additional/different features.

If a classifier doesn't exist for your tissue type, but there exists an additional dataset from the same tissue and organism with known cell types, then you can train a classifier on this dataset to apply to any future data without providing marker genes by using the marker-free method.

To use marker-free classifier training, follow all of the steps above for creating your own classifier, but provide a marker-free cell type definition file as described below:

### Constructing a marker-free cell type definition file

As described above, the Garnett marker file contains a list of cell type definitions written in an easy-to-read text format. The cell type definitions tell Garnett how to choose cells to train the model on. Each cell type definition starts with a '>' symbol and the cell type name, followed by a series of lines with definition information. Definition lines start with a keyword and a ':' and entries are separated by a comma.

In the case of the marker-free cell type definition file, instead of providing marker gene information, you will instead provide alternative definition information which corresponds to a column in your cell metadata (colData/pData). For example, if I have a column in my colData table called "known_type", then I can use definitions there to generate cell type definitions.

Marker-free cell type definition allows nested cell types, for example, another valid marker file would be:

Lastly, some cell types can be defined by gene expression as above while others are defined without markers. However please note that at present, cell types defined marker-free cannot also include expression markers. In the case that both an expression marker and a marker-free definition are included in the same cell type definition, the marker-free tag will be treated as additional metadata (see above) rather than as a marker-free definition. See below for and example of a valid marker file with some cell types defined using expression and others defined marker-free:

#### Training example using marker-free cell type definitions

Using the example cell type definition file from above:

After running train_cell_classifier, the output object, of type "garnett_classifier" contains all of the information necessary to classify cells.

If you haven't loaded your data yet (because you're using a pre-trained classifier), now is the time!

#### The cell_data_set class

Because Garnett builds on Monocle 3, data for Garnett is held in objects of the cell_data_set (CDS) class. This class is derived from the Bioconductor SingleCellExperiment class, which provides a common interface familiar to those who have analyzed single-cell data with Bioconductor. Monocle 3 provides detailed documentation about how to generate an input CDS here.

As an example, Garnett includes a small dataset derived from the PBMC 10x V1 expression data [1].

Once you've got your classifier, it's time to classify your cells using the classify_cells function! The key arguments are:

• cds: This is the CDS object containing your gene expression data (see above).
• classifier: This the the garnett_classifier you obtained above.
• db: db is a required argument for a Bioconductor AnnotationDb-class package used for converting gene IDs. For example, for humans use org.Hs.eg.db. See available packages at the Bioconductor website. Load your chosen db using library(db). If your species does not have an AnnotationDb-class package, see here.
• cluster_extend: This tells Garnett whether to create a second set of assignments that expands classifications to cells in the same cluster. You can either provide cluster IDs in the pData table in a column titled "garnett_cluster", or you can let Garnett calculate the clusters and populate the column.
Warning: if not providing a "garnett_cluster" column and setting cluster_extend to TRUE with a very large dataset, this function will slow down considerably. For convenience, Garnett will save the clusters it calculates to "garnett_cluster", so the function will be faster if run again.
• cds_gene_id_type: This argument tells Garnett the format of the gene IDs in your CDS object. It should be one of the values in columns(db). The default is "ENSEMBL".
The classify_cells function returns the input CDS object with one (or two if cluster_extend = TRUE) new columns in the pData table containing the Garnett classifications.

The top plot above shows Garnett's cell type assignments, and the second plot show Garnett's cluster-extended type assignments. You can see that the T cell subsets (CD4 and CD8) are not well separated in these clusters, so when the cluster-extended type is calculated, Garnett backs up the hierarchy to the more confident assignment of "T cells".

Because this example data is from a FACS sorted cell sample, we can compare Garnett's assignments to the "true" cell types.

## Troubleshooting

### Common marker file errors

Here, we provide examples of a few of the common marker file errors and potential outcomes in Garnett's classification. For all panels, a classifier was trained on the 10x PBMC version 2 (V2) data and the classifier was then used to classify the 10x PBMC version 1 (V1) data shown above. The first panel is colored by the FACS-based 10x cell type assignments. The remaining panels are colored by the Garnett cluster-agnostic cell type assignment.

1. A cell type is missing from the marker file. For example, in the PBMC marker file, not including a T cell definition (Panel 2). We found that, in general, Garnett labels the missing cell type as ‘Unknown’. The exception to this, discussed in an addition to the manuscript, is when a feature describing an existing cell type is present in missing cell type (i.e. the NKT cells that express NK marker FCGR3A.
2. A cell type is defined but includes no good specific markers. For example, in the PBMC marker file, defining T cells using only CD4 and not CD3 (Panel 3). In this case, we found that Garnett labelled only a subset of the T cells as such and left the remaining cells unlabelled.
3. A gene that is not specific and widely expressed is used to define a cell type. For example, if we add MALAT1 (the most expressed transcript in the PBMC dataset) to the T cell definition (Panel 4). In this case, we find that each of the cell types ends up with mixed assignments between the true cell types and T cells. In an alternative scenario, the inclusion of a widely expressed non-specific gene may cause Garnett not to find sufficient training samples at all, because it will consider all cells to be ambiguous (i.e. they will express other marker plus the non-specific one).
4. A cell type definition includes genes that are specific to another cell type. This is the case where a definition is truly ‘wrong’, i.e. if the best marker for B cells (CD79A) is add to the T cell definition (Panel 5). We found that the B cell clusters gets mixed cell type assignments of B cell and T cell, but that the remaining cell types’ labels are mostly unchanged.

### My species doesn't have an AnnotationDbi-class database

If your species doesn't have an available AnnotationDbi-class database, then Garnett won't be able to convert among gene ID types. However, you can still use Garnett for classification. Set db = 'none' and then be sure that you use the same gene ID type in your marker file as your CDS object. When db = 'none' Garnett ignores the arguments for gene ID type.

More troubleshooting to come...