** Disk-based storage is in the develop branch only at this time. **
You need to install the Monocle3 develop branch to use disk-based count matrix storage. Please see Installing Monocle 3.

Disk-based count matrix storage

Monocle3 can store the count matrix on disk rather than in memory in a way that is essentially transparent to the user. This reduces substantially the memory required to process a data set while using the familiar Monocle3 functions. This feature depends on Ben Parks' excellent BPCells R package.

Make a cell_data_set with a disk-based matrix

Make a cell_data_set with a disk-based counts matrix using the matrix_control parameter with Monocle3 functions that make a cell_data_set; for example, load_mm_data(). In this example, the counts matrix is in a MatrixMarket file called counts.mtx, the gene names are in features.txt, and the cell names are in cells.txt.

cds <- load_mm_data(mat_path='counts.mtx', feature_anno_path='features.txt', cell_anno_path='cells.txt', matrix_control=list(matrix_class='BPCells'))

Functions that have the matrix_control parameter include

  • load_mm_data()
  • load_mtx_data()
  • load_worm_embryo()
  • load_worm_l2()
  • load_a549()
  • combine_cds()

You can convert a dgCMatrix sparse matrix in the cell_data_set to a disk-based matrix using the function convert_counts_matrix(). For example

counts(cds_bpcells) <- convert_counts_matrix(counts(cds), matrix_control=list(matrix_class='BPCells'))

where counts(matrix) is a dgCMatrix. convert_counts_matrix() makes and stores both the BPCells column-order matrix and the row-order matrix, in an effort to keep the two consistent.

The new_cell_data_set() function has no matrix_control parameter so it does not convert the input matrix to a disk-based matrix. However, if the input matrix is a BPCells matrix, it makes and stores the row-order copy of the input matrix.

Process a cell_data_set with disk-based matrix

After making the cell_data_set with a disk-based matrix, you process it using the same functions as for a cell_data_set with a dgCMatrix counts matrix. For example,

cds <- load_mm_data(mat_path='counts.mtx', feature_anno_path='features.txt', cell_anno_path='cells.txt', matrix_control=list(matrix_class='BPCells'))
cds <- preprocess_cds(cds)
cds <- reduce_dimension(cds)
cds <- cluster_cells(cds)

Store a cell_data_set with a disk-based matrix

You must use the save_monocle_objects() function to store a cell_data_set that has a disk-based counts matrix, and the load_monocle_objects() function to reload the saved cell_data_set.

Temporary disk-based matrix working directory

Monocle3 makes temporary working directories where the disk-based counts matrix files are kept until you quit the R session, at which time Monocle3 tries to remove them. By default, Monocle3 makes these directories in the directory where you are running R. The directories have names like monocle.bpcells.20240412.3106e35c0e4a2.tmp, which include the date on which the directory is made, a unique string, and the .tmp suffix. Do not delete these directories while the Monocle3 R session is running. If a temporary directory remains after you quit the R session for some reason, you may delete it, if you are certain that the session completed. If you delete such a directory before the session ends, you will lose the counts matrix, which will make the cell_data_set unusable.

You can tell Monocle3 where you want the disk-based working directories using the matrix_control parameter with the list element matrix_path. For example,

cds <- load_worm_embryo(matrix_control=list(matrix_class='BPCells', matrix_path='/home/me/tmp_bpcells'))

Working with BPCells count matrices

  • BPCells queues operations for streaming and lazy evaluation, which increases time and memory efficiency. However, performance can suffer when matrix elements, rows, and columns are accessed repeatedly while the matrix has queued operations because the queued operations are applied for each access. To avoid repeating these operations, use a temporary disk-based copy, which applies the operations to the full matrix once.
  • disk-based matrix operation performance depends strongly on disk latency and bandwidth. Matrix operations are fastest using fast disks attached to the main peripheral bus such as solid-state disks on the system PCIe bus. Network attached storage may be slower, and USB drives should be avoided for storing working directories.
  • disk-based matrix storage requires free disk space so you may need to watch disk usage. Some commands such as the preprocess_cds() create an additional temporary disk-based matrix while it runs.
  • when using disk-based matrix storage, Monocle3 stores both column-order and row-order storage matrices. It uses the column-order matrix except when the operations required frequent gene subsetting.
  • the set_matrix_control() function help has additional information about BPCells features supported by Monocle3.
Previous Next