Monocle3 can store the count matrix on disk rather than in memory in a way that is essentially transparent to the user. This reduces substantially the memory required to process a data set while using the familiar Monocle3 functions. This feature depends on Ben Parks' excellent BPCells R package.
Make a cell_data_set with a disk-based counts matrix using the matrix_control parameter with Monocle3 functions that make a cell_data_set; for example, load_mm_data(). In this example, the counts matrix is in a MatrixMarket file called counts.mtx, the gene names are in features.txt, and the cell names are in cells.txt.
cds <- load_mm_data(mat_path='counts.mtx', feature_anno_path='features.txt', cell_anno_path='cells.txt', matrix_control=list(matrix_class='BPCells'))
Functions that have the matrix_control parameter include
load_mm_data()load_mtx_data()load_worm_embryo()load_worm_l2()load_a549()combine_cds()
You can convert a dgCMatrix sparse matrix in the cell_data_set to a disk-based matrix using the function convert_counts_matrix(). For example
counts(cds_bpcells) <- convert_counts_matrix(counts(cds), matrix_control=list(matrix_class='BPCells'))
where counts(matrix) is a dgCMatrix. convert_counts_matrix() makes and stores both the BPCells column-order matrix and the row-order matrix, in an effort to keep the two consistent.
The new_cell_data_set() function has no matrix_control parameter so it does not convert the input matrix to a disk-based matrix. However, if the input matrix is a BPCells matrix, it makes and stores the row-order copy of the input matrix.
After making the cell_data_set with a disk-based matrix, you process it using the same functions as for a cell_data_set with a dgCMatrix counts matrix. For example,
cds <- load_mm_data(mat_path='counts.mtx', feature_anno_path='features.txt', cell_anno_path='cells.txt', matrix_control=list(matrix_class='BPCells'))
cds <- preprocess_cds(cds)
cds <- reduce_dimension(cds)
cds <- cluster_cells(cds)
You must use the save_monocle_objects() function to store a cell_data_set that has a disk-based counts matrix, and the load_monocle_objects() function to reload the saved cell_data_set.
Monocle3 makes temporary working directories where the disk-based counts matrix files are kept until you quit the R session, at which time Monocle3 tries to remove them. By default, Monocle3 makes these directories in the directory where you are running R. The directories have names like monocle.bpcells.20240412.3106e35c0e4a2.tmp, which include the date on which the directory is made, a unique string, and the .tmp suffix. Do not delete these directories while the Monocle3 R session is running. If a temporary directory remains after you quit the R session for some reason, you may delete it, if you are certain that the session completed. If you delete such a directory before the session ends, you will lose the counts matrix, which will make the cell_data_set unusable.
You can tell Monocle3 where you want the disk-based working directories using the matrix_control parameter with the list element matrix_path. For example,
cds <- load_worm_embryo(matrix_control=list(matrix_class='BPCells', matrix_path='/home/me/tmp_bpcells'))preprocess_cds() create an additional temporary disk-based matrix while it runs.set_matrix_control() function help has additional information about BPCells features supported by Monocle3.