Cell Ranger ATAC2.1 (latest), printed on 02/07/2023
The cellranger-atac pipeline outputs three types of feature-barcode matrices. The matrix has features as rows and barcodes as columns. Each element of the matrix is the number of cut sites associated with a feature and barcode.
|Unfiltered (Raw) peak-barcode matrix||Contains every observed barcode including background and non-cellular barcodes.|
|Filtered peak-barcode matrix||Contains only detected cellular barcodes.|
|Filtered tf-barcode matrix||Contains only detected cellular barcodes.|
The cellranger-atac pipeline generates peak-barcode and tf-barcode matrices. Each matrix is stored in Market Exchange Format (MEX). It also contains a BED file with peaks (TSV file for transcription factors) and barcode sequences corresponding to row and column indices, respectively. For example, if cellranger-atac is run with a human reference, the matrices output may look like:
$ cd /home/jdoe/runs/sample345/outs $ tree filtered_peak_bc_matrix filtered_peak_bc_matrix ├── barcodes.tsv ├── peaks.bed └── matrix.mtx 1 directories, 3 files
Features correspond to row indices. For each peak, its chromosome, start and end
positions are stored in the
$ head filtered_peak_bc_matrix/peaks.bed chr1 237588 237917 chr1 564444 565537 chr1 567478 568248 chr1 569021 569641 chr1 713461 715293 chr1 752379 753032 chr1 762073 763379 chr1 773651 774064 chr1 779547 780286 chr1 793345 794375
These peaks match the original peaks called by the peak calling algorithms. They
are duplicated in the
mex/ directory for safekeeping while processing
the matrices outside of the pipeline.
The fully specified and common names
for each transcription factor in the reference are stored as the first and second columns in the
$ head filtered_tf_bc_matrix/motifs.tsv Arnt_HUMAN.MA0004.1 Arnt Ahr::Arnt_HUMAN.MA0006.1 Ahr::Arnt Ddit3::Cebpa_HUMAN.MA0019.1 Ddit3::Cebpa NFIL3_HUMAN.MA0025.1 NFIL3 Mecom_HUMAN.MA0029.1 Mecom FOXF2_HUMAN.MA0030.1 FOXF2 FOXD1_HUMAN.MA0031.1 FOXD1 Gfi1_HUMAN.MA0038.1 Gfi1 Foxq1_HUMAN.MA0040.1 Foxq1 Foxd3_HUMAN.MA0041.1 Foxd3
Transcription factor fully specified names correspond to the names in the
motifs.pfm reference. Similarly, transcription factor common names
correspond to the prefix before species in the fully specified names located in
the first column.
For multi-species experiments, reference contigs (first column in peaks.bed) are prefixed with the genome name to avoid name collisions between chromosomes of different species e.g. chr1 becomes hg19_chr1.
|Cell Ranger ATAC does not produce the tf-barcode matrix for multi-species experiments.|
Barcode sequences correspond to column indices.
$ head filtered_peak_bc_matrix/barcodes.tsv AAACATACAAAACG-1 AAACATACAAAAGC-1 AAACATACAAACAG-1 AAACATACAAACGA-1 AAACATACAAAGCA-1 AAACATACAAAGTG-1 AAACATACAACAGA-1 AAACATACAACCAC-1 AAACATACAACCGT-1 AAACATACAACCTG-1
Each barcode sequence includes a suffix with a dash separator followed by a number:
More details on the barcode sequence format are available in the barcoded BAM section.
R and Python support MEX format, and sparse matrices can be used for more efficient manipulation.
It is possible to load mex files directly into R, for example:
require(magrittr) require(readr) require(Matrix) require(tidyr) require(dplyr) # peak-bc matrix mex_dir_path <- "/opt/sample345/outs/filtered_peak_bc_matrix" mtx_path <- paste(mex_dir_path, "matrix.mtx", sep = '/') feature_path <- paste(mex_dir_path, "peaks.bed", sep = '/') barcode_path <- paste(mex_dir_path, "barcodes.tsv", sep = '/') features <- readr::read_tsv(feature_path, col_names = F) %>% tidyr::unite(feature) barcodes <- readr::read_tsv(barcode_path, col_names = F) %>% tidyr::unite(barcode) mtx <- Matrix::readMM(mtx_path) %>% magrittr::set_rownames(features$feature) %>% magrittr::set_colnames(barcodes$barcode) # tf-bc matrix mex_dir_path <- "/opt/sample345/outs/filtered_tf_bc_matrix" mtx_path <- paste(mex_dir_path, "matrix.mtx", sep = '/') feature_path <- paste(mex_dir_path, "motifs.tsv", sep = '/') barcode_path <- paste(mex_dir_path, "barcodes.tsv", sep = '/') features <- readr::read_tsv(feature_path, col_names = c('feature', 'common_name')) barcodes <- readr::read_tsv(barcode_path, col_names = F) %>% tidyr::unite(barcode) mtx <- Matrix::readMM(mtx_path) %>% magrittr::set_rownames(features$feature) %>% magrittr::set_colnames(barcodes$barcode)
The csv, os and scipy.io libraries are recommended for loading a feature-barcode matrix into Python.
import csv import os import scipy.io # peak-bc matrix matrix_dir = "/opt/sample345/outs/filtered_peak_bc_matrix" mat = scipy.io.mmread(os.path.join(matrix_dir, "matrix.mtx")) peaks_path = os.path.join(human_matrix_dir, "peaks.bed") peaks = [(row, int(row), int(row)) for row in csv.reader(open(peaks_path), delimiter="\t")] barcodes_path = os.path.join(human_matrix_dir, "barcodes.tsv") barcodes = [row for row in csv.reader(open(barcodes_path), delimiter="\t")] # tf-bc matrix matrix_dir = "/opt/sample345/outs/filtered_tf_bc_matrix" mat = scipy.io.mmread(os.path.join(matrix_dir, "matrix.mtx")) motifs_path = os.path.join(human_matrix_dir, "motifs.tsv") motif_ids = [row for row in csv.reader(open(motifs_path), delimiter="\t")] motif_names = [row for row in csv.reader(open(motifs_path), delimiter="\t")]
Cell Ranger ATAC represents the feature-barcode matrices using sparse formats (only the nonzero entries are stored) in order to cut down on file size.
However, certain programs (e.g. Excel) only support dense formats (where
every row-column entry is explicitly stored, even if it's a zero). You can
convert a feature-barcode matrix to dense CSV format using the cellranger
mat2csv command from the Cell Ranger
software. This command
takes two arguments - an input matrix generated by Cell Ranger ATAC (either an H5
file or a MEX directory), and an output path for the dense CSV. For example, to
convert a matrix from a pipestance named
sample123 in the current
directory, either of the following commands would work:
# convert from MEX $ cellranger mat2csv sample123/outs/filtered_peak_bc_matrix sample123.csv # or, convert from H5 $ cellranger mat2csv sample123/outs/filtered_peak_bc_matrix_h5.h5 sample123.csv
You can then load
sample123.csv into Excel.
WARNING: dense files can be very large and may cause Excel to crash, or even fail in mat2csv if your computer doesn't have enough memory. For example, a peak-barcode matrix with 30k peaks with ~3k barcodes uses at least 200MB of disk space.