HOME  ›   pipelines
If your question is not answered here, please email us at:  ${email.software}

10x Genomics
Chromium Single Cell CNV

Evaluating Single Cell CNVs against Bulk CNVs

We discuss here in detail how we calculated the sensitivity and positive predictive value (PPV) of copy number calls made using cellranger-dna cnv pipeline on the 5k MKN-45 Gastric Cancer Cell Line dataset. We chose this cell line because it shows a relatively low level of cell-to-cell variation in copy number and has a broad spectrum of copy number events ranging from less than 100 kb to more than 10 mb. The low heterogeneity allowed us to evaluate the single cell copy number calls using bulk data.

In order to validate the copy number calling performance, we created a second Illumina PCR-free TruSeq library from bulk genomic DNA extracted from the same cell line sequenced to a depth of 700 million reads (2x150 paired-end reads). We called copy number variants on the bulk TruSeq data using Ginkgo (see the publication for more details on the method) with bin size of 5kb. The resulting CNV calls can be found here. We treat the bulk CNV calls as true events in the cell line and calculate the sensitivity and PPV for event detection amongst the single cells and groups of cells.

We divided the GRCh37 reference genome into 20 kb bins and selected bins where at least 70% of the 35-mers in the bin were unique in the genome. We restricted our comparison of CNV calls to these "confident regions". We consider an event detected if a one or more CNV calls match the copy number of the event and together overlap the event region by at least 50%. A CNV call is a false positive if the corresponding the genomic region in the ground truth has a different copy number. The sensitivity is defined as the fraction of total events detected, and the PPV is the fraction of all CNV calls that are true. Additionally, since the algorithms detect copy number changes, a segment with copy number 2 between two segments with copy numbers not equal to 2 is considered an event.

The figure below shows the single cell CNV detection performance. The vast majority of points in the left panel are in the upper right quadrant with high sensitivity and PPV. The points in red are cells that are labeled as noisy, a majority of which, are due to cells undergoing DNA replication. When we restrict our attention to the dominant homogeneous cluster in the population, Group 9303, the cells cluster in the upper right quadrant. The median sensitivity and PPV is greater than 90% for events in the size range of 1 - 2 megabases (84 such events).

Figure 1 Sensitivity vs PPV for single cell CNV detection compared against bulk CNVs for the same sample. Left: each point is a cell and the noisy cells are colored red; right: all cells in Group 9303, a homogeneous cluster of 2,408 cells. Dashed black lines show the median sensitivity and PPV over all the blue points.
 

The quality of CNV detection improves as we aggregate together similar cells. The CNV detection performance for 2,408 cells aggregated together in Group 9303 is as follows:

Event sizeNumber of eventsSensitivityPPV
100kb < size < 200kb 1089195
200kb < size < 500kb 109 95 98
500kb < size < 1mb 58 97 100
1mb < size < 2mb 84 98 98
2mb < size < 10mb 196 100 100

We computed the sensitivity and PPV for each sub cluster of the majority cluster as defined by the tree and the results are in the figure below. The sensitivity and PPV statistics for the majority cluster are already reached at around the 10 cell mark. This demonstrates high quality CNV event detection is possible in a 1 % sub population in a 1000-cell experiment.

Figure 2 Sensitivity and PPV for events in the ranges 100kb-200kb and 200kb-500kb for sub clusters of cells within group 9303 as defined by the tree as a function of sub cluster size. The solid line shows the median value, while the shaded region around the median is the interquartile range.