Software  ›   pipelines

# Single cell copy number calling

The observed coverage per cell for a fixed 20 kb genome bin is influenced by

• bin mappability : regions with low mappability produce fewer read pairs
• bin GC content : drop in coverage in regions of very low or very high GC content
• scale factor : the number of read pairs generated by a single fragment of DNA in a partition, which is a function of both the underlying library complexity and the sequencing depth
• bin copy number : high copy number have high coverage

These effects can be incorporated into a generative model where DNA in a fixed, unknown copy number state "emits" read pairs that are observed according to a Poisson distribution. The mean of the Poisson distribution is determined by the product of the copy number, the effect of GC content, the mappability, and the scale factor: $$X_i \ \sim \ \text{Poisson} ( S \ p_i \ Q(g_i) \ m_i)$$ where i is an index over all the mappable bins, X is the read coverage, S is the scale factor, p is the copy number, Q(g) is the impact of GC content g, and m the mappability.

## Correcting for mappability

We restrict the estimation of copy number to mappable 20 kb bins of the genome defined as bins with mappability greater than 70% as determined by the simulation procedure described earlier. These highly mappable bins comprise 85-90% of the human reference genome GRCh37 depending on the sequenced read length and insert size. After restricting ourselves to mappable bins we include an additional mappability term (m) for each bin in estimates of read counts to correct for any residual modulation effect due to varying mappability.

## Correcting for GC bias

To determine the effect of GC content we aggregate the read coverage across a window of consecutive mappable bins to minimize the effects of sampling noise at low depth. The aggregation window w is approximately the number of 20 kb bins that would have to be aggregated to reach a mean read count of 200 reads per window.

The effect of GC content on read coverage is made visually clear in the top panel of the plot above, where the read coverage (scaled to a mean of 1.0) is plotted against the GC content for a window size w = 20 (or 400 kb). The distinct bands are due to copy number variation in the cell: the lowest band corresponds to segments of copy number one, the next band to segments of copy number two, etc. The bottom panel shows the results after our GC correction procedure. The scaled GC-corrected coverage shows no relationship to the GC content.

We model the coverage variation as a multiplicative two parameter (l, q) quadratic function Q(g; l, q) of the GC content such that the variation is 1.0 at an arbitrarily chosen value GC = 0.45. When l and q are both zero this represents no variation of coverage with GC content. Given values of l, q we define the GC-corrected coverage y from the read coverage x $$y_i = \frac{x_i}{m_i Q(g_i; l, q)}$$ We find the best fit values of l, q by minimizing the entropy of the histogram of y.

## Segmenting the genome

We identify breakpoints in the read coverage that separate adjacent regions with distinct copy number by successive application of a log-likelihood ratio test based on the Poisson read-emission model described in the beginning of this page. Once all the breakpoints are identified the regions between two adjacent breakpoints define segments with uniform copy number.

For each mappable bin i in the genome and a half-open interval [l, r) around i we define the log-likelihood ratio (LLR) as $$LLR (i; l, r) = \sum_{j \in [l, i)} \log \frac{ \text{Prob}\left(x_j\ | \ \mu_{[l, i)}\right)}{\mathrm{Prob}\left(x_j \ | \ \mu_{[l,r)}\right)} + \sum_{j \in [i, r)} \log \frac{ \mathrm{Prob}\left(x_j\ | \ \mu_{[i, r)}\right)}{\mathrm{Prob}\left(x_j \ | \ \mu_{[l, r)}\right)}$$ where $$\mu_{[i, j)} = \frac{\sum_{i <= k < j} x_k}{\sum_{i <= k < j}m_i Q(g_k; l, q)}$$ The LLR helps decide between the hypotheses

1. the interval [l, r) has uniform copy number
2. The sub-intervals [l, i) and [i, r) have different copy numbers under the Poisson generative model outlined at the beginning of the page. We choose a significance threshold of 5 and large positive values of LLR greater than 5 favor hypothesis 2 making i a candidate breakpoint while values less than 5 favor hypothesis 1.

We calculate the LLR at each mappable bin i using a symmetric interval [i-w, i+w) around i, where w is the aggregation window defined in the previous section. The figure below shows the read-pair counts (x) for a single cell over 500 mappable bins (10 Mb) in the top panel and the calculated LLR for each bin in the bottom panel. The peak above the significance threshold of 5 indicates the presence of one or more breakpoints.

We use the following algorithm to identify breakpoints

1. We select all local maxima of the LLR calculated using the aggregation window w above the significance threshold of 5 as an initial candidate breakpoint set B0.
2. For each triple of consecutive breakpoints l, b, r we calculate LLR(b; l, r) and discard b if LLR(b; l, r) < 5. This gives a smaller set of breakpoints B1.
3. For consecutive breakpoints l, r in B1 we calculate LLR(i; l, r) for each bin i in (l, r-1) and if LLR(i; l, r) > 5, we add i as a breakpoint. We repeat until no more breakpoints can be added giving us the final set of breakpoints.

We divide the genome into a set D of non-overlapping segments [l, r) based on the final breakpoint set where each segment is bounded by consecutive breakpoints l and r.

## Scale factor estimation

After segmentation we determine the integer copy number of each segment in D by estimating the scale factor S. We define an objective function defined as the length-weighted sum over all segments [l, r) $$O(S) = \sum_{[l, r) \in D} (r-l) \sin^2 \left( \frac{\pi \mu_{[l, r)}'}{S}\right)$$ where $$\mu_{[l, r)}' = \mu_{[l, r)}w$$ is the segment mean defined as the mean normalized read-pair count for each segment calculated over the aggregation window w. The objective function is minimized when each segment mean is an integer multiple of the scale factor S. The objective function has multiple local minima and each minimum defines a candidate copy number solution $$C_i = {\rm round}\left(\frac{\mu_{[l, r)}'}{S}\right)$$ where [l, r) is the unique segment that contains the bin i. We filter out candidate solutions that are poor fits to the data or correspond to an average copy number greater than 8. We use the following heuristic to determine the best candidate copy number solution

• if most of the genome has the same copy number, we pick the copy number solution where the average ploidy is closest to 2, otherwise
• pick the copy number solution that corresponds to the lowest value of the objective function. This procedure results in accurate copy number profiling over regions of the genome with high mappability.

## Copy number imputation in unmappable regions

For each region R with low mappability (simulated mappability < 0.90) we consider the copy number values of adjacent bins with high mappability determined. When the values agree and when R is less than 500 kb, we set the copy number across R equal to the copy number of the adjacent bins. When this is not the case, for example, in large centromeric regions, we designate R as a no call. See HDF5 Output for more information about how this imputation is conveyed in the outputs.

# Computing copy number event confidence scores

For every copy number n event in the interval [l, r) we induce a Negative Binomial (or Poisson if the sample variance is less than or equal to the mean) distribution using the sample statistics gathered over all bins in [l, r). We employ an alternative parameterization of the Negative Binomial where it is treated as an overdispersed Poisson, with mean (μ) and standard deviation (σ).

The probability of error is approximated as $${\rm P}(n) = 1 - \sum_{k = C_n - 0.5}^{C_n + 0.5} {\rm NB}\left(\frac{Sk}{r - l} \sum_{i = l}^r m_i Q(g_i; l, q); \mu_n, \sigma_n \right)$$ We calculate the confidence score for the event using the familiar Q-score translation: $${\rm conf}(n) = -10 \log_{10} {\rm P}(n)$$ The confidence score is then rounded to the interval [0, 256) and represented as uint8. When we have a good fit to the copy number n hypothesis, P(n) is very close to 0, and the confidence score is large and positive.

• 1.0
• Cell Ranger DNA v1.1 (latest)