Supplemental Information: Method

Input data. For the purpose of this work, a diploid cell from a man was deemed to have a normal diploid DNA copy-number (CN) profile if the CN was two for at least 99% of the autosome and one for at least 95% of chromosome X. With this definition, CN profiles of approximately 4,300 individual cells collected for a study of prostate biopsies were examined and, among these, 1,306 cells with normal diploid CN profiles were identified. These profiles were derived from low-pass sequencing of single-cell genomes, as previously described [1, 2]. A sequence read set was available for each of the selected cells, with the number of uniquely mapping, PCR duplication free reads per cell in the 3.4105 to 6.7106 range (median 1.84106).

Integer-valued CN profiles of tumors were previously derived using Absolute [3] for 3,852 patient cases in TCGA Pan-Cancer collection[4], with 11 solid-tissue tumor types represented in this set.

Simulation of genomic sequence read sets of individual circulating cells at reduced coverage. Given a mapped sequence read set R of a single-cell diploid genome and an integer genome-wide CN profile P, the following procedure was adopted to simulate a read set of a cell with the CN profile P at a coverage reduced by a factor F relative to that of R. First, R is interpreted as a read count profile where, for each read r in R, a read count C(x(r)) = 1 is assigned to the genomic position x(r) to which the 5’ end of r maps. Next, the read counts are resampled: a new read count c(x(r)) is generated by drawing at random from a Poisson distribution with the mean value of C(x(r)) P(x(r))/2/F, where P(x(r)) is the CN value of P at the position x(r). The reduction factor F = 16 was used throughout.

Three types of integer copy-number profiles P were used to generate reduced-coverage count profiles as described. These are (a) a diploid profile with P(x) = 2 for any genomic position x in the autosome, (b) a tumor profile from TCGA pan-cancer collection of 3,852 profiles; (c) a tumor-like profile, to represent, respectively, the ND, CR and UTL cell populations, as defined in the main text. The tumor-like profiles were derived from TCGA pan-cancer collection by permuting homologous chromosomes at random among the profiles. Thus, for example, chromosomes 2 of profile U may replace chromosomes 2 of profile V, while chromosomes 5 of the latter profile may be reassigned as chromosomes 5 of profile W, etc.

Derivation of single-cell copy-number profiles. Copy-number profiles were derived from read count profiles as previously described[1, 2]. Briefly, the genome was partitioned into bins with a constant number of uniquely mapping reads per bin. Genomic regions known to give rise to anomalously high number of reads were excluded from the partition. Then, for a given read count profile, bin-wise read counts were computed by summing the read counts within each bin. These were normalized by the mean bin count. Finally, LOWESS smoothing was used to control for dependence of the bin-wise read counts on the GC share of the bin sequence. The number of bins used in this work was 1,180 per autosome, resulting in a median number of reads per bin close to 100 prior to normalization.

Analysis of simulated samples of circulating cells. Sensitivity to the presence of clonal cells in the specimen was evaluated as follows. For each of the 3,852 tumors in TCGA pan-cancer collection a matrix M was formed with 10 normalized single-cell CN profiles as columns. Each of these was generated using the tumor CN profile, to represent circulating cells from a cancer clone. For each column, an input sequence read set R was chosen at random, without replacement, among the 1,306 normal diploid read sets available. Next, a matrix Cor(M) of Pearson correlations among the columns was computed. This correlation matrix was transformed into an adjacency matrix Adj(M) by applying a threshold T: the matrix elements of Cor(M) above T were rounded up to 1, and the remaining elements were set to 0. This square matrix represents a graph with the number of vertices equal to the dimension of the matrix and an edge connecting any pair of vertices for which the corresponding off-diagonal matrix element is 1. An R language package igraph[5] was used to determine the size S of the largest component of the graph. For T = 0.7 these are listed, for all 3,852 cases investigated, in Supplementary Table S1.

For the purpose of evaluating the specificity of detection, as summarized in Table 1, 1306 single-cell CN profiles, representing unrelated tumor-like genomes, were simulated by resampling each of the available diploid read sets. From this set of profiles, 104 random subsets of size N were drawn for each of the following values of N: 10, 20, 50, 100 and 200. For each subset drawn, an adjacency matrix was computed with T = 0.7 and the size S of the largest component was determined as described above.

The probability distribution of correlations among CN profiles of diploid cells was estimated as follows. 1,306 such profiles were simulated by down-sampling from each of the available read sets of diploid cells and pairwise correlations were computed. These ranged between -0.07 and 0.32. Extreme-value theory, as implemented in R package extRemes[6], was used in order to extrapolate this empirical distribution to larger values of the correlation. In a similar fashion, the distribution of correlations in mixed pairs of cells with diploid and unrelated tumor-like genomes was estimated, with the precaution that the two CN profiles forming a pair be derived by resampling from two different single-cell read sets.

1.Baslan, T., et al., Optimizing sparse sequencing of single cells for highly multiplex copy number profiling. Genome Research, 2015. 25(5): p. 714-724.

2.Kendall, J. and A. Krasnitz, Computational methods for DNA copy-number analysis of tumors. Methods Mol Biol, 2014. 1176: p. 243-59.

3.Carter, S.L., et al., Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol, 2012. 30(5): p. 413-21.

4.Zack, T.I., et al., Pan-cancer patterns of somatic copy number alteration. Nat Genet, 2013. 45(10): p. 1134-40.

5.Csardi, G., Nepusz, T., The igraph software package for complex network research. InterJournal Complex Systems, 2006.

6.Gilleland, E. and R.W. Katz, extRemes 2.0: An Extreme Value Analysis Package in R. Journal of Statistical Software, 2016. 72(8): p. 1-39.