Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SINGLE CELL CLONING APPROACHES FOR BIOLOGICAL STUDIES
Document Type and Number:
WIPO Patent Application WO/2019/241249
Kind Code:
A1
Abstract:
Clonally derived barcoded cell populations and methods of generating the same, as well as methods of use thereof, e.g., to evaluate heterogeneity of a starting population of cells.

Inventors:
MCALLISTER SANDRA (US)
OLIVE JESSICA (US)
Application Number:
PCT/US2019/036554
Publication Date:
December 19, 2019
Filing Date:
June 11, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
BRIGHAM & WOMENS HOSPITAL INC (US)
International Classes:
C07K17/02; C12N5/02; C12N5/07; C12N5/071; C12N5/09; C12N5/10; C12N5/16; C12N5/22; C12N15/07; C12N15/10; C12N15/11
Domestic Patent References:
WO2016176322A12016-11-03
WO2013192570A12013-12-27
WO2017027549A12017-02-16
WO2019113499A12019-06-13
Foreign References:
US20130323732A12013-12-05
US20180002764A12018-01-04
Attorney, Agent or Firm:
DEYOUNG, Janice, Kugler et al. (US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. A method of creating a plurality of barcoded clonal populations (BCPs), the method comprising:

(a) providing an initial heterogeneous population of cells;

(b) dividing the initial heterogeneous population of cells into separate cultures, each culture comprising a single cell from the initial heterogeneous population;

(c) maintaining the single cells in culture to provide a plurality of stable single cell-derived monoclonal populations; and

(d) introducing individual identifying nucleic acid sequences into each cell of the plurality of stable single cell-derived monoclonal populations; to thereby create a plurality of barcoded clonal populations (BCPs).

2. The method of claim 1, wherein the initial heterogeneous population of cells comprises cells from cancer cell lines or patient-derived cells, preferably cells comprising affected and/or normal cells.

3. The method of claim 1, wherein the identifying nucleic acid sequences comprise unique sequences of 10-40 nucleotides.

4. The method of claim 3, wherein the identifying nucleic acid sequences comprise unique sequences of 20-30 nucleotides, preferably 24 nucleotides.

5. The method of claims 1-4, wherein the identifying nucleic acid sequences are flanked by uniform sequences comprising PCR primer binding sites.

6. The method of claim 1, wherein the identifying nucleic acid sequences are

integrated into the genomes of the cells of the plurality of stable single cell- derived monoclonal populations.

7. The method of claims 1-6, wherein the identifying nucleic acid sequences are introduced into the cells of the plurality of stable single cell-derived monoclonal populations using a viral vector.

8. The method of claim 5, wherein the viral vectors are lentiviral vectors.

9. The method of claims 1-8, further comprising mixing equal numbers of each BCP to create a barcoded polyclonal population of cells (BPP).

10. The method of claim 9, further comprising exposing the BPP to a test condition.

11. The method of claims 1-10, further comprising determining one or both of identity and relative abundance of each BCP in the BPP.

12. The method of claim 11, wherein identity and/or relative abundance of each BCP is determined using a method comprising PCR, a hybridization assay, or next- generation sequencing.

13. A method of selecting a therapy for a subject who has cancer, the method

comprising: creating a plurality of barcoded clonal populations (BCPs) by a method comprising:

(a) providing an initial heterogeneous population of cells from the cancer in the subject;

(b) dividing the initial heterogeneous population of cells into separate cultures, each culture comprising a single cell from the initial heterogeneous population;

(c) maintaining the single cells in culture to provide a plurality of stable single cell-derived monoclonal populations; and

(d) introducing individual identifying nucleic acid sequences into each cell of the plurality of stable single cell-derived monoclonal populations; to thereby create a plurality of barcoded clonal populations (BCPs);

(e) mixing equal numbers of each BCP to create a barcoded polyclonal population of cells (BPP);

(f) exposing the BPP to a candidate therapeutic compound; and

determining one or both of identity and relative abundance of each BCP in the BPP.

14. The method of claim 13, wherein identity and/or relative abundance of each BCP is determined using a method comprising PCR, a hybridization assay, or next- generation sequencing.

15. The method of claim 13, wherein the initial heterogeneous population of cells comprises cells from cancer cell lines or patient-derived cells, preferably comprising affected and/or normal cells.

16. The method of claim 13, wherein the identifying nucleic acid sequences comprise unique sequences of 10-40 nucleotides.

17. The method of claim 16, wherein the identifying nucleic acid sequences comprise unique sequences of 20-30 nucleotides, preferably 24 nucleotides.

18. The method of claims 13-17, wherein the identifying nucleic acid sequences are flanked by uniform sequences comprising PCR primer binding sites.

19. The method of claim 13, wherein the identifying nucleic acid sequences are integrated into the genomes of the cells of the plurality of stable single cell- derived monoclonal populations.

20. The method of claims 13-19, wherein the identifying nucleic acid sequences are introduced into the cells of the plurality of stable single cell-derived monoclonal populations using a viral vector.

21. The method of claim 20, wherein the viral vectors are lentiviral vectors.

22. The method of claims 2 and 15, wherein the affected cells are tumor cells.

Description:
SINGLE CELL CLONING APPROACHES FOR

BIOLOGICAL STUDIES

CLAIM OF PRIORITY

This application claims the benefit of U.S. Provisional Application Serial No. 62/683,225, filed on June 11, 2018. The entire contents of the foregoing are incorporated herein by reference. FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

This invention was made with Government support under Grant No.

CA166284 awarded by the National Institutes of Health. The Government has certain rights in the invention.

TECHNICAL FIELD

Described are clonally derived barcoded cell populations and methods of generating the same, as well as methods of use thereof, e.g., to evaluate heterogeneity of a starting population of cells.

BACKGROUND

Human cancer cell lines have facilitated fundamental discoveries in cancer biology and translational medicine 1 . An implicit assumption has been that cell lines are clonal and genetically stable, and hence results obtained in one study can be readily extended to another. Yet findings involving cancer cell lines are often difficult to reproduce 2 · 3 , leading investigators to conclude that the findings were either weak or the studies not carefully conducted. For example, while pharmacogenomic profiling of large collections of cancer cell lines have proven largely reproducible, some discrepancies in drug sensitivity remain unexplained 4-11 .

SUMMARY

Gene editing protocols often require the use of a subcloning step to isolate successfully edited cells, the behavior of which is then compared to the aggregate parental population and/or other non-edited subclones. The results herein demonstrate that the inherent functional heterogeneity present in many cell lines can render these populations inappropriate controls, resulting in erroneous interpretations of experimental findings. The present protocol incorporates a single-cell cloning step prior to gene editing, allowing for the generation of appropriately matched, functionally equivalent control and edited cell lines. The results demonstrate that heterogeneity should be considered during experimental design when utilizing gene editing protocols and provide a solution to account for it.

Thus provided herein are methods for creating a plurality of barcoded clonal populations (BCPs). The methods include (a) providing an initial heterogeneous population of cells; (b) dividing the initial heterogeneous population of cells into separate cultures, each culture comprising a single cell from the initial heterogeneous population; (c) maintaining the single cells in culture to provide a plurality of stable single cell-derived monoclonal populations; and (d) introducing individual identifying nucleic acid sequences into each cell of the plurality of stable single cell-derived monoclonal populations; to thereby create a plurality of barcoded clonal populations (BCPs).

In some embodiments, the methods include (e) mixing equal numbers of each

BCP to create a barcoded polyclonal population of cells (BPP). In some

embodiments, the methods include exposing the BPP to a test condition. In some embodiments, the methods include determining one or both of identity and relative abundance of each BCP in the BPP, e.g., using a method comprising PCR, a hybridization assay, or next-generation sequencing.

In some embodiments, the initial heterogeneous population of cells comprises cells from cancer cell lines (e.g., from a single cancer cell line) or patient-derived cells, e.g., from a single patient, optionally including cells from normal tissues, e.g., affected and/or normal cells.

In some embodiments, the identifying nucleic acid sequences comprise unique sequences of 10-40 nucleotides.

In some embodiments, the identifying nucleic acid sequences comprise unique sequences of 20-30 nucleotides, e.g., 24 nucleotides.

In some embodiments, the identifying nucleic acid sequences are flanked by uniform sequences comprising PCR primer binding sites. The sites allow for PCR amplification of the identifying nucleic acid sequences from genomic DNA preparations. In some embodiments, the identifying nucleic acid sequences are integrated into the genomes of the cells of the plurality of stable single cell-derived monoclonal populations.

In some embodiments, the identifying nucleic acid sequences are introduced into the cells of the plurality of stable single cell-derived monoclonal populations using a viral vector. In some embodiments, the viral vectors are lentiviral vectors.

Also provided herein are methods for selecting a therapy for a subject who has a cancer. The methods comprising: creating a plurality of barcoded clonal populations (BCPs) by a method comprising (a) providing an initial heterogeneous population of cells from the cancer in the subject; (b) dividing the initial

heterogeneous population of cells into separate cultures, each culture comprising a single cell from the initial heterogeneous population; (c) maintaining the single cells in culture to provide a plurality of stable single cell-derived monoclonal populations; and (d) introducing individual identifying nucleic acid sequences into each cell of the plurality of stable single cell-derived monoclonal populations; to thereby create a plurality of barcoded clonal populations (BCPs); (e) mixing equal numbers of each BCP to create a barcoded polyclonal population of cells (BPP); (f) exposing the BPP to a candidate therapeutic compound; and determining one or both of identity and relative abundance of each BCP in the BPP.

In some embodiments, identity and/or relative abundance of each BCP is determined using a method comprising PCR, a hybridization assay, or next-generation sequencing.

In some embodiments, the initial heterogeneous population of cells comprises cells from cancer cell lines (e.g., from a single cancer cell line) or patient-derived cells, e.g., from a single patient, optionally including cells from normal tissues, e.g., affected and/or normal cells.

In some embodiments, the identifying nucleic acid sequences comprise unique sequences of 10-40 nucleotides.

In some embodiments, the identifying nucleic acid sequences comprise unique sequences of 20-30 nucleotides, e.g., 24 nucleotides.

In some embodiments, the identifying nucleic acid sequences are flanked by uniform sequences comprising PCR primer binding sites. The sites allow for PCR amplification of the identifying nucleic acid sequences from genomic DNA preparations.

In some embodiments, the identifying nucleic acid sequences are integrated into the genomes of the cells of the plurality of stable single cell-derived monoclonal populations.

In some embodiments, the identifying nucleic acid sequences are introduced into the cells of the plurality of stable single cell-derived monoclonal populations using a viral vector. In some embodiments, the viral vectors are lentiviral vectors. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Methods and materials are described herein for use in the present invention; other, suitable methods and materials known in the art can also be used.

The materials, methods, and examples are illustrative only and not intended to be limiting. All publications, patent applications, patents, sequences, database entries, and other references mentioned herein are incorporated by reference in their entirety. In case of conflict, the present specification, including definitions, will control.

Other features and advantages of the invention will be apparent from the following detailed description and figures, and from the claims.

DESCRIPTION OF DRAWINGS

Figures 1A-H: Extensive genetic variation across 27 strains of the cancer cell line MCF7. (A) The distribution of pairwise allelic fraction (AF) correlations between the Broad and the Sanger cell lines (n=l06), for germline (black) and somatic

(gray) SNVs. One-tailed paired Wilcoxon rank-sum test. (B) The number of gene- level copy number alterations (CNAs) shared by each number of MCF7 strains.

Gains, medium grey; losses, dark grey. (C) CNAs of two genes, PTEN and ESR1. (D)

The number of non-silent point- mutations shared by each number of MCF7 strains.

(E) The AFs of inactivating mutations in the tumor suppressor RΊEN. (F) Top:

unsupervised hierarchical clustering of 27 MCF7 strains, based on CNA profiles derived from low-pass whole-genome sequencing. Strain M subjected to in v/vopassaging and drug treatment; rightmost 11 Connectivity Map strains cultured in the same lab without extensive passaging; strains D and E cultured in the same lab and separated by few passages; strains I and K separated by Cas9 introduction.

Bottom: a corresponding heatmap, showing the CNA landscapes of the strains, relative to the median CNA landscape. (G) Top: unsupervised hierarchical clustering of 27 MCF7 strains, based on their non-silent SNV profiles derived from deep targeted sequencing. Bottom: a corresponding heatmap, showing the mutation status of non-silent mutations across strains. Shown are mutations identified in a subset of the strains at AF>0.05. Mutation present, light grey; mutation absent, dark gray. (H) Comparison of the magnitude of CNAs observed following multiple freeze-thaw cycles (n=9; R/A/S vs. W/X/Y), extensive passaging (n=5; D vs. L vs. AA, B vs. I/P), and genetic manipulations (n=4; AA vs. O, B vs. C, I vs. J/K). Bar, median; box,

25 th and 75 th percentiles; whiskers, l.5*IQR of lower and upper quartile; circles: data points. Two-tailed Wilcoxon rank-sum test.

Figures 2A-D: Genetic heterogeneity and clonal dynamics underlying genetic variation. (A) Top: unsupervised hierarchical clustering of 27 MCF7 strains, based on the allelic fractions of their non-silent SNVs. Colors, as in Fig. 1. Bottom: a corresponding heatmap, showing the AFs of non-silent mutations present in a subset of the strains. (B) The distribution of AFs of non-silent mutations across strains. (C) The cellular prevalence of mutation clusters across MCF7 strains, as identified by a PyClone analysis. Shown are mutation clusters with differential abundance

(ACP>0. l5), the clonal cluster (cluster #6; CP~l in all clones) and a cluster unique to MCF7-M (cluster #12). n = mutations per cluster, mean ± s.e.m. (D) Top:

unsupervised hierarchical clustering of 27 samples of DNA-barcoded MCF7-D, based on barcode representation. Dendrogram branches are colored by culture condition. Bottom: a corresponding heatmap of barcode representation. ETP, early time point; RPMI, RPMI-1640 medium; DMEM, DMEM medium; DMSO, RPMI-1640 with 0.05% DMSO; ESTDEP, estrogen-depleted RPMI-1640 medium; BORT, bortezomib (500nM; 48hr exposure) followed by RPMI-1640.

Figures 3A-H: Extensive transcriptomic variation associated with genetic variation. (A) AtSNE plot of gene expression profiles from multiple samples of nine cancer cell lines. *, the 27 MCF7 strains profiled in the current study (also encircled). (B) Unsupervised hierarchical clustering of the strains, based on their global gene expression profiles. Colors, as in Fig. 1. (C) Schematics presenting the analysis performed to evaluate the association between genetic variation and transcriptional programs. (D) Arm-level gains and losses are associated with significant up- and down-regulation of genes transcribed from the aberrant arms. (E) Gene-level CNAs are associated with significant dysregulation of the perturbed pathways. For example, up-regulation of mTOR signaling in strains that have lost a copy of PTEN. (F) Point mutations are associated with significant dysregulation of the perturbed pathways. For example, up-regulation of mTOR signaling in strains with an

inactivating PTEVmutation. (G) Copy number loss of ESR1 is associated with significant down-regulation of the estrogen response. (H) A tSNE plot of single-cell RNA sequencing data from a parental population and three of its single cell-derived clones.

Figures 4A-K: Drug response consequences of genetic and transcriptomic variation. (A) Top: unsupervised hierarchical clustering of 27 MCF7 strains, based on their response to the 55 active compounds in the primary screen. Bottom: a corresponding heatmap, showing the % of viability change for each compound across strains. Compounds shaded based on their mechanism-of-action. (B) Classification of the screened compounds based on their differential activity. Consistent, viability change <-50% for all strains; variable, viability change <-50% for some strains and >-20% for other strains; intermediate, viability change in between these values. (C) Comparison of the similarity in drug response patterns between compounds that share the same mechanism-of-action (n=39) and compounds that work through different mechanisms (n=l,439). One-tailed Wilcoxon rank-sum test. (D) Highly similar differential drug response patterns for three proteasome inhibitors: bortezomib, MG- 132 and carfilzomib. Each data point represents the mean of two replicates. The number of data points per strain is mentioned within parentheses. The response pattern with no drug (DMSO control) is presented for comparison. (E) Schematics presenting the analysis performed to evaluate the association between drug response and transcriptional variation. (F) Up-regulation of the KEGG cell cycle signature in strains sensitive to the cell cycle inhibitor alsterpaullone (8 sensitive vs. 15 resistant strains). (G) Up-regulation of mTOR signaling in strains sensitive to the PI3K inhibitor BKM-120 (8 sensitive vs. 5 resistant strains). (H) Up-regulation of the genes that are up-regulated when PTENis knocked-down in strains sensitive to AKT inhibitor IV (6 sensitive vs. 9 resistant strains). (I) Strains with PTEN mutation (n=l2) respond more strongly to AKT inhibitor IV than strains without the mutation (n=l4). (J) Strains with ESR1 copy number loss (n=5) grow better in estrogen-depleted medium than strains without ESR1 loss (n=2l). (K) Comparison of GSEA-based MoA identification between the MCF7 cohort and the CTD 2 (n=l5) and GDSC (n=l9) cohorts, across matched drugs. Two-tailed Fisher’s exact test. For all box plots: bar, median; box, 25 th and 75 th percentiles; whiskers, l.5*IQR of lower and upper quartile; circles: data points.

Figure 5. Correlation plots between various measures to estimate cell line strains (n=35l strain pairs). CNA distances (based on low-pass whole-genome sequencing or targeted sequencing), SNV distances, gene expression distances and drug response distances were compared to each other. CNV distance based on LP- WGS was determined by the fraction of the genome affected by discordant CNV calls. CNV and SNV distances based on targeted sequencing were determined by Jaccard indices. Gene expression and drug response distances were determined by Euclidean distances. Spearman’s rho value and p-value indicate the strength and significance of the correlation, respectively.

DETAILED DESCRIPTION

The present disclosure describes methods develop to evaluate lack of clonality and instability in cancer cell lines, which can generate variability in drug sensitivity. The results showed that established cancer cell lines, generally thought to be clonal, are in fact highly genetically heterogeneous across strains. This heterogeneity results both from clonal dynamics (i.e., changes in the abundance of pre-existing clones) and from ongoing instability (i.e., the appearance of new genetic variants). Moreover, genetic heterogeneity yields varying patterns of gene expression, which in turn can result in differential drug sensitivity.

We found that changes in clonal composition underlie much of the observed variability in cell line behavior. Such clonal composition changes follow selection by particular conditions (e.g., growth media), or by genetic manipulations associated with a population bottleneck. The genetic distance between cell line strains was strongly correlated with their gene expression distance and with their drug response distance. Cell line diversification can therefore be estimated using the present profiling methods (Fig. 5).

Variation within cancer cell lines can also be useful in at least two ways. First, deeper characterization (e.g., by single-cell sequencing) of the heterogeneity within cultures of common cell lines could enable the study of cooperative and competitive interactions between cancer cell populations 28 · 29 , and mechanisms of pre-existing drug resistance 19 . Second, due to their matched genetic background, naturally -occurring “isogenic-like” strains could help uncover the association between molecular features and phenotypes such as drug response.

The present disclosure provides methods of evaluating variation across cell line strains, which should be considered in experimental design and data

interpretation.

The barcoding approaches described herein can be used to track and study individual subclonal populations within a heterogeneous populations of cells or tissues. Traditional cell tagging approaches currently do not enable one to enumerate cells at the end point of a study or know anything about their identity or the ability to isolate them for further study. Therefore, the present inventors developed a molecular barcoding approach that enables the analysis of intratumoral subclonal composition, tracking of cells over time, and retrieval of barcoded cells for further study.

Importantly, the present approach is different from others that have been reported in that we generate single cell subclones prior to introducing the barcode tags. Other reported approaches infect heterogeneous parental populations of cells with an entire library of barcodes at low MOI, without the ability to identify which cells are tagged with which barcode. Hence, one advantage of the present approach is that by introducing single barcodes into monoclonal populations and then generating the pooled barcoded polyclonal population rather than infecting the bulk parental population with a library of barcodes, we gain the ability to retroactively characterize barcoded monoclonal populations in any given experiment. This approach also allows us to be confident that the same barcode is not unwittingly introduced into multiple unique clonal populations, thus confounding subsequent analyses.

The general methods are as follows. First stable single cell-derived monoclonal populations (CPs) are generated from heterogeneous populations of cells, e.g., cultured cells (e.g., cancer cell lines, e.g., any of the NCI-60 cancer cell lines, see, e.g., dtp.cancer.gov/discovery_development/nci-60/cell_list.htm) or patient- derived cells, e.g., affected and/or normal cells; affected cells are cells that are affected by a disease, e.g., tumor cells. Methods for obtaining and culturing the cells are known in the art. For example, cells are separated (e.g., by dilution or cell sorting) into individual cells that are placed into individual culture environments, e.g., individual vials or wells of a culture dish. Where the cells are obtained from a patient tumor or other tissue, they can first be dissociated, e.g., enzymatically, chemically, or mechanically. The individual cells are then maintained in culture to produce individual clonal populations. Any number of individual clonal populations can be produced, e.g., 10, 100, 1000, 10 4 , 10 5 , 10 6 , or more.

Each individual clonal population is then tagged with a unique molecular

“barcode” sequence (also referred to herein as individual identifying nucleic acid sequence), e.g., using a viral vector, e.g., recombinant retroviruses, adenovirus, adeno-associated virus, alphavirus, and lentivirus vectors (Yu, et al. Nat Biotech 2016) to create barcoded clonal populations (BCPs). The individual identifying nucleic acid sequences preferably are integrated into the genome of the cells. In preferred embodiments, upon integration, each individual identifying nucleic acid sequence introduces a unique heritable DNA barcode tag of 10-50 base pairs, 20-30 base pairs, e.g., 24-base pairs, into each cell clone genome; these individual identifying nucleic acid sequences can be used to precisely follow the progeny of each cell over time. Each individual identifying nucleic acid is flanked by uniform sequences that are common to all of the cells and allow for PCR amplification of the individual identifying nucleic acid sequences from genomic DNA preparations made from the cells.

In some embodiments, substantially (i.e., within about plus or minus 10%, given difficulties in exactly determining numbers of cells) equal numbers of each BCP, or known ratios of each BCP, are then mixed together to create a barcoded polyclonal population of cells (BPP). The BPP can be exposed to a number of conditions. Then the identity and relative abundance of each clonal population (BCP) within a polyclonal mixture of cells (BPP), e.g., optionally including tumor and non- tumor stromal cells, is determined, e.g., using a Luminex-based PRISM detection (Yu, et al, Nat Biotech, 2016) or next-generation sequencing. These methods allow the identification and quantification of the representation of each individual barcode in a given tissue sample.

While other gene editing, barcoding, and other cell manipulation technologies exist, the present methods derive individual clonal populations of cells from a parental cell line or tissue prior to any type of modulation so that individual clonally related cells can be tracked and studied in any given experiment. The barcode detection system can also be optimized for applicability with typical DNA sequencing technologies.

Provided herein are collections of single cell clonal populations derived from cell lines or tissues; collections of single cell clonal populations tagged with unique molecular barcodes; and collections of mixed populations of cells comprised of uniquely barcoded clonal populations.

The presence of an individual identifying nucleic acid sequences as described herein can be evaluated using methods known in the art, e.g., using polymerase chain reaction (PCR), reverse transcriptase polymerase chain reaction (RT-PCR), quantitative or semi-quantitative real-time RT-PCR, multiplex PCR, digital PCR, e.g., BEAMing ((Beads, Emulsion, Amplification, Magnetics), Diehl (2006) Nat Methods 3:551-559); various types of nucleic acid sequencing (Sanger, pyrosequencing, NextGeneration Sequencing); multiplexed gene analysis methods, e.g., oligo hybridization assays including DNA microarrays; hybridization based digital barcode quantification assays such as the nCounter® System (NanoString Technologies, Inc., Seattle, WA; Kulkami, Curr Protoc Mol Biol. 2011 Apr;Chapter 25:Unit25B.10) and hybridization assays, e.g., utilizing branched DNA signal amplification such as the QuantiGene 2.0 Single Plex and Multiplex Assays (Affymetrix, Inc., Santa Clara, CA; see, e.g., Linton et a., J Mol Diagn. 2012 May-Jun;l4(3):223-32); SAGE; MLPA; or luminex/XMAP. See, e.g., W02012/048113, which is incorporated herein by reference in its entirety.

The methods described herein can include exposing the barcoded polyclonal population of cells (BPPs) to test conditions, e.g., the presence or absence of one or more environmental factors (e.g., temperature, light, atmosphere (e.g., levels of oxygen or nitrogen) or test compounds (e.g., polypeptides, polynucleotides, inorganic or organic large or small molecule test compounds) to determine whether different BCPs within the BPP react differently to the test conditions.

As used herein,“small molecules” refers to small organic or inorganic molecules of molecular weight below about 3,000 Daltons. In general, small molecules useful for the invention have a molecular weight of less than 3,000 Daltons (Da). The test compounds can be, e.g., natural products or members of a

combinatorial chemistry library. The methods can include comparing genetic, genomic, epigenomic, transcriptomic, proteomic, and other profiles across and within the BCPs, e.g., preferably before being combined in a BPP, to determine heterogeneity of a starting population of cells. The methods can alternatively or in addition include determining effects on viability, proliferation, motility, cell cycle, or other cellular characteristics. In some embodiments, one or more characteristics of each BCP is determined before they are mixed together to form a BPP, e.g., genetic, genomic, epigenomic, transcriptomic, proteomic, or other profiles, or viability, proliferation, motility, cell cycle, or other cellular characteristics can be determined; such characteristics can be determined using methods known in the art. For example, the presence of therapy- resistant BCPs can be identified. In some embodiments, therapy-resistant cells are present naturally in the starting sample, and make up some proportion of the BCPs in a BPP; in other embodiments, BCPs consisting of therapy-resistant cells are intentionally spiked in (added) to the BPP.

In some embodiments, e.g., where the initial heterogeneous populations of cells comprises patient-derived cells, e.g., tumor cells and optionally non-tumor cells (e.g., from a biopsy that includes all or part of a tumor and some surrounding non- cancerous tissue), the methods can include exposing one or more populations of BCPs or BPPs generated from those patient-derived cells to one or more test conditions to determine the effect on the patient cells. Thus, for example, BCPs or BPPs generated from tumor cells can be exposed to test conditions that comprise one or more potential therapeutics (e.g., cancer therapeutic agents), and identity and/or relative abundance of each BCP is determined, e.g., using a method comprising PCR, a hybridization assay, or next-generation sequencing. Thus, an effect on the different kinds of cells in the BPP can be evaluated, e.g., an effect on viability or growth of cells having known genetic, genomic, epigenomic, transcriptomic, proteomic, or other profiles, and/or an effect on viability, proliferation, invasiveness, motility, cell cycle, or other cellular characteristics. For example, the methods can be used to determine responses to medication and potential drug resistance (e.g., to monitor the development or overgrowth of resistant cells, and optionally to identify those populations that later develop resistance). The methods can be used to identify and select therapeutics that provide the most complete response (greatedt reduction in affected/tumor cells and/or that overcome resistance (e.g., reduce numbers or don’t elicit development of drug- resistant populations of cells) and/or that selectively affect resistant or tumor cells and not normal cells.

These methods can also be used to evaluate drug responses and resistance mechanisms across BCPs and BPPs generated from various cell populations, e.g., primary or non-primary tumor cells, or cells from other disease tissues or models, to screen for new candidate therapeutics or to identify new targets.

EXAMPLES

The invention is further described in the following examples, which do not limit the scope of the invention described in the claims. Methods

The following materials and methods were used in the Examples below.

Cell culture

MCF7, HT29, MDAM453 and A375 cell line strains were cultured in RPMI- 1640 (Life Technologies), with 10% Fetal Bovine Serum (Sigma-Aldrich) and 1% Penicillin-Streptomycin-Glutamine (Life Technologies). A549, VCaP, PC3, HCC515, HepG2, HeLa and Ben-Men- 1 cell line strains were cultured in DMEM (Life

Technologies), with 10% Fetal Bovine Serum (Sigma), 2 mM Glutamine (Sigma- Aldrich), and 1% Penicillin-Streptomycin-Glutamine (Life Technologies). HA1E cell line strains were cultured in MEMa (Life Technologies), with 10% Fetal Bovine Serum (Sigma), 2mM Glutamine (Sigma-Aldrich), and 1% Penicillin-Streptomycin- Glutamine (Life Technologies). MCF10A cell line strains were cultured in MEGM Mammary Epithelial Cell Growth Medium (Lonza) supplemented with the MEGM Bulletkit (Lonza). The single cell-derived clones scWT3, scWT4 and scWT5, as well as their parental MCF7 population, were cultured in DMEM (Life Technologies), with 10% Fetal Bovine Serum (Sigma), 2mM Glutamine (Sigma-Aldrich), 1% Penicillin-Streptomycin-Glutamine (Life Technologies), and lOpg/mL Insulin (Sigma-Aldrich). Cells were incubated at 37°c, 5% CO2, and passaged twice a week using Trypsin-EDTA (0.25%) (Life Technologies). All strains of the same cell line were cultured under the same conditions, cell identity was confirmed and they were confirmed to be mycoplasma-free. Cells were tested for mycoplasma contamination using the MycoAlert™ Mycoplasma Detection Kit (Lonza), according to the manufacturer’s instructions. Cell line identity was confirmed using SNP-based DNA fingerprinting (see below).

Derivation of single-cell clones

The WT single cell-derived MCF7 clones were generated by cell sorting. Single cells were sorted into individual wells of 96-well plates, using BD FACSAriall SORP Cell Sorter. Three resultant clones were expanded for a period of ~3 months before prior to the experiments. The genetically -manipulated single cell-derived MCF7 -GREB1 and MCF7-/AA/ clones were generated using CRISPR/Cas9 mediated genome engineering to insert aNanoLuciferase reporter gene into the 3’-UTR of the respective genes. Briefly, a selectable reporter gene cassette was engineered using the EMCV IRES element to drive expression of the destabilized NLucP reporter gene (Promega) fused to the N-terminus of the Bsr blasticidin-resistance gene (Invivogen) containing a P2A self-cleaving peptide element. For targeting GREB1, the reporter gene cassette was subcloned into a construct containing ~2 kb of GREB1 gene surrounding the termination codon in exon 33, such that reporter gene cassette is located 9 bp downstream of the GREB1 termination codon in the resulting mRNA hybrid transcript. A G7/A7i/-speciric sgRNA was generated recognizing the sequence GCTGACGGGACGACACATCTG (SEQ ID NO: l) on the sense strand, and utilizing a PAM site that is adjacent to the GREB1 termination codon. For targeting ESR1, the reporter gene cassette was subcloned into a construct containing ~2 kb of ESR1 gene surrounding the termination codon in exon 8, such that reporter gene cassette is located 21 bp downstream of the ESR1 termination codon in the resulting hybrid mRNA transcript. An ESR1 -specific sgRNA was generated recognizing the sequence GT CTCC AGC AGC AGGT CAT AG (SEQ ID NO:2) on the anti-sense strand, and utilizing a PAM site that is 160 bp upstream of the ESR1 termination codon.

Corresponding Cas9-sgRNA and targeting construct pairs were transiently co transfected into MCF7 cells using the LipofectAMINE 2000 reagent (Thermo-Fisher Scientific). After outgrowth for 7 days, the cells were cultured in media containing 5 pg/ml blasticidin to select for the desired recombinants. Single-cell clones were then isolated by a limiting dilution single-cell cloning in 96-well plates.

Growth rate analysis

Cells were seeded in triplicates in white, clear bottom, 96-well plates (Coming #3903), at a density of 5,000 cells/well. Plates were incubated in an IncuCyte ZOOM instrument (Essen Bioscience) at 37°C, 5% CC . Four non-overlapping phase-contrast images (10X) were taken every 2 hours for a total of 160 hours. The IncuCyte ZOOM Software (version 2015 A) was used to calculate the mean confluence per well at every time point (filtered to exclude objects smaller than 100 pm 2 ), and averaged across wells to calculate the mean confluence per strain. Doubling times were calculated for each strain, using the formula Tdoubiing = (log2*AT)/(log(c2)-log(ci)), where ci and C2 were the minimum and maximum percent confluence during the linear growth phase, respectively, and AT was the time elapsed between ci and C2. To account for potential differences in cell recovery following seeding, t=0 h was defined as the first time point in which the mean strain confluence surpassed a threshold of 15%. To examine the effect of estrogen-depletion on the growth of MCF7 strains, cells were cultured either in standard conditions (described above) or in estrogen- depleted conditions: RPMI-1640 without Phenol Red (Life Technologies), with 10% Charcoal Stripped Fetal Bovine Serum (Life Sciences) and 1% Penicillin- Streptomycin-Glutamine (Life Technologies). Comparison between standard and estrogen-depleted conditions was performed by calculating the fold-change in doubling time between the two conditions.

Cell Painting

Cells were plated in triplicate at a density of 1,000 cells per well, and then stained and fixed as previously described 26 · 30 . Images were taken on a Perkin-Elmer Opera Phenix microscope with a 20X/1.0NA water immersion lens. Image quality control was carried out as previously described 31 , using CellProfiler 30 and

CellProfiler-Analyst 31 . For all 27 MCF7 strains, the majority of images in all three wells passed quality control, and therefore all strains were further considered. Image illumination correction and analysis were performed in CellProfiler. For each of the 27 MCF7 strains, the median value of the 1,784 measured features was computed and used for hierarchical clustering.

DMA & RNA extraction

Genomic DNA was extracted using the DNeasy Blood & Tissue Kit (Qiagen), according to the manufacturer’s protocol. Total RNA was extracted using the RNeasy Plus Mini Kit (Qiagen), according to the manufacturer’s protocol. DMA fingerprinting

Fingerprinting analysis was performed using 44 polymorphic loci. Picard Tools“GenotypeConcordance” was used to calculate the concordance between every pair of samples (for the MCF7 and A549 cohorts, separately). Samples with >95% concordance were called a match.

Ultra low pass whole-genome DNA sequencing (ULP-WGS)

Copy number characterization was performed using low-pass (~0.2x coverage) whole-genome sequencing. Libraries were prepared from 50ng of DNA using ThruPLEX-DNAseq sample preparation kits (Rubicon Genomics), according to the manufacturer’s protocol. The resultant libraries were quantified by Qubit fluorometer, Agilent TapeStation 2200, and RT-qPCR using the Kapa Library Quantification kit (Kapa Biosystems), according to the manufacturer’s protocol. Uniquely indexed libraries were pooled in equimolar ratios and sequenced on a single IlluminaNextSeq500 run with paired-end 35bp reads, at the Dana-Farber Cancer Institute Molecular Biology Core Facilities. The reads were aligned to the UCSC hgl9 reference genome, using“BWA-MEM” (v0.07T5), with default parameters.

ULP-WGS data analysis

The ichorCNA algorithm 32 was applied to identify copy number alterations (CNAs) of large genomic segments, chromosome arms and whole chromosomes. First, thegenome was divided into lMb bins and read counts were generated for each bin using the HMMcopy Suite (compbio.bccrc.ca/software/hmmcopy/). The raw read counts were then normalized to correct for GC-content and mappability biases using the HMMcopy R package 33 , generating corrected read counts for each lMb bin. Segmentation and copy number prediction for each sample were performed using ichorCNA vO.1.0 (github.com/broadinstitute/ichorCNA), which is optimized for low coverage whole-genome sequencing. Parameters were initialized based on prior knowledge:— normal=0.0l,— ploidy=c(3, 3.5, 4),— txnE=0.99999—

txnStrength=l 0,000,— maxCN=8. Remaining parameters were set to the default. For bin-level comparison between strains, we used the log2-transformed corrected read counts and determined gain and loss status using thresholds of 0.1 and -0.1, respectively. For arm-level calls, the copy number status was determined based on the largest overlapping segment. Deep targeted sequencing (Profile OncoPanel v3)

Deep (~250x coverage) targeted exon sequencing of 447 genes commonly mutated in cancer was performed. Prior to library preparation, DNA was fragmented (Covaris sonication) to 250bp and further purified using Agentcourt AMPure XP beads. Size-selected DNA was ligated to sequencing adaptors with sample-specific barcodes during automated library preparation (SPRIworks, Beckman-Coulter). Libraries were pooled and sequenced on an Illumina Miseq to estimate library concentration based on the number of index reads per sample. Library construction was considered to be successful if the yield was >250ng, and all samples yielded sufficient library. Normalized libraries were pooled in batches, and hybrid capture was performed using the Agilent Sureselect Hybrid Capture kit with the

POPv3_824272 bait set 34 . Captures were then pooled and sequenced on one

HiSeq3000 lane. Pooled sample reads were de-convoluted and sorted using the Picard tools (broadinstitute.github.io/picard). The reads were aligned to the reference sequence b37 edition from the Human Genome Reference Consortium using“bwa aln” (bio-bwa.sourceforge.net/bwa.shtml ), with the following parameters:“-q 5 -1 32 -k 2 -o 1”, and duplicate reads were identified and removed using the Picard tools 35 . The alignments were further refined using the GATK tool for localized realignment around indel sites (software.broadinstitute.org/gatk/

documentation/tooldocs/current/org_broadinstitute_gatk_to ols_walkers_indels_Indel

Realigner. php). Recalibration of the quality scores was also performed using GATK tools (gatkforums.broadinstitute.org/discussion/44/base-quality-sc ore-recalibration- bqsr) 36 · 37 . Metrics for the representation of each sample in the pool were generated on the unaligned reads after sorting on the barcode

(broadinstitute.github.io/picard/picard-metric-definition s.html). All samples achieved the CCGD recommended threshold of >30x coverage for >80% of the targeted bases. Average mean exon target coverage was 251.5x (range: 171.5c-336.7c) for the MCF7 samples, 288.9x (range: 208.2x-398.9x) for the A549 samples, and 257.32 (range 2l l.7x-442.68x) for the additional cell line samples.

Targeted sequencing data analysis

Mutation analysis for single nucleotide variants (point mutations, or SNVs) was performed using MuTect vl. l.4 38 . Indel calling was performed using the SomaticIndelDetector tool in GATK (broadinstitute.org/cancer/cga/indelocator). Consecutive variants in the same codon were re-annotated to maximize the effect on the codon and marked as“Phased” variants. MuTect was run in paired mode, pairing the MCF7/A549 samples to a normal sample, CEPH1408. Mutations were called if detected in >2% of the reads (AFX).02). All SNVs, indels, and phased variants were annotated with Variant Effect Predictor (VEP) 39 . Variants were filtered against the 6,500 exome release of the Exome Sequencing Project (ESP) database. Variants represented more than once in either the African- or European- American populations and less than twice in COSMIC were considered to be germline (given that no matched normal samples were available). A germline filter was not applied to the downstream analyses, however, as changes in such mutations between strains of the same cell line would have to arise in culture and may be functionally relevant. Non- silent mutations were considered to be those with the following BestEffect Variant Classification: missense, initiator codon, nonsense, splice acceptor, splice donor, splice region, frameshift, inframe insertion or inframe deletion. Mutations that appeared more than once in COSMIC were regarded as COSMIC mutations.

Copy number variants (CNVs, or CNAs) were identified using RobustCNV, an algorithm that relies on localized changes in the mapping depth of sequenced reads in order to identify changes in copy number at the sampled loci (Ducar et al.

Manuscript in preparation). Systematic bias in mapping depth was reduced using robust regression, fitting the observed tumor mapping depth against a panel of normal samples (PON) captured using the same bait set. Observed values were then normalized against predicted values and expressed as log2ratios. A second normalization step was performed to remove GC bias, using a loess fit. Log2ratios were centered on segments determined to be diploid based on the allele fraction of heterozygous SNPs in the targeted panel. Normalized coverage data were next segmented using Circular Binary Segmentation 40 with the“DNAcopy” Bioconductor package. Finally, segments were assigned gain, loss, or normal-copy calls using a cutoff derived from the within-segment standard deviation of post-normalized mapping depths. Due to the high data quality and low within-segment standard deviation, a cutoff of ~0. l was applied to all samples. Segment calls were summarized to gene calls by assigning them to capture intervals, and then counting the interval calls for each gene. Gene level calls were determined according to the following rules: “gain” =“+” calls >50%;“loss” =“-” calls >2 or in 100%;“gain+loss” =“-” calls >2 times and“+” calls <50%;“mixed” =“+” and calls in the same gene, but below threshold;“Normal+” =“+” calls, but below threshold;“Normal-” =“-” calls, but below threshold;“Normal” = no“+” or“-” calls.

For a subset of 60 genes, rearrangements (structural variants, or SVs) were detected using BreaKmer 41 , which is designed to detect larger genomic structural variations from single sample aligned short read target-captured high-throughput sequence data. Briefly, the method extracts’misaligned’ sequences from a targeted region, such as split-reads and unmapped mates, assembles a contig from these reads, and re-aligns the contig to make a variant call. It classifies detected variants as ’’insertions/deletions”,’’tandem duplications”,’’inversions”, and’’translocations”. Rearrangements were visualized using the“Circos” visualization tool 69 .

Clonality analysis

To resolve clonal dynamics/composition we applied the PyClone algorithm nq.13.0 (bitbucket.org/aroth85/pyclone/wiki/Home) to the measured allelic fractions, accounting for copy number, LOH and cellularity 17 . PyClone enabled us to follow clonal dynamics throughout the evolution of cell populations 17 18 . For copy number input, we used results from ichorCNA segmentation and copy number predictions. Mutations with <50 read depth were excluded. The following parameters were used for PyClone: 10,000 iterations, 1,000 bum-in,“total_copy_number” for the prior. We also performed multi-sample analysis using PyClone, to determine the changes in clonal composition across strains. For the multi-sample analysis, mutations were selected as the union set across all 27 strains. The same parameters were used for PyClone multi-sample analysis as for the individual-sample runs.

DMA Barcoding experiment

Degenerate oligos for sgRNA-barcode library construction were synthesized by IDT and cloned into lentiGuide-Puro 42 by Gibson assembly, as describe in Joung et al. 43 . Approximately 300pg of Gibson product was transformed into 25pL of Endura electrocompetent cells (Lucigen). After a 1 hour recovery period, 0.1% of transformed bacteria were plated in a 10-fold dilution series on ampicillin plates to determine the number of successful transformants. The remainder of the transformed bacteria were cultured in 50mL of LB with 50ug/mL ampicillin for 16 hours at 30°c. Plasmid libraries were extracted using Plasmid MidiPlus kit (Qiagen) and sequenced to a depth of 6.2 million reads on Illumina Miniseq, corresponding to 6X coverage of >1 million barcodes. Lentivirus was prepared by transfecting a total of 10 million HEK 293FT cells, as described in Joung et al. 43 . The MCF7-D strain was cultured in standard conditions (described above), and four million cells were infected with a low multiplicity of infection (20-30%) to reduce the probability of each cell being infected with more than one barcode. Cells underwent puromycin selection, and the final cell pool contained -160,000 unique barcodes. Cells were expanded for the experiment, and five million cells were then plated into each of 25 tissue culture flasks. Five culture conditions were then applied (with five replicates per condition):

1) RPMI-1640 (Life Technologies) with 10% Fetal Bovine Serum (Sigma- Aldrich) and 1% Penicillin-Streptomycin-Glutamine (Life Technologies); 2) DMEM (Life Technologies) with 10% Fetal Bovine Serum (Sigma-Aldrich) and 1% Penicillin- Streptomycin-Glutamine (Life Technologies); 3) RPMI-1640 without Phenol Red (Life Technologies), with 10% Charcoal Stripped Fetal Bovine Serum (Life Sciences) and 1% Penicillin-Streptomycin-Glutamine (Life Technologies); 4) RPMI-1640 (Life Technologies) with 10% Fetal Bovine Serum (Sigma-Aldrich), 1% Penicillin-

Streptomycin-Glutamine (Life Technologies) and 0.05% DMSO (Sigma-Aldrich); 5) RPMI-1640 (Life Technologies) with 10% Fetal Bovine Serum (Sigma-Aldrich) and 1% Penicillin-Streptomycin-Glutamine (Life Technologies), supplemented for the first 48 hours with 500nM bortezomib (Selleckchem S 1013). After five weeks of culture, DNA was extracted and barcode abundance was assessed by DNA sequencing, as described in Joung et al. 43 . Libraries were sequenced to a median depth of 4.2 million reads, corresponding to a barcode coverage of >26X.

Transcriptional profiling with L1000

The L1000 expression-profiling assay was performed as previously described 16 . First, mRNA was captured from cell lysate using an oligo dT coated 384 well Turbocapture plate. The lysate was then removed, and a reverse transcription mix containing MMLV was added. The plate was washed and a mixture containing both upstream and downstream probes for each gene was added. Each probe contained a gene specific sequence, along with a universal primer site. The upstream probe also contained a microbead-specific barcode sequence. The probes were annealed to the cDNA over a 6-hour period, and then ligated together to form a PCR template. After ligation, Hot Start Taq and universal primers were added to the plate. The upstream primer was biotinylated to allow later staining with strepdavodin-phycorethrin. The PCR amplicon was then hybridized to Luminex microbeads via the complimentary, probe-specific barcode on each bead. After overnight hybridization the beads were washed and stained with strepdavodin-phycorethrin to prepare them for detection in Luminex FlexMap 3D scanners. The scanners measured each bead independently and reported the bead color/identity and the fluorescence intensity of the stain. A deconvolution algorithm converted these raw fluorescence intensity measurements into median fluorescence intensities for each of the 978 measured genes, producing the GEX level data. This GEX data was then normalized based on an invariant gene set, and then quantile normalized to produce QNORM level data. An inference model was applied to the QNORM data to infer gene expression changes for a total of 10,174 features. Per-strain gene expression signatures were calculated using a weighted average of the replicates, where the weights are proportional to the

Spearman correlation between the replicates.

Transcriptional profiling data analysis

To examine how newly profiled MCF7 and A549 cells compared in gene expression to a previously acquired collection of cell line profiles (untreated samples that served as controls for Connectivity Map perturbational experiments), we used t- distributed stochastic neighbor embedding (t-SNE). Profiles were restricted to untreated profiles from the nine core Connectivity Map cell lines, and to batches with multiple untreated profiles. As samples first clustered based on their project codes, batch effect was next removed using the COMBAT algorithm 44 . t-SNE was applied on the batch-corrected data and visualized using a scatter plot. Analysis was completed using the“Rtsne” R package version 0.13. For the comparison of transcriptional variation across the nine core Connectivity Map cell lines, the collection of untreated profiles generated with the L1000 assay was used. Five profiles from each cell line were randomly chosen, and the expression variance of the 978 L1000“landmark” genes was calculated for each cell line. For the comparison of L1000 gene expression data to the Cancer Cell Line Encyclopedia (CCLE) gene expression profiles, RNAseq and Affymetrix gene expression profiles were downloaded from the CCLE website (portals.broadinstitute.org/ccle/data). Data within each platform were processed using invariant set scaling, which adjusts profiles according the expression of 80“invariant” genes, followed by quantile normalization 16 . The ranked gene expression order of the 978“landmark” genes was compared using Spearman’s correlation.

Chemical screening

MCF7 strains were tested against a small molecule Informer Set library of 321 anti-cancer compounds, assembled by the Cancer Target Discovery and Development (CTD 2 ) (ocg.cancer.gov/programs/ctd2/data-portal), using the same principles as those described in the Cancer Therapeutics Response Portal 8 · 45 . Cells were seeded in in their culture media in white, 384 well plates (Coming #3570) at an initial density of 2,500 cells per well and incubated overnight at 37°c, 5% CO2. The next day, 25nL (for primary screen) or lOOnL (for confirmation dose response screen) of compound stocks in DMSO were added by pin transfer. Plates were incubated for 72 hours, cooled at RT for 10 minutes, and viability was measured using the CellTiter-Glo luminescent cell viability assay (Promega), according the manufacturer’s protocol. After 10 minutes of incubation, luminescence was read on a Perkin Elmer Envision reader, at a speed of 0.1 s/well.

Chemical screening data analysis

Data were analyzed in Genedata Screener version 13.0, using the

normalization method“neutral controls”, where the median of 32 DMSO wells on each plate was set to 0% activity and 0 raw signal was set to -100%. Positive controls (20mM MG-132 or 20mM bortezomib) were included on all plates (16 wells each) but were not used for normalization due to variability of response across cell lines. Dose response curves were fit using the“Smart Fit” strategy in Genedata. The % effect was defined as the high-concentration asymptote (Sint) and qEC50 was the concentration at which the fitted curve crossed the inhibitory value representing half the maximal % effect. In addition, parameters were calculated at which the curve crossed absolute inhibitory values of 30% or 50% regardless of maximal effect (AbsEC30 and

AbsEC50, respectively). AUC calculations were performed as previously described 8 : curves were fit with nonlinear sigmoid functions, forcing the low concentration asymptote to 1 using a 3-parameter sigmoidal curve fit. The AUC for each compound- strain pair was calculated by numerically integrating under the 8-point concentration- response curve. For visualization purposes, drug response curves were fit with a 4- parameter log-logistic function, based on normalized viability data from which the lowest dose viability had been subtracted. Plots were generated using the“LL.4” function in the“drc” R package (cran.r-project.org/web/packages/drc/). To examine a potential link between proliferation rate and differential drug response, doubling times were compared against the AUC values of the 33 differentially-active compounds.

Gene Set Enrichment Analysis (GSEA)

GSEA was performed using the 10,147 genes best inferred from the

Connectivity Map linear model 33 , also known as the BING gene set. Samples were divided into two classes depending on the comparison being made: samples with a genetic alteration vs. samples without it; samples sensitive to a drug (>50% inhibition) vs. samples insensitive to the same drug (<20% inhibition). Differential expression was calculated using the signal -to-noise (S2N) metric 46 . A ranked gene list and S2N values served as the input for the GSEA pre-ranked module of Gene Set Enrichment Analysis, using the Java app version 3.0. The analysis was run using the ‘hallmark’,‘KEGG’,‘positional’ and‘oncogenic’ signature collections from

MsigDB. To compare between our MCF7 panel, CTD 2 and GDSC, drug response were downloaded from the CTRP website (ocg.cancer.gov/programs/ctd2/data-portal; “v20.data.curves_post_qc” file, updated October l4 th 2015) and from the GDSC website (cancerrxgene.org/downloads;“log(IC50) and AUC values” file, updated July 4 th 2016). Expression profiles were downloaded from the CCLE website to match the CTD2 drug response data (portals.broadinstitute.org/ccle/data;

“CCLE_Expression_Entrez_2012-09-29. get”; updated October l7 th 2012), and from the GDSC website to match with the GDSC drug response data

(cancerrxgene.org/downloads;“RMA normalized expression data for cell lines”, updated March 2 nd 2017). Expression profiles were filtered to include only the genes that belong to the L1000 BING set. GSEA compared the expression patterns of the 5 strains/cell lines with the highest AUC values for each matched drug with the 5 strains/cell lines with the lowest AUC values for that drug. As the robustness of gene expression signatures varies, this quantitative analysis was restricted to the 50 well- defined hallmark GSEA gene sets 27 .

Single-cell RNA sequencing

MCF7 cells were cultured as described above. For following transcriptional changes post drug treatment, MCF7-AA cells were exposed to 500nM of bortezomib (Selleckchem S 1013 ) and harvested before treatment, after 12 hours of exposure (tl2), after 24 hours of exposure (t48), or after 72 hours of exposure followed by drug wash and 24 hours of recovery (t72+24). Cells were washed, trypsinized, passed through a 40mM cell strainer, centrifuged at 400g, and resuspended at a concentration of 1,000 cells/pL in PBS + 0.5% BSA. Single cells were processed through the Chromium Single Cell 3' Solution platform using the Chromium Single Cell 3’ Gel Bead, Chip and Library Kits (10X Genomics), as per the manufacturer’s protocol. Briefly, 7,000 cells were added to each channel, and were then partitioned into Gel Beads in emulsion in the Chromium instrument, where cell lysis and barcoded reverse transcription of RNA occurred, followed by amplification, shearing and 5' adaptor and sample index attachment. Libraries were sequenced on an IlluminaNextSeq 500.

Single-cell RNA sequencing data analysis

Reads were mapped to the GRCh38 human transcriptome using cell ranger 2.1.0, and transcript-per-million (TPM) was calculated for each gene in each filtered cell barcodes sample. TPM values were then divided by 10, since the complexity of single-cell libraries is estimated to be on the order of 100,000 transcripts. For each cell, we quantified the number of genes expressed and the proportion of the transcript counts derived from mitochondrial encoded genes. Cells with either <1,000 detected genes or >0.15 mitochondrial fraction were excluded from further analysis. Finally, the resulting expression matrix was filtered to remove genes detected in <3 cells. We focused on highly variable genes for downstream principal component analysis (PCA). For each dataset, we used the Seurat R package to detect variable genes based on fitting a relationship between the mean and the dispersion of each gene. We next scaled the data and regressed out UMI number and mitochondrial gene fraction to remove technical noise. The resulting scaled data were used as an input for PCA. Top significant PCs, estimated by a manual inspection of the PCA standard deviations elbow plots, were used to generate tSNE plots. For each dataset, we used

Seurat 47 (satijalab.org/seurat/) to identify genes that vary between samples. To detect differentially-active pathways, gene ontology (GO) enrichment analysis was performed with MSigDB 27 (software.broadinstitute.org/gsea/msigdb) using the differentially expressed genes that passed the following thresholds: |log2FC|>0.25, Bonferroni corrected p-value<0.0l, the gene was detected in >10% of the cells in each of the compared groups. Expression signatures for selected pathways were downloaded from MSigDB 27 . We evaluated the degree to which individual cells express a certain expression signature by using a procedure that takes into account the variability in signal-to-noise ratio, as previously reported 48 . To calculate pairwise cell distances, variable genes were detected, and the cell embedding matrix for the top significant PCs was used to calculate the Euclidean distance between every two cells within each sample.

Analysis of genome-wide CRISPR screens

CERES dependency scores 49 were obtained from the Broad Institute Achilles website (portals.broadinstitute.org/achilles/datasets/l8/download). Due to an unusually large difference in screen quality between MCF7 and KPL1, the subtle differences in dependency status between these lines were dominated by effects related to screen quality. To remove these uninteresting sources of variation, we corrected CERES gene scores by removing their first six principal components. These components were well-explained by experimental batch effects related to screen performance and pDNA pool. Corrected dependency scores <-0.5 were defined as dependencies. Genes listed as“pan_dependent” in the original dependency dataset were excluded from further analysis. For a more stringent overlap comparison, genes with CERES scores between -0.4 and -0.6 in MCF7 or KPL1 were further excluded. To implement the force directed layout, the full corrected dependency matrix was reduced to its top 100 principle components and a //-means clustering algorithm was run repeatedly on cell lines. Here, k is the number of clusters, and the mean cluster size (number of cell lines) I k is a parameter similar to perplexity in tSNE, set to 6 for our data. Edges between cells were weighted according to the frequency with which they co-clustered, with edges appearing less than 30% of the time ignored. Cells were then laid out using the SFPD spring-block algorithm 50 . Cell line RNAseq gene expression data and RPPA protein expression data were obtained from the CCLE website (portals.broadinstitute.org/ccle/data). Single sample GSEA was calculated using the ssGSEA algorithm 51 .

Chymotrypsin-like activity

MCF7 cells were plated in triplicates in 96 well plates at a density of 20,000 cells per well. 24 hours later, chymotrypsin-like activity of the proteasome was assayed, using the Proteasome-Glo™ assay (Promega), according to manufactures protocol. The activity levels were normalized to the relative cell number that was measured using the fluorescent detection of resazurin dye reduction (544-nm excitation and 590-nm emission). Western blots

For PSMC2 and PSMD2 immunoblotting, cells were lysed in HENG buffer (50mM Hepes-KOH pH 7.9, l50mM NaCl, 2mM EDTA pH 8.0, 20mM sodium molybdate, 0.5% Triton X-100, 5% glycerol), with protease inhibitor cocktail (Roche Diagnostics #11836153001). Protein concentration was determined by the BCA assay (Thermo-Fisher Scientific #23227), and proteins were resolved on SDS-PAGE for immunoblot analysis. Antibodies against the following human proteins were used: alpha-Tubulin (ab80779; Abeam), PSMC2 (MSS 1-104; Enzo Life Sciences) and PSMD1 (C-7; Santa-Cruz). Visualization was performed using the ChemiDoc MP System (Bio-Rad), and ImageLab Software (Bio-Rad) was used to quantify relative band intensities. For ERa immunonblotting, cells were lysed with a mix of 4X protein loading buffer (Li-Cor 928-40004) and 10X NuPAGE sample reducing agent (Life Technologies NP0009). Protein concentration was normalized by cell counting, and proteins were resolved on SDS-PAGE. Antibodies against the following human proteins were used: beta-Actin (N-21; Santa Cruz), ERa (F-10; Santa Cruz).

Visualization was performed using the Odyssey CLx imaging machine (Li-Cor), and Image Studio Software (Li-Cor) was used to quantify the relative intensities.

Generation and comparison of dendrograms

Dendrograms were constructed using Euclidean distances for continuous measures and Manhattan distances for discrete measures. Complete linkage hierarchical clustering was performed in all cases. The mutation status dendrogram was based on mutations with AFX).05. The gene expression dendrogram was based on the 978“landmark” genes directly measured by the L1000 assay. The copy number dendrograms were based on discrete calls (loss, normal or gain) assigned to each event based on its log2 copy number ratio, using a cutoff value of +/-0.1. The drug response dendrogram was based on normalized viability values. The cell morphology dendrogram was based on the full list of 1,784 cellular features measured. The barcode representation dendrogram was based on the log2 transformed number of reads, including only barcodes with >1,000 reads in at least one sample. To understand how dendrograms from different sources compared, the Fowlkes-Mallows index was used, as it could capture similarities in global clustering while ignoring within-group variance 52 . The“Bk” function in the“dendextend” R package was used for computations and visualizations. We compared dendrograms from different sources with k values ranging from 5 to 26. A background distribution was calculated by randomly shuffling the labels of the trees a 1,000 times, and calculating Bk values. The 95% upper quantile of the randomized distribution for each k was plotted. The maximum Bk value was used to estimate the degree of similarity between the compared pair of dendrograms.

Calculation of the distances between strains based on their genomic features

CNA distance based on LP-WGS was determined by the fraction of the genome affected by discordant CNA calls. CNA and SNV distances based on targeted sequencing were determined by Jaccard indices, defined as the number of shared events between strains (intersection) divided by the total number of evens in these strains (union). For SNVs, both the mutated gene and the exact amino acid change had to be identical to be counted as a shared event. Gene expression distances were defined as the Euclidean distances between L1000 expression profiles. Drug response distances were defined as the Euclidean distances between drug response profiles, after limiting the drug set to active drugs only (i.e., drugs that reduced the viability of at least one strain by >50%) and thresholding viability values to +/-100.

Comparisons across CCLE cell lines

Gene-level mRNA expression, copy number and mutation status data were downloaded from the CCLE website (portals.broadinstitute.org/ccle/data;

“CCLE_Expression_Entrez_2012-09-29. get”, updated October 17 lh 2012;

CCLE_copynumber_byGene_20l3-l2-03.txt, May 27 th 2014;

CCLE_MUT_CNA_AMP_DEL_binary_Revealer.gct, updated February 29 th 2016). The total number of point mutations and copy number changes were counted for each cell line. Chromosome arm-level events in CCLE samples were generated as described in Ben-David et al. 53 , and the number of arm-level events was counted for each cell line. The fraction of the genome affected by subclonal events was estimated using ABSOLUTE 54 . Combined CNA-SNV genomic instability scores were calculated as described in Zhang et al. 55 . The DNA repair gene set was derived from the Molecular Signature Databse (software.broadinstitute.org/gsea/msigdb), using the “DNA_Repair” GO signature 56 . The CIN70 gene set was derived from the publication by Carter et al. 57 . For each gene set, genes not expressed at all in the CCLE dataset were removed, and the remaining gene expression values were log2 -transformed and scaled by subtracting the gene expression means. The signature score was defined as the sum of these scaled gene expression values.

Comparison of Broad (CCLE) and Sanger (GDSC) genomic features

Whole-exome sequencing data for 107 matched cell lines were downloaded from the Sanger Institute (cancer.sanger.ac.uk/cell_lines, EGA accession number: EGAD00001001039) for the GDSC cell lines, and from the GDC portal

(portal.gdc.cancer.gov/legacy-archive) for the CCLE cell lines. For copy number analysis, For copy number analysis, the GATK4 somatic copy number variant pipeline was applied (gatkforums.broadinstitute.org/gatk/discussion/9l43/how-to- call-somatic-copy -number-variants-using-gatk4-cnv)36, 37. Gene-level copy number calls were generated by mapping genes from segment calls using the Consensus Coding Sequence database 58 . The gene-level values were log2 transformed, and converted to discrete values using pre-defmed thresholds (+/-0.1, +/-0.3 and +/-0.5). To determine the % of discordance for each cell line, the number of discordant CNA calls between each pair of strains was divided by the total number of genes (excluding genes with a neutral copy number call in both data sets). For analysis of somatic variants, the CCLE/Sanger merged mutation calls were downloaded from the CCLE portal (portals.broadinstitute.org/ccle/data), and target interval list files were generated for each of the 107 matched cell lines in CCLE. Mutation calling was performed using MuTect 38 , with default parameters and force outpuf’ enabled, to count the number of reads supporting the reference and alternate allele for each variant in each cell line. For analysis of germline variants, a common target interval list file consisting of a panel of 105,995 SNPs was generated, based on common SNVs found in 1,019 CCLE RNAseq samples, and Mutect was applied with the same parameters as described above. Comparison of allelic fractions was performed using the subset of variants with minimum depth of coverage of 10 in both Sanger and CCLE datasets and with minimum of allelic fraction of 0.1 in at least one dataset. Out of the 107 cell lines, one cell line (Dovl3) lacked any germline concordance and was thus excluded from all analyses.

Cytogenetic analysis

Karyotyping was performed by Karyo Logic. Inc. (karyologic.com/) on 50 G- banded metaphase spreads per sample. Every spread displayed multiple chromosomal rearrangements with many marker chromosomes. A marker was defined as“a structurally abnormal chromosome that cannot be unambiguously identified by conventional banding cytogenetics.” The analysis was performed according to the International System for Human Cytogenetic Nomenclature (ISCN) 2016 guidelines. Rare metaphases with >100 chromosomes were excluded from further analysis.

E-karyotyping analysis

RNAseq data from non-manipulated/non-treated samples of the near-diploid human cell line RPE1 were downloaded from the NCBI SRA website

(ncbi.nlm.nih.gov/sra). STAR- paired aligner was used to align paired-end samples, and STAR -non-paired aligner was used to align the non-paired samples 59 . The STAR to RSEM tool 60 was used to generate the gene-level expression values using the gtex pipeline (github.com/broadinstitute/gtex-pipeline). To infer arm-level copy-number changes from gene expression profiles, the RSEM values were subjected to the e- karyotyping method 61 . Briefly, RSEM values were log2-converted, genes that were not expressed (log2RSEM<l) in >20% of the samples were excluded, and expression levels of the remaining genes were floored to RSEM=l. The median expression value of each gene across all samples was subtracted from the expression value of that gene, in order to obtain comparative values. The 10% most variable genes were removed from the dataset to reduce transcriptional noise. The relative gene expression data were then subjected to a CGH-PCF analysis, with a stringent set of parameters: Least allowed deviation = 0.5; Least allowed aberration size = 30; Winsorize at quantile = 0.001; Penalty = 18; Threshold = 0.01. CNAs exceeding 80% of the length of a chromosome arm were called arm-level CNAs.

Comparison of arm-level CNAs between cell line propagation and tumor progression

Recurrence of chromosome arm-level CNAs during breast cancer progression was determined by their frequency in TCGA samples, as previously described 53 . Recurrence of chromosome arm-level CNAs during cell line propagation was determined by comparing the arm-level calls of the strains directly separated by extensive passaging (strain D vs. strain L vs. strain AA, strain B vs. strains I/P). Only arms that are recurrently gained or lost (but not both) in TCGA (q-value<0.05), and that have variable copy number status across the MCF7 panel, were considered for the comparison. Statistical analysis

The significance of the difference between genomic instability associated with different sources of genetic variation, and that of the difference in chromosome number between two time points of single cell-derived clones, were determined using the two-tailed Wilcoxon rank-sum test. The significance of the difference in the euclidean distance between compounds that work through the same MoA and compounds that work through different MoA’s, that of the difference in the discordance of non-silent SNVs at different stages of transformation, that of the difference in CIN70 and wGII scores between cell lines derived from primary tumors and those derived from metastases, that of the difference between the somatic and germline SNV Pearson correlations of the Broad-Sanger cell lines, and that of the differences in the Broad-Sanger somatic SNV concordance between MSI and MSS cell lines and between primary-derived and metastasis-derived cell lines, were determined using the one-tailed Wilcoxon rank-sum test. The significance of the difference in mutation cellular prevalence across strains was determined by a Kruskal- Wallis test. The significance of the difference in AKT inhibitor IV sensitivity between PTEN- wt and PTEN-rmt strains, that of the difference in the relative growth effect of ER-depletion between /AW -loss and no-fiSW-loss strains, that of the difference in proteasome activity between bortezomib-sensitive and bortezomib- insensitive strains, that of the difference in ERa protein expression levels between strains, and that of the difference in the number of arm-level CNAs between matched early-late MCF7 strains were determined using the one-tailed Student’s t-test. The significance of the difference in doubling times, and that of the difference in sensitivity to estrogen depletion, was determined using the two-tailed Student’s t-test. The significance of the correlation between the two replicates of the primary screen was determined using Pearson’s correlation. The significance of the correlation between doubling time and the number of protein coding mutations, that of the correlation between doubling time and the fraction of subclonal mutations, and that of the correlation between doubling time and drug response, were determined using a Spearman’s correlation, excluding the broadly resistant strains Q and M. The significance of the correlation between ESR1 CERES dependency scores and estrogen signaling, and that of the correlation between GATA3 CERES dependency scores and GAT A3 protein expression levels were determined using a Spearman’s correlation. The deviation of the doubling time-drug response correlations from a hypothetical mean value of 0 was determined using a two-tailed one-sample t-test. The significance of the difference between the emergence and disappearance of recurrent arm-level CNAs during cell line propagation was determined using McNemar’s test. The significance of the correlation between the primary and secondary drug screens was determined using a Spearman’s correlation (including only compounds that were active in both screens). The significance of the directionality of drug-pathway association, and the likelihood that a mutation would be clonal given the number of reads that detected it, were determined using a binomial test. The significance between the fraction of pathways correctly identified between the MCF7 panel,

CTD 2 and GDSC, was determined using a two-tailed Fisher’s exact test. GSEA p- values and FDR-corrected q-values are shown as provided by the default analysis output. For the comparison of pathway prediction, FDR q-values were re-calculated using only the pre-selected pathways. Thresholds for significant associations were determined as: p<0.05; q<0.25. The significance of the difference in the karyotypic variation between parental and single cell-clone derived cultures was determined using the Levene’s test. The significance of differentially-expressed genes in the single-cell RNAseq data was determined by an analysis of variance (ANOVA) followed by a Games-Howell post hoc test and a Bonferroni correction. Box plots show the median, 25 th and 75 th percentiles, lower whiskers show data within

25 th percentile -1.5 times the IQR, upper whiskers show data within 75 th percentile +1.5 times the IQR, and circles show the actual data points. Statistical tests were performed using the R statistical software (r-project.org/), and the box plots and violin plots were generated using the“boxplot” and“vioplot” R packages, respectively.

Example 1. Cross-laboratory comparisons

To test the hypothesis that clonal variation exists within established cell lines, we re-analyzed whole-exome sequencing data from 106 cell lines generated by both the Broad Institute (the Cancer Cell Line Encyclopedia (CCLE)) and the Sanger Institute (the Genomics of Drug Sensitivity in Cancer (GDSC)), using the same analytical pipeline for both datasets (Methods).

As expected, estimates of allelic fraction (AF) for germline variants were nearly identical across the two datasets (median r=0.95), indicating that sequencing artifacts do not substantially contribute to the erroneous appearance of low AF calls. However, the degree of agreement in AF for somatic variants was substantially lower (median r=0.86; p<2* 10 K ’; Fig. la). Moreover, a median of 19% of the detected non- silent mutations (range, 10% to 90%) were identified in only one of the two datasets. Likewise, 26% of genes altered by copy number alterations (CNAs) (range, 7% to

99%) were discordant. These results indicate that genetic variability across versions of the same cell line is common. Indeed, a median of 22% of the genome was estimated to be affected by subclonal events across 916 CCLE cell lines, suggesting that subclonality may underlie the observed differences. Example 2. Genetic variation across 27 MCF7 strains

We performed extensive genomic characterization of 27 versions (hereafter called“strains”) of the commonly used estrogen receptor (ER)-positive breast cancer cell line MCF7 (ref 12 l4 : Methods), including 19 strains that had not undergone drug treatment or genetic manipulation, 7 strains that carried a genetic modification generally considered to be neutral (e.g., introduction of a reporter gene, Cas9 or a DNA-barcode), and one strain (MCF7-M) that had been expanded in vivo in mice following anti-estrogen therapy. Strain M was found to be an outlier, consistent with having been through strong bottlenecks, and was therefore excluded from downstream quantitative analyses.

Ten chromosome arms (25% of the genome) were differentially gained or lost in a pairwise comparison of strains. We detected 283 genes with copy -gains and 405 genes with copy-losses (compared to basal ploidy) in at least one strain. Only a small minority of these (13% of gains and 21% of losses) were detected in all strains. 7% of gains and 13% of losses were detected in only a single strain, and the remaining events were observed variably across strains (Fig. lb). The differential events included genes commonly gained or lost in breast cancer

(e.g. TP 53, PTEN, EGFR, PIK3CA and MLR2K4). For example, PTEN was deleted in 17 strains and retained in the other 10 (Fig. lc). Similarly, the estrogen receptor gene ESR1 was gained in 12 strains, lost in 6, and unaltered in 9 (Fig. lc), and this correlated with differential expression of ERa (p=0.009).

Genetic variation was similarly observed at the level of point mutations, small insertions/deletions (indels) and chromosomal translocations. Only 35% of 95 non- synonymous single nucleotide variants (SNVs) and indels affecting coding sequence or splicing were shared by all strains. 29% were unique to a single strain, and the remaining were present in a subset of strains (Fig. ld-e). Similar, albeit lower, variability was observed among mutations listed as recurrent in the COSMIC database 15 , consistent with COSMIC mutations tending to be clonal founder mutations.

Unsupervised hierarchical clustering, where genetic distance is reflected by branch lengths of the dendrogram, generated branch structure that accurately reflected the strains’ history. For example, strain M, which had been subjected to in vivo passaging and drug treatment, was the most genetically distinct; the 11 strains used by the Connectivity Map project 16 over a 10 year period clustered tightly together; and sibling strains D and E, merely a few passages apart, were the closest to each other (Fig. lf-g). The genetic distance between strains appeared to be affected more by passage number and genetic manipulation than by freeze-thaw cycles (Fig. lh). Example 3. Sources of variation

Analysis of variant AFs revealed extensive subclonality across strains (Fig. 2a-b). For example, all 27 strains harbored the /VA Cri -activating mutation c. l633G>A, but the AF varied from 0.21 to 0.70. Based on AFs and copy number status, 45% of all observed mutations were determined to be subclonal (pO.Ol in a binomial test). PyClone 17 18 , which reconstructs subclonal structure by clustering mutations with similar cellular prevalence (CP), indicated multiple subclones within each MCF7 strain, with varying abundance across strains (Fig. 2c). Indeed, for 43% of the non-silent SNVs, CP differed by >50% across strains.

We next asked whether clonal dynamics were stochastic or the product of selection. We barcoded MCF7 cells (strain D) and evaluated the change in barcode representation over time under five culture conditions, each in five replicates. We reasoned that if clonal dynamics were stochastic, distinct barcoded populations would emerge in independent replicates. In contrast, if pre-existing subclones were selected under different conditions, enrichment of the same barcodes would be observed in replicate cultures 19 . Unsupervised hierarchical clustering by barcode representation revealed that biological replicates clustered together (Fig. 2d), indicating that pre existing clones are indeed selected by changes in culture conditions. Next, we characterized the genetic stability of three wild-type (WT) single cell-derived MCF7 clones and five single cell-derived clones with a“neutral” genetic manipulation (stable expression of a luciferase reporter; Methods). Clones derived from the same parental population differed in their mutational landscapes: a median of 15% of the non-silent SNVs detected in the WT parental population (range, 13% to

16%), were not observed in their single cell-derived progeny, or vice versa.

Moreover, the single-cell clones continued to evolve into heterogeneous populations. We propagated two clones for 8-14 months, and sequenced their DNA at multiple time points. A median of 13% of the non-silent SNVs (range, 8% to 16%) were not shared between time points. Similar results were observed based on cytogenetic analysis, indicating that even single cell-derived clones are genomically unstable.

Example 4. Gene expression variation

We next measured transcriptomic variation across the MCF7 strains using the L1000 assay 16 · 20 · 21 . Despite an overall similarity in their global gene expression profiles (Fig. 3a), the 27 strains also showed extensive expression variation: a median of 654 genes (range, 10-1,574) were differentially expressed by at least two-fold between pairs of strains (p<0.05, q<0.05), and the differentially expressed genes converged on important biological pathways. Importantly, the 27 strains clustered similarly in the space of mutations and expression profiles, and the expected downstream consequences of genetic mutations were observed in the gene expression variation (Fig. lf-g, Fig. 3b-g). For example, strains with inactivating RΊEN mutation or activating PIK3CA mutation exhibited decreased PTEN and increased mTOR gene expression signatures, respectively (Fig. 3e-f). Similarly, copy loss of ESR1 was associated with reduced estrogen signaling (Fig. 3g).

We further explored gene expression heterogeneity by single-cell RNA sequencing (scRNA-seq) of 26,465 individual cells from two parental and four single cell-derived clones (Methods). Unsupervised clustering showed that cells from the single cell-derived clones did not cluster independently, but were mixed with the parental population, indicating high similarity in overall gene expression (Fig. 3h). Interestingly, the extent of expression heterogeneity among the single cell-derived clones was not substantially lower than that seen in the parental population, and increased with time in culture. These results suggest that variation in gene expression arises de novo, in addition to reflecting selection of pre-existing clones 22 .

Example 5. Verification in additional cell lines

To exclude the possibility that the variation observed across MCF7 strains was unique to that cell line, we repeated genomic analyses on 23 strains of the commonly used lung cancer cell line A549 (ref. 23 ). We observed a similar level of molecular variation across these strains. For example, loss of CDKN2A, the most significantly deleted gene in lung adenocarcinomas 24 , was detected in 5 strains, but normal copy number was retained in the other 18. Whereas transcriptome analyses showed that estrogen signaling was the most variable pathway in MCF7 cells, KRAS signaling was the most variable pathway in A549, a commonly used model of A7/AS'-dependent cancer.

The generalizability of our findings was further confirmed by deep targeted sequencing of multiple strains from 11 additional cell lines. Notably, genomic instability was not limited to transformed cancer cell lines. For example, the variation across 15 strains of MCF10A 25 , a non-transformed human mammary cell line, was as high as that seen in MCF7 cancer cells (median discordance, 26%; range, 17% to 40).

Example 6. Functional consequences of genomic variation

The extensive genomic variation across strains was associated with variation in biologically meaningful cellular properties. We examined several measures of basic cellular function, including doubling time and cell morphology, using quantitative live cell imaging 26 (Methods). MCF7 strains varied in doubling times by as much as 3.5- fold (median, 3lh; range, 22-78h). Similarly, cell size and shape were highly variable across strains. Clustering based on morphological traits echoed that based on genomics or transcriptomics, and genomic features correlated with proliferation.

Genomic instability also had major impact on drug response. We measured cell viability following treatment with 321 drugs at a single concentration (5mM) across the 27 MCF7 strains. 55 compounds had strong activity (>50% growth inhibition) against at least one strain. However, at least one strain was entirely resistant (<20% growth inhibition) to 48 of 55 (87%) active compounds (Fig. 4a-b). The same phenomenon was observed at a more stringent threshold: of 42 compounds with strong activity in at least two strains, 33 (79%) were inactive in at least two strains. All 33 differentially active compounds validated in an 8-point dose-response testing of each of the 27 strains (median Spearman’s rho=0.42 between screens, p=3*lCT 9 ).

The high degree of variability in drug response cannot be explained by irreproducibility of the assay. First, replicate treatments yielded highly concordant results (median Pearson’s r=0.97, p<2*lCT 16 ). Second, compounds with the same mechanism-of-action had similar patterns of activity across strains (Fig. 4a, c;

p=3* 10 ~ ). For example, the same activity pattern was observed for three proteasome inhibitors (bortezomib, MG- 132 and carfilzomib) (Fig. 4d), and was associated with biochemically-measured differential proteasome activity. Third, for 82% of differentially active compounds, we found differential gene expression signatures of compound mechanism-of-action 27 between sensitive and insensitive strains

(p=2*lCT 5 ; Fig. 4e-h).

Indeed, drug response was associated with transcriptional differences in relevant pathways. For example, strains sensitive to CDK inhibitors had an upregulated cell cycle signature, and strains sensitive to PI3K inhibitors had an upregulated mTOR signature (Fig. 4f-g). Interestingly, the strains most resistant to treatment in general (strains M and Q) downregulated a signature of drug metabolism. Differences in proliferation rate did not explain the majority of the observed differential drug activity (median Spearman’s rho=0.0l7; p=0.60).

Genetic variation could be linked directly to differential drug response. For example, genetic inactivation of PTEN was associated with decreased PTEN and increased AKT expression signatures (Fig. lc,e and Fig. 3e-f), and increased sensitivity to the AKT inhibitor IV (Fig. 4h-i). Similarly, ESR1 loss was associated with reduced estrogen signaling (Fig. lc and 3g), which was in turn associated with reduced sensitivity to tamoxifen or estrogen depletion (Fig. 4j). More broadly, clustering of the MCF7 strains based on their drug response was highly similar to clustering based on genetics or gene expression (Fig. lg, 2a, 3b, 4a). Genome-wide CRISPR screens revealed that genetic dependencies were affected by genomic variation similarly to pharmacological dependencies, and functional analyses revealed that single cell-derived clones remained phenotypically unstable.

We thus hypothesized that variation across otherwise isogenic strains might be harnessed to discover mechanisms of drug sensitivity and resistance. Indeed, we found that basal gene expression profiles across the 27 MCF7 strains could be more readily connected to mechanism-of-action of active drugs than did larger panels of breast cancer cell lines derived from different patients 5 · 8 (Fig. 4k).

Example 7. Genomic evolution of cell lines appears to be different from genomic evolution of tumors during disease progression

The genetic changes that arise during the propagation of cell lines in culture may be associated with actual breast cancer disease progression. To address this question, we compared the genetic changes that arose following MCF7 passaging to those that arise during breast cancer disease progression. Yates et al. 78 found seven genes that were significantly more frequently mutated in relapsed or metastatic breast cancer than in primary breast cancer. In a pairwise comparison of earlier vs. later passage MCF7 strains, we could not detect new mutations in any of these genes.

We recently reported that recurrent arm-level CNAs are seen more frequently in relapsed or metastatic tumors compared to matched primary tumors across cancer types, but that some of these events tend to gradually disappear during the

propagation of patient-derived xenografts53. In breast cancer, the number of chromosomal changes was reported to be higher in metastases than in primary tumors, and recurrent arm-level CNAs (e.g., loss of l8q) were reported to be more frequent in metastases than in their matched primary tumors 79 82 . In contrast, the number of arm- level CNAs did not increase following extensive passaging of MCF7 (p=0.31), and - similar to our previous findings in PDXs - common CNAs mostly disappeared rather than arose on prolonged cell line passaging (p=0.047 in a McNemar’s test).

Therefore, we found no evidence that genomic evolution in culture reflected breast cancer disease progression in patients. We note that tumors and cell lines probably follow distinct evolutionary trajectories due to the widely distinct in vivo and in vitro selection pressures to which they are subjected.

Example 8. Single cell RNA sequencing following MCF7 response to bortezomib

To validate that scRNA-seq could be used to evaluate transcriptional changes in MCF7-derived populations, we treated MCF7 (strain AA) with bortezomib

(500nM), and profiled the transcriptome of 7,555 single cells at four time points: before drug treatment (tO), 12 hours after drug exposure (tl2), 48 hour after drug exposure (t48), and 72 hours after drug exposure followed by 24 hours without it (t72+24). Cell populations clustered separately at the different time points. In line with the previously reported transcriptional response to bortezomib 83 85 , drug treatment led to upregulation of the proteasome and unfolded protein response gene expression programs, and to downregulation of estrogen signaling and proliferation signatures. As expected, and in line with what we observed visually (data not shown), the genes most differentially expressed at t48 vs. tO were significantly enriched for apoptosis (r<2*10-16). We conclude that scRNA-seq can successfully capture transcriptional differences between MCF7 cell populations. Example 9. Single cell RNA sequencing identifies differentially expressed genes between single cell-derived clones

Despite an overall similarity between the single cell clones and their parental populations, a median of 165 differentially expressed genes (DEGs) (|log2FC|>0.25, Bonferroni-corrected p<0.0l; range, 95 to 226) were identified in a pairwise comparison . Almost twice as many genes were identified between the clonal populations themselves (median: 311; range, 290 to 395), indicating that single cell clones were closer to their parental population than to each other. Moreover, 153 DEGs were detected even between the two cultures of the same clone, before and after prolonged passaging. Moreover, cultures of the same clone before and after prolonged passaging exhibited significant differences in important pathways, including estrogen signaling and proliferation.

Example 10. Genetic variation across strains of additional cancer and non-cancer cell lines

MCF7 has been in culture for decades, and may have been subjected to varying selection pressures. We therefore tested the genetic variation of multiple additional cell lines. In paired comparisons between strains of the same cancer cell line, a median of 17% of the detected non-silent mutations (range, 0% to 33%) were only identified in one strain. The variation within our MCF7 and A549 panels was well within this observed range, with MCF7 being more variable than the median cell line (median discordance, 27%; range, 0% to 44%) and A549 being as variable as the median cell line (median discordance, 15%; range, 2% to 30%). The variation between the two samples of the benign tumor cell line BEN-MEN- 1 - separated by three years of continuous culture - was not lower than that within malignant cell lines, indicating that genomic instability is not limited to cell lines from malignant tumors.

We analyzed multiple strains of the non-transformed human cell lines MCF10A 25 , HA1E 86 and RPE1 87 . Whereas the variation in non-silent mutations across kidney HA1E strains was relatively low (median discordance, 5%; range, 2% to 7%), the variation across breast MCF10A strains was as high as that seen in MCF7 cancer cells (median discordance, 26%; range, 17% to 40%). Similarly, re-analysis of genomic data from immortalized RPE1 retinal cells showed discordant arm-level CNAs in three of 26 strains (-10%).

To ask whether the degree of cell line genomic instability is associated with its degree of transformation, we compared MCF10A cells at different stages of transformation 88 90 . We compared fraction of discordant non-silent SNVs between non-transformed MCF10A strains (expressing GFP or an empty vector), partially- transformed MCF10A strains (expressing a single oncogene), and a fully -transformed MCF10A strain (expressing ERBB2 and 1433z). The fully-transformed MCF10A strain was significantly more unstable than its partially -transformed or non- transformed counterparts).

Taken together, these results suggest that genomic instability is not restricted to transformed cells, but that increasing instability is associated with malignant progression. A causal relationship, however, cannot be inferred from these analyses.

Example 11. Doubling time was strongly correlated to the number of protein altering mutations and the extent of subclonality of the strains

Because mutation load serves as a molecular reflection of cellular“age” 91 , we examined the association between mutation load and growth rate. Remarkably, doubling time was strongly correlated with the number of protein altering mutations in naturally-occurring strains (Spearman’s Rho=-0.79, p=l*l0-4), consistent with cell line fitness increasing with time in culture. Similarly, the extent of subclonality of strains was highly predictive of proliferation rate (Spearman’s Rho=-0.77, p=2*l0-4), consistent with the association between subclonality and prognosis observed in primary tumors 92 . Example 12. Cell line instability is associated with the degree of aggressiveness and genomic instability of its tumor of origin 7

Analysis across 106 cell lines common to the Broad CCLE and Sanger GDSC indicated that the concordance in mutation allelic fractions was significantly lower for cell lines derived from metastases as opposed to primary tumors. Indeed, cell lines derived from metastases exhibited significantly higher genomic instability scores 57 · 93 than those derived from primary tumors. Of note, the concordance between mutation AFs was also significantly lower for cell lines with microsatellite instability (MSI), suggesting that deficient maintenance of genome integrity underlies at least some of the observed differences. These data suggest that cell line instability is associated with the degree of aggressiveness and genomic instability of its tumor of origin.

Example 13. Variability in drug response is observed in multiple analyses

The high variability in drug response across MCF7 strains was observed when the analysis was performed in multiple ways: 1) When broadly resistant strains were removed; 2) when using a more stringent definition of strong activity (>80% growth inhibition), and 3) when averaging the 10 WT Connectivity Map strains.

Example 14. The Connectivity Map strains do not skew variation estimates

As our panel of MCF7 strains included 11 CMap strains, which were highly similar to one another, we wanted to confirm that our estimations of variation across strains were not skewed due to this closely-related group of strains. We therefore averaged the genetic, transcriptomic and drug response profiles of the 10 naturally- occurring CMap strains, and repeated the quantitative analyses of variation using this single averaged CMap strain. The degree of heterogeneity across strains was very similar to that observed when the CMap strains were considered independently. This is true for the variation in the SNV and CNA landscapes, the correlations between doubling time and the number of protein altering mutations and between doubling time and subclonal fraction, the variation in gene expression, and the variation in drug response. Example 15. Dose-response curves are indicative of heterogeneous drug response

Dose-response analysis showed that while some compounds were active in all strains, their EC50 often varied dramatically. For example, the histone deacetylase inhibitor romidepsin varied in its cytotoxicity by more than 800-fold across the strains. We note that the dose-response curves of many differentially active drugs were characterized by shallow hill slopes and incomplete killing in the resistant strains, which is indicative of heterogeneity within the cell population 94 .

Example 16. Comparison of clustering trees 8

Our results suggest an overarching view in which genetic variation across strains of common cell lines is associated with relevant variation in cell morphology, gene expression, and drug response (Fig. lg, 2a, 3b, 4a). We compared the MCF7 clustering trees representing these different datasets using the Fowlkes-Mellows index52 (see Methods), and found that all trees were much more similar to one another than expected by chance. The similarity between the genetic trees and the gene expression tree was higher than that between either of them and the drug response tree. Among the genetic trees, the arm-level copy number-based tree was the most similar to the gene expression-based tree, probably due to the direct effect of aneuploidy on the expression of hundreds of genes simultaneously. Finally, the drug response tree was slightly more similar to the gene expression tree than to the mutation and gene-level copy number trees, in line with previous reports showing gene expression as a strong predictor of drug dependencies 95 .

Example 17. Genomic variation results in differential genetic

dependencies

We asked whether genomic variation between strains affects genetic dependencies in a similar way to its effect on drug response. This is particularly important because pooled CRISPR/Cas9 knock-out screens use genome copy number estimates to correct for the non-specific toxicity of Cas9, which is associated with the number of Cas9-mediated cuts in DNA49,96,97. We used the CERES algorithm to generate gene dependency scores52 from genome-wide CRISPR screens performed in both MCF7 and the cell line KPL1, now known to be a strain of MCF7 (ref. 98). The dependency landscapes of MCF7 and KPL1 were more similar to one another than to other breast cancer cell lines, but MCF7 had nearly twice as many dependencies as KPL1 (254 vs. 142, respectively; excluding pan-essential genes required for the survival of all cell lines). -70% of the MCF7 dependencies and -30% of the KPL1 dependencies were not shared by both cell lines. Remarkably, two key genes involved in estrogen signaling, ESR1 and BCAS3, were among the top differential

dependencies. Moreover, the genes that were over-expressed in MCF7 compared to KPL1 were significantly enriched for estrogen signaling expression signatures, which were in turn strongly associated with the dependency of breast cancer cell lines on ESR1. Similarly, the protein levels of GATA3, one of the most frequently mutated genes in breast cancer99, were higher in MCF7 than in KPL1, and MCF7 was more sensitive to GATA3 knockout. We concluded from these analyses that genetic variation within cancer cell lines can have substantial impact on the interpretation of genetic loss-of-function screens.

Example 18. Single cell-derived clones remain phenotypically unstable Lastly, we performed functional analyses of parental strains and their single cell-derived clone, comparing their proliferation rates and their dependency on estrogen. This analysis demonstrated that single cell-derived clones remain phenotypically unstable.

While single cell cloning initially reduces genomic heterogeneity, it does not ultimately yield genomically and phenotypically stable clonal lines.

Example 19. Differential drug response can be used to identify drugs’ mechanisms of action

Comparison of gene expression profiles between the most sensitive and most resistant cell lines or cell strains identified differential expression of the correct associated pathways in about two thirds of the cases in the MCF7 panel, compared to about a third of the cases in the CTD2 and the GDSC panels (p=0.022; Fig. 41). Therefore, panels of multiple strains of the same cell line, like the one described in this study, may offer a novel approach to identifying genotype-specific cancer vulnerabilities. This approach may be particularly useful for those drugs for which there is a highly variable sensitivity within the panel. Example 20. Comparison of naturally-occurring and genetically- manipulated strains

We found that“neutral” genetic manipulations, which usually involve antibiotic selection, are a major contributor to genomic variation across cell line strains. To ask whether genetic manipulations, as a group, lead to any reproducible transcriptional and drug response alterations, we compared the naturally-occurring to the genetically-manipulated strains. We first compared gene expression profiles between these groups and found no evidence for consistent gene expression changes. The number of differentially expressed genes between in-group pairs of strains was not statistically different from that between out-group pairs of strains. The genes differentially expressed (>2FC) between the two groups were not enriched for any hallmark pathway. We next compared the drug response patterns between the two groups. We did not find any drug that differentially (p<0.05, q<0.25) affected manipulated strains compared to naturally-occurring strains. We note that this does not rule out the possibility that specific genetic manipulations, or specific viral integration sites, may affect the genomic stability of cancer cell lines or may be associated with reproducible genomic changes.

References

1. Sharma SV, Haber DA & Settleman J Cell line-based platforms to evaluate the therapeutic efficacy of candidate anticancer agents. Nat Rev Cancer 10, 241-53

(2010).

2. Freedman LP, Cockbum IM & Simcoe TS The Economics of

Reproducibility in Preclinical Research. PLoS Biol 13, el002l65 (2015).

3. Prinz F, Schlange T & Asadullah K Believe it or not: how much can we rely on published data on potential drug targets? Nat Rev Drug Discov 10, 712 (2011).

4. Barretina J et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603-7 (2012).

5. Garnett MJ et al. Systematic identification of genomic markers of drug sensitivity in cancer cells. Nature 483, 570-5 (2012).

6. Basu A et al. An interactive resource to identify cancer genetic and lineage dependencies targeted by small molecules. Cell 154, 1151-61 (2013). 7. Yang W et al. Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells. Nucleic Acids Res 41, D955-61 (2013).

8. Seashore-Ludlow B et al. Harnessing Connectivity in a Large-Scale Small- Molecule Sensitivity Dataset. Cancer Discov 5, 1210-23 (2015).

9. Haibe-Kains B et al. Inconsistency in large pharmacogenomic studies. Nature 504, 389-93 (2013).

10. Cancer Cell Line Encyclopedia, C. & Genomics of Drug Sensitivity in Cancer, C. Pharmacogenomic agreement between two cancer cell line data sets. Nature 528, 84-7 (2015).

11. Haverty PM et al. Reproducible pharmacogenomic profiling of cancer cell line panels. Nature533, 333-7 (2016).

12. Soule HD, Vazguez J, Long A, Albert S & Brennan M A human cell line from a pleural effusion derived from a breast carcinoma. J Natl Cancer Inst 51, 1409- 16 (1973).

13. Brooks SC, Locke ER & Soule HD Estrogen receptor in a human cell line (MCF-7) from breast carcinoma. J Biol Chem 248, 6251-3 (1973).

14. Lee AV, Oesterreich S & Davidson NE MCF-7 cells— changing the course of breast cancer research and care for 45 years. J Natl Cancer Inst 107(2015).

15. Bamford S et al. The COSMIC (Catalogue of Somatic Mutations in

Cancer) database and website. Br J Cancer 91, 355-8 (2004).

16. Subramanian A et al. A Next Generation Connectivity Map: L1000 Platform And The First 1,000,000 Profiles. bioRxiv (2017).

17. Roth A et al. Py Clone: statistical inference of clonal population structure in cancer. Nat Methodsl 1, 396-8 (2014).

18. Eirew P et al. Dynamics of genomic clones in breast cancer patient xenografts at single-cell resolution. Nature 518, 422-6 (2015).

19. Bhang HE et al. Studying clonal dynamics in response to cancer therapy using high-complexity barcoding. Nat Med 21, 440-8 (2015).

20. Berger AH et al. High-throughput Phenotyping of Lung Cancer Somatic

Mutations. Cancer Cell30, 214-228 (2016).

21. Peck D et al. A method for high-throughput gene expression signature analysis. Genome Biol 7, R61 (2006). 22. Gupta PB et al. Stochastic state transitions give rise to phenotypic equilibrium in populations of cancer cells. Cell 146, 633-44 (2011).

23. Lieber M, Smith B, Szakal A, Nelson-Rees W & Todaro G A continuous tumor-cell line from a human lung carcinoma with properties of type II alveolar epithelial cells. Int J Cancer 17, 62-70 (1976).

24. Cancer Genome Atlas Research, N. Comprehensive molecular profiling of lung adenocarcinoma. Nature 511, 543-50 (2014).

25. Soule HD et al. Isolation and characterization of a spontaneously immortalized human breast epithelial cell line, MCF-10. Cancer Res 50, 6075-86 (1990).

26. Bray MA et al. Cell Painting, a high-content image-based assay for morphological profiling using multiplexed fluorescent dyes. Nat Protoc 11, 1757-74 (2016).

27. Liberzon A et al. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst 1, 417-425 (2015).

28. Janiszewska M et al. In situ single-cell analysis identifies heterogeneity for PIK3CA mutation and HER2 amplification in HER2-positive breast cancer. Nat Genet 47, 1212-9 (2015).

29. Venteicher AS et al. Decoupling genetics, lineages, and microenvironment in IDH-mutant gliomas by single-cell RNA-seq. Science 355(2017).

30. Bray MA, Fraser AN, Hasaka TP & Carpenter AE Workflow and metrics for image quality control in large-scale high-content screens. J Biomol Screen 17, 266-74 (2012).

31. Dao D et al. CellProfiler Analyst: interactive data exploration, analysis and classification of large biological image sets. Bioinformatics 32, 3210-3212 (2016).

32. Adalsteinsson VA et al. Scalable whole-exome sequencing of cell-free DNA reveals high concordance with metastatic tumors. Nat Commun 8, 1324 (2017).

33. Ha G et al. Integrative analysis of genome-wide loss of heterozygosity and monoallelic expression at nucleotide resolution reveals disrupted pathways in triple- negative breast cancer. Genome Res 22, 1995-2007 (2012).

34. Sholl LM et al. Institutional implementation of clinical tumor profiling on an unselected cancer population. JCI Insight 1, e87062 (2016). 35. Li H & Durbin R Fast and accurate short read alignment with Burrows- Wheeler transform. Bioinformatics 25, 1754-60 (2009).

36. McKenna A et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20, 1297-303 (2010).

37. DePristo MA et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet 43, 491-8 (2011).

38. Cibulskis K et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 31, 213-9 (2013).

39. McLaren W et al. Deriving the consequences of genomic variants with the

Ensembl API and SNP Effect Predictor. Bioinformatics 26, 2069-70 (2010).

40. Olshen AB, Venkatraman ES, Lucito R & Wigler M Circular binary segmentation for the analysis of array-based DNA copy number data. Biostatistics 5, 557-72 (2004).

41. Abo RP et al. BreaKmer: detection of structural variation in targeted massively parallel sequencing data using kmers. Nucleic Acids Res 43, el9 (2015).

42. Sanjana NE, Shalem O & Zhang F Improved vectors and genome-wide libraries for CRISPR screening. Nat Methods 11, 783-784 (2014).

43. Joung J et al. Genome-scale CRISPR-Cas9 knockout and transcriptional activation screening. Nat Protoc 12, 828-863 (2017).

44. Johnson WE, Li C & Rabinovic A Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118-27 (2007).

45. Rees MG et al. Correlating chemical sensitivity and basal gene expression reveals mechanism of action. Nat Chem Biol 12, 109-16 (2016).

46. Golub TR et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, 531-7 (1999).

47. Macosko EZ et al. Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets. Cell 161, 1202-1214 (2015).

48. Tirosh I et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science 352, 189-96 (2016).

49. Meyers RM et al. Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells. Nat Genet 49, 1779-1784 (2017). 50. Hu Y Efficient, High-Quality Force-Directed Graph Drawing. The Mathematica Journal 10, 37-71 (2006).

51. Barbie DA et al. Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature 462, 108-12 (2009).

52. Fowlkes EB & Mallows CL A Method for Comparing Two Hierarchical

Clusterings. Journal of the American Statistical Association 78, 553-569 (1983).

53. Ben-David U et al. Patient-derived xenografts undergo mouse-specific tumor evolution. Nat Genet 49, 1567-1575 (2017).

54. Carter SL et al. Absolute quantification of somatic DNA alterations in human cancer. Nat Biotechnol 30, 413-21 (2012).

55. Zhang S, Yuan Y & Hao D A genomic instability score in discriminating nonequivalent outcomes of BRCA½ mutations and in predicting outcomes of ovarian cancer treated with platinum-based chemotherapy. PLoS One 9, el 13169 (2014).

56. Subramanian A et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A 102, 15545-50 (2005).

57. Carter SL, Eklund AC, Kohane IS, Harris LN & Szallasi Z A signature of chromosomal instability inferred from gene expression profiles predicts clinical outcome in multiple human cancers. Nat Genet 38, 1043-8 (2006).

58. Pujar S et al. Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation. Nucleic Acids Res 46, D221-D228 (2018).

59. Dobin A et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15-21 (2013).

60. Li B, Ruotti V, Stewart RM, Thomson JA & Dewey CN RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics 26, 493-500 (2010).

61. Ben-David U, Mayshar Y & Benvenisty N Virtual karyotyping of pluripotent stem cells on the basis of their global gene expression profiles. Nat Protoc 8, 989-97 (2013).

62. Jones, C. et al. Comparative genomic hybridization reveals extensive variation among different MCF-7 cell stocks. Cancer Genet Cytogenet 117, 153-8 (2000). 63. Nugoli, M. et al. Genetic variability in MCF-7 sublines: evidence of rapid genomic and RNA expression profile modifications. BMC Cancer 3, 13 (2003).

64. Leung, E., Kannan, N., Krissansen, G.W., Findlay, M.P. & Baguley, B.C. MCF-7 breast cancer cells selected for tamoxifen resistance acquire new phenotypes differing in DNA content, phospho-HER2 and PAX2 expression, and rapamycin sensitivity. Cancer Biol Ther 9, 717-24 (2010).

65. Coser, K.R. et al. Antiestrogen-resistant subclones of MCF-7 human breast cancer cells are derived from a common monoclonal drug-resistant progenitor. Proc Natl Acad Sci U S A 106, 14536-41 (2009).

66. Esquenet, M., Swinnen, J.V., Heyns, W. & Verhoeven, G. LNCaP prostatic adenocarcinoma cells derived from low and high passage numbers display divergent responses not only to androgens but also to retinoids. J Steroid Biochem Mol Biol 62, 391-9 (1997).

67. Briske- Anderson, M.J., Finley, J.W. & Newman, S.M. The influence of culture time and passage number on the morphological and physiological

development of Caco-2 cells. Proc Soc Exp Biol Med 214, 248-57 (1997).

68. Yu, EL, Cook, T.J. & Sinko, P.J. Evidence for diminished functional expression of intestinal transporters in Caco-2 cell monolayers at high passages. Pharm Res 14, 757-62 (1997).

69. Wenger, S.L. et al. Comparison of established cell lines at different passages by karyotype and comparative genomic hybridization. Biosci Rep 24, 631-9 (2004).

70. Sambuy, Y. et al. The Caco-2 cell line as a model of the intestinal barrier: influence of cell and culture-related factors on Caco-2 cell functional characteristics. Cell Biol Toxicol 21 , 1 -26 (2005).

71. Lin, H.K. et al. Suppression versus induction of androgen receptor functions by the phosphatidylinositol 3-kinase/ Akt pathway in prostate cancer LNCaP cells with different passage numbers. J Biol Chem 278, 50902-7 (2003).

72. Riley, S.A., Warhurst, G., Crowe, P.T. & Tumberg, L.A. Active hexose transport across cultured human Caco-2 cells: characterisation and influence of culture conditions. Biochim Biophys Acta 1066, 175-82 (1991). 73. Langeler, E.G., van Uffelen, C.J., Blankenstein, M.A., van Steenbrugge, G.J. & Mulder, E. Effect of culture conditions on androgen sensitivity of the human prostatic cancer cell line LNCaP. Prostate 23, 213-23 (1993).

74. Engelholm, S.A. et al. Genetic instability of cell lines derived from a single human small cell carcinoma of the lung. Eur J Cancer Clin Oncol 21, 815-24 (1985).

75. Masramon, L. et al. Genetic instability and divergence of clonal populations in colon cancer cells in vitro. J Cell Sci 119, 1477-82 (2006).

76. Frattini, A. et al. High variability of genomic instability and gene expression profiling in different HeLa clones. Sci Rep 5, 15377 (2015).

77. Orth, K. et al. Genetic instability in human ovarian cancer cell lines. Proc Natl Acad Sci U S A 91, 9495-9 (1994).

78. Yates, L.R. et al. Genomic Evolution of Breast Cancer Metastasis and Relapse. Cancer Cell 32, 169-184 e7 (2017).

79. Nishizaki, T. et al. Genetic alterations in lobular breast cancer by comparative genomic hybridization. Int J Cancer 74, 513-7 (1997).

80. Li, H. et al. A preliminary study of the relationship between breast cancer metastasis and loss of heterozygosity by using exome sequencing. Sci Rep 4, 5460 (2014).

81. Friedrich, K. et al. Chromosomal genotype in breast cancer progression: comparison of primary and secondary manifestations. Cell Oncol 30, 39-50 (2008).

82. Kroigard, A.B. et al. Clonal expansion and linear genome evolution through breast cancer progression from pre-invasive stages to asynchronous metastasis. Oncotarget 6, 5634-49 (2015).

83. Meiners, S. et al. Inhibition of proteasome activity induces concerted expression of proteasome genes and de novo formation of Mammalian proteasomes. J Biol Chem 278, 21517-25 (2003).

84. Mitsiades, N. et al. Molecular sequelae of proteasome inhibition in human multiple myeloma cells. Proc Natl Acad Sci U S A 99, 14374-9 (2002).

85. Powers, G.L., Ellison-Zelski, S.J., Casa, A.J., Lee, A.V. & Alarid, E.T.

Proteasome inhibition represses ERalpha gene expression in ER+ cells: a new link between proteasome activity and estrogen signaling in breast cancer. Oncogene 29, 1509-18 (2010). 86. Hahn, W.C. et al. Creation of human tumour cells with defined genetic elements. Nature 400, 464-8 (1999).

87. Bodnar, A.G. et al. Extension of life-span by introduction of telomerase into normal human cells. Science 279, 349-52 (1998).

88. Nieminen, A.I., Partanen, J.I., Hau, A. & Klefstrom, J. c-Myc primed mitochondria determine cellular sensitivity to TRAIL-induced apoptosis. EMBO J 26, 1055-67 (2007).

89. Lu, J. et al. l4-3-3zeta Cooperates with ErbB2 to promote ductal carcinoma in situ progression to invasive breast cancer by inducing epithelial- mesenchymal transition. Cancer Cell 16, 195-207 (2009).

90. Seton-Rogers, S.E. et al. Cooperation of the ErbB2 receptor and transforming growth factor beta in induction of migration and invasion in mammary epithelial cells. Proc Natl Acad Sci U S A 101, 1257-62 (2004).

91. Bromham, L. & Penny, D. The modem molecular clock. Nat Rev Genet 4, 216-24 (2003).

92. Andor, N. et al. Pan-cancer analysis of the extent and consequences of intratumor heterogeneity. Nat Med 22, 105-13 (2016).

93. Burrell, R.A. et al. Replication stress links structural and numerical cancer chromosomal instability. Nature 494, 492-496 (2013).

94. Xia, X., Owen, M.S., Lee, R.E. & Gaudet, S. Cell-to-cell variability in cell death: can systems biology help us make sense of it all? Cell Death Dis 5, el26l (2014).

95. Tshemiak, A. et al. Defining a Cancer Dependency Map. Cell 170, 564- 576 el6 (2017).

96. Aguirre, A.J. et al. Genomic Copy Number Dictates a Gene-Independent

Cell Response to CRISPR/Cas9 Targeting. Cancer Discov 6, 914-29 (2016).

97. Munoz, D.M. et al. CRISPR Screens Provide a Comprehensive

Assessment of Cancer Vulnerabilities but Generate False-Positive Hits for Highly Amplified Genomic Regions. Cancer Discov 6, 900-13 (2016).

98. Capes-Davis, A. et al. Check your cultures! A list of cross-contaminated or misidentified cell lines. Int J Cancer 127, 1-8 (2010).

99. Cancer Genome Atlas, N. Comprehensive molecular portraits of human breast tumours. Nature 490, 61-70 (2012). OTHER EMBODIMENTS

It is to be understood that while the invention has been described in conjunction with the detailed description thereof, the foregoing description is intended to illustrate and not limit the scope of the invention, which is defined by the scope of the appended claims. Other aspects, advantages, and modifications are within the scope of the following claims.