Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
NON-RANDOMNESS IN THE CLUSTERING OR GENOMIC EVENTS DETECTS INSERTIONAL MUTAGENESIS IN CLINICAL GENE THERAPY
Document Type and Number:
WIPO Patent Application WO/2008/071682
Kind Code:
A2
Inventors:
VON KALLE CHRISTOF (DE)
SCHMIDT MANFRED (DE)
Application Number:
PCT/EP2007/063666
Publication Date:
June 19, 2008
Filing Date:
December 11, 2007
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
DKFZ KREBSFORSCHUNGSZENTRUM (DE)
UNIV RUPRECHT KARLS HEIDELBERG (DE)
ABEL ULRICH (DE)
VON KALLE CHRISTOF (DE)
SCHMIDT MANFRED (DE)
International Classes:
G01N33/50; G16B40/30
Other References:
No Search
Attorney, Agent or Firm:
DICK, Alexander (PatentanwaltTheodor-Heuss-Anlage 12, Mannheim, DE)
Download PDF:
Claims:

What is claimed is:

1. A method of determining a selective advantage of affected cell clones, comprising: determining a distribution of integration sites, determining an expected frequency of insertions, applying a common integration site definition to identify a non-random integration site distribution.

2. The method of claim 1, wherein the act of applying a common integration site definition comprises detecting biological effects of insertion sites on clonal selection of hematopoiesis.

3. The method of claim 1, wherein the act of applying comprises detecting substantial biological effects of insertion sites on clonal selection of hematopoiesis in humans.

4. A method for identifying a random or non-uniform sequence motif distribution in an organism, comprising: subjecting a unit of observation, subjecting random uniform or non-uniform allocation, subjecting a cluster of order n of a n-tuple of specific DNA and/or RNA motifs such that the maximum distance between the lowest and highest position is no greater than a fixed bound, subjecting the number of sampled DNA and/or RNA motifs, defining a probability that a given (sub)set of n specific DNA and/or RNA motifs that are randomly or non-uniformly allocated form a cluster of order n, determining an expected value, and applying statistical inferences.

5. The method of claim 4, wherein the act of applying statistical inferences comprises the calculation of P-values.

6. The method of claim 4, wherein the act of applying statistical inferences comprises the statistical analysis of experiments involving viral vector insertion into the human genome.

7. The method of claim 4, wherein the act of applying statistical inferences comprises calculating an expected number of common integration sites (CIS) of various order under the assumption that the distribution of the integrations sites (IS) of viral vectors into the genome, or part of it, is random and uniform.

8. The method of claim 4, wherein the act of applying statistical inferences comprises calculating p-values from the number of CIS (of order 2,3,4) observed in a particular experiment, wherein the p-values are probabilities that indicate whether or not the observed number of CIS is compatible with the null hypothesis that the IS produced in the experiment have a random and uniform distribution on at least part of the genome.

9. A method of determining an influence of sequence motifs on affected cell clones, comprising:

(i) determining a distribution of sequence target sites in a sample of nucleic acid, (ii) determining an expected frequency of those target sites in that sample, and

(iii) detecting biological effects of the target sites on clonal selection of a particular condition in a subject containing the sample of nucleic acid.

10. The method of claim 9, wherein the sample of nucleic acid is a genome.

11. The method of claim 10, wherein the subject containing the sample is a human.

12. The method of claim 9, wherein the subject is a human.

13. The method of claim 9, wherein the condition is hematopoiesis.

Description:

NON-RANDOMNESS IN THE CLUSTERING OF GENOMIC EVENTS DETECTS INSERTIONAL MUTAGENESIS IN CLINICAL GENE THERAPY

BACKGROUND OF THE INVENTION

Gene therapy using retrovirus based vectors has reached unprecedented levels of success, potentially allowing the genetic correction of stem cell systems for the entire lifespan of the treated patient 1"4 . While in the past the chance of vector insertional mutagenesis inadvertently de-regulating a gene has been considered negligible 5 ' 6 , the increasing efficiency of gene transfer procedures has demonstrated that the genetic modification of cells is indeed not inert. The risk of vector insertion induced oncogenesis has severely manifested in a clinical gene therapy trial treating X-linked severe combined immunodeficiency (SCID-Xl) where insertional activation of LM02 led to a T ALL-like lymphoproliferation in 3 patients 7 ' 8 . We have further reported insertional side effects in a clinical gene therapy trial of chronic granulomatous disease (CGD) that resulted in an in vivo expansion of gene-corrected myeloid cells, augmented by insertional activation of MDSl-EVIl, PRDMl 6 and SETBPl 4 . Meanwhile, vector- induced clonal selection of gene-modified cells has been studied in a variety of in vitro studies and preclinical in vivo models of insertional oncogenesis 9"16 . We have now shown that subtle influences of insertional gene activation on clonal proliferation dynamics in humans can be demonstrated by insertion site clustering 17 ' 18 . Thus, varying degrees of non-random growth or survival of targeted cells have to be considered a frequent side effect in gene transfer studies using randomly integrating vector systems.

SUMMARY

The present invention provides a method of determining a selective advantage of affected cell clones, comprising: determining a distribution of integration sites, determining an expected frequency of insertions, applying a common integration site definition to identify a non-random integration site distribution.

In one embodiment, the act of applying a common integration site definition comprises detecting biological effects of insertion sites on clonal selection of hematopoiesis.

In a further embodiment, the act of applying comprises detecting substantial biological effects of insertion sites on clonal selection of hematopoiesis in humans.

The present invention also provides a method for identifying a random or non-uniform sequence motif distribution in an organism, comprising: subj ecting a unit of observation, subjecting random uniform or non-uniform allocation, subjecting a cluster of order n of a n-tuple of specific DNA and/or RNA motifs such that the maximum distance between the lowest and highest position is no greater than a fixed bound, subjecting the number of sampled DNA and/or RNA motifs, defining a probability that a given (sub)set of n specific DNA and/or RNA motifs that are randomly or non-uniformly allocated form a cluster of order n, determining an expected value, and applying statistical inferences.

In one embodiment, the act of applying statistical inferences comprises the calculation of P-values. In another embodiment, the act of applying statistical inferences comprises the statistical analysis of experiments involving viral vector insertion into the human genome. In another embodiment, the act of applying statistical inferences comprises calculating an expected number of common integration sites (CIS) of various order under the assumption that the distribution of the integrations sites (IS) of viral vectors into the genome, or part of it, is random and uniform.

In a further embodiment, the act of applying statistical inferences comprises calculating p-values from the number of CIS (of order 2,3,4) observed in a particular experiment, wherein the p-values are probabilities that indicate whether or not the observed number of CIS is compatible with the null hypothesis that the IS produced in the experiment have a random and uniform distribution on at least part of the genome.

The present invention also comprises a method of determining an influence of sequence motifs on affected cell clones, comprising:

(i) determining a distribution of sequence target sites in a sample of nucleic acid, (ii) determining an expected frequency of those target sites in that sample, and (iii) detecting biological effects of the target sites on clonal selection of a particular condition in a subject containing the sample of nucleic acid.

In one embodiment, the sample of nucleic acid is a genome. In another embodiment, the subject containing the sample is a human. In any of the methods disclosed herein, in one embodiment, the subject is a human. In any of the methods disclosed herein, in one embodiment, the condition is hematopoiesis.

DETAILED DESCRIPTION

Features such as mutations or structural characteristics can be non-randomly or non-uniformly distributed within a genome. So far, only simulations, requiring extensive computational resources and fixed parameters of cluster definition, could assess statistical inferences on the distribution of sequence motifs. Here, we show that these analyses are possible by mathematical means.

Concerted retrovirus insertional mutagenesis studies using tumor prone mouse models have aimed to identify new cancer genes by determining the gene configuration near frequently affected integration site loci 14"16 . The related consortium has collected a database of murine cancer genes based on a definition of common insertion sites (CIS) to specify those insertions that potentially affect identical genes with more than expected frequency of involvement in tumorgenesis. In this definition of CIS, both distance as well as intra- or intergenic location of integrants were used as criteria 14 . In contrast, CIS have been originally introduced determining two independent viral integration events located close to each other (<30 kb), independently of being located in or outside of gene coding regions 19 . Although the invention is not so limited, as it is a more relevant model of insertion site proximity to a gene's promoter, we followed the latter CIS definition in an illustrative embodiment of the invention. We considered 2, 3 or 4 insertions as CIS of 2 nd , of 3 rd or 4 th order if they fell within a 30kb, 50kb or 100kb window of genomic sequence from each other, respectively, as previously described 14 , but scored independently of intra- or intergenic location.

To validate the correctness of our mathematical approach, we simultaneously performed computer simulations for which a window of size d n (d n = the maximum distance defining a CIS of order n) was shifted through the ordered sequence of the IS. For each window W(j)=[IS(j),IS(j)+d n ] it was then counted how many CIS of order n including IS(J) as first element were contained in W(j). The program was written in open source 'R'-language (http://cran.r-project.org).

An "insertion site" is only one class of genomic or nucleic acid loci that can be assimilated by the present algorithms and computer manipulations to make an assessment concerning the non-randomness of those insertion sites. For instance, the present algorithms and software calculations also can be applied to any nucleotide motif present in a mass of nucleic acid material, such as an organism's genome. That is, the present invention permits the skilled artisan to determine the uniformity and

randomness of any nucleotide sequence that has a certain frequency of occurrence and distribution in a genome. Hence, the present inventive calculative assessment methods can be applied not only to the presently described "insertion sites" but also to (i) restriction enzyme recognition sites, (ii) any sequence motif, such as (iii) transcription motifs, (iv) protein or factor binding sites, such as recognition sites present in 5'- and 3'- untranslated regions, e.g., for termination signal protein binding sites, (iv) sequence motifs in promoters that facilitate binding of promoter-specific proteins, (v) telomeric repeat motifs, (vi) cis-regulatory DNA motifs, (vii) any enzymatic recognition site, such as those involved in DNA repair, splicing, and intron excision, (viii) nucleotide mutation sites or chromosomal hotspots, (ix) single nucleotide polymorphisms (SNPs), (x) any sequence that is conserved between the genomes of individuals of a species and distributed throughout the genome at the same or different frequency among those individuals.

The present invention encompasses two fundamental aspects, namely (1) a "biological perspective" and (2) a "mathematical perspective," which are described in more detail below.

(1) Biological Perspective

Integration Site (IS) distribution was thought to be random for a long time. With the complete genome sequencing efforts in the last decade, however, it has become obvious that this is not necessarily the case. Hence, for instance, Avian sarcoma and leukosis virus IS are evenly distributed over the genome with no known preferred integration sites or hotspots or preference for insertion in gene coding regions or at CpG island loci. In this biological sense, therefore, the IS distribution is random (and uniform).

By contrast, murine leukemia virus is known to integrate preferably at gene loci that surround transcriptional start sites. Accordingly, in biology, this is non-random (and non-uniform) distribution.

Lentivirus (HIV-I) is known to preferentially integrate in genes but typically in an evenly distributed manner. In biological terms, this is non-random (and uniform) distribution. The present invention can be adapted to each of these cases. The biological community usually equates "non-uniform" with "random" likely because the preference of certain IS locations associated with non-uniformity intuitively corresponds to a non- random biological process. From the biological perspective, then, the present inventive analysis and algorithms are useful for determining biological distribution of non- random, random, uniform, non-uniform events and any combinations thereof depending on the genome and, for instance, specific viral vector activity.

(2) Mathematical Perspective

From the mathematical point of view, the present invention defines non- randomness generally. Thus, all of the viral examples, for instance, are, mathematically speaking, "random." For ASLV it is clear. Contextually, the mathematic analysis can assess distribution at particular loci as opposed to the entire genome. Thus, for MLV, the transcriptional start regions are assessed independently and the distribution within those regions is random, and the distribution in the rest of the genome is also random. The same for HIV-I, where the gene coding regions are viewed separately from the rest of the genome.

In order to assess whether or not an observed number of CIS is "higher than" expected, one has to specify exactly what "expected" means. This necessarily implies determining a baseline of IS loci probability distributions for comparison purposes. This probability distribution describes a "random" variable in mathematical terms. Thus, the mathematical comparison always includes a "random" element. However, IS distribution itself actually may or may not be uniform.

Thus, statistically speaking, the present invention covers comparing CIS observations with (i) a uniform (random) distribution; and (ii) with a non-uniform (random) distribution.

The present invention is not limited to the assessment of distribution and frequency of nucleotide motifs in a genome - any mass of nucleic acid may be assessed, such as a pool or library of cDNA molecules, or a synthetic nucleic acid molecule, such as a vector, plasmid, or cosmid. Similarly, the present inventive methods can be applied to identifying distribution and frequency qualities associated with sequences present in RNA, not only DNA.

Accordingly, the use of the term "insertion sites" herein is not all-encompassing but is simply one exemplified embodiment of the present invention. Any sequence that is of interest may be investigated according to the present invention, such as those identified above, not only the sites of insertion. Thus, a generic term for these sequences of interest is, according to the present invention, a "sequence target site."

The calculations that are applied herein to any such sequence target site simply require knowledge of (1) the genome size or nucleotide size of the mass of nucleic acid to be investigated, (2) the number of (sampled) sequence motifs, (3) the number of sequence motifs within a distance, (4) the order (number of affected sequence motifs within a distance) and (5) distance. Based on our mathematical allocations, we can provide related workspaces to be loaded on public available basic programs to retrieve expected frequency and p-value in real-time. Thus, for each individual problem a flexible cluster definition can be applied to most effectively identify non-random or non-uniform sequence motif distribution. As an example, we show the effectivity and

reliability of our mathematical approach in clinical retroviral vector integration site distribution.

Part 1: Random uniform allocation of IS

For the purpose of this discussion, the unit of observation (location and distance) is kb. We assume that a number n ls of IS is randomly allocated (with a uniform distribution) to the locations of a genome consisting of g kb. A CIS of order n is an n- tuple of IS such that the maximum distance between the lowest and highest position is no greater than a fixed bound.

Further terminology

d n , defining "size" or distance of a CIS of order n, i.e. maximum permissible distance between any two members of a CIS of order n. P n , probability that a given (sub)set of n IS that are randomly allocated form a CIS of order n

P(m,d), probability that a given subset of m randomly allocated IS has a span

(=maximum distance between any two elements) of exactly d.

We start with the elementary observation that the expectancy E n equals P n times the number of subsets of IS consisting of n elements:

Clearly,

P n = ∑P(n,d) (2) d=0 It remains to determine P(n,d). First note that P(l,d)=0 for d>0. Furthermore, for all m≥l :

g

A recursive formula for P(m,d), d>0, can be derived by breaking down the potential CIS of order m into subsets of m-1 elements having a span of d'≤d, to which an m-th IS is added such that the maximum span is exactly d:

P(m,d) = -{∑ [2 - P(m - \,d')] + (d + \)P(m - \,d)} + r (4), g d≡o

where r ist a negligible correction term that arises because the uncorrected recursion formula is strictly valid only for subsets of IS that have a distance >d from the telomeres.

By mounting the recursive ladder (m=l,...,n), these formulae successively yield P(n,d), P n , and E n . In particular, one easily obtains (d>0): P(2,d) ~ —

8 βd P(3,d) ~ —

8

Ud 2 + 2

P(4,d)

8 '

Plugging this into formulae E n :

As shown in Table 1, our mathematical approximation corresponds extremely well to the mean values found in 50000 simulation runs.

Table 1

Table 1 Mean values for random CIS formation (1000 IS) determined either with computer simulations or mathematically, with g=3.12-10 fi , and d2=30, d3=50, d4=100, and n ls =1000. Simulations were performed with 50000 runs each, g, haploid size of the human genome 3.12 x 10 6 kb; d n , genomic window size [kb] for CIS of n th order.

Statistical inferences, such as the calculation of P-values, can be based on the observation that, under the null hypothesis (Ho) of random uniform allocation of the IS, the number of CIS of order n is (approximately) Poisson distributed with parameter λ=E n . Thus, if the random variable X denotes the number of CIS of order n, and X=k is observed in a trial, then the p-value P(X≥k) of this observation calculated under Ho, i.e. from the Poisson distribution P 0 (E n ), is given by

P(X ≥ k \ H 0 ) = = P(χ 2 < 2E n ) ,

where the random variable χ 2 has a chi-square distribution with 2k degrees of freedom 20 ' 21 .

The Poisson approximation to the true random distribution of CIS is exceedingly close. In fact, if the number of simulation runs is sufficiently high, the simulated distribution is virtually undistinguishable from P 0 (E n ). In particular, both the expected values and the p-values derived from P 0 (E n ) are nearly identical to those obtained in computer simulations. The latter point is apparent from Table 2, where for a final "proof of principle" of our mathematical calculations, results of the analysis of our integration data set retrieved from two clinical SCID-Xl therapy trials 17 ' 18 are given.

Table 2

MV MV p-Value p-Value

CIS IS

Simulation Formula Simulation Formula

3 140 0.188 0.190 0.0009 0.001

1 134 0.175 0.174 0.16 0.16

4 102 0.100 0.101 0 3.9x10 6

15 304 0.899 0.900 0 6.8 xlO "14

102 572 3.200 3.193 0 < 10 "16

Table 2 Statistical analysis of the results on CIS formation in clinical samples 17 ' 18 using the methods described above. Simulations were performed with 50000 runs on the haploid size of the human genome (3.12 x 10 6 kb). P-values estimated from simulations equal the proportion per 50000 runs in which the number of CIS was at least as high as the number observed in the trials. The genomic window size chosen for CIS of 2 th order was 30kb. CIS, number of identified CIS of 2 nd order; IS, number of identified integration sites; MV, mean value.

The p- value can be calculated by means of either of the following commands ('R' code): 1-ppois (lambda = E n , q = k-1) or pchisq (df = 2k, q = 2E n ). Using the data of Table 2 (first line) 1-ppois (lambda = 0.19, q = 2) or pchisq (df = 6, q = 0.38). In both instances, the result is 0.00099. Alternatively, the table of the chisquare distribution with 6 degrees of freedom can be used to look up the probability P(X < 0.38) 20 . One should note that, for low E n , the p-value of a single observed CIS is virtually identical to E n . This implies that, for n>5, no p-values need to be calculated (and hence no formulae are required for E n , n>5), because even with an extremely liberal definition of the CIS (d n =500) and a fairly high number of IS (n ls =1000) a single CIS will be statistically significant (p = 0.027).

Part 2: Non-random uniform allocation of IS

Defining non-randomness in the clustering of genomic events often requires additional precautions as sequence structures of interest may already have known specific distribution biases. In the case of our clinical example 17 ' 18 , it is known that retroviral vectors based on the murine leukaemia virus (MLV) tend to integrate into gene coding regions preferentially near the transcriptional start site (JSS) 17 ' 18 ' 22'25 . It is also proposed that additional factors, indeed mostly unknown, may influence the accessibility of vectors to certain genomic DNA regions 24 . Thus, the null hypothesis of random uniform allocation of MLV IS distribution may not be adequate according to the current 'state of the art', as has recently been argued by Wu et al. 25 . In line with this study, we portioned the genome into 2 adequate areas that differ in the likelihood of getting targeted by vectors 25 . 25% of IS were limited to the 5 kb region surrounding the TSS and 75% of IS were randomly distributed over the remaining human genome.

Further terminology

nxss, number of TSS

U5, an interval of +/5kb around a TSS

GU5, union of all U5 nisGU5, niscomp, number of IS occuring in GU5 and in the complement of

GU5, respectively ncisGus, ncisMix, nciscomp, number of CIS occurring in GU5, both in GU5 and in the complement of GU5 and in the complement of GU5 only, respectively.

The expected value of ncis is given by the following sum:

(5) E(ncis) = E(ncisou5) + E(nciSπnx) + E(ncis C om P )

In the following it will be shown how to calculate the terms on the right side of (5).

We start with the expected value of ncisGus fore what we assume that vector integration into any U5 occurs with the same probability. Then

(6) E(ncisou5) =n T ss- E(X),

where X is the number of CIS (among those occurring in GU5) that occur in a fixed U5. Observing that i IS in a fixed U5 yield CIS of order k in this U5 one easily obtains

the expected value of X

Since X is binomially distributed as ~ B(nisGU5,l/nτss),

(8) p(x = i) =

Merging equations (6)-(8) yields the desired formula for E(ncisGus):

(9)

If nisG U 5 is small compared to n T ss (undoubtedly, this is mostly the case), terms of higher order can be neglected so that - because (nτss-l)/nτss ~ 1 - formula (9) simplifies to (10) E(IiCiS 0115 ) ~ = ™ [ k ) n τss

Notice that formulas (6)-(10) do not depend on the spatial distribution of the IS within the U5. (It is unnecessary to account for the closeness of IS within U5 because any pair - or triple, quadruple etc. for that matter - of IS within a U5 yields a CIS.)

If, as recently described 25 , the integrations into GU5 depend on the level of gene expression ("on" or "off) so that they are limited to a certain proportion p of the U5 (e.g. p=5%), one has to use p-nτss instead of niss in formulas (8) through (10). This results in E(ncisGus) increasing by a factor 1/p.

The expected value of ncis mix E(ncis mix ) is not independent of the distance between the IS and the TSS. Thus, inevitably, assumptions regarding the spatial distribution for the IS will influence its value. In the sequel, a formula for E(ncis mix ) shall be derived for the case k=2. As before, CIS of order 2 are defined by a maximum distance of 30kb between the IS.

If the TSS are indistiguishable with respect to the probability distribution of the integrations, then

(11) E(ncis mix )=nisGU5-nisComp-n T ss-Pmi X ,

where p mix denotes the probability that an arbitrary pair of IS (with one element in GU5 and one element in the complement of GU5) forms a CIS of order 2 around a fixed TSS. We will assume that the distributions of IS within an U5 and within +/-3 Okb around a TSS are symmetric. Then, using kb as unit of distance,

In formula (12) the point x=0 and y=0 corresponds to the TSS-5; f(x) designates the probability density function of vector integrations in U5; and g(y) designates the corresponding density function in [TSS-35,TSS-5]. Formula (12) shall be evaluated for two special cases:

Case 1: Vector integrations are uniformly distributed in GU5 and in the complement of GU5, respectively. I.e.,

f(x)≡l/(n τss -10) g(x) ≡l/(g-nτss-10) .

Solving the integrals in formula (12) we have

( U ) 400

V / P mix 1 n , 1 λ x

Case 2: As above, vector integrations in the complement of GU5 are assumed to be uniformly distributed. However a triangular distribution is assumed for f(x). The corresponding formula is easily calculated:

By plugging this into (12) we get

170

O 4 ) P ma = - ^ τss :(g - l0n τss r)

It may be surprising that a triangular distribution in U5 results in a higher expected value for ncis mix than a uniform distribution. However, this becomes more plausible if one notes that a higher value is also obtained if the IS are concentrated in an extreme manner within the U5, viz. in a one-point distribution with total mass in the TSS. In this special case (which is particularly easy to evaluate), Pm 1x = 50/(nτss(g- mss)).

If, with respect to the formation of CIS, the complement of GU5 could be regarded as a continuum, the expected value of nciscomp would be given by the formulas developed in Part 1 of this contribution. In the light of the biases of retroviral (MLV) vectors, however, the complement of GU5 may rather be viewed as a partitioned set consisting of approximately TSS disjoint intervals. It follows that that the residual term on the right-hand side of equation (4) (Part 1) may no longer be negligible. For an exact calculation of E(ncisc O m P ) one would need to specify the lengths of all components in the complement of GU5; and even then, given that the resulting formulas would be prohibitively complex, one would have to rely on numerical computation using computer simulations. On the other hand, it should be noted that the assumption of a continuum (at least if the IS are taken to be uniformly distributed) tends to lead to an overestimation of the number of CIS, because the boundaries of the components reduce the number of CIS occurring in their neighborhood. It follows that the formulas derived in Part 1 form an upper bound for E(ncisc O m P ). In particular, the true p-values are less or

equal to the values calculated by means of the formulas derived in Part 1. Therefore, any positive statements regarding statistical significance remain valid. Moreover, the overestimation is probably fairly small given that the sections of GU5 located between the TSS are mostly rather wide compared to the length defining a CIS.

Indeed, the null hypothesis of non-random uniform allocation for IS distribution does not substantially change the results we have obtained based on the hypothesis of a random uniform allocation for CIS formation in our clinical samples(Table 2), as is shown in Table 3.

Table 3

MV MV p-Value p-Value

CIS IS

Uniform Triangular Uniform Triangular

3 140 0.191 0.212 0.001 0.0014

1 134 0.175 0.195 0.161 0.177

4 102 0.101 0.124 4.0 xlO "6 6.IxIO "6

15 304 0.905 1.006 7.4xlO "14 3.3 xlO "13

102 572 3.212 3.568 < 10 "16 < 10 "16

Table 3 Statistical analysis of the results on CIS formation in clinical samples 17 ' 18 using the methods described above. Calculations were performed on the haploid size of the human genome (3.12 x 10 6 kb) and on the basis of an IS skewing (25% of all IS) to the +/- 5kb TSS region, for which an uniform and a triangular IS distribution, respectively, was assumed. The genomic window size chosen for CIS of 2 th order was 30kb. CIS, number of identified CIS of 2 nd order; IS, number of identified integration sites; MV, mean value.

Embodiments of the present invention can provide a reliable, straightforward calculation of non-randomness in CIS and other genomic event distributions under the null hypothesis of random and non-random uniform allocation. Expected values for the number of CIS of different orders that are significantly different from a random distribution can be calculated with superior flexibility compared to the previous computer simulations, which were complex to model and required extensive computational resources. Our approach allows the use of a versatile definition of CIS, enabling a closely problem-oriented, highly exact evaluation of non-randomness that is particularly useful for assessing clinical risks of insertional toxicity in novel vector systems such as transposases and zinc finger nucleases.

According to one example embodiment of the invention, the software, and the mathematical framework upon which it is based, permits the statistical analysis of experiments involving viral vector insertion into the human genome.

According to a further example embodiment of the invention, the software, the software, and the mathematical framework upon which it is based, permits the calculation of the expected number of common integration sites (CIS) of various order (i.e. involving 2 or more vector insertions) under the assumption that the distribution of the integrations sites (IS) of viral vectors into the genome, or part of it, is random and uniform.

According to another example embodiment of the invention, the software, the software, and the mathematical framework upon which it is based, permits the calculation of p-values from the number of CIS (of order 2,3,4) observed in a particular experiment. These p-values are probabilities that indicate whether or not the observed number of CIS is compatible with the null hypothesis that the IS produced in the experiment have a random and uniform distribution on the genome (or part of it).

According to a further example embodiment of the invention, a method is provided for identifying non-random distribution (clusters) of specific DNA and/or RNA motifs and/or structures of specific source and/or derivation in a given natural and/or artificial genome without computational simulations, comprising: (a) subjecting the unit of observation (e.g. kilobasepairs), (b) subjecting random uniform allocation, (c) subjecting a cluster of order n of a n-tuple of specific DNA and/or RNA motifs such that the maximum distance between the lowest and highest position is no greater than a fixed bound, (d) subjecting the number of sampled DNA and/or RNA motifs, (e) defining the probability that a given (sub)set of n specific DNA and/or RNA motifs that are randomly allocated form a cluster of order n, (f) determining the expected value, and (g) applying statistical inferences such as the calculation of P-values. Of note: subjecting 'random uniform allocation' can be extended for individual cases to defined 'non-uniform allocation.'

According to various implementations of the invention, examples of the specific DNA structure/motif may be: in general integrating/inserted DNA (RNA) and DNA (RNA) integrated by homologous and/or non-homologous recombination; integrated in full length and/or in portions; DNA present in the natural and / or artificial genome

1. virus (e.g. retrovirus, lentivirus,foamyvirus, adenoassociated virus)

2. viral vectors,

3. plasmid DNA

4. naked DNA

5. ultraconserved and/or conserved nucleotide sequences 6. repetitive elements (e.g. AIu, LINE, SINE, Satellite

7. transcription factor binding sites

8. retro-, transposons

9. methylation pattern

According to various implementations of the invention, examples of the specific source and/or derivation can include: in general each of eukaryotic and prokaryotic and viral DNA (RNA)

1. vertebrates (e.g. mammals like man, non-human primate, mouse, rat)

2. invertebrates 3. plants

4. fungi

5. bacteria

6. virus

According to various implementations of the invention, examples of the natural and/or artificial genome can include: in general each of eukaryotic and prokaryotic and viral DNA (RNA) and artificial produced DNA (RNA)

1. vertebrates (e.g. mammals like man, non-human primate, mouse, rat) 2. invertebrates

3. plants

4. fungi

5. bacteria

6. virus 7. artificial chromosome (e.g. yeast artificial chromosome, bacterial artificial chromosome)

Various embodiments of the present invention can be implemented by the use of an electronic device, or in a network connecting multiple electronic devices. The electronic device is representative of a number of different technologies, such as personal computers (PCs), laptop computers, workstations, personal digital assistants

(PDAs), Internet appliances, routers, switches, cellular telephones, wireless devices, and the like. In the illustrated embodiment, the electronic device includes a central

processing unit (CPU) and a display device. The display device enables the electronic device to communicate directly with a user through a visual display.

The electronic device can include a primary storage device and a secondary storage device for storing data and instructions. The primary and secondary storage devices can include, but are not limited to, such technologies as a floppy drive, hard drive, tape drive, optical drive, read only memory (ROM), random access memory (RAM), and the like. Applications such as browsers, JAVA virtual machines, C compilers, and other utilities and applications can be resident on one or both of the primary and secondary storage devices.

The electronic device can also include a network interface for communicating with one or more electronic devices external to the electronic device depicted. Modems and Ethernet cards, are examples of network interfaces for establishing a connection with an external electronic device or network. The CPU has either internally, or externally, attached thereto one or more of the aforementioned components. Interactive programming and/or development applications, and other applications can be installed and operated on the electronic device.

It should be noted that the electronic device is merely representative of a structure for implementing portions of the present invention. However, one of ordinary skill in the art will appreciate that the present invention is not limited to implementation on only the described electronic device. Other implementations can be utilized, including an implementation based partially or entirely in embedded code, where no user inputs or display devices are necessary. In such an instance, a processor can communicate directly with another processor, or other device.

As noted above, embodiments within the scope of the present invention include program products comprising machine-readable media for carrying or having machine- executable instructions or data structures stored thereon. Such machine-readable media can be any available media which can be accessed by a general purpose or special purpose computer or other machine with a processor. By way of example, such machine-readable media can comprise RAM, ROM, EPROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store desired program code in the form of machine-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer or other machine with a processor. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a machine, the machine properly views the connection as a machine-readable medium. Thus, any such connection is properly termed a machine-readable medium.

Combinations of the above are also included within the scope of machine-readable media. Machine-executable instructions comprise, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing machines to perform a certain function or group of functions.

It should be noted that although the description herein may describe a specific order of method steps, it is understood that the order of these steps may differ from what is depicted. Also two or more steps may be performed concurrently or with partial concurrence. Such variation will depend on the software and hardware systems chosen and on designer choice. It is understood that all such variations are within the scope of the invention. Likewise, software implementations of the present invention could be accomplished with standard programming techniques with rule based logic and other logic to accomplish the various connection steps, processing steps, comparison steps and decision steps.

EXAMPLE 1 SUMMARY

Features such as mutations or structural characteristics can be non-randomly or non- uniformly distributed within a genome. So far, computer simulations were required for statistical inferences on the distribution of sequence motifs. Here, we show that these analyses are possible using an analytical, mathematical approach. For the assessment of non-randomness, our calculations only require informations including genome size, number of (sampled) sequence motifs and distance parameters. We have developed computer programs evaluating our analytical formulas for the real-time determination of expected values and p- values. This approach permits a flexible cluster definition that can be applied to most effectively identify non-random or non-uniform sequence motif distribution. As an example, we show the effectivity and reliability of our mathematical approach in clinical retroviral vector integration site distribution.

INTRODUCTION

With the sequences of complete genomes available (Adams et al. 2000, Lander et al., 2001; Holt et al. 2002; Waterston et al., 2002), and accelerating technologies for high-throughput sequencing (Margulies et al., 2005) genome wide sequence analyses of individual samples will soon become reality. Comparative analyses of sequence composition and sequence motif distribution have become central parts of genome and transcriptome research, providing new insights on evolution, physiology and medical diagnosis (Camargo et al., 2001; Riva et al., 2004; Gerhard et al., 2004; Ota et al., 2004; Wang et al., 2006; Garrigan and Hammer, 2006; Subramanian et al., 2003, 2003; Miranda et al., 2006; Bakker et al., 2006). Our understanding of integrating viruses and related vectors in gene therapy trials is an interesting example of such approaches. Since the completion of the human and murine genome sequencing projects the location of the vector in the cellular genome can be defined precisely, allowing the determination of possible vector integration induced effects on the surrounding genomic DNA regions at the molecular level. Integration site analyses have gained increasing interest with the dramatic development of a retroviral vector-induced lymphoproliferative disease in 3 patients cured of X-linked severe combined immunodeficiency (X-SCID) that was triggered by insertional activation of the proto-oncogene LMO2. (Hacein-Bey-Abina, 2003; Hacein-Bey-Abina, 2003). Meanwhile, insertion induced side effects have been identified ranging from immortalization (Du, 2006) to clonal dominance (Hematti, 2004; Calmels, 2005; Kustikova, 2006; Ott, 2006) and even oncogenesis (Li, 2002; Modlich, 2005; Montini, 2006) in a variety of gene therapy studies. These studies have

in common that a clustering of integration sites (IS) in certain genomic loci was detectable, and likely provided a selective advantage for the affected cell clone.

The clustering of integrations, termed common integration sites (CISs), as an indicator for clone selection has already been used in concerted retrovirus insertional mutagenesis studies that aimed to identify new cancer genes by determining the gene configuration near frequently affected integration site loci (Mikkers, 2002; Lund, 2002; Suzuki, 2002). For CIS determination, computer simulations were performed to assess non-randomness of IS distribution in tumors (Suzuki, 2002). To validate the correctness of our mathematical approach defining non-randomness and non-uniform sequence motif distribution, we analyzed the IS distribution and presence of CIS in 2 successful clinical SCID-Xl studies (Cavazzana-Calvo et al, 2000; Gaspar et al, 2004; unpublished data). We considered 2, 3 or 4 insertions as CIS of 2 nd , of 3 rd or 4 th order if they fell within a 30kb, 50kb or 100kb window of genomic sequence from each other, respectively. Simultaneously, we performed computer simulations written in open source 'R'-language (http://cran.r-project.org) for which a window of size d n (d n = the maximum distance defining a CIS of order n) was shifted through the ordered sequence of the IS. For each window W(j)=[IS(j),IS(j)+d n ] it was then counted how many CIS of order n including IS(J) as first element were contained in W(j). We show that our mathematical approach for defining biased IS distribution is comparable to the output of computational simulations, but superior in performance. Even if the null hypothesis of random uniform allocation is not adequate, as it is known from retroviral vector integration (Wu, 2006), our calculations can be extended to address non-uniform sequence motif distributions.

RESULTS AND DISCUSSION Part 1: Random uniform allocation of IS

For the purpose of this discussion, the unit of observation (location and distance) is kilobasepair (kb). We assume that a number n ls of IS is randomly allocated (with a uniform distribution) to the locations of a genome consisting of g kb. A CIS of order n is an n-tuple of IS such that the maximum distance between the lowest and highest position is no greater than a fixed bound.

Further terminology

d n , defining "size" or distance of a CIS of order n, i.e. maximum permissible distance between any two members of a CIS of order n.

P n , probability that a given (sub)set of n IS that are randomly allocated form a CIS of order n

P(m,d), probability that a given subset of m randomly allocated IS has a span

(=maximum distance between any two elements) of exactly d. E n , expected value of the number of CIS of order n

We start with the elementary observation that E n equals P n times the number of subsets of IS consisting of n elements:

Clearly,

P n = ∑P(n,d) (2) d=0

It remains to determine P(n,d). First note that P(l,d)=0 for d>0. Furthermore, for all m≥l :

g

A recursive formula for P(m,d), d>0, can be derived by breaking down the potential CIS of order m into subsets of m-1 elements having a span of d'≤d, to which an m-th IS is added such that the maximum span is exactly d:

P(m,d) = -{∑[2 - P(m -\,d')] + (d + \)P(m -\,d)} + r (4), g d≡o where r ist a negligible correction term that arises because the uncorrected recursion formula is strictly valid only for subsets of IS that have a distance >d from the telomeres. By mounting the recursive ladder (m=l,...,n), these formulae successively yield

P(n,d), P n , and E n . In particular, one easily obtains (d>0):

g βd P(3,d) ~ —

g

Plugging this into formulae (2) and (1) yields for the expected value E n :

+ 2</ 4 {l + (</ 4 + l)(2</ 4 + l)}

As shown in Table 1, our mathematical approximation corresponds extremely well to the mean values found in 50000 simulation runs.

Statistical inferences, such as the calculation of P-values, can be based on the observation that, under the null hypothesis (Ho) of random uniform allocation of the IS, the number of CIS of order n is (approximately) Poisson distributed with parameter λ=E n . Thus, if the random variable X denotes the number of CIS of order n, and X=k is observed in a trial, then the p-value P(X≥k) of this observation calculated under Ho, i.e. from the Poisson distribution P 0 (E n ), is given by

P(X ≥ k \ H 0 ) = l - ∑je- λ = P(χ 2 ≤ 2E n ) ,

where the random variable χ 2 has a chi-square distribution with 2k degrees of freedom (Hartung, 1987; Dudewicz and Mishra, 1988).

The Poisson approximation to the true random distribution of CIS is exceedingly close. In fact, if the number of simulation runs is sufficiently high, the simulated distribution is virtually undistinguishable from P 0 (E n ). In particular, both the expected values and the p-values derived from P 0 (E n ) are nearly identical to those obtained in computer simulations. The latter point is apparent from Table 2, where for a final proof of principle of our mathematical calculations, results of the analysis of our integration data set retrieved from two clinical SCID-Xl therapy trials (unpublished data) are given. The p-value can be calculated by means of either of the following commands

('R' code): l-ppois(lambda = E n , q = k-1) or pchisq(df = 2k, q = 2E n ). Using the data of Table 2 (first line) l-ppois(lambda = 0.19, q = 2) or pchisq(df = 6, q = 0.38). In both instances, the result is 0.00099. Alternatively, the table of the chisquare distribution with 6 degrees of freedom can be used to look up the probability P(X < 0.38). One should note that, for low E n , the p-value of a single observed CIS is virtually identical to E n . This implies that, for n>5, no p-values need to be calculated (and hence no formulae are required for E n , n>5), because even with an extremely liberal definition of the CIS (d5=500) and a fairly high number of IS (n ls =1000) a single CIS of order 5 will be statistically significant (p = 0.027).

Part 2: Non-uniform allocation of IS

Defining non-randomness in the clustering of genomic events often requires additional precautions as sequence structures of interest may already have known specific distribution biases. In the case of our clinical example (unpublished data), it is known that retroviral vectors based on the murine leukaemia virus (MLV) tend to integrate into gene coding regions preferentially near the transcriptional start site (TSS) (Wu, 2003; Mitchell, 2004; Laufs 2003). It is also proposed that additional factors, indeed mostly unknown, may influence the accessibility of vectors to certain genomic DNA regions (Bushman, 2003). Thus, the null hypothesis of random uniform allocation of MLV IS distribution may not be adequate according to the current 'state of the art', as has recently been argued by (Wu, 2006). In line with this study, we portioned the genome into 2 adequate areas that differ in the likelihood of getting targeted by vectors.

Further terminology

n T ss, number of TSS T5, an interval of +/5kb around a TSS

GT5, union of all T5 n ls ,Mix, n ls ,comp, number of IS occuring in GT5 and in the complement of

GT5, respectively n C is,GT5, nc 1S! Mix, n C is,comp, number of CIS occurring in GT5, both in GT5 and in the complement of GT5 and in the complement of GT 5 only, respectively.

Clearly, the expected value E n of the number CIS of order n is given by the following sum:

(5) E n = E(nc 1Si GT5) + E(ricis,Mix) + E(nc 1Si com P )

In the following it will be shown how to calculate the terms on the right side of (5).

We start with the expected value of ncisGTs fore what we assume that vector integration into any T5 occurs with the same probability. Then

where X is the number of CIS (among those occurring in GT5) that occur in a fixed T5. Observing that i IS in a fixed T5 yield CIS of order n in this T5 one easily obtains the expected value of X

(7) E(X) = P(X = n) - + 2)\ + ...

Since X is binomially distributed as ~ B(n ls , G τ5,l/nτss),

(8) P(X = i) = ( n ' s ' GU5 \ JL)' ( ϋisszλγ , ous

I i J n τss n τss

Merging equations (6)-(8) yields the desired formula for E(nc 1Sl 5 ):

(9)

If n 1S! 5 is small compared to niss (undoubtedly, this is mostly the case), terms of higher order can be neglected so that, because (n T ss-l)/nτss ~ 1, formula (9) simplifies to (10) E(n cιsfiU5 ) ~ n τss \ n is,GU5 \ ϊ 1 λ )n = [ n is,GU5

Notice that formulas (6)-(lO) do not depend on the spatial distribution of the IS within the T5. (It is unnecessary to account for the closeness of IS within T5 because any pair - or triple, quadruple etc., for that matter - of IS within a T5 yields a CIS.)

Clearly, the expected value of n cls ,Mix E(n cls ,Mix) is not independent of the distance between the IS and the TSS. Thus, inevitably, assumptions regarding the spatial distribution for the IS will influence its value. In the sequel, a formula for E(nc 1Sl Mix) shall be derived for the case n=2. As before, CIS of order 2 are defined by a maximum distance d 2 of 30kb between the IS.

If the TSS are indistiguishable with respect to the probability distribution of the integrations, then

( 1 1 ) E(UC 1 S 1 MIX)=II 1 S 1 GTS-Ii 1 S 1 COmP-IiTSs-PMiX ,

where P MIX denotes the probability that an arbitrary pair of IS (with one element in GT5 and one element in the complement of GT5) forms a CIS of order 2 around a fixed TSS. We will assume that the distributions of IS within an T5 and within +/-35kb around a TSS are symmetric. Then, again using kb as unit of distance,

In formula (12) the point x=0 and y=0 corresponds to the TSS-5; f(x) designates the probability density function of vector integrations in T5; and g(y) designates the corresponding density function in [TSS-35,TSS-5]. Formula (12) shall be evaluated for two special cases:

Case 1: Vector integrations are uniformly distributed in GT5 and in the complement of GT5, respectively. I.e.,

f(x)≡l/(n TS s-10) g(x) ≡l/(g-n T ss-10) .

Solving the integrals in formula (12) we have

400

(13) P Mx = ^n τss (g -\0n τss )

Case 2: As above, vector integrations in the complement of GT5 are assumed to be uniformly distributed. However, a triangular distribution is assumed for f(x). The corresponding formula is easily calculated:

By plugging this into (12) we get

(H) P Ua - m

^ τss (g -l0n τss )

It may be surprising that a triangular distribution in T5 results in a higher expected value for than a uniform distribution. However, this becomes more plausible if one notes that a higher value is also obtained if the IS are concentrated in an extreme manner within the T5, viz. in a one-point distribution with total mass in the TSS. In this special case (which is particularly easy to evaluate), PMIX = 50/(n T ss(g- ttrss)).

If, with respect to the formation of CIS, the complement of GT5 could be regarded as a continuum, the expected value of n clSi com P would be given by the formulas developed in Part 1 of this contribution. In the case of retroviral (MLV) vectors, however, the complement of GT5 has rather to be viewed as a partitioned set consisting of approximately TSS disjoint intervals. It follows that that the residual term on the right-hand side of equation (4) (Part 1) may no longer be negligible. Note however, the assumption of a continuum clearly tends to lead to an overestimation of the number of CIS, because the boundaries of the components reduce the number of CIS occurring in their neighborhood. It follows that the formulas derived in Part 1 form an upper bound for E(n cls ,comp). In particular, the true p-values are less or equal to the values calculated by means of the formulas derived in Part 1. Therefore, any positive statements regarding statistical significance remain valid. Moreover, the overestimation is probably fairly small given that the sections of GT5 located between the TSS are mostly rather wide compared to the length defining a CIS.

Indeed, the null hypothesis of non-uniform allocation for IS distribution does not substantially change the results we have obtained based on the hypothesis of a random uniform allocation for CIS formation in our clinical samples (Table 2), as is shown in Table 3.

Our mathematical formulae allow a reliable, straightforward calculation of non- randomness in CIS and other genomic event distributions under the null hypothesis of uniform and non-uniform allocation. Using formula based workspaces (available on request), expected values and p-values can be calculated with ease and superior flexibility in real-time compared to computer simulations, which were complex to model and required extensive computational resources. Our approach enables a closely problem-oriented, highly exact evaluation of non-randomness that is useful for assessing IS distribution in clinical trials and for assessing the distribution of any sequence motif of interest in a natural or artificial genome.

LEGENDS

Table 1

Mean values for random CIS formation (1000 IS) determined either with computer simulations or mathematically. Simulations were performed with 50000 runs each, g, haploid size of the human genome: 3.12 x 10 6 kb; d n , genomic window size [kb] for CIS of n th order: d2=30, d3=50, and d4=100; n ls , number of (assumed) sampled integration sites: 1000.

Table 2 Comparative analysis of mean values and p-values obtained computationally ('Simulation') or mathematically ('Formula'). The results refer to the presence of CIS detected in 2 clinical X-SCID gene therapy studies (unpublished data). Simulations were performed with 50000 runs on the haploid size of the human genome (3.12 x 10 6 kb). P-values estimated from simulations equal the proportion per 50000 runs in which the number of CIS was at least as high as the number observed in the trials. The genomic window size chosen for CIS of 2 nd order was 30kb. CIS, number of identified CIS of 2 nd order in patient and control samples pre- and post-transplant; IS, number of all unique identified integration sites in patient and control samples pre- and post- transplant; MV, mean value.

Table 3

Formulae based statistical analysis of the results on CIS formation in clinical samples derived from 2 clinical X-SCID gene therapy studies. Calculations were performed on the haploid size of the human genome (3.12 x 10 6 kb) and on the basis of an IS skewing (25% of all IS) to the +/- 5kb TSS region, for which an (*) uniform or a ( § ) triangular IS distribution, respectively, was assumed. 75% of IS were assumed to be uniformly distributed over the remaining human genome. The genomic window size chosen for CIS of 2 nd order was 30kb. CIS, number of identified CIS of 2 nd order in patient and control samples pre- and post-transplant; IS, number of all unique identified integration sites in patient and control samples pre- and post-transplant; MV, mean value.

TABLES

Table 1

Table 2

MV MV p-Value p-Value

CIS IS Simulation Formula Simulation Formula

3 140 0.188 0.190 0.0009 0.001

1 134 0.175 0.174 0.16 0.16

4 102 0.100 0.101 0 3.9x10 6

15 304 0.899 0.900 0 6.8xlO "14

102 572 3.200 3.193 0 <10 "16

Table 3

3 140 0.191 0.212 0.001 0.0014

1 134 0.175 0.195 0.161 0.177

4 102 0.101 0.124 4.OxIO "6 6.IxIO "6

15 304 0.905 1.006 7.4xlO "14 3.3xlO "13

102 572 3.212 3.568 <10 "16 <10 "16

EXAMPLE 2

We have treated ten children with X-linked severe combined immunodeficiency (SCID-Xl) using gammaretrovirus-mediated gene transfer. Those with sufficient follow up have recovered substantial immunity in the absence of any serious

adverse events up to 5 years after treatment. To determine the influence of vector integration on lymphoid reconstitution, we have compared retrovirus integration sites (RIS) from peripheral blood CD3+ T lymphocytes with transduced CD34+ progenitor cells. Integration occurred preferentially in gene regions either side of transcription start sites, was clustered, and correlated with the expression level in CD34+ progenitors during transduction. In contrast to CD34+ cells, RIS recovered from engrafted CD3+ T cells were significantly over-represented within or near genes encoding proteins with kinase or transferase activity, or involved in phosphorus metabolism. Though gross patterns of gene expression were unchanged in transduced cells, divergence of RIS target frequency between transduced progenitor cells and post-thymic T lymphocytes indicates that vector integration influences cell survival, engraftment or proliferation.

Introduction

Retroviral vectors have been widely used in human hematopoietic stem cell (HSC) gene therapy trials as they stably integrate into the genome and therefore provide an opportunity for sustained clinical effect. This principle has been applied successfully to treat inherited immunodeficiencies including X-linked Severe Combined Immunodeficiency (SCID-Xl) 1"3 , adenosine deaminase-deficient SCID (ADA-SCID) 4 ' 5 (Gaspar H.B. et al, in press) and more recently X-linked Chronic Granulomatous Disease (X-CGD) 6 . Despite highly encouraging results, evidence has accumulated both in animal and human studies for mutagenic side effects as a direct result of vector integration 7"11 . It has therefore become of particular importance to understand the risks of harmful mutagenesis, and to define the patterns of retroviral insertion that may predispose to these events.

Recent studies have shown that the distribution of retroviral integration sites (RIS) within the genome is not arbitrary, and is variable in pattern depending on the nature of the virus or vector. MLV, HIV-I, and ASLV-based vectors exhibit quite distinct target site preferences 12 . Gammaretroviral vectors and HIV-I based lentiviral vectors both preferentially integrate into gene coding regions 13 , although gammaretroviruses particularly favor a 5 kb window either side of the start site of transcription 14 . In contrast, ASLV exhibits only a weak preference for genes. The mechanisms that dictate the differential integration site patterns have not been clearly elucidated, but may depend to variable extents on the accessibility of euchromatin to the pre-integration complex, the transcriptional activity of the locus, and binding or tethering to specific

DNA sequences via host proteins at the sites of insertion 15 . It is therefore quite likely that integration patterns may also be skewed by the nature and activation status of the target cell.

Although integration patterns are easily defined in homogeneous cell populations in vitro, the influence of integration when measured in complex in vivo situations is more relevant for our understanding of the risks of harmful mutagenesis. In HSC gene therapy, starting cell populations that are transduced ex vivo are heterogeneous, and the minority of progenitor cells among them that do engraft are subject to post-engraftment influences that dictate survival, homing to appropriate microenvironmental niches, and subsequent differentiation and proliferation in vivo. As a result, significant selection pressure may favor specific retroviral insertions if they change the expression of one or several cellular genes, thereby influencing the biological fate of a cell clone, over and above (or even cooperating with) any selective advantage arising from successful expression of the vector transgene. In this study we have performed high throughput analysis to examine RIS patterns in post thymic CD3+ T cells following successful treatment of SCID-Xl, and have compared these with freshly transduced CD34+ cells. Significant changes in RIS distribution among engrafted compared to pretransplant cell populations demonstrate that vector insertion influences the biological characteristics of a significant percentage of transplanted cells.

Results

Successful recovery of immunity following gammaretrovirus-mediated gene therapy

Ten patients with molecularly defined SCID-Xl were treated by retrovirus-mediated gene transfer to autologous bone marrow CD34+ progenitor cells (Table Ia). Details of the gibbon ape leukaemia virus (GALV)-pseudotyped vector and transduction conditions have been published previously, and were unchanged for the duration of the study 2 . Between 60 and 207 x 10 6 cells were infused in the absence of conditioning, of which 20-60% were estimated (in patients with null mutations) to be CD34/γc positive. Where possible to evaluate, all patients benefited from substantial immunological recovery, usually with normalization of T cell numbers (Fig. Ia), TCRVβ diversity, and proliferative responses in vitro (data not shown). In most patients with sufficient follow up, this has been accompanied by recovery of humoral immunity and withdrawal of immunoglobulin supplementation. The levels of functional cell surface γc on engrafted CD3+ populations were generally less than in control cells, and there was no selection for higher expressing cells over time in any patient (Fig. Ib and data not shown). Average transgene copy number in CD3+ T cells from all patients was determined by

qPCR on sorted populations to be one (data not shown). No serious adverse events have been documented, and all patients are clinically well at home.

Distribution analysis of retroviral vector insertions in transduced CD34 + cells and engrafted patient CD3 + T cells.

Linear amplification mediated PCR (LAM-PCR) 16 ' 17 and high-throughput sequencing of insertion sites was performed on DNA isolated from purified peripheral blood CD3 + T cells obtained from 5 patients (obtained >9 months after gene therapy), in comparison with DNA from transduced CD34 + cells of one patient (pre-engraftment sample, P6), and a healthy donor (transduced under identical conditions). In total, 439 unique insertion sites were isolated from post engraftment CD3+ T cell populations, of which 304 could be mapped exactly to the human genome using the UCSC BLAT alignment tools. Similarly, 134 and 140 unique mappable sites were isolated from transduced pre- engraftment CD34+ cells, and transduced normal CD34+ cells, respectively (Table Ib and Table 4). The chromosomal distribution of RIS was analyzed to determine the relationship between chromosome size, gene density, and insertion frequency. For each of the human chromosomes, the number of integrations was not related to the size of the chromosome, whereas a correlation between gene content and insertion number in CD34+ progenitor cell and post transplantation CD3+ T-cells was evident (Fig. 2a). This suggests that integration is dependent on the gene density of a chromosome, rather than its size. Given the number of genes in the human genome and assuming random integration, 25% of RIS would be expected to fall into or within a 10kb window around RefSeq genes, which account for approximately one third of the human genome. The actual frequency of insertions within these genes (44%), and including the 10kb window up- and downstream (64%) was significantly higher, though similar for all cell populations (Table Ib). This indicates that in these treated cell populations, genes are preferred targets for integration of MLV vectors over non-coding regions.

Recent studies in cell lines and primary cells have demonstrated a preferred integration of MLV-based gammaretroviral vectors near the region of transcription start sites (TSS) 12 ' 14 ' 22 ' 23 . Here, we observed a similar preference for transduced normal CD34+ cells, transduced pre-engraftment CD34+ progenitors, and post-engraftment CD3+ T cells (Fig. 2b and Table Ib). 29%, 25% and 23% of RIS were located within 5 kb of transcription start sites, respectively. When the entire region of the targeted RefSeq gene was examined, the frequency of integration decreased with distance from the TSS (Fig. 2c). As previously shown by Wu et al. for HeLa cells in vitro, CpG islands and a surrounding lkb genomic region also represented preferred targets of vector integration harboring 24% and 13% of all integrants derived from the patient pre- and post-

transplantation samples, respectively, and 14% in transduced healthy donor CD34+ cells.

Insertions are clustered in common integration sites

In order to define whether there was preferential insertion or selection of insertions at specific genomic loci, mapped insertions were examined for common integration sites (CIS) (Table 2). CIS have been defined as integrations into the same intergenic locus in 2 different cells or samples which are not more than 30 kb apart from each other 18 . To keep high stringency, we followed this definition, but counted 2 integrants a CIS including intragenic location of integrants. Accordingly, an average of 38 of the 578 (6.6%) exactly mappable RIS found in all cell populations were located in CIS (Table 2). Within the same cell fraction (engrafted CD3+ cells, pre-engraftment CD34+ cells P6 or control CD34+ cells), each CIS comprised 2 RIS: 30 of 304 RIS (9.9%) from CD3+ T cells, 6 of 140 RIS (4.3%) from transduced normal CD34+ cells, and 2 of 134 RIS (1.5%) from pre-engraftment CD34+ cells. These findings were characterized by performing computer simulations to compare the likelihood of random CIS occurrence based on the size of the human genome with the number of CIS detected in real samples (Table 5). By this analysis, the number of CIS in engrafted CD3+ cells was significantly greater than expected (p < 0.0001). Also, in CD3+ T cells, 16 RIS located in CISs were intergenic, 12 RIS were intragenic and 2 RIS were located in and near a RefSeq gene. Of the intragenic CISs, 12 RIS were detected in intron 1 of the affected genes and 1 RIS in intron 4. In normal CD34+ progenitor cells, 2 intergenic RIS, 2 intragenic RIS and 2 inter-/intragenic RIS were located in CISs, of which 3 RIS were located in the first intron of the genes involved. In pre-engraftment CD34+ cells from P6, only 1 CIS consisting of 2 RIS was detected in the first intron of a gene.

RIS in relation to levels of gene expression.

To determine whether gene expression was globally altered in post-engraftment CD4 + T cells (which by default contain one RIS each) as a result of vector integration, expression profiles were compared to an untransduced age-matched control using an Affymetrix Ul 33 A microarray. The sensitivity of the assay is low due to the polyclonal nature of the sample, but no gross disturbance of gene expression was observed (Fig. 3a). To assess whether the expression of the 96 probesets exceeding the log 2 fold change > 2 threshold could be caused by RIS, the distance of the RIS to the significantly expressed genes were calculated. For 77 of 96 probesets, a closest RIS could be determined. The smallest distance observed was 112kb and in only 6 cases were the distances smaller than 1 Mb. Since the largest distance reported for a RIS influencing gene expression is 90kb 24 , we conclude that significant differences in expressed

probesets in the patients likely represent individual variation rather than differential expression caused by the RIS (Fig. 3b).

As MLV-based vectors are known to favor highly expressed genes for integration, we hypothesized that a correlation might exist between the gene expression profile obtained in transduced CD34+ progenitors and the genes discovered in the RIS analysis on engrafted CD3+ cells. For a statistical analysis of this relationship, genes were organized into 10 'bins' according to their relative expression levels on Affymetrix Ul 33 Plus 2.0 microarray. Numbers of integrations in, or closest to each gene in those bins were calculated. The integration sites observed in the peripheral CD3+ T cells of 5 patients were found at a higher frequency in genes that were more highly expressed in representative purified CD34+ progenitor cell populations during transduction (p = 8.5 x 10 "13 for PBMC CD34+ cells; Fig. 3c). Interestingly, the distribution of RIS in CD3+ T cells from a single patient (Pl) also demonstrated a clear correlation ( p = 5.7 x 10 "4 ) with gene expression patterns in a matching CD4+ T cell population from the same patient (Fig. 3d), presumably as a reflection of the shared portion of the gene expression pattern of mature T cells and their immature progenitors.

Oncogenes and tumor suppressor genes

A total of 475 genes annotated with 'tumor suppressor gene' and 390 genes annotated with 'oncogene' were identified from Entrez Gene, and were searched for in the RIS dataset. Eleven oncogenes and 14 tumor suppressor genes were found (Table 4). STAT3, RUNXl, BCL2 and HIFlA are present in both oncogene and tumor suppressor gene categories. The T-ALL oncogenes LM02, TALI, TANl, LCK, LMOl, HOXIl, HOXl 1L2, LYLl, TAL2 and C-MYC were not present in the RIS datasets.

Gene ontology annotation

To determine the biological characteristics of RefSeq genes containing a RIS either within the gene or within the neighboring 10 kb window, these were analyzed according to the gene ontology terms. Characterization of 175 RefSeq genes derived from post engraftment CD3+ T cells revealed a significant overrepresentation of genes which encode proteins with kinase or transferase function and phosphorylation activity compared to that expected by random integration over the whole genome (Table 3). In contrast, analysis of 92 RefSeq genes from transduced normal CD34 + cells revealed that genes encoding proteins with cytokine binding or enzyme activator and genes involved in defense responses harbored significantly more RIS (Table 3). In contrast, analysis of

86 RefSeq genes from pre-engraftment CD34 + cells showed an overrepresentation of genes encoding either proteins with SH3/SH2 adaptor protein activity, receptor signaling proteins, or proteins controlling GTPases as well as genes whose products are

involved in biological processes like RAS protein- and small GTPase mediated signal transduction (Table 3).

Discussion

The RIS profile in peripheral CD3+ T cells derived from patients successfully treated by gene therapy for SCID-Xl showed interesting differences compared to transduced cells prior to transplantation. To date, most large-scale RIS analyses have been conducted in cell lines and primary cells in vitro 12'14 . Although large amounts of information concerning the preferred sites of viral integration have been amassed, the data may be skewed by the use of aneuploid cells, and cell lines with atypical gene expression patterns. Likewise, it is possible that the distribution of RIS determined from in vivo samples may have been strongly influenced by expansion or deletion of cells as a result of vector-specific or host contextual influences. Overall, the distribution of RIS in patient materials is consistent with earlier descriptions of preferential integration around TSS of RefSeq genes both in vitro, and in myeloid cells and transduced T cells recovered from patients in other human clinical trials 6 ' 25 . Remarkably, the distribution of RIS in CD3+ T cells correlated quite well with that of transduced CD34+ progenitors in which integration is clearly favored for highly expressed genes. This pattern therefore persists even though recovering T cells in SCID-Xl patients have undergone extensive in vivo expansion including thymic differentiation with positive and negative selection events, and were additionally subject to continous antigen-mediated clonal expansion. Gene expression patterns were essentially identical to normal CD3+ T cells, it therefore appears that retroviral insertion does not cause major global disturbances.

On a more subtle level, however, gene ontology analysis of RefSeq genes carrying insertions within a 10kb window revealed divergence from what would be expected from a semi-random distribution, and differences between transduced CD34+ progenitor cells and post-engraftment CD3+ T cells. Similarly, the frequency of CIS was specifically increased in the latter cell population. It therefore seems likely that within a generally conserved pattern of vector insertion, there is skewing as a result of either host- or vector-specific influences. For example, RIS that result in preferential cell survival (through homing, engraftment or proliferative advantage) may be over- represented and may favor the appearance of CIS. This may arise as a result of inadvertent gene activation as has been noted in murine and human recipients of gammaretro virus-transduced stem and progenitor cell populations 9 ' 26 . In a recent clinically successful trial of gene therapy for chronic granulomatous disease (CGD), it was demonstrated by high throughput sequencing of RIS, that insertions in PRDMl 6, MDSIIEVIl and SETBPl led to a selective growth advantage of affected cells without signs of malignancy 6 . In this study we have not observed clonal outgrowth in vivo (over

and above normal T cell receptor clonal diversity), although the occurrence of increased numbers of CIS in CD3+ T cells may be an indicator of similar effects. The profound growth advantage conferred by successful expression of γc to immature CD3- thymocytes and to all mature CD3+ T cells may, however, obscure other more subtle influences. Consideration should also be given to the possibility that RIS patterns may be skewed in a negative way through transgene silencing as has been noted for gammaretroviral vectors previously 27"29 , though there is of course no direct evidence for this phenomenon in this study as it would preclude T cell development and compromise survival. Finally, it is possible that antigen-mediated skewing of the T cell repertoire would alter the representation of RIS, although this is unlikely as TCR diversity is highly polyclonal.

Considerable attention has been focused on the development of malignant lymphoproliferation in 3 patients treated using a similar gene therapy protocol for SCID-Xl 7 . In these patients, inadvertent activation of the T cell proto-oncogene LMO-2 has at least in part contributed to the clonal expansion through insertional mutagenesis. We have screened a large number of randomly cloned RIS from CD3+ T cells, and used a specific tracking-PCR reaction in 3 patients, but have been unable to detect CIS or RIS at the LM02 locus, suggesting that there may be measurable differences between the trials. However, the overall genomic integration site distribution (chromosomal distribution, proportion of targeting events in RefSeq genes, and preference for TSS) in both studies is very similar (Deichmann et al, manuscript in submission). Likewise, while there were no consistent findings between transduced CD34+ progenitor cells in either study, gene ontology analysis of targeted genes in engrafted CD3+ cells revealed an overrepresentation of the same categories, namely phosphorus metabolism, kinase and transferase activity. In contrast, the frequency of CIS in CD3+ cells and strikingly in pre-engraftment CD34+ cells was much lower in our study suggesting that any differences may have arisen during the transduction process, for example due to divergent ex vivo culture conditions or vector pseudotype. It is not clear whether patients treated in our study are at the same risk of developing lymphoproliferation, as the follow up period is relatively short and the majority of our patients has not reached the 34 month manifestation time point typical of the previously described lymphoproliferations. is relatively short. However, this outcome may also be influenced by effective engrafting cell dosage, the kinetics of T cell reconstitution, and other undefined host factors.

Our data show that even without obvious side effects, there are indeed vector insertion induced influences on engraftment, clonal proliferation and survival. Delineation of RIS in clinical gene therapy trials such as this provides important information on the biology

of vectors in vivo, and on the way in which they interact with host genes and environment. Ultimately this has major implications for clinical efficacy, safety, and for rationalization of vector design.

Methods

Transduction of CD34+ progenitor cells CD34+ cells were purified using magnetic bead sorting (CliniMACS) from bone marrow harvested under general anesthetic. Cells were pre-activated for 40 hours in the presence of cytokines (SCF 300ng/ml, TPO lOOng/ml, IL-3 20ng/ml, and Flt3-L 300ng/ml, R&D Systems), then transduced on three sequential occasions over the next 56 hours in gas-permeable cell culture containers. Serum-free conditions were maintained during the entire ex vivo culture period. The gammaretroviral vector (containing intact MoMLV LTR sequences) was pseudotyped with a gibbon ape leukemia virus envelope, and produced in PG 13 cells.

Trial approval and patient consent

The gene therapy protocol was approved by the UK Gene Therapy Advisory Committee, the Medicines Control Agency (now Medicine and Healthcare products Regulatory Agency), and the local institutional research ethics committee. Written informed consent was obtained from each family.

Flow cytometry for the detection of gamma chain expression

200μL of whole blood was collected in EDTA. 100 μL was stained with 5μL of an anti- γc antibody or isotype control (Pharmingen, Becton Dickinson, San Diego, CA). The stained cells were detected using a FACsCalibur (Becton Dickinson). Analysis was performed using CellQuest software (Becton Dickinson) to determine the level of γc expression on CD3+ gated cells.

Preparation of DNA Peripheral blood samples were taken from patients 9 to 30 months after the reinfusion of autologous CD34 + cells transduced with a MoMLV retrovirus encoding the therapeutic gene as previously by described 2 . CD3 + cells were isolated by Fluorescence Activated Cell Sort (Epics Altra, Beckman Coulter). A pre-engraftment sample of CD34+ cells was held from the cells of patient 6 at the time of reinfusion and a control sample of CD34+ cells was separated from healthy donor bone marrow cells using CliniMACS (Miltenyi) and transduced with the same protocol used in the clinical trial 2 . Genomic DNA was isolated from all cells using a DNeasy kit (Qiagen).

Linear amplification mediated (LAM-) PCR and sequence alignment

LAM-PCR was performed on 10 - 100 ng of DNA isolated from sorted peripheral blood leukocytes to characterize the unknown genomic DNA flanking the 5'LTR and the 3'LTR of the vector. For LAM PCR, 5' biotinylated LTR specific vector primers (Roth, Karlsruhe, Germany) were used as follows: linear PCR (5'LTR: 5'>TGC TTA CCA CAG ATA TCC TG<3' and 5'>ATC CTG TTT GGC CCA TAT TC<3', 3'LTR: 5'>TCC GAT TGA CTG AGT CGC<3' and 5'> GGT ACC CGT GTA TCC AAT A<3'), 1 st exponential PCR (5'LTR: 5 >GCC CTT GAT CTG AAC TTC TC<3\ 3'LTR: 5'>TCT TGC AGT TGC ATC CGA CT<3') and 2 nd exponential PCR (5'LTR: 5 >TTC CAT GCC TTG CAA AAT GGC<3', 3'LTR: 5'>GTG GTC TCG CTG TTC CTT<3'). Linear PCR, magnetic capture, hexa-nucleotide priming, restriction digest (enzymes used: Tsp509l, Msel or HmPlI), linker ligation and exponential PCRs have been previously described 16 ' 17 . Optionally, the 1 st exponential biotinylated PCR product was magnetically captured before reamplification by the 2 nd PCR step. LAM-PCR amplicons were either isolated and cloned (Elchrom Scientific, Cham, Switzerland) into the TOPO TA vector (Invitrogen, Carlsbad, CA) or PCR-purifϊed (Qiagen), shotgun cloned and sequenced (GATC, Konstanz, Germany). Sequences were aligned to the human genome (assembly July 2003) using the UCSC BLAT genome browser (http://www.ucsc.genome.edu). Relation to annotated genome features were studied using the UCSC and ensembl database (http://www.ensembl.org). Analysis of the LM02 Transcription Start Site (TSS) region

A total of 40 ng of CD3+ cell DNA from each of 3 patients (Pl : 9 months post- transplant; P2: 9 months; P3: 14 months) was analyzed for potentially dangerous integration events surrounding the TSS of LM02. To screen the 5 kb upstream and downstream region of the TSS and possible forward and reverse orientation of integrated vector, initial 1 st 'mid-range' PCR (PeqLab, Germany) was set up 'four-fold': upstream (using LM02 forward primer 5'>TCG TCC AAA CTG AGG ATC AC<3' and biotinylated LTR forward primer LTRAl 5'>TGC TTA CCA CAG ATA TCC TG<3' or LTR reverse primer LTRBl 5'>TTC AAA TAA GGC ACA GGG TC<3'); downstream (LM02: 5'>CTT CCC AAT TCT GCT CAA GG<3'; LTR: LTRAl or LTRBl). After a 30-cycle PCR (initial denaturation: 94°C,lmin; 94°C,30sec; 56°C,45sec and 68°C, 3.5min; 68°C,7min final elongation), final PCR products were captured via magnetic beads (Dynal) and reamplified by nested PCR (T aq polymerase; Qiagen). For the TSS upstream and downstream region, respectively, each of five LM02 specific primers was applied with LTR primer LTRA2 5'>GAC CTT GAT CTG AAC TTC TC<3 ' for vector forward orientation and LTR primer LTRB2 5 '>GTG GTC TCG CTG TTC CTT<3' for vector reverse orientation. LM02 specific primers for

nested PCR located upstream of TSS were 5'>AGC TCT CTC ACA CCA GAT G<3', 5'>TAC ATT GCT AGC TTG CAG AC<3', 5'>ATG CAG AGT GTC AGA CTA TG<3' and 5'>GCT GGC AAA GTG GAA TAG TG<3', respectively. LM02 specific primers downstream of TSS were 5'>CAA GTC TCC ACA TTC TGA GT<3', 5'>ACA GGC CGG GCA CAT TGG CT<3', 5'>CAA AGA AGA GCA GAG CTC CA<3', 5'>GAG GAT CAC CTG AAC TCA GA<3' and 5'>ATC CCA GCA CTT TGG GAG GC<3\ respectively.

Culturing and gene expression profiling of CD34+ and CD4+ cells.

G-SCF mobilized peripheral blood CD34+ cells from 3 donors were transduced using the identical conditions of the gene therapy trial 2 . RNA of transduced cells was isolated using TriReagent (Sigma) following the manufacturer's protocol. The mRNA expression levels were determined using Affymetrix Ul 33 Plus 2.0 arrays and normalised as described previously 19 . CD4 cells were isolated from patient 1 and a healthy aged-matched control using the 'CD4 untouched' kit (Miltenyl). Isolated cells were expanded for 1 week in culture in RPMI containing 2% human AB serum, IL-2 (100 IU/ml) and 75μl CD3/CD28 T cell expander Dynabeads® per 10 6 cells (Dynal) , before isolating the RNA with Tri-reagent (Sigma). The RNA was then double extracted with an RNA isolation kit (Qiagen) and applied to an Affymetrix Ul 33 A microarray. The normalized expression values were used to generate MvA plots, in which average Iog2 expression level (A) and the difference in Iog2 expression level (M) for each probeset are plotted. Significantly differentially expressed probesets, as determined by a Sidak step-up adjusted p-value < 0.05, were indicated separately 20 . Genes described by these probesets were retrieved from the Affymetrix HG-Ul 33a annotation file (April 2006).

(Expression

( M = A = log 2 λExpression * Expression control ). To determine the relationship between expression levels and viral integration, the normalized microarray values were sorted on expression and divided into 10 equal sized expression level categories. The presence of the gene closest to a virus integration site as identified by LAM-PCR analysis was determined in each expression category. A Cochran - Armitage test for trend was performed to test whether higher expression level categories correspond to larger numbers of insertions 21 . For each unique Gene Symbol represented on the array, the highest expression value over all probesets representing it was used for analysis.

Gene ontology analysis.

Gene ontology analysis was performed on all RefSeq genes that have been vector targeted in its gene coding region or the surrounding 10kb genomic region using the DAVID-EASE software (http://david.niaid.nih.gov/david/ease.htm). The database classifies each gene into defined categories of 'cellular compartment', 'biological process' and 'molecular function'. Over-represented gene categories were calculated and defined by 'Fisher Exact' test.

EXAMPLE 3 Recent reports have challenged the notion that retroviruses and retroviral vectors integrate randomly into the host genome. These reports have pointed to a strong bias towards integration in and near gene coding regions and, for gammaretroviral vectors, around transcription start sites. Here, we report the results obtained from a large-scale mapping of 572 retroviral integration sites (RIS) isolated from 9 SCID-Xl patients treated by a retrovirus based gene therapy protocol. Our data show that two third of insertions occurred in or very near to genes, of which more than half are highly expressed in CD34+ cells. Strikingly, one fourth of all integrations are clustered as common integration sites (CIS). The highly significant incidence of CIS in circulating T cells and the nature of their locations indicate that insertion in many gene loci has an influence on cell engraftment, survival and proliferation. Beyond the observed cases of insertional mutagenesis in 3 patients, these data help to elucidate the relationship between vector insertion and long term in vivo selection of transduced cells in human patients with SCID-Xl.

Introduction

Retroviruses have been used as efficient gene-delivery vehicles in several gene-therapy trials, as they integrate stably into the genome, allowing the genetic correction of stem cells, potentially for the entire lifespan of the affected individual 1"3 . Due to the availability of the complete human genome sequence, large-scale sequence-based surveys of retroviral integration sites (RIS) have become possible and have strongly challenged the notion that retrovirus vector integration may be a semirandom event 4 ' 5 . Schroeder et al (2002) investigated targeting of human immunodeficiency virus (HIV) and HIV-based vectors in a human lymphoid cell line (SupTl) and found that genes were favored integration targets 6 . Similarly, Wu et al. (2003) examined targeting of murine leukemia virus (MLV) in human HeLa cells and found that MLV strongly favor integration in transcriptional units, with integration focusing near the start of

transcription 7 . This non-random distribution of integrations has been confirmed by Laufs et al. (2003) for human bone marrow repopulating cells in mouse xenografts 8 .

A comparative analysis of human primary cell types and cell lines transduced with HIV-I, ASLV or MLV based vectors showed that each vector type produces a unique pattern of RIS distribution in the human genome 9 . These analyses revealed a significant association between integration target sites and transcriptional profiling for HIV-I, but not for ASLV and MLV 9 . Thus, the statistics of the integration process of retro- and lentiviruses and derived vectors suggest that a more specific mechanism, e. g. active tethering of the pre-integration complex to DNA motifs, DNA binding factors or other connections to the gene activation or expression status of target cells are of influence beyond the accessibility of the euchromatin 10"12 . In line with this hypothesis, a comparative analysis of retrovirus integration and gene expression status demonstrated reduced integration in genomic sites with highly active transcription 13 . A large-scale mapping of RIS in gene modified T lymphocytes from leukemic patients after allogeneic stem cell transplantation has shown that retroviral vectors integrated preferentially in genes expressed during transduction and that integrations can deregulate gene expression, albeit without obvious side effects 14 .

The published large-scale in vitro integration site studies have in common that none of them followed the possible selective advantage induced by virus or vector integration for an individual transduced cell over time. Interestingly, an analysis of MLV retrovirus and simian immunodeficiency virus (SIV) lentivirus integration sites in a preclinical nonhuman primate model discovered the presence of common insertion sites (CIS) in transcriptional units 15 . Recent studies on transduced CD34+ cells have further demonstrated that vector integration is indeed non-random, often clustered and potentially capable of inducing immortalization in vitro, clonal dominance in vivo or even leukemogenesis in vz ' vo 16"18 . Insertion in the human gene-modified T lymphocytes occurred preferentially at the TSS, but only a low incidence of CIS insertion was found 14 .

Recurrent integration in specific gene loci strongly indicates that the insertion has provided a non-random growth or survival advantage to the affected target cell clones 17 ' 18 . Our recent observation in a clinical gene therapy trial for chronic granulomatous disease (CGD) that cell clones with integrations in MDSl /E VIl, PRDM 16 or SETBPl drove a 3- to 4-fold in vivo expansion of the gene-corrected myeloid cell pool emphasizes the importance of analyzing the influence of the integration sites present in transduced cells and their clonal progeny in current gene therapy trials aimed at curing disorders of the myeloid or lymphoid blood cell compartment 19 . The occurrence of a lymphoproliferative disease in 3 of our 9 patients

showed the biological relevance the integration of replication-defective retroviral vectors may have 20 .

Here, we show by high-throughput integration site analysis and sequencing performed on CD34+ transduced cells and sorted peripheral blood cell samples obtained from patients of the first SCID-Xl gene therapy trial that integration of retroviral vectors takes place preferentially in gene coding regions, is skewed to the transcriptional start site of genes and is significantly correlated with the gene expression pattern of the gene-corrected cell population. Most strikingly, the significant clustering of distinct cellular integration events hitting common integration sites (CIS) in different circulating lymphocytes indicates that in vivo selection of transduced cells in the clinical setting occurs in relation to vector insertion, and may critically influence an individual cell's repopulation and proliferation capacity .

Results

Distribution analysis of retrovirus vector insertions in patients' mature blood cells

To study the characteristics of retroviral insertion in clinical c gene correction, a high- throughput analysis of insertion sites was conducted by LAM-PCR 21"23 on the DNA of whole blood leukocytes (554 sites) and purified peripheral blood T cells, B cells (CD 19+), granulocytes ( CD 15+) and monocytes (CD 14+) (18 sites) collected 4 to 41 months after the reinfusion of autologous CD34+ cells transduced with a γc retrovirus vector. 704 unique insertion site sequences were retrieved from the 9 analyzed patients, of which 572 (81%) (Table 4) could be mapped unequivocally to the human genome using the UCSC BLAT alignment tool. Chromosomal distribution analysis demonstrated that the frequency of insertion sites detected for each of the 23 human chromosomes correlated well with gene content but not with chromosome size (Fig. Ia). Insertions were most frequent on Chromosome 1, which is the largest chromosome, and least frequent on chromosomes Y and 18. At the same time the high insertion site frequency on chromosomes 17 and 19 correlates with a higher than average number of genes on these chromosomes. Of the 572 unique RIS, 216 (38%) were located within a RefSeq gene, 157 (27%) were within 5 kb surrounding the TSS and 356 (62%) were located in the gene coding sequence or less than 10 kb away (Fig. Ib and Ic, Table 1 and Table 4). Insertion data sets of the 3 patients (P4, P5, PlO) who developed a vector- associated T ALL-like disorder 30 - 34 months after gene therapy were analyzed separately 20 . Their integration pattern was not found to be significantly different for any of the assessable parameters as compared to the other patients (Table 1).

RIS distribution in transduced CD34+ cells

To study the influence of the differentiation process on the distribution of insertion sites, we compared the insertion site distribution of transduced pre-injection CD34+ cells (total RIS: 167, mappable RIS: 102) with the profile found in the sorted circulating cell population (total RIS: 191, mappable RIS: 141) of the same patient (P4). We did not observe any substantial difference in the frequencies of gene-associated insertions between pre- and post-transplantation cells (49% versus 41% respectively, p=0.22, defined by chi-square test), of targeting the transcription start site (+/- 5 kb TSS: 16% versus 26% respectively, p=0.05), of insertions in the proximity to RefSeq genes and their 10 kb upstream and downstream vicinity (64% versus 64% respectively, p=0.98) and of targeting CpG islands (14.7% versus 16.3% respectively, p=0.73) (Fig. 2).

Vector integration is clustered in common integration sites (CIS) For the purpose of analyzing high throughput insertional mutagenesis models in mice, a non-random insertion clustering in the form of retrovirus integration into the same genomic locus on two or more different cells has been defined as a common integration site (CIS). A CIS has been shown to be indicative of a non-random functional association of the insertion locus with the transformation event 24"26 . To distinguish random coincidence of neighboring integration from non-random CIS formation, we followed a more stringent CIS definition as recently defined by Suzuki et al. 26 . We classified CIS only by distance, independently of whether vector integrants were located inter- or intragenic. 2, 3 or 4 insertions were considered CIS if they fell within a 30 kb, 50 kb or 100 kb window, respectively. CIS of 5 th or higher order were defined by a 200 kb window. Computer simulations show that with 572 unique mappable RIS, the average number of randomly occurring second order CIS (formed by 2 individual integrants) was 3.2 (Table 5 and Methods ). The null hypothesis that the 102 observed CIS were due to random clustering could be rejected (estimated p-value = 0). No CIS of third order (CIS formed by 3 integrants) or higher orders were obtained in 10.000 simulation runs.

Of the 572 mappable unique insertions found in blood cells, 122 (21.0%) were part of a CIS (Table 6), which is 33-fold the value to be expected under random distribution of the RIS. Out of the 47 different loci harboring CIS, 38 (81%) were closer than 30 kb in distance to the next RefSeq gene. Among the 47 different CIS loci, 11 are known proto -oncogenes involved in human chromosomal translocations described in acute leukemia or other forms of cancer: ZNF217, V AVS, CCND2, LMO2, MDSl, BCL2LI, NOTCH2, SOCS2, RUNXl, RUNX3 and SEPT6. Out of these, nine are well known transcription factors involved in human hematopoiesis. 14 particularly relevant CIS consisted of 3 or more integrants, the majority (10/14, 71%) localized <30 kb away from genes. Here, protooncogene insertion was found in nearly one half (6/14, 43%)

(Table 2). Of note, three CIS with 5 (LM02), 8 (ZNF217) and 9 insertions (CCND2) accounted for 22 (4%) of all independent RIS, suggesting that they confer a strong selective advantage to the cell clones harboring these RIS.

In the CD34+ cells of P4 prior to transplantation, we could identify 4 CIS (4%) of second order out of the 102 unique RIS (Table 6), compared to an expected value of 0,03. Computer simulations did only reach a maximum of 3 CIS in 10.000 runs (mean value: 0.098, median: 0, standard deviation: 0.31, p=0 and Methods ). This non-random integration could indicate that these CIS are particularly accessible, but it is substantially lower than in post-transplantation samples.

RIS are located next to growth promoting genes

To characterize the potential biological influence of vector integration on clonal selection, we used the gene ontology (GO) database and related DAVID-EASE software to classify each gene into defined functional and biological categories. While we did not find any overrepresented gene classes in the transduced pre-transplant samples, insertion analysis of engrafted cells showed highly significant overrepresentation of genes involved in phosphorus metabolism, cell survival, kinase, transferase, receptor signalling and DNA binding (Table 3).

Further comparative analysis showed an accumulation of RIS in or near genes that are listed in the database of the cancer genome project (CGP; http://www.sanger.ac.uk/genetics/CGP/) (Table 4). 31 (9%) vector targeted genes were known oncogenes described in the CGP (total number of genes listed: 356). Together, these data underline an integration related selective advantage of RIS located in the vicinity of growth-promoting genes.

RIS and CIS loci correlate to the gene expression profile of transduced cells

To test whether the expression of genes may be associated with the likelihood of receiving a retrovirus insertion into expressed gene loci, we analyzed the correlation between gene related insertion and the expression pattern of CD34+ cells. RIS in engrafted cells were significantly more frequently found to be among the genes with the highest expression levels in CD34+ cells (n=422, Cochran-Armitage p<10 "6 ) (Fig. 3a). We further compared insertions in pre-transplant CD34+ cells from P4. Interestingly, the correlation of gene expression and vector integration, though significant, was less pronounced than observed in the in vivo setting (n=83, Cochran-Armitage p=4.99xlθ "4 ) (Fig. 3b). CIS location correlated even better with the genes highly expressed in CD34+ cells (Table 6). Of 47 CIS genes, 43 could be analyzed as they were represented on the microarrays. The average expression bin was 6.8. With the exception of FAM9C

(average expression bin: 0.7), PDE4B (average expression bin: 3.3) and TSRCl (average expression bin: 4.66), 11 of 14 genes associated with CIS of 3 or more integrants were found to be in the highest quartile of expression (average expression bin: 7.1). LM02, PTPRC, TOMM20, PRKCBPl and RUNXl were among the 10% highest expression in bin '9'. Discussion

To understand the biology of insertional gene transfer in clinical trials, we here performed high throughput insertion site mapping on samples derived from a clinical gene therapy trial for SCID-Xl. We compared RIS distribution in circulating mature cell populations from patients who have developed a lymphoproliferative adverse event and those who have not. Overall RIS distribution did not differ between the two groups. Both revealed the expected distribution features of retroviral vectors, with a strong preference for gene coding regions and symmetrical accumulation close to the transcription start site. Similar to what has been reported previously by Wu et al. for HeLa cells and by Laufs et al. for CD34+cells, the frequency of RIS was more closely related to gene density than to overall chromosome size, most frequently targeting chromosomes 1,17 and 19.

Compared to the distribution in pre-transplant cells, in vivo repopulation and normal function of the corrected T cell pool led to a significant skewing of the RIS distribution. 21 percent of all RIS detected in post-transplantation blood samples were found to be clustered and a much lower CIS frequency in the CD34+ pre-transplantation sample (7.8%). The observed changes in RIS distribution indicate that non-random selection or other biological effects of insertions in or near CIS genes have strong influence on the in vivo fate of gene corrected cell clones.

Several mechanisms may account for the differences between insertion distribution profiles in pre- versus post-transplantation samples. First, the majority of cells in the pre-transplantation sample have no repopulating ability. Therefore, the insertion site distribution of this population is not completely representative of repopulating cells from which post-transplantation cells derive. Second, posttransplantation CIS are even more frequently found near genes related to cell growth than post-transplantation RIS. Consequently, integration sites in lymphocytes and their progenitor cells are not only related to the gene expression status at the time of vector entry into the repopulating target cell, but might in addition confer a selective advantage, most likely as a result of gene activation, in gene loci which govern growth and/or survival of CD34+ cells and T cell precursors. This observation is further corroborated by analyzing whether the catalogue of gene-associated insertions correlates with the target cells' gene expression pattern. In samples obtained after transplantation, there was an even higher correlation between the

level of gene expression present in CD34+ cells, the population initially targeted by the transduction, and RIS frequency than in the analyzed pre-transplant sample. The relevance of this association and its influence on clonal selection of engrafted cells is completely obvious in CIS with 3 or more RIS, where nearly 80% of CIS affect genes of the highest expression quartile in the engrafted gene-corrected cells. Gene ontology analysis is further strong evidence that the biological function of genes at the insertion site is related to the in vivo fate of cell clones. When grouping vector targeted genes according to their role in cellular physiology, engrafted cells show a clear preponderance of RIS located in or near growth promoting genes, in particular genes revealing kinase and transferase activity. This feature was not seen with the pre- transplant samples, indicating that in vivo selection of clones having integrants in or near growth promoting genes occurred in our patients.

In line with this observation, more than 2/3 of the detected CIS genes are related to cell signaling and growth regulation, control of cell cycle, tyrosine kinases or differentiation. The most frequent CIS associated genes, CCND2, a cyclin found deregulated in a number of human cancer cells 27 ' 28 , ZNF217, a zinc finger transcription factor hyperexpressed in solid tumors 29 and LM02, a T ALL related proto-onocogene 30 , are well known to influence clonal proliferation and survival if activated. Together, these areas represent 3% of all clones but only 7xlO "7 % of the genetic code. Aberrant expression in many of these CIS genes in the context of other genetic changes has been linked to human oncogenesis. However, while the presence of CIS indicates that such clones engrafted and/or grew better than others, no evidence of clonal dominance has been detectable in the analyzed samples.

In a T cell gene transfer trial, RIS distribution was similar between clinical in vivo and experimental in vitro samples 14 . To test whether pre-transplant RIS distribution would have discernable characteristics related to a later lymphoproliferation event, we have studied the integration sites in the CD34+ cell population cryopreserved immediately after the transduction phase for patient P4, the first patient who developed a LMO-2 associated T-ALL like disease. No LM02 RIS and only a low number of CIS were found among the 102 sequences analyzed in CD34+ cells by LAM-PCR. In contrast, CIS were as frequent in post-transplantation T cells of P4 as in the other patients, with CCND2 related insertions being the most frequent CIS in this patient. Our findings support the concept that insertional activation of CIS genes, even when providing a subtle selective advantage to transduced precursors, will not lead to uncontrolled proliferation in the absence of other genetic changes. This latter hypothesis is compatible with our recent observation of clonal myeloid cell expansion in a clinical retroviral vector based gene therapy trial to correct chronic granulomatous disease (CGD). We found that a non-random integration site

distribution had developed by extensive expansion of progenitor cells with MDSl /EVIl, PRDM 16 and SETBPl related integration sites in two patients. Expression of these genes conferred a selective advantage to the transduced myeloid cells, leading to a 3- to 4-fold self-limiting expansion of the gene corrected cell fraction 19 .

Our data indicate that retrovirus vector integration pattern in T cells following clinical gene transfer is non-randomly distributed, correlates well with CD34+ target cell gene expression and is characterized by highly significant clustering into multiple different CIS. These CIS preferentially map to growth-regulating genes expressed in CD34+ cells, highlighting that their integration occurs preferentially in active gene loci and that maintaining their activation in later cell generations by insertion confers a clonal selection advantage. These data also show that subtle alterations of gene expression by the use of retrovirus vectors are likely to occur frequently. Vector integration in many different sites in our clinical SCID-Xl study has actively influenced the fate of corrected cell clones in vivo. Potential therapeutic advantages associated with the preferential growth of particular clones over time will be the subject of further investigation. Additional bio safety measures designed into vectors could include inactivation of the 3' LTR enhancer activity, e.g. by use of retro- or lentivirus self- inactivating (SIN) vectors and insulators. Thus, the prospects are excellent that it will be possible in the future to develop safety measures for gene therapy of severe immunodeficiencies, cancer and other diseases with limited therapeutic options which avoid or at least minimize unwanted gene activation. The excellent therapeutic success achieved in gene therapy trials can be maintained while the probability of insertional side effect is significantly decreased.

Methods Patients' cells.

Blood samples were obtained at various time points from patients enrolled in the SCID- Xl gene therapy trial 31 . CD3 T cells, CD 19 B cells and CD 14 monocytes were selected from patients' PBMC by immuno magnetic columns (Miltenyi Biotech). Granulocytes (CD 15) were sorted by FACS (Becton Dickinson). A CD34+ cell sample from patient 4 was separated just prior to re-infusion. Genomic DNA was isolated from all cells using commercially available DNA isolation kits (Qiagen). Informed consent was obtained from parents and the study has been approved by the Cochin ethical committee.

Integration site analysis by linear amplification mediated (LAM) PCR. 1 - 100 ng of DNA derived from patient blood cells were used for integration site sequencing as previously described 21 . Biotinylated primers LTRIa (5'>TGC TTA CCA CAG ATA TCC TG<3') and LTRIb (5'>ATC CTG TTT GGC CCA TAT TC<3') were

used for the preamplifϊcation of the vector-genome junctions. After magnetic capture, hexanucleotide priming, and a restriction digest with Tsp509I a linker cassette was ligated at the 5 'end of the genomic sequence. First exponential amplification of the vector-genome junction with linker cassette primer LCI and vector LTR specific primer LTRII followed by second exponential PCR with primer LCII and LTRIII 22 ' 23 . LAM PCR amplicons were purified, shotgun cloned into the TOPO TA vector (Invitrogen, Carlsbad, CA) and sequenced (GATC, Konstanz, Germany and Centre National de Sequencage, Evry, France). Alignment of the integration sequences to the human genome was carried out using the UCSC BLAT genome browser (http://genome.ucsc.edu/). The same tool and ensembl database (http://www.ensembl.org) was used to study the relation to annotated genome features. Not mappable sequences were either too short (< 20 bps), showed no definitive hit, or multiple hits on the human genome.

Definition of CIS and statistics. For the determination of CIS, we have measured the distance between individual integrants independently of being located in or outside of gene coding regions. 2, 3 or 4 insertions were considered as CIS if they fell within a 30 kb, 50 kb or 100 kb window from each other, respectively. Of note, three clusters of 5, 8 and 9 integrants (next RefSeq gene: LM02, ZNF217 and CCND2), covered 40 kb, 170 kb and 60 kb of genomic DNA, respectively. The genomic window for CIS of fifth and higher orders was set to 200 kb.

Computer simulations (10.000 runs) on the haploid size of the human genome (3.12x10 9 bp) were performed to calculate the likelihood of random, coincidental. We counted the number of CIS of second order formed by 2 integrants within a 30 kb window, the number of CIS of third order formed by 3 integrants within a 50 kb window, the number of CIS of fourth order formed by 4 integrants within a 100 kb window and the number of CIS of higher orders within a 200 kb window. Of note, CIS of different orders have been analyzed independently of each other, e.g. a CIS formed by 3 integrants located within 20 kb were counted as 3 CIS for the calculation of CIS of second order and as 1 CIS for the calculation of CIS of third order, respectively (Table 5 and 6).

Transcription profile in CD34+ cells.

G-SCF mobilized peripheral blood CD34+ cells from 3 donors were cultured using the same conditions as performed in the original gene therapy trial 1 and served as 3 independent and individual sample sources for further RNA expression analysis. RNA was isolated using TriReagent (Sigma) following the manufacturer's protocol. The

mRNA expression levels were determined using Affymetrix Ul 33 Plus 2.0 arrays and normalized as described previously 32 . The normalized microarray values were sorted upwardly on expression and divided into 10 equal sized expression level categories (0 - 9). The presence of the gene closest to a vector integration site as identified by LAM- PCR analysis was determined in each expression category. A Cochran - Armitage test for trend was performed to test whether higher expression level categories corresponded to larger numbers of insertions. For all gene symbols on the array the highest expression values were used to describe the gene expression.

Gene ontology analysis. To classifiy vector targeted genes according to gene ontology (GO) terms we analysed RefSeq genes which were hit by vector or which had vector integration in the surrounding of 10 kb next to the gene. Gene ontology analysis was performed using the publically available EASE software from NIH-DAVID

(http://david.niaid.nih.gov/david/ease.htm). The database sorts the genes in categories according to GO terms regarding their 'molecular function', 'biological process' and 'cellular compartment'. Over-represented gene categories were determined by 'Fisher exact' test.

REFERENCES

Adams, M.D., Celniker, S. E., Holt, R.A., Evans, C.A., Gocayne, J.D., Amanatides, P.G., Scherer, S.E., Li, P.W., Hoskins, R.A., Galle, R.F., et al. (2000). The genome sequence of Drosophila melanogaster. Science 287, 2185-2195.

Aiuti, A., Slavin, S., Aker, M., Picara, F., Deola, S., Mortellaro, A., Morecki, S., Andolfϊ, G., Tabucchi, A., Carlucci, F., et al (2002). Correction of ADA-SCID by stem cell gene therapy combined with nonmyeloablative conditioning. Science 296, 2410- 2413.

Bakker, E.G., Toomajian, C, Kreitman, M., and Bergelson, J. (2006). A genome-wide survey of R gene polymorphisms in Arabidopsis. Plant Cell 18, 1803-1818.

Bushman, F. D. (2003). Targeting survival: Integration site selection by retroviruses and LTR-retrotransposons. Cell 115, 135-138.

Calmels, B., Ferguson, C, Laukkanen, M.O., Adler, R., Faulhaber, M., Kim, H.J., Sellers, S., Hematti, P., Schmidt, M., von Kalle, C, et al. (2005). Recurrent retroviral vector integration at the MDSl-EVIl locus in non-human primate hematopoietic cells. Blood 106, 2530-2533.

Cavazzana-Calvo, M., Hacein-Bey, S., de Saint Basile, G., Gross, F., Yvon, E., Nusbaum, P., SeIz, F., Hue, C, Certain, S., Casanova, J.L., et al (2000). Gene therapy of human severe combined immunodeficiency (SCID)-Xl disease. Science 288, 669- 672.

Coffin, J.M.; Hughes, S. H.; and Varmus, H.E. (1997). Retroviruses. Plainview. (New York: Cold Spring Harbor Laboratory Press).

Dudewicz, E. J., and Mishra, S.N. (1988). Modern Mathematical Statistics. (Wiley, New York).

Du, Y., Jenkins, N.A., and Copeland, N. G. (2005). Insertional mutagenesis identifies genes that promote the immortalization of primary bone marrow progenitor cells. Blood 106, 3932-3839.

Hematti, P., Hong, B.K., Ferguson, C, Adler, R., Hanawa, H., Sellers, S., Holt, I.E., Eckfeldt, CE. , Sharma, Y., Schmidt, M., et al. (2004). Distinct genomic integration of MLV and SIV vectors in primate hematopoietic stem and progenitor cells. PLoS Biology. 2, e423.

Garrigan, D., and Hammer, M. F. (2006). Reconstructing human origins in the genomic era. Nat Rev Genet. 7, 669-680.

Gaspar, H. B., Parsley, K.L., Howe, S., King, D., Gilmour, K.C., Sinclair, J., Brouns, G., Schmidt, M., von Kalle, C, Barington, et al. (2004). Gene therapy of X-linked severe combined immunodeficiency by use of a pseudotyped gammaretroviral vector. Lancet 564, 2181-2187.

Gerhard, D. S., Wagner, L., Feingold, E.A., Shenmen, CM., Grouse, L.H., Schuler, G., Klein, S.L., Old, S., Rasooly, R., Good, P., Guyer, M., et al. (2004). The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). Genome Res. 14, 2121-2127.

Hacein-Bey-Abina, S., von Kalle, C, Schmidt, M, McCormack, M.P., Wulffraat, N., Leboulch, P., Lim, A., Osborne, CS., Pawliuk, R., Morillon, E., et al. (2003). LMO2- associated clonal T cell proliferation in two patients after gene therapy for SCID-Xl. Science. 502, 415-419.

Hacein-Bey-Abina, S., von Kalle, C, Schmidt, M., Le Deist, F., Wulffraat, N., Mclntyre, E., Radford, L, Villeval, JX. , Fraser, CC, Cavazzana-Calvo, M., and Fischer, A. (2003). A serious adverse event after successful gene therapy for X-linked severe combined immunodeficiency. N. Engl. J. Med. 348, 255-256.

Hartung, J., Elpelt, B., and Klόsener, K-H. (1987). Statistik. (Oldenbourg Verlag, Mϋnchen-Wien).

Holt, R.A., Subramanian, G. M., Halpern, A., Sutton, G. G., Charlab, R., Nusskern, D.R., Wincker, P., Clark, A.G., Ribeiro, J.M., Wides, R., et al.(2002). The genome sequence of the malaria mosquito Anopheles gambiae. Science 298, 129-149.

Kustikova, O., Fehse, B., Modlich, U., Yang, M., Dullmann, J., Kamino, K., von Neuhoff, N., Schlegelberger, B., Li, Z., and Baum, C (2005). Clonal dominance of hematopoietic stem cells triggered by retroviral gene marking. Science 308, 1171-1174.

Lander, E. S., Linton, L.M., Birren, B., Nusbaum, C, Zody, M. C, Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. (2001). Initial sequencing and analysis of the human genome. Nature 409, 860-921.

Laufs, S., Gentner, B., Nagy, K.Z., Jauch, A., Benner, A., Naundorf, S., Kuehlcke, K., Schiedlmeier, B., Ho, A.D., Zeller, W.J., and Fruehauf, S. (2003). Retroviral vector integration occurs in preferred genomic targets in human bone marrow repopulating cells. Blood 101, 2191-2198.

Li, X., Dϋllmann, J., Schiedlmeier, B., Schmidt, M., von Kalle, C, Meyer, J., Forster, M., Stocking, C, Wahlers, A., Frank, O., et al. (2002). Murine leukemia induced by retroviral gene marking. Science 296, 497.

Lund, A.H., Turner, G., Trubetskoy, A., Verhoeven, E., Wientjens, E., Hulsman, D., Russell, R., DePinho, R.A., Lenz, J., and van Lohuizen, M. (2002). Genome-wide retroviral insertional tagging of genes involved in cancer in Cdkn2a-deficient mice. Nat. Genet. 32, 160-165.

Mikkers, H., Allen, J., Knipscheer, P., Romeijn, L., Hart, A., Vink, E., and Berns, A. (2002). High-throughput retroviral tagging to identify components of specific signalling pathways in cancer. Nat. Genet. 32, 153-159.

Miranda, K.C., Huynh, T., Tay, Y., Ang, Y. S., Tarn, W.L., Thomson, A.M., Lim, B., and Rigoutsos, I. (2006). A pattern-based method for the identification of MicroRNA binding sites and their corresponding heteroduplexes. Cell 126, 1203-1217.

Mitchell, R.S., Beitzel, B.F., Schroder, A.R., Shinn, P., Chen, H., Berry, CC, Ecker, J.R., and Bushman, F.D. (2004). Retroviral DNA integration: ASLV, HIV, and MLV show distinct target site preferences. PLOS Biology 2, e234.

Ota,T., Suzuki,Y., Nishikawa,T., Otsuki,T., Sugiyama,T., Irie,R., Wakamatsu,A., Hayashi,K., Sato,H., Nagai,K., et al. (2004). Complete sequencing and characterization of 21,243 full-length human cDNAs. Nat. Genet. 36, 40-45.

Moolten, FX. , and Cupples, L.A. (1992). A model for predicting the risk of cancer consequent to retroviral gene therapy. Hum. Gene. Ther. 3, 479-486.

Ott, M.G., Schmidt, M., Schwarzwaelder, K., Stein, S., Siler, U., Koehl, U., Glimm, H., Kϋhlcke, K., Schilz, A., Kunkel, H., et al. (2006). Correction of X-linked chronic granulomatous disease by gene therapy is augmented by insertional activation of MDS/EVI1, PRDM16 or SETBPl. Nat. Med. 12, 401-409.

Riva, A., Delorme, M.-O., Chevalier, T., Guilhot, N., Henaut, C, and Henaut, A. (2004). The difficult interpretation of transcriptome data : the case of the GATC regulatory network. Computational Biology and Chemistry. 28, 109-118.

Steffen, D., and Weinberg, R.A. (1978). The integrated genome of murine leukemia virus. Cell 15, 1003-1010.

Subramanian, S., Madgula, V.M., George, R., Mishra, R.K., Pandit, M. W., Kumar, C. S., and Singh, L. (2003). Triplet repeats in human genome: distribution and their association with genes and other genomic regions. Bioinformatics 19, 549-552.

Subramanian, S., Mishra, R.K., and Singh, L. (2003) Genome-wide analysis of Bkm sequences (GATA repeats): predominant association with sex chromosomes and potential role in higher order chromatin organization and function. Bioinformatics 19, 681-685.

Suzuki, T., Shen, H., Akagi, K., Morse, H.C., Malley, J.D., Naiman, D. Q., Jenkins, N.A., and Copeland, N. G. et al. (2002). New genes involved in cancer identified by retroviral tagging. Nat. Genet. 32, 166-174 (2002).

Wang, J., Song, L., Gonder, M.K., Azrak, S., Ray, D.A., Batzer, M.A., Tishkoff, S.A., and Liang, P. (2006). Whole genome computational comparative genomics: A fruitful approach for ascertaining AIu insertion polymorphisms. Gene 365,11-20.

Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., Antonarakis, S. E., Attwood, J., et al. (2002). Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520-562.

Wu, X., Li, Y., Crise, B., and Burgess, S. M. (2003). Transcription start regions in the human genome are favored targets for MLV integration. Science 300, 1749-1751.

Wu, X., Luke, B. T., and Burgess, S.M. (2006). Redefining the common insertion site. Virology 344, 292-295.

References

Cavazzana-Calvo, M. et al. Gene therapy of human severe combined immunodeficiency

(SCID)-Xl disease. Science 288, 669-672 (2000). Gaspar, H. B. et al. Gene therapy of X- linked severe combined immunodeficiency by use of a pseudotyped gammaretroviral vector. Lancet 364, 2181-2187 (2004). Hacein-Bey-Abina, S. et al. Sustained correction of X-linked severe combined immunodeficiency by ex vivo gene therapy. N. Engl. J. Med. 346, 1185-1193 (2002). Aiuti, A. et al. Correction of ADA-SCID by stem cell gene therapy combined with nonmyeloablative conditioning. Science 296, 2410-2413 (2002). Aiuti, A. et al. Immune reconstitution in ADA-SCID after PBL gene therapy and discontinuation of enzyme replacement. Nat. Med. 8, 423-425 (2002). Ott, M. G. et al. Correction of X-linked chronic granulomatous disease by gene therapy, augmented by insertional activation of MDSl-EVIl, PRDM16 or SETBPl. Nat.

Med. 12, 401-409 (2006). Hacein-Bey-Abina, S. et al. LMO2-associated clonal T cell proliferation in two patients after gene therapy for SCID-Xl. Science 302, 415-419 (2003).

Li, Z. et al. Murine leukemia induced by retroviral gene marking. Science 296, 497 (2002).

Kustikova, O. et al. Clonal dominance of hematopoietic stem cells triggered by retroviral gene marking. Science 308, 1171-1174 (2005).

Modlich, U. et al. Leukemias following retroviral transfer of multidrug resistance 1 (MDRl) are driven by combinatorial insertional mutagenesis. Blood 105, 4235- 4246 (2005).

Montini et al. Hematopoietic stem cell gene transfer in a tumor-prone mouse model uncovers low genotoxicity of lentiviral vector integration. Nature Biotechnol. 24, 687-696 (2006).

Mitchell, R.S. et al. Retroviral DNA integration: ASLV, HIV, and MLV show distinct target site preferences. PLOS Biol. 2, e234 (2004).

Schroeder, A. et al. HIV-I integration in the human genome favors active genes and local hotspots. Cell 110, 521-529 (2002). Wu, X. et al. Transcription start regions in the human genome are favored targets for

MLV integration. Science 300, 1749-1751 (2003). Bushman, F. D. Targeting survival: Integration site selection by retroviruses and LTR- retrotransposons. Cell 115, 135-138 (2003).

Schmidt, M. et al. Polyclonal long-term repopulating stem cell clones in a primate model. Blood 100, 2737-2743 (2002).

Schmidt, M. et al. Clonality analysis after retroviral-mediated gene transfer to CD34+ cells from the cord blood of ADA-defϊcient SCID neonates. Nat. Med. 9, 463-468 (2003). Suzuki, T. et al. New genes involved in cancer identified by retroviral tagging. Nat.

Genet. 32, 166-174 (2004).

Dik, W.A. et al. New insights on human T cell development by quantitative T cell receptor gene rearrangement studies and gene expression profiling. J. Exp. Med. 201, 1715-1723 (2005). Ge, U., Dudoit, S. & Speed, T. P. Resampling-based multiple testing for microarray analysis. Test 12, 1-77 (2003). P. Armitage, P., Berry, G. & Matthews, J. N. S. Statistical Methods in Medical

Research, 4 th edition, 2001, Blackwell Science, Oxford, Maiden, MA, 2002. Hematti, P. et al. Distinct genomic integration of MLV and SIV vectors in primate hematopoietic stem and progenitor cells. PLoS Biol. 2, e423 (2004).

Laufs, S. et al. Retroviral vector integration occurs in preferred genomic targets in human bone marrow repopulating cells. Blood 101, 2191-2198 (2003). Bartholomew, C. & IhIe, J.N. Retroviral insertions 90 kilobases proximal to the Evi-1 myeloid transforming gene activate transcription from the normal promoter. MoI. Cell. Biol. 11, 1820-1828 (1991).

Recchia, A. et al. Retroviral vector integration deregulates gene expression but has no consequence on the biology and function of transplanted T cells. Proc Natl. Acad. Sci. U. S. A. 103, 1457-1462 (2006).

Calmels, B. et al. Recurrent retroviral vector integration at the MDS1/EVI1 locus in nonhuman primate long-term repopulating cells. Blood 106, 2530-2533 (2005).

Hoeben, R. C. et al. Inactivation of the Moloney murine leukemia virus long terminal repeat in murine fibroblast cell lines is associated with methylation and dependent on its chromosomal position. J. Virol. 65, 904-912 (1991).

Palmer, T. D. et al. Genetically modified skin fibroblasts persist long after transplantation but gradually inactivate introduced genes. Proc. Natl. Acad. Sci. U.

5. A 88, 1330-1334 (1991).

Xu, L. et al. Factors affecting long-term stability of Moloney murine leukemia virus- based vectors. Virology 171, 331-341 (1989).

References

1. Cavazzana-Calvo, M. et al. Gene therapy of human severe combined immunodeficiency (SCID)-Xl disease. Science 288, 669-672 (2000).

2. Aiuti, A. et al. Correction of ADA-SCID by stem cell gene therapy combined with nonmyelo ablative conditioning. Science 296, 2410-2413 (2002).

3. Gaspar, H. B. et al. Gene therapy of X- linked severe combined immunodeficiency by use of a pseudotyped gammaretroviral vector. Lancet 364, 2181-2187 (2004).

4. Coffin, J.M.; Hughes, S. H.; Varmus, H.E. Retroviruses. Cold Spring Harbor Laboratory Press, Plainview, New York, 843 p. (1997).

5. Moolten, F.L. & Cupples, L. A. A model for predicting the risk of cancer consequent to retroviral gene therapy. Hum. Gene Ther. 3, 479-486 (1992).

6. Schroeder, A. et al. HIV-I integration in the human genome favors active genes and local hotspots. Cell 110, 521-529 (2002). 7. Wu, X. et al. Transcription start regions in the human genome are favored targets for MLV integration. Science 300, 1749-1751 (2003).

8. Laufs, S. et al. Retroviral vector integration occurs in preferred genomic targets in human bone marrow repopulating cells. Blood 101, 2191-2198 (2003).

9. Mitchell, R.S. et al. Retroviral DNA integration: ASLV, HIV, and MLV show distinct target site preferences. PLoS Biol. 2, e234 (2004).

10. Mooslehner, K., Karls, U. & Harbers, K. Retroviral integration sites in transgenic Mov mice frequently map in the vicinity of transcribed DNA regions. J. Virol. 64, 3056-3058 (1990).

11. Scherdin, U., Rhodes, K. & Breindl, M. Transcriptionally active genome regions are preferred targets for retrovirus integration. J. Virol. 64, 907-912 (1990).

12. Bushman, F. Targeting survival: Integration site selection by retroviruses and LTR-retrotransposons. Cell 115, 135-138 (2003).

13. Maxfield, L. F., Fraize, CD. & Coffin, J.M. Relationship between retroviral DNA-integration-site selection and host cell transcription. Proc. Natl. Acad. Sci. USA 102, 1436-1441 (2005).

14. Recchia, A. et al. Retroviral vector integration deregulates gene expression but has no consequence on the biology and function of transplanted T cells. Proc. Natl. Acad. Sci. USA 103, 1457-1462 (2006).

15. Hematti, P. et al. Distinct genomic integration of MLV and SIV vectors in primate hematopoietic stem and progenitor cells. PLoS Biol. 2, e423 (2004).

16. Du, Y., Jenkins, N.A. & Copeland, N. G. Insertional mutagenesis identifies genes that promote the immortalization of primary bone marrow progenitor cells. Blood 106, 3932-3939 (2005). 17. Kustikova, O. et al. Clonal dominance of hematopoietic stem cells triggered by retroviral gene marking. Science 308, 1171-1174 (2005). 18. Calmels, B. et al. Recurrent retroviral vector integration at the Mdsl/Evil locus in nonhuman primate hematopoietic cells. Blood 106, 2530-2533 (2005). 19. Ott, M. G. et al. Correction of X- linked chronic granulomatous disease by gene therapy, augmented by insertional activation of MDSl-EVIl, PRDM 16 or SETBPl. NatMed. 12, 401-409 (2006). 20. Hacein-Bey-Abina, S. et al. LMO2-associated clonal T cell proliferation in two patients after gene therapy for SCID-Xl. Science 302, 415-419 (2003). 21. Schmidt et al. Clonal evidence for the transduction of CD34+ cells with lymphomyeloid differentiation potential and self-renewal capacity in the SCID- Xl gene therapy trial. Blood 105, 2699-2706 (2005). 22. Schmidt, M. et al. Polyclonal long-term repopulating stem cell clones in a primate model. Blood 100, 2737-2743 (2002).

23. Schmidt, M. et al. Clonality analysis after retroviral-mediated gene transfer to CD34+ cells from the cord blood of ADA-defϊcient SCID neonates. Nat. Med. 9, 463-468 (2003). 24. Mikkers, H. et al. High-throughput retroviral tagging to identify components of specific signalling pathways in cancer. Nat. Genet. 32, 153-159 (2002).

25. Lund, A.H. et al. Genome-wide retroviral insertional tagging of genes involved in cancer in Cdkn2a-defϊcient mice. Nat. Genet. 32, 160-165 (2002).

26. Suzuki, T. et al. New genes involved in cancer identified by retroviral tagging. Nat. Genet. 32, 166-174 (2002)

27. von Eyben, F. E. Chromosomes, genes, and development of testicular germ cell tumors. Cancer Genet. Cytogenet. 151, 93-138 (2004).

28. Hideshima, T. et al. Advances in biology of multiple myeloma: clinical applications. Blood 104, 607-618 (2004). 29. Collins, C. et al. Comprehensive genome sequence analysis of a breast cancer amplicon. Genome Res. 11, 1034-1042 (2001). 30. Nam, CH. & Rabbitts, T.H. The role of LMO2 in development and in T cell leukemia after chromosomal translocation or retroviral insertion. MoI. Ther. 13,

15-25 (2006). 31. Hacein-Bey-Abina, S. et al. Sustained correction of X-linked severe combined immunodeficiency by ex vivo gene therapy. N. Engl. J. Med. 346, 1185-1193

(2002).

32. Dik W.A. et al. New insights on human T cell development by quantitative T cell receptor gene rearrangement studies and gene expression profiling. J. Exp. Med. 201, 1715-1723 (2005).

33. Bickel, P.J., Doksum, K.P. Mathematical statistics. Holden Day, San Francisco 1977).

Table 1 Summary of clinical trial patient details and genomic distribution of all mappable RIS. a, The clinical and molecular parameters of T cell recovery for ten consecutive patients, including the molecular diagnosis. All patients are well and thriving at home. *, no longer requiring prophylactic immunoglobulin replacement, b, Cumulative genomic distribution of RIS detected in peripheral blood T cells from 5 patients, pre-engraftment CD34+ cells from patient 6, and transduced CD34+ cells from a healthy normal donor. The absolute numbers of RIS as well as the percentage relating to the summation of all exactly mappable RIS are indicated. RIS, retroviral insertion sites; n, number of RIS; Kb, kilo base pairs; TSS, transcription start site.

Table 2 Common integration sites (CIS) detected in all cell samples analyzed. All CIS within one cell fraction (engrafted cells, freshly transduced patient cells and healthy donor cells) were formed of 2 individual integrants. CIS formed by 3 integrants could be detected only across different fractions. The RefSeq gene closest to the individual insertions were listed. Pat., patient number; nd, normal donor; 'Source', RIS detected either in a pre or post transplantation sample; post, post transplantation; pre, pre transplantation; 5' of TSS, RIS upstream of transcription start site; 3' of TSS, RIS in gene; Intron, number of intron in which RIS is located; 3' of RefSeq gene, RIS downstream of RefSeq gene; UCSC locus, RIS locus; bp, base pairs.

Table 3 Gene ontology (GO) analysis. Each vector targeted gene was classified in defined categories of molecular function and biological process. For each cell fraction (engrafted cells, freshly transduced patient and healthy donor cells) the most significant over-represented gene categories were listed. 'Level', classifies the specificity and coverage of the gene ontology declaration. Level 1 is the most general category and provides the highest coverage. Level 5 offers more specific information and less coverage. 'List hits' identify the number of analyzed RefSeq genes which belong to the corresponding category. The Fisher exact value determines the significance of the over- representation of RefSeq genes belonging to one category.

Table 4 Overall characteristics of retroviral integration sites (RIS) found in 9 patients. RIS are shown as absolute numbers (n) and as percentage (%) of the exactly mappable sequences for each category. RIS distribution of P4, P5 and PlO which developed leukaemia following gene therapy is shown separately in comparison to the vector distribution in the other patients. The last columns summarize the RIS distribution of all patients analyzed. TSS, transcriptional start site; kb, kilo base pairs.

Table 5 Common integration sites (CIS) of third and higher order detected in patients. The nearest RefSeq gene and the distribution of integrations among the different patients are indicated of all CIS that are formed of at least 3 individual integrants. The number in brackets denote the number of unique integrants retrieved from the individual patient. The 6 protooncogenes (CCND2, ZNF217, LM02, NOTCH2, RUNX3, and RUNXl) are depicted in red.

Table 6 Gene ontology (GO) classification. RefSeq genes which received an insertion hit within the gene or the surrounding 10 kb have been used for GO analysis. Of 356 affected genes identified in engrafted cells 164 could be analyzed regarding their molecular function and 189 regarding the biological process according to GO terms. Significant results are listed (Fisher Exact test <0.05).

Legends

Figure 1 Functional restoration of immunity, a, Lymphocyte recovery in patients after treatment in the clinical trial. CD3+ counts were obtained for each patient at regular time points after treatment. Although variable, all patients demonstrated an increase in lymphocyte count which is stable over time, b, Surface expression of gamma chain protein. Expression of γc on CD3+ cells was determined 25 months post-treatment for patient 6 who had no cell-surface γc protein before gene therapy.

Figure 2 Genomic distribution of retroviral integration sites (RIS). a, The relationship between chromosome size, number of known genes, and the retroviral insertion frequencyO, Lengths of the autosomes which are counted twice to allow for the diploid status of hematopoietic cells and shown as a percentage of the total genome size. X and Y chromosome were counted once only. D , gene density of each chromosome; Q RIS detected in CD34+ cells from a healthy donor;^ , RIS derived from transduced pre- transplant CD34+ cells from patient 6;J, RIS detected in patients engrafted cells, b, c, RIS location related to RefSeq genes. Columns display the percentage of all mappable insertions detected in different fractions. Q , RIS derived from transduced CD34+ cells of a healthy normal donor; ^ , RIS derived from transduced pre-engraftment CD34+ cells from patient 6;H, RIS derived from patients engrafted cells, b, RIS distribution 10kb up- and downstream of transcription start sites (TSS). Negative numbers indicate the region upstream of TSS and positive numbers the gene coding region, up, upstream of TSS. c, RIS in and near gene coding regions. Negative numbers indicate the region upstream of TSS and positive numbers the gene coding region as well as the region downstream of genes. RIS location inside genes are expressed as the percentage of the overall length of each individual vector targeted gene that was divided in 10 sections of equal length. -5 kb, all RIS located 5 kb upstream of TSS; +5 kb, all RIS located 5 kb downstream of RefSeq genes; up, upstream of TSS; down, downstream of Refseq genes.

Figure 3 Comparative analysis of vector integration and gene expression, a, b, MvA plots for all probesets and probesets closest to RIS in patient 1 (Pl) and healthy donor. a, MvA plot of RNA expression determined by Affymetrix Ul 33 A microarray of Pl and healthy donor CD3/CD28 stimulated CD4+ T-cells. Indicated in blue, all 22283 probesets on the Ul 33a array are shown. 3173 of these probesets were significantly different in the patient versus the control as indicated by a Sidak step-up adjusted p- value < 0.05 (red circles), corresponding to 1549 upregulated and 1624 downregulated

genes. 96 probesets, corresponding to 65 upregulated and 16 downregulated genes exceaded log 2 fold change > 2. None of these were associated with RIS. b, MvA plots for 200 probesets (blue diamonds) describing 134 genes closest to RIS in Pl . 48 probesets were significantly differentially expressed (red circles), corresponding to 17 upregulated and 19 downregulated genes. Most of the differences were marginal; only 5 of these probesets, describing the genes FLJ10986, SPTLC2 (upregulated) and ITGAL, PDCD4 and DPH5 (downregulated), exceeded log 2 fold change > 1.5, but <2. c, d Comparative analysis of gene expression in (c) CD34+ cells stimulated under transduction conditions and the integration sites retrieved from engrafted CD3+ cells in 5 patients and in (d) engrafted CD4+ T cells (Pl) and the integration sites retrieved from the corresponding CD3+ population (Pl). There is a significant correlation between gene expression and the numbers of integration events, though as expected this is less pronounced. All genes on the array were organized into 10 bins according to expression levels and the number of integrations was calculated for each category. The dotted line represents the number of genes in each expression level category when uniform random distribution is assumed.

Figure 4 Retroviral insertion site (RIS) distribution analysis of engrafted cells, a, RIS distribution compared to chromosome size and gene content. The displayed chromosome distribution accounts for the double copy number of diploid autosomes. ■ size of chromosomes, D number of known genes, D number of RIS. b, c, Vector integration in and near RefSeq genes. RIS were preferentially found near the transcription start site (TSS) (b) and within gene coding regions (c). Negative numbers denote the region upstream (Up) of a gene, positive numbers indicate the gene region downstream (Down) of the TSS. The position of intragenic hits was mapped according to the percentage of overall gene length (c). kb, kilo base pairs.

Figure 5 Comparison of pre- and post-transplant retroviral integration site (RIS) distribution in Patient 4. Black columns demonstrate the percentage of RIS detected in the indicated gene region after transplantation. Grey columns represent insertion sites prior transplantation, kb, kilo base pairs; TSS, transcription start site.

Figure 6 Comparative analysis of gene expression and vector integration, a, b, Correlation between gene expression pattern in stimulated peripheral blood CD34+ cells and gene related insertions detected in engrafted cells (a) and in CD34+ cells prior to

reinfusion (b). For each gene, the probeset with the highest expression value was used. All genes (20.600) present on the array were sorted on expression and divided in 10 percentile categories according to their expression level, so that each category contains 10% genes. The columns represent the average number of genes in each category based on 3 individual arrays ( methods ).