CHEMICAL SCREENING - DEOMED LTD

Title:

CHEMICAL SCREENING

Document Type and Number:

WIPO Patent Application WO/2009/138767

Kind Code:

Abstract:

A method and apparatus for screening a molecule against a set of different polypeptides and corresponding drug discovery method are described. The method includes treating a plurality of different polypeptides with the molecule and identifying those polypeptides that bound with the molecule. A three dimensional combinatorial surface binding signature for at least one of the identified polypeptides is compared with a plurality of three dimensional combinatorial surface binding signatures for the set of different polypeptides to identify those of the set of different polypeptides with which the molecule is likely to bind. Methods and apparatus for generating the three dimensional combinatoπal surface binding signature are also described.

Inventors:

HUMPHERY-SMITH IAN (GB)

Application Number:

PCT/GB2009/050381

Publication Date:

November 19, 2009

Filing Date:

April 17, 2009

Export Citation:

Click for automatic bibliography generation Help

Assignee:

DEOMED LTD (GB)
HUMPHERY-SMITH IAN (GB)

International Classes:

G06F17/00; G01N33/50; G01N33/566; G01N33/68

Other References:

HUMPHERY-SMITH I ET AL: "The search for validated biomarkers in the face of biosystems complexity" DRUG DISCOVERY WORLD, R J COMMUNICATIONS & MEDIA WORLD LTD., LONDON, GB, vol. 6, no. 2, 1 January 2005 (2005-01-01), pages 49-56, XP009118425 ISSN: 1469-4344
PETERS K P ET AL: "THE AUTOMATIC SEARCH FOR LIGAND BINDING SITES IN PROTEINS OF KNOWNTHREE-DIMENSIONAL STRUCTURE USING ONLY GEOMETRIC CRITERIA" JOURNAL OF MOLECULAR BIOLOGY, LONDON, GB, vol. 256, no. 1, 1 January 1996 (1996-01-01), pages 201-213, XP000882565 ISSN: 0022-2836
ZHOU H X ET AL: "Prediction of protein interaction sites from sequence profile and residue neighbor list." PROTEINS 15 AUG 2001, vol. 44, no. 3, 15 August 2001 (2001-08-15), pages 336-343, XP002533180 ISSN: 0887-3585
CHUNG JO-LAN ET AL: "Exploiting sequence and structure homologs to identify protein-protein binding sites." PROTEINS 15 MAR 2006, vol. 62, no. 3, 15 March 2006 (2006-03-15), pages 630-640, XP002533181 ISSN: 1097-0134
SEIDL T ET AL: "SOLVENT ACCESSIBLE SURFACE REPRESENTATION IN A DATABASE SYSTEM FOR PROTEIN DOCKING" PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INTELLIGENTSYSTEMS FOR MOLECULAR BIOLOGY, XX, XX, 1 January 1995 (1995-01-01), pages 350-358, XP009069154

Attorney, Agent or Firm:

ALTON, Andrew (Merrion WayLeeds, Yorkshire LS2 8PA, GB)

Download PDF:

View/Download PDF PDF Help

Claims:

1. A method for screening a molecule against a set of different polypeptides, the method comprising: treating a first plurality of different polypeptides with the molecule; identifying those polypeptides that bound with the molecule; and comparing a three dimensional combinatorial surface binding signature for at least one of the identified polypeptides with a plurality of three dimensional combinatorial surface binding signatures for the set of different polypeptides to identify those of the set of different polypeptides with which the molecule is likely to bind.

2. The method as claimed in claim 1, wherein the polypeptides are proteins.

3. The method as claimed in claim 2, wherein the proteins are human proteins.

4. The method as claimed in claim 1, wherein the set of different polypeptides are all the known polypeptides for an organism.

5. The method as claimed in claim 1, wherein the set of different polypeptides are all the known polypeptides.

6. The method as claimed in any preceding claim, wherein a three dimensional combinatorial surface binding signature for a plurality of the identified polypeptides are compared.

7. The method as calimed in claim 6, wherein the three dimensional combinatorial surface biding signature represents a common property of the plurality of the identified polypetides.

8. The method as claimed in any preceding claim, wherein the three dimensional combinatorial surface binding signature is expressed as a- one dimensional data string.

9. The method as claimed in claim 8, wherein the three dimensional combinatorial

surface binding signature for at least one of the identified polypeptides is compared with the plurality of three dimensional combinatorial surface binding signatures for the set of different polypeptides by determining the similarity of the three dimensional combinatorial binding signatures.

10. The method as claimed in claim 1 , wherein the plurality of three dimensional combinatorial surface binding signatures for the set of different polypeptides, comprises a three dimensional combinatorial surface binding signature for every exposed surface amino acid residue of each of the set of different polypeptides.

1 1. A computer implemented method for screening a molecule against a set of different polypeptides, the method comprising: identifying a three dimensional combinatorial surface binding signature for at least one of a first plurality of different polypeptides that have been treated with the molecule and that has exhibited some binding to the molecule; and comparing the three dimensional combinatorial surface binding signature with a plurality of three dimensional combinatorial surface binding signatures for the set of different polypeptides to identify those of the set of different polypeptides with which the molecule is likely to bind.

12. Data processing apparatus for screening a molecule against a set of different polypeptides, the apparatus comprising a data processing device and a storage device storing computer program instructions which can configure the data processing device to: identify a three dimensional combinatorial surface binding signature for at least one of a first plurality of different polypeptides that have been treated with the molecule and that has exhibited some binding to the molecule; and to compare the three dimensional combinatorial surface binding signature with a plurality of three dimensional combinatorial surface binding signatures for the set of different polypeptides to identify those of the set of different polypeptides with which the molecule is likely to bind.

13. A computer implemented method for determining a three dimensional

combinatorial surface binding signature for a polypeptide compπsing a plurality of amino acid residues, comprising identifying at least one of the plurality of ammo acid residues that are exposed at the surface of the polypeptide, for at least one of the identified ammo acid residues, determining all of the unique combinations of amino acid residues within an area centred on the ammo acid residue, and stoπng a data representation of the determined combinations for the identified amino acid residues to create a three dimensional combinatorial binding signature

14 A method as claimed in claim 13, wherein: all of the plurality of amino acid residues that are exposed at the surface are identified, all of the unique combinations of ammo acid residues within an area centred on a current ammo acid residue is determined for all of the identified surface ammo acid residues, and a data iepiesentation of the determined combinations for all the identified amino acid residues is stoied to create a three dimensional combinatorial binding signature for the entire polypeptide

15 A method as claimed in claim 14, wherein the data representation compiises a one dimensional data string

16 A method as claimed in any of claims 12 to 15, wherein the polypeptide is a protein

17 A method for creating a screening database of three dimensional combinatorial surface binding signatures compπsing repeating the method of claim 14 for every identified human protein

18 Data processing apparatus for determining a three dimensional combinatorial surface binding signature for a polypeptide compπsing a plurality of ammo acid residues,

the apparatus comprising a data processing device and a storage device storing computer program instructions which can configure the data processing device to: identify at least one of the plurality of amino acid residues that are exposed at the surface of the polypeptide; for at least one of the identified amino acid residues, determine all of the unique combinations of amino acid residues within an area centred on the amino acid residue ; and store a data representation of the determined combinations for the identified amino acid residues to create a three dimensional combinatorial binding signature.

19. Computer program code executable by a data processing device to carry out the method of any of claims 1 to 1 1 or 13 to 17.

20. At least one computer readable medium bearing computer program code as claimed in claim 19.

21. A drug discovery method comprising: systematically assessing the potential for on-target and/or off- target binding of drugs and other ligands with respect to all possible drug binding sites found on the surface of molecules within an entire proteome; and ranking candidate drugs with respect to their selectivity of binding within an entire organism.

22. A screening method substantially as hereinbefore described.

23. A method for determining a three dimensional combinatorial surface binding signature for a polypeptide substantially as hereinbefore described.

Description:

Chemical Screening

The present invention relates to screening chemicals and in particular to methods and apparatus for use in assessing the potential usefulness of molecules as a treatment for an organism.

Drug discovery is very difficult and therefore expensive to conduct but essential in order to produce new treatments for conditions and diseases. There are massive numbers of potential compounds available in chemical libraries but it is practically impossible to easily and quickly identify out of all of those compounds, a much smaller subset of compounds which are worthy of further research and investigation to identify candidate drugs.

Even after a potentially suitable candidate compound is identified these can often give rise to unforeseeable side effects which are hard to predict using conventional screening and/or testing procedures. Sometimes side effects can be tolerated, for example hair loss in some chemotherapies. However, sometimes side effects can not be tolerated, for example the use of Thalidomide.

Many mechanisms of action in a living organism occur at the polypeptide or protein level and often can involve the binding of a chemical or molecule at a site on a polypeptide or protein. However, all molecules will have some degree of binding with all potential binding sites of a protein and so it is impossible to determine a priori exactly how a chemical is likely to affect an organism owing to the massive number of potential interactions which could occur.

It would therefore be beneficial to provide a screening mechanism to help identify potentially useful chemicals or compounds which also takes into account the wide variety of possible interaction mechanisms between the chemical or compound and the organism.

The present invention provides a screening method which includes an assay of a molecule, which, for example, may be a small molecular mass chemical drug or a

biotherapeutic macromolecule or other molecule of interest, with polypeptides whose binding sites can be, or have been, characteπsed by a binding site signature, in order to identify those polypeptides exhibiting significant binding and then using the binding site signatures for those polypetides to identify other polypeptides to which the molecule is likely to bind The invention also provides a method and apparatus for generating the binding site signature

According to a first aspect of the present invention, there is provided a method for screening a molecule against a set of different polypeptides The method can comprise treating a first plurality of different polypeptides with the molecule, identifying those polypeptides that bound with the molecule, and comparing a three dimensional combinatorial surface binding signature for at least one of the identified polypeptides with a plurality of three dimensional combinatorial surface binding signatures for the set of different polypeptides to identify those of the set of different polypeptides with which the molecule is likely to bind

The polypeptides can be proteins The proteins can be human proteins

The set of different polypeptides can be all the known polypeptides for an organism

The set of different polypeptides can be all known polypeptides

A three dimensional combinatoπal surface binding signature for a plurality of the identified polypeptides can be compared

The three dimensional combinatoπal surface binding signature can represent a common property of the plurality of the identified polypetides

The three dimensional combinatoπal surface binding signature can be expressed as a one dimensional data stnng

The three dimensional combinatorial surface binding signature for at least one of the identified polypeptides can be compared with the plurality of three dimensional combinatorial surface binding signatures for the set of different polypeptides by determining the similarity of the three dimensional combinatorial binding signatures.

The plurality of three dimensional combinatorial surface binding signatures for the set of different polypeptides can comprise a three dimensional combinatorial surface binding signature for every exposed surface amino acid residue of each of the set of different polypeptides.

A further aspect of the invention provides a computer implemented method for screening a molecule against a set of different polypeptides. The method can comprise identifying a three dimensional combinatorial surface binding signature for at least one of a first plurality of different polypeptides that have been treated with the molecule and that has exhibited some binding to the molecule; and comparing the three dimensional combinatorial surface binding signature with a plurality of three dimensional combinatorial surface binding signatures for the set of different polypeptides to identify those of the set of different polypeptides with which the molecule is likely to bind.

A further aspect of the invention provides data processing apparatus for screening a molecule against a set of different polypeptides. The apparatus can comprise a data processing device and a storage device storing computer program instructions which can configure the data processing device to: identify a three dimensional combinatorial surface binding signature for at least one of a first plurality of different polypeptides that have been treated with the molecule and that has exhibited some binding to the molecule; and to compare the three dimensional combinatorial surface binding signature with a plurality of three dimensional combinatorial surface binding signatures for the set of different polypeptides to identify those of the set of different polypeptides with which the molecule is likely to bind.

A further aspect of the invention provides a computer implemented method for determining a three dimensional combinatorial surface binding signature for a

polypeptide comprising a plurality of amino acid residues, comprising: identifying at least one of the plurality of amino acid residues that are exposed at the surface of the polypeptide; for at least one of the identified amino acid residues, determining all of the unique combinations of amino acid residues within an area centred on the amino acid residue; and storing a data representation of the determined combinations for the identified amino acid residues to create a three dimensional combinatorial binding signature.

All of the plurality of amino acid residues that are exposed at the surface can be identified. All of the unique combinations of amino acid residues within an area centred on a current amino acid residue can be determined for all of the identified surface amino acid residues. A data representation of the determined combinations for all the identified amino acid residues can be stored to create a three dimensional combinatorial binding signature for the entire polypeptide.

• The data representation can comprise a one dimensional data string.

The polypeptide can be a protein.

A further aspect of the invention provides a method for creating a database for screening three dimensional combinatorial surface binding signatures comprising repeating the method for every identified human protein.

A further aspect of the invention provides a data processing apparatus for determining a three dimensional combinatorial surface binding signature for a polypeptide comprising a plurality of amino acid residues. The apparatus can comprisie a data processing device and a storage device storing computer program instructions which can configure the data processing device to: identify an amino acid residue that is exposed at the surface of the polypeptide; determine all of the unique combinations of amino acid residues within an area centred on a particularl surface-exposed amino acid residue ; and store a data representation of the determined combinations for the identified amino acid residues.

A further aspect of the invention provides computer program code executable by a data processing device to carry out any of the method aspects of the invention. A further aspect of the invention provides at least one computer readable medium bearing such computer program code.

A further aspect of the invention provides a drug discovery method or screening method for use in drug discovery. The method can include systematically assessing the potential for on-target {i.e. any known) and/or off-target (ι e an unknown) binding of drugs and other ligands with respect to all possible or potenial drug binding sites found on the surface of molecules within an entire proteome. Candidate drugs can then be ranked with respect to their selectivity of binding within an entire organism.

Hence, the selectivity with which a compound, be it a drug, candidate drug, other theraputic or potential therapuetic agent, binds with all binding sites of an entire orgnaism can be determined, so that the likely usefulness of the compound as a drug, either by being able to bind, or not, to a specific target or targets, can be used to help select compunds for further drug development work

The drug discovery method can include using the screening method aspect of the inventon.

Embodiments of the invention will now be described in detail, by way of example only, with reference to the accompanying drawings, in which:

Figure 1 shows a schematic block diagram of apparatus including data processing apparatus according to the invention;

Figure 2 shows a graphical representation of protein three dimensional structure data stored in a database of the apparatus shown in Figure 1 ;

Figure 3 shows a graphical representation of protein three dimensional combinatorial surface binding site signature data stored in the database of the apparatus shown in Figure 1 ;

Figure 4 shows a flow chart illustrating a screening method according to the

invention;

Figure 5 shows a process flow chart illustrating a three dimensional combinatorial surface binding site signature generation part of the method of Figure 4 in greater detail;

Figure 6 shows a schematic diagrammatic representation of surface residues of a molecule illustrating the determination of a first three dimensional combinatorial surface binding site signature;

Figure 7 shows a schematic diagrammatic representation of the surface residues shown in Figure 4 illustrating the determination of a second three dimensional combinatorial surface binding site signature;

Figure 8 shows a flow chart illustrating a parallel target assay part of the method illustrated in Figure 4;

Figure 9 shows a process flow chart illustrating a target specificity determining part of the method illustrated in Figure 4; and

Figure 10 shows a process flow chart illustrating an alternate target specificity determining part of the method illustrated in Figure 4.

Similar items in different Figures share common reference signs unless indicated otherwise.

The term "ligand" is used herein in a very general sense to mean what is binding or potentially binding to a polypeptide. In some instances, the ligand may be the entire molecule, for example if the molecule is a small molecular mass chemical. In other instances, for example if the molecule is a biotherapuetic or other macromolecule, then the ligand may be a part, or parts, of the molecule, such as a functional group or groups, an ion or ions, an atom or atoms, or any other chemical entities of the molecule. Hence, it will be understood that in the context of the present invention, ligand is to be intepreted broadly as the part, parts or whole of the molecule which can participate in binding rather than in any narrower, more specific sense. Hence, the expression ligand molecule or ligand compound will be understood to be referring to a molecule or compound which may be binding either via one or more of its parts or its whole.

With reference to Figure 1 there is shown a schematic diagram of a screening

system 100 according to the invention. The screening system 100 includes a data processing device 102 in the form of a general purpose programmable computer. Computer 102 is in communication with a mass storage device 104 holding a first database 106 and a second database 108. The data stored in databases 106 and 108 will be described in greater detail below with reference to Figures 2 and 3.

Screening system 100 also includes an assay device or instrument 1 10 which can be used in a wet assay part of the invention in an automatic, semi-automatic or manual manner. Assay device 110 is also in communication with computer 102 to allow data derived from the assay to be transmitted to computer 102. In other embodiments, assay device 110 need not be connected to computer 102 and data can be transferred manually, e.g. via a data storage medium. However, it is preferred if data can be transferred to computer 102 over a communication link, such as a wired or wireless network. The data derived from the wet assay is stored in database 109 as described in greater detail below.

In alternative embodiments, the data in databases 106, 108 and 109 can be combined into a single database. In other embodiments, the data can be stored in separate mass storage devices depending on the implementation of the screening system.

Computer 102 is also in communication with wired only network 1 12, the internet, by which computer 102 can communicate with other remote computers or data storage devices.

Various aspects of the methods described below are implemented by appropriate software which can be executed to control computer system 102 to carry out various data processing operations and manipulations on data stored in, or retrieved from, mass storage device 104 and / or assay device 110. In particular, computer 102 can be used to generate a three dimensional combinatorial surface binding site signature for a protein. Computer 102 can also be used to carry out a screening method aspect of the invention by searching database 108 to identify proteins likely to have similar binding properties to those identified by a parallel wet assay.

In the following description of an embodiment of the invention, the discussion will focus on application of the invention to proteins. However, it would be appreciated that the invention is not limited only to proteins, but can include any other molecules made up from combinations of amino acids or combinations of small molecule chemical scaffolds. As well as proteins, the invention can be applied particularly to amino acids combined with any number of post-translational modifications, including, for example, sugars, lipids, phosphorylation products, acetylation products, succinylation products, etc.

Figure 2 shows a graphic representation 120 of the data items stored in and organised by database 106 of Figure 1. In general, database 106 stores data defining the structure of a number of different proteins. The database can store protein structure data for a plurality of different organisms (O). For example, as illustrated in Figure 2, the database draws protein structure data for N different organisms (Ol to ON), such as, for example, human beings, chimpanzees, pigs and other mammals. The different organisms can be different varieties of the same species, e.g. human beings having different genders, races, or other phenotypic properties. The different organisms may be different species of the same type, e.g. such as different types of mammals. The different organisms may also be different types of species, e.g. mammals, insects, invertebrates, bacteria, fungi, plants, etc. The different types of organisms stored in the database will depend on the organisms that the drug discovery process is intended to treat.

For each type of organism 122, there is provided a set of proteins 124, Pl to PT. For example, Figure 2 illustrates the total set of proteins Pl to PT for the Xth organism OX. The set up proteins is preferably as comprehensive as possible and ideally includes all proteins expressed by the organism or as close to all as have been identified or as it practicable.

For each protein, the database stores data from which the three dimensional structure of the protein can be derived. For example, Figure 2 illustrates the set of protein structure data 126 for the Yth protein, PY. The set of protein structure data 126 includes an indication of the type of amino acid (αα) together with position data (in this

example Cartesian x, y, z co-ordinates) indicating the position for each of the amino acids in a co-ordinate system for the protein structure. This information can also include the traditional residue number from the N-terminus of a protein. However, in order to establish the uniqueness of a particular amino acid residue each reside should be accompanied by a Cartesian reference indicating its position so that the same residues at different positions can be distinguished. It is preferred if a common Cartesian reference frame is used into which all protein molecules can be located using the positional data stored in database 104, for example, the XYZ coordinates within this framework. Equally efficient is expression of diagonal reference coordinates with respect to a defined central reference point (e.g. an alpha carbon or actual centre of the most centrally positioned amino acid residue within the whole protein) with respect to the crystalline coordinates of a given protein molecule within the Protein DataBase (PDB) or Macromolecular Structural Database (MSD).

Using the position and amino acid data, the structure of the protein can be generated by computer system 102. In particular, those amino acid residues exposed at the surface of the protein can be determined from the protein structure data 126 as will be described in greater detail below. Each amino acid residue within a protein molecule can also be associated with data specifiying the constituent atoms and electrons and their relative positions within crystalline coordinates.

With reference to Figure 3, there is shown a graphical representation 130 of the data stored in and organised by database 108. Similarly, to database 106, database 108 is organised by different types of organisms 132 Ol to ON and, for each organism, different proteins 134, Pl to PT. For each protein, database 130 stores a three dimensional combinatorial surface binding site signature 136 in the form of a one dimensional data string. A single three dimensional combinatorial surface binding site signature 136 is associated with each protein 134. However, database 130 is not necessarily fully populated with all signatures for all proteins. Ideally, signatures have previously been generated and stored in database 108 for all proteins present in the database. However, it may be necessary to generate binding signatures as required, depending on an assay being carried out and additionally or alternatively, as new proteins or organisms are

added to the database.

Figure 4 shows a flow chart illustrated an assay method 140 according to the invention at a high level. At a first step 142, the process protein structure data from database 106 is processed to generate a three dimensional combinatorial surface binding signature for the protein. The signature is then stored in database 108. As discussed above, the generation of signatures can have been carried out previously or can be carried out as required. As well as generating signatures from protein structures stored in database 106, computer system 102 can obtain protein structure data from external sources, via network 1 12, and generic signatures for those proteins to be stored in database 108. Therefore, it is not necessary that the protein structure data be stored or accessed locally. Protein structure data can be obtained from various sources, such as the Protein Data Bank, PDB, or Macromolecular Structure Database, MSD. The protein structure data can be real structure data measured by NMR or crystallography or can be predicted structure data with any associated errors derived from protein modelling techniques, such as homology matching, threading or ab initio.

In the case of predicted structures, as this invention is primarily concerned with screening populations of molecules, the most important information is the juxtaposition of surface exposed amino acid residues. This property is less open to predictive errors than total molecular structure.

Figure 5 shows a process flow chart illustrating a computer implemented method 150 for generating three dimensional combinatorial surface binding signatures. At step 152, the protein structure data 126 is loaded for the protein of interest so that a virtual 3- D model of the protein can be represented in computer 102. It is not actually necessary to create a three dimensional model of the protein structure, although displaying a visual representation of the surface of the protein structure can be helpful for data visualisation purposes. Rather, all that is required is that computer 102 can determine the relative positions of each of the amino acid residues of the protein so that the degree of surface exposure can be determined. The degree of surface exposure can be measured in absolute or relative terms and expressed in square Angstroms of surface area of a

particular amino acid residue.

Then at step 154, those amino acid residues that can be considered to be exposed at the surface of the protein are identified. As will be appreciated, some of the amino acid residues will be located deep within the structure of the protein whereas others will be exposed on the outer surface of the protein. All outermost amino acid residues can be considered to be exposed. Some amino acid residues may not be outer most but a substantial portion thereof may be exposed, slightly shadowed by outermost residues. In that case, those amino acid residues that have their surface area exposed by more than approximatley 1.7 square Angstroms can be considered exposed at the surface of the protein.

Further, the degree of steric hindrance of amino acid residues can be taken into account in determining whether an amino acid residue can be considered to be exposed or not. Steric hinderance contributed by Post-Translational Modifications (PTMs) can also be taken into account with respect to known or estimated molecular structure of a particular PTM bound to individual surface exposed amino acids and those in, juxtaposition with such modified amino acid residues exposed at the protein surface. Such relative steric hinderance for different PTMs can be estimated with respect to known molecular masses associated with a particular surface-exposed amino acid residue as determined experimentally by a variety of approaches based on MS/MS mass spectrometry, i.e. exact mass calculations of a particular amino acid residues and its chemical adducts ( PTMs).

All the non-exposed amino acid residues can be discarded and processing proceeds only for the set of exposed amino acid residues. At step 156, a first exposed amino acid residue is selected. Then at step 158, a potential interaction region or patch, representing the area over which interaction between a molecule wanting to bond with the protein can be considered to occur, is centred on the current exposed residue. Each and every surface exposed amino acid residue across the entire protein molecule is treated in turn as the central exposed residue as indicated by processing loop 163.

Figure 6 shows a schematic representation of a two dimensional projection of exposed surface amino acid residues for a part of the surface of a protein 170. In this illustrative example, only five different types of amino acid residue (A, B, C, D, E) are considered to exist. However, it will be appreciated that in practice there are actually 20 different amino acids that constitute protein molecules, i.e. the 20-letter code of life exploited by proteins and derived from the 3 -letter DNA code within individuals codons that is translated into individual amino acid residues within a protein molecule.

As illustrated in Figure 6, a number of amino acid residues, shaded grey, are not considered to be exposed at the surface as they are significantly overlapped by outer most residues or are too deep within the structure of the protein. As illustrated in Figure 6, the interaction region is in the form of a disc 172 centred on a D residue 174. Based on the frequency of distribution of amino acids and their mean molecular diameters (as measured for approximately 3 million known proteins across all of known biology in 2006) a mean amino acid residue diameter of 6.35342 Angstroms is preferred in practice. This mean value will change little in the future as a significant sample size was employed to calculate its value. In the example illustrated in Figure 6, the diameter of disc 172 can be spanned maximally by a diameter comprised of three amino acid residues, a central residue plus one on each side. In practice, for a five residues diameter disc, for example, the method will seek-out combinations of five amino acid residues within the twelve nearest residues, twelve being the theoretic number in a perfectly packed set of baubles of identical diameter within a disc of maximal diameter of 5 x 6.35342 Angstroms.

The example interaction patch 172 shown in Figure 6 can be spanned by up to three amino acid residues, but in practice larger interaction patches can be used and can provide more accurate results. The interaction patch size can be expressed in terms of units of mean amino acid residue diameter. The maximum number of exposed amino acid residues for a patch diameter of three amino acid residues is the central amino acid, plus six others able to be perfectly stacked in juxtaposition. This equates to the central amino acid residue plus the sum over the radius of the interaction patch of six times the radius, in which the radius is expressed in whole mean amino acid residue diameter

(subtracting the 0.5 of a residue diameter contributed by the central residue). Hence, as the radius of the disc, as measured in units of mean amino acid diameter, increases so does the maximum number of possible surface exposed residues with a disc. For example, an interaction disc of diameter 3 (i.e. radius 1) as illustrated in Figure 6 can contain up to l + ( l x 6 ) = 7 amino acid residues in total. Similarly, a disc with a diameter of 5 (i.e. radius 2) mean residue diameters can contain up to 1 + (1 x 6) + (2 x 6) = 19 exposed amino acid residues in total, and a disc with a diameter of 7 (i.e. radius 3) mean residue diameter can contain up to 1 + (1 x 6) + (2 x 6) + (3 x 6) = 37 exposed amino acid residues, etc. The central residue can be considered to count the whole unit of one residue, as opposed to 0.5 residues in an exemplar patch with a 7 residue diameter spanned by string of A,B,C,D,E,F,G in which D represents the central residue. This rule applies for any body comprised of perfectly-packed baubles/beads of any identical size and thereby provides a means to systematically envisage macromolecules. In reality, however, proteins are not perfectly packed, contain empty spaces and consist of amino acid molecules of differing mass. Nonetheless, amino acids are sufficiently similar in mass not to be largely dissimilar to the formula given above. In practice, however, this invention employs merely the central residue plus the nearest X residues in 3D space situated on the protein surface for a given intended interaction patch as defined in the formula immediately above, for example, the central residue and the nearest 36 residues.

In practice however most protein macromolecules are not perfectly globular and thus a disc of diameter 3 refers specifically to a central amino acid plus the nearest 6 surface-exposed amino acid residues. Similarly, a disc of diameter 7 refers specifically to a central amino acid plus the nearest 36 surface-exposed amino acid residues, as defined by the formula above. In practice, a surface exposed amino acid residue can be considerd as a solvent accessible residue wherein that solvent is water, i.e. a small molecule with equally low molecular mass.

In practice, a disc size which has a diameter spanned maximally by 5, 7 or 9 amino acid residues is preferred. All diameters between a minimum of 3 and a maximum of 13 residues (i.e. radius 0.5, plus 1 through to radius 6 around the central amino acid) can be used, depending on the application. Thereafter, the curvature at a particular

molecule surface is likely to restrict the applicability of this invention due to the inability of atoms and molecules to physically interact over greater distances. Concomitantly, numerical projections in terms of screening possible combinations above and beyond a maximal disc size of diameter 13 and above becomes increasingly difficult to compute and nearing a non-polynomially complete state when considered for large protein structure databases and the need to interrogate such datasets, i.e. bordering on impossible to compute.

The disc size is considered more likely to represent the actual area over which binding occurs in practice. However, a binding region spanned maximally by three residues is used in the example by way of simplicity of explanation only. The interaction disc 172 referred to in Figure 6 corresponds to a schematic representation of the actual interaction region (patch) or molecular surface binding site interface between any molecule and any target protein, with diameters measured in Angstroms between approximately 3 x 6.35342 and 13 x 6.35342.

Then at step 160 all possible unique ways in which a molecule could bind with some of the exposed residues within the interactive area are enumerated for this particular binding site. That is, for the current binding site characterised by D 174, all unique three exposed residue combinations, rather than permutations, within the current interaction region 172 are enumerated. For example, a first three exposed residue combination is DCE, which is considered equivalent to the permutations DEC, ECD, EDC or CDE and all other permutations and therefore counts as a single unique combination 176. Another unique three exposed residue combination is DCB 178 which is equivalent to the permutations DBC, BCD, etc. Step 160 continues until all possible unique combinations of three exposed residues within the interaction area have been identified. As indicated above, in practice, combinations of 5, 7 or 9 exposed residues would be used .

When step 160 has completed for the current exposed residue 174 then at step 162 a next exposed residue is selected and process flow returns, as illustrated by return line 163, to step 156. Then at step 158 the interaction region is centred over the next exposed residue to be evaluated and step 160 is carried out. This is illustrated in Figure 7 in

which the interaction region 172 has been centred on exposed C residue 180 so that residues C, A, D, B, Ei and E ₂ are within the interaction region 172, in which Ei and E ₂ are two instances of the same type of residue. Again, all unique combinations of three exposed residues within the interaction region are enumerated as illustrated at 182.

Because residues Ei and E ₂ occur within the same interaction region they must be counted to represent two distinct occurrences of the same type of amino acid residue.

In practice, any such amino acid residue will be accompanied by its exact nomenclature, e.g. glycine; position with respect to the N-terminus of the protein; and X.Y,Z coordinates in a standardised Cartesian reference frame or 3-dimensional radial references from a molecular centroid again within a standardised reference frame or X ₅ Y, Z coordinates as employed by the internationally-accessible Protein Data Bank or the Macromolecular Structural database; extent of surface exposure in absolute or relative terms and measured in square Angstroms; and any information pertaining to post- translational modifications of a particular amino acid residue, for example. This accompanying data is used to detail exact information of relevance to a particular surface-exposed amino acid residue in database 108. Where regions of a molecule contain low redundancy information, such as repeats of just one amino acid type, this data will allow a more informed interpretation of the molecular surface with respect to others in database 108.

After all the unique combinations of exposed residues within the interaction region have been enumerated at step 160, then processing proceeds as described above and continues until all surface exposed residues have been evaluated.

For any protein, there will exist a fixed number of surface exposed residues for which the above process must be completed, noting that much duplicated information will initially be generated as a consequence of juxtaposing surface exposed residues sharing many unique combinations in their respective information strings that define potential binding sites. Some of this redundancy of information is removed later on as described below.

At step 164, a signature is generated for each potential binding site by creating a one dimensional data string as the sequence of unique combinations of exposed residues for each binding site. By combining the signatures for all the binding sites, as described by the concatenated lists for each and all surface exposed residues, a signature can be generated for the entire protein surface. As unique combinations within each disc are counted across the surface of a whole protein and as the interaction regions overlap due to stepping one exposed residue position at a time to an adjacent amino acid residue of a new interaction region, considerable redundancy or duplication as to the count of unique surface exposed combinations occurs. This redundancy may be removed from the list of unique combinations attributed to a given whole protein molecule in order to more accurately reflect the potential three dimensional combinatory surface binding site diversity of the protein or it can be retained so as to afford higher regional selectivity in screening database entries with respect to one another..

The duplication of counting of 3 residue strings is removed. Also, the concatenation of the multiple 3 residue strings into a single string for each binding site is carried out, prior to saving the signatures to the database 108 in an reproducible manner, for example, alphabetical order of the information strings. The ID data string to be stored in database 108 must first be arranged in a consistent manner. Any consistent arrangement can be used and an alphabetical ordering is described by way of example below. Each three residue string is arranged in alphabetical order and then the three residue strings are arrange in alphabetical order and concatentated so as to produce a ID signature for the site. Finally, any duplicate 3 residue strings are removed from the ID signature to provide the final ID signature for a site which is then saved to database 108 at step 164.

For example, if a binding site gave rise to the combinations: ACD, DAE, DBA, DCB, DCA, DBE then these combinations are each re-arranged alphabetically internally to give: ACD, ADE, ABD, ACD, CDE, and BDE internally. Rearranging the 3 residues alphabetically gives: ABD, ACD, ACD, ADE, BDE and CDE. However, ACD appears twice and so the duplicate versions are removed to leave the final concatenated ID

signature of ABDACDADEBDECDE. It will be appreciated that the duplicate 3 residue combinations can be removed prior to arranging the 3 residues alphabetically.

Recapitulating, a string which contained the 3 -letter string ABD three times, such as, ABD, ABD, ABD, ADE, BCD, BDE, CDE is reduced to ABD, ADE, BCD, BDE, CDE. Importantly, each 3 residue string, such as ABD is accompanied by unique positional information {e.g., linear with respect to the N-terminus, Cartesian or radial coordinates, etc.) so that Ai Bi Di at a first potential binding site may be distinguished from A ₂ B ₂ D ₂ situated at a second different potential binding site elsewhere on the protein surface.

The result of this process is a data string that allows a 3 residue description of any protein surface. Of course, a 5 residue, 7 residue, 9 residue, etc descriptions of a protein surface would be represented in the same way but with longer data strings. It is important to note that the resultant one dimensional data string of concatenated 3 residue letter strings has translated the three dimensional binding capacity of the protein into a one dimensional data string more akin with traditional methods of representing genomic and protein sequence. However, here the linear arrangement for any protein, while able to be reproducibly expressed, bears no linear relationship to the genomic or protein sequence of the encoded gene-product.

The set of ID data string signatures for each binding site of the protein is then saved in database 108 for that protein.

Hence the present invention allows a linear information string to be generated which represents the entire 3D molecular landscape of any potential ligand binding-site of varying maximal diameter that may occur on the surface of any target protein molecule existing within any living organism. As a consequence it is possible to compare such linear information strings with other linear information strings for similarity and identity using any of a number of existing procedures in bioinformatics. As will be described later, this one-dimensional string is now in a format able to be interrogated by existing gene and protein bioinformatics tools. The greater the similarity or identity as

measured by traditional bioinformatics tools, the more similar the exposed protein surface. Once defined as similar, other extant tools can then be employed to assess and rank the potential for ligand (small molecule or protein macromolecule) to target (protein macromolecule) interaction. Predictions can further be based upon Quantitative Structure Activity Relationship (QSAR) and and any number of molecular dynamic methods which take into account detailed predictions of potential energies of interaction.

However, the latter methods alone are impractical for the assessment of ligand binding potential or comparison of known binding sites for similarity of 3D molecular landscape characteristics within an entire proteome or for populations of greater than 50 potential protein molecules for similarity of 3D molecular landscape characteristics. This is especially true when confronted with an almost infinite population of diverse protein structures found within any living organism and each sharing varying degrees of three dimensional surface binding-site similarity. The present invention allows a deep look within an almost infinite molecular universe and renders tractable subsequent assessments of potential atomic and molecular interactions within a more restricted set of 3D molecular landscapes which can be compared initially in terms of linear information string similarity, i.e. a sieving strategy to render an otherwise computationally impossible task within the realms of tractability by known bioinformatics methods, a task not previously possible on a proteomic scale. For enhanced sensitivity and selectivity, the method described here can be modified by reference to smaller regions of a protein surface, subsets or portions of a whole protein whereby such information can be stored in database 108 as combinatorial strings as described immediately above, i.e. so as to define sub-regions of the protein surface, be they structural domains as housed in the CATH or SCOP databases, for example, or surface quadrants or other any arbitrary delineations of a protein surface.

Returning to Figure 4, at step 144, as part of the screening method, a wet parallel assay of the target compound is carried out. As indicated above, the generation of the binding signatures can be carried out before, after, or in parallel with the wet assay. It is only necessary that the signatures have been generated before the final step of the method.

Figure 8 shows a flow chart illustrating a wet assay method 200, corresponding generally to step 144. At step 202, approximately 400 proteins are selected and high purity, high quality samples are synthesised which should be as free as possible of contaminants. Approximately 400 to 600 proteins gives a reasonable sample of the 40,000 or so human proteins currently identified and so is likely to lead to at least some binding with the target compound. However, in other embodiments a greater number or lesser number of proteins can be used. Ascomputer power increases , it is conceivable that the number of target proteins able to be compared simultaneously for similarity of their 3D surface molecular topography could be measured in the 10's and 100's of thousands of molecules, and even millions of proteins could become computationally facile. It is preferred if at least 50 different proteins are used. Irrespective of the number of proteins selected, it is preferred if multiple samples are produced to allow for replicate experiments. The proteins may include Post-translational modifications (PTMs) and/or may also include other isoforms based on splice variants of particular proteins and in turn PTMs of each of these splice variants of the proteins and/or any other variants of proteins in order to refine the assay.

For populations of proteins numbering less than 50, such applications of the current invention would be applied to specifically dissect protein isoform binding disparity of one or more closely-related proteins and/or their associated isoforms, e.g. splice and PTM variants, to provide an isoform-specifϊc assay. When the wet-lab and in silico aspects of the present inventions are combined, the method described can be particularly powerful in terms of differential binding of chemical agents with respect to closely-related targets and potentially quite distinct functional consequences intra- cellularly, inter-organ and inter-organism.

In the following, chip based high-throughput screening assays will be described. However, the invention is not limited to such approaches and other parallel assay systems and assay methods can be used such as multiple, but low-throughput, test-tube-based assay methods. At step 204 the chips are prepared with randomly immobilised samples of each protein and replicates. Alternatively, non-immobilised samples of each protein

can be provided on the chips . At step 206, the proteins on the chips are exposed to the compound and this can be repeated for each replicate of the experiment. At step 208, the degree of binding of the compound to each protein on the chips is detected and the intensity of binding is detected by light emission for each chip position and recorded by assay device 110. A number of other quantitative methods well known in in molecular biology can be used additionally or alternatively, e.g. radio-isotpoes, different wavelengths of light, mass changes, substrate thickness changes due to molecular binding, heat of reaction, changes in magnetic and molecular spin properties, changes in refraction patterns, surface plasmon resonance, etc., The protein at each position on the chip is known beforehand and so the intensity of light, or other signal, from each position on the chip can be associated with the corresponding protein to provide a measure of the degree of binding of the compound with each protein. At step 210, those sites on the chip which have exhibited significant binding can be identified. For example, those sites for which the intensity of light exceeds the noise level by at least two standard deviations can be considered to be sites for exhibiting significant binding. By identifying those sites exhibiting significant binding, the associated proteins can be identified.

It is assumed that within each protein at a given site on the chip, there will be common information within the linear infoπnation string defining the potential surface binding sites on the particular set of proteins (i.e. those sites exhibiting significant binding). It is this common information that is used as the product of the wet assay to define the 3D surface binding signature of any given ligand. As the number of proteins included in the wet assay increases it becomes statistically less likely that no significant binding will be detected in the wet assay. Data indicating the proteins which exhibited significant binding and a measure of their binding can then be supplied to computer 102 from assay device 110 and stored in a new database 109. The wet assay part of the method is then completed.

Returning to Figure 4, the results of the wet assay are used to determine the likely specificity of binding of the compound at the level of a whole organism. Figure 9 shows a process flow chart illustrating a method 220 for determining the specificity of binding. As used herein, specificity of binding is used to indicate the degree with which a target

will bind to proteins within an organism, for example by binding to very many or all of the proteins in an organism (low specificity) or by binding to very few, a single or none of the proteins in an organism (high specificity).

The efficacy of the wet assay conducted in a parallel manner is dependent upon the inclusion on the chip or in the array of proteins having known 3D structures or proteins for which 3D structure can be determined retrospectively. From these 3D structuresa set of linear information strings defining the potential 3D surface binding sites of each protein on the chip or in the parallel array can be determined using the method described above. If numbers greater than 400-600 are included in such wet assays, the accuracy with which ligand binding specificity, or lack thereof, can be estimated with respect to a whole target proteome will also increase accordingly. A proteome is defined here as the total protein complement of a given organism, including those forms expressed during different temporal variants of a given organism and/or variants expressed in different regions of an organism. The accuracy of such estimates of ligand binding specificity, or lack thereof (on- and off- target binding specificity respectively) can be further enhanced by inclusion in the wet assay of one or more target proteins known to bind with a given ligand, although this is not essential to the working of the current invention.

It is however important to note that any ligand of interest does not necessarily need to be accompanied by known 3D structure and/or estimates thereof in order for the current invention to work. It is the purpose of the wet assay to define the 3D binding signature for any ligand.

At step 221 , computer 102 identifies a number N of the proteins from the assay which exhibited substantial binding with the compound. This can be done in a number of ways. For example, all proteins which bound to the compound above a given threshold can be selected. Alternatively, the proteins can be ranked by the strength of their binding with the compound and the top six binding proteins can be identified. Irrespective of how they are identified, at step 222 a number, N, of proteins from the 400 to 600 used in the assay are identified. Although proteins are used in this example, it will be

appreciated that it is not necessary to use proteins and that binding to any molecule can be assessed using the present invention.

At step 222 it is determined whether there are too many proteins which exhibited significant binding to the ligand compund. For example if more than 20% of the proteins used in the assay exhibited significant binding as determined in the preceding step, then it is likely that the ligand compound is going to bind to too many proteins in pratice and so may not be a good candidate theraputic agent. Therefore, the ligand compound can be rejected for further processing at step 223. It will be appreciated that different proportions of proteins exhibiting significant binding can be used depending on the sensitivity being sought.

Then at step 224 the 3D combinatorial surface binding site signature is obtained for each of the N proteins which exhibited significant binding. If the signature already exists for a protein, then it can simply be looked up in database 108. However, if the signature does not already exist for a protein, then the signature can be generated now, or before the wet assay, or in parallel with the wet assay, using the method described above and the signature stored in database 108 or 109. Irrespective of whether they are generated or looked up at step 224, a set of N signatures is obtained.

Then at step 226, the N signatures are compared with one another in order to identify regions on the surface of target molecules (by means of their own signatures) that are common to all or some of the signatures associated with molecules to which the ligand compound bound experimentally. The parts of the signatures which are common to all of the signaturescan now be employed to represent a signature for those sites on the N proteins to which the ligand compound bound. And then at step 228 a ID data string representing the common parts of the signatures is generated. This "consensus string" generated from the wet lab assay, or indeed directly from co-crystallisation of ligand compound and known target, can then be exploited as the "3D consensus binding signature" for the ligand compound in question. The wet lab assay can also be used to screen compound ligand of unknown chemistry, but definable by their "3D consensus binding signature".

The following example is intended to demonstrate this principle, whereby each letter in two sentences can be taken to represent the 1-D information strings as detailed above. It is intedned to indentify identical or similar letters (in practice amino acids can be either identical or similar with respect to properties such as size, charge, hydrophobicity, etc.) and indeed whole words or regions within a given signature.

The big book was very interesting. Big houses appeal very much.

Similar information is apparent as: "big, h, e, o, a, s, very, i, r" and replicates of one or letters in each sentence. However, a sentence:

"The big book was very stimulating",

would score as a more similar / identical protein, whereby the protein is being defined by the whole sentence / information string and each letter to be taken as an example of a surface exposed amino acid residue and unique combinations of defined length with its nearest neighbours.

Then at step 230 the ID data string representing the common parts of the signature is used to search database 108 to identify all proteins in database 108 for an organism which have a signature similar to some degree to the common signature. Hence, this effectively identifies all the proteins in the organism to which the ligand compound is likely to bind as they too are likely to have a binding site at least similar to the binding site or sites on the N proteins exhibiting binding to the ligand compound. A metric reflecting the degree of similarity, or percentage identity, of the common signature to each of the signatures is generated at step 230 by comparing the ID data strings using suitably adapted standard bioinformatics tools, namely substitution matrices such as PAM250 or BLOSSUM to score the degree of amino acid similarity and identity.

Then at step 232, the proteins of the organism are ranked by their degree of

similarity metric. Then at step 234 a threshold for sensitivity of detection for proteins within an organism most likely to exhibit binding to the ligand compound can be established. Thresholds can be established by statistical measures, but most simply can be the top 10, 100, 200 or 1000 most similar protein surfaces contained within database 108 and thereby allowing proteins targets likely to be bound by the ligand compound to be identified within, for example, the entire human proteome. Such information is a measure of 'catholicness of binding' or a relative 'target-selectivity' measure.When drugs bind to numerous non-intended targets, these off-target responses can give rise to undesirable and life threatening Adverse Drug Reactions. The individual proteins identified as potential binders will often be known as members of key physiological pathways associated with known molecular biochemistry. Such information may also allow drug developers to address this undesirable off-target binding by a variety of approaches known to the pharmaceutical industry so as to develop safer and higher efficacy drugs.

Other applications of this invention are possible if the information generated at step 228 can be derived for a sub-region of interest of any given protein or proteins of interest. Examples of such sub-regions of interest include suspected or confirmed drug or other molecular binding sites of interest as determined by co-crystallisation, protein and DNA mutagenesis studies, or merely an area of interest on the surface of any protein. Once defined, such an area of interest can now be expressed as 5-, 7-, or 9- residue combinatorial descriptions of potential 3D surface binding sites, converted into a linear information string akin to the 3D binding site signature as generated in step 228 and compared as in step 234 with information housed in database 108 to identify all of those proteins for the organism with definable identity or similarity to a given area of interest found on the surface of a particularly protein. In this manner, it becomes possible for the first time to assess realistically and systematically the potential for on- (any known) and off- (unknown) target binding of drugs and other ligands with respect to all possible / potential drug binding sites found in an entire proteome and, most importantly, rank candidate drugs with respect to their target selectivity (degree of 'catholicness of binding') within an entire organism.

Those drugs exhibiting the highest degree of specificity (as determined by having few potential binding sites within an organism, such as a human) can immediately be considered less likely to give rise to a diversity of Adverse Drug Reactions (ADRs) within that organism. Of course, specificity to some target proteins can also give rise to deleterious side effects due to drug administration, but these can be more knowingly monitored and predicted as a result of the invention described herein. ADRs are known to be a major caused of mortality and morbidity associated with the use of pharmaceutical compounds registered for use in humans in different countries around the world and also a major cause of high-cost attrition during drug discovery and development programs.

Thus, this invention can provide a tool for developing safer and more target specific therapeutic agents (such as small molecule and macromolecule compounds) and concomitantly assess the probable safety or risk based upon an ability to screen compounds on a genomic scale (all the protein entities encoded by a given genome as expressed in an organism's proteome) against multiple targets in silico. As the knowledge of 3D structure associated with all human proteins increases, so too will the general applicability and utility of the present invention increase.

Hence, the present invention provides a method for screening a target compound against all possible binding sites of a protein and can consider all known identified proteins of an organism to give a more holistic view of the likely extent of interactions between the target compound and the organism. This is of paramount importance to drug development both to assess and to measure enhancement of target selectivity to deliver more specific compounds and thereby reduce the likelihood of Adverse Drug Reactions. In addition, chemical iterations or isoforms of a therapeutic compound can be knowingly ranked between one another with respect to both affinity for target and the degree of target selectivity during drug discovery and drug development. Computational efficiency is achieved by reducing the problem of determining the likelihood of binding to any site on the protein to a 1 D pattern matching problem which is readily soluble and computationally tractable using current computer power.

It will be appreciated that various modifications and changes to the specific

embodiment described herein are envisaged. In particular, and as a context requires otherwise, the sequence of steps in the illustrated methods may not be important. Similarly, individual steps may be broken down into sub-steps or expanded into more general steps and the specific steps illustrating the flow charts are by way of clarity explanation only.

Previous Patent: BACK WASHING APPARATUS

Next Patent: ENERGY STORAGE SYSTEMS