COMPARATIVE DETECTION OF STRUCTURE PATTERNS IN INTERACTION SITES OF MOLECULES

Title:

COMPARATIVE DETECTION OF STRUCTURE PATTERNS IN INTERACTION SITES OF MOLECULES

Document Type and Number:

WIPO Patent Application WO/2008/091225

Kind Code:

Abstract:

Disclosed are methods and systems for representing an interaction site on a molecule, the method comprising: selecting a global region of the molecule, the global region encompassing an interaction site on the molecule; selecting a plurality of local regions which lie substantially within the global region; determining at least one local descriptor for each of the local regions; and forming a representation of the interaction site by combining the at least one local descriptor for each of the local regions. Optionally, at least one descriptor for the global region may be determined and the representation of the interaction site may be formed by combining the at least one global descriptor and the at least one local descriptor for each of the local regions. Further methods and systems are also disclosed for identifying or screening for common structural patterns in an interaction site of molecules, the method comprising: representing an interaction site of at least one first molecule and of at least one second molecule; and comparing the representation of the interaction site of the at least one first molecule with the representation of the interaction site of the at least one second molecule to identify or screen for common structural patterns in the interaction sites.

Inventors:

TONG JOO CHUAN (SG)
CHOO KHAR HENG (SG)

Application Number:

PCT/SG2008/000025

Publication Date:

July 31, 2008

Filing Date:

January 22, 2008

Export Citation:

Click for automatic bibliography generation Help

Assignee:

AGENCY SCIENCE TECH & RES (SG)
TONG JOO CHUAN (SG)
CHOO KHAR HENG (SG)

International Classes:

C12Q1/68; G01N33/53; G06Q99/00; G16C20/64

Foreign References:

US6182016B1

2001-01-30

Attorney, Agent or Firm:

ROBINSON, Kristian (1531 Robinson Road Post Office, Singapore 1, SG)

Download PDF:

View/Download PDF PDF Help

Claims:

THE CLAIMS DEFINING THE INVENTION ARE AS FOLLOWS:

1. A method for representing an interaction site on a target molecule, the method comprising: selecting a global region of the target molecule, the global region encompassing an interaction site on the target molecule; selecting a plurality of local regions which lie substantially within the global region; determining at least one local descriptor for each of the local regions; and forming a representation of the interaction site by combining the at least one local descriptor for each of the local regions. 2. A method according to claim 1 further comprising the step of determining at least one global descriptor for the global region and wherein the representation of the interaction site is formed by combining the at least one global descriptor for the global region and the at least one local descriptor for each of the local regions.

3. A method of identifying or screening for common structural patterns in an interaction site of molecules, the method comprising: representing an interaction site of at least one first target molecule and of at least one second target molecule using the method according to either claim 1 or claim 2; and comparing the representation of the interaction site of the at least one first target molecule with the representation of the interaction site of the at least one second target molecule to identify or screen for common structural patterns in the interaction sites.

4. A method of predicting interactions between at least one first and at least one second target molecules, the method comprising: representing at least one interaction site of each of the at least one first target molecule and of the at least one second target molecule using the method according to claim 1; comparing the representation of the at least one interaction site of the at least one first target molecule with the at least one representation of the interaction site of the at least one second target molecule to predicting interactions between the interaction sites.

5. A method according to claim 4 wherein the comparison of the interaction site representations comprises direct or complementary comparison of the descriptors in each of the interaction site representations.

6. A method according to claim 5 wherein the comparison of the descriptors is performed by either matching descriptor values, or fuzzy matching of descriptor values.

7. A method according to claim 6 wherein the fuzzy matching of descriptor values comprises matching of a descriptor value to a fuzzy value within a desired range of values.

8. A method according to claim 5 wherein the comparison of the descriptors is performed by one or more machine-learning techniques selected from the group of artificial neural network, support vector machine, hidden Markov model and genetic algorithms.

9. A method according to claim 1 wherein the interaction site comprises a plurality of functional residues and a plurality of non- functional residues and wherein each of the plurality of functional residues is encompassed by the global region and at least one local region. 10. A method according to claim 9 wherein each of the functional residues is encompassed by a plurality of local regions.

11. A method according to claim 10 wherein the plurality of local regions are a plurality of adjacent overlapping local regions.

12. A method according to claim 1 wherein the representation of the interaction site is selected from the group consisting of: a list of the at least one descriptors for each of the local regions; a delimited list of the at least one descriptors for each of the local regions; a linear combination of the at least one descriptors for each of the local regions; and a matrix representation of the at least one descriptors for each of the local regions.

13. A method according to claim 12 wherein the linear combination of the at least one descriptors for each of the local regions is a weighted linear combination of the at least one descriptors for each of the local regions wherein each of the descriptors is assigned a weight.

14. A method according to claim 12 wherein the matrix representation of the at least one descriptors for each of the local regions is a weighted matrix representation of the at least one descriptors for each of the local regions wherein each of the descriptors is assigned a weight.

15. A method according to claim 1 , wherein the target molecule is a plurality of macromolecules.

16. A method according to claim 1 wherein the target molecule is selected from the group consisting of a polypeptide, a polynucleotide, a polysaccharide, a glycoprotein, and a lipoprotein.

17. A method according to claim 1 or claim 3, wherein the target molecule is in association with a associated molecule which interacts with the target molecule via the association site.

18. A method according to claim 3, wherein the first target molecule and the second target macromolecule form an association pair, the association pair being selected from the group consisting of: a receptor-ligand pair; an antibody-epitope pair, and an enzyme-substrate pair.

19. A method according to any one of the preceding claims wherein the at least one descriptor is selected from the group consisting of free binding energy, entropic free energy, electrostatic energy, charge, van der Waals energy, torsion energy, hydrogen bonding energy, hydrophobic energy, bond stretching energy, and disulfide bonding energy, overall charge, charge distribution, contour, solvent accessible surface area, energetics, and torsion angles.

20. A system for identifying or screening for common structural patterns in an interaction site of molecules, the system comprising: means for selecting a global region in at least one target molecule, the global region encompassing an interaction site in the at least one target molecule; means for selecting a plurality of local regions which lie substantially within the global region; means for determining at least one local descriptor each of the local regions; means for forming a representation of the interaction site from a combination of the at least one local descriptor for each of the local regions; and means for comparing the representation of the interaction site of the target molecule with representations of an interaction site of at least one test molecule to identify or screen for common structural patterns in the interaction sites of the at least one target molecule and the at least one test molecule.

21. A system according to claim 20 further comprising means for determining at least one global descriptor for the global region and wherein the representation of the interaction site is formed from a combination of the at least one global descriptor for the global region and the at least one local descriptor for each of the local regions. 22. A computer system comprising a computer processor and memory, the memory comprising software code stored therein for execution by the computer processor of a method for representing an interaction site on a target molecule, the method comprising: selecting a global region in at least one target molecule, the global region encompassing an interaction site in the at least one target molecule; selecting a plurality of local regions which lie substantially within the global region; determining at least one local descriptor for each of the local regions; and forming representation of the interaction site by combining the at least one local descriptor for each of the local regions.

23. A system according to claim 22 further comprising means for determining at least one global descriptor for the global region and wherein the representation of the interaction site is formed from a combination of the at least one global descriptor for the global region and the at least one local descriptor for each of the local regions.

24. A computer system comprising a computer processor and memory, the memory comprising software code stored therein for execution by the computer processor of a method for identifying or screening for common structural patterns in an interaction site of molecules, the method comprising: representing an interaction site of at least one first target molecule and of at least one second target molecule using the method according to claim 22 or claim 23; and comparing the representation of the interaction site of the at least one first target molecule with the representation of the interaction site of the at least one second target molecule to identify or screen for common structural patterns in the interaction sites.

25. A software product for representing an interaction site of a molecule comprising: code for selecting a global region in at least one target molecule, the global region encompassing an interaction site in the at least one target molecule;

code for selecting a plurality of local regions which lie substantially within the global region; code for determining at least one local descriptor for each of the local regions; and code for forming a representation of the interaction site by combining the calculated at least one local descriptor for each of the local regions.

26. A software product according to claim 25 further comprising code for determining at least one global descriptor for the global region and wherein the code for forming the representation comprises code for forming a representation of the interaction site by combining the at least one global descriptor for the global region and the at least one local descriptor for each of the local regions.

27. A software product for identifying/screening for common structural patterns in an interaction site of molecules, comprising: code for representing an interaction site of at least one first target molecule and of at least one second target molecule; and code for comparing the representation of the interaction site of the target molecule with the representations of an interaction site of a test molecule to identify or screen for common structural patterns in the interaction sites of the at least one target molecule and the at least one test molecule.

28. A software product according to claim 27 wherein the code for representing an interaction site comprises: code for selecting a global region in at least one target molecule, the global region encompassing an interaction site in the at least one target molecule; code for selecting a plurality of local regions which lie substantially within the global region; code for determining at least one local descriptor for each of the local regions; and code for forming a representation of the interaction site from the calculated at least one local descriptor for each of the local regions.

29. A software product according to claim 28 further comprising code for determining at least one global descriptor for the global region and wherein the code for forming the representation comprises code for forming a representation of the interaction site by

combining the at least one global descriptor for the global region and the at least one local descriptor for each of the local regions.

30. An information database product comprising information on interaction site representations for a plurality of molecules, the interaction site representations for each of the plurality of molecules each being formed by the method comprising: selecting a global region in a molecule, the global region encompassing an interaction site in the molecule; selecting a plurality of local regions which lie substantially within the global region; determining at least one local descriptor for each of the local regions; and forming a representation of the interaction site of the molecule from the calculated at least one local descriptor for each of the local regions.

31. An information database product according to claim 30 wherein the method for forming the interaction site representations for each of the plurality of molecules further comprising determining at least one global descriptor for the global region and wherein the representation of the interaction site is formed by combining the calculated at least one global descriptor for the global region and the calculated at least one local descriptor for each of the local regions.

32. An information database product according to comprising information on a plurality of interaction site representations for each of a plurality of molecules, each of the plurality of information site representations for each of the plurality of molecules being formed by the method of either claim 30 or claim 31.

Description:

COMPARATIVE DETECTION OF STRUCTURE PATTERNS IN INTERACTION SITES OF MOLECULES

TECHNICAL FIELD [ 0001 ] The present invention relates to methods and systems for the systematic identification of structure patterns and representation of an interaction site of one or more molecules or macromolecules and analysis of interaction sites of associated pairs of molecules or macromolecules and of interactions between the associated pairs.

RELATED APPLICATION [ 0002 ] This application claims priority to US provisional patent application No. 60/881,479, the entire contents of which is incorporated herein by cross-reference.

BACKGROUND OF THE INVENTION

[ 0003 ] Any discussion of the background art throughout the specification should in no way be considered as an admission that such background art is prior art, nor that such background art is widely known or forms part of the common general knowledge in the field.

[ 0004 ] The association of two molecules, where one or both of the molecules may be macromolecules, is a fundamental biological event that is essential for the initiation and regulation of biological responses. For example, a receptor protein has a binding site which selectively binds a particular ligand, and which initiates a cascade of reactions that induce a change of the state of the affected cell. This new state of the cell results in a biological response, such as enzyme activation or deactivation, protein synthesis, protein stabilization, or the release of hormones or transmitters, among others. Examples of ligands of receptor include naturally occurring and synthetic hormones, pheromones, neurotransmitters, peptides, drugs, and small molecules. [ 0005 ] Understanding the structural principles involved ligand-receptor interaction is important for the analysis of biological responses, and related applications. A receptor may bind multiple ligands, or the same ligand may be recognized by multiple receptors. A cell may contain multiple copies of a particular receptor. The same type of receptor may be present in different cells. Some of these receptors belong to families with large number of variants. Even if two proteins share similar structures, they may have different functions, as

the binding site is highly sensitive and a small number of amino acid residues differences may alter the function of the protein.

[ 0006 ] With the rapid growth in the number of known protein sequences (2408264 in Swiss-Prot/TrEMBL as of October 2005) (Bairoch et al., 2004) and structures (33065 in the Protein Data Bank (PDB) as of October 2005) (Bernstein et al., 1978), there is a need for methods describing and identifying common functionally important units in related structures. Screening a family of receptors for their ligands or vice-versa through wet lab experimentation is not feasible due to the large number of combinations, leading to excessive experimental costs. Thus significant efforts have been invested in developing computational techniques for screening protein functional interaction sites to model receptor-ligand interactions. These techniques are based on computer simulated ligand binding or docking (Abagyan and Totrov, 2001; Gane and Dean, 2000), and the use of descriptors (Fetrow and Skolnick, 1998; Di Gennaro et al. 2001).

[ 0007 ] High-throughput flexible docking is an emerging technology for rational lead discovery that attempts to find the correct binding mode of a candidate ligand within a target receptor (binding site) over a three-dimensional search space (Abagyan and Totrov, 2001; Shoichet and Bussiere, 2000; Gane and Dean, 2000). Many docking algorithms have been developed, guided by the geometry and/or energy of the candidate receptor and ligand. FlexX (Rarey et al., 1999) is an incremental docking technique that places short fragments of target ligand within the receptor binding site and gradually constructs the entire ligand by extending and linking the fragments together in a series of steps. DOCK4.0 (Fradera et al., 2000) is another incremental docking technique in which, at certain steps of the calculation, the similarity of a target ligand to a reference ligand is used as a weighting factor to correct the docking energy score. DARWIN (Taylor and Burnett, 2000) was developed based on a genetic algorithm and CHARMM force field. Internal Coordinate Mechanics (Abagyan and Totrov, 2001; Fernandez-Recio et al., 2002) utilizes biased Monte Carlo technique to sample different orientations of a ligand around a receptor (binding site) using a soft energy function precalculated on a grid. A variety of other computational docking methods for modeling peptides into the receptor binding site have been reported. These include QXP, AutoDock, GOLD, LUDI, among others. Docking techniques have proven to be successful in practice, but being computationally intensive, they are highly dependent on the availability of functional interaction site location. Accordingly, this methodology is currently less useful for

large-scale screening of potential binding targets, especially when the location of the receptor interaction site is unknown as docking requires prior knowledge of receptor binding site.

SUMMARY OF THE INVENTION

[ 0008 ] Accordingly, in a first aspect there is provided a method for representing an interaction site on a target molecule. The method may comprise selecting a global region of the target molecule. The global region may encompass an interaction site on the target molecule. A plurality of local regions may be selected. At least one local descriptor for each of the local regions may be determined. The plurality of local regions may lie substantially within the global region. The plurality of local regions may be partially overlapping with one, two, three, four, five, six or more adjacent local regions. Optionally, at least one global descriptor for the global region may be determined. In one arrangement, a representation of the interaction site is formed by combining the local descriptors for each of the local regions. In another arrangement, the representation of the interaction site may be formed by combining the at least one global descriptor for the global region and the at least one local descriptor for each of the local regions.

[ 0009 ] hi an embodiment of the first aspect, the method may comprise in combination the steps of: selecting a global region of the target molecule, the global region encompassing an interaction site on the target molecule; selecting a plurality of local regions which lie substantially within the global region; determining at least one local descriptor for each of the local regions; and forming a representation of the interaction site from the local descriptors for each of the local regions.

[ 0010 ] The method may further comprise determining at least one global descriptor for the global region. The representation of the interaction site may be formed by combining the global descriptors for the global region and the local descriptors for each of the local regions. [ 0011 ] The target molecule may be a macromolecule for example a receptor, ligand, antibody, epitope, antigen, enzyme, or substrate among others, or it may be a small molecule. The small molecule may bind to a macromolecule. The at least one descriptor may be selected from the group consisting of free binding energy, entropic free energy, electrostatic energy, charge, van der Waals energy, torsion energy, hydrogen bonding energy, hydrophobic energy, bond stretching energy, and disulfide bonding energy, overall charge, charge distribution, contour, solvent accessible surface area, energetics, or torsion angles.

[ 0012 ] The interaction site may comprise a plurality of elements (which may be functional elements of the interaction site or non-functional "spacer" elements) which may be residues for example amino acid residues, nucleotide residues monosaccharide residue for proteins, polynucleotides (eg. DNA/RNA) or polysaccharides (eg. carbohydrates) respectively. The residues which participate in an interaction between interaction sites of two or more molecules are referred to as functional residues. The may be a plurality of functional residues and a plurality of non-functional residues and each of the plurality of functional residues may be encompassed by the global region and at least one local region. Each of the functional residues may be encompassed by a plurality of local regions. The plurality of local regions may be a plurality of adjacent overlapping local regions. The local regions may be selected such that they are evenly distributed throughout the global region, i.e. from the centre of the global region to its boundary; however, the local regions may be selected in an irregular manner provided that at least each of the functional elements of the interaction site are encompassed by at least one local region. Selection of a particular method of selecting the local regions may be based on its accuracy, soundness, availability and computational time for a particular interaction site.

[ 0013 ] The representation of the interaction site may be selected from the group consisting of: a list of the at least one descriptors for each of the local regions; a delimited list of the at least one descriptors for each of the local regions; a linear combination of the at least one descriptors for each of the local regions and a matrix representation of the at least one descriptors for each of the local regions. The linear combination of the at least one descriptors for each of the local regions may be a weighted linear combination of the at least one descriptors for each of the local regions wherein each of the descriptors is assigned a weight. The matrix representation of the at least one descriptors for each of the local regions may be a weighted matrix representation of the at least one descriptors for each of the local regions wherein each of the descriptors is assigned a weight.

[ 0014 ] hi a second aspect, there is provided a method of identifying or screening for common structural patterns in an interaction site of molecules. The method may comprise representing an interaction site of at least one first target molecule and of at least one second target molecule using the method according to the first aspect. The representation of the interaction site of the first target molecule may be compared with the representation of the

interaction site of the at least one second target molecule to identify or screen for common structural patterns in the interaction sites.

[ 0015 ] In an embodiment of the second aspect, the method may comprise the combination of representing an interaction site of at least one first target molecule and of at least one second target molecule using the method according to the first aspect; and comparing the representation of the interaction site of the at least one first target molecule with the representation of the interaction site of the at least one second target molecule to identify or screen for common structural patterns in the interaction sites.

[ 0016 ] In particular embodiments, the first target molecule and the second target molecule form an association pair, where one or both of the molecules in the association pair may be macromolecules. The association pair maybe selected from the group consisting of: a receptor-ligand pair; an antibody-epitope pair or antibody-antigen pair; an enzyme-substrate pair; protein-protein pairs; channel/transporter-solute pairs or cytoskeletal-protein pairs; or other donor-acceptor molecule pair. The method of representing the interaction site may be provided by the method of the first aspect.

[ 0017 ] hi one embodiment, the target molecule is at least one molecule, and in certain embodiments a plurality of molecules. The at least one or the plurality of molecules may be macromolecules. In one embodiment, the target molecule is selected from the group consisting of a polypeptide, a polynucleotide, a polysaccharide, a glycoprotein, and a lipoprotein.

[ 0018 ] hi certain embodiments, the target molecule is in association with an associated molecule which interacts with the target molecule via the association site.

[ 0019 ] In a third aspect there is provided a method of predicting interactions between at least one first and at least one second target molecule. The method may comprise representing at least one interaction site of each of the at least one first target molecule and of the at least one second target molecule using the method according to the first aspect. The method may further comprise comparing the representation of the at least one interaction site of the at least one first target molecule with the at least one representation of the interaction site of the at least one second target molecule to identify or screen for common structural patterns in the interaction sites.

[ 0020 ] In an embodiment of the third aspect, the method may comprise in combination: representing at least one interaction site of each of the at least one first target molecule and of the at least one second target molecule using the method according to the first aspect; and comparing the representation of the at least one interaction site of the at least one first target molecule with the at least one representation of the interaction site of the at least one second target molecule to identify or screen for common structural patterns in the interaction sites.

[ 0021 ] The comparison of the interaction site representations may comprise comparison of the descriptors in each of the interaction site representations. The comparison of the descriptors may be performed by either matching descriptor values, fuzzy matching of descriptor values. The fuzzy matching of descriptor values may comprise matching of a descriptor value to a fuzzy value within a desired range of values. Alternatively, the comparison of the descriptors may be performed by one or more machine-learning techniques selected from the group consisting of artificial neural network, support vector machine, hidden Markov model and genetic algorithms. [ 0022 ] hi a fourth aspect, there is provided a system for identifying or screening for common structural patterns in an interaction site of molecules. The system may comprise means for selecting a global region in at least one target molecule. The global region may encompass an interaction site in the at least one target molecule. The system may further comprise means for selecting a plurality of local regions which lie substantially within the global region. The system may further comprise means for determining at least one local descriptor for each of the local regions. The system may further comprise means for forming a representation of the interaction site from a combination of the at least one descriptor for each of the local regions. The system may further comprise means for comparing the representation of the interaction site of the target molecule with the representations of an interaction site of at least one test molecule to identify or screen for common structural patterns in the interaction sites of the at least one target molecule and the at least one test molecule. The system may further comprise means for determining at least one global descriptor for the global region. The means for forming the representation of the interaction site may comprise means for forming a representation of the interaction site from a combination of the at least one descriptor for the global region and for each of the local regions

[ 0023 ] The system of the fourth aspect may, in a particular embodiment, comprise in combination: means for selecting a global region in at least one target molecule, the global region encompassing an interaction site in the at least one target molecule; optionally means for determining at least one descriptor for each of the local regions; means for forming a representation of the interaction site from a combination of the at least one descriptor for each of the local regions; and means for comparing the representation of the interaction site of the target molecule with the representations of an interaction site of at least one test molecule to identify or screen for common structural patterns in the interaction sites of the at least one target molecule and the at least one test molecule. The system may further comprise means for selecting a plurality of local regions which lie substantially within the global region wherein the means representation of the interaction site is formed from a combination of the at least one descriptor for the global region and for each of the local regions.

[ 0024 ] In a fifth aspect, there is provided a computer system comprising a computer processor and memory, the memory comprising software code stored therein for execution by the computer processor of a method for representing an interaction site on a target molecule, the method comprising: selecting a global region in at least one target molecule, the global region encompassing an interaction site in the at least one target molecule; selecting a plurality of local regions which lie substantially within the global region; determining at least one local descriptor for each of the local regions; and forming representation of the interaction site by combining the at least one local descriptor for each of the local regions.

The computer system may further comprise means for determining at least one global descriptor for the global region wherein the representation of the interaction site is formed by combining the at least one global descriptor for the global region and the at least one local descriptor for each of the local regions.

[ 0025 ] hi a sixth aspect there is provided a computer system comprising a computer processor and memory, the memory comprising software code stored therein for execution by the computer processor of a method for identifying or screening for common structural patterns in an interaction site of molecules, the method comprising: representing an interaction site of at least one first target molecule and of at least one second target molecule using the method according to the method of the first or fifth aspects; and comparing the representation of the interaction site of the at least one first target molecule with the

representation of the interaction site of the at least one second target molecule to identify or screen for common structural patterns in the interaction sites.

[ 0026 ] In a seventh aspect there is provided a software product for representing an interaction site of a molecule. The software product may comprise, either in combination or separately in discrete functional software code units or modules: code for selecting a global region in at least one target molecule, the global region encompassing an interaction site in the at least one target molecule; code for selecting a plurality of local regions which lie substantially within the global region; code for determining at least one local descriptor for each of the local regions; and/or code for forming a representation of the interaction site by combining the at least one of local descriptors for each of the local regions. The software product may further comprise code for determining at least one global descriptor for the global region. The code for forming a representation of the interaction site may comprise code for forming a representation of the interaction site by combining the plurality of global descriptors for the global region and the at least one of local descriptors for each of the local regions.

[ 0027 ] Where the software product comprises code in discrete functional software code units or modules, the software product further comprises code for using one or more of the functional software code units of modules in combination to form a representation of an interaction site in a molecule. [ 0028 ] In an eighth aspect there is provided a software product for identifying/screening for common structural patterns in an interaction site of molecules. The software product may comprise, either in combination or separately in discrete functional software code units or modules: code for representing an interaction site of at least one first target molecule and of at least one second target molecule; and code for comparing the representation of the interaction site of the target molecule with the representations of an interaction site of at least one test molecule to identify or screen for common structural patterns in the interaction sites of the at least one target molecule and the at least one test molecule.

[ 0029 ] Where the software product comprises code in discrete functional software code units or modules, the software product further comprises code for using one or more of the functional software code units of modules in combination to either identify or screen for common structural patterns in an interaction site of molecules from a representation of an interaction site in a molecule.

[ 0030 ] The code of the eighth aspect for representing an interaction site may comprise, either in combination or separately in discrete functional software code units or modules: code for selecting a global region in at least one target molecule, the global region encompassing an interaction site in the at least one target molecule; code for selecting a plurality of local regions which lie substantially within the global region; code for determining at least one local descriptor for each of the local regions; and code for forming a representation of the interaction site by combining the at least one descriptor for each of the local regions. The code may further comprise code for determining at least one global descriptor for the global region. The code for forming a representation of the interaction site may comprise code for forming a representation of the interaction site by combining the at least one descriptor for the global region and the at least one descriptor for each of the local regions.

[ 0031 ] hi a ninth aspect there is provided an information database product comprising information on interaction site representations for a plurality of molecules. The interaction site representations for each of the plurality of molecules in the information database product may each being formed by the method comprising: selecting a global region in a molecule, the global region encompassing an interaction site in the molecule; selecting a plurality of local regions which lie substantially within the global region; determining at least one local descriptor for each of the local regions; and forming a representation of the interaction site of the molecule by combining the the at least one descriptor for each of the local regions. The method for deriving the interaction site representations may further comprise determining at least one global descriptor for the global region. The representation of the interaction site of the method may comprise forming a representation of the interaction site by combining the at least one descriptor for the global region and the at least one descriptor for each of the local regions.

[ 0032 ] hi a tenth aspect there is provided an information database product comprising information on a plurality of interaction site representations for each of a plurality of molecules, each of the plurality of information site representations for each molecule being formed by the method of the ninth aspect. BRIEF DESCRIPTION OF THE DRAWINGS

[ 0033 ] The present invention is described in greater detail below, by way of an example only, with reference to the accompanying figures in which:

[ 0034 ] Figure 1 is a diagrammatic illustration of the process of extracting a representation of descriptors for a protein interaction site using a single source protein (which for example may be a receptor or ligand). The contact amino acids are those that are involved in interaction, or that provide for structural integrity of the interaction. A 3D global region container, in this case a sphere, is used to enclose structural features and optionally define global descriptors, such as structural features, of the source protein interaction site. (Figure IA). Figure IB illustrates the extraction of descriptors from contact elements at a global- level. Figure 1C illustrates the seeding of the global region with a plurality of (possibly overlapping) local regions, in this case sperical local regions, to capture local-3D substructure features similar to those described in IB. Figure ID illustrates the extraction of local descriptors from contact elements at the local-level. Figure IE illustrates the combination of descriptors from both global- and local-levels. Figure IF illustrates the representation of descriptors of the source protein in a format suitable for use with a means which is capable of comparison with other representations of descriptors from other molecules. Prefixes used are: G — Global region descriptor, Lx — Local region descriptor for local region number x; Suffixes used are: C - Overall charge within enclosed environment, SA - Solvent accessible surface area, E - Binding energy, within enclosed environment.

[ 0035 ] Figure 2 provides a diagrammatic illustration of the process of extracting a representation of descriptors for a protein interaction site using a receptor-ligand association pair; the contact amino acids are those that are involved in interaction, or that provide for structural integrity of the interaction. Figure 2 A illustrates a 3D global region container in the form of a sphere in this example, is used to select global structural features of both receptor and ligand binding site. The contact elements are amino acids that directly or indirectly affect a ligand-receptor interaction. Figure 2B illustrates the extraction of global descriptors from contact elements at the global-level. Figure 2C illustrates the seeding of the global region with multiples of (possibly overlapping) local regions to capture micro-3D substructure features similar to those described in B. Figure 2D illustrates the extraction of local descriptors from contact elements at the local region level. Figure 2E illustrates the combination of descriptors from both global and local levels. Figure 2F illustrates the representation of descriptors in a format suitable for use with a determining means. Prefixes used are: G - Global region descriptor, Lx - Local region descriptor for local region number x; Suffixes used are: C - Overall charge within enclosed environment, SA — Solvent accessible surface area, E - Binding energy, within enclosed environment.

[ 0036 ] Figure 3 illustrates the hierarchical clustering of structural interaction characteristics for ligands binding to three different families of protein receptors, the clustering identified using the methods provided herein. Three well-defined clusters representing ligands binding to their respective receptors can be identified in this figure. [ 0037 ] Figure 4 is a flow chart of an embodiment of the procedure for forming an interaction site representation in accordance with the aspects of the invention.

[ 0038 ] Figure 5 a flow chart of an embodiment of the procedure for identifying and/or screening for structural patterns between two or more molecules two molecules, where one or both of the molecules may be macromolecules, in accordance with the aspects of the invention.

[ 0039 ] Figure 6 is a schematic block diagram of a general purpose computer upon which arrangements described can be practiced;

[ 0040 ] Figures 7A and 7B show a table of sample data for the binding and non-binding sites from mAb E5.2 (PDB ID IDVF) and HEL (PDB ID 1 A2Y) of Example 2. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

[ 0041 ] Disclosed herein are methods for representing an interaction site on a target molecule. These methods may be used in methods of identifying or screening for common structural patterns in an interaction site of molecules, by comparing the representations of interaction sites in silico. Also disclosed herein are systems for practicing these methods. [ 0042 ] The methods and systems provided herein may be used for the systematic identification of structure patterns involved in interactions between an association pair, such as between receptors and ligands or between an antibody and an epitope or between an enzyme and a substrate. The methods and systems also allow the prediction of molecule function, the prediction of receptor-ligand/antibody-epitope/enzyme-substrate binding sites, and the construction or screening of virtual ligands, receptors, antibody epitope-binding regions, epitopes, enzyme active sites, or substrates. For example, the methods and systems described herein may be used to perform screening using either ligand (i.e. binding target) interaction site alone, receptor interaction site alone or ligand interaction site in combination with receptor interaction site, if available. [ 0043 ] The term "molecule" as described herein is intended to encompass macromolecules and compounds which interact with macromolecules via an association site.

The molecule may be a polypeptide, such as a protein, a glycoprotein, a lipoprotein or a proteoglycan. The molecule may be a polynucleotide, such as a RNA or a DNA polynucleotide. The molecule may be a polysaccharide. The molecule may be a molecule which is not a polypeptide, a polynucleotide or a polysaccharide, for example a cholesterol- based hormone. These and other molecules may be described by common features such as geometry (e.g. the position of subunits of the molecule, surface area, bonding angles) and energy (e.g. entropic free energy, van der Waals energy, electrostatic energy or others), which may be used as descriptors in the methods described herein.

[ 0044 ] A "target molecule" will be a molecule for which at least one structure is described from which three-dimensional data may be obtained. The structure may be, for example, a Nuclear Magnetic Resonance structure, a X-ray crystallography structure, or a cryo-electron microscope structure, or a structure obtained using computational modelling techniques such as homology modelling and ab initio molecular modelling techniques or any combination of these. Alternatively, sequence data may be used from which structure information of a potential test or target molecule may be determined from homology modelling or ab initio molecular modelling. In some embodiments, there will be more than one described structure of the target molecule. Where the target molecule is a polypeptide, the amino acid sequence will be known. Where the target molecule comprises polysaccharide units, the sequence and saccharide bond structure between monosaccharide residues will be known. Where a target molecule is a polynucleotide the nucleotide sequence of the polynucleotide will be known. The target molecule will have X-Y-Z positional data for each of the residues or atoms which it contains, although the methods described herein can utilise low resolution structures or models (also known as "fuzzy" data).

[ 0045 ] Other descriptors, such as the amount and distribution of charge carried in each global or local region may be calculated from the structural data using selected molecular force fields, or may be imported from database sources.

[ 0046 ] An "interaction site" on a molecule refers to at least one site on the molecule which is known or suspected of interacting with another molecule. The interaction site is the site on the molecule which is represented by the methods described herein, to allow comparison with sites on other molecules for similarity or complementarity.

[ 0047 ] An "interaction site" consists of those parts of the target molecule which may interact (for example via contact) with an interaction site of an associated molecule when the

molecule interacts with the associated molecule, such as the other component of an "association pair", or those parts of the target molecule which are essential for the structural integrity of the contact parts. The shape and charge distribution of an interaction site may provide for the recognition and specificity of an interaction between an association pair. [ 0048 ] An "association pair" is a pair of molecules, each having an interaction site on their molecular structure, which interact with each other. Exemplary association pairs may comprise any one of, among others, receptor-ligand pairs, antibody-epitope or antibody- antigen pairs, enzyme-substrate pairs, protein-protein pairs, channel/transporter-solute pairs or cytoskeletal-protein pairs, or other donor-acceptor molecule pairs which are known to interact. As used herein, an association pair generally includes a pair of moieties and/or molecules that can be linked directly via a respective complementary interaction site on each of the pair of moieties and/or molecules. The linkage between the components of the association pair may be a non-covalent linkage formed by physical interaction or binding between the members of the association pair. [ 0049 ] The binding interaction between members of association pair may be driven by any suitable physical interaction(s), including but not limited to electrostatic, charge-charge (including distributed charge interactions), or ionic interactions, van der Waals interactions, hydrogen-bonding interactions, hydrophobic-hydrophilic interactions, dipole-dipole interactions, sulphide-sulphide bonding or creation of di-sulphide bridges and/or the like, and may be limited or moderated by physical structural shape of the component moieties and/or molecules of the association pair. These interactions generally do not require covalent interactions; however, these interactions may, in some cases, be supplemented by such interactions, for example, using cross-linking reagents.

[ 0050 ] Alternatively, the linkage between the components of the association pair may be a covalent linkage, that is, one or more covalent bonds may be formed between the members of the association pair by chemical reaction. Accordingly, the components of the association pair may form a chemically reactive pair, in which both members of such a pair may be chemically modified by formation of the covalent linkage.

[ 0051 ] Additional examples of association pairs may comprise channel/transportor (eg. passive or active holes) - transported solute pairs (eg. ions, amino acids); or cytoskeletal - protein (eg. muscle fibres) pairs. In the case of antibody - epitope or antibody - antigen pairs, the antigen may comprise for example IgA, IgE, IgG, IgM or IgD immunoglobulins

and the antigen may comprise for example any foreign entity such as food antigens (eg. food allergies) foreign antigens (eg. toxins from invasive bacteria, or any antigen derived from parasite, fungi, or viruses), or self antigens (eg. involved in autoimmune diseases such as rheumatoid arthritis). Also, the interaction between a ligand — receptor association pair may comprise, for example, the analysis or screening of receptor-ligand drug interactions for example for predicting viral gene delivery activity and/or efficacy for diseases such as cancer and neuro-degenerative diseases such as Parkinson's disease Huntington's disease or Alzheimer's disease.

[ 0052 ] Where the target molecule is a polypeptide, the interaction site may comprise those amino acids on the target molecule which contact another molecule, or which are directly involved in positioning or stabilizing the amino acids which contact the other molecule.

Where the molecule is a polypeptide the interacting site may be comprised of a contiguous sequence of amino acids, or it may be a site comprised of two or more amino acid residues which are not contiguous. Typically for an interaction site of a protein, the amino acids which make up the interaction site may be identified by substituting amino acid residues or functional groups within the subject site and determining whether there is a loss or elimination of binding strength with its association counterpart. The borders of an interaction site may be identified by experimental data such as the identification of essential amino acid residues, as described above, or through the use of solution structures of experimentally solved receptor-ligand complexes.

[ 0053 ] If the molecule is a receptor, the interaction site comprises or consists of the ligand binding site on the receptor. Conversely, if the molecule is a ligand, the interaction site comprises or consists of the receptor-binding portion of the ligand. If the molecule is an enzyme, the interaction site comprises or consists of the substrate-binding portion of the enzyme. If the molecule is an antibody, the interaction site may comprise or consist of one or more of the hypervariable regions of the antibody which interact with an epitope.

[ 0054 ] The methods provided herein include the step of selecting a global region which encompasses an interaction site on the target molecule, and the selection of a plurality of local regions which lie substantially within the global region. [ 0055 ] Typically, the global region encompasses the entire interaction site within a three dimensional defined space. In some embodiments the shape and size of the global region is selected so as to minimise those parts of the molecule encompassed by the global region

which are not involved in the interaction site, in order to limit the comparison to the region of interest of the target molecule; however, it is not critical to the methods described herein if the global region encompasses parts of the molecule not contributing to the interaction site.

[ 0056 ] Simple three dimensional shapes such as spheres are computationally less taxing and are easier to implement as global regions in the methods described herein; however, the methods described herein may be implemented using global regions of any three dimensional shapes. In certain embodiments the global region may be any one of a platonic solid, a sphere, a cube, a prism, a pyramid, a cone or a cylinder, or a combination of any two or more of these. In some embodiments, the global region may itself be used as an additional descriptor. The selection of global region may be based on either wet-lab experimental data such as point mutations, or three-dimensional solution structures of experimentally solved receptor-ligand complexes. The global region is used to determine the location of the entire binding site of interest and the location of local regions within the global region. Optionally, specific descriptors (at least one, but many more may be used) for the global region may be determined which may be used to improve the predictive performance of the method in combination with the descriptors (at least one, but many more may be used) from each of the local regions, hi the methods described herein a "plurality of local regions" which lie substantially within the global region are selected. The plurality of local regions may be selected after the global region is selected, or plurality of local regions may be selected before the global region, and then a global region is selected to substantially encompass the local regions. The local regions may be defined using standard molecular three-dimensional visualization software such as, among others, Internal Coordinate Mechanics (Abagyan and Totrov 1994, Abagyan et al. 1994), or the RasMol (http://www.RasMol.org) or Jmol (http://jmol.sourceforge.net) molecular graphics visualization software programs. [ 0057 ] The number of local regions within the global region may be selected based on the structure of the interaction site (or sites where there is more than one interaction site within the global region) and the granularity and spacing between elements (which may be functional elements of the interaction site or non-functional "spacer" elements) of the molecule, particularly at the interaction site. The local regions are typically selected such that they are evenly distributed throughout the global region, i.e. from the centre of the global region to its boundary; however, the local regions may be selected in an irregular manner provided that at least each of the functional elements of the interaction site are encompassed

by at least one local region. Selection of a particular method of selecting the local regions may be based on its accuracy, soundness, availability and computational time for a particular interaction site.

[ 0058 ] Each of the elements or sub-units of the molecule or interaction site may be residues, for example amino acid residues, nucleotide residues monosaccharide residue for proteins, polynucleotides (eg. DNA/RNA) or polysaccharides (eg. carbohydrates) respectively. The residues which participate in an interaction between interaction sites of two or more molecules are referred to as functional residues. The local regions are chosen such that each of the functional residues at least in the interaction site of interest are encompassed by a local region. Each of the local regions may be centred on a functional residue of the interaction site. To ensure the most accurate prediction, each of the functional residues (eg. each of the contact amino acids in a protein) of the interaction site are to be encompassed within at least one local region.

[ 0059 ] The local regions may in some arrangements be approximately the same size as particular residues or they may be larger or smaller. The local regions may be about 1 A or greater in diameter for a spherical local region, however, the local regions may be any convenient shape and may be a different shape to that of the global region or other local regions. The local regions need not necessarily completely fill the space of the global region, however, the local regions need to encompass each of the functional residues which are active in the interaction site. The local regions may be selected to encompass non-functional residues of the interaction site, or they may be selected to encompass empty space within the global region. The selection of local regions for non-functional residue sites may be useful in determining whether that residue is a functional residue. The selection of local regions which encompass empty space may by used to identify differences in three-dimensional shape between the interaction sites of one or more molecules.

[ 0060 ] The local regions may overlap within the global region and may overlap significantly with adjacent local regions. One or more of the functional elements/residues of the interaction site may be encompassed by one or more local regions, and may be encompassed by a significant number (ie. more than 5, more than 10, or more than 20) of local regions The amount of overlap may be determined on the basis of the amount that adjacent or nearby residues are expected to influence or otherwise affect each other, for example the amount in which amino acids influence the conformations of each other. The

local regions may extend partially beyond the boundary of the global region, however, the residues of interest within the local region will also be within the global region. Where a function residue lies near the boundary of the global region, the local region encompassing that functional residue may extend beyond the boundary of the global region, and the local region may reside substantially outside of the global region.

[ 0061 ] It will be appreciated that the size of the interaction sites for different macromolecules will vary and that the interaction site size varies even within a particular class of macromolecule (eg. proteins), thus the total number of local regions for representation of a particular interaction site is variable depending upon the specific macromolecule of interest. One or more of the properties of the local regions, eg. the size, shape, number, amount of overlap between adjacent regions etc may be optimised to improve the performance of the method/system. The optimisation may for example be an iterative process where one or more of the properties of the local region are changed eg. increased/decreased repeatedly and the degree to which the interaction site representation agrees with the properties of a known equivalent or similar interaction site tested after each iteration to improve the performance. Machine-learning algorithms can also be used to handle fuzziness in the comparison with known or similar interaction sites to benchmark the accuracy of the interaction site representation obtained from the local regions. The performance may be benchmarked against a test dataset. The iterative process may cease when there is no further increase in accuracy of the representation to known equivalent or similar interaction site or the test dataset.

[ 0062 ] The methods provided herein include the step of "determining at least one descriptor" for each of the local regions and optionally for the global region.

[ 0063 ] A "descriptor" as described herein is intended to encompass a numerical value for a measurable or calculable parameter of the molecule which may be derived from knowledge of the structure of the molecule. A descriptor is a value which may be computed or measured from a three dimensional structure of a target molecule. Non-limiting examples of descriptors include the numbers of subunits, geometry descriptors, and energy descriptors.

[ 0064 ] Descriptor parameters associated with the number of subunits may include the number of amino acid residues (for proteins), the number of monosaccharide units (for polysaccharides), the number of nucleotides (for polynucleotides) or the number of functional groups (for small molecules). Thus one descriptor for a region may be a simple count of the

number of amino acid residues within the region. For geometry descriptors, measurable parameters may include the surface area, the solvent accessible surface area, the torsion angles of amino acids within a specific region, or contours (including height, horizontal, vertical angles, width). For energy descriptors, measurable parameters include free binding energy, entropic free energy, electrostatic energy, charge, van der Waals energy, torsion energy, hydrogen bonding energy, hydrophobic energy, bond stretching energy, and disulfide bonding energy.

[ 0065 ] The descriptors may be computed from three dimensional information using diverse techniques available in the art. Binding energy may be computed using, for example, 1) molecular dynamic algorithms and equations such as the Gibbs free energy change on binding δG = -RT In(K ₃ ) where R is the universal gas constant, T is the temperature (K), and K ₃ is the equilibrium binding constant between protein and ligand; 2) partitioning the binding energy into biophysical energy terms such as δG = αδG _H + βδG s + Y&G _EL + C; or 3) knowledge-based scoring functions such as using pair-wise potentials (Gohlke and Klebe, 2001). The accessible surface area of a region of a molecule may be measured by tracing out the maximum permitted van der Waals' contact that is covered by the center of a water molecule as it rolls over the surface of the target molecule (Tong et al., 2007). Torsion angles may be computed from 3D coordinates (Hao et al., 2007

[ 0066 ] Typically the at least one descriptor used in the methods described herein will include one or more of the surface area, the free binding energy, and the number of amino acids or functional groups within a region.

[ 0067 ] Each descriptor for each region is assigned a value based on its calculation. This value will typically be a numerical value. Although a variety of alternative calculation methods will be available for determining descriptors, the particular calculation method used will not be critical to the methods described herein provided the calculation method used is consistent between the target molecule and the molecule being compared with the target molecule.

[ 0068 ] Examples of software which may be used to calculate descriptors includes Internal Coordinate Mechanics (Abagyan and Totrov 1994, Abagyan et al. 1994), SURPNET (Laskowski, RA, available from University College London <http://www.biochem.ucl.ac.uk/~roman/surfiiet/surfhet.htm l>), HBPLUS (McDonald and Thornton, 1994), LIGPLOT (Andrew Wallace and Roman Laskowski, available from

University College London), AutoDock (Morris et al, 1998), GOLD (available from The Cambridge Crystallographic Data Centre), GLIDE (Friesner et al, 2004) and FlexX (available from BioSolvelt GmbH).

[ 0069 ] The values for each of the descriptors will depend on the type of descriptors used. For example, binding energies are typically in the negative range, surface areas are normally greater than zero (>0), charges may be positive, 0 (neutral) or negative.

[ 0070 ] The minimum number of descriptors for any given region will be one. More typically, however, there will be a plurality of descriptors for each region. In particular embodiments there will be at least 2, 3, 4, 5, 6, 7, 8, 9 or 10 descriptors for any given global or local region.

[ 0071 ] The descriptors which are used to characterise the global region may be the same or different to the descriptors used to characterise each of the local regions. The number of descriptors used to characterise the global region may be the same or different to the number of descriptors used to characterise each of the plurality of local regions. Different sets and combinations of global descriptors and local descriptors may be used, provided the same global descriptors are used in a comparison between molecules and the same local descriptors are used a comparison between molecules. A global descriptor, such as overall charge, may also be used as a local descriptor, although the value of the descriptor may be different for the global region and each of the local regions. [ 0072 ] In particular embodiments, a global or local descriptor is selected from the group consisting of charge, solvent accessible surface area, van der Waals energy, hydrophobic energy, electrostatic energy, torsion energy, free binding energy, the number of amino acid residues within the defined enclosed environment, and torsion angles.

[ 0073 ] When one or more descriptors for the global region are used, the step of calculation of the plurality of global descriptors for the global region and local descriptors for each of the local regions may be carried out sequentially, with either the global region or the local regions calculated first, or simultaneously.

[ 0074 ] The methods provided herein comprise the step of forming a representation of the interaction site, by combining the at least one descriptor for each of the local regions. The representation of the interaction site thus comprises a series of numerical values of parameters which are characteristic of the interaction site and which may be presented in a

form which is readily compared with representations of other interaction sites or possible interaction sites, for example by computer means. When one or more descriptors for the global region are used, the representation may comprise the step of forming a representation of the interaction site, by combining the descriptors) for the global region and the at least one descriptor for each of the local regions.

[ 0075 ] In particular embodiments, a representation of the interaction site comprises providing the descriptors as a string of values or comprises providing the descriptors as a matrix of values. These embodiments are illustrated in the Examples below. Thus the representation of the interaction site preserves each of values of the descriptor for each region.

[ 0076 ] In certain aspects, the methods described herein involve the comparison of two or more representations of interaction sites. Such comparisons may be performed using matrix comparison techniques, exact matching, fuzzy matching, or using machine-learning techniques such as artificial neural networks (ANN), support vector machine (SVM), hidden Markov models (HMM) or genetic algorithms. Typically, a system will be trained based on the descriptors of the target molecule and the same set of descriptors will be used to screen other (test) molecules to identify potential similarities. It is preferable that multiple structures are used to train the system and techniques such as using machine-learning, fuzzy match or matrix comparison may be used, however, it is possible to operate the system with a single structure i.e. the target molecule by utilising either exact or fuzzy matching techniques. The representations of the interaction sites to be compared are formed from the same set of descriptors derived from the selection of the same global and local regions around each of the interaction sites. The comparison of representations may be a direct one-to-one comparison of the descriptors in each of the corresponding global and local regions, or the descriptors may be compared using fuzzy logic matching or machine-learning techniques, for example by allowing a range on the descriptor value which indicates a match.

[ 0077 ] The descriptors in the representations of either of the interaction sites under comparison are first computed independently and compiled in a suitable representation as above for comparison. Each of the descriptors may be weighted or otherwise manipulated appropriately prior to the comparison. For instance, the descriptors may indicate a match or potential similarity where the descriptors for the target molecule are complementary to the corresponding descriptors of the test molecule(s), such as for example complementary charge

(positive and negative) matches for interaction sites in a receptor-ligand pair. Further examples of complementary descriptors in receptor-ligand pairs may include geometry and binding energy and the descriptors for potential interaction sites in the receptor-ligand pair may be weighted or otherwise manipulated appropriately in accordance with known functions of interaction site processes.

[ 0078 ] Also provided are methods of identifying or screening for common structural patterns in interaction sites of molecules, comprising representing the interaction site of a at least one first molecule and of the interaction site of at least one second molecule using the method as described above, and comparing the representations to identify or screen for common structural patters in the interaction sites. Common structural patterns in the interaction sites may, for example, include geometry/shape of the molecule at the interaction site (i.e. three dimensional structure), charge distribution (including both sign and magnitude) and hydrophilicity/hydrophobicity amongst others. Comparisons of the structural patterns between interaction sites may be performed using linear combinations and/or matrices comprising numerical values for each of the descriptors used in the representations of the numerical sites. The comparison of descriptors may be an exact match, fuzzy matching (such as within a desired range of values), or machine-learning techniques such as artificial neural network, support vector machine, hidden Markov model and genetic algorithms. The output from the comparison of the representations may be presented as a numeric score indicative of the probability that the interaction site is a binding site or the molecule is a potential binder. As a result of the comparison between the interaction sites the interaction sites for each of the test molecules may be ranked on the basis of their probability of being a binding site. Also, the comparison of the interaction sites of the test molecules may be used to predict whether a particular interaction site will behave similarly or better than an alternative interaction site, where the alternative interaction site may reside on the same molecule (although not necessarily).

[ 0079 ] It will be appreciated that the comparison may be carried out on a receptor-ligand pair which are in the act of interacting and also on each component separately, and the interaction site representation is also of use for screening for only one of the pair (i.e. either the receptor or the ligand in isolation) based on similarities between components based on the initial data such as the three-dimensional structure of the macromolecule. Also, since descriptors such as the geometry and binding energy of interacting pairs are complementary,

the properties of the target macromolecule may be manipulated (eg. a weighting or inverse function) the modified descriptors used to screen for potential test macromolecules likely to interact with the target molecule. This is particularly useful as typically only one component (either the receptor or ligand) is available for initial determination of the interaction site representation.

[ 0080 ] Virtual ligand screening using descriptors derived from three-dimensional structures allow capturing of sophisticated patterns that define receptor-ligand binding. Important functional units (eg. functional residues) can be described by structure patterns, analogous to the patterns of (protein) amino acid sequences or functional groups of chemicals.

[ 0081 ] Families of proteins and synthetic catalysts with the same function (represented by the same or similar structures) may be formed. It may be possible to find a signature (pattern) derived from the protein structure which is characteristic of its function. Such descriptors can be used to match against a new protein of unknown function, and the outcome of the comparison may be indicative of the presence or absence of the particular function. These patterns can be used to classify protein structures into structure families, for example by considering the occurrence of common arrangements of secondary structure elements in the core of proteins, as described by Koch et al. (1996). Structure patterns can also be used to infer the function of a protein, for example the "coordinate templates" for finding Ser-His- Asp catalytic triads in the serine proteinases and lipases as reported in Wallace et al. (1996), and to screen for candidate binding partners.

[ 0082 ] The methods described above may be used to identify or predict ligand and/or receptor interaction sites, or antibody and/or epitope interaction sites, or enzyme and/or substrate interaction sites using descriptors derived from the three-dimensional structure of the ligand, receptor, receptor-ligand complex, or antibody, epitope or antibody-epitope complex, or enzyme, substrate or enzyme-substrate complex respectively. The following uses of the methods described above are proposed in the context of receptor-ligand interactions, but should not be construed as suggesting that they are limited to only receptor- ligand interactions. [ 0083 ] The methods may be used to screen a binding candidate to a particular ligand or receptor for which no experimental data or three-dimensional structure is available, but which may be refined by inclusion of new experimental data.

[ 0084 ] The methods may be used for the construction of a three-dimensional binding candidate to a particular ligand or receptor for which no experimental data or three- dimensional structure is available, but which may be refined by inclusion of new experimental data. [ 0085 ] The methods may be used to predict the activity of molecules for which no experimental data is available, but which may be refined by inclusion of new experimental data.

[ 0086 ] The methods may be used for large-scale, high-throughput screening of molecules which can be generalized for prediction of receptor-ligand interactions for various receptor families.

[ 0087 ] The methods may be used to study the phylogeny of protein families using descriptors derived from the three-dimensional structure of the ligand, receptor or receptor- ligand complex.

[ 0088 ] The methods described above are based on the use of descriptors, such as functional features, derived from the three-dimensional structure of a molecule, such as a polypeptide or a protein-protein complex, for the prediction of interaction site patterns or binding activity. The methods described may be used in building a single model which can predict ligand binding to a multiplicity of different receptors, and/or may be used in the construction of three-dimensional representations of molecule structural formations, particularly known or potential interaction sites or vice versa. The methods may facilitate cyclical refinement of predictive models for improved accuracy by inclusion of new experimental data. In addition the methods may facilitate high accuracy predictions of ligand binding to receptor molecules for which no experimental data are available. It also enables large-scale, high-throughput screening of receptor-binding ligands and has the advantage of being adapted easily for the prediction of receptor-ligand interactions for various receptor families. The methods may be generalised to the prediction of a wide variety of types of molecular interactions, for example interactive pair interactions such as between receptor- ligand complexes, antibody-epitope complexes or enzyme-substrate complexes.

[ 0089 ] In another aspect the invention relates to a computer program, residing on a computer-readable medium, for identifying molecule interaction sites comprising instructions or code for causing a computer to represent a an interaction site (for example a ligand or receptor interaction site; or alternatively a antibody or epitope interaction site; or alternatively

still a enzyme or substrate interaction site) by descriptors using a probabilistic approach or other deterministic means such as multiple regression, artificial neural network (ANN), hidden markov model (HMM) or other dynamic Bayesian network model, support vector machines (SVM) or alternatively a fuzzy means for representing descriptors as would be appreciated by the skilled addressee. Where necessary, each descriptor may be associated with a variance in order to provide a degree of relaxation.

[ 0090 ] The computer program may be generally described by the following pseudo code:

Select known/possible interaction site of molecule Input interaction site elements (eg. residues) Compute locations of global and local regions in three dimensional space

Optionally compute and print interaction site descriptors for global region For each local region

Compute properties Print properties End For

Combine descriptors to form interaction site representation

[ 0091 ] The computer program may further include instructions or code to represent an interaction between one or more molecules by combining and/or comparing representations of interaction sites of, for example, a receptor interaction site and a ligand interaction site. [ 0092 ] The computer program may further include instructions or code train the computer or other determining means with representations characterizing at least one interaction site of one or both molecules of an association pair, where the interaction site(s) location(s) may be known or estimated interaction.

[ 0093 ] The computer program may further include instructions or code to apply representations of at least one test association pair (eg. ligand-receptor) interaction of unknown interaction site(s) and/or unknown structure, using the same representation form as used in training the computer or other determining means.

[ 0094 ] The computer program may further include instructions or code to analyse each applied test association pair interaction in order to predict the interaction site(s) of each test interaction. Preferably, the determining means is selected from the group using probabilistic means, fuzzy means or multiple regression means, however other means may be employed

such as direct or complementary matching means. The computer program may optionally comprise instructions or code for manipulating one or more of the descriptors, for example adding weights to one or more descriptors of the interaction site(s), prior to analysis.

[ 0095 ] hi certain embodiments the methods and systems described herein utilize properties of amino acid or monosaccharide or nucleic acids which are enclosed within a three-dimensional region, such as charge, energetics, solvent accessible surface area and other properties, as descriptors. The characterisation of an interaction site of a molecule based solely on examination of one or two components of a pair in isolation may utilize a probabilistic approach or other deterministic means such as multiple regression, SVM, ANN or HMM for representing descriptors.

[ 0096 ] Characterization based complexes of an association pair, such as a receptor-ligand complex, may combine both representations of ligand and receptor for each single data training point and are thus based on the characteristics of the descriptors derived from receptor-ligand interactions rather than on the characteristics of either ligand or receptor component in isolation.

[ 0097 ] Building an application for detecting structure patterns in molecule (eg. ligand or receptor) interaction sites using descriptors typically involves several stages: a) Representation of known association pair (eg. receptor-ligand) interaction site in a format useful for training a determining means; b) Training the determining means; c) Representing an unknown (or test) association pair interaction site in the same format as defined in step a); d) Predict the interaction site of the unknown association pair interaction. The outcome of the comparison may be indicative of the presence or absence of the particular function, hi the presence of a particular function, the derived descriptors can be used for the construction of three-dimensional structure of a virtual target.

[ 0098 ] The method and systems described herein can generally be defined in four discrete but inter-related portions.

[ 0099 ] The first portion is the utilization of amino acid properties (such as charge, energetics, solvent accessible surface area, among others) enclosed within a three-

dimensional container (such as a sphere, cube, pyramid, cylinder, among others) as descriptors.

[ 0100 ] The second portion is the description and the representation of the association pair (eg. receptor-ligand) interactions by using descriptors derived from a receptor-ligand complex (or a series of similar complexes with the same receptor but different ligands; or vice- versa). Taking the example of a receptor-ligand interaction, the steps of the present invention include, but are not limited to, the following: a) Identification of the contact elements in a three-dimensional structure of a receptor-ligand complex from a representative known structure using an enclosed three- dimensional (3D) container (which may be but not limited to a sphere, cube, pyramid or cylinder). The contact elements are amino acids that directly or indirectly affect a ligand- receptor interaction. b) Extract descriptors from contact elements at macro- and micro-levels. A macro-3D container (e.g. sphere, cube, pyramid, cylinder, among others) is used to detect global structural features of both receptor and ligand binding site. In addition, the macro-3D container is further seeded with multiples of (possibly overlapping) micro-containers to capture micro-3D substructure features or profiles (e.g. charge distribution, contour, solvent accessible surface area, energetics, torsion angles). Macro- and micro-features detected by all containers are mapped into descriptors. c) Represent descriptors in a format suitable for use with a determining means. d) Train the determining means. e) Represent a protein target (which may be a ligand or receptor in isolation) of unknown interaction site in the format suitable for use with the determining means (following the procedure described in steps a) to c)). f) Predict the interaction site of the unknown target.

[ 0101 ] The third portion is the description and the representation of the target binding candidate to a source protein (which may be a receptor or ligand) using descriptors derived from the source in isolation. Taking the example of a receptor-ligand interaction the steps of the present invention include, but are not limited to, the following:

a) Identification of the contact elements in a three-dimensional structure of a source receptor or source ligand in isolation from a representative known structure using an enclosed three-dimensional (3D) container (which may be but not limited to a sphere, cube, pyramid or cylinder). The contact elements are amino acids that directly or indirectly affect a ligand-receptor interaction. b) Extract descriptors from contact elements at macro- and micro-levels. A macro-3D container (e.g. sphere, cube, pyramid, cylinder, among others) is used to detect global structural features of both receptor and ligand binding site. In addition, the macro-3D container is further seeded with multiples of (possibly overlapping) micro-containers to capture micro-3D substructure features or profiles (e.g. charge distribution, contour, solvent accessible surface area, energetics, torsion angles, among others). Macro- and micro-features detected by all containers are mapped into descriptors. c) Obtain a complementary, which may be, but not limited, to an inverse, or weighted relationship of appropriate descriptors (e.g. change a positive charge of +1 to negative charge of-1) to obtain (possibly fuzzy) descriptors for its target binding partner of the association pair. The derived descriptors are used to describe a virtual binding site (VBS) to the source protein. d) Represent descriptors in a format suitable for use with a determining means. e) Train the determining means. f) Represent a protein of unknown interaction site in the format suitable for use with the determining means (following the procedure described in steps a) to c)). g) Predict novel binding candidates or interaction site to the source protein where only experimental 3D data of the source is available. h) The VBS is also applicable for construction of three-dimensional structures/profiles of virtual targets.

[ 0102 ] The fourth portion is the training of derived descriptors using statistical means such as probabilistic function, artificial neural network, hidden Markov model, multiple regression or Bayesian network.

[ 0103 ] A computer-based general system and method for prediction of protein interaction sites operates as follows. In one aspect, the methods described herein perform the process of forming an interaction site representation of a molecule as depicted in one arrangement as method 100 presented in flow chart form in Figure 4. [ 0104 ] The method 100 for representing an interaction site on a target molecule comprises selecting 100 a global region in at least one target molecule. The global region may encompass an interaction site on the target molecule. A plurality of local regions may further be selected 105. The plurality of local regions may lie substantially within the global region. The plurality of local regions may be partially overlapping with one, two, three, four, five, six or more adjacent local regions.

[ 0105 ] One or a plurality of descriptors are then determined 109 for each of the local regions. Optionally, at least one descriptor for the global region may be determined 107, which may be determined either sequentially or simultaneously to the determination of the local descriptors. The determination 107 of the global descriptors may optionally be performed prior to the selection 105 of the local regions of interest. In some arrangements, the selection 105 of the local regions of interest may be determined from or influenced by the global descriptors.

[ 0106 ] The at least one descriptor for each of the local regions are then combined 111 to form a representation of the interaction site. Where one or more global descriptor have been determined, the representation of the interaction site is formed from a combination of the at least one global descriptor and the at least one descriptor for each of the local regions.

[ 0107 ] In a further aspect, the methods described herein perform the process of identify and/or screening for common structural patterns between two or more molecules as depicted in one arrangement as method 200 presented in flow chart form in Figure 5. [ 0108 ] The method 200 may in one arrangement comprise selecting 201 a target molecule with at least one target interaction site. A representation of the at least one target interaction site is formed by the method 100 of Figure 4. At least one test molecule having at least one test interaction site is then selected 203 and a representation of the at least one test interaction site is formed, again by the method 100 of Figure 4. Optionally, one or more additional test molecules and/or test interaction sites may be selected 205 and a site representation of the known and/or potential interaction sites for each additional test molecules and/or test interaction sites may be formed by the method 100 of Figure 4.

[ 0109 ] The representation of the target interaction site is then compared 207 with the representation of the test interaction site(s) to identify 209 or screen for common structural patterns in the interaction sites. Optionally, a test molecule may then be selected 211 on the basis of the comparison 209 where a favourable interaction is found. [ 0110 ] Further, although process steps, method steps, algorithms or the like as described above may be described in a sequential order, such processes, methods and algorithms may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any order practical, Further, some steps may be performed simultaneously.

[ 0111 ] It will be readily apparent that the various methods and algorithms described herein may be implemented by, e.g., appropriately programmed general purpose computers and computing devices. Typically a processor (e.g., a microprocessor) will receive instructions from a memory or like device, and execute those instructions, thereby performing a process defined by those instructions. Further, programs that implement such methods and algorithms may be stored and transmitted using a variety of known media.

[ 0112 ] The method 100 may be implemented using a computer system 400, such as that shown in Figure 6 wherein the processes of Figures 3 or 4 may be implemented as software, such as one or more application programs executable within the computer system 400. In particular, the steps of the method 100 of forming an interaction site representation of a molecule or the method 200 of identifying and/or screening for common structural patterns between two or more molecules are effected by instructions in the software that are carried out within the computer system 400.

[ 0113 ] The software may be stored in a computer readable medium, including the storage devices described below, for example. The software is loaded into the computer system 400 from the computer readable medium, and then executed by the computer system 400. A computer readable medium having such software or computer program recorded on it is a computer program product. The use of the computer program product in the computer system 200 preferably effects an advantageous apparatus for performing the steps of the method described herein

[ 0114 ] As seen in Figure 6, the computer system 400 is formed by a computer module 401, input devices such as a keyboard 402 and a mouse pointer device 403, and output devices including a printer 415, and a display device 414.

[ 0115 ] The computer module 401 typically includes at least one processor unit 405, and a memory unit 406. The module 401 also includes a number of input/output (I/O) interfaces including a video interface 407 that couples to the video display 414, an I/O interface 413 for the keyboard 402 and mouse 403, and an interface 408 for the printer 415. Computer readable medium and/or storage devices 409 are provided and typically include a hard disk drive (HDD) and optical disk drive. [ 0116 ] The components 405, to 413 of the computer module 401 typically communicate via an interconnected bus 404 and in a manner which results in a conventional mode of operation of the computer system 400 known to those in the relevant art. Typically, the application programs discussed above are resident on the hard disk drive and read and controlled in execution by the processor 405. Intermediate storage of such programs and any data generated may be accomplished using the memory 406, possibly in concert with the hard disk drive.

[ 0117 ] The term "computer-readable medium" as used herein refers to any medium that participates in providing data (e.g., instructions) which may be read by a computer, a processor or a like device. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media include, for example, optical or magnetic disks and other persistent memory. Volatile media include dynamic random access memory (DRAM), which typically constitutes the main memory. Transmission media include coaxial cables, copper wire and fibre optics, including the wires that comprise a system bus coupled to the processor. Transmission media may include or convey acoustic waves, light waves and electromagnetic emissions, such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, an EPROM, a FLASH-EEPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read.

[ 0118 ] Various forms of computer readable media may be involved in carrying sequences of instructions to a processor. For example, sequences of instruction (i) may be delivered from RAM to a processor, (ii) may be carried over a wireless transmission medium, and/or (iii) may be formatted according to numerous formats, standards or protocols, such as Bluetooth, TDMA, CDMA, and 3 G.

[ 0119 ] The methods described herein may alternatively be implemented in dedicated hardware such as one or more integrated circuits performing the functions or sub-functions of the methods eg. either or both of the methods 100 or 200.

[ 0120 ] For the example of interaction between a receptor-ligand association pair, identification of contact (interaction) sites of the receptor-ligand complex from structural models facilitates representation of the receptor-ligand interaction by combining representations of receptor interaction site and ligand receptor site. Using such receptor- ligand representations a determining means (such as probability density function, multiple regression system, ANN, HMM, SVM, among others) is trained with input data characterizing instances of receptor-ligand interactions of known 3D structure. After training, test data representing a test protein (which may be a receptor or ligand) of unknown interaction site (using the same representation form as for training the detemύning means) is applied and analyzed to predict the interaction site of the test protein. The method of present invention may be set out in the form of a computer program, residing on a computer-readable medium and may be implemented using a computer programmed with one of the above mentioned determining means.

[ 0121 ] As indicated above the present invention uses descriptors which may be input to probabilistic function, an artificial neural network, a hidden Markov model, multiple regression or Bayesian network. The use of such a technique facilitates cyclical refinement of predictive models for improved accuracy by inclusion of new experimental data as it becomes available. If there is no experimental data with which to train the determining means is available, the training process may be based on estimated binding affinity produced using other methods. For example, if binding activity of a ligand-receptor interaction is unknown, but there is experimental evidence of biological activity, a reasonable estimate of binding affinity can be deduced and used for training a predictive model.

[ 0122 ] The systems and methods provided herein generally applicable to data sets based on any type of association pair interaction, such as interactions between, among others,

receptor-ligand, antibody-epitope or antibody-antigen, enzyme-substrate, protein-protein, channel/transporter-solute or cytoskeletal-protein interaction.

[ 0123 ] The methods and systems provided herein may be applicable to, but are not limited to, the identification of novel interaction sites of components of association pairs comprising pairs of moieties and/or molecules, for example receptors or ligands, the identification of unknown binding counterparts eg. of a known receptor or ligand, the identification of unknown and secondary therapeutic targets of drugs, drug leads, drug candidates, natural products, the identification of novel receptor or ligand molecule with similar interaction site as the source or target molecule, the prediction of drug targets related to side effect and toxicity (including drug safety evaluation), the prediction of targets to drug ADME (pharmacokinetics), and the construction of virtual binding targets in docking simulations.

Data preparation and training

[ 0124 ] Receptor families typically share a common conserved structure. The representations of known interaction residues derived from three-dimensional models can therefore be used to train a system for prediction of interaction site and relative binding to a range of related receptors. Two techniques for representing interaction sites using descriptors derived from a single protein in isolation and from receptor-ligand complexes are illustrated:

[ 0125 ] The representation of a interaction site for a protein (receptor or ligand) in isolation may be expressed as [G-P _n ][L _xIn -P _n ][BA], where G-P _n stands for global descriptor for amino acid property P _n where n is an integer between 1 and total number of defined properties, Lx _m -P _n stands for local descriptor enclosed by micro-container x _m where m is an integer between 1 and total number of micro-containers for amino acid property P _n and n is an integer between 1 and total number of defined properties, and BA (optional) for strength of the interaction (the binding affinity or weight) where applicable. [ 0126 ] The descriptors may be calculated using a variety of different methods, depending on the nature of the descriptor. For example, using suitable three-dimensional information, binding energies may be computed using molecular dynamic algorithms, partitioning the binding energy into biophysical energy terms, or knowledge-based scoring functions; the accessible surface area of a region of interest of a molecule may be measured by tracing out the maximum permitted van der Waals' contact that is covered by the centre of a water molecule as it rolls over the surface of the protein; the torsion angles of amino acids may be computed from three-dimensional XYZ coordinates using mathematically formulas. The

interaction site may be represented by collating the values for each of the descriptors together in either a linear combination or matrix representation. In some arrangements, a continuous string of numerical digits may be suitable for representing the descriptors of the interaction site. The descriptors may be used for training a computer or other determining means including but not limited to binding matrices, fuzzy systems and machine-learning algorithms (eg. support vector machines, artificial neural networks, hidden Markov models, and genetic algorithms). As an example, the representation of the interaction site for a protein in isolation for one global region and n local descriptor regions may be an encoded string of the form: G_SurfaceArea G Charge G BindingEnergy G NoOfResidues; Ll Surface Area

Ll Charge Ll BindingEnergy Ll NoOfResidues; L2_SurfaceArea L2_Charge L2_BindingEnergy L2_NoOfResidues; ...; Ln SurfaceArea Lx Charge Ln BindingEnergy Ln NoOfResidues where G represents a descriptor for the global region and L represents a descriptor for a local region. Alternatively, the encoded string of the same interaction site using the same descriptors may be of the form:

G SurfaceArea G Charge G BindingEnergy G NoOfResidues Ll SurfaceArea Ll Charge Ll BindingEnergy Ll NoOfResidues L2_SurfaceArea L2_Charge L2_BindingEnergy L2_NoOfResidues ... Ln SurfaceArea Ln Charge Ln BindingEnergy Ln NoOfResidues where no delimiters are used. Alternatively, different delimiters, more delimiters, or combinations of delimiters may be used between each of or groups of descriptors as convenient.

[ 0127 ] A further example of the representation may be a linear combination of the descriptors such as:

(X _GGI X G SurfaceArea) •+ (X _GG2 X G Charge) + (XG _G 3 x G BindingEnergy) + (XGGI x G NoOfResidues) + (Xu i x Ll SurfaceArea) + (Xu2 x Ll Charge) + (X _L π X Ll BindingEnergy) + (Xu ₄ x Ll NoOfResidues) + (Xi^ ₁ x L2_SurfaceArea) + (X _L22 x L2_Charge) + (X _U3 x L2_BindingEnergy) + (X _L24 X L2_NoOfResidues) + ... + (X^ ₁ x Ln SurfaceArea) + (Xm ₂ X Ln Charge) + (X _Lπ3 X Ln BindingEnergy) + (XL _π 4 X

Ln NoOfResidues)

where X _GGX are optional weights for each of the global descriptors and X _GXX are optional weights for each of the local descriptors of each local region 1 to n. A still further example of the representation may be in matrix forma such as:

G SurfaceArea G Charge G BindingEnergy G NoOfResidues ...

Ll Surface Area Ll Charge Ll BindingEnergy Ll NoOfResidues ...

L2 SurfaceArea L2 Charge L2 BindingEnergy L2 NoOfResidues ...

... ... ... ...

Ln SurfaceArea Ln Charge Ln BindingEnergy Ln NoOfResidues

[ 0128 ] The representation of a receptor-ligand interaction site can be described as [GR-P _n ][GL-P _n ][LR _x1n -P _n ] [LLxm-Pn][BA], where GR-P _n is the global receptor descriptor for amino acid property P _n ; GL-P _n is the global ligand descriptor for amino acid property P _n ; LRχ _m -P _n stands for local receptor descriptor enclosed by micro-container x _m , LLχ _m -P _n is the local ligand descriptor enclosed by micro-container x _m ; and BA (optional) for strength of the interaction (the binding affinity or weight) where applicable; and where n is an integer between 1 and total number of defined properties and m is an integer between 1 and total number of micro-containers for amino acid property P _n .

EXAMPLES

Extraction of descriptors from a single protein [ 0129 ] The procedure for extracting descriptors from a molecule includes several steps, as illustrated in Figure 1 and described in Example 1.

Extraction of descriptors from a receptor-ligand complex

[ 0130 ] The procedure for extracting descriptors from an association pair complex includes several steps which are illustrated in Figure 2 and described in Example 2. Example 1

[ 0131 ] Hen egg white lysozyme (HEL) (Worldwide Protein Data Bank IDS: 1A2Y, 1G7M, 1G7L, 1G7L 1G7H, IKIR, IVFB, IFDL, IKIQ, and IKIP; Berman et al., 2000) binds monoclonal antibody (mAb) D 1.3. This mAb D 1.3 recognizes a conformational (nonlinear) epitope on HEL. The positional binding environments of the complex have been resolved by crystallography (Dall'Acqua et al. 1998, Fischmann et al. 1991, Sundberg et al. 2000, Fields et al. 1996, and Bhat et al. 1994).

[ 0132 ] The process of obtaining the representation of the interaction of the said ligand in isolation is shown in Figure 1. HEL has 18 contact amino acids that constitute ligand interaction site (Table 1 shows a listing of contact amino acid residue pairs derived from 3D coordinates of D1.3/HEL complex_using structures provided in Protein Data Bank (PDB) IDs 1A2Y, 1G7M, 1G7L, 1G7I, 1G7H, IKIR, IVFB, IFDL, IKIQ, or IKIP). These residues may be effectively captured by a three-dimensional sphere of radius 9.00 A (Figure IA). Combining the global (Figure IB) and local (Figure ID) descriptors, in this example using the region enclosed by a sphere of radius 9.00 A, results in the representation of the interaction site (Figure IF). The descriptors which were used to characterize the interaction area in this Example were the free binding energy of the global and local regions (refer to Table 2) which was derived from three-dimensional coordinate files such as PDB as used in the present example, or Tripos Mol2, among, others Experimental (e.g. point mutation) data from published literature was also used in the representation of the interaction site. Furthermore, wet-lab experiments such as point mutations may be used to reveal important residues or regions in the binding site, and may also reveal the location of the interaction site.

[ 0133 ] The process of obtaining the representation of the combined interaction of the said ligand and said receptor is shown in Figure 2. Dl.3 has 15 contact amino acids on the surface of the binding groove that constitute receptor interaction site and 18 contact amino acids that constitute ligand interaction site. These residues were effectively captured by a three-dimensional sphere of radius 9.00 A (Figure 2A). Putting together the receptor and ligand interaction sites, in this example using the region enclosed by a sphere of radius 9.00 A, resulted in the representation of the interaction site (Figure 2F).

Table 1: HEL residue positional environments for D 1.3 at a distance of 4.50 A.

Example 2

[ 0134 ] Monoclonal antibody (mAb) D 1.3 binds HEL and the anti-idiotypic monoclonal antibody mAb E5.2. The structures of these antibodies are described in Fischmann et al., 1991; Bhat et al., 1994; and Sundberg et al., 2000. A representation of the interaction site comprising a set of descriptors was prepared consisting of binding sites (8.5 - 9.0 A radius) from 9 HEL crystallographic structures (PDB IDs 1G7M, 1G7L, 1G7I, 1G7H, IKIR, IVFB, IFDL, IKIQ, IKIP). The global and local free binding energy profiles of each HEL interaction site were effectively captured using a total of 30 three-dimensional spheres, the global sphere having a radius of 9 A, and the 29 local spheres having a radius of 4.5 A. The position and size of the global sphere was based on experimental data of interaction site, and the local spheres were defined to be half the radius of global sphere. The descriptor used for this example was the free binding energy profiles (shown in Table 2) of each of the global and the local spheres which was computed based on empirical free energy functions as implemented in Internal Coordinate Mechanics software based on the equation δG _b m _d = a AG H + βδGs + ΎAG _EL - Here, AG _H is the hydrophobic energy computed as the product of solvent accessible surface area (determined by rolling a sphere of 1.4 A radius along the surface of the molecule) by the surface tension. AGs refers to the entropic contribution from the protein side-chains computed from the maximal burial entropies for each type of amino acid and their relative accessibilities. AG _EL denotes the electrostatic term composed of coulombic

interactions between receptor and ligand and the desolvation of partial charges transferred from an aqueous medium to a protein core environment, and is determined by the numeric solution of the Poisson equation using an implementation of the boundary element algorithm (Schapira et al., 1999). No other descriptors were used in this example. [ 0135 ] Next, the maximum (E _n13x ) and minimum (E _nUn ) free binding energies each global and local region of each crystal structure (PDB IDs 1G7M, 1G7L, 1G7I, 1G7H, IKIR, IVFB, IFDL, IKIQ, IKIP) were computed using the software program Internal Coordinate Mechanics and provided with a degree of relaxation from 0 kJ/mol to 90 kJ/mol, representing a range from high specific binding to the low specific binding to form the training set.

[ 0136 ] The test dataset included 2 binding and 1344 non-binding sites from mAb E5.2 (PDB ID IDVF) and HEL (PDB ID 1A2Y) a sample of which is shown in Figures 7 A and 7B for the two binding-sites and 21 IDVF and 9 1A2Y non-binding sites. For each fuzzy system, the model is tested on the test dataset. [ 0137 ] The predictive performance at different energy thresholds (10 kJ/mol) was assessed using sensitivity [SE=TP/(TP+FN)] and specificity [SP=TN/(TN+FP)] which is shown in Table 3. SE and SP represent percentages of correctly predicted interaction sites and non-interaction sites, respectively. TP (true positives) stands for interaction sites correctly predicted as interaction sites and TN (true negatives) for non- interaction sites correctly predicted as non- interaction sites. FN (false negatives) refers to interaction sites predicted as non- interaction sites and FP (false positives) represents non- interaction sites predicted as interaction sites.

[ 0138 ] Using the methods and systems provided herein, the interaction sites of ligands binding to D 1.3 were predicted with high accuracy using fuzzy descriptors of up to 60 kJ/mol (sensitivity = 100%, specificity = 80.95%). The present method provided the additional advantage that all the predictions were produced using a single predictive model, hi the present case, the interaction sites were known, however, where the interaction sites are unknown, a prediction may still be obtained by using exact or fuzzy matching techniques.

Table 2: Global (GL) and local (LL) free binding energy profiles of HEL interaction site from 9 crystallographic structures.

Table 3: Results of prediction for mAb D 1.3 ligand binding sites (TP = true positive, FP = false positive, TN = true negative, FN = false negative, SE = sensitivity, SP = specificity)

Example 3

[ 0139 ] Three-dimensional complexes of ligands bound to three common protein receptors of different families were extracted from Protein Data Bank as shown in Table 4. Descriptors (solvent accessible surface area, free binding energy and number of residues) for the ligand interaction sites were determined using the same procedure as described in Example 2.

[ 0140 ] A hierarchical clustering technique based on the single-linkage algorithm as implemented in MATLAB version 7.0 was applied. The algorithm began by merging nearest neighbours into clusters and then merging each smaller cluster into larger clusters until all items were combined. The clustering results are shown in Figure 3. Using the methods and systems provided herein, ligands were clustered into classes according to their binding receptor families. This makes it possible to screen ligands (or receptors) with similar binding partner (which may be of unknown structure and/or interaction site).

Table 4: Experimentally determined receptor-ligand structures

Example 4

[ 0141 ] Chymotrypsin is a proteolytic enzyme acting in the digestive systems of mammals and other organisms. It facilitates the cleavage of peptide bonds by a hydrolysis reaction. The receptor cleaves peptides and polypeptides into shorter peptide chains, tri- and dipeptides. The crystallographic structures of chymotrypsin in complex with nine different ligands have been solved and deposited in the Protein Data Bank (Table 4).

[ 0142 ] The training set consisted of 8 binding sites (13.0 - 13.5 A radius) from APPI (PDB ID ICAO), BPTI (PDB IDs 1T8O, 1T8N, 1T8M, ICBW), autocatalytic peptide (PDB ID 1OXG), and the human pancreatic secretory trypsin inhibitor (PDB IDs ICGJ, ICGI). The global and local free binding energy profiles of each interaction site were effectively captured using a total of 30 three-dimensional spheres as described in Example 2. Next, the maximum (E _m3x ) and minimum (E _mJn ) free binding energies within each sphere were computed and provided with a degree of relaxation from 0 kJ/mol to 90 kJ/mol, representing a range from high specific binding to the low specific binding. [ 0143 ] The test dataset included 7 binding sites from ecotin (PDB ID 1N8O), eglin C (PDB ID IACB), ovomucoid (ICHO), pmp-C (IGLl), pmp-D2v (IGLO), BPTI (1T7C, 1T8L) and 2237 non-binding sites from mAb E5.2 (PDB ID IDVF), HEL (PDB ID 1A2Y) and MHC class I allele A*0201 (PDB ID IDUZ). E _m3x and E _n ^ _n were gradually relaxed from 0 kJ/mol to 90 kJ/mol. [ 0144 ] The results of prediction are shown in Table 5. The interaction sites of ligands binding to D 1.3 were predicted with high accuracy using fuzzy descriptors of up to 90 kJ/mol (sensitivity = 100%, specificity = 93.83%). The present method has the advantage that all the predictions were produced using a single predictive model.

Table 5: Results of prediction for chymotrypsin ligand binding sites

[ 0145 ] The methods and systems described herein, and/or shown in the drawings, are presented by way of example only and are not limiting as to the scope of the invention.

Unless otherwise specifically stated, individual aspects and components of the methods and systems may be modified, or may have been substituted therefore known equivalents, or as yet unknown substitutes such as may be developed in the future or such as may be found to be acceptable substitutes in the future. The methods or systems may also be modified for a variety of applications while remaining within the scope and spirit of the claimed invention, since the range of potential applications is great, and since it is intended that the present methods or systems be adaptable to many such variations.

[ 0146 ] The terms "including", "comprising" and variations thereof as used herein are to be construed as meaning "including but not limited to", unless expressly specified otherwise.

REFERENCES

Abagyan, R., and Totrov, M. 2001. High throughput docking for lead generation. Curr. Opin. Chem Biol. 5, 375-382.

Abagyan, R. A. and Totrov, M.M. 1994. Biased Probability Monte Carlo Conformational Searches and Electrostatic Calculations For Peptides and Proteins J. MoI. Biol., 235, 983- 1002

Abagyan, R. A., Totrov, M.M., and Kuznetsov, D. A. 1994. ICM: A New Method For Protein Modeling and Design: Applications To Docking and Structure Prediction From The Distorted Native Conformation J. Comp. Chem., 15, 488-506.

Alesker, V., Nussinov, R., and Wolfson, HJ. 1996. Detection of non-topological motifs in protein structures. Protein Eng. 9, 1103-1119.

Alexandrov, N., and Fischer, D. 1996. Analysis of topological and nontopological structural similarities in the PDB: new examples with old structures. Proteins: Struct., Func. Gen. 25, 354-365.

Alexandrov, N., and Go, N. 1994. Biological meaning, statistical significance and classification of local spatial similarities in non homologous proteins. Prot. Sci. 3, 866-875.

Alexandrov, N., Takahashi, K. J., and Go, N. 1992. Common spatial arrangements of backbone fragments in homologous and non-homologous proteins. J. MoI. Biol. 225, 5-9.

Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman, D.J. 1990. Basic local alignment search tool. J. MoI. Biol. 215, 403^10.

Artymiuk, P., Porrette, A., Grindley, H., Rice, D., and Willett, P. 1994. A graph-theoretic approach to the identification of three-dimensional patterns of amino acid side-chains in protein structures. J. MoI. Biol. 243, 327-344.

Bagley, S., and Airman, R. 1995. Characterizing the microenvironment surrounding protein sites. Prot. Sci. 4, 622-635.

Bairoch, A., Boeckmann, B., Ferro, S., and Gasteiger E. 2004. Swiss-Prot: Juggling between evolution and stability. Brief. Bioinform. 5, 39-55. H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne 2000. The Protein Data Bank. Nucleic Acids Research, 28 pp. 235-242.

Bernstein, F.C., Koetzle, T.F., Williams, G.J., Meyer, E.F., Brice, M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T., and Tasumi, M. 1978. The protein data bank: A computer- based archival file for macromolecular structures. Arch. Biochem. Biophys. 185, 584-591. Bhat, T.N., Bentley, G.A., Boulot, G., Green, M.I., Tello, D., Dall'Acqua, W., Souchon, H., Schwarz, F.P., and Mariuzza, R. A. 1994. Bound water molecules and conformational stabilization help mediate an antigen-antibody association. Proc. Nat. Acad. Sci. USA 91: 1089.

Brazma, A., Jonassen, L, Eidhammer, I., and Gilbert, G. 1998. Approaches to the automatic discovery of patterns in biosequences. J. Comp. Biol. 5, 279-305.

Brenner, S., Chothia, C, and Hubbard, T. 1998. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl. Acad. Sci. 95, 6073-6078.

Chew, L., Huttenlocher, D., Kedem, K., and Kleinberg, J. 1999. Fast detection of common geometric substructure in proteins. Proceedings of the Third International Conference on Computational Molecular Biology, RECOMB'99.

Dall'Acqua, W., Goldman, E.R., Lin, W., Teng, C, Tsuchiya, D., Li, H., Ysern, X., Braden, B.C., Li, Y., Smith-Gill, S.J., Mariuzza, R. A. 1998. A mutational analysis of binding interactions in an antigen-antibody protein-protein complex. Biochemistry 37, 7981-7991.

de Rinaldis, M., Ausiello, G., Cesareni, G., and Helmer-Citterich, M. 1998. Three- dimensional profiles: A new tool to identify protein surface similarities. J. MoI. Biol. 284, 1211-1221.

Diederichs, K. 1995. Structural superposition of proteins with unknown alignment and detection of toplogically similarity using a six-dimensional search algorithm. Proteins: Struct., Func, Gen. 23, 187-195.

Eidhammer, L, Jonassen, I., and Taylor, W.R. 2000. Structure comparison and structure patterns. J. Comput. Biol., 7:685-716.

Falicov, A., and Cohen, F. 1996. A surface of minimum area metric for the structural comparison of proteins. J. MoI. Biol. 258, 871-892.

Fan, S.C., and Zhang, X. G. 2005. Characterizing the microenvironment surrounding phosphorylated protein sites. Geno. Prot. Bioinfo. 3, 213-217.

Fernandez-Recio, J., Totrov, M., and Abagyan, R. 2002. Soft protein-protein docking in internal coordinates. Protein Sci. 11: 280-291. Fetrow, J.S., and Skolnick, J. 1998. Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and tl ribonucleases. J. MoI. Biol. 281, 949-968.

Fischmann, T.O., Bentley, G.A., Bhat, T.N., Boulot, G., Mariuzza, R.A., Phillips, S.E.V., Tello, D., and Poljak, RJ. 1991. Crystallographic refinement of the three-dimensional structure of the Fab*D1.3-*Lysozyme complex at 2.5-*angstroms resolution. J. Biol. Chem. 266: 12915.

Fradera, X., Knegtel, R.M., and Mestres, J. 2000. Similarity-driven flexible ligand docking. Proteins 40: 623-636.

Friesner, R.A., Banks, J.L., Murphy, R.B., Halgren, T.A., Klicic, JJ., Mainz, D.T., Repasky, M.P., Knoll, E.H., Shelley, M., Perry, J.K., Shaw, D.E., Francis, P., and Shenkin, P.S. 2004. Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J. Med. Chem. 47: 1739-1749.

Gane, PJ., and Dean, P.M. 2000. Recent advances in structure-based rational drug design. Curr. Opin. Struct. Biol. 10: 401-404.

Gilbert, D., Westhead, D., Nagano, N., and Thornton, J. 1999. Motif-based searching in tops protein topology databases. Bioinformatics 15, 317-326.

Gohlke, H. and Klebe, G. 2001. Statistical potentials and scoring functions applied to protein- ligand binding. Curr. Opin. Struct. Biol. 11: 231-235. Di Gennaro, J.A., Siew, N., Hoffman B.T., Zhang, L., Skolnick, J., Neilson, L.I., and Fetrow, J.S. 2001. Enhanced functional annotation of protein sequences via the use of structural descriptors. J. Struct. Biol. 134, 232-245.

Hao, M.H., Haq, O. and Muegge, I. 2007. Torsion angle preference and energetics of small- molecule ligands bound to proteins. J. Chem. Inf. Model. 47: 2242-2252. Jonassen, L, Eidhammer, I., Taylor, W., and Grindhaug, S. 2000. Searching the protein structure databank with weak sequence patterns and structural constraints. J. MoI. Biol. 304(4), 597-617.

Jonassen, L, Eidhammer, I., and Taylor, W.R. 1999. Discovery of local packing motifs in protein structures. Proteins: Struct., Func, Gen. 34, 206-219. Kastenmϋller, G., Kriegel, H.P., and Seidl, T. 1998. Similarity search in 3d protein databases. German Conference on Bioinformatics, GCB 98.

Kasuya, A., and Thornton, J. 1999. Three-dimensional structure analysis of prosite patterns. J. MoI. Biol. 286, 1673-1691.

Levitt, M., and Gerstein, M. 1998. A unified statistical framework for sequence comparison and structure comparison. Proc. Natl. Acad. Sci USA 95, 5913-5920.

McDonald LK. and Thornton J.M. 1994, Satisfying Hydrogen Bonding Potential in Proteins, Jornal of Molecular Biology 238:777-793

Morris, G. M., Goodsell, D. S., Halliday, R.S., Huey, R., Hart, W. E., Belew, R. K. and Olson, A. J. 1998, Automated Docking Using a Lamarckian Genetic Algorithm and and Empirical Binding Free Energy Function J. Computational Chemistry, 19: 1639-1662

Nussinov, R., and Wolfson, H. 1991. Efficient detection of three-dimensional structural motifs in biological macromolecules by computer vision techniques. Proc. Natl. Acad. Sci USA 88, 10495-10499.

Pennec, X., and Ayache, N. 1998. A geometric algorithm to find small but highly similar 3D substructures in proteins. Bioinformatics 14, 516-522.

Petitjean, M. 1998. Interactive maximal common 3d substructure searching with the combined sdm/rms algorithm. Comp. Chem. 22, 463-465.

Rarey, M., Kramer, B., and Lengauer, T. 1999. The particle concept: placing discrete water molecules during protein-ligand docking predictions. Proteins 34, 17-28. Rossmann, M.G., and Argos, P. 1975. A comparison of the heme binding pocket in globins and cytochrombe b5. J. Biol. Chem. 250, 7523-7532.

Russell, R. 1998. Detection of protein three-dimensional side-chain patterns: New examples of convergent evolution. J. MoI. Biol. 279, 1211-1227.

Schapira, M., Totrov, M., and Abagyan, R. 1999. Prediction of the binding energy for small molecules, peptides and proteins. J. MoI. Recognit. 12: 177-190.

Shoichet, B.K., and Bussiere, D.E. 2000. Macromolecular crystallography and lead discovery: possibilities and limitations. J. MoI. Biol. 295: 337-356.

Sundberg, E.J., Urrutia, M., Braden, B.C., Isern, J., Tsuchiya, D., Fields, B.A., Malchiodi, EX., Torino, J., Schwarz, F.P., and Mariuzza, R.A. 2000. Estimation of the hydrophobic effect in an antigen-antibody protein-protein interface. Biochemistry 39: 15375.

Taylor, J.S., and Burnett, R.M. 2000. DARWIN: a program for docking flexible molecules. Proteins 41: 173-191.

Tong, J.C., Tan, T. W., and Ranganathan, S. 2007. hi silico grouping of peptide/HLA class I complexes using structural interaction characteristics. Bioinformatics 23: 177-183. Verbitsky, G., Nussinov, R., and Wolfson, H. 1999. Flexible structural comparison allowing hinge bending, swiveling motions. Proteins 34, 232-254.

Vriend, G., and Sander, C. 1991. Detection of common three-dimensional substructures in proteins. Proteins 11, 52-58.

Wako, H., and Yamato, T. 1998. Novel method to detect a motif of local structures in different protein conformations. Protein Eng. 11, 981-990.

Wallace, A., Borkakoti, N., and Thornton, J. 1997. TESS: A geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases: Applications to enzyme active sites. Prot. Sci. 6, 2308-2323.

Wallace, A., Laskowsi, R., and Thornton, J. 1996. Derivation of 3D coordinate templates for searching structural databases: Applications to ser-his-asp catalytic triads in the serine proteinases and lipases. Prot. Sci. 5, 1001-1013.

Previous Patent: WIRELESS AUDIO SHARING

Next Patent: CO-DOPED NICKEL OXIDE