Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ANALYZING MOLECULE AND PROTEIN DIVERSITY
Document Type and Number:
WIPO Patent Application WO/2000/060507
Kind Code:
A2
Abstract:
A computer-based method in which a set of constraints is placed on possible target surfaces, and a fully enumerated set of theoretical target surfaces under the given constraints is created, such that each surface has a defined, continuous volume and a defined, continuous surface area. One or more sets of objects are mapped to the fully enumerated set of theoretical target surfaces to define corresponding subsets of the fully enumerated set of theoretical target surfaces. An aspect of diversity of the objects is analyzed based on degrees of similarities and differences among the corresponding subsets.

Inventors:
WINTNER EDWARD A (US)
MOALLEMI CIAMAC C (US)
Application Number:
PCT/US2000/008777
Publication Date:
October 12, 2000
Filing Date:
March 31, 2000
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
NEOGENESIS INC (US)
WINTNER EDWARD A (US)
MOALLEMI CIAMAC C (US)
International Classes:
G06F17/50; G16B15/00; (IPC1-7): G06F17/50
Other References:
JOHN MOUNT ET AL: "Icepick: A flexible surface-based system for molecular diversity" JOURNAL OF MEDICINAL CHEMISTRY, [Online] vol. 42, no. 1, 1999, pages 60-66, XP002156348 Retrieved from the Internet: [retrieved on 2000-12-28]
PARKS CAMDEN A ET AL: "The measurement of molecular diversity by receptor site interaction simulation." JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, vol. 12, no. 5, 1998, pages 441-449, XP000972384 ISSN: 0920-654X
BURKHARD P ET AL: "An example of a protein ligand found by database mining: Description of the docking method and its verification by a 2.3 A X-ray structure of a thrombin-ligand complex." JOURNAL OF MOLECULAR BIOLOGY, vol. 277, no. 2, 27 March 1998 (1998-03-27), pages 449-466, XP000974379 ISSN: 0022-2836
JIANG F ET AL: "SOFT DOCKING MATCHING OF MOLECULAR SURFACE CUBES" JOURNAL OF MOLECULAR BIOLOGY, vol. 219, no. 1, 1991, pages 79-102, XP000972336 ISSN: 0022-2836
Attorney, Agent or Firm:
Klunder, Janice M. (LLP 60 State Street Boston, MA, US)
Download PDF:
Claims:
CLAIMS
1. A computerbased method comprising defining a set of constraints on possible target surfaces, defining a fully enumerated set of theoretical target surfaces under the defined constraints, such that each surface has a defined, continuous volume and a defined, continuous surface area, mapping one or more sets of objects to the fully enumerated set of theoretical target surfaces to define corresponding subsets of the fully enumerated set of theoretical target surfaces, and analyzing an aspect of diversity of the objects based on degrees of similarities and differences among the corresponding subsets.
2. The method of claim 1 in which the target surfaces comprise negative space target surfaces.
3. The method of claim 1 in which the objects comprise positive space object surfaces associated with different molecules.
4. The method of claim 2 in which the objects comprise positive space object surfaces associated with different molecules and in which the objects are mapped by defining corresponding subsets of the fully enumerated set of negative space theoretical target surfaces to which positive space object surfaces of conformations of molecules are complementary, and the aspect of diversity that is analyzed is the difference or similarity between the molecules which map to those negative space theoretical target surfaces.
5. The method of claim 1 in which the objects comprise negative space object surfaces associated with different proteins.
6. The method of claim 2 in which the objects comprise negative space object surfaces associated with different proteins and in which the objects are mapped by defining corresponding subsets of the fully enumerated set of negative space theoretical target surfaces to which negative space object surfaces of protein pockets are similar, and the aspect of diversity that is analyzed is the difference or similarity between protein pockets which map to those negative space theoretical target surfaces.
7. The method of claim 1 in which the objects comprise positive space object surfaces associated with different molecules and negative space object surfaces associated with different proteins.
8. The method of claim 2 in which the objects comprise positive space object surfaces associated with different molecules and negative space object surfaces associated with different proteins and in which, in the case of molecules, the objects are mapped by defining corresponding subsets of the fully enumerated set of negative space theoretical target surfaces to which positive space object surfaces of conformations of molecules are complementary, in the case of proteins, the objects are mapped by defining corresponding subsets of the fully enumerated set of negative space theoretical target surfaces to which negative space object surfaces of protein pockets are similar, and the aspect of diversity that is analyzed is the difference or similarity of the molecules which map to those negative space theoretical target surfaces to the protein pockets which map to those negative space theoretical target surfaces.
9. The method of claim 1 in which the theoretical target surfaces comprise polyhedrons.
10. The method of claim 1 in which the objects comprise polyhedrons.
11. The method of claim 9 or 10 in which the polyhedrons comprise cubes.
12. The method of claim 9 or 10 in which the polyhedrons are all of the same size and shape.
13. The method of claim 1 in which the set of all theoretical target surfaces defines a diversity space within which the diversity of objects can be measured by mapping those objects to the diversity space.
14. The method of claim 13 also including identifying regions of the diversity space to which no objects map.
15. The method of claim 14 also including designing molecules that occupy at least one of the unfilled theoretical target surfaces of the diversity space.
16. The method of claims 4 or 8 in which complementarity is associated with binding affinities of positive space object surfaces of conformations of molecules to negative space theoretical target surfaces.
17. The method of claim 1 in which the constraints comprise volume.
18. The method of claim 1 in which the constraints comprise associating each of a number of sites of the target surface with a preselected molecular property.
19. The method of claim 18 in which each of the preselected molecular properties is drawn from a larger set of possible molecular properties.
20. The method of claim 18 in which the preselected molecular properties include hydrophobic, polarizable, Hbond acceptor, Hbond donor, Hbond donor/acceptor, potentially positively charged, and potentially negatively charged.
21. The method of claim 18 in which fewer than all of the sites of the target surface are each associated with a different one of the molecular properties and all of the other sites of the target surface are associated with a common molecular property.
22. The method of claim 21 in which the common molecular property comprises slightly hydrophobic.
23. The method of claim 1 in which the degrees of similarities or differences comprise functional properties associated with the corresponding subsets of the fully enumerated set of theoretical target surfaces.
24. The method of claim 1 in which the degrees of similarities or differences comprise shape properties associated with the corresponding subsets of the fully enumerated set of theoretical target surfaces.
25. The method of claim 1 further comprising defining each of the objects by quantizing molecules into polyhedrons.
26. The method of claim 1 also including fitting each of a fixed set of orientations of each conformation of each of the objects to each of the target surfaces.
27. The method of claim 26 further comprising scoring each of the fittings.
28. The method of claim 9 in which the constraints comprise a resolution of the polyhedrons.
29. The method of claim 28 in which the resolution is 4.24 Angstroms.
30. The method of claim 9 in which the constraints comprise maximum and minimum numbers of polyhedrons.
31. The method of claim 9 in which each of the polyhedrons shares a common interface with another of the polyhedrons.
32. The method of claim 1 in which each of the target surfaces has no occlusions of volume greater than a given parameter.
33. The method of claim 1 in which the target surfaces are defined conceptually as having been carved out of a flat surface.
34. A method comprising categorizing existing molecules based on negative space target surfaces to which conformations of the molecules are complementary, and designing novel molecules that are complementary to negative space target surfaces to which no conformations of the existing molecular are complementary.
35. A method of creating novel molecules to be tested as ligands for proteins, comprising categorizing proteins based on target surfaces to which their pockets of known structure map, and designing novel molecules that are complementary to the negative space target surfaces to which the protein pockets map.
36. A computer programmed to determine the chemical similarity of different molecules, the program comprising approximating the surface shape of each one of a plurality of molecules of interest by linking a series of cubes, each cube having a dimension R, the locations of the cubes being determined by the calculated electron probability density of the individual one of the molecules of interest, each cube sharing at least one of its six faces with another cube, such that there is a specific number of linked cubes which varies for each individual one of the plurality of molecules of interest; approximating the chemical reactivity of each individual one of the plurality of molecules of interest by assigning each cube of each individual one of the plurality of molecules of interest, no more than one functionality value from a plurality of M different chemical functionality values; approximating the surface shape and chemical reactivity of a chemically active surface having a volume equal to V by subtracting a number V/R3 cubes of dimension R from a surface, wherein each of the cube spaces shares at least one face with another cube space and wherein N of the cube spaces has one of a plurality of M different chemical functionality values; calculating an attraction value K for each one of the plurality of molecules of interest to the chemically active surface; and calculating a list of overall attraction values to the chemically active surface.
37. The computer of claim 36, wherein further the calculation of the attraction value K is performed on a plurality of different predetermined chemically active surfaces, and a matrix of overall attractive values of each molecule of interest to each of the different surfaces is calculated.
38. The computer of claim 36, wherein the plurality of molecules of interest includes organic molecules.
39. The computer of claim 38, wherein further the chemically active surface having a plurality of predetermined active chemical locations is calculated to correspond to the shape of an actual protein surface structure.
40. The computer of claim 36, wherein further the molecules of interest are organic molecules of 1500 Daltons or less.
41. The computer of claim 36 wherein further the chemically active surface having a plurality of predetermined active chemical locations is compared to an actual protein surface to calculate a similarity value of the actual protein surface to the predetermined active chemical locations.
42. The computer of claim 41 wherein further a plurality of predetermined chemically active surfaces are compared to a plurality of actual protein surfaces and a matrix of similarity values is calculated.
43. The computer of claim 42 wherein further the cube spaces subtracted from the surface are calculated to approximate the electron probability density of at least one of a plurality of depressions in known protein surface structures.
44. The computer of claim 42 wherein further the N sites of chemical functionality are calculated to approximate the location and type of chemical functionality of actual depressions in known protein structures.
Description:
ANALYZING MOLECULE AND PROTEIN DIVERSITY This application claims priority from Provisional United States Patent Application Serial Number 60/127,486, filed April 2,1999, and incorporated by reference.

This invention relates to analyzing molecule and protein diversity.

Combinatorial chemistry allows the creation of unprecedented numbers of organic compounds. The rational synthesis of millions of small organic molecules is now achievable in a matter of days. There are estimated to be more than ten to the hundredth power of small molecules that could be synthesized using current methods. By"small", we mean a molecule having fewer than 1500 Daltons, where a Dalton is defined as 1/12 of the weight of a carbon 12 atom or roughly the weight of a hydrogen atom.

An important question is how can one create a set of molecules of such diversity as to contain at least one potent binder to any given target of interest? This question is central to drug discovery in an era that is characterized by a growing wealth of DNA sequence information and a relative dearth of corresponding target structures and their functions. In a world in which there are many more putative targets than can be studied by x-ray crystallography, multi- dimensional NMR, or other high resolution biophysical techniques, any attempt to generate biologically active ligands to targets of unknown structure will require general screening libraries: libraries of molecules that cover a high percentage of so-called"diversity space".

The utility of small molecules as drugs depends in part on molecular complementarity: how well the molecules fit and/or stick to chemically active sites (often in the form of depressions in a protein) on the surface of a cell, on the surface of an intracellular organelle, or on a cytosolic protein. The potential molecular complementarity of a small molecule is in large part determined by two factors: 1) The shape of the molecule, meaning the total Van Der Waals (VDW) surface of a given conformation of the molecule and how it follows or does not

follow the VDW surface of the target site of interest. The shape complementarity of molecule and target are largely responsible for energetic forces such as the displacement of water (the"hydrophobic effect") and so called"Van Der Waals" or"London dispersion"forces.

2) The types of potential energetic interactions (such as hydrogen bonding, percentage ionic bonding, proximity of polarizable moieties) found at various places on the molecule, and on the manner, order, and spatial orientation in which the energetically interactive portions of the molecule are connected to each other and presented in the presence of the target of interest.

Typically, each of these factors is measured or calculated in only a relative way with respect to a specific set of molecules or with respect to specific protein surfaces being examined. A general comparison of potential molecular complementarity between two sets of molecules, however, requires doing calculations or experiments using an absolute or fixed frame of reference.

For example, if company 1 has tested a molecule set A for affinity to a set of target surfaces X found in cancerous cells, and company 2 has tested a molecule set B against a set of target surfaces Y found in nervous tissue, the value of molecule set B with respect to the target surfaces X of company 1 is not apparent because each of the affinity evaluations has been performed against a different standard. Also, without a full molecular calculation of set A against target surfaces Y, it is not apparent whether potential chemically active portions of the target surface Y could be bonded by molecules in set A. Finally, even if all calculations of sets A and B vs. surfaces X are performed, neither company will gain a measure of the"absolute diversity"of their molecule sets; that is, they will have no measure of the likelihood that either sets A or B will contain a molecule that has potential activity against any given target T. This is because the standard against which they have measured their molecules are only a small subset of potential target surfaces.

SUMMARY OF THE INVENTION

Implementations of the invention provide an absolute or fixed frame of reference for comparing sets of molecules and protein surfaces. By defining molecules in terms of their complementarity or attraction to a fully enumerated basis set of theoretical protein surfaces within given parameters, it is possible to measure the diversity of a set of molecules against a non relative standard, i. e., an absolute measurement. This allows for an efficient comparison of different sets of molecules, for example, sets of drugs, and enables meaningful categorization of classes of molecules against a standard set of surfaces. Furthermore, this allows for the detection of theoretical protein surfaces to which no molecules in a set are complementary, thus enhancing the ability of a chemist to design novel molecules that supplement the deficiencies of the original set. By defining real world protein surfaces in terms of their similarity to a basis set of theoretical protein surfaces, it is similarly possible to categorize real world sets of protein surfaces against a standard set of theoretical surfaces. This allows for improved classification of proteins into similar target classes by the similarity of their surface sites. By evaluating an actual set of molecules against theoretical protein surfaces and further by evaluating a set of real world protein surfaces of interest against the same theoretical protein surfaces, it is possible to select classes of molecules that are likely to have substantial activity against the real world protein surfaces. This allows for improved molecule screening, for example, in drug research. By evaluating an actual set of molecules against theoretical protein surfaces and further by evaluating a set of real world protein surfaces of interest against the same theoretical protein surfaces, it is also possible to find actual protein surfaces to which no molecule in the set A has likely activity. This enhances the ability of a chemist to design molecules beyond those in set A that match the previously unmatched protein surfaces of interest, and may thereby be tested for pharmaceutical activity, thus supplementing deficiencies in the original set of molecules.

Thus, in general, in one aspect, the invention features a computer-based method in which a set of constraints on possible target surfaces is defined, and a fully enumerated set of theoretical target surfaces under the defined constraints is

also defined, such that each surface has a defined, continuous volume and a defined, continuous surface area. One or more sets of objects are mapped to the fully enumerated set of theoretical target surfaces to define corresponding subsets of the fully enumerated set of theoretical target surfaces. An aspect of diversity of the objects is analyzed based on degrees of similarities and differences among the corresponding subsets.

Implementations of the invention may include one or more of the following features. The target surfaces may include negative space target surfaces. The objects may include positive space object surfaces associated with different molecules. The objects may be mapped by defining corresponding subsets of the fully enumerated set of negative space theoretical target surfaces to which positive space object surfaces of conformations of molecules are complementary. The aspect of diversity that is analyzed may be the difference or similarity between the molecules which map to those negative space theoretical target surfaces.

The objects may include negative space object surfaces associated with different proteins, and the objects may be mapped by defining corresponding subsets of the fully enumerated set of negative space theoretical target surfaces to which negative space object surfaces of protein pockets are similar. The aspect of diversity that is analyzed may be the difference or similarity between protein pockets which map to those negative space theoretical target surfaces. The objects may include positive space object surfaces associated with different molecules and negative space object surfaces associated with different proteins.

In the case of molecules, the objects may be mapped by defining corresponding subsets of the fully enumerated set of negative space theoretical target surfaces to which positive space object surfaces of conformations of molecules are complementary. In the case of proteins, the objects may be mapped by defining corresponding subsets of the fully enumerated set of negative space theoretical target surfaces to which negative space object surfaces of protein pockets are similar. The aspect of diversity that is analyzed may be the difference or similarity of the molecules which map to those negative space theoretical target

surfaces to the protein pockets which map to those negative space theoretical target surfaces.

The theoretical target surfaces and the objects may be polyhedrons, e. g., cubes, all of the same size and shape. The set of all theoretical target surfaces defines a diversity space within which the diversity of objects can be measured by mapping those objects to the diversity space. Regions of the diversity space to which no objects map may be identified, and molecules may be designed that occupy at least one of the unfilled theoretical target surfaces of the diversity space.

Complementarity may be associated with binding affinities of positive space object surfaces of conformations of molecules to negative space theoretical target surfaces.

The constraints may include volume, associations of each of a number of sites of the target surface with a preselected molecular property drawn from a larger set of possible molecular properties, including hydrophobic, polarizable, H-bond acceptor, H-bond donor, H-bond donor/acceptor, potentially positively charged, and potentially negatively charged. Fewer than all of the sites of the target surface may each be associated with a different one of the molecular properties and all of the other sites of the target surface may be associated with a common molecular property, such as slightly hydrophobic. The degrees of similarities or differences may involve functional properties associated with the corresponding subsets of the fully enumerated set of theoretical target surfaces or shape properties associated with the corresponding subsets of the fully enumerated set of theoretical target surfaces.

Each of the objects may be defined by quantizing molecules into polyhedrons. Each of a fixed set of orientations of each conformation of each of the objects may be fitted to each of the target surfaces, and each of the fittings may be scored.

The constraints may include a resolution of the polyhedrons, e. g., 4.24 Angstroms, or maximum and minimum numbers of polyhedrons.

Each of the polyhedrons may share a common interface with another of the polyhedrons. The constraints may also include the absence of any occluded volumes greater than a given user-defined parameter. The target surfaces may be defined conceptually as having been carved out of a flat surface.

In general, in another aspect, the invention features categorizing existing molecules based on negative space target surfaces to which conformations of the molecules are complementary, and designing novel molecules that are complementary to negative space target surfaces to which no conformations of the existing molecular are complementary.

In general, in another aspect, the invention features a method of creating novel molecules to be tested as ligands for proteins. In the method, proteins are categorized based on target surfaces to which their pockets of known structure map, and novel molecules are designed that are complementary to the negative space target surfaces to which the protein pockets map.

In general, in another aspect, the invention features a computer programmed to determine the chemical similarity of different molecules. The program approximates the surface shape of each one of a plurality of molecules of interest by linking a series of cubes, each cube having a dimension R, the locations of the cubes being determined by the calculated electron probability density of the individual one of the molecules of interest, each cube sharing at least one of its six faces with another cube, such that there is a specific number of linked cubes which varies for each individual one of the plurality of molecules of interest. The chemical reactivity of each individual one of the plurality of molecules of interest is approximated by assigning each cube of each individual one of the plurality of molecules of interest, no more than one functionality value from a plurality of M different chemical functionality values. The surface shape and chemical reactivity of a chemically active surface having a volume equal to V is approximated by subtracting a number V/R3 cubes of dimension R from a surface, wherein each of the cube spaces shares at least one face with another cube space and wherein N of the cube spaces has one of a plurality of M different chemical functionality values. An attraction value K is calculated for each one of

the plurality of molecules of interest to the chemically active surface. A list of overall attraction values to the chemically active surface is calculated.

Implementations of the invention may include one or more of the following features. The calculation of the attraction value K may be performed on a plurality of different predetermined chemically active surfaces, and a matrix of overall attractive values of each molecule of interest to each of the different surfaces may be calculated. The molecules of interest may include organic molecules. The chemically active surface having a plurality of predetermined active chemical locations may be calculated to correspond to the shape of an actual protein surface structure. The molecules of interest may be organic molecules of 1500 Daltons or less. The chemically active surface having a plurality of predetermined active chemical locations may be compared to an actual protein surface to calculate a similarity value of the actual protein surface to the predetermined active chemical locations. The predetermined chemically active surfaces may be compared to a plurality of actual protein surfaces and a matrix of similarity values may be calculated. The cube spaces subtracted from the surface may be calculated to approximate the electron probability density of at least one of a plurality of depressions in known protein surface structures. The N sites of chemical functionality may be calculated to approximate the location and type of chemical functionality of actual depressions in known protein structures.

Other advantages and features will become apparent from the following description and from the claims.

DESCRIPTION Figure 1 shows a (CH2) n chain encapsulated by 4.24 A cubic units.

Figure 2 shows examples of surfaces allowed and disallowed by the non- occlusion parameter in a theoretical target surface generation algorithm. Gray shading represents the opening of the theoretical surface. A is allowed. B is disallowed due to two occluded negative space cubes (marked X).

Figure 3 shows a theoretical target surface of 13 negative space cubes and four sites of specific molecular property interaction: hydrophobic (white), polarizable (purple), H-bond accepting (green), and H-bond donating (orange).

Blue shading indicates the opening of the theoretical surface.

Figure 4 shows a"quantized"representation (Q-file) of one conformation of molecule 6a superimposed on its atomic structure (ball and stick and space- filling model). Molecular property characteristics of the Q-file are hydrophobic (white quanta), polarizable (purple quanta), H-bond accepting (green quanta), and negatively charged (red quanta).

Figure 5 shows test molecules.

Figure 6 shows a ranking of molecules by QCSD similarity scores.

Figure 7 illustrates examination of a theoretical target surface common to molecules 8c (top) and 8a (bottom). Blue shading indicates opening of the theoretical surface. Specific points of complementarity on the theoretical target surface are hydrophobic (white) polarizable (purple) and H-bond donating (orange). Superimposition of the original molecular conformations onto the theoretical target surface demonstrates that the extra phenyl substituent of 8c protrudes from the opening of the theoretical surface and is not involved in complementarity to the surface.

Figure 8 illustrates examination of a theoretical target surface common to molecules la (A) and 5a (B). Blue shading indicates opening of the theoretical surface. Specific points of surface complementarity found by QSCD are hydrophobic (white), polarizable (purple), and H-bond donating/accepting (yellow). Overlay plot (C) of the non-hydrogen backbones of la (orange) and 5a (green) indicate similar features. Aromatic/hydrophobic overlaps are shown in purple; H-Bond donating oxygens are in red. Overlay generated with Sybyl version 6.5 (Tripos Inc, 1699 S. Hanley Rd., St. Louis, MO, 63144).

Figure 9 illustrates ranking of molecules in Figure 5 by Tanimoto similarity score of 2D UNITY fingerprints.

Figure 10 shows a QSCD plot of all of the theoretical surface shapes covered by all of the conformations of all of the molecules (blue dots) in Figure

5. The total volume of the cube encompasses all 49,268,918 theoretical surface shapes as listed in Table 1. Red dots show two exemplary theoretical surface shapes (a, b) not covered by any of the molecules in Fig. 5. Axes used are functions of opening area, opening length/width, and depth per opening quantum.

Figure 11 shows a map of the 20 compounds in Fig. 5 (blue dots) in a representative BCUT three-axis diversity space. BCUT axes used are, respectively: 1) BCUT HACCEPT S INVDIST 050 R H, 2) BCUT HDONOR S INVDIST 030 R H, and 3) BCUT TAB POLAR S INVDIST 300 R L. Red dot shows an unfilled coordinate of diversity space, at (7.54,7.25,6.82). The information contained in this BCUT coordinate does not reveal information about the shape of a molecule which might be able to fill this position in diversity space.

Figure 12 shows use of QSCD to design complementary combinatorial libraries to unmatched theoretical target surfaces. Many conceivable libraries of a given shape and functionality may be designed to fill a given unmet diversity need.

Figure 13 shows two sample surfaces.

Figure 14 illustrates the quantization process.

Figure 15 shows a legend for symbols used in functionality rule diagrams.

Figure 16 shows functionality rules for Potential Negative Charge Functionality. The structures are searched for in order.

Figure 17 shows functionality rules for Potential Positive Charge Functionality. The structures are searched for in order.

Figure 18 shows a functionality rule for Hydrogen Bond Donor/Acceptor Functionality.

Figure 19 shows a functionality rule for Hydrogen Bond Donor Functionality.

Figure 20 shows a functionality rule for Hydrogen Bond Acceptor Functionality. The structures are searched for in order.

Figure 21 shows a functionality rule for Polarizable Functionality.

Figure 22 shows Table 5, a ranking of molecules in Fig. 5 by QSCD diversity score. Blue = homogeneous pairs, yellow = +phenyl pairs (8c), green = ATI-AT2 pairs (3,4) Figure 23 shows Table 6, a ranking of molecules in Fig. 5 by Tanimoto similarity score of 2D UNITY fingerprints. Blue = homogeneous pairs, yellow = +phenyl pairs (8c), green = ATI-AT2 pairs (3,4) Figure 24 shows an example subset of theoretical surfaces Ti containing 4 members and an example central set Ci for Ti (F = 1, E = 3) where the black face denotes a point of attachment A1 on Ci: Figure 25 shows an example Core Molecule Mi to fill Central set Ci.

Figure 26 shows an example Library (Mi, B) where B = a set of amines.

Figure 27 shows an example subset of target surfaces Ti containing 4 members and an Example Central set Ci for Ti (F = 1, E = 3), where the black face denotes a point of attachment Al on Ci: Figure 28 shows an Example Core Molecule Mi to fill Central set Ci.

Figure 29 shows an example Library L (Mi, B) where B = a set of amines.

Figure 30 shows a protein quantization process.

We define diversity as the measure, based on pre-defined criteria, of the difference or similarity among all members of a set. In a pharmaceutical setting, molecular diversity can be defined as the measure, based on biological criteria, of the difference or similarity between small molecules. Each of the existing methods of calculating biologically relevant diversity of small molecules defines slightly different criteria for molecular comparison, and thus a different configuration of diversity space as a whole. Examples include low dimensional diversity space such as BCUT metrics, high dimensional diversity space such as Chem-X/ChemDiverse multiple point pharmacophores, and empirical biological diversity space such as affinity fingerprinting.

Many known ways of quantifying the diversity of molecules use molecular properties such as functionality and connectivity as a basis for

categorization (see for instance Potter and Matter, J. Med. Chem., 1998, p. 478).

For example, in the BCUT method used to generate four-to six-dimensional diversity space, molecules are broken down into matrices according to connectivity and molecular interaction properties. Coordinates in diversity space are assigned through the resulting eigenvalues of these matrices, leading to useful multi-dimensional plots of molecular diversity. However, because the use of eigenvalues is an irreversible transformation (different 3D shapes can map to the same eigenvalues), it follows that an empty coordinate in BCUT diversity space cannot be translated into a 3D template of a"missing molecule. "Thus, while a model such as BCUT diversity is well validated as a tool for finding combinatorial matches to a lead compound or pharmacophore, it cannot be directly used to populate the entire diversity space that it defines.

Similarly, in the popular Chem-X/ChemDiverse diversity package, molecules are broken down into all accessible three-or four-point pharmacophores of triangular or tetrahedral functionality distances. If the model is used to display molecular diversity, coordinates in diversity space are assigned through the resulting string of accessible three-or four-point pharmacophores; this method has been shown to be highly effective in classifying molecules by pharmacological similarity. However, the mapping of complex 3D shapes to a set of triangular or tetrahedral functionality distances is an irreversible transformation; empty three-or four-point pharmacophores in Chem-X derived diversity space cannot be translated into a 3D template of a complex shape. Since a set of coordinates in Chem-X is insufficient to define the shape of"missing molecules,"Chem-X cannot be used to directly populate empty molecular diversity space.

Another example of current diversity methods is affinity fingerprinting, in which molecules are empirically assayed against a panel of 10-20 actual proteins selected to be promiscuous in their ability to bind small molecules. Position in molecular diversity space is assigned through the resulting string of ICSO binding values, and these affinity fingerprints provide unprecedented ability to group similarly active compounds in diversity space. However, because the actual mode

of binding in any assay is not incorporated in the resulting ICso value, the mapping of molecules to the selected protein panel is an irreversible transformation. Thus, an empty coordinate in affinity fingerprinting diversity space (an"unmatched"string of IC50s to a given protein panel) cannot be back- translated into a 3D molecular template. A similar affinity fingerprinting diversity method has been put into practice using a panel of computational surfaces of real- world protein pockets and a modified form of the DOCK program. While this method shows similar promise in its ability to detect pharmacological similarity, it is, like its empirical affinity fingerprinting counterpart, an irreversible mapping.

Thus, for the most part, current methods are able to successfully identify compounds of the same pharmacological class as being similar and compounds of different pharmacological classes as being different. Given a starting pharmacophore from known ligands and/or the target site of a target crystal structure, such methods interface well with the design of complementary combinatorial libraries.

The design of combinatorial libraries to cover all of diversity space is a rather different problem, however. In this case, it is not enough to be able to compare existing molecules for differences or similarities. In addition to being able to place molecules relative to one another in diversity space, one must be able to point to an absolute area of diversity space not yet covered and from its coordinates design a novel set of compounds to fill that uncovered space.

In order to rationally and systematically fill diversity space, an informationally reversible diversity model is needed. This model must be formulated such that: members (in this case molecules) can be assigned to coordinates for similarity/dissimilarity comparison, and empty coordinates retain the information necessary to directly generate coordinate membership.

One good path for such a model is to use as coordinates the exact information that differentiates one member from another, without intervening, irreversible transformations. To apply this reasoning to molecular diversity, it must first be asked: what are the criteria by which diversity of compounds is to be measured (what information differentiates one molecule from another). One of

the most fundamental criterion in molecular drug discovery is the extent to which two molecules have similar or different binding affinities to a given target. With the assumption that similar binding affinity tracks with a molecule's complementarity to similar target surfaces, we have selected as our criterion for diversity complementarity to a fully enumerated set of theoretical target surfaces.

Given the above definition of molecular diversity, it remains to provide parameters under which to define a biologically relevant basis set of enumerated theoretical target surfaces and to quantify molecular complementarity to a given theoretical target surface at a level which is both in accordance with known principles of molecular recognition and computationally applicable to millions of compounds. With a numerical determination of complementarity and a biologically relevant basis set of surfaces, molecular diversity space is thus absolutely established as the molecular complement to a fully enumerated set of theoretical target surfaces.

We introduce the concept quantized surface complementarity diversity (QSCD), which defines a molecule numerically by a mapping that describes its complementarity K to every distinct theoretical protein surface of resolution R not exceeding volume V with N sites of M types of chemical functionality PMN. K is defined as an algorithm that takes into account the molecular shape and chemical functionality of both the given molecule and the given theoretical protein surface. From this definition, it follows that a comparison of two molecules will yield a numerical difference that is representative of their complementarities: to the extent that two molecules each have complementarity for the same theoretical protein surfaces, the molecules are similar; to the extent that two molecules have no complementarity to common theoretical protein surfaces, the molecules are dissimilar. In other words,"similarity"between molecules is defined as the ability to complement the same theoretical protein surfaces and"difference"between molecules is defined as the ability to complement different theoretical protein surfaces.

Because QSCD uses complementarity to theoretical protein surfaces as a basis for categorization, both 3-D shape and molecular functionality are taken

into account. Also, because QSCD uses as its basis a complete set of theoretical protein surfaces (under R, V, and PMN), the method provides diversity information in a fixed frame of reference: coordinates in quantized surface complementarity diversity space are independent of any molecules or natural protein surfaces compared in that space. Because of this, not only can the diversity of disparate sets of molecules be compared without having to compare the molecules to each other directly, but the diversity of any set of actual natural protein surfaces can be examined through complementarity to the theoretical basis set. In addition, complementarity of molecules to a set of theoretical protein surfaces representative of actual natural protein surfaces can be examined in the context both of any other set of molecules and any other set of protein surfaces.

Furthermore, given a set of molecules, it becomes immediately apparent what percentage of theoretical protein surfaces are covered by complementary molecules, thus giving a measure of the set's molecular diversity in the space defined by all potential surfaces under R, V, and PMN. Not only does this provide a measure of diversity in an absolute sense which is not relative to any historically biased set of surfaces or molecules, but it also makes clear a set of theoretical target surfaces to which no molecules in the initial set bind, allowing a straightforward design of novel molecules to supplement the initial set.

Theoretical Target Surfaces In one implementation, to generate a finite set of theoretical target surfaces that approximates all possible binding pockets with volume equal to or less than V, we consider each theoretical surface to be formed by successively carving c ubic units out of an initially flat surface. These cubic units represent "negative space"that a potential ligand could occupy. Given cubic units with sides of length R (the resolution of the model), we use at most V/R3 negative space cubes to describe each theoretical target surface. Others have previously employed cubic units to successfully approximate complementarity between small molecules and individual protein surfaces.

The size of a negative space cube is directly related to the resolution and type of diversity data which the user desires as output. In choosing the size of the

negative space cube, one motivation is to maximize negative space cube size such that the difference of a single cube in a surface is highly differentiating in terms of molecular recognition (i. e., every surface is orthogonal to every other surface).

At the same time, enough information must be retained in each negative space cube to predict shape and functional complementarity at a ligand/surface interface. The former constraint minimizes overlap of diversity information while the latter constraint maximizes precision of diversity information. Together, the competing constraints result in a basis unit for the enumeration of theoretical target surfaces that minimizes the number of negative space cubes needed to accurately model diversity for a given volume V.

A resolution of 4.24 A negative space cubes was found by computer optimization of test molecules to provide an upper limit of cube size while still maintaining an acceptable level of molecular shape information. Interestingly, 4.24 A is the approximate VDW"cross-section"of a (CH2) n chain; a series of 4.24 A units neatly encapsulates a (CH2) n chain in its ground state conformation as shown in Figure 1.

In one implementation the basis set for diversity can be a set of theoretical target surfaces comprised of all possible shape combinations of 6 to 14 negative space cubes of resolution 4.24 A (negative volume between 460 and 1070 cubic A) subject to the following rules: Surfaces are created by successively"carving out"negative space cubes from a flat block of infinite width and depth (the theoretical target). All negative space cubes of a given surface must share at least one face with another negative space cube of the surface, and all must be part of a single, contiguous negative surface. No negative space cubes may be occluded in the +Z axis of the infinite surface block; that is, there may be no solid surface between any negative space cube and the surface plane of the infinite block. As shown in figure 2, the surface A is allowed, but the surface B is disallowed.

Surfaces duplicating a previous surface with respect to rotation in the X-Y plane are discarded.

The occlusion rule provides a compromise between complete coverage of topological possibilities and acceptable computational speed. This compromise

was made based on the topological assumption that occlusions of 4.24 A or more are infrequent in small molecule/target interactions, and that their omission would thus have only a small effect on predicting diversity of binding affinities of small molecules.

Applying the rules yields 49,268,918 unique negative surface shapes including chiral opposites. Covering a negative volume between 460 and 1070 cubic A, these surface shapes are deemed sufficient to examine diversity of most small molecules. For instance, examining a previously published reference set of pharmaceutically relevant compounds (a filtered Comprehensive Medicinal Chemistry or CMC database), 5049 out of 5120 compounds (98.6%) have a volume of 1070 cubic A or less.

Within each of the 49,268,918 unique negative surface shapes, each negative space cube is assigned a molecular property characteristic Pm that represents the dominant molecular environment which any atoms that are placed within that negative space will experience. Properties used are PI hydrophobic, P2 polarizable (includes aromatics), P3 H-bond acceptor, P4 H-bond donor, P5 H-bond donor/acceptor, P6 potentially positively charged (basic), and P7 potentially negatively charged (acidic). These seven types of molecular environments are assumed to represent a minimal basis set of factors that contributes to the electrostatic/VDW complementarity of a ligand and a target surface. In one implementation, four positions of particular molecular property P1-7 are assigned, leading to 74*N !/ ( (N-4)! *4!) surfaces for each surface shape of N negative space cubes. All other (N-4) cubes not assigned a particular molecular property are given property P8, slightly hydrophobic. The latter assignment is based on an assumption that hydrophobic effects are, on average, the largest single component contributing to ligand/target interaction.

In sum, the above process implies as a basis set for molecular diversity 1.1 * 1014 theoretical target surfaces of negative volume between 460 and 1070 cubic A and having four sites of specific molecular property characteristics P 1-7. The numerical breakdown of these 110 trillion surfaces is listed in Table 1.

Table 1: Numerical breakdown of the total number of theoretical target surfaces created using the algorithm given in the text. Surfaces consist of 6-14 negative space cubes and 4 sites of 7 possible molecular property characteristics. Number of functionally different surfaces per surface shape varies for infrequent cases in which a given shape has an axis of symmetry, so actual number of unique surfaces is slightly less than (# surface shapes) * 7 4*N !/ ( (N-4)! *4!).

Volume Number of Approx. number Exact number of (number unique of functionally unique surfaces (N) of surface different surfaces negative shapes per unique space surface shape: cubes) 74*N !/ ((N-4) ! *4!) 6 212 36, 015 7,163,338 7 885 84,035 73,271,443 8 3,959 168,070 655,324,488 9 17,747 302,526 5,350,917,208 10 81,407 504,210 40,912,578,322 11 375,897 792,330 297,622,676,624 12 1,753,218 1,188,495 979,379 13 8,224,443 1,716,715 14,116,888,070,845 14 38,811,150 2, 403,401 93,264,917,290,356 Total 6-14 653,272,003 One such surface is shown in Fig. 3.

A pseudocode description of an algorithm for determining the set of theoretical target surfaces is set forth in Appendix A.

Molecular Quantization To measure complementarity of small molecules to the basis set of theoretical target surfaces, the small molecules must be formatted in a similar frame of reference, for instance by quantizing them into positive space cubes ("quanta") of resolution 4.24 A according to the following process (illustrated in Fig. 4): A set of up to 100 minimized energy conformations within user-defined parameters is created. In one implementation, Tripos Multisearch modeling is used, and all conformations within 10 kcal of the lowest energy conformation found are accepted.

QuantizedMoteeuteProperties

For each conformation, a 4.24 A 3D grid of cubes (quanta) is aligned on top of the 3D structure using the molecule's principle axes of rotation (calculated with all atoms having mass 1).

To all 4.24 A quanta which contain at least a user-defined % of the VDW radius of any atom, a dominant molecular property characteristic is assigned based on connectivity rules (e. g. R- [C=O]-O-H yields P7, R-O-H yields P5; see definitions of Pl-P7 above). Order of dominance is from P7 to Pl, in order of maximum complementarity score obtainable by a given characteristic as shown in Table 2: Table 2: Relative magnitudes of parameters used in calculating molecular property interactions between negative space cubes (theoretical target surfaces) and positive space cubes (quantized molecules). Magnitudes (listed from highest to lowest): +++, ++, +, 0, -, --, ---. Theoretical Target (P7) (P6) (P5) (P4) (P3) (P2) (P1) Surface Properties neg pos hb don/acc hb donor hb acceptor polarizabl hydrophobic e ---+++0+-----(P7)negcharged (P6)pos charged +++---0 + 0 (P5) hb don/acc O O ++ (P4) hb donor + + ++ (P3)hb acceptor ++ 0---++0(P2)polarizable-- (P I) hydrophobic O + --0--00(P8)(surfaceonly)

Minimum % of VDW radius parameter allows for a user-defined protrusion beyond the surface of a quantum cube, adding a measure of topological"flexibility"to the quantization process. A user defined 32% was found to be especially good.

The total number of 4.24 A quanta that have been assigned a property characteristic is counted.

The grid alignment is shifted per user-defined parameters and the process is repeated until all shift combinations have been searched.

For each conformation in the original set, a"Q-file" (3D configuration of property-assigned quanta) is saved that has the lowest number of quanta in and is closest to the principle alignment.

Thus, an average molecule in this implementation is represented by 100 Q-files, each file consisting of N positive space cubes or quanta of 4.24 A resolution having an assigned molecular property characteristic Pm (m=1-7). A typical Q-file (molecule 6a) is shown in Fig. 4, superimposed upon its corresponding conformation. The process of optimization of quantization parameters is described later.

A pseudocode description of an algorithm for performing the quantization is set forth in Appendix B.

Mapping Given molecules which have been rendered into sets of Q-files, each quantized conformation can be mapped into the diversity space defined by the set of 1.1 x 1014 theoretical target surfaces. In general, the following process is used.

For each quantized conformation of each molecule, each of its 24 possible X/Y/Z rotations (6 faces * 4 rotations per face) is fit to each of the 49,268,918 available surface shapes. For a given conformation-to-surface shape fit, if at least a user-defined minimum number of negative and positive space cubes overlap (in one implementation either 9 quanta or N-2 quanta of a conformation of N quanta), and if no quanta of the conformation extend beyond the bounds of the surface shape except at the mouth of the surface shape, then the complementarity of the quantized conformation to all theoretical target surfaces of that shape is examined in detail as explained next. If the above conditions are not met, the next conformation is examined.

A score is generated for the complementarity of the given conformation to each theoretical target surface of a given shape from based on user-defined parameters. (The process of optimization of the complementarity parameters is described later.) The following complementarity parameters can be used: a) A negative parameter for each rotatable bond of the conformation. b) If conformational energies are calculated, a negative parameter for the energy of the conformation above the lowest energy conformation from that molecule.

c) A positive parameter for the hydrophobic energy gained by removing "water"from any hydrophobic (PI) or polarizable (P2) surface face of either the conformation or the theoretical surface. d) A positive parameter for the hydrophobic energy gained by removing "water"from any mildly hydrophobic (P8) surface face of the theoretical surface. e) A positive or negative molecular property interaction parameter for overlapping negative and positive space cubes as depicted in Table 2.

If and only if the resulting score meets a user-defined minimum, then the conformation (and thus the molecule it represents) is said to be complementary to the given theoretical target surface.

The computational advantage inherent in the process of molecule and surface quantization is realized in the speed of complementarity checking.

Whereas a traditional docking program must search a high-dimensional configuration space, the implementations of the invention resolve the problem to a framework bounded by 24 possible fitting orientations and a finite number of translations. This approximation allows three-dimensional diversity computation on a scale that is applicable to very large sets of molecules.

A pseudocode description of an algorithm for performing the mapping is set forth in Appendix C.

The above process results in a complementarity map that consists of a list of all theoretical target surfaces to which at least one conformation of a molecule is complementary. Comparison of these maps provides a novel method for measuring diversity of small molecules. We term the model on which this process is based quantized surface complementarity diversity (QSCD) because it calculates diversity by measuring complementarity to a quantized representation of theoretical target surfaces.

To maintain a computationally efficient complementarity scoring system, QSCD makes many approximations of molecular recognition. As explained, these include cubic units of 4.24 A resolution, gross approximations of surface contact area, exactly 4 points of 7 finite types of molecular property

characteristics, static theoretical surfaces, and a limited set (up to 100) of low energy conformers. Thus, the final complementarity scores are not presumed to give precise binding energies for any individual match of conformation to target surface. However, taken over all conformations of a molecule and across an enumerated set of theoretical target surfaces, the scoring system is statistically relevant as explained below.

Model Validation To test the validity of the QCSD model, i. e., its ability to predict the extent to which two molecules have similar or different binding affinities, eight sets of test molecules were analyzed (Fig. 5), seven of which were known to have binding affinities to seven distinct targets (in addition to a known overlap between sets 3 and 4). An eighth set with no known binding affinities was chosen with minor atomic and spatial changes to examine the sensitivity of the QCSD model at 4.24 A resolution. Known activities of the molecules in Fig. 5 are listed in Table 3 with references.

Table 3: Pharamacological activities of the molecules used in this study (see Fig.

5). a) Doherty et al. J. Med. Chem. b) Uehling et al. J.

Med. Chem. c) Chang et al. J Med. Chem. 1994,37, 4464-4478. d) Chang et al. J. Med. Chem. e) Tsutsumi et al. J. Med. Chem. 1994 37, 3492-3502. f) Penning et al. J. Med. Chem. 1995, 38,858-868. g) Cristalli etal. J. Med. Chem. h: numbers in parentheses indicate ICso in AT1 subtype assay of series 4. i: numbers in parentheses indicate ICso in AT2 subtype assay of series 3.

Assay IC50 or Ki (nm) Ref.

1 a Binding to Endothelin A Receptor 400 a 1 b Binding to Endothelin A Receptor 170 a 2a Inhibition of DNA fragmentation by 28 b Topoisomerase I 2b Inhibition of DNA fragmentation by 143 b Topoisomerase I 3a Binding to AT2 subtype of Angiotensin II 17 (0.45) h c Receptor 3b Binding to AT2 subtype of Angiotensin II 173 (31) h c Receptor 4a Binding to AT1 subtype of Angiotensin 11 0.85 d Receptor

4b Binding to AT1 subtype of Angiotensin 11 1.4 d Receptor 4c Binding to AT1 subtype of Angiotensin 11 1. 2 d Receptor (23,000)' 5a Inhibition of Prolylendopeptidase Protease 5 e Activity 5b Inhibition of Prolylendopeptidase Protease 10.3 e Activity 6a Binding to Leukotriene B4 Receptor 320 f 6b Binding to Leukotriene B4 Receptor 3.2 f 7a Binding to A2A Type Adenosine Receptor 6.3 g<BR> <BR> <BR> 7b Binding to A2A Type Adenosine Receptor 41.3 g 8a none-- 8b none-- 8c none-- 8d none-- 8e none-- The bulk of these molecules have previously been used as part of an in- depth study validating molecular descriptor approaches for the prediction of molecular diversity within compound classes. This is a more stringent discrimination than the base criterion sought for the QCSD model, which seeks at a minimum to show accurate diversity prediction between compound classes.

Conformations of all 20 test molecules were"quantized"and then mapped onto the basis set of 1.1 * 1014 theoretical surfaces. Complementary surfaces are tabulated for each molecule in Table 4.

Table 4: Tabulation of surface shapes and total number of theoretical target surfaces complementary to each molecule in Fig. 5.

Complementary Complementary surface shapes surfaces (shape plus functionality) 1 a 376 16,127,687 1 b 379 9,086,768 2a 27 545,584 2b 27 416,210 3a 414 4,970,816 3b 315 813,024

4a 487 4,542,463 4b 479 12,388,826 4c 482 7,595,982 5a 337 2,080,523 5b 374 1,966,837 6a 220 192,067 6b 186 153,436 7a 362 298,927 7b 269 22,367 8a 45 5,561,654 8b 41 3,959,678 8c 333 17,324,247 8d 64 2,059,546 8e 87 1,343,811 average: 4,572,523 There are many ways to analyze the resulting set of complementarity mappings.

Because in this case individual molecule comparisons were desired, each of the 20 mappings was compared pairwise for a total of 190 data points. Mappings were scored in similarity from 0 to 1000 based on a function of the number of theoretical surfaces in common: Score = SS * FS = ShapeScore * FunctionalityScore SS = 100 * # theoretical target surface shapes common to A & B total # surface shapes complementary either to A or to B FS theoretical target surface shapes common to A & B with at least I set of 4 common functionahies) ## total # theoretical target surface shapes common to A & B The first term in this equation gives a percentage measure (0-100) of shape similarity between molecules A and B, while the second term gives a measure from 0-10 of functional similarity per given shape overlap. The complete scores are detailed in Table 5 contained in Figure 22. Using this scoring system, the maximum score obtainable by very rigid, structurally similar molecules is 1000. However, many molecules can only be sampled by an examination of up to 100 low energy conformations (an average molecule w/5+

rotatable bonds will have at least 35 = 243 conformations). Thus, for most molecules with more than 100 accessible conformations, similarity scores between 0-100 are observed. The scoring constant D in the equation above adjusts the influence of functionality on scoring. A value of 0.33 was found to be optimal (as discussed below), meaning that shape is the dominant criterion in our measure of diversity. A pseudocode description of an algorithm for determining either the similarity of two molecules or the similarity of two libraries of molecular structures is set forth in Appendix D.

Fig. 6 shows a plot of all 190 pairings ranked by similarity score. Circles show"heterogeneous"pairs of expected dissimilarity (e. g. 2a, 6b), while squares show"homogeneous"pairs of expected similarity (e. g. 2a, 2b). Clearly, the QCSD model ranks homogeneous pairs almost exclusively higher than heterogeneous pairs; all 15 pharmacologically similar pairs fell within the top 20 scores out of 190. All homogeneous scores were ranked above 25, while the median score in this experiment was 2.8, showing good"signal to noise."The QCSD model is thus a valid predictor of target binding similarity among these molecules.

The pairings also reveal further validation. As might be expected from their relative rigidity (low number of accessible conformations) and structural similarity, the highest scoring pairs are 2a/2b, 8a/8b, and 8d/8e. Furthermore, examination of the pairings of 8c with 8a, b, d, e (triangles in Fig. 6, yellow in Table 5) yields scores that are within the top 20% of the pairing experiment but which are generally lower that the"homogeneous"pairs. This makes sense from a target-binding point of view, considering that one face of 8c contains a large molecular difference (an extra phenyl substituent). To the extent that this face is not involved in complementarity to a target surface, the molecules are similar; to the extent that this face must be complementary for binding to occur, the molecules are quite different. Fig. 7 shows one such case of a surface common to both 8a and 8c; the protruding phenyl substituent plays no role in complementarity.

As a rule, there is close similarity of shape and functionality for molecules which score 25 or higher. In addition to the pairings that one would expect, several other pairs scored between 25-35. When these molecules were examined by molecular modeling, significant overlaps were found, suggesting that these high scores are not just"noise"in the QCSD model. Fig. 8 depicts one such case between la and 5a; conformations of la and 5a are displayed that were found in the QCSD model to be complementary to the same surface (Fig. 8A, 8B). 3D overlays (Fig. 8C) confirm correlation of general shape and 4 points of functionality, although they also make clear the limits of resolution of complementarity information using 4.24 A units. As can be seen from Fig. 8, the surface in question can detect general shape and functional similarity, but by does not provide a basis to predict atom-for-atom overlap between molecules.

A final result comes from examination of the QCSD model's rankings of sets 3 and 4. While set 4 is known to bind exclusively to the AT1 subtype of the Angiotensin II receptor, set 3 is known to bind to both the AT1 subtype and the AT2 subtype. While the QCSD model found high similarity within sets 3 (score = 27) and 4 (avg. score = 54), it found an average similarity of 6.9 between 3a and set 4 and an average similarity of 3.3 between 3b and set 4 (diamonds in Fig.

6, green in Table 5 contained in Figure 22). Based on the QCSD model, one would therefore conclude that while sets 3 and 4 share a limited number of complementary theoretical surfaces, they are dissimilar with respect to the majority of theoretical target surfaces. This is in fact the case with the AT2 subtype of the Angiotensin II receptor, to which 4c binds 50,000 times more poorly than 3a (see Table 3).

Advantages of the QCSD model Having validated the basis set used for the QCSD model in the classification of molecular diversity, it must be noted that other models may do as well or better in detecting target binding similarity/dissimilarity between molecules. For instance, Fig. 9 and Table 6, contained in Figure 23, show the same set of 20 molecules ranked by Tanimoto similarity of standard 2D UNITY fingerprints (see discussion below). The data demonstrate that the 2D model is

equally capable of predicting pharmacologically similar pairs; UNITY ranks similarity between AT1 and AT2 subtype binders much higher than our QCSD model, although it finds unusually high similarity between 8a and 8c. In general, such 2D fingerprint descriptors have been found effective in clustering pharmacologically similar compounds, and are widely used in determining molecular diversity of existing structures.

Among the advantages of the QCSD model is the value of its negative information: The QCSD model determines not only diversity of existing structures, but also the structure of non-existing diversity. Given theoretical surface shapes for which no complements exist in a general screening library, QSCD allows the design of molecules to fill the given diversity void.

As stipulated in its formulation, the QSCD basis set is created through a reversible process. Although some information resolution may be lost in fixing the parameters of a cube's size and functional scope, information content is retained in either direction. Just as a single molecular conformation and orientation corresponds to a defined pattern in QSCD space, likewise, a single point in QSCD space (within the limits of volume V, resolution R, and N sites of functionality Pm) corresponds to a unique 3D shape with a defined 3D array of functionality. Given any starting set of molecules, unoccupied points in QSCD space directly define the molecular shapes and functionalities which those molecules do not cover. Thus, a set of detailed 3D molecular templates (at the resolution of the QSCD model used) is immediately available for the creation of novel molecules.

As an example, Fig. 10 shows a plot of all of the theoretical surface shapes covered by all of the conformations of all of the molecules used in the example implementation (see Fig. 5). The total volume of the cube in Fig. 10 encompasses all 49,268,918 theoretical surface shapes as listed in Table 1. As can be seen from the plot and two expanded points, many theoretical surface shapes are"unfilled"by the set of compounds shown in Fig. 5. Thus, in searching for molecules or libraries to enhance the diversity of the given set of compounds, the chemist is presented with a set of actual 3D templates into which

new compound libraries may be designed. In comparison, although mapping the same set of compounds in a"non-reversible"diversity space would also display a set of coordinates to which the molecules map, there would be no way to visualize the 3D shape of any point that was not filled by one of the compounds in the set. Using BCUT values for example (Fig. 11), the coordinates specified for an unfilled point leave the chemist with a set of normalized eigenvalues.

While these may give an idea of relative abundance of a given functionality (e. g.

H-Bond Donor) at this point in diversity space, the coordinates give no hint of what shape or class of molecules might fill that diversity void.

The above example shows how QSCD is a reversible diversity model with respect to molecular shape. Within a given surface shape in QSCD, there are many combinations of functionality leading to many different theoretical surfaces. If a given library fills only a portion of theoretical surfaces of a given surface shape, by following the same process outlined above and in Fig. 10, unfilled surfaces of specific shape and functionality may be identified and filled with complementary libraries. By using data-mining algorithms to analyze and intersect the shape and functionality of unfilled surfaces, a minimal set of "missing"3D combinatorial templates can be deduced from the QSCD mapping of a given set of general screening compounds. These templates represent the smallest number of combinatorial syntheses which need to be executed in order to fill out the diversity of the set of screening compounds. One such template is depicted in Fig. 12. In conjunction with the efficiency of core-based combinatorial chemistry, QSCD makes possible the contemplation of a "complete"library of screening molecules at a given resolution. The model thus offers a theoretical and practical answer to the problem of generating lead structures for genomic targets of unknown structure and function.

Technical details of an implementation For the example discussed above, molecular conformations were generated with Multisearch in Sybyl (version 6.5, Tripos Inc, 1699 S. Hanley Rd., St. Louis, MO, 63144) on an R10000 Silicon Graphics workstation.

Conformations were subsequently sorted by energy and conformations within 10

kcal of the lowest energy were accepted. Overlay plots of molecules (Fig. 8B) were also generated using Sybyl.

UNITY 2D fingerprints (Unity 4.0, Tripos Inc, 1699 S. Hanley Rd., St.

Louis, MO, 63144) were generated on an R10000 Silicon Graphics workstation.

Pairwise Tanimoto coefficients were computed as described by Dixon and Koehler.

QSCD software for molecule quantization, mapping of Q-files, and surface complementarity display was developed using the Java programming language (JDK 1.2) and the Java3D graphics API (version 1.1) on Intel-based workstations. Theoretical target surfaces were stored and indexed using an Oracle 7.3.3 database.

Parameters for theoretical target surface generation/molecular quantization and parameters for complementarity mapping/scoring were alternately optimized in three successive rounds as described below.

The parameters used for theoretical target surface generation and the closely related parameters for quantization of small molecules into quantized files (Q-files) were optimized in the context of the algorithms mentioned above.

Parameters were iteratively optimized by varying a given parameter and then quantizing training molecules other than those in Fig. 5. Training molecules used were taken from in house structures and two published SAR sets.

Concomitant with molecular quantization, an enumerated set of theoretical target surfaces was created with corresponding parameters. Using the current optimized complementarity/scoring parameters, molecules were then mapped to theoretical target surfaces and all diversity pairing scores generated as described in the text.

Parameters were chosen which accurately predicted known homogeneous/heterogeneous pairs and which maximized"signal to noise"of homogeneous scores over heterogeneous scores.

The parameters used for mapping/scoring molecular conformations to theoretical target surfaces were optimized in the context of the algorithm stated above. Parameters were iteratively optimized by varying a given parameter and then mapping a constant set of training molecules (see above) to a constant set of

theoretical target surfaces, using the most current surface generation and quantization parameters. Diversity pairing scores were generated for all training molecules, and parameters were chosen which accurately predicted known homogeneous/heterogeneous pairs and which maximized"signal to noise"of homogeneous scores over heterogeneous scores.

As mentioned earlier, for a given conformation-to-surface shape fit to be accepted, the minimum overlap requirement was set to either 9 quanta or N-2 quanta of a conformation of N quanta. This range allows large conformations to fit partially into a theoretical surface (protruding volume must be at the mouth of the surface) while also allowing smaller conformations to be considered for complementarity. It excludes large conformations which do not overlap at least 9 quanta.

Approximate computational speeds of typical QSCD operations are as follows on a single Pentium III 500 MHz workstation: Generation of the basis set of theoretical target surface used in the study required 17 min.; this data was stored for access by subsequent QSCD functions. Quantization of 100 conformations of a given molecule into 100 Q-files required 250 seconds.

Complementarity mapping of 100 Q-files onto the basis set of theoretical target surfaces used in the study required 40 seconds.

Algorithm for Designing Molecules for Unfilled Target Surfaces The following algorithm could be used to design novel molecules based on complementarity to unfilled theoretical target surfaces that are not complementary to any existing molecular conformations.

1) Existing molecules are quantized and those negative space cube target surfaces to which their conformations are complementary are identified. (see Appnedixes A, B, and C) 2) For a given set of existing molecules and a desired set of theoretical target surfaces with given shapes and functionalities, those theoretical target surfaces are identified to which no existing molecular conformations are complementary. Novel molecules are designed to be complementary to the above identified theoretical target surfaces as follows: a) Let the set of those theoretical target surfaces to which no existing molecular conformations are complementary = T.

b)'Cluster T into subsets Tl-Tn such that for each Ti there is a central set of negative space cubes of given functionality Ci such that all target surfaces in Ti can be created by adding up to E additional negative space cubes of given functionality to Ci at up to each of F points of attachment on Ci where each point of attachment Aj (j = 1-F) is a single face of Ci to which zero, one or a set of up to E negative space cubes may be added. (See Fig.

24) c) For each Ci, design a core molecule Mi that fills but does not extend beyond the space defined by Ci within a tolerance limit TOL, the core molecule furthermore complementary to the functionality of Ci, and the core molecule furthermore containing at least one combinatorial site (defined as a reactive site that can be further functionalized and/or extended in a combinatorial step under given chemical conditions) that can project potential combinatorial building blocks through at least one plane Aj. (See Fig. 25) d) Computationally enumerate the combinatorial library L (Mi, B) defined by Mi and a set of building blocks B that are each no larger in volume than E negative space cubes (See Fig. 26). e) Quantize nc conformations of each molecule in L (Mi, B) and determine the set W of all theoretical surfaces to which any conformation of any molecule in L (Mi, B) is complementary (see Appendixes B and C). f) Compute the set of target surfaces M = W n Tn g) If M contains an acceptable number of novel surfaces as defined by the user, then chemically synthesize the actual library L (Mi, B). Otherwise, choose a new Mi in step 3 for the given Ci and repeat until conditions in this step are met. h) Move on to the next Ci in step c.

Protein Diversity An extension of any diversity model based on an absolute frame of reference is that the same basis set may be used to classify actual proteins. By mapping onto the QSCD basis set all surfaces of volume V of a known protein, actual proteins can be compared and classified by their 3D binding sites. In addition to providing a diversity map of known protein binding sites within the universe of all theoretical protein surfaces under given parameters, the theoretical

surfaces of QSCD may thus be used to correlate protein classes to complementary molecular core structures.

Appendix E describes an algorithm for quantization of protein surfaces.

Appendix F describes an algorithm for comparing protein surfaces to determine a degree of similarity or dissimilarity. The following algorithm generates a set of files T of quantized protein binding surfaces which together represent the available surface of a given protein binding site.

Algorithm T (protein binding site, H, R, V, TolA, ToIB, Transinc, Rot, Rotvar, PM) Protein binding site: see 1. below H: resolution of scanning grid in angstroms, H<=R R: resolution of cube = length of cube side in angstroms V: total volume of each surface in cubic angstroms TolA: tolerance (%) of atomic radii ToIB: tolerance (%) of cube volume which must intersect convex hull Transinc: translational increment in angstroms Rot: # rotations (odd integer) Rotvar: rotational variance (%) PM: For each of V/R3 cubes used, any of M types of chemical functionality 1. Take a protein binding site file (obtained by any number of commercially available methods, such as running a"Connolly surface search"on a standard "PDB file" (Michael L. Connolly, 1259 El Camino Real, #184, Menlo Park, CA 94025) which minimally consists of : a) A calculated probable electron density surface of the binding site b) A list of all known atom types in the molecule with their coordinates and atomic radii c) A list of known connectivities of all atoms with the type of bond connecting each atom 2. Overlay flexible 2D square grid (s) of grid size HxH angstroms over all calculated probable electron density surfaces in the protein binding site 3. Calculate the convex hull of the set of points defined by the points of each grid nexus 4. Examine each grid nexus one at a time 5. At a given grid nexus place a cube with the center of one face tangent to the probable electron density surface of the protein binding site

6. If the cube contains a protein atom coordinate or any atomic radii of protein atoms protrude into the trial cube by more than Tol % of their atomic radius, then step 7., otherwise step 8.

7. Translate the cube Transinc angstroms away from the grid nexus on the probable electron density surface while keeping the center of the cube on a line perpendicular to the probable electron density surface at the the grid nexus. If the cube is now R angstroms away from the grid nexus, take the next nexus in 4., otherwise return to step 6.

8. Designate the cube as being a set cube with 5 unchecked faces and 1 checked face (the checked face being that which faces the grid nexus being examined).

Designate the initial frame of reference to be the X, Y, and Z axes co-linear with the sides of the cube.

9. Rotate the unchecked cube about its center in all combinations (total of Rot combinations) of the following units: a) X rotations: any one of Rot units (degrees) from-Rotvar * 90/2 to +Rotvar * 90/2 by Rotvar * 90/Rot b) Y rotations: any one of Rot units (degrees) from-Rotvar * 90/2 to +Rotvar * 90/2 by Rotvar * 90/Rot c) Z rotations: any one of Rot units (degrees) from-Rotvar * 90/2 to +Rotvar * 90/2 by Rotvar * 90/Rot 10. For a given rotated system from 9., place face contiguous trial cubes of side length R at all unchecked cube face (s), not to exceed the nearest integer to V/R3 total cubes. If addition of a new trial cube would exceed the nearest integer to V/R3 total cubes, proceed directly to 11 without adding further trial cubes 11. All unchecked cube faces (not including cube faces on trial cubes) become checked cube faces.

12. If a trial cube contains a protein atom coordinate, or any atomic radii of protein atoms protrude into the trial cube by more than TolA of their atomic radius, or if the cube does not intersect with a volume of the convex hull equal to at least R3* TolB (see 3. above), then the trial cube is removed.

Otherwise the trial cube becomes a set cube (with unchecked faces).

13. If the total number of set cubes becomes = the nearest integer to V/R3 (yielding a resulting cube combination of the nearest integer to V/R3 cubes), then go to step 15.

14. If no trial cubes in step 12 became set cubes and there are no unchecked cube faces remaining then start with the next rotated cube in step 9. If no rotations produce resulting cube combinations in step 13. then start with the next translation in step 7.

15. Ignore any remaining rotations from step 9. Designate all cubes as negative space cubes fully enclose except as detailed below. Designate the layer of cubes which is a) perpendicular to the line perpendicular to the probable electron density surface at the grid nexus being examined b) farthest from the grid nexus as negative space cubes which are open at their faces farthest from the grid nexus and perpendicular to the line perpendicular to the probable electron density surface at the grid nexus.

16. Based on the types of proximal atoms in the protein surrounding each negative space cube, and based upon the bonds which these proximal atoms form and the other atoms to which these proximal atoms are bonded, ascribe to each negative space cube one and only one of M types of chemical functionality.

Types of functionality M may include but are not limited to: Acidic regions, Basic regions, regions of formal charge +1, regions of formal charge-1, regions of partial charge between +0.5 and +1, regions of partial charge between-0.5 and-1, regions of partial charge between 0 and +0.5, regions of partial charge between 0 and-0.5, hydrophobic regions, polarizable regions, hydrogen bond donating regions hydrogen bond accepting regions hydrogen bond donating/accepting regions 17. Yield a"quantized"protein binding surface in terms of the nearest integer to V/R3 negative space cubes of side length R with any one of M types of functionality per negative space cube.

18. Return to step 4. for each grid nexus 19. Compare all quantized surfaces from 17. and remove any which are identical

20. Yield a set of files T of quantized protein binding surfaces which together represent the available surfaces of the given protein binding site Comparing Molecules to Proteins Appendix G describes an algorithm for determining the complementarity of a library of molecules to a set of protein surfaces.

Algorithm for Designing Molecules for a Set of Protein Target Surfaces In another procedure, novel molecules (for testing as ligands for proteins) can be designed based on complementarity to negative space cube targets to which a set of protein pockets map. The following outline describes the steps for doing so: 1) A set of existing protein pockets is quantized and those negative space cube target surfaces to which their quantizations map are identified. (See Appendixes B and C.) 2) Novel molecules are designed to be complementary to the above identified set of target surfaces as follows: a) Let the set of target surfaces = T. b) Cluster T into subsets Tl-Tn such that for each Ti there is a central set of negative space cubes of given functionality Ci such that all target surfaces in Ti can be created by adding up to E additional negative space cubes of given functionality to Ci at up to each of F points of attachment on Ci where each point of attachment Aj (j = 1-F) is a single face of Ci to which zero, one or a set of up to E negative space cubes may be added. (See Fig. 27) c) For each Ci, design a core molecule Mi that fills but does not extend beyond the space defined by Ci within a tolerance limit TOL, the core molecule furthermore complementary to the functionality of Ci, and the core molecule furthermore containing at least one combinatorial site (defined as a reactive site that can be further functionalized and/or extended in a combinatorial step under given chemical conditions) that can project potential combinatorial building blocks through at least one plane Aj. (See Fig. 28)

d) Computationally enumerate the combinatorial library L (Mi, B) defined by Mi and a set of building blocks B that are each no larger in volume than E negative space cubes (See Fig. 29). e) Quantize nc conformations of each molecule in L (Mi, B) and determine the set W of all theoretical surfaces to which any conformation of any molecule in L (Mi, B) is complementary (See Appendixes B and C.) f) Compute the set of target surfaces M = W fl Tn g) If M contains an acceptable number of novel surfaces as defined by the user, then chemically synthesize the actual library L (Mi, B). Otherwise, choose a new Mi in step 3 for the given Ci and repeat until conditions in step 7 are met. h) Move on to the next Ci in step 3.

Parameter Values Appendix H contains example parameter values useful in connection with the algorithms described in Appendixes A through G.

Further Extensions As mentioned, a 4.24 A cube was found to be the largest predictive unit size of diversity measure for our criteria of designing general screening libraries.

For example, both 4.48 and 4.00 A units gave poorer prediction of homogeneous/heterogeneous pairs than the pairings of Fig. 6 (4.24 A units). This is likely due to the fact that most organic small molecules are themselves quantized by a limited basis set: the VDW radii of H, C, N, O and a few other atoms (see for example Fig. 1). If there is no constraint on size of cubic units, however (i. e., if there is no attempt to maximize orthogonality of theoretical target surfaces), other unit measures of diversity can be found. A unit of 2.12 A should also provide effective diversity information but at a much higher resolution. Such a"high-resolution"adaptation of QSCD brings with it numerical (and thus computational) challenges. 112 negative space cubes (14 x 8) are now required at the upper limit of theoretical target surface size, translating to exponentially greater numbers of theoretical target surfaces and, depending on

the stringency of fitting parameters, correspondingly greater numbers of surface fits per molecule as in Table 4. At this resolution, the assumption of no occlusions in theoretical target surfaces becomes less valid, and removal of this assumption increases computational complexity further.

A corollary of any absolute diversity model is a prediction of the total "size"of diversity space in terms of unique molecular points. In other words, what is the minimum set of molecules needed to fully cover a given diversity space. This calculation is dependent on two factors: the resolution stipulated in the model (e. g., what amount of molecular change is recognized as different) and the maximum values of each dimension of the model's basis axes. In the model of QSCD discussed above, resolution is fixed by cubic units of 4.24 A, and maximum values are fixed at 14 units (molecular volume of 1070 cubic A) and 4 points of 7 types of molecular property characteristics. As describe above, the result is a set of 1.1* 1014 unique molecular points. Since, using the parameters of this study, an average molecule covers 4.6 million of the unique molecular <BR> <BR> <BR> <BR> points bounded by QSCD space (Table 4), the model predicts a minimum 1.1 * 1014/4.6 * 106 = 24 million molecules would be necessary to completely cover diversity space.

We estimate that an average complementary molecule in the context of the QSCD model has a AG of complementarity on the order of-11 kcal (Table 7).

Table 7. Summation of binding energies for an interaction of an average complementary molecule/theoretical target surface pair in the context of the QSCD model used herein. An average molecule is assumed to have a buried volume of 12 cubic quanta (= 915 cubic A at 4.24 A resolution), 36 exposed faces (4.24 A square), 21 non-polar exposed faces (60%), 10 rotatable bonds, 4 points of complementary electrostatic/VDW potential, and a conformational energy within 2 kcal/mol of ground state. An average complementary theoretical target surface is also assumed to have 60% non-polar exposed faces. Constants used in the table are taken from Ajay and Murcko.

Energetic contribution Average AG

translational/vibrational entropic loss (constant) +9 kcal/mol constant +0.7 kcal/mol (RTln3) per rotatable +7 kcal/mol bond; assume rigid theoretical target surface AAG conformation from ground state +2 kcal/mol constant-0.03 kcal/mol per square A non-polar-23 kcal/mol buried surface =-0.54 kcal/mol per 4.24 A square non-polar buried face; total 21 molecular faces + 21 theoretical surface faces total interaction from Table 2-6.0 kcal/mol (four complementary points) sum of binding energies-11 kcal/mol binding affinity to nearest integer (. 1. 363) 10-8 = 10 nanomolar In other words, the resolution used to calculate diversity translates roughly to nanomolar binding conditions for an average molecule/target surface pair. Given that some 24 million molecules are needed to completely cover diversity space under these conditions, a general screening library guaranteed to contain at least one nanomolar binder to any given target of interest would thus number at least 24 million molecules. This is a large number and will be attenuated by the fact that some molecules have significantly more than 100 conformations available to them. However, the QSCD model suggests that if, in the near future, combinatorial chemistry and high-throughput screening are to generate initial hits primarily in the nanomolar rather than micromolar range, then the field must continue to focus its efforts on the development of numerically competent synthesis and screening technologies.

By defining molecules in terms of their complementarity to a fully enumerated set of theoretical protein surfaces under given parameters, and by defining actual protein surfaces in terms of their similarity to the same set of theoretical protein surfaces, the model allows:

1. Numerical prediction of protein surface diversity (how many different types of possible protein surfaces exist under a given set of parameters).

2. Numerical prediction of how many and what type of molecules would be necessary to create a"universal molecular library" (a library which contains at least one complement [of a given score] to any given protein surface).

3. Comparison of the similarity or difference of molecules or sets of molecules based on their complementarity to theoretical protein surfaces. The frame of reference for such comparisons is fixed no matter how many or what types of molecules are involved.

4. Numerical prediction of the percent of protein surface diversity to which a given set of molecules is complementary.

5. Comparison of the similarity or difference of actual protein surfaces or sets of surfaces based on their similarity to theoretical protein surfaces. The frame of reference for such comparisons is fixed no matter how many or what types of protein surfaces are involved.

6. Numerical prediction of the percent of protein surface diversity to which a given set of actual protein surfaces is similar.

7. Prediction of the actual protein surfaces to which a given molecule is complementary.

8. Prediction of how many and what type of molecules would be necessary to create a"universal molecular library"against a given set of actual protein surfaces (a library which contains at least one complement [of a given score] to each actual protein surface).

This application discloses information from an article which has been accepted for publication in the peer-reviewed Journal of Medicinal Chemistry and is currently scheduled for publication in the May 18th issue. The article, listed by the Journal of Medicinal Chemistry as JM990504B, is titled"Quantized surface complementarity diversity (QSCD): A model based on small molecule- target complementarity,"and is incorporated herein by reference.

Other implementations are within the scope of the following claims.

Appendix A Theoretical Surfaces A surface opening O is a set of lattice squares in Z2 denoted by their comers: {(xi,yi)##2,i=1...m}O= By definition, a surface opening must be connected. For connectivity purposes, two points p, q E Z2 are neighbors iff<BR> <BR> <BR> <BR> <BR> <BR> <BR> #px#px- #py-qy#=1+ The area of a surface opening is defined to be the number of lattice squares it contains.

Surface openings are considered to be unique upto translations and rotations of the x-y plane.

A surface shape S is a set of"negative space"cubes represented as lattice cubes in 23 denoted by their comers:<BR> <BR> <BR> <BR> <BR> <BR> {(xi,yi,zi)##3,i=1...n}S= By definition, a surface shape must satisfy the following conditions: <BR> <BR> <BR> <BR> # All cubes are below the x-y plane. That is, if (x, y, z) E S, then z < 0.<BR> <BR> <BR> <BR> <BR> <BR> <P> 'The surface shape is connected. For connectivity purposes, two cubes p, q E #3 are neighbors iff <BR> <BR> <BR> <BR> qx#+#py-qy#+#pz-qz#=1#px- <BR> <BR> <BR> <BR> <BR> <BR> <BR> * There are no occlusions along the z-axis. That is, if (a y, z) E S and z < 0, then (x, y, z + 1) E S.

The volume of a surface shape is defined to be the number of lattice cubes it contains.

Surface shapes are considered to be unique up to translations and rotations of the x-y plane.

Every surface shape specifies a unique opening: opening (S) = {(x,y)#(x,y,z) # S} APPENDIX A. THEORETTCAL SURFACES Further, given a surface opening O and a function d: O # N, a unique surface shape with the given opening is defined: shape (O, d) = { (x, y,z)#(x,y) # O, d (x, y) >-z} The function d specifies the"depth"of the surface shape at each opening point. Sample surface shapes and openings can be seen in Figure 13.

A theoretical surface consists of a surface shape where the cubes in the surface <BR> <BR> shape are each associated with functionality. The set F of seven specific types of characteristic functionality is used: Fi: Potential Negative Charge <BR> <BR> <BR> <BR> PotentialPositiveCharge#F2: <BR> <BR> <BR> <BR> <BR> ; F3 : Hydrogen Bond Donor/Acceptor<BR> <BR> <BR> <BR> <BR> F4: Hydrogen Bond Donor<BR> <BR> <BR> <BR> <BR> F5: Hydrogen Bond Acceptor<BR> <BR> <BR> <BR> <BR> JF6 : Polarizable 57 : Hydrophobic 8 : Slightly Hydrophobic A functionality map f: S @ F defines the assignment. By default, all cubes are <BR> <BR> assigned functionality. 58, and all possibilties are considered where upto nf of the cubes are given one of the functionalities F1-F7.

Generating all surface shapes is accomplished by the following steps: 1. Generate all surface openings (S URFACEOPENINGS).

2. Filter the surface openings to remove openings that are unlikely to resemble surfaces found in nature (OPENINGFILTER).

3. From the set of filtered openings, generate all surfaces shapes whose associated openings are in the set (S URFACESHAPES).

4. Filter the set of surfaces shapes to remove shapes unlikely to resemble surfaces found in nature (SHAPEFILTER).

5. From the set of filtered surface shapes, add all possible combinations of func- tionality to define a set of theoretical surfaces (FUNCTIONALIZESURFACE).

APPENDIX A. THEORETICAL SURFACES Algorithm A. 1 SURFACEOPENINGS (A): calculate all surface openings with area less than or equal to A.

1: define N1 to to be set containing containing only the surface opening with a single square at (0,0) 2: fori~2toAdo 3: Ni##{Ni will ultimately contain all openings of area i} 4: for all O E Ni_l do 5: define P to the set of all possible openings obtained by adding a single square to O adjacent to a square already present in O 6: P#pdoall 7: if no rotation or translation of P is present in Ni then 8: add P to N 9: end if 10: end for 11 : end for 12 : end for 13: O#Ui=1A Ni 14:(#) APPENDIX A. THEORETICAL SURFACES Algorithm A. 2 OPENINGFILTER(@, At, Mnc, MC): filter a set of surface openings 0 using area-threshold parameter At, max-non-central parameter MnC and max- contiguous parameter Me. <BR> <BR> <BR> <P> 1:###<BR> <BR> <BR> <BR> <BR> 2: for all O E 0 do<BR> <BR> <BR> <BR> <BR> 3:### {delete from # all lattice squares with exactly one neighbor that has 2 or more other neighbors} 4: for all (x, y) ## do 5: N {(x-1,y),(x+1,y),(x,y-1),(x,y+1)]## =1then6:if#N# 7: define (#, #) to be the single element in N 8: # {(#-1,#),(#+1,#),(#,#-1),(#,#+1)}## <BR> <BR> <BR> <BR> >2then9:if### 10: remove (x, y) from O if11:end 12: end if 13: end for {delete from 0 all 2x2 blocks of lattice squares in 0} 14: for all (x,y) # O do 15 : if (x + 1,y) # O and (x, y + 1) # O and (x + 1, y + 1) E 0 then (x,y),(x+1,y),(x,y+1),(x+1,y+1)fromO16:remove 17 : end if 18 : end for 19 : define c to be the number of squares in the largest connected component left in 6<BR> <BR> <BR> <BR> <BR> 20: if area(#) # Mnc and (area (O) < At or c # Ue) then 21: add 6 to 6 22: end if 23: end for 24: return (#) APPENDIX A. THEORETICAL SURFACES Algorithm A. 3 SURFACESHAPES (#, V): calculate all surface shapes with openings in the set 0 and volume less than or equal to V. <BR> <BR> <BR> allO#Odo1:for <BR> <BR> <BR> <BR> <BR> 2: define d: O # N such that d (x, y) = 1 3: So # # {So will contain all surfaces with opening 0} 4: v # area (O) 5: while v < V do 6: S # shape(O, d) 7: if So contains no translations or rotations of S then 8: add S to So 9: end if 10: for z | area (O) to area (O), y # -area (O) to area (O) do 11: Oandv+1#Vthen# 12: v+1,d(x,y)#d(x,y)+1# 13 : goto step 5 14: else #v-d(x,y)+1,d(x,y)#115:v 16: end if 17: end for 18 : goto step t {no more surfaces left with this opening} 19: end while 20: end for 21: UOEOSo# 22: return (u) Algorithm A. 4 SHAPEFILTER (S, Me): filter a set of surfaces S using max-extrusion parameter self. <BR> <BR> <BR> <P> 1: ## <BR> <BR> <BR> <BR> <BR> <BR> 2: for all S E S do<BR> <BR> <BR> <BR> <BR> 3: 0 opening (S)<BR> <BR> <BR> <BR> <BR> 4: d depth (S) {corresponding depth function}<BR> <BR> <BR> <BR> <BR> <BR> 5: for all (x, y) E O do 6: for all (#,#) # {(x - 1,y),(x + 1,y),(x,y - 1),(x,y + 1)} do (#,#)#Oandd(#,#)#d(x,y)-Methen7:if 8: goto step 5 9: end if 10: end for t 1: goto step 2 {surface does not qualify} 12: end for 13: add S to surface does qualify} :: end for (#)15:return APPENDIX A. THEORETICAL SURFACES Algorithm A. 5 FUNCTIONALIZESURFACE (S, nf): Return the set of theoretical sur- faces defined by surface shape S with nf points of specific functionality attached.

1: S # #{set of functionalized surfaces} 2: define V to be the set of all the (volume(S)) selections of nf cubes from S<BR> nf<BR> <BR> <BR> 3: for all V E V do<BR> <BR> <BR> 4: define f: S # F such that f(s) = F8 if s # V, f(s) = F1 for s E V 5: number the elements of V as v1,..., vnf 6: loop <BR> <BR> 7: add (S, f) to S<BR> <BR> <BR> <BR> 8: for i +-1 to do 9: if f(vi) # F7 then 10 : assign f (vi) to be the next higher functionality 11: goto step 6 12: else 13: f(vi) # F1 <BR> <BR> 14: end if<BR> <BR> <BR> <BR> 15: end for<BR> <BR> <BR> 16: goto step 3 {no more functionality assigments left with the set V of cubes} : end loop 18: end for 19 : return (S) Appendix B Quantization In the quantization process a molecule is reduced into a representation such that com- plementarity can be calculated against a set of theoretical target surfaces. Quantization takes place in the following steps : 1. Each atom in the molecule is assigned a functionality based on its type and con- nectivity.

2. Three dimensional conformations of the molecular structure are generated.

3. Each conformation is converted into"positive space"cubes based on the posi- tions of its atoms.

4. Each cube is assigned a functionality based on the atoms that it contains.

Finally, the quantized form of the molecule is defined to be the set of all selected conformation quantizations with functionalities assigned. The quantization process is summarized in Figure 14.

APPENDIX B. QUANTIZATION B. 1 Atomic Functionality Assignment <BR> <BR> <BR> <BR> <BR> Given a molecular structure as a set of atoms M, a map f : M is defined assigning functionalities to all of the atoms. The map is defined by the using a set of rules to match molecular substructures based on extended atom types (as generated by Tripos, for example) and bonding patems that encapsulate each functionality type.

The algorithm keeps track of atoms that are excluded from matching a lower priority rule because they have already been matched in a higher priority rule, where has the highest priority and. the lowest. Atoms not matching any rule are assigned functionality 57. No atoms are assigned functionality 8. The functionality rules used can be seen as follows: s Figure 16 <BR> <BR> <BR> <BR> <BR> 52 Figure 17<BR> <BR> <BR> <BR> <BR> <BR> 18#F3:Figure <BR> <BR> <BR> <BR> <BR> <BR> <BR> 19#F4:Figure <BR> <BR> <BR> <BR> <BR> <BR> <BR> 20#F5:Figure <BR> <BR> <BR> <BR> <BR> <BR> 21#F6:Figure Within each functionality type, the functionality rules are search sequentially in the order listed in Figures 16-21. Figure 15 provides a legend for the symbols used in the functionality rule diagrams.

Algorithm B. 1 ATOMFUNCTIONALMAP (M) : assign functionality to the atoms in molecular structure M and determine which atoms are excluded from quantization. i: E,-0 IS, is set set atoms atoms that excluded excluded consideration.} 2: E,-0E, is the set of atoms that are excluded from the rest of the quantization process.} <BR> <BR> 3: define J : MfF such that f (a) = F- for all atoms a 6. M.<BR> <BR> <BR> <BR> <P> 4: fori + l to6 do<BR> <BR> <BR> <BR> 5: for each functionality rule for Fi considered in the proper order do<BR> <BR> <BR> 6 : find all matches in M with consideration to £c<BR> <BR> <BR> <BR> 7 : according to the rule, add appropriate atoms to S, and ## ; and set f to : F, for selected atoms 8: end for 9: end for io: return (f, Eq) APPENDIX B. QUANTIZATION<BR> <BR> <BR> <BR> <BR> <BR> <BR> <BR> B. 2 Conformation Generation<BR> <BR> <BR> <BR> <BR> <BR> A conformation is a mapping c: M R3 of the molecular structure into three dimensional space. Using OpenEye Omega software (Open Eye Scientific Software Inc., 335c Winische Way, Santa Fe, NM, 87501), upto nc representation conformations for each molecule are generated within given rule-based energy parameters.

APPENDIX B. QUANTIZATION B. 3 Base Coordinate Frame <BR> <BR> <BR> <BR> <BR> A coordinate frame T = (R. t): 23) 23 is a rigid motion of the space defined by a<BR> <BR> <BR> <BR> rotation R E S03 (R) and a translation t E R3 that transforms a point p by the rule: T (p) = Rp + t Given a resolution r (the length of a side of a lattice cube) and a coordinate frame, a lattice on RI is implicitly defined by q E Z3 # [rqx,rqx+ r) x [rqy, rqy + r) x [rqz, rqz + r) C R3 The base coordinate frame is generated from a conformation of molecular structure M. First, a subset M of the atoms in M are selected via F RAMEATOMS. These are atoms in or near ring structures close to the center of the conformation. A ring atom is defined to be an atom that contains at least one bond which, if removed, would not result in the molecule being disconnected. If there are an insufficient number of ring atoms, all atoms sufficently close to the center of the conformation are used.

Then, the base coordinate frame is calculated in B ASEFRAME. The x-axis of the base coordinate frame is defined to to be the solution to the optimization problem: is the center of the conformation. Due to the non-linear nature of this problem, it is solved using an approximate gradient descent method. Given x, the y-axis of the base coordinate frame is defined to be the solution of a second optimization problem: Given x and y, z is uniquely specified. The base coordinate frame then simply involves <BR> <BR> centering the conformation by translating p to the origin, and using the new z, y, and z axes.

APPENDIX B. QUANTIZATION <BR> <BR> <BR> <BR> <BR> <BR> Algorithm B. 2 FRAMEATOMS (M, c, rf, rm, q,): select a set of atoms to use for gen- erating the coordinate frame from a molecular structure M and a conformation c. The following parameters are used: ring factor rf, ring minimum rm, radius factor q,.

1: R # # {R will hold the set of ring atoms} 2: for all a E M do 3: if a is not a Hydrogen atom and a is in a ring then 4: add a to R 5: end if 6: end for {the mathematical center of the conformation} 8: r # maxa#M #c(a) - p# {the maximum distance trom the center of the contor- mation} 9 : C # # {a placeholder for atoms in rings we have already examined} tu : Rg # #{a set of rings which are close to the center of the conformation} i: for all a E 7Z do t2 if #c(a) - p# # rqr and a # C then {if a ring atom is sufficiently close to the center of the conformation, add all members of its ring group} #RINGGROUP(a,R,rf)13:# 4: add # to 7 ?-g 5 S C-CUC 6 : end if 7 : end for <BR> <BR> <BR> #{a#Msuchthat#c(a-p##rqr}{adefaultsetofatomsclosetothe18:D center, to be used if we haven't collected enough ring atoms} 19 : ifRg = 0 then {if there are no ring groups, use the default} 20: return (D) 21: end if 22 : if max##Rg ### < rm then {if the biggest ring group is smaller that rm, use the default} 23 : return (D) 24: end if 25 : Bg # Ug#Rg,###>6#{a set of"big"ring groups, having at least 6 atoms} 26: # 0<BR> <BR> <BR> <BR> <BR> <BR> 27: for all a E U##Bg # do {use all "big" ring group atoms and any neighboring atoms} 28: add a and all atoms bonded to a to #, if not already present 29: end for 30: return APPENDIX B. QUANTIZATION Algorithm B. 3 RINGGROUP (a,R,rf) : calculate the set of atoms that can be reached starting at atom a and crossing over at most rf atoms that are not in the set of ring atoms 7Z. <BR> <BR> <P> #01:d <BR> <BR> <BR> <BR> 2: {a},Tn### <BR> <BR> <BR> <BR> 3: {a},##{a}# 4: while d < rf do 5: if Th = 0 then 6: d +1d 7: Tn,Tn### 8: else 9: # pop(Th) 10: if a is not a Hydrogen atom and â # # then for all atoms o bonded to â do 12: if o is not a Hydrogen atom and o # # then 13: add o to # 14: if o # R then 15 : add o to 9 otoTh16:add 17:else <BR> <BR> 18: add o to T,<BR> <BR> <BR> <BR> 19: end if 20: end if 21: end for 22: end if 23: end if 24: end while 25: return (9) APPENDIX B. QUANTIZATION Algorithm B. 4 BASEFRAME (#. c): compute the base coordinate frame given a set of atoms M and a conformation c.

2: define u :# # R3 by u (a) = c (a)-p 3: di- 4: for all a E M do 5: v # u(a)/#u(a)# 6: loop 7: s LOeM sign (vTu (o)) u (o) 8: s/#s## 9: for all o E M do 10: if sign(sTu(o)) # 0 and sign (sTu (o)): sign (vTu (o)) then #s11:v 12: goto step 6 if13:end 14: end for 15: s# step1816:goto 17 : end loop 20: end if 21: end for 22:(0,0,0)# 23: for all a E M do 24: v 4--di x (u (a) d 1 u (a) dl) 25: u/tH ! 26 : s (, O"sign (vTu (o)) u (o) 27 : s (s-d'sdl) llls-dtsdill 28: if sign (sTu (o)) = sign (vTu (o)) for all o E M such that sign (sTu (o)) # 0 then 29: if Ea |sTu (o) l > Eai d2Tu (o) | 30: d2 + s 31: end if 32: end if 33: end for 34:[d1,d2,d1#d2]T# 35: return (R,-Rp) APPENDIX B. QUANTIZATION B. 4 Lattice Centering By our definition above, the lattice defined by a coordinate frame places comers of cubes at points all of whose coordinates are integer multiples of r. Given a particular conformation, it may be better to shift the lattice by a length of r/2 in a particular direction, recentering the lattice cubes.

Algorithm B. 5 CENTERFRAME (T, M, c, cf, ct): recenter the coordinate frame T for the set of atoms M and conformation c using parameters: centering fraction cf, reso- lution r, centering tolerance ct.

#[cf#M#]1:l 2: set xl to be the lth lowest element in the set #M}a 3: set y, to be the lth lowest element in the set {IT (c (a)) yl, a E M}<BR> <BR> <BR> 4: set zl to be the lth lowest element in the set {#T (c (a)) z#, a E M} 5 : A # {a # M such that IT (c (a)) l # xl, #T(c (a)) yl # y@, #T (c (a)) z # zl} 6: define v1 (r/2. r/2, r/2) v2 (0, r/2, r/2) v3v3# r/2)0. v4 # (0, 0, r/2) v5 ~ (r/2, r/2,0) v6 # (0, r/2,0) v7 # (r/2, 0,0) us (0, 0. 0) 7: #1to8doi 8: define the coordinate frame Ti (p) = p + vi 9: CUBIFY(M,c,TioT,r,ct)# <BR> <BR> 10: #Qi## <BR> <BR> <BR> #maxq#Qiqx-minq#Qiqx11:dx,i <BR> <BR> <BR> <BR> #maxq#Q,qy-minq#Qiqy12:dy,i <BR> <BR> <BR> #maxq#Qiqz-minq#Qiqz13:dz,i <BR> <BR> <BR> <BR> 14: end for 15 : define j to be the index of the least member of the set { (ni,dz,i,dy,i,dx,i}, where comparisons are done using a dictionary ordering (that is, compare the first com- ponent, if case of equality compare the second component, etc.) (Tj)16:return APPENDIX B. QUANT ON B. 5 Lattice Perturbations The base coordinate frame is not necessarily optimal for quantization, so a set of "close"frames are also examined. By parameterizing the space of coordinate frames S03 (1fS) x R3 into 3 rotational dimensions and 3 translational dimensions and taking nr equal steps in each rotational dimension and nt equal steps in each translational di- mension, a set of nr3nt3 perturbation frames is built. These can later be composed with the base coordinate frame to give the desired set.

Algorithm B. 6 PERTURBATIONFRAMES(nt,vt,nr,vr) : generate a set of perturbation coordinates frames using parameters: resolution r, number of translations nt, transla- tional variance vt, number of rotations nr, rotational variance vr. t: 7'<-0 2: for iz O to nr-1 do 3: define R. to be the coordinate frame corresponding to rotation about the x-axis by 1-nr)/(4nr)radians+ 4 : for iy # 0 to nr-1 do 5: define Ry to be the coordinate frame corresponding to rotation about the y- axis by 7TMr (2iy + 1 - nr)/(4nr) radians 6: #0tonr-1doiz <BR> <BR> 7 : define R, to be the coordinate frame corresponding to rotation about the<BR> <BR> <BR> <BR> z-axis by 1rUr (2iz + 1 - nr)/(4nr) radians 8: #0tont-1dojx 9: define the coordinate frame p+(rvt(2jx+1-nr)/(2nr),0,0)Tx(p)= 10: #0tont-1dojy li : define the coordinate frame p+(0,rvt(2jy+1-nr)/(wnr),0)Ty(p)= 12: for jz # 0 to nt - 1 do 3: define the coordinate frame p+(0,0,rvt(2jz+1-nr)/(2nr))Ty(p)= TzoTyoTxoRzoRyoRxtoT14:add 15 : end for 16: end for 17: end for ts: end for 19: end for 20: end for 21:return(T) APPENDIX B. QUANTIZATTON B. 6 Cubification Given a conformation and a lattice (defined by a coordinate frame and resolution), cubification is the process of determining which lattice cubes are filled by the confor- mation. The set of lattice cubes is constructed by taking any cube in which an atom center in the conformation directly falls and also cubes which are sufficiently close to the van Der Waals sphere of an atom.

Algorithm B. 7 CUBIFY (M, c, T, r, t): quantize the conformation c of the molecular structure M into a set of cubes defined on the coordinate frame T using parameters: coordinate frame T, resolution r, tolerance t.

@ : Q # # { # will contain the set of cubes in the quantization.} <BR> <BR> 2: define a function d: M-R that takes each atom to its van Der Waals radius<BR> <BR> <BR> <BR> <BR> 3: for all a E M do<BR> <BR> <BR> <BR> <BR> 4: p # T (c (a)) {the transformed position} 5: q # ([px/r], [py/r], [pz/r]) {the integer coordinates of the cube into which the atom falls} 6: if q # Q then 7: add q to Q 8: end if 9: define # to be the set of cubes neighboring q: <BR> <BR> <BR> <BR> <BR> {##Z3suchthatmax(#qx-#x#,#qy-#y#,#qz-#z#)=1}## <BR> <BR> <BR> <BR> <BR> <BR> <BR> <BR> <BR> io: for all q E Q do define d to be the distance of the point in RI in the cube defined by q that is closest to p: <BR> <BR> <BR> <BR> <BR> # #p-v#min <BR> <BR> <BR> v#(rqx,rqx+r)#(rqy,rqv+r)#(rqz,rqx+r)<BR> <BR> <BR> <BR> <BR> <BR> <BR> <BR> 12: if d (a) < tr and # # # then t3: add # to Q {add a neighboring cube if it is too close to the van Der Waals sphere of the atom} 4: end if :: end for 16 : end for 7: return (Q) APPENDIX B. QUANTIZATION B. 7 Entire Process Given a set of conformations, in Q UANTIZE we see the entire quantization algorithm For each conformation: 1. a base frame and centering frame are calculated 2. perturbations of the base frame are used in order to find the lattice that results first in the smallest number of cubes and second with the least distance from the base frame 3. using the atomic functionality map, functionalities are assigned to the cubes in the minimal quantization APPENDIX B. QUANTIZATTON Algorithm B. 8 QUANTIZE (M, C, r, t, rf, #m, ar, cf, ct, nt, vt, nr, vr): quantize the set of conformations C for the molecular structure M with parameters: resolution r, tolerance t, ring factor rf, ring minimum radius factor qr, centering fraction cf, centering tolerance ct, number of translations nt, translational variance vt, number of rotations variancevr,polarizableminimumpm.rotational t : (f,#q) # ATOMFUNCTIONMAP(M) {build a functionality map and a list of atoms to be excluded from quantization} 2: S +-0 {will hold the resulting quantizations} 3: for all c E C do 4: # -#q,c,rf,rm,qr)FRAMEATOMS(M 5: Tb # BASEFRAME(#, c) {base coordinate frame} 6: Te CENTERFRAME (T, M - #q, c, cf, ct) {lattice centering} 7 : qm # #{smallest number of cubes seen so far} 8: dm # #{smallest transform distance seen so far} 9 : # # PERTURBATIONFRAMES(nt,vt,nr,vr) {set of perturbation frames} to: for all Tp E T do l: T # Te o Tp o Tb {the total coordinate transform} 12: # -#q,c,T,r,t){cubesgiventhistransform}CUBIFY(M 13: q # ###{number of cubes} 14d'SåEM-S"llTp (c (a)) c (a) || {transformdistance} 15: if ### < qm or (### = qm and d < dm) then<BR> <BR> 16: q,dm#d,#m##,Tm#T# 17: end if 18 : end for l9: define fm : #m # F as fm(q) = F@.

20: for all a E M = #q do 21: ([Tm(c(a))x/r],[Tm(c(a))y/r],[Tm(c(a))z/r])# {q is the the cube a falls into} 22: if f (a) has higher priority than f, (q) then<BR> <BR> 23: if f (a) # F6 then<BR> <BR> 24: fm?)-/ (a) 25: else if at least pm atoms with 56 functionality are in q then 26: f(a)# 27: end if 28: end if 29: end for 30: add (#m,fm) to S 31: end for 32: return (S) Appendix C Surface Complementarity In order to view molecules in the space of theoretical surfaces, we must establish com- plementarity between quantized conformations and theoretical surfaces. This is done in FITSURFACES as follows: 1. all 24 possible orientations of each quantizaed conformation are considered 2. for each orientation, the quantized conformation is shifted down and below the plane 3. a set of surfaces which fit each shifted conformation are detected 4. functionalities for each surface are computed such that the binding energy be- tween the surface and the quantized conformation are favorable SURFACECOMPLEMENTARITYAPPENDIXC. <BR> <BR> <BR> <BR> <BR> <BR> <BR> <BR> <BR> <BR> <BR> <P>Algorithm C. 1 FITSURFACES (Q, fq, Ec, rb,...) : calculate all surfaces with func- tionality that are complementary to the quantizated conformation Q with function- <BR> <BR> ality map fq, conformational energy EcX and rb rotatable bonds using the following parameters: minimum surface opening area A, maximum surface volume V, area- threshold At, max-non-central Mnc, max-contiguous Mc, max-extrustion Me, num- <BR> <BR> ber of points of characteristic functionality nf, minimum energy Emtn. minimum fit quanta qmin, minimum slackness spins maximum slackness Smax, maximum protru- sion levels pmax, translational-rotational-vibrational entropy Et,,, rotatable bond co- efficient cr, hydrophobic energy coefficient Ch, hydrophobic surface energy coefficient CJ, potential function P. t: S # #{will hold the resulting surfaces} 2: define R to be the set of 24 lattice rotations 3: for all R e do 4: # # {Rq, q # #} {rotate the quantization by R} 5: [(maxq##qx+minq##qx)/2]# 6: [(maxq##qy+minq##qy)/2]# 7: minq##qz# 8: {(qx-tx,qy-ty,qz-tz),q##}{recenteroverthex-yplane}# 9: for all d: O to max go 10: Od # {(qx, qy, qz - d), q # # # such that qz # d} {shift the quantization down and only keep cubes on or under the x-y plane} 11: if ##d# < min(qmin, ###) - smin then {skip if not enough cubes are below the plane} 12: goto step 9 13: end if 14: if maxq##d qz > pmax then {skip if too many layers of cubes are sticking out of the plane} 15: goto step 9 16: end if <BR> <BR> li: Sd # CORESURFACE(#d)<BR> <BR> <BR> <BR> <BR> <BR> #min(smax+#Sd#,A),Vd#min(smax+#Sd#,V)18:Ad <BR> <BR> <BR> <BR> <BR> <BR> 19: for all S E DETECTSURFACES (Sd, Ad, Vd, At, Mnc, Mc, Me) do<BR> <BR> <BR> <BR> <BR> <BR> 20: for all (Sf. fs) E FUNCTIONALIZESURFACE (S, nf) do 21: (Sf,fs,#q,fq,Ec,rb,r,Etrv,cr.ch,cs,P)#EminthenENERGY 22: if no translation or rotation of (Sf, fs) is in S then 23: add (Sf, f9) to S 24: end if 25: end if 26 : end for 27 : end for 28: end for 29: end for 30: return (S) APPENDIXC SURFACE COMPLEMENTARITY Algorithm C. 2 CORESURFACE (#) : calculate the minimal surface fitting a quantized conformation.

1: #{surfaceopening}# 2: for all q E Q do 3: add (qz, qv) to O, if not already present 4: end for 5: define a depth function d: O # N such that d (x, y) = 1 6: for all q E Q do 7: #max(d(qx,qy),1-qz)qy) 8: end for 9: return (shape (O. d)) APPENDIX C. SURFACE COMPLEMENTARITY Algorithm C. 3 DETECTSURFACES (Sc, A, V, At, Mnc, Mc, Me): detect additional surfaces by adding cubes to SC subject to parameters: minimum surface opening area A, maximum surface volume V, area-threshold At, max-non-central Mc, max- max-extrusionMe.contiguousMc, 1: #{thedetectedsurfaces}# 2: <-{} Sn##do3:while 4: pop(Sn),O#opening(Sm),dm#depth(S)# 5: if O is connected and area (O) < A then {add surfaces obtained by adding cubes below Su} 6: d # dm, v # volume(Sm) <BR> <BR> v#Vdo7:while <BR> <BR> <BR> <BR> 8S ShaPe (OEd) 9: add S to S 10: for x # -area (O) to area (O), y # -area (O) to area (O) do <BR> <BR> (x,y)#Oandv+1#Vthen11:if <BR> <BR> <BR> 12: v-v + 1, d (x, y)-d (x, y) + 1 13 : goto step 7 14: else #v-d(x,y)-dm(x,y),d(x,y)#dm(x,y)15:v 16: end if 17 : end for 18: goto step 20 {no more surfaces left} 19: end while 20: end if 21: if area (O) < A and volume (Sm) < V then {consider surfaces made by en- larging the opening} 22: define P to be the set of all possible openings obtained by adding a single square to O adjacent to a square already present in O 23: for all P E P do 24: define dp: P N such that dp (x. y) = dm (x, y) for (x, y) E O and dp (x, y) = 1 otherwise zs: add shape (P, dp) to Sn 26: end for 27: end if 28: end while 29: # S#S},At,Mnc,Mc)OPENINGFILTER({opening(S), 30: for all S E S do Ifilter surfaces based on openings} 31: if opening (S) ¢ e) then 32: delete S from S 33: end if 34: end for 35: S surfacesbasedonshape}SHAPEFILTER(S,Me){filter 36: return (S) APPENDIX C. SURFACE COMPLEMENTARITY C. 1 Energy Complementarity energy between a quantization of a molecular conformation and a a cubic theoretical surface is the sum of several components: . translational-vibration-rotational entropy (a constant) 'a conformational energy term, representing the energy of this conformation rela- tive to the minimal energy conformation (calculated by the conformational gen- erator) * a term proportional to the number of rotatable bonds * potential energy, the sum of energy due to interactions between the functionali- ties of overlapping negative space surface cubes and positive space quantization cubes (represented as a function P: IF x F R U {-#}) * hydrophobic energy due to the exclusion of water from cubes in the surface, proportional to the surface area from which water is excluded APPENDIX C. SURFACE COMPLEMENTARITY Algorithm C. 4 ENERGY(S, fs, #, fq, Ec, rb, r, Etrv, cr, ch, cs, P) : calculate the com- <BR> <BR> plementarity energy between a surface S with functionality map 5 and a quantized conformation Q with functionality map fq, conformational energy Ec, and rb rotatable bonds using parameters: resolution r, translational-rotational-vibrational entropy EtrV, rotatable bond coefficient Cr, hydrophobic energy coefficient Ch, hydrophobic surface energy coefficient cs, potential function P. <BR> <BR> <P> #crrb{energyduetorotatablebonds}1:Er <BR> <BR> <BR> <BR> 2: Ep * 0 {energy due to potential interactions} 3: for all s E S do 4: if s E Q then 5: Ep+P(fq(s),fs(s))# 6: end if 7 : end for 8: Eh t-0 {energy due to hydrophobicity} <BR> <BR> 9: s#S##doall <BR> <BR> <BR> <BR> <BR> <BR> 10: define<BR> <BR> <BR> <BR> TT# +1,sy,sx),(sx-1,sy,sz),(sx <BR> <BR> <BR> <BR> +1,sz),(sz,sy-1,sz),(sx,sy <BR> <BR> <BR> <BR> (SIX,-1)} 11: a # r2#T # ##{the area of contact between the surface and the quantized con- formation at this point} 12: if fa (s) = F8 then {hydrophobic energy due to slightly hydrophobic surface cubes} 13: Eh: Eh + ChCsa <BR> <BR> ta: else if fs(S) # {F6, F7} then {hydrophobic energy due to non-polar surface cubes} <BR> <BR> #Eh+cha15:Eh <BR> <BR> <BR> <BR> <BR> 16: end if 17 : if fq(S) # {F6, F7} then {hydrophobic energy due to non-polar quantized con- formation cubes} 18: Eh+cha# 19: end if 20: end for 21: return (Etr + Er + Ep + Eh-Ec) Appendix D Molecular Library Comparison A library is a set of molecular structures. Given a library, the set of complementary theoretical surfaces is defined as the union of all surface shape/functionality pairs com- plementary to any quantized conformation of any molecule in the library.

The algorithm LIBRARYCOMPARE calculates a score proportional to the similar- ity of the two libraries. The score is calculated by representing each library as its set of complementary theoretical surfaces, and using the S IMILARITYSCORE primitive to determine the similarity or dissimilarity two sets of theoretical surfaces. If the molec- ular libraries each contain only one molecule, then the algorithm calculates a score proprotional to the similarity of the two molecules.

Algorithm D. 1 LIBRARYCOMPAREGI, L2): calculate a similarity score between 0 and 1000 for two molecular libraries. t: define fi to be the theoretical surfaces complementary to L, 2: define T2 to be the theoretical surfaces complementary to L2 3: return (SIMILARITYSCORE (T1, T2)) APPENDIX D. MOLECULAR LIBRARY COMPARISON Algorithm D. 2 SIMILIARITYSCORE (T1, T2): calculate a similarity score between 0 and 1000 for two sets of theoretical surfaces. i: define sets of complementary surface shapes: {Ssuchthatforsomef,(S,f)#T1}S1# S2 f {5 such that for some f, (5. f) E 2} For purposes of comparing two theoretical surfaces or surface shapes, note that translations and x-y plane rotations of a surface are considered identical to the original surface.

2: s # #S1 # S2#/#S1 # S2#{shape score} 3: S f Of functionality score 1 4: for all S E S1 n S2 do 5: define complementary functionalities for this surface shape: <BR> <BR> <BR> {fsuchthat(S,f)#T1}F1# <BR> <BR> <BR> F2' {f such that (5, f) E 2} 6: define sets of"active"cubes, that is cubes with non-default functionality: <BR> <BR> <BR> {Q#Z3suchthat#f#F1,#q#Q,f(q)#F8}#1# <BR> <BR> <BR> #2 # Z3suchthat#f#F2,#q#Q,f(q)#F8}# <BR> <BR> <BR> {Q#Z3suchthat#f#F1#F2,#q#Q,f(q)#F8}#i# 7: end for 8 : end for 9: return (1000sssf) Appendix E Protein Quantization In a process complementary to the quantization of a molecular conformation, the target sites of a protein surface are quantized into the same negative space cubic representa- tion used by theoretical surfaces. This allows the following analyses: * comparison between a set of known protein target sites and the set of all possible theoretical surfaces within given parameters of volume, shape, and functionality * comparison between two different sets of known protein target sites. o comparison between a set of known protein target sites and a set of theoretical surfaces to which a given set of molecules is complementary The protein quantization process is accomplished in the following steps, as depicted in Figure 30: 1. A 3D crystal structure of the protein is examined and a functionality map is built for the protein atoms using the algorithm A TOMFUNCTIONALMAP.

2. A protein surface is generated from the 3D structure. A protein surface is a set of triangles defining the surface of the protein that is accessible to water molecules (known as the Connolly surface). Michael Connolly's MSRoll software is an example of a package that can generate a protein surface suitable for this purpose.

3. Subsets of the surface which are target sites likely for the binding of small molecules are detected. This can be accomplished, for example, by looking for highly concave regions. Michael Connolly's MSForm software is an example of a package that can measure surface curvature and detect pockets suitable for this purpose.

4. Each target site is isolated and examined individually.

5. Each target site is quantized into a set of negative space cubes with associated functionalities using the protein function map and the algorithm T ARGETSITE- QUANTIZE. The underingly process is very similar to the algorithm Q UANTIZE, APPENDIX E. PROTEIN QUANTIZATION using an imaginary molecule that is defined by atoms centered at a set of points with a given radius that fill the target site.

6. Each set of quantized negative space cubes with functionality is converted to a set of theoretical surfaces satisfying the proper constraints (for example, no occluded cubes are allowed) using the algorithm B UILDSURFACES.

Algorithm E. 1 TARGETSITEQUANTIZE (T, f, v, r, ri, ru, b, np, Sr....): calculate a negative space cubic representation of the target site defined by the triangles in set T, with v as a normal vector pointing out of the target site, f as a functionality map for the entire protein, using parameters: resolution r, lattice density r1, lattice van Der Waals radius rv, lattice tolerance ti, number of lattice neighbors n p, buffer distance b, search radius Sr, and additional parameters for subrountine calls (see below) as necessary. i : define a coordinate frame T, with origin at the center of the target site, z axis in the direction of v, x and y axes determined by the longest side of the pocket <BR> <BR> 2: define a set of points P C R3 to be the points on a lattice with coordinate frame t<BR> <BR> <BR> <BR> <BR> and cube side length ri such that p E P iff p is contained in the target site and the closest triangle in T is at least distance b away from p 3: remove from P all points who do not have at least np neighboring points also in P, where each point has neighbors consisting of the 26 points on the lattice offset by at most one cube from the point in question 4: if P is disconnected, consider each connected component seperately in the follow- ing steps 5: define a"molecular"structure M with conformation c by imagining P to be a set of atoms with van Der Waals radius r v 6: define a base coordinate frame Tb using the algorithm B ASEFRAME on M and c 7 define a centering coordinate frame Tc using the algorithm C ENTERFRAME on M and c 8: define a set of perturbation frames using PERTURBATIONFRAMES 9: as in QUANTIZE, by examining all perturbation frames, find the total frame which minimizes the number of negative space cubes of resolution r (calculated using CUBIFY with tolerance tl) and the total transformation distance o: denote the above set of negative space cubes by Q t t : redefine the coordinate system for the negative space cubes in Q, such that the z axis is the direction with the greatest component in the v direction, and the maxi- mum z value is 0 <BR> <BR> <BR> tu: build a functionality map fq : Q ~ F by, for each q E #, finding the highest priority functionality associated by f with an atom in the protein within search radius sr to the center of q, or assigning 58 if there are no such atoms (#,fq)13:return APPENDIX E. PROTEIN QUANTIZATION Algorithm E. 2 BUILDSURFACES fq, m@,nr...) : return the set of theoretical sur- faces satifying proper constraints described by the negative space cubes Q with func- tionality fq, using parameters: maximum occlusions mO, number of functionality points nf, additional parameters for subroutine calls (see below) as necessary. <BR> <BR> t: define O # # and d: Z2 # Z such that d (x, y) = 0 {future surface opening and<BR> <BR> <BR> <BR> <BR> depths} 2: for all q E Q do {build an opening and depth function} (qx,qy)#Othen3:if 4: add (qx, qv) to O, d(qx, qy) # max(d(qx, qy), 1 - qz) 5: end if 6: end for 7: oc # REMOVEOCCLUDEDCUBES(#, O, d) 8: if oc > ma then 9: return (0) {too many occluded cubes} io: end if t: if O is disconnected then 12: define 0 to be the set of connected components, S ffi 13: for all OC # # do {generate a set of surfaces for each connected component} 14: {q##suchthat(qx,qy)#Oc}# 15: S#BUILDSURFACES(#c,fq)# 16: end for 17 : return (S) 18 : end if 19 : if O does not satisfy filtering rules imposed by OPENINGFILTER then 20: return () 2t end if <BR> <BR> <BR> #shape(O,d)22:S <BR> <BR> <BR> <BR> 23: if S does not satisfy filtering rules imposed by SHAPEFILTER then<BR> <BR> <BR> <BR> <BR> 24 : return (0) 25: end if <BR> <BR> <BR> 26: ## <BR> <BR> <BR> <BR> <BR> 27 {q # # such that fq(q) # F8}{set of negative space cubes with non- default functionality} 28: for all Q C Qa such that #Q# = nf do 29: define S#Fsuchthatfs(x,y,z)=fq(x,y,z)for(x,y)##a,: F8otherwise,add(S,fs)toSfs(x,y,z)= 30: end for 31: return (S) APPENDIX E. PROTEIN QUANTIZATION Algorithm E. 3 REMOVEOCCLUDEDCUBES(#,O,d) : remove from Q cubes which are occluded, adjusting opening O and depths d, return the number of occluded cubes removed. <BR> <BR> <P> #0{countofoccludedcubes}1:oc <BR> <BR> 2: for all (x, y) E O do 3: for z # -1 to 1 - d(x,y) do 4: if (x, y, z) E Q and (s, y, z + 1) ¢ Q then 5: oc+1# (x,y,z)from#6:remove 7 : end if 8: end for #{qz,(x,y,qz)##}9:#xy 10: =#then#xy delete (x, y) from 0, d (x,y) # 0 12:else 13: min{1-qz,qz##xy}# 4: end if 15: end for 16: return (oc) Appendix F Protein Surface Comparison A target surface set is defined as the set of all theoretical target surfaces to which a set of known protein surfaces map. The target surface set may comprise, for example, all of the surfaces mapped from one protein, all of the surfaces mapped from multiple proteins, or all of the surfaces mapped from specific sites on multiple proteins.

The algorithm P ROTEINCOMPARE calculates a score proportional to the similarity of the two sets of protein surfaces. The score is calculated by representing each protein surface set as its target surface set, and using the S IMILARITYSCORE primitive to determine the similarity or dissimilarity two sets of theoretical surfaces.

Algorithm F. 1 PROTEtNCOMPARECPi,): calculate a similarity score between 0 and 1000 for two protein surface sets. t: define fi to be the target surface set corresponding to P1 2: define T2 to be the target surface set corresponding to P2 3: return (SIMILARITYSCORE (Ti, T2)) Appendix G Protein/Library Comparison The algorithm P ROTEINLIBRARYCOMPARE calculates a score proportional to the com- plementarity of a library of small molecules and a set of protein surfaces. The score is calculated by representing the protein surface set as the theoretical surface set to which it is similar, the molecular library as the theoretical surface set to which it is comple- mentary, and using the S IMILARITYSCORE primitive to determine the similarity or dissimilarity two sets of theoretical surfaces.

Algorithm G. 1 PROTEINLIBRARYCOMPARE (L, P): calculate a complementarity score between 0 and 1000 for a molecular library C and a protein surface set P. t: define T to be the theoretical surfaces complementary to L 2: define Tp to be the target surface set corresponding to P 3: return (SIMILARITYSCORE (T, Tp)) Appendix H Parameter Values The values of parameters used are as follows: # Maximum opening area (A): 15 Maximum surface volume (V): 18 Area threshold: (At) : 8 Maximum non-central opening squares (Mnc): 5 Maximum contiguous opening squares (Me) : 3 Maximum surface extrusions (Me): 3 # Number of surface cubes of specific functionality (nf) : 4 Maximum number of conformations per molecule: 300 Resolution (r): 4.24 Angstroms Tolerance (t): 0. 32 Ring factor (rf): 0 Ring minimum (rm): 13 Radius factor (qr) : 0.75 Centering tolerance (ct): 0.1 fraction(cf):0.75#Centering # Number of translations (nt) : 5 Translational variance (vt): 0.2 ofrotations(nr):5#Number APPENDIX H. PARAMETER VALUES # Rotational variance (for): 0.1 # Polarizable minimum (pm): 2 <BR> <BR> <BR> <BR> fitquanta(qmin):9#Minimum <BR> <BR> <BR> <BR> <BR> slackness(smin):2#Minimum <BR> <BR> <BR> <BR> <BR> slackness(smax):0#Maximum <BR> <BR> <BR> <BR> <BR> protrusionlevels(pmax):1#Maximum # Minimum energy (Emin) : 8.0 cal <BR> <BR> <BR> <BR> entropy(Etrv):-9.0kCal#Translational-rotation-vibrational # Rotatable bond coefficient (rc):-0.7 kCal/bond # Hydrophobic energy coefficient (Ch): 0.025 kCal/Angstrom2 # Hydrophobic surface energy coefficient (cs): 0.8 # Potential function (P): khat Surface Molecule * Lattice density (ri): 1.5 Angstroms # Lattice van Der Waals radius (rv): 0.75 Angstroms * Lattice tolerance (t@) : 0.2 * Buffer distance (b): 0.5 Angstroms * Number of lattice neighbors (np): 3 radius(sr):2.22Angstroms#Search * Maximum number of occluded cubes (m@) : 1