Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SCALABLE METHOD FOR THE EVALUATION OF DISTANCE-LIMITED PAIRWISE PARTICLE INTERACTIONS
Document Type and Number:
WIPO Patent Application WO/2006/113825
Kind Code:
A1
Abstract:
A generalized approach to particle interaction can confer advantages over previously described method in terms of one or more of communications bandwidth and latency and memory access characteristics. These generalizations can involve one or more of at least spatial decomposition, import region rounding, and multiple zone communication scheduling. In a parallel implementation, the generalized approach can have the advantage that the communication bandwidth required by several instances of the approach come close or achieve an approximate lower bound for communication bandwidth. In a serial processing implementation, effective cache utilization can substantially reduce memory latency in processing.

Inventors:
BOWERS KEVIN J (US)
DROR RON (US)
SHAW DAVID E (US)
Application Number:
PCT/US2006/014782
Publication Date:
October 26, 2006
Filing Date:
April 19, 2006
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
DE SHAW RES LLC (US)
BOWERS KEVIN J (US)
DROR RON (US)
SHAW DAVID E (US)
International Classes:
G06F17/50; G16B15/00
Other References:
SNIR M: "A note on n-body computations with cutoffs", THEORY OF COMPUTING SYSTEMS SPRINGER-VERLAG USA, vol. 37, no. 2, March 2004 (2004-03-01), pages 295 - 318, XP002392175, ISSN: 1432-4350
HEFFELFINGER G S: "Parallel atomistic simulations", COMPUTER PHYSICS COMMUNICATIONS ELSEVIER NETHERLANDS, vol. 128, no. 1-2, June 2000 (2000-06-01), pages 219 - 237, XP002392176, ISSN: 0010-4655
MCCOY R A ET AL: "PARALLEL PARTICLE SIMULATIONS OF THIN-FILM DEPOSITION", INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS, SAGE SCIENCE PRESS, THOUSAND OAKS, CA, US, vol. 13, no. 1, 21 March 1999 (1999-03-21), pages 16 - 32, XP000865057, ISSN: 1078-3482
GEORGE S. ALMASI, CALIN CASCAVAL, ET AL.: "Demonstrating the Scalability of a Molecular Dynamics Application on a Petaflops Computer", INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, vol. 30, no. 4, August 2004 (2004-08-01), pages 317 - 351, XP002392177
PANDE VIJAY S ET AL: "Atomistic protein folding simulations on the submillisecond time scale using worldwide distributed computing.", BIOPOLYMERS. JAN 2003, vol. 68, no. 1, January 2003 (2003-01-01), pages 91 - 109, XP002392178, ISSN: 0006-3525
Attorney, Agent or Firm:
Rohlicek, Robin J. (P.o. Box 1022 Minneapolis, Minnesota, US)
Download PDF:
Claims:
What is claimed is:
1. A method for performing computations associated with bodies located in a computation region, the method comprising: for each subset of multiple subsets of the computations, performing the computations in that subset of computations, including accepting data of bodies located in each of a plurality of import regions associated with the subset of the computations, the import regions being parts of the computation region; for each combination of a predetermined plurality of combinations of multiple of the import regions, performing computations associated with sets of bodies, wherein for each of the sets of bodies, at least one body of the set is located in the each import region of the combination.
2. The method of claim 1 wherein at least some of the combinations of multiple of the import regions each consists of a pair of the import regions.
3. The method of claim 1 wherein each of the subsets of computations is performed on a different node of a multiple node computing system.
4. The method of claim 3 wherein accepting the data for bodies from the import regions includes accepting data from a communication medium coupling at least two of the nodes of the computing system.
5. The method of claim 1 wherein each of the subsets of computations is performed in sequence on a computing system.
6. The method of claim 5 wherein the subsets of computations are selected according to memory limitations of the computing system.
7. The method of claim 1 further comprising associating each of the subsets of computations with an iteration in an iterative sequence of computation phases.
8. The method of claim 7 wherein accepting the data for bodies from the import regions includes loading the data into a local memory associated with a processor on which the computations for the iteration are performed.
9. The method of claim 1 wherein each of the subsets of computations is associated with a 5 separate subregion of the computation region.
10. i.
11. The method of claim 9 wherein for each of the subsets of computations, the import regions are separate from the subregion of the computation region with which the subset is associated.
12. The method of claim 9 wherein the subregions associated with the multiple subsets of 10 computations form a partition of the computation region.
13. The method of claim 9 wherein the subregions form a regular partition of the computation region.
14. The method of claim 12 wherein the subregions form a rectilinear partition of the computation region.
15. 15 14.
16. The method of claim 9 wherein performing the computations in each subset of computations further includes: performing computations associated with sets of bodies, for each of the sets each body of the set being located in the subregion associated with the subset of computations; and 20 performing computations associated with sets of bodies, for each of the sets one body being located in the subregion associated with the subset of computations and the other bodies of the set being located in the import regions associated with the subset of computations.
17. The method of claim 14 wherein the performing of computations associated with sets of 25 bodies in which each body of the set of bodies is located in the subregion associated with the subset of computations is performed while accepting the data for bodies from the import regions.
18. The method of claim 9 wherein performing the computations in each subset of computations further includes: performing computations associated with sets of bodies, each body of the set being located in one of the import regions associated with the subset of computations, and no body of the set being located in the subregion associated with the subset of computations.
19. The method of claim 1 wherein performing the computations in each subset of computations includes applying a schedule of computations specified according to the import regions.
20. The method of claim 1 wherein each of the computations in the set of computations includes a computation of data characterizing an interaction of the set of bodies associated with that computation.
21. The method of claim 1 wherein each of the computations in the set of computations includes recording in a data structure an identification of the set of bodies associated with that computation.
22. The method of claim 19 further comprising performing computations of data characterizing an interaction of each of the sets of bodies recorded in the data structure.
23. The method of claim 1 wherein in performing the computations in all the subsets of computations, for any set of bodies, a computation for that set of bodies is performed in at most one of the subsets.
24. The method of claim 1 wherein for at least some of the selected plurality of combinations of multiple import regions, a computation is performed for every combination of one body in each of the multiple import regions.
25. The method of claim 22 wherein in performing the computations in all the subsets of computations, for any set of bodies, a computation for that set of bodies is performed in at most one of the subsets.
26. The method of claim 1 wherein the steps are repeated for each of a series of time steps, the computations performed at each time step being used to update locations of the bodies in the computation region for use in a subsequent time step.
27. A method for performing a set of computations associated with bodies located in a computation region, each of the computations in the set of computations being associated with a pair of the bodies, the method comprising: accepting data for bodies located in a neighborhood, the neighborhood comprising a plurality of zones; and performing computations associated with pair of bodies, the bodies of the pair being located in different of the zones; wherein a spatial extent of at least one of the zones is determined to eliminate at least some points in one of the zones that are further away than a minimum distance from all points in another of the zones.
28. The method of claim 25 wherein the spatial extent is determined according to a threshold distance between pairs of bodies for which computations are to be performed.
29. The method of claim 25 wherein a spatial boundary of one of the plurality of zones is determined according to a minimum distance to points within another of the plurality of zones.
30. The method of claim 25 wherein at least one of the plurality of zones includes a non planar boundary.
31. The method of claim 25 wherein the computation region is partitioned into regular regions, and at least one of the plurality of zones includes some but not all of one of the regular regions.
32. The method of claim 25 wherein the computation region is partitioned into regular regions and each region is associated with a subset of the computations.
33. The method of claim 30 wherein the neighborhood corresponds to one of the regular regions.
34. A method for performing a set of computations associated with bodies located in a computation region, the computation region being partitioned into regular regions, each of the computations in the set of computations being associated with a set of the bodies, the method comprising: for at least some of the regular regions, accepting data for bodies located in a neighborhood of the regular region, the neighborhood comprising a plurality of zones; wherein the plurality of zones includes at least one zone that is not contiguous with parts being distributed in at least two dimensions of the computation region.
35. The method of claim 32 wherein each computation in the set of computations is associated with a pair of bodies.
36. The method of claim 33 wherein the neighborhood consists of two zones, and for each of the pairs of bodies, one body of the pair is located in a different one of the two zones.
37. The method of claim 32 wherein the plurality of zones includes at least one contiguous zone.
38. The method of claim 35 wherein the contiguous zone is contiguous with the regular region.
39. The method of claim 32 wherein at least two of the zones are elongated in one dimension of the computation region.
40. The method of claim 37 wherein at least one of the zones extends over at least one other dimension of the computation region.
41. A method for performing computations associated with bodies located in a computation region using a processor coupled to a local memory, the method comprising: iteratively performing multiple subsets of the computations, in each iteration performing one subset of the multiple subsets of computations, including loading into the local memory data of bodies located in each of a plurality of import regions associated with the subset of computations, the import regions being parts of the computation region; performing on the processor computations associated with pairs of bodies, wherein for each pair, one body of the pair is located in a first of the import region and the other body of the pair is located in a second of the import region.
42. The method of claim 39 wherein the performing the computations associated with pairs of bodies is performed for each of a predetermined plurality of combinations of a first and a second of the import regions.
43. The method of claim 39 wherein performing the computations associate with the pairs of bodies includes performing computations associated with substantially all pairs of bodies in which one body of the pair is located in the first of the import regions and the other body of the pair is located in the second of the import regions.
44. A method for performing computations associated with bodies located in a computation region, the method comprising: for each of a plurality of subregions of the computation region, maintaining data for bodies in that subregion in a corresponding separate storage area, wherein the subregions overlap such that at least some bodies have data maintained in multiple of the storage areas; and for each of the subregions, accessing the data for bodies in that subregion from the corresponding storage area and performing computations associated with sets of two or more of the bodies in that subregion.
45. The method of claim 42 wherein each of the separate storage areas is associated with a node of a multiple node computing system.
46. The method of claim 43 wherein performing the computations associated with pairs of the bodies in the subregion is performed on the node associated with the storage area corresponding to the subregion.
47. The method of claim 43 wherein each of the separate storage areas is associated with a different node.
48. The method of claim 42 wherein each of the subregions includes a home region and a region extending at least a predetermined distance from the home region.
49. The method of claim 46 wherein the home regions form a partition of the computation region.
50. The method of claim 46 wherein the predetermined distance is at least half an interaction radius for pairwise interactions between bodies.
51. The method of claim 48 wherein performing the computations associated with the pairs of bodies in the subregion includes performing computations for pairs of bodies for which a midpoint of locations of the bodies is within the home region for the subregion.
52. A method for computing interactions among sets of bodies located in a computation region, the method comprising: for each computation associated with one of the sets of bodies determining a computation unit of a plurality of computation units for performing the computation according to a mid location of the set of bodies determined from a location of each of the bodies.
53. The method of claim 50 wherein the mid location is determined according to a geometric midpoint of the locations of each of the bodies.
54. The method of claim 50 wherein the mid location is determined according to a center of a minimal sphere bounding the locations of each of the bodies.
55. The method of claim 50 wherein each of the computation units is associated with a node of a multiple node computing system.
56. The method of claim 53 wherein each of the computation units is associated with a different node.
57. The method of claim 53 where each of the computation units is associated with one of the nodes.
58. The method of claim 50 wherein the computation region is partitioned into regular regions each associated v/ith a different one of the computation units.
59. The method of claim 56 wherein determining the computation unit for performing a computation associated with at least some of the sets of bodies is according to whether the mid location of the set is within the regular region associated with the computation unit.
60. The method of claim 56 wherein determining the computation unit for performing a computation associated with at least some of the sets of bodies includes selecting a computation unit from a set of computation units that include the computation unit for which the mid location of the set falls within the regular region associated with the computation unit.
61. The method of claim 58 wherein the set of computation units further includes at least one other computation unit such that each of the bodies in the set belongs to at least one other set of bodies with a mid location falling in the regular region of the at least one other computation unit.
62. The method of claim 58 wherein selecting the computation unit is performed in part according to distribution of computation load across the computation units.
63. A method for performing a set of computations associated with bodies located in a computation region, each of the computations in the set of computations being associated with a set of bodies, the method comprising: for at least some of the subsets of computations, performing the computations in that subset of computations, including accepting data of bodies located in a vicinity of the subregion associated with the subset of computations; and performing computations for the sets of bodies, each body of the set being located in the vicinity of the subregion or in the subregion, the locations of the bodies of the set and the subregion satisfying a specified criterion, wherein the specified criterion for the locations of the bodies in a set and the subregion requires that a mid locations of the bodies in the set fall within the subregion.
64. The method of claim 61 wherein the mid locations of the bodies in the set is determined according to the center of a bounding sphere enclosing the locations.
65. The method of claim 61 wherein the mid locations of the bodies in the set is determined according to an average of the locations.
66. The method of claim 61 wherein at least some of the sets of bodies consists of two bodies.
67. The method of claim 61 wherein at least some of the sets of bodies consists of three bodies.
68. The method of claim 61 wherein at least some of the sets of bodies include more than three bodies.
69. A method for performing a set of computations associated with bodies located in a computation region, each of the computations in the set of computations being associated with a pair of bodies, the method comprising: for at least some of the subsets of computations, performing the computations in that subset of computations, including accepting data of bodies located in a vicinity of a subregion associated with the subset of computations; performing computations for pairs of bodies, each body of the pair being located in the vicinity of the subregion or in the subregion, the locations of the bodies of the pair and the subregion satisfying a specified criterion.
70. The method of claim 67 wherein the specified criterion incorporates information about a distribution of bodies of pairs in a vicinity of neighboring subregions.
71. The method of claim 67 wherein the specified criterion for the locations of the bodies in a pair and the subregion requires that a midpoint of the locations of the bodies in the pair fall within the subregion.
72. A method for computation of molecular dynamics simulation of a plurality of processing nodes arranged in geometric grid, the method comprising: exchanging particle data between processing nodes, such that data for all pairs of particles separated by less than an interaction radius are available on at least one processing node; and performing two or more types of computations for pairs of particles available at a processing node.
73. The method of claim 70 wherein the two or more types of computations include at least two of: electrostatic interaction, bonded force, particle migration, charge spreading, and force spreading computation.
74. The method of claim 71 wherein the two or more types of computations includes electrostatic interaction computation.
75. A method for performing computations involving sets of items, the method comprising: associating each of a plurality of items with a set of groups to which the item belongs; identifying a plurality of sets of the items as candidates for performing a computation associated with each set; and excluding sets from the computation according to the group membership of the items in the set.
76. The method of claim 73 wherein excluding the sets includes excluding a set if multiple of its items are associated with a common group.
77. The method of claim 74 wherein excluding the sets includes excluding a set if all of its items are associated with a common group.
78. The method of claim 73 wherein the items are associated with bodies in a simulation system.
79. The method of claim 76 wherein the bodies form linked sets of bodies and at least some of the groups are each associated with a linked set of bodies, and wherein each of the items associated with the bodies in the set of bodies is associated with the group associated with the linked set.
80. The method of claim 77 wherein the linked sets of bodies include bonded sets of bodies.
81. The method of claim 76 wherein identifying the plurality of sets of the items as candidates includes identifying the sets such that the associated bodies fall within a prescribed interaction radius.
82. A software product adapted to perform all the steps of any of the previous method claims.
83. The software product of claim 80 including instructions for performing the steps embodied on a computerreadable medium.
84. doc.
Description:
SCALABLE METHOD FOR THE EVALUATION OF DISTANCE-LIMITED PAIRWISE

PARTICLE INTERACTIONS

Cross-Reference to Related Applications

This application claims the benefit of U.S. Provisional Application Serial No. 60/672,717, filed April 19, 2005, U.S. Provisional Application Serial No. 60/709,184, filed August 18, 2005, and of U.S. Provisional Application Serial No. 60/756,448, filed January 4, 2006, all of which are incorporated herein by reference.

This application is also related to U.S. Application Serial No. 11/171,619, filed June 30, 2005, U.S. Application Serial No. 11/171,634, filed June 30, 2005, and International Application Serial No. PCT/US2005/023184, filed June 30, 2005, designating the United States, and published on January 12, 2006, as WO 2006/004877 A2, all of which are incorporated herein by reference.

Background

This document relates to computation of particle interactions, including algorithms and hardware and software techniques that are applicable to such computation.

Simulation of multiple-body interactions (often called "N-body" problems) is useful in a number of problem areas including celestial dynamics and computational chemistry. Biomolecular or electrostatic particle interaction simulations complement experiments by providing a uniquely detailed picture of the interaction between particles in a system. An important issue in simulations is the simulation speed.

Determining the interaction between all pairs of bodies in the system by enumerating the pairs can be computationally intensive, and thus, alternate interaction schemes are often used. For example, all particles within a predefined radius of a particular particle are interacted with that particle while particles farther away are ignored. Special-purpose hardware has also been applied in order to reduce the overall computation time required for simulation.

Summary

In one aspect, in general, a method for performing computations associated with bodies located in a computation region involves performing the computations in each subset of multiple subsets of the computations. Performing the computations in a subset includes accepting data of bodies located in each of a multiple import regions that are associated with the subset of the computations. The import regions are parts of the computation region. For each combination of a predetermined set of combinations of multiple of the import regions, computations that are associated with sets of bodies are performed. For each of these sets of bodies, at least one body of the set is located in the each import region of the combination.

Aspects can include one or more of the following features.

At least some of the combinations of multiple of the import regions each consists of a pair of the import regions.

Each of the subsets of computations is performed on a different node of a multiple node computing system.

Accepting the data for bodies from the import regions includes accepting data from a communication medium coupling at least two of the nodes of the computing system.

Each of the subsets of computations is performed in sequence on a computing system.

The subsets of computations are selected according to memory limitations of the computing system.

The method includes associating each of the subsets of computations with an iteration in an iterative sequence of computation phases.

Accepting the data for bodies from the import regions includes loading the data into a local memory associated with a processor on which the computations for the iteration are performed.

Each of the subsets of computations is associated with a separate sub-region of the computation region.

For each of the subsets of computations, the import regions are separate from the sub-region of the computation region with which the subset is associated.

The sub-regions associated with the multiple subsets of computations form a partition of the computation region. For example, the sub-regions form a regular partition of the computation region and/or a rectilinear partition of the computation region.

Performing the computations in each subset of computations further includes performing computations associated with sets of bodies. For each of these sets, each body of the set is located in the sub-region associated with the subset of computations. The method also includes performing computations associated with sets of bodies. For each of these sets, one body is located in the sub-region associated with the subset of computations and the other body or bodies of the set are located in the import regions associated with the subset of computations.

The performing of computations associated with sets of bodies in which each body of the set of bodies is located in the sub-region associated with the subset of computations is carried out while accepting the data for bodies from the import regions.

Performing the computations in each subset of computations further includes performing computations associated with sets of bodies. For each of these sets, each body of the set is located in one of the import regions associated with the subset of computations, and no body of the set is located in the sub-region associated with the subset of computations.

Performing the computations in each subset of computations includes applying a schedule of computations specified according to the import regions.

Each of the computations in the set of computations includes a computation of data characterizing an interaction of the set of bodies associated with that computation.

Each of the computations in the set of computations includes recording in a data structure an identification of the set of bodies associated with that computation.

The method further includes performing computations of data characterizing an interaction of each of the sets of bodies recorded in the data structure.

In performing the computations in all the subsets of computations, for any set of bodies, a computation for that set of bodies is performed in at most one of the subsets.

For at least some of the selected set of combinations of multiple import regions, a computation is performed for every combination of one body in each of the multiple import regions.

In performing the computations in all the subsets of computations, for any set of bodies, a computation for that set of bodies is performed in at most one of the subsets.

The steps are repeated for each of a series of time steps, the computations performed at each time step being used to update locations of the bodies in the computation region for use in a subsequent time step.

In another aspect, in general, a method is for performing a set of computations associated with bodies located in a computation region. Each of the computations in the set of computations is associated with a pair of the bodies. The method includes accepting data for bodies located in a neighborhood. The neighborhood includes a set of multiple zones. Computations associated with pair of bodies are performed such that the bodies of each pair are located in different of the zones. A spatial extent of at least one of the zones is determined to eliminate at least some points in one of the zones that are further away than a minimum distance from all points in another of the zones.

Aspects can include one or more of the following features.

The spatial extent is determined according to a threshold distance between pairs of bodies for which computations are to be performed.

A spatial boundary of one of the plurality of zones is determined according to a minimum distance to points within another of the set of zones.

At least one of the set of zones includes a non-planar boundary.

The computation region is partitioned into regular regions, and at least one of the set of zones includes some but not all of one of the regular regions.

The computation region is partitioned into regular regions and each region is associated with a subset of the computations.

The neighborhood corresponds to one of the regular regions.

In another aspect, in general, a method is for performing a set of computations associated with bodies located in a computation region. The computation region is partitioned into regular regions and each of the computations in the set of computations is associated with a set of the bodies. The method includes, for at least some of the regular regions, accepting data for bodies

located in a neighborhood of the regular region, the neighborhood including a set of multiple zones. The set of zones includes at least one zone that is not contiguous with its parts being distributed in at least two dimensions of the computation region.

Each computation in the set of computations is associated with a pair of bodies.

The neighborhood consists of two zones, and for each of the pairs of bodies, one body of the pair is located in a different one of the two zones.

The set of zones includes at least one contiguous zone.

The contiguous zone is contiguous with the regular region.

At least two of the zones are elongated in one dimension of the computation region.

At least one of the zones extends over at least one other dimension of the computation region.

In another aspect, in general, a method is for performing computations associated with bodies located in a computation region. The method uses a processor coupled to a local memory. The method includes iteratively performing multiple subsets of the computations. In each iteration one subset of the multiple subsets of computations is performed. Performing the subset of computations includes loading into the local memory data of bodies located in each of a set of import regions associated with the subset of computations. The import regions are parts of the computation region. Computations associated with pairs of bodies are performed on the processor. For each pair, one body of the pair is located in a first of the import region and the other body of the pair is located in a second of the import region.

Aspects can include one or more of the following features.

The performing the computations associated with pairs of bodies is performed for each of a predetermined set of combinations of a first and a second of the import regions.

Performing the computations associate with the pairs of bodies includes performing computations associated with substantially all pairs of bodies in which one body of the pair is located in the first of the import regions and the other body of the pair is located in the second of the import regions.

hi another aspect, in general, a method for performing computations associated with bodies located in a computation region includes, for each of a plurality of sub-regions of the

computation region, maintaining data for bodies in that sub-region in a corresponding separate storage area. The sub-regions overlap such that at least some bodies have data maintained in multiple of the storage areas. For each of the sub-regions, accessing the data for bodies in that sub-region from the corresponding storage area and performing computations associated with sets of two or more of the bodies in that sub-region.

Each of the separate storage areas is associated with a node of a multiple node computing system.

Performing the computations associated with pairs of the bodies in the sub-region is performed on the node associated with the storage area corresponding to the sub-region.

Each of the separate storage areas is associated with a different node.

Each of the sub-regions includes a home region and a region extending at least a predetermined distance from the home region.

The home regions form a partition of the computation region.

The predetermined distance is at least half an interaction radius for pairwise interactions between bodies.

Performing the computations associated with the pairs of bodies in the sub-region includes performing computations for pairs of bodies for which a midpoint of locations of the bodies is within the home region for the sub-region.

In another aspect, in general, a method for computing interactions among sets of bodies located in a computation region includes, for each computation associated with one of the sets of bodies determining a computation unit of a set of computation units for performing the computation according to a mid location of the set of bodies determined from a location of each of the bodies.

Aspects can include one or more of the following features.

The mid location is determined according to a geometric midpoint of the locations of each of the bodies.

The mid location is determined according to a center of a minimal sphere bounding the locations of each of the bodies.

Each of the computation units is associated with a node of a multiple node computing system.

Each of the computation units is associated with a different node.

Each of the computation units is associated with one of the nodes.

The computation region is partitioned into regular regions each associated with a different one of the computation units.

Determining the computation unit for performing a computation associated with at least some of the sets of bodies is according to whether the mid location of the set is within the regular region associated with the computation unit.

Determining the computation unit for performing a computation associated with at least some of the sets of bodies includes selecting a computation unit from a set of computation units that include the computation unit for which the mid location of the set falls within the regular region associated with the computation unit.

The set of computation units further includes at least one other computation unit such that each of the bodies in the set belongs to at least one other set of bodies with a mid location falling in the regular region of the at least one other computation unit.

Selecting the computation unit is performed in part according to distribution of computation load across the computation units.

In another aspect, in general, a method is for performing a set of computations associated with bodies located in a computation region. Each of the computations in the set of computations is associated with a set of bodies. The method includes, for at least some of the subsets of computations, performing the computations in that subset of computations. This performing of the computations in that subset includes accepting data of bodies located in a vicinity of the sub- region associated with the subset of computations and performing computations for the sets of bodies. Each body of the set is located in the vicinity of the sub-region or in the sub-region. The locations of the bodies of the set and the sub-region satisfy a specified criterion. The specified criterion for the locations of the bodies in a set and the sub-region requires that a mid locations of the bodies in the set fall within the sub-region.

Aspects can include one or more of the following features.

The mid locations of the bodies in the set is determined according to the center of a bounding sphere enclosing the locations.

The mid locations of the bodies in the set is determined according to an average of the locations.

At least some of the sets of bodies consists of two bodies.

At least some of the sets of bodies consists of three bodies.

At least some of the sets of bodies include more than three bodies.

In another aspect, in general, a method is for performing a set of computations associated with bodies located in a computation region. Each of the computations in the set of computations is associated with a pair of bodies. The method includes, for at least some of the subsets of computations, performing the computations in that subset of computations. The performing of the computations in that subset includes accepting data of bodies located in a vicinity of a sub- region associated with the subset of computations and performing computations for pairs of bodies. Each body of the pair is located in the vicinity of the sub-region or in the sub-region. The locations of the bodies of the pair and the sub-region satisfy a specified criterion.

Aspects can include one or more of the following features.

The specified criterion incorporates information about a distribution of bodies of pairs in a vicinity of neighboring sub-regions.

The specified criterion for the locations of the bodies in a pair and the sub-region requires that a midpoint of the locations of the bodies in the pair fall within the sub-region.

In anther aspect, in general, a method for computation of molecular dynamics simulation of a plurality of processing nodes arranged in geometric grid includes exchanging particle data between processing nodes such that data for all pairs of particles separated by less than an interaction radius are available on at least one processing node. Two or more types of computations are performed for pairs of particles available at a processing node.

Aspects can include one or more of the following features.

The two or more types of computations include at least two of: electrostatic interaction, bonded force, particle migration, charge spreading, and force spreading computation.

The two or more types of computations includes electrostatic interaction computation.

In another aspect, in general, a method for performing computations involving sets of items includes associating each of a multiple items with a set of groups to which the item belongs. Multiple sets of the items are identified as candidates for performing a computation associated with each set. Sets are excluded from the computation according to the group membership of the items in the set.

Aspects may include one or more of the following.

Excluding the sets includes excluding a set if multiple of its items are associated with a common group.

Excluding the sets includes excluding a set if all of its items are associated with a common group.

The items are associated with bodies in a simulation system.

The bodies form linked sets of bodies and at least some of the groups are each associated with a linked set of bodies.

Each of the items associated with the bodies in the set of bodies is associated with the group associated with the linked set.

The linked sets of bodies include bonded sets of bodies.

Identifying the plurality of sets of the items as candidates includes identifying the sets such that the associated bodies fall within a prescribed interaction radius.

In another aspect, in general, a software product is adapted to perform all the steps of any of the above methods. The software product can include instructions for performing the steps embodied on a computer-readable medium.

In another aspect, in general, a parallel processing system is for computing particle interactions. The system includes multiple computation nodes arranged according to a geometric partitioning of a simulation volume. Each of the nodes includes a storage for particle data for particles in a region of the simulation volume. The system also includes a communication system including links interconnecting the computations nodes. Each of the nodes includes a processor subsystem,

the processor subsystems of the computation nodes together, in a distributed manner, coordinate computation of the particle interactions.

Aspects can include one or more of the following.

The communication system forms a toroidal mesh.

The communication system includes separate components each linking a different neighborhood of computation nodes.

Each separate component of the communication system includes a point-to-point link between a two of the computation nodes.

Sending at least some of the particle data to the other of the computation nodes includes passing that data via other computation nodes over multiple of the separate components of the communication system.

Each node is configured to

(i) send particle data to other of the computation nodes requiring that data for computation,

(ii) accept particle data sent from other of the computation nodes,

(iii) perform computations using the accepted particle data producing first results,

(iv) send the first results to other of the computation nodes,

(v) accept the first results send from other of the computation nodes, and

(vi) perform further computations using the accepted first results producing second results.

The communication system is configured to accept the particle data from a computation node, and to distribute that particle data to a neighborhood of other of the computation nodes.

The communication system includes storage for configuring the neighborhood for distributing the particle data.

Each of the computation nodes further includes a communication subsystem of the communication system, the communication subsystem being configurable to pass particle data from one computation node to another computation node without intervention of the processor subsystem.

Each of the computation nodes further includes a computation subsystem for performing the computations using the accepted particle data producing the first results.

The computation subsystem is further for sending the first results to other of the computation nodes, without intervention of the processing subsystem at that node.

Each of the computation nodes further includes a memory subsystem for accepting the first results send from other of the computation nodes, without intervention of the processing subsystem.

The memory subsystem is further for accumulating the accepted first results for particular of the particles.

The memory subsystem is further for providing data based on the accepted first results to the processing subsystem.

The processor subsystem is further for performing further computations using the accepted first results and producing the second results.

The particle data includes particle location data.

The first results include particle force data.

The second results include particle location data.

In another aspect, in general, a processing node for use in a parallel processing system includes a group of subsystems. This group of subsystems includes a processor subsystem, a computation subsystem, and a memory subsystem. The system also includes a communication subsystem linking the subsystems of the group of subsystems and providing at least part of a link between subsystems of the node and subsystems of other processing nodes of the processing system. The computation subsystem includes circuitry for performing pairwise computations between data elements each associated with a spatial location.

Aspects can include one or more of the following features.

The processor subsystem includes is programmable for substantially general computations.

The processor subsystem includes a plurality of independently programmable processors.

The communication subsystem includes a set of interface units each linked to a subsystem of the group of subsystems, the communication subsystem also including one or more interface units for linking with interface units of other processing nodes.

The communication subsystem includes a communication ring linking the subsystems of the group of subsystems.

The communication system provides addressable message passing between subsystems on the same and on different processing nodes.

The computation subsystem is adaptable for computations involving pairs of particles in a particle simulation system.

The computation subsystem is adaptable for performing computations each involving a particle and a grid location in a particle simulation system.

The group of subsystems and the communication subsystem are integrated on a single integrated circuit.

The memory subsystem includes an interface to a storage device external to the integrated circuit.

The communication subsystem includes an interface for communicating with the communication subsystems on other integrated circuits.

In another aspect, in general, a computation system includes an array of processing modules arranged into one or more serially interconnected processing groups of the processing modules. Each of the processing module includes a storage for data elements and includes circuitry for performing pairwise computations between data elements, each of the pairwise computations making use a data element from the storage of the processing module and a data element passing through the serially interconnected processing modules.

Aspects can include one or more of the following.

The array of processing modules includes two of more serially interconnect processing groups.

At least one of the groups of the processing modules includes a serial interconnection of two or more of the processing modules.

The system further includes distribution modules for distributing data elements among the processing groups.

The system further includes combination modules for combining results of the pairwise computations produced by each of the processing groups.

In another aspect, in general, a method includes receiving a first set of data elements, distributing the first set of data elements to the storages of an array of processing modules that are arranged into one or more serially interconnected processing groups of the processing modules, receiving a second set of data elements, and passing the data elements through the serially interconnected processing groups. At each of at least some of the processing modules, a pairwise computation is performed involving a data element from the first set that is stored in the storage of the processing module and a data element passing through the processing group, and a result of the pairwise computation is passed through the serially interconnected processing group. The method further includes combining the results of the pairwise computations passed from each of the processing groups and transmitting the combined results.

Aspects can include one or more of the following.

The data elements of the first set and of the second set are associated with bodies in a multiple body simulation system.

The pairwise computation is associated with a force between a pair of the bodies.

The method further includes, at each of the at least some of the processing modules, selecting some but not all of the data elements data elements passing through the processing group for the performing of the pairwise computations.

Selecting the data elements includes selecting the pairs of data elements for the performing of the pairwise computations.

Selecting the pairs of data elements includes selecting the pairs according to separation between spatial locations associated with the data elements.

Distributing the first set of data elements to the storages of the array includes, at each of the modules, distributing data elements to a plurality of matching units, each matching unit receiving a different subset of the data elements.

The method further includes at each of the at least some of the processing modules at each of the matching units selecting some but not all of the data elements data elements passing through the processing group for the performing of the pairwise computations.

The method further includes at each of the at least some of the processing modules combining the outputs of the matching units, and passing the combined outputs to a computation unit for executing the pairwise computation.

In another aspect, in general, a processing module includes a set of matching units, each unit being associated with a storage for a different subset of a first set of data elements. The processing module also includes an input for receiving a series of data elements in a second set of data elements, the input being coupled to each of the matching units for passing the received data elements to the matching units. The module also includes a computation unit coupled to the matching units for receiving pairs of data elements selected by the matching units, for each pair, one data element coming from the first set and one data element coming from the second set, and an output for transmitting results produced by the computation unit.

Aspects can include one or more of the following features.

The module includes an output for transmitting the received series of data elements to another module.

The module includes an input for receiving results produced by other modules.

The module includes a merging unit for combining the results produced by the computation unit and the results received on the input for receiving results.

The merging unit includes circuitry for merging results according to a data element of the first set that is associated with each of the results.

Other aspects and features are apparent from the following discription and from the claims.

Description

FIG. 1 is a table of asymptotic scaling for various decomposition methods.

FIG. 2 is perspective view of regions for the Half-Shell (HS) method.

FIG. 3 is perspective view of regions for the Neutral Territory (NT) method.

FIG. 4 is perspective view of regions for the Snir's Hybrid (SH) method.

FIG. 5 is a view of regions in a simplified two-dimensional analog of the Half-Shell (HS) method.

FIG. 6 is a view of regions in a simplified two-dimensional analog of the Neutral Territory (NT) method.

FIG. 7 is a view of regions in a two-dimensional decomposition that shares some characteristics with the Snir's Hybrid (SH) method.

FIG. 8 is a view of regions in a two-dimensional decomposition.

FIG. 9 is a view of regions in a two-dimensional decomposition.

FIG. 10 is a view of regions in a two-dimensional decomposition

FIG. 11 is a view of regions in a two-dimensional decomposition

FIG. 12 is a view of regions in a simplified two-dimensional analog of the Half-Shell (HS) method for commutative interactions.

FIG. 13 is a view of regions in a simplified two-dimensional analog of the Neutral Territory (NT) method for commutative interactions.

FIG. 14 is a view of regions in a two-dimensional decomposition that shares some characteristics with the Snir's Hybrid (SH) method for commutative interactions.

FIG. 15 is a view of regions in a two-dimensional decomposition for commutative interactions.

FIG. 16 is a view of regions in a simplified two-dimensional analog of the Half-Shell (HS) method for commutative interactions with rounding.

FIG. 17 is a view of regions in a two-dimensional decomposition that shares some characteristics with the Snir's Hybrid (SH) method for commutative interactions with rounding.

FIG. 18 is a perspective view of regions for the (SNT) method with rounding.

FIG. 19 is a perspective view of regions for the "clouds" method.

FIG. 20 is a perspective view of regions for the "city" method.

FIG. 21 is a perspective view of regions for the "foam" method.

FIG. 22 is a view of regions of a two-dimensional multiple zone method.

FIG. 23 is a computation schedule for the regions illustrated in FIG. 22.

FIG. 24 is a computation schedule for a multiple zone version of the decomposition shown in FIG. 3.

FIG. 25 is an algorithm corresponding to the computation schedule in FIG. 24.

FIG. 26 is a graph of computational contributions versus degree of parallelism.

FIG. 27 is a perspective view of regions for the "eighth shell (ES)" method.

FIG. 28 is a computation schedule for the decomposition shown in FIG. 27.

FIG. 29 is a graph of import volume versus degree of parallelism.

FIG. 30 is a perspective view of regions for a "sliced half-shell" method.

FIG. 31 is a computation schedule for the decomposition shown in FIG. 30.

FIG. 32 is a perspective view of regions for the (SNT) method with rounding.

FIG. 33 is a computation schedule for the decomposition shown in FIG. 32.

FIG. 34 is a graph of import volume versus degree of parallelism.

FIG. 35 is a graph of import volume versus degree of parallelism.

FIG. 36 is a view of regions in a two-dimensional pseudo-random assignment method.

FIG. 37 is a block diagram of a parallel implementation.

FIG. 38 is a flowchart.

FIG. 39 is a block diagram including elements of a processing node.

FIG. 40 is a diagram illustrating data flow in a processing iteration.

FIG. 41 is a functional block diagram of a particle-particle interaction module (PPIM).

FIG. 42 is a block diagram of a portion of a midrange subsystem.

FIG. 43 is a block diagram of an implementation of a PPM.

FIG. 44 is a block diagram of a flexible subsystem.

FIG. 45 is a block diagram of a serialized implementation.

Description

Section headings are provided as an aid to the reader, and should not be taken to limit the description. A listing of the section headings is as follows:

1 Overview 18

1.1 Molecular dynamics 18

1.2 Molecular force fields 20 1.3 Pairwise interactions 21

1.4 Mesh based computations 23

1.5 Document organization 25

2 NT, SH, and HS Methods 25 2.1 HS method 26 2.2 NT method 27

2.3 SH method 28

3 Generalized Decompositions 28

3.1 The convolution criterion 30

3.2 Assignment rules 32 3.2.1 Assignment rules for HS, NT and SH analog methods 32

3.2.2 Midpoint assignment rule 33

3.2.3 Pseudo-random assignment rule 34

3.2.4 Computational load balancing 35

3.2.5 Translational invariance 38 3.3 Commutative interactions 39

3.4 The rounding criterion 41

3.5 Decompositions in higher dimensions 42

4 Shared communication 45

5 Pair-lists and exclusion groups 48 6 Multiple Zone Methods 50

6.1 A Lower Bound for Import Volume 55

6.2 Optimization Priorities 56 7 Applications of generalized decompositions 56

7.1 Eighth Shell Method 57 7.2 Sliced Half Shell Method 58

7.3 Multiple Zone NT and SNT methods 58

7.4 Foam Method 59

7.5 Comparison of Communication Bandwidth Requirements 60

7.6 Multiple-particle interactions 61 8 Serialized implementation 62

9 Cluster architecture 67

10 Parallel implementation 68

11 Applications 82

12 Alternatives 82 Appendix A 85

1 Derivation of a Bound on Import Volume 85 2 Approximation of finite homebox volume effects 89

3 Limiting Cases 90

3.1 Massive parallelism limit 90

3.2 Low parallelism limit 91

3.3 Fine grained scheduling 91 Appendix B Zonal Methods for the Parallel Execution of Range-Limited n-Body Simulations 92

Appendix C The Midpoint Method for Parallelization of Particle Simulations 163

Claims 209

1 Overview

Simulation of multiple-body interactions (often called "N-body" problems) is useful in a number of problem areas including celestial dynamics and computational chemistry. For example, biomolecular or electrostatic particle interaction simulations complement experiments by providing a uniquely detailed picture of the interaction between particles in a system. A significant issue in simulations is the simulation speed.

1.1 Molecular dynamics

One approach to molecular simulation is primarily physics-based, predicting structure based on a mathematical model of the physics of a molecular system. Proteins may be modeled at the atomic level, with individual atoms or groups of atoms represented as point bodies in an N-body system. One of these methods is molecular dynamics (MD), in which the force on each particle is calculated and Newton's laws are numerically integrated to predict the physical trajectory of each atom over time. An alternative is a class of Monte Carlo methods, which stochastically sample the potential energy surface of a system. Physics-based methods can either be used to refine homology models or be applied on their own, to determine protein structure from scratch. MD can also be used to study the process of protein folding.

One factor in applying physics-based methods to protein structure prediction and other structure- based problems in computational biochemistry is that often these problems are not concerned with any single molecule or molecular event, but rather with the statistical properties of a very large collection of molecules. For instance, when predicting protein structure, we may not necessarily be interested in the structure that one particular protein molecule may happen to fold into, but rather the most probable structure for molecules of that protein, the structure that will be

the norm among a large collection of molecules of that protein. When studying binding, the question may not be whether a particular protein and ligand will bind in a particular instance, but the concentration of bound ligands in the steady state, when a large number of proteins and ligands are put together.

A useful quantity to compute for these problems is the free energy of a molecular system. Free energy is not a property of a single state of a system, but rather of an ensemble of states. In very general terms, the lower the free energy of an ensemble of states, the higher the probability that a molecular system will be found in one of those states at any given time. The free energy of an ensemble is computed by summing probabilities over all states in the ensemble. The probability of a state is an exponential function of its potential energy that depends on the temperature of the system (specifically, the calculation uses a Boltzmann distribution). In practical terms, this means that for protein structure prediction, it is insufficient to perform a single run of an MD simulation, stop at some point, and take the end conformation as the native state. At a minimum, many conformations near the end of such a simulation should be sampled. More likely, multiple separate simulations may need to be run and the results compared.

Once the structure of a target protein has been obtained — through either computational or experimental methods — computational methods can be used to predict the protein's interactions with ligands, a task loosely referred to as "the docking problem."

Screening ligands for activity could in theory be done with physics-based methods. Owing to the enormous computational burden of that approach, other methods have been developed.

Generally, different configurations of a protein and ligand are generated and a score is calculated for each according to a heuristic scoring function. To reduce the number of configurations that must be tested, the protein is held rigid or allowed a small degree of flexibility; the ligand is tested in various poses and orientations at various points near the protein. To further reduce the configuration space, it is necessary (or at least very helpful) to know the active site on the protein.

This heuristic screening may identify ligands that are likely to bind to a target protein, but it may not quantify that binding or predict other properties of the interaction, such as the concentration of the ligand necessary to produce a certain effect. For this purpose, the binding free energy of the interaction, the free energy difference between the unbound and bound states of the system, can be used. Too small a binding energy can make the required dosage of a drug too high to be

practical; too high an energy can cause toxicity or other side effects. Accurate calculation of binding free energy is a significantly more computationally intensive task than simply determining whether two molecules are likely to bind, and relies heavily on accurate structural models of these interactions.

1.2 Molecular force fields

In physics-based methods for structure-based problems, the "inner loop" of these computational methods can be chosen to solve the following problem: given a molecular system, calculate either (a) the potential energy of the system as a whole or (b) the force on each particle due to its interactions with the rest of the system. (Note that the force in (b) is simply the three- dimensional vector gradient of (a), making the two variations of the problem similar computationally.) These forces and energies are generally computed according to a particular type of model called a molecular force field.

Force fields can model molecular systems at the atomic level, with atoms or groups of atoms typically represented as point bodies. Each point body has a set of associated parameters such as mass and charge (more specifically a partial charge, so that the complicated electron distribution caused by atomic bonding can be modeled with point charges). These parameters are determined at initialization according to the atom's type and, in standard force fields, remain constant throughout the simulation. Atom type is not as simple as an atomic number on the periodic table: an atom's parameters can depend upon what particles are near it and, in particular, what other atoms are bonded to it. For example, a hydrogen atom bonded to a nitrogen atom (an amine) may have a partial charge that is different than one bonded to an oxygen atom (an alcohol). There can easily be several types each of carbon, hydrogen, oxygen, and nitrogen. The set of atom types and the parameters used for them are one of the characteristics that define a particular force field.

The interactions among atoms can be broken into several components. The bonded terms model the interaction between atoms that are covalently bonded. One such term effectively models a bond between two atoms as a harmonic oscillator, reflecting the tendency for two atoms to settle at a certain distance from each other, known as the bond length. Another term reflects the tendency for two bonds to "bend" towards a certain angle. Yet another term takes into account the effect of torsion, the "twisting" of a bond owing to the relative angles it makes with two bonds on either side of it. Several other types of terms are possible, many of them cross-terms,

which take into account the interaction between the basic types of terms. Each term contains one or more parameters, and each parameter must have a value for each combination (pair, triplet, quartet, etc.) of atom types, leading to a profusion of parameters from all of the terms and cross- terms.

The non-bonded terms in a force field model all-to-all interactions among atoms. One of the non- bonded terms calculates the electrostatic interaction of charges attracting and repelling each other according to Coulomb's law. Another type of non-bonded term models the van der Waals interaction, a shorter-range interaction that comprises an attractive component and a repulsive component. The repulsive component dominates at very short distances of a few angstroms (abbreviated A, equal to 1 Cf 10 meters). Ih a rough sense, the repulsive force can be thought of as keeping atoms from overlapping or crashing into one another. There are various ways to model the van der Waals potential. The attractive component is usually modeled as dropping off as the inverse sixth power of the distance between two particles, or Vr 6 ; the repulsive component can be modeled with a similar power function (usually \lr u , for computational convenience on general-purpose processors) or with an exponential. Such forces are in contrast the electrostatic potential which drops off more slowly, as 1/r.

1.3 Pairwise interactions

Simulations in many fields require computation of explicit pairwise interactions of particles, hi addition to the molecular dynamics and Monte Carlo simulations in biochemistry discussed above, similar simulations may be applied to materials science, N-body simulations in astrophysics and plasma physics, and particle-based hydrodynamic and equation of state simulations in fluid dynamics. To reduce the computational workload of such simulations, instead of computing interactions between all pairs of particles, one approach is to compute interactions only between pairs separated by less than some interaction radius, R. These interactions are referred to as "near interactions." The remaining interactions ("distant interactions") are either neglected or handled by a less computationally expensive method. Even with the savings associated with limiting the interaction radius, the near interactions still frequently dominate the computational cost. Methods for decomposing this computation for parallel and/or serial processing can be useful in reducing the total time required and/or computation or communication required for these simulations.

Traditional methods for parallelization are described in a number of review papers, including Heffelfmger, "Parallel atomistic simulations," Computer Physics Communications, vol. 128. pp. 219-237 (2000), and Plimpton, "Fast Parallel Algorithms for Short Range Molecular Dynamics," Journal of Computational Physics, vol. 117. pp. 1-19 (March 1995). They include atom, force, and spatial decomposition methods. Spatial decomposition methods can offer an advantage over atom and force decomposition methods because they can exploit spatial locality to only consider pairs of nearby particles rather than all pairs in the system. Force decomposition methods, on the other hand, can offer an advantage over spatial and atom decomposition methods in that the communication bandwidth required by each processor decreases as the number of processors increases.

Two recently developed methods - Snir's hybrid (SH, "hybrid") method, published as "A Note on N-Body Computations with Cutoffs," by Marc Snir (Theory Comput. Systems 37, 295-318, 2004) and Shaw's Neutral Territory (NT) method, to be published as "A Fast, Scalable Method for the Parallel Evaluation of Distance-Limited Pairwise Particle Interactions" by David E. Shaw, combine spatial and force decompositions. They have a property that the communications required decreases as the interaction radius decreases. They also have a property that the communication required per node decreases as the number of processors increases. Referring to FIG. 1, a comparison of asymptotic scaling properties of the SH and NT methods with those of traditional methods shows that the SH and NT methods have the same scaling properties, which are significantly better than those of traditional methods. Specifically, both of these two methods have asymptotic complexity of , which grow more slowly with the number of processors, p, than all but the force method, which does not exploit spatial locality and therefore does not improve as the interaction radius, R, is reduced.

Both the SH and NT methods can offer advantages over traditional methods for practical machine and simulation sizes. For instance, the NT method can be shown to always requires less communication that the SH method, but that the SH method can be optimized using features of the NT method such that the difference was small. An optimized version of the SH method that makes use of features of the NT method is referred to below as the SNT method.

The NT, SNT, and SH methods involve different import regions. That is, the set of particles imported by each processor differs from one method to another. The methods, however, share a number of features. In both the NT method and the SH method, each processor stores the

particles falling in some rectangular box. In both methods, the processor that computes the interaction between a pair of particles typically is not the one on which either particle resides, but instead a third processor that imports both of the particles of the pair. In both methods, each processor imports particles from two interaction zones and considers all pairs of particles such that one particle is in one interaction zone and the other particle is in the other interaction zone.

The description below addresses a general class of methods that includes traditional spatial decompositions, the SH method, the NT method, and the SNT method as special cases. The communication requirements of a variety of specific instances of the general class are analyzed taking into account both latency and bandwidth of the interprocessor communication network.

A criterion for selecting parameters of a particular method can depend on the properties of both the physical system to be simulated (e.g., system dimensions, interaction radius) and the hardware available for the simulation (e.g., number of nodes, network topology, and memory structure including cache structure and sharing between nodes). For most practical parameter settings, the method that delivers the best performance according to such a criterion generally is not necessarily one of the specific methods described above, but rather is a different instance of the general class of methods.

1.4 Mesh based computations

As introduced above, distant interactions between particles, for example, between particles separated by more than the maximum interaction radius, can be computed using less computationally expensive methods. One such approach is based on the Ewald Method, as described in detail in copending applications U.S. Application No. 11/171,619, U.S. Application No. 11/171,634, International Application No. PCT/US2005/023184.

The essence of the Ewald method is to split the electrostatic potential field into two main parts, one of which can be computed efficiently in real space (by calculating pairwise interactions) and the other of which can be handled efficiently in the Fourier domain using computations on a regular grid on the simulation region. (Note, however, that not all variations on Ewald use the Fourier domain.)

Implementation of the part involving the Fourier domain calculations can be divided into three phases. First, a discrete charge distribution

that is defined on the (infinite and periodic) continuous three-dimensional space r, is mapped to a charge distribution p S (r m ) defined on the finite three-dimensional mesh according to

where S is a charge spreading function, the operator <8> denotes spatial convolution, m is an index for mesh points that lie within the primary box, and the vector r™ n is the vector from mesh point m to the nearest image of particle i.

In a second phase, a potential function is computed at the grid locations using a convolution operation on the charge distribution . This convolution can be implemented using Fast Fourier Transform (FFT) techniques.

In the third phase, energy or force quantities defined at the particle locations are found using a convolution approach that maps the mesh-based potential function to those locations. In particular, the force on the / th particle can be computed as

where . This third phase can be implemented as a sum

Note that computationally, the first phase and the third phase involve sums involving pairs of locations, with one location of each pair being a particle (i) location and the other location of the pair being a location of a mesh point (m).

1.5 Document organization

The remainder of this document is organized as follows. Sections 2 through 7 generally relate to algorithms that are applicable to computing range-limited pairwise interactions between particles, or more generally, to performing range-limited computations associated with pairs of points in space, where the pair may correspond to a pair of particles in an N-body problem, or the pair may correspond to one grid point and one particle point, for example, as part of a charge spreading or force interpolation method.

Sections 8 through 10 relate to software and hardware techniques, which may be applicable to the algorithms presented in Sections 2 through 7, or to an overall n-body simulation implementation. Note that the hardware techniques described, for example in Section 10, have aspects that are applicable to a much broader range of applications than n-body simulation.

Possible applications of the algorithms and software and hardware techniques are discussed briefly in Section 11.

An Appendix provides mathematical details related to some of the algorithmic approaches described in Sections 2 through 7.

2 NT. SH, and HS Methods

The recently presented NT and SH methods, as well as an example of a conventional spatial decomposition method, the "Half-Shell" or HS method, can be described in the following framework. We assume that the system of particles (bodies) being simulated resides within a rectangular box called the global cell. In all three methods, we divide the global cell into a regular, rectangular, three-dimensional grid of smaller boxes. We refer to these smaller boxes as homeboxes. In one parallel implementation, we assign a one-to-one correspondence between homeboxes and processors, and we refer interchangeably to a processor and its homebox. The methods are described below in the content of such an implementation, with some alternative implementations being presented after this discussion.

Also, when we refer to a homebox or processor importing particles, this should be understood to include a processor receiving information characterizing those particles, such as their location, physical attributes, identities, etc. Each processor is responsible for updating the position of the particles that lie within its homebox at any point in time. As a convention, the base coordinates

of a given home box (or of any particle within that homebox) are defined as the coordinates of the low-coordinate corner of that homebox. The logical coordinates of a homebox or particle are the integer mesh coordinates of the homebox. The dimensions of each homebox are h x xh y x h z .

Each home box has an associated import region consisting of the region of space from which the processor must import particles. Each home box imports all particles not already present within that homebox that are to be interacted in the home box. The NT, SH, and HS methods each use different import regions.

The NT, SH, and HS methods each interact all particles in one volume of space with all particle in another volume of space. We refer to these volumes as "interaction zones" or simply "zones." In each of these methods, the intersection of the two zones is the homebox.

We use the volume of the import region associated with each homebox as a simple model for (i.e., a proxy for or an estimate of) the communication bandwidth required by pairwise particle interaction method. Such a proxy is most accurate for a fully connected network topology.

2.1 HS method

In conventional decompositions, generally, two particles are interacted in the homebox in which one of them resides. In the HS method specifically, the location where a pair of particles are interacted follows the following rule: the particles are interacted in the homebox with the smaller x base coordinate; if the two particles have the same x base coordinate; the particles are interacted in the homebox with the smaller y base coordinate; if they also have the same y base coordinate, they are interacted in the homebox with the smaller z base coordinate. If they also have the same base z base coordinates, they are in the same homebox and the interaction is computed there. Interactions between particles on the same homebox are referred to as "local interactions" while interactions between particles on different homeboxes are referred to as "remote interactions."

Referring to FIG. 2, as a result of the rule, each homebox has an import region that consists of half the points external to the homebox within a distance R of the boundary. Every particle in the import region 220 interacts with every particle in the homebox 210. Every particle in the homebox 210 also interacts with every other particle in the homebox 210. Equivalently, the two interaction zones are the homebox and the union of the homebox with the half shell.

The conventional HS method is guaranteed to consider all unique pairs separated by less than R. The method also could interact some pairs whose separation exceeds R due to the finite volume of the homebox. These "excess interactions" can be eliminated ("filtered") at the pair level as necessary. Local interactions will also need filtering to avoid double counting interactions between local particles. Such filtering can be accomplished by a test within an inner loop of a nested computational iteration of particles.

2.2 NT method

In the NT method, two particles interact in a homebox that has the x and y base coordinates associated with one particle and the z base coordinate associated with the other particle. More specifically, the x and y base coordinates of the homebox in which the particles interact are those of the particle with the smaller x base coordinate or, if both particles have the same x base coordinate, those of the particle with the smaller y base coordinate. The z base coordinate of the homebox in which the particles interact is the z base coordinate of the other particle. If two particles have the same x wAy base coordinates, they are interacted on the homebox of the particle with the lower z coordinate.

Referring to FIG. 3, the resulting import region of each homebox is the union an "outer tower" 320 and an "outer plate" 330. The outer tower comprises all points that lie directly above or below the homebox 310 within a distance R of the boundary. The outer plate 330 comprises half the points external to the homebox 310 that lie in its z-plane within a distance R of the boundary. The interaction zones are the "tower," which consists of the union of the homebox 310 and the outer tower 320, and the "plate," which consists of the union of the homebox 310 and the outer plate 330.

By interacting the tower (310, 320) with the plate (310, 330), all near interactions are guaranteed to be computed at least once. However, to ensure that each near interaction is computed only once, a filtering criterion is used that is not required in the HS method. Namely, local (i.e., homebox) plate particles should only interact with local tower particles and remote (i.e., non- homebox) particles in the upper half of the outer tower (or lower half, but not both).

The import requirements of the NT method can be reduced by an appropriate choice of the aspect ratio of the homebox appropriately. An optimum occurs when h x =h y >h z . That is, an optimum occurs when the homeboxes are "flattened" in the z dimension by an amount that depends on the

interaction radius and the homebox volume, and generally corresponds to the tow interaction zones having equal or approximately equal volumes.

2.3 SH method

The SH method assumes a cubical homebox of side length h. We define r = \RI h ~ \ and

r = \ -Jr + 1 . Referring to FIG. 4, the import region of the homebox 410 with logical coordinates

(i,j, k) (including the homebox) consists of the union of two interaction zones. The first interaction zone is a contiguous region that we refer to as the "base" 420 and consists of the set of homeboxes with logical coordinates (i + u, j, k- wi) where u e [-r, r] and wj e [0, r- I]. The second interaction zone is a set of non-contiguous regions that we refer to as the "comb" 430. It consists of the set of boxes with logical coordinates (i, j + v, k + M> 2 r), where v e [—r, r] and W 2 <= [0, [ r /r] ]. Note that both the comb 430 and the base 420 include the homebox 410.

Interacting the comb 430 and base 420 ensures that all near interactions will be computed at least once. As for the HS and NT methods, filtering can be used to ensure that all near interactions are computed exactly once and that only near interactions are computed.

3 Generalized Decompositions

In a generalized class of methods, each processor imports particles from two interaction zones and interacts particles from one zone with particles from the other zone. The pairs of zones are chosen for the implementation of the method to ensure that all near interactions are computed. Both the case where each particle interacts with other particles in a surrounding cube, and the case where each particle interacts with other particles in a surrounding sphere are considered.

Because a spherical region of radius R has roughly half the volume of the circumscribing cube of side length 2R, "rounding" allows reduction in communication bandwidth requirements.

For clarity of exposition, we initially consider a simplified problem in which:

A. The particles are restricted to the two-dimensional x-y plane.

B. A pair of particles interact if and only if they are separated by a distance less than R in the x-dimension and by a distance less than R in the ^-dimension (that is, the Euclidean distance may be greater than R)

C. The homeboxes are square with side length h, and R is an integer multiple of h. Likewise, the interaction zones consist only of complete homeboxes ("voxelized method").

D. The interactions are non-commutative. Thus, for a pair of particles a and b, we must separately compute the influence of a on b and the influence of & on a. (Note that the force between two particles is usually treated as a commutative interaction as by Newton's Third Law, the force due to an interaction on particle a is equal and opposite the force on particle b. Non-commutative interactions are also of practical interest; for example, one might want to compute the influence of each particle on nearby mesh points.)

These simplifications are progressively relaxed below after an initial discussion of the simplified problem.

The interaction neighborhood of a homebox is defined to be the smallest region of space that is guaranteed to contain all particles with which any arbitrary local particle (i.e., any particle in the homebox) could interact. In this simplified problem, the interaction neighborhood is a square of side length 2R+h extending a distance R beyond the homebox in the positive and negative x and y directions. Particles in the homebox influence particles in the interaction neighborhood and vice versa.

Referring to FIG. 5, a simplified analog to the HS method (more of a "full" shell than a "half shell approach in view of the non-commutative assumption D above) involves each homebox 510 computing the influence on its own particles of all particles in its interaction neighborhood 530. This method requires that each homebox import all the particles in its interaction neighborhood, except of course for those that already reside in the homebox.

Referring to FIG. 6, a simplified analog to the NT method involves each homebox 610 computing the influence of all particles in a row of neighboring boxes 620 that span the interaction neighborhood horizontally on all particles in a column of neighboring boxes 630 span the interaction neighborhood vertically. Each homebox imports both the remote particles in the row of boxes 620 and the remote particles in the column of boxes 630. The import requirements for this method are significantly lower than those of the HS analog, even though the total number of interactions computed is the same.

Note that in both of these methods, each homebox computes the influence of all particles in one region of space, the interaction zone consisting of homebox 510 and boxes 620, on all particles in another region of space, the interaction zone consisting of neighborhood 530 and boxes 630. In these examples, the two zones of each method overlap. Their intersection is the homebox 510, 610. The zones are translationally invariant in the sense that each homebox 510, 610 has the same relative spatial relationship with its two associated zones.

Other choices for the two interaction zones lead to alternative decompositions. Referring to FIG. 7, a decomposition (referred to below as the "SH analog") that shares some characteristics with the SH method has one zone comprised of a set of equally spaced horizontal slabs 720, while the other zone that is comprised of a vertically oriented block 730 whose height is equal to the spatial period of the slabs. Again, each homebox 710 imports the particles in the two zones (720, 730) and computes the influence of particles in the one zone 720 on particles in the other zone 730.

Referring to FIG. 8, in another decomposition, one zone 820 comprises a square area, a "brick," centered on the homebox 810. The other zone 830 is a "foam"-like structure that consists of boxes the size of the homebox positioned such that their period in each dimension is equal to the width of the zone 820. The homebox 810 computes the influence of each particle in the zone 820 on each particle in the zone 830.

3.1 The convolution criterion

Given a pair of translationally invariant zones associated with each homebox, verification that all required interactions are computed can be performed using a convolution-based criterion. We define the reception region of a method as the region of space that influences the homebox. Specifically, the reception region is the smallest volume of space that is guaranteed to contain all remote particles that influence local particles if no additional filtering at the pair level is employed. The criterion is based on ensuring that the reception region of a homebox includes its entire interaction neighborhood.

To find the reception region, assume each homebox has two translationally invariant types of interaction zones associated with it, its "blue" zones and its "red" zone. The homebox computes the influence of each particle in its red zones on each particle in its blue zones. We first find the set of homeboxes whose associated blue zone includes the designated homebox. We then take

the union of the red zones for all homeboxes in this set to obtain the reception region of the designated homebox.

To compute the reception region graphically, we first reflect the blue zone about the designated homebox to show the set of homeboxes whose blue zones include the designated homebox. In two dimensions, reflection about the designated homebox is equivalent to a 180° rotation about the designated homebox. We then translate the red zone such that it has the same spatial relationship to each homebox in the reflected blue zone that it initially had to the designated homebox. The reception region is the union of these translated red zones.

This process is similar to a convolution operation. Technically, the reception region is the support of the discretized convolution of an indicator function over the reflected blue zone with an indicator function over the red zone. For that reason, we term the criterion that the reception region includes the entire interaction region the "convolution criterion." For the four decompositions illustrated in Figures 5 through 8, the reception region of the homebox is identical to its interaction neighborhood.

Referring to FIGS. 9-11, several other decompositions that also satisfy the convolution criterion. The decomposition shown in FIG. 9 is similar to that of FIG. 8, but here, the two interaction zones 920 and 930 associated with a particular homebox 910 are not centered on that homebox 910. Symmetry of the zones about the homebox is not required. The zone 920 extends up and to the right of the homebox 910. Note that when reflected, the zone 920 extends down and to the left of the homebox. The decomposition of FIG. 9 also differs from those of FIGS. 5-8 in that the reception region of the homebox is larger than its interaction neighborhood and in that parts of the zone 930 lie outside the interaction neighborhood 950 (indicated by the solid outlined square). If one were to eliminate the parts of the zone 930 that are outside the interaction neighborhood 950, the reception region would no longer cover the entire interaction neighborhood, so the decomposition would no longer ensure that all required interactions were computed.

FIGS. 10 and 11 show two other decomposition schemes that also satisfy the convolution criterion. In FIG. 10, neither of the zones 1020 or 1030 for the homebox 1010 is contiguous. Similarly, in FIG. 10, the zones 1120 and 1130 of homebox 1130 are not contiguous.

3.2 Assignment rules

Decompositions can be described in terms of interaction zones associated with each homebox. An alternative description takes the form of an assignment rule, which specifies the processor that will compute the influence of one particle on another as a function of their locations or on the processors on which they initially reside. Such a description was informally used in describing some of the methods, including the previously presented HS and NT methods above.

Suppose that particles a and b are separated by a distance no greater than R in each dimension. We denote the spatial coordinates of particles a and b by (x a , ya), and (xb, y b ), respectively, and denote the logical coordinates of the homebox of particle a, the homebox of particle b, and the homebox that computes the influence of a on b by (i a , J 0 ), (h, jb), and (i a→ j,, j a → b), respectively. An assignment rule can be specified as (i a → b, j a→ b ) in terms of the homeboxes (i a , jj and (i f ,, jb) that contain the particles, or in terms of the locations (x a , yj, and (x b , y b ) of the particles.

3.2.1 Assignment rules for HS, NT and SH analog methods

For the simplified analog of the HS method, which is illustrated in FIG.5, the assignment rule can be represented as (i a→ & j a→ t ) - (h, J b )- For the simplified analog of the NT method, which is illustrated in FIG. 6, the assignment rule can be represented as (i a → b, j a b ) = (ia, J b )-

To specify the assignment rules for the decomposition schemes illustrated in FIGS. 7 and 8, we define the functions r and g, which are used in the specification of the assignment rule, for integer arguments as:

The assignment rule for the decomposition scheme of FIG. 7, the SH analog, can be represented as - The assignment rule for the foam-like scheme, which is illustrated in FIG. 8, can be represented a

3.2.2 Midpoint assignment rule

The assignment rule can also be specified in terms of the locations of the particles a and b. For example, if mid(α,&) is used to denote a point at the geometric midpoint between the coordinates of a and b (i.e., x m i d ( a,b f :: (x a + x b y2), then the assignment rule can be represented as (i a→ b , j a→ t ) = More generally, a midpoint mapping m(a,b) from the coordinates of the particles to another point ("a mid location") in the space can be used that is not necessarily the geometric midpoint of the points, and a midpoint rule can be applied to three or more points with a suitable definition of a "midpoint." For example, a midpoint of a set of points may be defined as the center of the smallest sphere containing all the points. As another example, an average location (e.g., for each coordinate direction averaging the coordinates of the points in the set).

Given an assignment rule and the interaction radius R, the import region necessary to satisfy the rule can be determined based on the rule. The assignment rule is a many-to-one mapping from pairs of particle locations to logical homebox coordinates. For any homebox in the range of the mapping, the inverse in the domain of pairs of particle locations is well defined. This inverse defines the import region for the homebox through the union of the domain of each coordinate of the pairs in the inverse. The inverse can be specified either according to the coordinates of the particles, or in terms of their logical (homebox) coordinates. The existence of an assignment rule that is defined on the entire domain of pairs of particle locations with the global cell guarantees that each pair of particles will interact on some processor, so application of the convolution criterion is not necessary in this case.

Note that in the case of the midpoint assignment rule using the geometric midpoint mid(α,ό) above, the import region extends at most a distance R/2 away from the homebox, because if one of the particles is further than R/2 away from the homebox and the particles are geometrically separated by at most R, then the midpoint necessarily falls outside the homebox.

Referring to FIG. 36, a two-dimensional example uses an array of square homeboxes, each of which is A=8 A on each side. A homebox 3610 is a representative homebox, with the solid lines delineating the homeboxes. In this example, the interaction radius is i?=lθA. As presented above, the interaction region 3620 extends R/2=5A around homebox 3610. A representative pair of particles, a 3631 and b 3632 has a geometric midpoint m 3633 that falls within the homebox 3610, and therefore the computation of the interaction between a and b is computed by the processor assigned to that homebox.

The midpoint method can also handle interactions that involve sets of three or more particles. Suppose that we wish to compute interactions between all sets of m particles that fall within some sphere of radius R 1 2 (if m = 2, this is equivalent to the statement that we wish to compute interactions between all pairs of particles separated by less than a distance R). In the midpoint method, the interaction between a set of m particles is computed on the processor that contains their midpoint, defined, for example, as the center of the smallest sphere that contains all m particles. The import region of the midpoint method in the rø -particle case is identical to that for the two-particle case.

3.2.3 Pseudo-random assignment rule For a particular import region defined for each homebox, quite frequently the necessary particle data for an interaction is imported to or already available on more than one node. For example, in the example shown in FIG. 36 with a homebox size of h=S A with a i?=lθA interaction radius, the data required for the interaction between particles a 3631 and b 3632 is also available to the processor with homebox 3612, which has an import region 3622. In general, in this example, data for any pair of particles is available to nodes associated with up to 9=3 2 homeboxes (in a three dimensional version, 27=3 3 homeboxes). For most of the interactions in this particular case, there are at least two nodes that could perform the interaction and there are quite frequently many, many more.

An approach to assigning the computation to one of the nodes that has the data for a pair of particles does not necessarily use the midpoint node. Rather, in this approach, the computation is assigned in a generally uniform pseudo-random manner from a set of possible nodes such that at least one node and preferably only one node performs the calculation. For example, each processor that has the data for those particles makes a pseudo-random determination using little if any communication between the nodes to determine whether it should perform the calculation. In the case where only one node can compute the interaction, in this example, that node contains the midpoint of the particle locations.

A specific way of implementing the pseudo-random determination involves the data for a particle that is distributed by its homebox including a bitvector which indicates which nodes have copies. For a particular pair of particles, the logical AND of the bitvectors provides the indication of the nodes that have copies of the data for both the particles. The number of such nodes that has the data for both particle a and b is defined to be the n_cc(a,b). Using a same

random number generator, each node with copies of the particles generates the same pseudorandom number in the range 0 to n_cc(α,δ)-l. This random number is used to identify the node that will perform the computation based on the ANDed bitvectors (e.g., identifying the bit position of the n_cc th set bit). One way of implementing the pseudo-random number generator uses quantities associated with the specific particles as seeds. For example, each particle has a global index, index a , index b , respectively, and the pseudo-random number generator is implemented as symmetric_hash(m<iex β , index b ) MOD rank(α,δ), where symmetric_hash(«,rø) can be a hash function implemented using any of a number of well-known techniques

As another variant, of this approach, the pseudo-random function is not necessarily uniform. For example, if a goal is the spread out the computations on available nodes that have the needed data, the random function is biased to nodes that have relatively few pairs of particles to choose from, This type of bias can be incorporated by providing shared load information to the nodes, or by providing information from the sending nodes related to the number of particles for which data is being sent to other nodes.

3.2.4 Computational load balancing

In an approach related to the pseudo-random assignment rule discussed above, the fact that multiple nodes in general are available to perform each pairwise computation is used for load balancing, with the nodes cooperating to determine which computations are performed at which node in such a way that each computation is performed only once, and a desirable balancing of computational load is achieved. We refer to a class of methods as "midpoint assured methods" where each processor imports the same data as would be imported in the midpoint method described above, but the assignment of interaction computations to processors is adjusted in such a way that the computation load is more even across the processors.

For expository simplicity, we first describe this method for a one-dimensional space where each "box" is simply an interval of length b. We also assume that R is no greater than b, where a set of w particles interact if they all fall within some sphere of radius R/2. We number the processors such that the home box of processor i is the zth box from the left. The computation proceeds in the following manner:

First, each processor imports all remote particles within a distance R/2 of its home box.

Each processor then finds its set of locally computable interactions, consisting of all combinations of m particles available to that processor that fall within a sphere of radius R/2. Each processor i divides its set locally computable interactions into three nonoverlapping subsets: interactions that are also computable on its left neighbor (processor z'-l), interactions that are also computable on its right neighbor (processor z+l), and interactions that are only computable on processor z. We denote these subsets Li, Ru and Q, respectively, and we denote their sizes by /,-, r / , and c,-. R t and X^ 1 contain exactly the same elements, namely all combinations of m particles that are within a distance R/2 of the boundary between box i and box

Z+l, SO f;= /;+i,

Each processor z then sends the value of a to its left and right neighbors. This is the only communication required by this parallelization method beyond that required by the basic midpoint method. Each processor z now has the values of C M and c, +1 as well as c,, /,-, and r,-.

Each processor i then determines which of the interactions in Z,- and which of the interactions in Ri it will compute, and which of these interactions it will leave to its neighbors to compute. In order to ensure that each interaction is in fact computed somewhere, neighboring processors must make consistent decisions about the division of labor between them. Processors i and z+l both have access to the values of c ; -, c t+ \, and r,- (because r / = /V 1 ). Processor i will compute the first ki interactions in Ri, while processor z+l will compute the last r f -k t interactions, where processors i and z+l independently compute identical values for ki based on c,-, c t+ \, and r,. To avoid missing or duplicating computation of any interaction, processors i and z+l agree on the order of the interactions in the equivalent sets Ri and Li + r, such a consistent ordering might be based on the positions of the particles involved in each interaction, with ties broken by other particle characteristics such as charge. In the absence of prior statistical information about the spatial distribution of particles, we might compute k t as

That is, if Q+I and c,- are identical, then we assume that processors z and z+l have a comparable load, and we therefore split the interactions in Ri evenly between them. If cm > Q, then we assign more than half the interactions in R t to processor i, in an attempt to balance the load. The max, min, and round operations in the above equation are necessitated by the fact that ki can only

take on integer values between O and r,-. In the absence of such restrictions, distributing

interactions according to the formula /c ( . = -i-+- i ± i — '- for all i would result in the assignment of

an identical number of interactions to processors z and z ' +l, if C{- \ and c ;+2 were both equal to the average of c, and cι +\ . This average represents a reasonable estimate of CM and c, +2 based on information to which processors i and z+1 both have access. Better load balancing could be achieved if both processors had access to CM and c,- +2 , or if processors had some prior statistical information about the particle distribution (e.g., the average particle density).

Generalizations of this method to higher dimensions can take several forms. A computationally efficient and easily implemented generalization is available if we expand the import region slightly such that its outer boundary is a rectangular parallelepiped extending a distance RU beyond the interaction box in the positive and negative directions along each coordinate axis. We describe this generalization in the two-dimensional case, where the outer boundary of the import region will be a rectangle (a square if the interaction box is a square). Assuming that the side lengths of the interaction box are all larger than the import radius R, an individual interaction may be computable on four boxes that share a common corner, on two boxes that share a common edge, or only on a single box. Our strategy is first to assign each interaction that can be computed by two or four boxes in neighboring columns to a unique column, and then to assign interactions that can be computed by boxes in neighboring rows to a unique row. We perform these assignments by repeated application of a procedure similar to that used to split Rj in the one-dimensional case, with appropriate values substituted for r t , c,, and c !+1 in Equation 1.

More specifically, we assume that processor (ij) is in the zth row and theyth column. After importing particles in its import region, processor (ij) divides its set of locally computable interactions into nine subsets, based on which other processors can compute them. We denote these subsets by S°j , where a and b can each take on the values — 1, 0, or 1. The value of a indicates whether the interactions in S?f can be computed by processor (i-l,j) (in which case a

= -1), by processor (i+l,j) (in which case a = 1), or neither (in which case a = 0). The value of b indicates whether the interactions can be computed by processor (Uj-I) (in which case b = -1), by processor (i,j+l) (in which case b = 1), or neither (in which case b = 0). The shape of the import region implies that an interaction is in S] '_'] if and only if it is computable in both processors (i,j) and (z+1, /H) (this would not be the case if the import volume were restricted to

that of the midpoint method). Neighboring processors agree on their set of shared elements; for example, S] j contains the same elements as S£ζ} +1 , S^{\ , and S^ 1 l 1 . We denote the size of Sfj by sfj .

Each processor (ij) sends the values s}j' , s° j , and _?rV 0 to processors (i,j—l) and (i,j+l). For each a in {-1, 0, 1}, we assign the first kfj interactions in Sfj to processors in column / and the last sfj -kfj to processors in column/H, where kfj is computed by Equation 1 with sfj substituted for n, sfj substituted for a, and sf j+ϊ substituted for c; + y. Processor (i,j+l) independently computes identical column assignments for all interactions in Sfj . Processors (z+l,y) and (i+l,j+ϊ) independently compute identical column assignments for interactions in Sj j , and processors (i—l,j) and (i-l,j+l) independently compute identical column assignments for interactions in STj' 1 . At this point, all interactions have been uniquely assigned to a column.

Interactions are then distributed between processors within each column using a procedure identical to that described previously for the one-dimensional case.

One could formulate multi-dimensional midpoint-ensured methods that would require only the import volume of the basic midpoint method, at the cost of a more complex algorithm for assigning interactions to processors. The midpoint-ensured method described above generalizes in a straightforward manner to the case where R is greater than one or more of the box side lengths if one requires that an interaction between a set of particles always be computed either in the box containing their midpoint or in a neighboring box (i.e., a box that shares at least a corner with it). Other midpoint-ensured methods that remove this restriction can lead to further improvements in load balance.

3.2.5 Translational invariance

The decomposition schemes we have described thus far are translationally invariant. That is, each homebox has the same spatial relationship with its associated interaction zones. Assignment rules allow the description of decompositions that are not necessarily translationally invariant.

As an example of an assignment rule that is not translationally invariant, consider an assignment rule where i a→ b = i a , the least significant bit of the binary representation of j a →b is the least

significant bit ofj a , and the other bits of j a→ b are identical to those of/ / ,. This decomposition is similar to that of the SH analog illustrated in FIG. 7, but the import region of homeboxes with an oddj coordinate is different from that of homeboxes with an eveny coordinate.

Such non-translationally invariant assignment rules may enable better performance in tree-like networks than restriction to translationally invariant approached. In a tree-like network, communication between certain sets of neighboring homeboxes is less expensive or more preferable than communication between other sets of neighboring homeboxes. It is possible that translationally invariant schemes will give optimal performance in translationally invariant networks, including toroidal networks and networks with all-to-all interconnection schemes, while non-translationally-invariant schemes may be preferable when the network itself is not translationally invariant.

We now devote our attention primarily to translationally invariant decompositions, recognizing that many of the approaches can be applied directly or with some modification to non- translationally invariant decompositions.

A subset of translationally invariant schemes are separable in the sense that the logical x coordinate of the interaction homebox depends only on the logical x coordinates of the particles and similarly for the y coordinate. Most of the schemes discussed in this paper are separable, again recognizing that many of the approaches can be applied directly or with some modification to the non-separable case as well.

3.3 Commutative interactions

One of the assumptions (labeled D above) in the simplified problem that is presented above was that for any pair of particles a and b, we separately compute the influence of a on b and the influence of & on a. In many, if not most applications, these influences possess symmetry such that only one interaction needs to be computed. Even if the interactions are not commutative, the computations may require the same inputs, such as particle positions. We can take advantage of this to reduce communication requirements by computing both non-commutative interactions at the same homebox.

The convolution criterion is modified for commutative interactions to include not only the reception region but also a "transmission region." The transmission region of a method is the smallest volume of space that is guaranteed to contain all particles that are influenced by

particles in the homebox. We need to ensure that the union of a homebox's reception region and transmission region includes its entire interaction neighborhood. The transmission region and reception region are closely related. One can determine the transmission region with a procedure similar to that used to determine the reception region, but with the roles of the "red" zone and "blue" zone reversed. The result is that the transmission region is simply the reception region reflected about the homebox. The union of the two regions includes the entire interaction neighborhood if and only if the reception region includes a "symmetric half of the interaction neighborhood. That is, the two regions include the entire interaction neighborhood if for any point in the interaction neighborhood that does not fall within the reception region, the point obtained by reflecting it about the homebox falls within the reception region.

If assignment rules are being used to specify the decomposition method, one way to generalize an assignment rule to take advantage of interaction symmetry is to order the particle pairs before applying the assignment rule. That is, a new assignment rule can be defined as the original assignment rule applied to the result of a function that orders the two arguments in a canonical order. For example, the ordering could be such that the first particle in a pair has the minimal x- coordinate with ties broken by y. Some assignment rules (such as the rule for the eighth shell method described later in this paper) require no additional ordering to utilize the commutative nature of the interaction computations.

FIGS. 12 and 15 illustrate modifications that take advantage of commutative interactions for the import regions of HS and foam-like analogs of FIGS. 5 and 8, respectively. In each case, we have roughly cut one of the two interaction zones in half. The reception region for each scheme is enclosed within a dashed line, 1250 and 1550, respectively. In each case, the reception region includes at least half the interaction neighborhood and the union of the reception region and the reflection of the reception region about the homebox covers the entire interaction neighborhood. Note that in these decompositions, if we simply interact all particles in each zone with all particles in all other zones, then some near interactions will be computed and some excess interactions for pairs that are separated by a Euclidean distance greater than R might be computed. While pair level filtering mechanisms can be used, for example, by introducing a test in an inner loop of a computation iteration, an alternative method that avoids such double- counting without necessarily requiring such a test is introduced below in Section 6.

3.4 The rounding criterion

As introduced above, the approaches above relate to computation of interactions between particles that are separated by a Euclidean distance less than R, the interaction radius, hi the simplified two-dimensional problem, each particle therefore interacts with particles falling within a circle of radius R rather than a square of side length 2R, as set out in assumption B above. Likewise, the interaction neighborhood of a homebox is a subregion of the square region considered above (as a consequence of assumption B) with the corners of the subregion being "rounded" rather than extending to the limits of the homeboxes that fall at least in part within the interaction neighborhood. Use of such an interaction neighborhood that reduces in volume through this rounding is taken advantage of to reduce the size of the import region and thereby reduce the amount of communication needed to import information for the particles to be interacted. This technique to reduce the size of the import region is referred to as "rounding."

The criterion that specifies the degree to which the boundaries of the import region can be rounded can be expressed as follows: a point in an interaction zone v may be discarded from a zone if it is further than R away from the closest point in the set of zones with which v interacts. For the two-zone methods considered in this section, this means that points in the red zone may be discarded if they are further than R away from all points in the blue zone, and points in the blue zone may be discarded if they are further than R away from all points in the red zone.

hi FIGS. 16-17, rounded versions of the HS and SH analogs, respectively, show reduced import regions determined according to the rounding criterion presented above as compared to corresponding FIGS. 12 and 14, which do not employ rounding, hi FIGS. 16-17, the interactions regions 1650 and 1750, respectively, are enclosed by the dashed line. Note that in FIG. 17, the rounded reception region 1750 expends beyond the interaction neighborhood 1755, which is enclosed by the dotted line. While rounding never increases import volume, it does not always reduce it. For example, the import regions of the particular instances of the foam-like analog in FIG. 15 or the NT analog in FIG. 13 are not reduced in volume by rounding.

The rounding criterion also allows us to generalize the techniques of the previous section to arbitrary R (for example that are not an integral multiple of the side dimension of a homebox) and is applicable to arbitrary home box aspect ratios. In such cases, the interaction zones may contain fractional homeboxes.

In a voxelized method the interaction zone consists of complete homeboxes (assumption C above), without making use of rounding, all local particles are tested for near interactions against the set of particles in that method's reception and transmission regions. When a method is rounded such that zones contain fractional homeboxes, assuming that entire homeboxes in the interaction neighborhood are imported, this is no longer true. For example, a local particle near the corner of a homebox may be tested for a near interaction against a different set of particles than a local particle in the opposite corner. The importing itself is preferably limited to the rounded regions, such that a homebox in the import region provides only particles located in a rounded subregion to the homebox performing the computation, hi this way, fewer pairs of particles that are separated by more than R are considered by the homebox performing the computation, and for some pairs of zones being interacted, no test in the computation loop is required to avoid considering pairs of particles separated by more than R.

3.5 Decompositions in higher dimensions

The simplified techniques described above generalize to the higher dimensional cases (relaxing of assumption A above). Given a method where each homebox interacts particles in one translationally invariant interaction zone with particles in another zone, we can use the convolution criterion or an assignment rule to determine whether all particle pairs within a distance R will be interacted and then round the interactions zones with the rounding criterion. The three-dimensional HS, SH, NT and SNT methods described above are specific instances of this general class of schemes.

A class of decompositions that we have found useful in practice uses translationally invariant separable decompositions. Recall that a decomposition is termed "separable" if the logical x coordinate of the interaction homebox depends only on the logical x coordinates of the particles and similarly for the coordinates in other dimensions. A separable decomposition can be described by specifying a separate decomposition for each dimension. As one example of such a separable decomposition, one can use either of two decompositions for each dimension:

A. A decomposition where one zone contains only a single homebox, while the other contains a row of homeboxes spanning the interaction neighborhood. This is the case for both dimensions of the HS analog illustrated in FIG. 12 and of the NT analog

illustrated in FIG. 13, and for the horizontal dimension of the SH analog illustrated in FIG. 14.

B. A decomposition where one zone consists of several contiguous homeboxes, while the other consists of a set of non-contiguous individual homeboxes spaced regularly across the interaction neighborhood. This is the case for both dimensions of "foam" analog illustrated in FIG. 15, and for the vertical dimension of the SH analog illustrated in FIG. 14.

Decomposition A above can be viewed as a special case of the decomposition B, but in terms of the properties of the resulting methods, it is worthwhile to treat them separately.

We can construct one zone of such a method by selecting one of the following four choices for each dimension:

• a. All homeboxes

• a'. A single homebox

• b. A set of individual homeboxes spaced regularly across the dimension

• b ' . A block of multiple contiguous homeboxes

The other zone will then be uniquely determined by the requirement that it be complementary to the first zone, hi each dimension, a and a' are complementary, as are b and b'.

In three dimensions, there are 20 unique decompositions that fit this description, assuming we take advantage of interaction commutativity by halving one of the interaction zones along one dimension.

Without loss of generality, we require that the "red" zone be halved along the first dimension. This restricts the options for the first dimension of the red zone to a and b. We can select any of the four choices for the second and third dimensions of the red zone. Because there is no distinction between these two dimensions, however, we impose an ordering a > a' > b > b' on the choices and require that the choice for the second zone be no larger than the choice for the third zone. Thus we can select the decompositions for the second and third zones in 4 + 3 + 2 + 1 = 10 ways, so there are a total of 2x10 = 20 decompositions in this class. Note that we have ignored

details such as the exact positioning of each zone relative to the home box when counting unique decompositions.

The HS and NT methods, which are illustrated in FIGS. 2 and 3, respectively, are two of these unique decompositions. The SNT is another; as illustrated in FIG. 18 in which zones 1820 and 1830 of the homebox 1810 are rounded. FIGS. 19-21 show three further methods in this class. Referring to FIG. 20, a "clouds" method has one zone 1920 of the homebox 1910 that consists to a series of circular plates of successively smaller radius resulting from the rounding criterion and a second zone 1930 that is continuous with the homebox. Referring to FIG. 20, a "city" method has one zone 2020 of a homebox 2010 (hidden from view) that consists of a set of vertical towers that are truncated according the rounding criterion and a second zone 2030 that is continuous with the homebox 2010. Referring to FIG. 21, a "foam." method has one zone 2120 that consist of a set of subzones that are regularly arranged in a three-dimensional grid and truncated according to the rounding criterion, and a second zone 2130 that is continuous with the homebox 2110.

Rounding can be taken advantage of to different extents with the methods illustrated in FIGS. 18-21. The NT and SNT methods illustrated in FIGS. 3 and 18, respectively take advantage of two-dimensional rounding. That is, they round one or both interaction zones in order to eliminate parts of the zone that lie more than a distance R from the homebox when projected onto the plane. Such rounding guarantees that each particle will be tested against all particles in a surrounding region approximately shaped as a cylinder of radius R and height 2R. When homeboxes are very small relative to R, for example in the high parallelization limit, two- dimensional rounding reduces the size of the rounded zone and of the reception region by a factor of 7τ/4 compared to the unrounded case. Other decompositions, including the HS, clouds, city and foam methods, illustrated in FIGS. 2 and 19-21, respectively, allow three-dimensional rounding. That is, they allow one to reduce one of the interaction zones in order to eliminate most parts of the reception region lying more than a distance R from the homebox in three- dimensional space. Such rounding guarantees that each particle will interact with all particles in a region shaped approximately as a sphere of radius R surrounding it, rather than some larger region. The size of the rounded zone and of the reception region can be reduced by a factor of τr/6 compared to the unrounded case and by a factor of 2/3 compared to the two-dimensional

rounded case. Three dimensional rounding of a zone is possible when it uses decompositions a or b in all three dimensions.

As discussed below, it is desirable to balance the volume of the import zones in order to minimize import bandwidth requirements. This is more easily accomplished in systems with an even number of dimensions, where we can select two interaction zones with the same shape (e.g., FIG. 6). Even in this case, we typically halve one zone along one dimension due to symmetry considerations which introduces a difference in volume between the zones which could be balanced via other means. An odd number of dimensions introduces an inherent asymmetry between the two zones. We can still balance their volumes, for example, in one of two ways:

• Use non-cubical homeboxes, making one of the zones larger than it would have been otherwise and the other smaller. This is the technique employed by the NT method.

• In methods where one or both zones involve multiple non-contiguous pieces (e.g., decomposition b above), choose the spacing between the pieces appropriately. This is the technique employed by the SH method.

These optimization strategies can be combined. We return to the topic of optimization later.

4 Shared communication

The midpoint method described above is particularly well-suited for parallelization of molecular dynamics simulations because it allows several different components of the computation to utilize the same communicated particle data, thereby reducing total communication requirements more than it would if each component were parallelized separately. It can be used to parallelize the evaluation of bonded force terms, which commonly involve the interaction of up to four particles, as well as pairwise electrostatic and van der Waals terms. Moreover, it allows several components of the MD computation, including the evaluation of the bonded force terms, to rely on the same data that is communicated for the evaluation of electrostatic and van der Waals forces between pairs of nearby atoms.

The total force on a particle in most biomolecular force fields can be expressed as a sum of bonded force terms, which depend on the covalent bond structure of the molecules, and non- bonded force terms, which comprise electrostatic and van der Waals interactions between all

pairs of particles in the system. Bonded force terms include bond length terms, involving two particles connected by a bond; bond angle terms, involving three particles, two of which are bonded to a third; and dihedral angle terms, involving four particles connected by three bonds. We denote by n the maximum radius of a sphere required to enclose the particles involved in any bonded term. In common biomolecular force fields, n is no larger than 4A.

The non-bonded force terms, on the other hand, involve all pairs of particles in the system and therefore represent a much larger computational burden. Van der Waals forces fall off sufficiently quickly with distance that they can typically be neglected for pairs of particles separated by more than some cutoff radius r 2 , typically chosen between 9 and 12 A. An ever- mounting body of evidence shows that neglecting electrostatic interactions beyond a cutoff is inadequate for explicit solvent simulations, so electrostatic forces are typically computed by one of several efficient, approximate methods that take long-range interactions into account without requiring the explicit interaction of all pairs of particles. Examples of these methods include mesh-based Ewald methods such as Particle-Particle Particle Mesh (PPPM), Particle Mesh Ewald (PME), Lattice Gaussian Multigrid (LGM), and Gaussian Split Ewald (GSE), which use the Fast Fourier Transform (FFT) or a multigrid solver to compute electrostatic potential on a mesh given a charge distribution on that mesh. These methods require that modified electrostatic interactions be computed explicitly within some cutoff radius; we will assume that this cutoff radius is chosen to be identical to the van der Waals cutoff radius r 2 , as is typically the case. These methods also require that charge be mapped from particles to nearby mesh points before the Fourier-domain or multigrid computation, and that forces on particles are computed from potentials at nearby mesh points afterwards. In the charge spreading computation, the charge on each mesh point is determined as a sum over nearby particles, while in the force spreading computation, the force on each particle is computed as a sum over nearby mesh points. We denote by r 3 the maximum distance between any particle and a mesh point that "interacts" with it in either charge spreading or force spreading. The value of r 3 depends on the parameters of the long-range electrostatics method employed, but is typically around 5A.

When one parallelizes the computation of explicit pairwise electrostatic and van der Waals interactions using the midpoint method, each processor must import data for particles lying within a distance of r 2 /2 from the boundary of its home box. Each processor computes interactions that exert force on the particles in its import region, so once it has completed

computation of pairwise interactions, it must export a force for each of these particles to the respective particle's home box.

If 7' 3 <r 2 /2, then no additional communication is required to support charge spreading and force spreading. The import requirements of the explicit nonbonded force computation ensure that each processor will import all remote particles within a distance r 2 /2 of any mesh point lying in the home box. Likewise, the export requirements ensure that each processor will export forces to this set of remote particles (a remote particle is a particle that does not reside on the interaction box). The force contribution on each remote particle due to mesh points in the home box can be combined into each of these forces before inter-chip communication. In MD runs using common biomolecular force fields, r 3 is usually approximately equal to r 2 /2. Parameters of the mesh- based Ewald method employed can be tuned to guarantee that r 3 <r 2 /2; alternatively, the import region can be expanded to include all particles within a distance 7* 3 of the home box boundary.

Likewise, if ri<r 2 /2 — as is typically the case for MD simulations using common biomolecular force fields — then no additional communication is required to support computation of the bonded terms. This computation can also be parallelized using the midpoint method, which specifies that each bonded term be computed by the processor that contains the midpoint of the particles involved in that term. The import requirements of the explicit nonbonded force computation ensure that data for each particle will be available on any processor that needs it to compute a bonded force term, while the export requirements ensure that that a force contribution will be exported back to the processor on which that particle resides.

In an MD simulation parallelized via a spatial decomposition, particles migrate from one home box to another as they move. As long as the maximum distance r 4 that a particle moves in a time step — typically under 0.1 A — is less than r 2 /2, no extra communication will be required to migrate the particles; their new positions will be communicated as part of the import data required by the midpoint method for explicit non-bonded interactions (in certain cases data will need to be imported from more boxes than if particle migration had been performed separately, because the data to be imported might reside on any box including points within a distance 7 * 2/2+7 * 4 of the interaction box, rather than only on boxes including points within a distance r 2 /2 of the interaction box). Note that this is not generally the case for the NT, HS, or ES methods, in which proximity of a particle to an interaction box does not guarantee that the particle will be imported by that interaction box.

Consider a molecular dynamics simulation with a cubic global cell measuring 64A on a side, running on a system with 512 processors and partitioned into an 8 x 8 x 8 mesh of boxes. Suppose that we use an interaction radius of r 2 = 12A for explicit pairwise non-bonded interactions. If we use the midpoint method to parallelize these interactions, the import volume will be 5923 A 3 ; no additional communication is required for charge spreading, force spreading, particle migration, or bonded force computation, provided that bonded force computation is also parallelized using a midpoint method. Using the HS method to parallelize the explicit pairwise non-bonded interactions, on the other hand requires an import volume of 11352A 3 . This import volume does not ensure, however, that all remote particles near a mesh point on a particular home box are imported onto that home box. If r 3 =4.5A, charge spreading will require an additional import volume of 2164A 3 , and force spreading will require a comparable amount of additional data export. Traditional parallelization strategies for bonded forces, in which each term is computed either on a fixed processor or on the processor containing a specific particle, will require additional communication, as will particle migration.

We expect that all method with the import region of the midpoint method allow communication to be shared in the same manner. Also, while a mesh-based Ewald methods is presented above, the midpoint method also allows communication sharing between components of the computation for other long-range electrostatic methods. Essentially all of these methods (including the Fast Multipole Method and related tree code methods) require that some pairwise particle interactions between nearby particles be computed explicitly; at a minimum, the communication required for this computation can be shared with the communication required for bonded interactions.

5 Pair-lists and exclusion groups

In approaches that involve identifying pairs of particles that are to be interacted, computation can be arranged such that the interaction computations are performed at the time that the pairs of particles are identified. As an alternative, approaches such as the NT approach are used to identify the pairs of particles to be interacted, but rather than performing the computations as the pairs are identified, a list of the pairs is generated. The list is then used to iterate through the required computations. For example, the pair list is completely calculated before any pairwise computations begin, or alternatively, a pipelined approach is used in which another software or hardware module receives the list and computes the interactions while more pairs are being

computed. In some hardware architectures, separation of the generation of the pairlist from the pairwise computations may increase computation efficiency, for example, by allowing simpler hardware or a more efficiently coded inner software loop to be used to compute the pairwise interactions.

Another aspect of computation of pairwise particle interactions, whether using pairlists or performing the computations was appropriate pairs are identified, is that some particle pairs within the specified interaction radius must effectively be excluded from the computation. For example, in a molecular dynamics simulation, the computation of Van der Waals and near electrostatic interactions (non-bonded interactions) for all particles that are within a cutoff radius R and that are not covalently bonded is performed. If a pair of particles with the cutoff radius are covalently bonded, then a near electrostatic force term is not included. A pair of particles within radius R but that does not have a non-bonded interaction is termed an excluded pair.

Approaches to dealing with exclusion of computations involving bonded pairs of particles include (1) including the electrostatic force contributions for all pairs of particles within the interaction radius, and later subtracting out (i.e., "correcting") the force contributions for pairs that are covalently bonded, and (2) identifying the excluded pairs during the identification of pairs of particles with the interaction radius to avoid introducing the electrostatic force terms in the first place.

Neutral territory methods described above allow interactions between all pairs within R to be found and distributed efficiently, for example, on a parallel machine. However, in order to take into account exclusions efficiently during this computation, a means to quickly detect whether or not a pair of particles is excluded can increase the efficiency of the computation.

One solution is to create a Boolean matrix indexed by particle whose entries indicate whether or not a pair is excluded — conceptually, is_excluded(i j). However, even for a modest number of particles, direct storage of this matrix (or half the matrix as the matrix is transpose symmetric) is likely prohibitive. Since the number of exclusions usually scales proportional to the number of particles, the matrix is sparse and could be stored in a compressed format.

Exclusion groups provide an alternative means for specifying a set of excluded pairs. Each particle has an associated value that indicates the exclusion group to which the particle is

assigned. If two particles belong to the same exclusion group, they are considered an excluded pair and discarded during the matchmaking process of a neutral territory method.

This test is efficient and maps nicely onto both neutral territory method implementations and the physical properties of many problems involving range limited pairwise interactions. For example, in molecular dynamics again, all the non-bonded interactions between particles belonging to the same water molecule are commonly excluded for example. If there is an exclusion group that corresponds to each water molecule then, no additional handling of water exclusions is necessary.

To demonstrate their usage and assignment, consider a simulation with 12 particles. These 12 particles represent a methane molecule (particles 0,1,3,6,9), two water molecules (particles 2,4,5 and 7,8,10) and a sodium ion (particle 11). All pairwise non-bonded interactions between , particles that belong to the same molecule are excluded.

Particles 0,1,3,6 and 9 could be assigned to exclusion group 2, particles 2,4 and 5 could be assigned to exclusion group 5, particles 7,8,10 could be assigned to exclusion group 0 and particle 11 could be assigned to exclusion group 1. This demonstration also shows that particles and exclusion group assignments have no special ordering requirements.

6 Multiple Zone Methods

hi general, in the methods considered above, two overlapping zones are interacted — the homebox/shell, tower/plate, base/comb, brick/foam or, generically, blue/red. In this section, we consider methods involving multiple zones, any pair of which may be selected to interact depending on the particular method being used, hi such a multiple zone method, the import region and homebox are treated as the union of a set of several potentially overlapping zones. Particles in a zone are interacted with all particles in a subset of the other zones.

To demonstrate a multiple zone method for a simple two-dimensional case, consider the import region shown in FIG 22. The homebox 2210 imports a rectangle 2220 adjacent to the homebox extending R in +x direction, a rectangle 2230 adjacent to the homebox 2210 extending R in the +y direction and a quarter of a circle 2240 of radius R adjacent to both rectangles 2220 and 2230. The homebox 2210 is treated as a distinct zone. Homebox 2210 particles are interacted with the particles in the each of the other three zones (2220, 2230, and 2240), and particles in the vertical rectangle 2230 are interacted with particles in the horizontal rectangle 2220. Completing the

specification, pairs of particles are filtered by interaction radius R and local interactions (which occur exclusively in the homebox-homebox zone interaction) are to be filtered to avoid double counting.

Note that the filtering operation can be implemented in a number of ways. If the homeboxes are smaller than an excluded volume about the particles such that there is at most only one particle per homebox, then there will be no local interactions to consider at all. If there are many local interactions, they could be filtered by particle ID or other means. Alternatively, they could be handled by a special case loop construct. If it is desirable to treat the homebox like any other zone and the number of particles per homebox is modest, a fix-up loop could be done to subtract off redundant interactions. Still another possibility for certain kinds of near interactions is to treat homebox-homebox zone interactions like regular zone-zone interactions but scale the results by half.

The import volume of this method is always less than that of the HS analog, which is illustrated in FIG. 16. When R is small relative to the home box side length, the import volume is also less than that of the NT analog, which is illustrated in FIG. 13.

The validity of a multiple zone method can be checked by either determining an assignment rule or by applying the convolution criterion. The convolution criterion can be generalized to multiple zones using geometric considerations as follows. Each zone-zone interaction generates a reception and transmission region. The union of all the zone-zone reception and transmission regions must cover the interaction neighborhood of the homebox.

This method illustrated in FIG. 22 corresponds to the following assignment rule:

Alternatively, the convolution criterion can be applied. The reception and transmission regions of the interaction between the homebox 2210 and the other zones 2220, 2230 and 2240 cover all but two quarter circle portions of the interaction neighborhood. The interaction of the rectangular zones 2220 and 2230 covers the remainder.

Note that for any particular multiple zone method, the set of combinations (pairs) of zones for which particle interactions are considered is selected, for instance, to avoid double counting. The order or schedule in which to consider combinations of zones can also be specified, for example,

selected according to computations or communication considerations. The order of considering combinations of zones is not important to insure that all near interactions are computed, as long as all the selected combinations are considered. However, the schedule can be selected to overlap communications with computation.

For commutative interactions, the specification of a schedule can be thought of as an upper triangular matrix whose size is the number of zones in the method. The entries in this matrix are non-negative integers giving the order in which the zone-zone interactions are computed. A zero entry indicates that the zone-zone interaction is not computed at all.

A schedule for the method in FIG. 22 is shown in FIG. 23. A multiple zone method is specified according to the definitions of the zones, the zone interaction schedule, and any additional redundancy filtering mechanisms.

An optimal schedule is influenced by characteristics of the network and processor on which the method is implemented. As one approach to determining schedules, and the approach generally used in this description involves filling the schedule matrix by row and then by column. This order allows the maximum number of zone-zone interactions to be performed with zones already received as the computation progresses. Having selected an order for the rows and columns of the matrix, a remaining degree of freedom is the ordering of the zones. While a selection of an optimal order may in general be hardware dependent, in the schedules shown in FIG. 23 import zones that participate in more zone-zone interactions are scheduled before those that participate in fewer zone-zone interactions.

The order of interactions that allows maximal overlap of computation and communication depends on details of the available computational hardware. A procedure for determining a schedule is as follows. First, we determine the order in which the zones will be imported, requiring that zone u be imported before zone v if zone u participates in more zone-zone interactions than zone v. We then number the zones according to the order in which they are imported, with zone 1 representing the interaction box, zone 2 representing the first zone imported, zone 3 representing the second zone imported, and so on. We set schedule entries corresponding to pairs of zones that should not be interacted to zero. We then number the remaining entries in column-major order; that is, the interaction in row Z 1 and columnyt will be computed before the interaction in row ii and column^ ifyi<7 2 , or if 71=72 and i\< / 2 • This ensures that computation of the interactions between a pair of zones that have already been

imported never has to wait for the computation of an interaction involving a zone that has not yet been imported.

Multiple zone methods can have a number of practical benefits. First, they can reduce bandwidth requirements relative to previously known methods. Increasing the number of zones allows more near interactions to be computed for a given import volume, because interactions are computed between zones rather than within a zone. Increasing the number of zones allows more potential pairs to be considered for a given import volume because a given particle could be tested against particles from more than two zones. An additional effect is that multiple zone methods can expose additional rounding opportunities.

Second, multiple zone methods can be used to simplify or even eliminate many of the filtering criteria required to avoid redundant computations and/or double counting. For example, in the two-zone NT method, which is illustrated in FIG. 3, without redundancy filtering, interactions between particles whose x and y base coordinates are the same (i.e., are in the tower consisting of the home box 310 and zone 320) are computed twice. However, the NT method can be reformulated as a multiple zone method in a manner that eliminates the need for this testing. This reformulation is given below.

Third, using more zones allows more opportunities to concurrently perform communication for later-scheduled zone combinations while performing earlier zone combinations, thereby "hiding" communication delays with computation. For example, zones could be assigned in the schedule based on latency to import data from that processor. The order in which zone-zone interactions are performed would then allow some interactions to be performed earlier while more remote zones are imported.

To demonstrate these advantages on a three-dimensional method, consider the original NT method reformulated using four zones. The zones are denoted the homebox (H) 310, the lower tower (L), which forms a portion of zone 320, the upper tower (U), which forms another portion of zone 320, and the outer plate (P) 330. The zone interaction schedule is shown in FIG. 24. The schedule eliminates the homebox-tower interaction redundancies; in particular, the homebox is not interacted with the lower tower. Consequently, the rounding criterion applied to the lower tower indicates that the corner furthest from the outer plate can be trimmed, reducing communication bandwidth

Note that the lower tower is trimmed by the intersection of two perpendicular cylinders of radius R; the impact on bandwidth is small for large R but significant for small R.

Additionally, interaction filtering is simplified as there is no longer a need to filter interactions between the particles in homebox and particles in the lower tower. Referring to FIG. 25, an algorithm for implementing communication and communication corresponding to the schedule in FIG. 24 gives a communication sequence based on the schedule appropriate for MPI-style cooperative message passing systems for a force computation. The algorithm has partially overlapped both particle import and force export. At the beginning of computation cycle (one time step of the simulation), the data characterizing the particles are sent from the homebox to other homeboxes that will need that data for computations at those homeboxes (step 2501). The other homeboxes similarly begin sending their data and the homebox begins receiving data for zones P, U, and L (step 2502). As the other homeboxes compute forces on particles in the homebox, they start sending the computed forces to the homebox, which begins receiving them (step 2503). The homebox already has all the data for particles in the homebox, so it can perform the homebox-homebox interactions without waiting for the completion of any of the communication that is in progress (step 2504). After the homebox ends the step of receiving data for particles in the plate (P) that was initiated in step 2502 (step 2505), the homebox computes the homebox-plate (H-P) interactions (step 2506). After the homebox ends the step of receiving data for particles in the upper tower (U) that was initiated in step 2502 (step 2507), the homebox computes the homebox-upper tower (H-U) interactions (step 2508) and computes the plate-upper tower (P-U) interactions (step 2509). At this point, the homebox begins sending the computed forces for particles in the upper tower (U) to their respective homeboxes (step 2510). After the homebox ends the step of receiving data for particles in the lower tower (L) that was initiated in step 2502 (step 2511), the homebox computes the plate-lower tower (P-L) interactions (step 2512). At this point, the homebox begins sending the computed forces for particles in the plate and lower tower to their respective homeboxes (step 2513). Finally, the homebox finishes receiving the forces computed for its particles form other homeboxes (step 2514).

Another way a multiple-zone schedule approach can be used is to break up each of the types of interaction zones described above into a collection of uniform voxels. The voxels can have the same shape as the homeboxes, but this is not necessary. The schedule is then specified according to which pairs of voxels are to be interacted, achieving the same result as a schedule define on

the larger interaction zones, with an added advantage that better rounding can be achieved by only scheduling pairs of voxels for which some pairs of particles may be within the interaction radius.

6.1 A Lower Bound for Import Volume

In the appendix, we establish a lower bound on the import volume (Vi mport ) of a multiple zone method by considering the number of pairwise interactions that need to be computed. This bound takes the form:

Above, N zr is the number of zones that import remote particles, and V / , is the volume of a homebox. is the expected volume of the portion of the interaction region that resides on other processors for a randomly selected particle. Vi r>re mo te depends on R, Vh, and the homebox aspect ratio. The lower bound decreases as the number of zones increases, as V h increases, or as R decreases.

The bound for the import volume of a method is met when the remote import zones are balanced in volume, the interaction schedule is dense in the sense that every zone interacts with every other zone, and the method is efficient in the sense that every pair considered is a non-redundant remote interaction. Geometrically, this is often not feasible, particularly when the interaction radius is small compared to homebox dimension, hi the appendix, we show that one can approximate finite homebox size effects by replacing Vi rιremote in the above bounds with the volume of the remote portion of a homebox interaction neighborhood. For spherical interaction regions, this volume (Fi> )rø «o/e,e#) can be expressed as

Above, the normalized homebox aspect ratios and the parallelization parameter are used. This effective interaction region volume is exact in the limit of large α Λ for any N zr and in the limit N zr = 1 for any a R .

6.2 Optimization Priorities

By symmetry, the volume of the effective remote interaction region is minimized for cubic homeboxes (O x = a y = O 2 = 1). This gives a less tight approximate lower bound that is independent of homebox aspect ratio:

This bound does not indicate that cubic homeboxes are optimal; it is simply a bound that is independent of homebox aspect ratio. There are three terms involving OC R in the bound. These terms geometrically originate from the 8 corners (the cubic term), the 12 edges (the quadratic term) and the 6 faces (the linear term) of the interaction neighborhood.

Referring to FIG. 26, the magnitude of the three terms is illustrated as a function of <XR. The magnitudes of the three terms are normalized so that the magnitude of the corner term is 1.0. In the low parallelism limit, the face terms dominate the import volume, followed by edge terms and then corner terms, hi this limit, a method should strive to minimize the volume of imports required to cover the faces of the interaction neighborhood followed by the import of edges. As the parallelization parameter increases though, these priorities change. The first transition point occurs at CCR > 2/π~ 0.64. After this transition, the priorities are ordered edge, face, comer. The next transition occurs at a & >(9 /βπ)) m ~ 1.20; the priorities are then ordered edge, corner, face. The final transition occurs at OC R > 2.25 when the priorities become ordered corner, edge, face.

These optimization priorities are approximate and meant to serve as guidelines in designing methods. Depending on the exact homebox aspect ratio and the details of the system running the methods, the exact transition points between various methods could be different.

7 Applications of generalized decompositions

m this section, specific instances of the generalized approach are presented. While most methods are not proven to be optimal in any particular sense, in many cases the methods are believed to be optimal or very close to optimal in terms of bandwidth for their intended regime. In most methods described, the only interaction filtering that needs to be applied is interaction radius and local interaction filtering. That is, filtering is not generally required to avoid considering pairs of

particles more than one in the system, other than for pairs of particles that come from the same homebox.

7.1 Eighth Shell Method

The import region for a HS method consists of 4 corner, 6 edge and 3 face import sub-regions. A face subregion includes homeboxes that differ in one dimension, an edge subregion differs in two dimensions, and a corner subregion differs in all three dimensions. Intuitively, the half-shell method might seem to be the best that could be achieved in the low parallelism limit (c&< 0.64). This is not the case.

The eighth-shell (ES) method is the three-dimensional generalization of the two-dimensional method illustrated in FIG. 22. The ES method import region consists of 7 remote zones of the homebox H 2710: 3 face zones (+x import subregion denoted FX 2720, and similarly for y (FY is hidden in FIG. 22) and z, FZ 2722), 3 edge zones (+y+z import subregion denoted EX 2730, +x+z region denoted EY 2731, +x+y region denoted EZ 2732) and 1 corner zone (+χ+y+z import subregion denoted C 2740). Referring to FIG. 23, an interaction schedule corresponds to the regions shown in FIG. 22. The eighth shell method is so named because asymptotically, its import region consists of one-eighth the full shell surrounding the homebox.

The ES import region is smaller than the HS import by 3 edges and 3 corners. The edge-face interactions (interactions 9, 11 and 13) cover the interactions that are handled in the HS method by the interaction of the homebox with the 3 removed corners. Likewise, the face-face interactions (interactions 4, 6 and 7) cover interactions that were handled by the interaction between the homebox and the 3 removed edges. Like its two-dimensional counterpart, the method can be described by an assignment rule

The total import volume is:

This method may be an optimal bandwidth method in the low parallelism limit. It is also highly symmetric which can have additional benefits on various network topologies. Referring to FIG.

29, the bandwidth of a number of methods is compared to the standard of the half-shell (HS) method 2910. The eighth-shell 2914 method is compared against several other methods and the approximate lower bound when using cubical homeboxes. The eighth-shell method conforms to the low parallelism optimization priorities of minimizing face, edge, and corner regions. The optimization priorities indicate the origin of the contributions to the import volume in terms of different geometric categories as a function of the degree of parallelism. For example, at a low degree of parallelism, face imports dominate the import region and therefore would be preferable to address as a priority for further optimization.

7.2 Sliced Half Shell Method

A class of methods for the range 0.64 < CL R < 1.20 may be well matched to the optimization priorities. Referring to FIG. 30, a Sliced Half Shell (SHS) consists of 4 corner import subregions 3010 (denoted as a single zone C4) in a sliced hemisphere arrangement and 5 face imports 3020 (denoted as the 5 zones N, S, E, W, L). Note that the S and W face imports can be rounded further than the N and E face imports as they do not interact with the homebox 3030. A schedule corresponding to FIG. 30 is given in FIG. 31.

7.3 Multiple Zone NT and SNT methods

The schedule for the multiple zone NT method presented above, and shown in FIG. 24. Referring to FIG. 32, a multiple zone of the two-zone SNT method illustrated in FIG. 18, consists of an extended base extended base 3220, which is below the homebox 3210. The comb 3230 and the portion 3240 of the base that are level with the homebox form four import regions. The remainder of the comb 3250 forms remaining import regions, The multiple zone SNT method has a relatively complicated schedule shown in FIG. 33.

As for the NT method, multiple zone version of the SNT method eliminates redundant interactions and allows for some additional rounding opportunities. If the multiple zone SNT method is further generalized to include methods where there is no spacing between the bars of the comb, then the extended base disappears and the +x import subregion becomes unnecessary as it only interacts with the extended base. The remaining set of import subregions exactly correspond to the NT method. It can be shown that if one has the freedom to optimize the aspect ratio of a homebox in addition to the spacing between the bars in the SNT method, the SNT method is optimized by turning it into an NT method.

At sufficiently high parallelization parameters when the homebox aspect ratio is fixed, an SNT method with judiciously chosen periodicity will be superior bandwidth wise to the NT method. This can be understood by considering the optimization priorities. Appropriate for the higher parallelization regimes, both methods do not import any corner subregions. Further, both import the same number of edge-like areas. However, for a fixed aspect ratio, the NT method optimizes the number of face imports at the expense of balance of volume between the outer tower and outer plate. The SNT optimizes for balance of volume between the extended base and partial comb at the expense of the number of face imports. Asymptotically, the face imports are negligible contributions to the import volume so that for fixed aspect ratio, the SNT method achieves lower import bandwidth than the NT method. Based on the optimization priorities, the NT method is expected to be most viable in the parallelization regime 1.20 < CC R < 2.25, while the SNT method is expected to become viable beyond OC R > 2.25. Accordingly, the crossover is at CC R ~ 2.7 for cubic homeboxes as can be seen in FIG. 34

7.4 Foam Method

The foam method is the three-dimensional generalization of the method shown in FIG. 15. This variant of the foam method has the many rounding opportunities and imports a brick of homeboxes of dimension s x s x s centered in the x and y directions on the homebox whose top is level with the top of the homebox. The homebox also imports a foam of homeboxes with a periodicity of s x s * s. The bottom of the foam is level with the bottom of the homebox and the foam is rounded against the brick to form a hemispherical structure, (hi the low parallelism limit, the brick can be rounded as well but the method is not competitive in that limit.) Asymptotically, the import volume of this method is:

Optimizing the periodicity to minimize the import volume yields:

Thus, the asymptotic import volume of an optimal foam method is:

The foam method achieves the exact lower bound asymptotically for a method with two remote zones. Practically, this method is best matched to situations in which the parallelization parameter is large. As can be seen in FIG. 34, foam only becomes competitive at parallelization parameters in excess of ~14. This is a consequence of the fact that as the number of processors increase, the relative contribution of the overhead of foam faces and foam corners to the import volume only declines very slowly.

Implementations of foam methods can be coupled with zone scheduling on a homebox-by- homebox level to "hide" import latencies from more distant homeboxes. Since the number of homeboxes imported is large when foam is practical, a zone interaction schedule is omitted for brevity. Conceptually, the key elements of the schedule are that the homebox is interacted with both the foam and the brick, while the foam and brick are interacted with one another.

7.5 Comparison of Communication Bandwidth Requirements

FIGS. 34 and 35 summarize the import volumes of a number of methods described above over a range of parallelization parameters. For cubic homeboxes, the ES method is best among these methods for a R < 0.8. The NT method is best between 0.8 < a R < 2.7. The SNT method is best from 2.7 < CC R < ~14. Beyond, foam methods are increasingly competitive. These transitions can be related to the number of processors used for parallelization. For example, consider a molecular dynamics simulation with iV=25,000 atoms and i.=12A. Biological systems have a density of atoms of roughly D=O. I/A 3 . Assuming cubic homeboxes are being employed and that each atom is represented by a single particle, the parallelization parameter can be written in terms of these parameters:

The graph in FIG. 34 indicates that when using below ~75 processors, the eighth-shell method will minimize communication bandwidth (moderate sized clusters). Between ~75 and ~2,800 processors (large clusters), the NT method is best. Beyond -2,800 processors (for example, in the range of ultra parallel hardware such as in BlueGene or QCDOC), more elaborate methods

are preferable, beginning with the SNT method with a spacing of 2. The foam methods begin to become preferable at the extreme half a million processor level.

The relative import volumes of the various methods are different if the homebox aspect ratio is allowed to float. FIG. 35 shows the relative import volumes when each method is allowed to optimize the homebox aspect ratio, assuming a fixed global cell volume and number of processors. In this case, the eighth-shell is still optimal for OL R < ~0.8, followed by the NT method, followed by the foam method at extreme parallelization levels. The floating homebox aspect ratio is particularly important for fully interconnected networks and in the massive parallelism limit. In practice, the global cell aspect ratio is usually fixed by phenomena being simulated. The set of available potential homebox aspect ratios is then set by the network topology and/or the number of ways the global cell can be partitioned among the processors. On fully interconnected networks though, the topology does not bias in favor of a particular homebox mesh. Similarly, if there are a large number of processors available for the simulation, the mesh may be partitioned very flexibly, especially if the exact number of processors used is not fixed. As a result, on such hardware, communication schemes may be improved by optimizing homebox aspect ratio on a per method basis.

7.6 Multiple-particle interactions

In the description above, various approaches are described in the context of computing pairwise interactions. The approaches are also applicable to computing interactions between sets of greater than two particles. For example, forces between atoms in a molecule can be arranged using these approaches. In the multiple zone approaches, for example using the ES method, particular combinations of more than two zones can be scheduled in order to arrange such multiple particle interactions.

For a midpoint method for interactions that involve three or more particles, computing the center of the smallest sphere containing m particles is computationally nontrivial when m ≥ 3 . One option for avoiding this computation is to use midpoint-ensured methods that do not require computation of a midpoint. Another is to use an efficient approximation for the midpoint. Two such approximations are:

1. For each of the coordinate dimensions, approximate the coordinate of the midpoint as the arithmetic mean of the coordinates of the particles.

2. For each of the coordinate dimensions, approximate the coordinate of the midpoint as the arithmetic mean of the coordinates of the particles.

When using either approximation, one needs to expand the import region slightly to ensure that the box containing the approximate midpoint of a set of interacting particles will import all remote particles in that set. In the absence of any restrictions on the spatial distribution of particles, use of approximation 1 requires that the import region contain all particles within a

distance of the interaction box boundary, implying a substantial increase in import

Tf volume (if rø-1 particles are collocated inside the interaction box at a distance — from the m boundary and the mth point is located outside the interaction box at a distance just under

rom the boundary, then the midpoint will fall within the boundary). In MD simulations using common biomolecular force fields, however, a smaller increase in import volume may be sufficient, because the spatial arrangement of the particles involved in a bonded term is not arbitrary; for example, any pair of atoms must maintain some minimal distance from one another under normal conditions.

A sufficient import volume for use of approximation 2 is that enclosed by a rectangular parallelepiped extending a distance R/2 beyond the interaction box in the positive and negative directions along each coordinate axis. One can show that a somewhat smaller volume suffices, one that includes all remote particles within a rectangular parallelepiped extending a distance RIA beyond the interaction box in each dimension as well as all particles within a distance RIA of that parallelepiped, hi three or fewer dimensions, for m ≥ 3 , this import volume is always smaller than that required by approximation 1 in the absence of restrictions on the spatial distribution of particles.

8 Serialized implementation

One approach to computation of particle interactions, which can include one or more of the algorithmic techniques described above, make use of a serialized software implementation. In such an implementation, a processing node performs computations for homeboxes in a serialized manner. For example, for each homebox, a multiple zone method is implemented in which interactions between pairs of particles in respective pairs of zones are computed by the

processor. In one version of such a serialized implementation, a single processing node performs all the computations for all the homeboxes in a serialized manner. In another version, a single processing node of a set of multiple processing nodes is responsible for a set of homeboxes, for example, that together define a region of space (e.g., a cube). The processing node may have a number of processors that share at least some common memory. For example, each of the processors at a node may have a private cache memory and access a shared memory that is local to the node as a whole. There may be further level of memory, for example, shared between processing nodes, but that is not directly significant to the discussion below.

Referring to FIG. 45, in one version of a serialized implementation, a processor 4540 uses a local memory 4550. For example, the local memory 4550 may be a fast cache memory that is associated with the processor 4540, and which provides local copies of data in a global memory 4560. The global memory 4560 holds the particle data 4562 for the entire global cell being simulated by the system.

One aspect of local or cache memories is that they are typically faster than larger memories, but are limited in size. Some arrangements of the computational iterations may result in relatively larger numbers of memory access to the global memory 4560 than others. To the extent that the number of global memory accesses is a significant factor that determines the overall time required for the overall computation, such arrangements of the iterations may require relatively

1 more overall computation time.

There are speed advantages to maintaining all or most of the data required for a computation in the local memory 4550 if that data can be used multiple times. This reduces the overall memory latency in that an initial fetch of the data may be relatively slow, but subsequent accesses to the data that can use the cached copy of the data are relatively fast.

One way of making effective use of limited size caches is to "cache block" the near interactions. The term "cache block" is used as an analogy to cache blocking techniques used to accelerate matrix-matrix multiplication. Naϊve implementations of large matrix-matrix multiplications require O(N 3 ) data transferred to and from main memory (bandwidth) when the entire matrices being multiplied do not fit in the cache memory. Generally, this results from the entries in the matrix being repeatedly flushed and then reloaded into the cache as the computation continues. However, by breaking each matrix into a collection of "cache blocks" and determining an

appropriate schedule cache blocks are loaded, processed and stored, matrix-matrix multiplication can be structured to have O(N 2 ) bandwidth.

In computation of pairwise interactions, in a worst-case use of the local memory, computation of every interaction between a pair of particles requires accessing the global memory 4560 for at least one of the particles. For example, this may result if the order in which particles are interacted is not performed with regard to their locations, or if a method such as the half-shell method is used but the particle data for an entire half shell of data does not fit in the cache. In such a worst-case situation, the number of such global memory accesses is O((.DF)(2λK 3 ))=O(D 2 VR 3 ), where F is the volume being simulates, D is the density of particles, and R is the interaction radius. That is, DVis the total number of particles, and the average number of particles in with the interaction radius of a particle is proportional to DR 3 . On average, for every load data for a particle into the cache, essentially only a single pairwise computation is performed.

In a cache block arrangement of the computation of pairwise interactions the number of global memory accesses can be reduced to 0(D V m R 3n N v m ) , where N v is the number of "voxels" used. Specifically, the volume V being simulated is partitioned into N v voxels, in general, each being the same shape. In general, each voxel is analogous to a homebox in the decomposition approaches described above.

Referring to FIG. 45, the general approach to arranging the computation of the pairwise interactions involves loading the particle data for an import region 4570 and its associated homebox (i.e., the interaction zones associated with the homebox) into an interaction region data portion 4552 of the local memory 4550. The size of the import region, which depends on the volume and geometry of voxels, is chosen such that the data for the particles in the interaction region fits in the local memory 4550. Once this data is loaded into the local memory, any pairs of interactions between a particle in the interaction region can be computed without necessarily accessing the global memory.

The saving in access to the global memory can generally be understood by the number of times the data for each particle is used in a pairwise computation. In the case of two interaction zone, if the local memory is sufficiently large to hold the particle data for 2M particles (M from each of equal volume interaction zones), then up to M 2 interactions can be computed. That is, there are

Mil pairwise computations for every particle loaded into the cache, resulting in a Mil speedup with respect to memory access time using a naϊve worst-case method.

As one example of a type of interaction region, an NT approach, as described in Section 2.2 and illustrated in FIG. 3, is used. The interaction region 4570 is partitioned into voxels (voxel 4564 being indicated as a representative voxel). The local memory 4550 also includes a schedule that identifies the pairs of voxels that are to be interacted. For example, each voxel in the tower portion of the interaction region 4570 is interacted with each voxel in the plate portion. For each pair of voxels that are interacted, all the pairs of particles, one particle from each voxel, can be considered.

In other examples, the entire import region associated with a homebox does not necessarily fit in the local memory. However, if the import region is divided into multiple zones such that pairs of the multiple zones fit in the local memory, they can be loaded in the local memory according to a multiple-zone schedule and all the pairs of particles interacted while the data remains in the local memory. In other examples, subsets of zones are loaded into the local memory and all scheduled combinations of pairs of the loaded zones are interacted while the data for those zones remains in the local memory. In these examples, generally, an Mil speedup is also achieved to the extent that IM particles are loaded and M 2 interactions are computed.

For an optimal choice of height of voxel (i.e. a square cross-section tower, and a relatively thin plate), the volume of the import region is O(i? 3/2 ) rather than O(R 2 ) as with cubic voxels. For the NT method, the geometry of the voxels can be chose by optimizing the height of the voxel (the thickness of the plate) such that bandwidth scales as the total number of particles times the square root of the expected number of particles in an interaction region. Often, there are 200-400 particles expected in an interaction region so this square root reduction can be significant.

Various architectures of local memory can be used with these approaches. In one example, the local memory is a cache memory that uses the same address space as the global memory, and a cache controller maintains copies of the most recently accessed data from the global memory in the cache. As another example, a non-uniform memory architecture can be used in which the local memory uses a distinct address space than the global memory, and the processor explicitly copies the particle data from the global memory to the local memory as needed.

The approach can be extended to a multiple level cache structure in which a first level cache may have space to hold two subzones, which are read from a second level cache that holds more data, for example, the entire import region for a homebox.

The loading of data into the local memory may be aided by a special-purpose memory controller. For example, loading the data from one voxel can use a regular and generally relatively large memory structure that is performed using hardware external to the processor (e.g., special purpose FGPAs).

A multiple processor implementation can also be used in which each processor has its own local memory, and all the local memories are coupled to a shared global memory. The schedule of voxel pairs to interact is then split among the processors so that together all the required interactions are performed.

Between interactions, the particle data is reassigned to the voxels in which the are located (as a result of motion of the particles caused by the forces computed in the previous iteration). This partitioning of the data is performed using an efficient 0(N) algorithm, where N is the number of particles. During the partitioning process, a data structure (an array) is built that allows efficient access to strips of adjacent voxels along one or the orthogonal axes are needed when loading an interaction zone into the local memory.

The order of voxels processed is chosen so that in addition to reduced memory bandwidth, particles loaded into the cache blocks for the interaction zones for one voxel will substantially overlap the particles for the interaction zones of the next voxel. This further accelerates the operations by increasing temporal hardware cache locality.

As described above, the computation involves computing interactions between pairs of particles loaded into the local memory in a schedule-based loop. One alternative it to defer computation of the interactions in this loop and rather computing a list of pairs of indexes of the particles to be interacted. Then this list is processed in a second loop to actually compute the interactions. This second loop may be accelerated using special-purpose hardware. Because the list of pairs is already computed, such special-purpose hardware may be relatively simple.

Νon-commutative (e.g., grid-particle) interactions can be organized so that the zones and voxel sequencing is such that multiple processors can write without conflict to the same output arrays. For example, in computations where the output is a charge density array, this can eliminate the

need to use one charge array per processor and the need to reduce these partial charge arrays into a single output array.

A specific instance of a highly optimized cache-blocking method is the dual of the NT method described above and has four cache block zones. The first zone is a single homebox voxel. The next zone is a "tower" of voxels. It consists of all non-homebox voxels along an axis-aligned voxel ray that starts on the homebox voxel and that contain at least one point within R of the closest point in the homebox voxel. The last two zones form an approximately circular plate of radius ~R. These two zones consist of all non-homebox voxels that contain at least one point within R of the closest point in the homebox voxel in the plane perpendicular to the tower that contains the homebox voxel. These two zones are split evenly into a "yin" zone and a "yang" zone. These were so named as the most convenient division of the plate results in symmetric zones that fit together like a yin-yang symbol.

The cache block zones are interacted as follows: homebox-homebox, homebox-tower, homebox- yin, tower-yin, tower-yang. The inner loop takes advantage of the packing of the cache blocks to merge these interactions into two composite interactions: homebox-(homebox,tower,yin) and tower-(yin,yang). It also sets up the inner loop such that no redundant pairs from the homebox- homebox interaction are considered. The only matchmaking criteria required is a distance check.

The zones are oriented such that the voxel strips used to encode the import region run parallel to long direction of the two plate zones. As a result, very few strips are needed to encode the import region. The outer loop is arranged to traverse the homebox voxels such that the plate needed for each outer loop voxel significantly overlaps the plates needed for the previous and next outer loop voxels, which improves temporal locality.

The serialized computation described above can be combined with a parallel architecture, for example, using cache blocking techniques with the processor at each node, while performing computations in parallel at different nodes and communicating the results between the nodes after the computation is completed.

9 Cluster architecture

Another approach to computation of particle interactions, which can include one or more of the algorithmic techniques described above, make use of a cluster architecture. In this architecture, a number of computers are coupled by a communication network, for example by one or more

Gigabit Ethernet segments. Each computer can be a general purpose computer, for example, executing an operating system (e.g., Linux).

In one example, each computer in the cluster is assigned a particular set of homeboxes of the simulation region, for example, assigning computers to homeboxes in a one-to-one manner. Each computer includes storage for data for particles associated with its homebox. At the start of each iteration, the computers exchange particle data, for example, sufficient to implement a midpoint method for computing range-limited interactions. Other computations are also performed using this exchanged data, for example, using the communication sharing approach described in Section 4. Some computations are performed in a distributed manner over multiple computers. For example, FFT computations that are part of the GSE method can be implemented in a distributed manner.

10 Parallel implementation

Another approach to computation of particle interactions, which can include one or more of the algorithmic techniques described above, makes use of special-purpose hardware that is particularly suitable for various computations necessary for computing particle interactions. The special purpose hardware is arranged in a parallel architecture that includes multiple processing nodes, with each processing node being responsible for the computations associated with one homebox. The architecture permits internode communication. One arrangement of nodes is in a three-dimensional array in which each node communicates directly with its six neighbors along three orthogonal axes, and communication of particle data can involve passing the data through multiple nodes to reach its destination.

Referring to FIG. 37, a representative three-dimensional array of computation nodes 3720 is arranged such that each node performs the computation for a corresponding homebox 3710. The nodes are linked by communication paths 3730 along the orthogonal axes, and optionally by diagonal communication paths 3735. Each processing node 3720 includes a processor 3720, which performs numerical calculations associated with the particle interactions, and a communication interface 3750, which is responsible for sending and receiving particle data over communication links 3730, 3735 linking the node to its neighboring nodes.

Before discussing the details of the hardware implementation, the computation performed by the hardware can be understood with reference to FIG. 38. The computation involves a loop that is

performed at each time step of the simulation, with each time step representing a simulated time such as 1 or 2 femtoseconds (abbreviated fs, equal to 10 seconds). Generally, each iteration involves three basic step. Velocity integration 3810 uses the calculated forces on each particle to update the velocity of that particle. Position integration 3820 uses the updated velocities to calculate the updated positions of the particles. Force calculation 3830 uses the new positions of the particles to calculate forces on the particles, thereby allowing the iteration to repeat.

Force calculation involves calculations at different spatial scales. These calculations are performed independently. There calculations include bonded force calculation 3832, midrange force calculation 3834, and distant force calculation 3836. As part of each of these calculations, force accumulation 3838 determines the total force on each particle, with the force on any particular particle potentially including components that are determined by computations at each of the spatial scales. It is not necessary to perform the force calculations at each of the spatial scales at each time step. The distant force computations may be performed less frequently than the midrange computations, which are performed less frequently than the bonded force calculations, for example, using a MTS (Multiple Time Scale) and/or RESPA (REference System Propagator Algorithm) approach.

Another aspect of the parallel implementation involves dedicating different hardware elements to various types or spatial scales of computations. For example, hardware elements are dedicated to computations involving pairwise particle interactions that are part of the midrange force calculation step 3834 (or particle-grid computations that form part of the distant force calculations 3836), and to computations of near interactions such that are part of the bonded force calculations 3832. In addition the force accumulation step 3838 makes use of a memory system that incorporates numeric value accumulation functionality, hi the multiple node architecture introduced above, each node includes such dedicated hardware elements for the various types of computations.

Referring to FIG. 39, each computation node 3720 includes a number of computation and communication elements. A flexible subsystem 3912 includes one or more programmable processors. A midrange system 3914 includes special-purpose circuitry that is adapted to perform pairwise computations, for example, between pairs of particles. An FFT (Fast Fourier Transform) subsystem 3924 is used in some embodiments to perform computations related to

distant force computations. Yet other embodiments may include additional or fewer specialized or general purpose computations elements.

Communication within one processing node makes use of a ring communication architecture that includes a number of ring stations 3920 coupled by communication links 3922. In this version of the system, there are eight ring stations per node, and the links are high speed serial communication links. Each ring station has additional links. For example links 3924 couple the ring stations to the computation elements of the node. Other links 3730 couple ring stations for this node to ring stations 3930 of the six neighboring processing nodes 3720 of the torroidal mesh of processors. Other links couple the ring stations to a memory controller 3950, which provides access to a memory 3954, and to a host interface 3940, which is used for example for administrative functions such as loading programs to be executed on the computation elements.

The communication architecture, which is made up of the interconnected communication rings on each node, supports a global addressing approach by which each computation or communication element coupled to a ring station can address a data packet to any other element in the system, which may be on the same node, or may be on another node. Also, as is described further below, the communication architecture supports forms of limited multicast that allows data packets sent from a source to be distributed to multiple destinations without necessarily broadcasting the packet to all nodes in the system or sending multiple copies of the data packet from the source node.

In general, implementation of the iteration illustrated in the flowchart of FIG. 38 using the node architecture illustrated in FIG. 39 involves partitioning tasks among nodes, and partitioning the processing steps among the computation elements of each node. Furthermore, each of the subsystems takes advantage of internal parallelism. Overall, the implementation provides parallel processing and pipelining of computations at multiple levels of the architecture.

Referring to FIG. 40, computation elements for two nodes are illustrated, with indications of some of the data flows between the elements in a computation iteration. The communication links within a node and between nodes are not illustrated. The data flow that is illustrated relates to execution of the velocity and position integration steps (3810, 3820 in FIG. 38) as well as the midrange force calculation 3834. Other data flows, for example, related to distant forces and bonded forces are not illustrated in this figure.

Generally, at one point of the iteration time step, the flexible subsystem 3912 on each node retrieves particle force data for particles it is responsible for from its own memory 3954. Using the force data, it performs the integration steps required to determine the new particle positions. This particle position data does not have to be returned to the memories 3954. Rather, each flexible subsystem 3912 multicasts its position data, and this position data is passed to those midrange systems 3914 that will use the data to compute pairwise particle interactions, for example, using the NT method described above. In particular, the multicast approach ensures that the midrange subsystem of each node receives the particle data for the entire import region needed for the decomposition approach being used. Furthermore, data is "pushed" to the consumer of the data without necessarily requiring requests or acknowledgements for the data, thereby reducing communication load and delay associated with such communication.

hi a pipelined manner, the midrange subsystems 3914 begins computing pairwise interactions as they receive the required particle data. For example, the midrange subsystem 3914 at a node may receive the particle position data for particles in that node's homebox relatively quickly over on- node communication paths, and then start receiving position data from neighboring nodes.

As the midrange system 3914 computes force contributions for particles, it distributes these contribution terms to the memories 3954 at the nodes responsible for those particles. The memories include accumulation functionality that allows the memory on sum the force contributions on each particle as the force contributions for that particle are received from various nodes. Using the accumulation function of the memories, different processors can compute components of a sum (e.g., of forces) and send them to a memory where the components are accumulated. Because the communication of the components does not require acknowledgement from the memories to the processors that produced the components, efficient accumulation is achieved. The memories implement the accumulation so that it is not order- dependent so that the identical result is obtained regardless of the order in which components to be summed together arrive at the memory.

Not illustrated in FIG. 40 are similar pipelined parallel procedures for the distance force computations, which involve grid-based computations that make use of Fast Fourier Transform (FFT) techniques that use multiple interchanges of data between nodes, and in some embodiments makes use of the specialized FFT subsystem 3916 to implement the FFT computations.

Referring again to FIG. 39, the communication system supports packet-based message interchange between elements on the same or on different nodes. Packets can represent operations including read, write, accumulate, and read-modify-write with the packet header specifying the operation and the packet body carrying operands (e.g., values to the written). A global address space is used for registers, memory locations, etc. in any node, with a 48-bit address space having 16 bits reserved for node number and the remaining 32 bits for 4G address locations (e.g., bytes) at each node. Different address ranges at a node map to different ring stations on the node based on routing information stored in the ring stations, and at each ring station, different address ranges are mapped to different subsystems coupled to the ring station. Packets are variable length having between one (i.e., a header only) and nine units of length (a header plus eight units of data), with each unit being, for example, 256 bits.

Each ring station is programmable with respect to routing, and is configured for a particular routing pattern before the simulation iterations are started. Each ring station includes a configuration table that is loaded before processing begins. In some examples, this table specifies a set of production rules that are applied to each incoming data packet. In general, the table in each ring stations specifies how packets that arrive at that ring station should be handled. Each packet that comes in is passed on one (or more) of the other ports (which can send the data off to another node, or on its way to a module on the same node) according to the table based on the source or destination address, and optionally based on a packet type. The configuration table can also specify an address translation function to be performed on the data before it is forwarded. Every ring station in this embodiment has two ports for communication on the ring, and two client ports, for example, for connecting to ring stations at other nodes or for connecting to computation elements at the node.

Typically, messages that are sent from client to client are not acknowledged. However, for some communication an acknowledgement may be generated by the destination element and used for synchronization purposes. This embodiment does not perform error correction if there are communication errors for example by retransmission. Rather, the computation is restarted from a previously checkpointed timestep. Checkpoints are generated periodically, for example, with a period ranging from approximately one second to one minute of real time.

One use of the multicast mechanism is for distribution of various types of data to nodes according to the type of data. For example, depending on the import region for particle data, the

processors that need copies of particle data from one processor is determined by the geometry of the import region. The ring stations of the communication system are configured such that the source processor for the particle data does not need to be aware to the communication interconnections across which the particle data must pass, or where copies of the data are to be made. Rather the source processor tags the data with a group id corresponding to the type of data, and the communication system distributes it to all processors that need to import than data according to the configuration of the communication system.

A multicast mechanism is provided for distributing position data, for example, in the implementation of the NT method discussed above. Multicast is implemented using a limited range of destination addresses, each corresponding to a different class of multicast communication, referred to as the multicast group id. A flag elsewhere in the header of a packet indicates that the message is a multicast rather than a unicast packet. When a multicast packet is received at a ring station, the multicast group id and the source node address, which is also in the header, are examined and based on those values and on the address of the receiving node, the packet may be replicated and sent on one or more output ports of the ring station according to the ring stations routing tables. Therefore, a different pattern of multicast can be configured for each different multicast group id. Note that acknowledgements are not required. The configuration of the multicast pattern is distributed among the ring stations, each of which has a limited view of the overall routing pattern.

Inside each client module (e.g., a computational subsystem), interface hardware monitors "writes" from the ring station and based on pre-specified writes generates signals (e.g., flags, event codes) that may initiate appropriate computations within the client, for example, as a result of polling the generated signals.

hi addition to the toroidal-mesh interconnect network, the system can optionally have additional other interconnect topologies for different patterns of communication. For instance, certain global quantities such as the temperature or pressure of the system may need to be computed. Global communication patterns may also be required to synchronize the nodes. Options for the additional topologies include a low-latency but not necessarily high-bandwidth tree structure. In the present implementation, routing protocols are layered on top of the mesh interconnect to provide a virtual topology of the required type for the different communication patterns.

Each subsystem at a node in general includes its own internal memories, which are separate from the memory 3954. The memory controller 3950 includes a cache (e.g., that uses fast static RAM), which for many simulation scenarios is sufficiently large to hold the data for all the particles being handled by the node. The memory 3954 includes a larger storage space, which may be DRAM, which can be used for larger problems where each node is responsible for more particles. The interchange of data between the SRAM cache and the DRAM is coordinated in a buffering arrangement in which data in one buffer is being transferred while data in another buffer in the SRAM is being used for a computation.

For many of simulated systems (e.g., those of dimensions roughly 128A cubed, or smaller), it is expected that the memory 3954 managed by the memory controller 3950 will not necessarily be used. In these cases, the primary purpose of the memory controller is the accumulation of force data. The cache of the memory controller is designed for high bandwidth accumulation of force data produced by the midrange or distant interactions in the midrange subsystems and the bond terms and corrections produced by the flexible subsystems. The memory subsystem of any specific node is responsible for accumulating force data for the particles owned by that node.

For large scale systems, the memory 3954 includes 1 to 2GB of DRAM at each node. In modes that use this DRAM, the system may be partitioned system is sub-partitioned — each box partitioned into sub-boxes — and the algorithms described above are implemented on a sub- box by sub-box basis, with the required sub-boxes being pulled out of DRAM to the memory controller cache as necessary. Note also that such a sub-box mode can in fact be used on smaller systems, before data exceeds the capacity of the memory controller cache, and there may be cases where this yields a performance advantage.

The subsystems described above and shown in FIG. 39 cooperate in the evaluation of a force field, the parameters of which are provisioned at the beginning of any simulation run. At the beginning of a simulation, the parameters of the force field, together with the description of the system — the particles (e.g., atoms) involved, their connectivity (e.g., covalent bonds), their positions in space and their current velocities — are all known and partitioned appropriately among nodes of the system. Specifically, the particle data (positions and velocities) reside in the flexible subsystem of an particle's homebox. In addition, each bonded term has an "owner" node, responsible for its computation.

The flexible subsystem 3912 is the "choreographer" of each simulation timestep. While most of the remaining subsystems are responsible for computation of the force field, the flexible subsystem implements the integration algorithm, which computes accelerations from forces, and increments velocities and positions incrementally from timestep to timestep. The integrator that is implemented for any particular type of simulation may require more than just simple integration: it may, for example, implement a constant temperature simulation — which generally requires the periodic calculation of temperature throughout the simulation. This in turn requires calculation of kinetic energy from velocity data, and the synthesis of a global scalar, the temperature, from this data. Such a particular computation, and others like it (pressure control, for example), are implemented by programming the flexible subsystem for these specific tasks; in this particular example, the flexible subsystems of all nodes must cooperate in computing a global reduction to compute this global scalar.

Before continuing with a detailed discussion of the overall computation timestep, or the architecture of the flexible subsystem, it is useful to consider the function and implementation of the midrange subsystem 3914.

The midrange subsystem 3914 can be thought of as a generic implementation of the NT Method that supports a much broader range of algorithms and decompositions. The subsystem is designed to efficiently calculate all pairwise interactions among point objects, whether particles or grid points, locally within a specified cutoff radius. One of its chief tasks is the computation of midrange non-bonded interactions through full pairwise summation. It is also used in the real- space components of the distant force calculation, for charge spreading and force interpolation.

Referring to FIG. 41, operation of the midrange subsystem can be understood by considering a single particle-particle interaction module (PPIM) 4100. Generally, for computation of interactions between particles, during each iteration, memory in PPIM is first loaded with particle data for one set of particles at the beginning of the iteration. Then, particles in another set are streamed past the PPIM producing force contributions on the particles being streamed past. At the end of the iteration, force contributions for the particles that were resident in the PPIM is streamed out.

In some examples of computations performed by a PPIM, the PPIM operates in three phases. In the first phase (indicated by the circled numeral 1 in the figure), data for a first set of points, such as particles in the "tower" region for the NT method, are received over an input 4110 and stored

in a memory 4130 of the PPlM. In a second phase, data for a second set of points, such as for particles in the "plate" region for the NT method, are received over input 4110 and passed to a computation element 4140. For each received particle, the computation element computes the total force (or other accumulation of pairwise computed quantities) on the particle from all stored particles, which were loaded in the first phase. This accumulated force is passed on an output 4122. In this second phase, there is one force send on output 4122 for each particle received on input 4110, with the ordering preserved. During the second phase, forces on the particles stored in memory 4130 are also accumulated. In the third phase, these forces are sent on output 4122, with one force sent for each particle receive on input 4110 in the first phase.

The PPBVI includes storage for a number, for example three, different computations, hi this way, different computations can be performed by the PPIM without reloading data, for example, allowing the first phase (data loading) phase of a second computation to be performed during the third phase (result unloading) phase of a first computation.

Referring to FIG. 42, the midrange subsystem implements a degree of parallelism by replicating the PPEvI 4100 illustrated in FIG. 41 so that each PPIM performs only a subset of the required computations, hi general, during the first phase, each PPIM receives only a portion of the first set of particle data, and in the second phase receives all the data in the second set. The outputs of all the PPIMs are then merged to yield the same result as would have been obtained by a single PPIM acting serially.

The midrange subsystem includes an array PPIMs 4100, as well as memories on the input 4230 and output 4235 of the array. The PPIMs are arranged into multiple rows, each with multiple PPπvIs. In the present implementation, each row has eight PPEVIs and there are four rows, but other arrangements can also be used.

Considering first one row of PPEVIs, each PPIM passes data on its input 4110 (see FIG. 41) to its output 4112 without modification so that in the second phase of computation, each PPIM in a row receives data for the same set of particles. During the first phase, the data for the particles in the first set are also passed from PPIM to PPEVI, but each PPEVI stores only a separate portion of the data. During the second phase, the output 4122 of each PPEVI includes one accumulated force for each particle input on its input 4110. At each PPEVI (after the first, left-most) the PPEVI receives a partially accumulated force for each particle on its input 4220, and internally sums the input contribution for that particle with the contribution computed by its computation element

4140, so that the output 4122 again has one force for each particle on its input 4110. Therefore, each row of PPIMs functions as a single PPIM, but performs more computations in parallel and therefore achieves higher throughput.

During the third phase, each PPIM streams out the accumulated forces for each point in the first set that was loaded to that PPIM, with "downstream" PPIMs in the row passing that force data without modification, with the overall order produced by the row matching the order the particles were loaded into the PPIM row during the first phase.

Considering the multiple rows of PPIMs, the data for the first set of particles is partitioned as it is read in so that each row has a separate portion of the first set. The data is passed through a set of distribution units 4210 to achieve this partitioning. The result is that each PPIM holds in its memory 4130 a different portion of the first set of points, with the PPIMs together holding all the points.

During the second phase, the distribution units 4210 replicate the particle data such that the same sequence of particles are sent to each row of PPIMs, and therefore also to each PPEVI in that row. This results in each row of PPIMs producing a force for each particle in the second set, and these forces must be accumulated across the rows. This accumulation is performed by the accumulation units 4220.

During the third phase, the accumulation units 4220 concatenate the force data for the particles in the first set, so that at the output of the last accumulation unit 4220, the forces for the particles in the first set exit in the same order as the particle data was passed to the array in the first phase.

The midrange subsystem includes a front-end memory 4320, which received the particle data multicast from the flexible subsystems. A controller (not shown) monitors the writes to this memory and coordinates the passing of data to the array when the required data is available. The subsystem includes a back-end memory 4235, which is used to stage the computed forces (or other data). Note that this data is sent to the memory subsystem without waiting for completion of the processing so that some forces are being sent to the memory while other forces have not yet been computed.

Referring to FIG. 43, each PPIM implements a further level of parallelism. Each PPIM contains several matchmaking units (MUs) 4350. hi the present implementation, each PPIM includes eight MUs. The particle data distributed to the PPBVI during the first phase, described above as

being stored in memory 4130 (see FIG. 41), is actually distributed to memories 4332 in the MUs. That is, the particles in the first set are partitioned among the MUs. During the second phase, all the particles received by the PPIM are steamed past each matchmaking unit. For each received particle in the second set, a computation element 4342 of an MU determines which points stored in that MU's memory 4332 should interactions be computed, based primarily (but not exclusively) on the distance between the particles. The computation element 4342 passes the selected pairs to a concentrator 4360, which receives such selected pairs from each of the MUs, and acts essentially as an arbitrator for passing the selected pairs in a serial manner to a particle- particle interaction pipeline (PPIP) 4344. The PPIP then computes either the force on each particle due to the other or the potential energy contribution from the pair. The PPIP can be configured to compute a rather general function of the particles, based on their identities and inter-particle distance, and calculated by table-lookup and interpolation. For example, the PPIP is implemented or preconfigured to support a number of different functional forms, with data loaded into the PPIP determining the specific function within that functional form. The PPIP accumulates the forces for each particle in the second set for different of the stored particles, and passes the accumulated forces on output 4122 while storing the forces on the stored particles in a memory 4334. In the third phase, these stored forces in memory 4334 are streamed out on output 4122.

In some implementations of the PPIP, the functions that can be computed by the PPIP are represented as piecewise analytic functions. That is, within each of a set of ranges of argument value, a parametric analytic function is used to determine the function of the argument. A data structure stores the parameters of the analytic functions for the corresponding different ranges of argument values. A variable precision approach is used such that in different sizes of ranges of argument value can correspond to different parameters of the analytic function, for example, giving higher precision or greater accuracy in some ranges of argument value than in others. The selection of coefficients is performed in two stages. First, the argument range is divided into N coarse ranges, each of which can have a different size. Each of the coarse ranges is divided into the same number n of subranges, such that within each coarse range the subranges have the same size. For a particular input argument, first the coarse range is determined, and then the subrange within that coarse range is determined in order to look up the appropriate parameters of the analytic function. Therefore, there are N times n sets of coefficients. Given a limited number of sets of coefficients that can be used, the accuracy of the function to an ideal function can be

chosen according to the argument range. An example of such a ideal function is erfc(l/R) and an example of the analytic functions are quadratic functions.

The PPM array of the midrange subsystem in also used for charge spreading and force interpolation computations of the GSE methods described earlier in this document. For both charge spreading and force interpolation (the first and last thirds of the distant computation) particles are distributed to the PPIMs in a first phase, and the grid locations are streamed past the PPIMs in a second phase. Note that grid locations for the homeboxes of neighboring nodes streamed, as well as grid locations of the current node's homebox. Note that for this computation using an NT method, a "full plate" is needed because there is an ordering for the particle interaction so two equal and opposite forces are not computed at one time. During the charge spreading, no data needs to be accumulated by the PPIMs and streamed out in the third phase, while in force interpolation, the PPIMs do not need to provide outputs during the second phase, with the interpolated forces being accumulated during the second phase and streamed out of the PPIMs in the third phase.

The flexible subsystem 3912 also implements parallelism within the subsystem. Referring to FIG. 44, a groups of computational elements 4410 function relatively independently in parallel. Not only is each particle assigned to a particular node within the system (i.e., the node associated with the homebox holding the particle) each particle is assigned to a particular group 4410 within the flexible subsystem of that node. In the present implementation, there are four groups of computation elements 4410.

Each group 4410 includes a processor 4414. Each processor has data and instruction caches (32KB) 4412 that are backed by the memory system accessed over the communication ring. In many applications the caches 4412 are sufficiently large to be able to hold all the data needed for an iteration. Each processor has a remote access unit (RAU) 4418, which implements functions including direct memory access (DMA) functions for transferring data from the memory over the communication ring to a memory 4416, which is accessible to the processor 4414. The particle data that is the responsibility of processor is held in memory 4416, and the RAU 4418 retrieves force data for those particles from the memory subsystem using a DMA approach without loading the processor 4414. The RAU is coupled to a ring station through a communication interface, which aggregates outputs and distributes inputs to the flexible subsystem to the appropriate elements (e.g., based on the addresses of received packets).

Each processor has a number of slave special-purpose co-processors 4420, referred to below as "geometry cores" (GCs). In the present implementation, there are two GCs per processor. Each GC has a four-way SIMD architecture, dual issue VLIW pipelined ALU, and includes "geometric" primitives including computation of inner ("dot") products. Each GC uses a vector/scalar register architecture in which registers can be addressed in an interleaved manner as either n vector registers or as An scalar registers, and where scalar results can be placed directly in a portion of a register where needed for subsequent computations. There are inbound queues for the GCs that are fed by its processor, and outputs from the GC can be fed back to the memory system without requiring handling by the processor..

The GCs are used to compute bonded force terms. Each bond is assigned to a particular GC, and the processors exchange particle position data so that each processor has the data it needed to have its GC compute bonded force contributions. Note that a processor that computes a bonded force contribution may not be responsible for any of the particles in that bonded pair. Periodically (e.g., every 2000 time steps) a reassignment of bonds to processors and GCs may be performed as an optimization that may reduce communication requirements.

In a computation iteration, particle position data is distributed among the memories 4416 of the processors of the flexible subsystems of the nodes of the overall system. From there, the particle locations are sent to the midrange subsystems for computation of the pairwise interactions, and to other processors for computation of bonded terms..

The flexible subsystem 3912 also includes a correction processor 4430 which is used to add correction terms to values computed by the GCs. The correction processor implements an example of a general computational approach. In its general form, if one computation is to be performed on a large part A of a data set and a second computation is to be performed on a small part b of the data set, where A and b form a partition of the complete data set, and the results of the computations are to be accumulated (or more generally reduced) over the data set, then it may be efficient to perform and accumulate the first computation over the entire data set, and then for the small part b recompute and subtract the result of the first computation and compute and add the result of the first computation. For example, by not having to make exceptions for elements of the data set in part b, the first computation may be made efficient by regular (i.e., condition or exception free) iteration. In the context of particle interactions the entire data set (i.e. A plus b) can correspond to a set of particle pairs that are within an interaction radius, the

first computation can be a unbonded interaction, for example, as computed in the PPIM array, and the second computation can be a bonded interaction as computed in the GCs. So rather than having to exclude bonded pairs of particles that are within the interaction radius from force computations in the PPIM array, when a bonded interaction between particles is computed a GC, the correction processor 4430 computes the negative to the unbonded term that is computed in the PPIM for that pair of particles, and adds it to the bonded term before it is sent to the memory where it is accumulated. In some implementations, the correction processor includes the same computational structure as a PPlP 4344, further ensuring that exactly the same unbonded term is computed for subtraction from the total.

There are generally three major computations that are hosted in the flexible subsystem: bonded force terms, constrain terms and integration (performed without the constraints), and finally updated positions accounting for the constraints. During the computation of distant forces, the processors and the GCs also perform performing the FFT calculations in versions of the system that do not include a dedicated FFT subsystem.

In versions of the system that use the flexible subsystem to perform the FFT computations, the distant forces are computed by first distributing charges to the grid locations using the midrange subsystem. These distributed charges are returned to the flexible subsystem, which implements a distributed FFT computation for the grid spanning the entire simulation volume. The FFT is distributed among the processors (and their associated GCs) and requires multiple phases of exchange of intermediate results between the nodes. At the end of the FFT iteration, the transformed data, which represents potentials at grid locations, is passed back to the midrange subsystem, where it is used to compute interpolated forces on the particles. These interpolated forces are then sent to the memory subsystem whether the distant force terms are combined with the other force terms computed for each particle.

In the present system, each node is implemented using an application specific integrated circuit (ASIC) that includes all the computation and communication subsystems shown in FIG. 39, with the exception of the memory 3954 and remote ring stations 3930. Multiple ASICs are arranged one each of multiple circuit boards that are interconnected through a backplane. The internode communication links 3730 are therefore either local to a circuit board or are routed through the backplane.

11 Applications

The algorithmic techniques, or the hardware or software techniques, described above may be applicable to a variety of applications, some of which are related to biomolecular simulation while others are not. The applications may include, for example, the following:

Biomolecular simulation applications related to use of molecular dynamics, Monte Carlo simulation, or other sampling technique, can include without limitation: ab initio protein structure prediction, refinement of homology models of protein structure, modeling of protein folding kinetics, computation of protein-ligand binding free energies, determining whether a pair of proteins will interact; determining their binding site and their binding free energy; toxicology studies; computation of hydration free energies; protein design; modeling allosteric changes and protein function; simulation of membrane proteins; simulation of nucleic acids; simulation of carbohydrates; simulation of combinations of the above and interactions between them; predicting membrane penetration, including blood-brain barrier penetration; molecular force field development; and simulations of crystal structures

Applications may also involve other fields, such as material science applications, weather simulations, hydrodynamic simulations, equation of state simulations, astrophysics simulations, etc.

12 Alternatives

A variety of arrangements of distributed processing nodes can alternatively be used with the computation decompositions described above.

It is also not necessary that there be a one-to-one correspondence between processing nodes and homeboxes. For example, each node may be responsible for a set of homeboxes, such as a nxrøxn cube of homeboxes. One use of such a finer-grain definition of homeboxes is in multiple-zone scheduling approaches in which the schedule of zone-to-zone interactions can be used to avoid a need to test for duplicate computations or can be used to reduce the amount of communication through improved rounding. Another use of such finer-grain homeboxes is in serializing the computation involved with each pair of interacting zones to make better use of local memory, such as local cache memory, at each of the processing nodes. Aspects of memory utilization are discussed further below in the context of serial implementations.

In one alternative version of a parallel implementation, each node maintains data for more particles than those that are located within its homebox. A "clone buffer" approach is used in conjunction with a midpoint assignment rule approach. In particular, each node stores particle data from a region that extends a distance R/2 beyond the boundaries of its homebox. After each computation phase, the forces for all the particles in the extended region are sent to that node allowing it to redundantly compute the new locations of the particles. During the computation phase, further importing of particle data is not necessary, for example, using the midpoint assignment rule or the pseudo-random assignment rule, because the extended region includes all the particles needed.

Two papers are attached as Appendix B and Appendix C to this description, and provide further information related to the examples described above and/or additional examples. Appendix B is titled "The Midpoint Method for Parallelization of Particle Simulations," and Appendix C is titled "Zonal Methods for the Parallel Execution of Range-Limited n-Body Simulations."

The approaches described above can be implemented in software, in hardware, or using a combination of hardware and software. For example, alternative versions of the system can be implemented in software, in firmware, in digital electronic circuitry, in computer hardware, in other modes of implementation, or in combinations of such modes. The system can include a computer program product tangibly embodied in a machine-readable storage device, such as on a magnetic or an optical disc or in a semiconductor memory, for execution by a programmable processor or by a system that includes multiple programmable processors. The programmable processors can be general-purpose processors, or can be special-purpose processors, for example, designed for computations of the types described above. Parallel implementation can, for example, use special-purpose interconnection networks, or can use more general purpose data networking approaches. Method steps can be performed by a programmable processor executing a program of instructions to perform functions by operating on input data and generating output. The system can be implemented in one or more computer programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device. Each computer program can be implemented in a high-level procedural or object-oriented programming language, or in assembly or machine language if desired. Such language can be either a compiled or interpreted language. Suitable

processors include, by way of example, either (or both) general and special purpose microprocessors. Generally, a processor will receive instructions and data from read-only memory and/or random access memory. Generally, a computer will include one or more mass storage devices for storing data files. Such devices include magnetic disks such as internal hard disks and removable disks, magneto-optical disks, or optical disks. Storage devices suitable for tangibly embodying computer program instructions and data include all forms of non- volatile memory, including by way of example semiconductor memory devices such as EPROM, EEPROM, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM disks. Any of the foregoing can be incorporated in or supplemented application-specific hardware, such as application-specific integrated circuits (ASICs).

Other embodiments are also with the scope of the appended claims.

Appendix A

Without being limited by the theory and derivations below, aspects of the approaches described above may be better understood in view of the following. In particular, statements made below that certain properties are required or must be true, for example, in view of the derivations should not be construed to necessarily apply to all versions or embodiments.

1 Derivation of a Bound on Import Volume

Approximate lower bounds on the import volume of multiple zone methods with rounding are derived under the assumption of a large number of uniformly distributed independent particles. For specificity, spherical interaction regions and commutative interactions are considered. This derivation can be generalized to non-spherical interaction regions, non-commutative interactions and small numbers of particles though.

Consider a randomly chosen particle. On average, this particle will interact with

other particles ("interaction partners per particle"). Above,

is the volume of the interaction region of a single particle, V gc is the volume of the global cell, D is the density of particles and R is the interaction radius. As the number of particles in the simulation is known, the reduced density, emoves the particle self interaction from Ni pp . Consistent with the assumption of a large number of particles, the difference between the density and reduced density can be neglected. It is straightforward to modify most of the below to take into account the reduced density though.

Multiplying by the half total number of particles, ields the average number of near interactions in the system (the half takes into account commutativity):

None of the methods considered in this paper affect the number of non-redundant near interactions. Thus, any valid method must consider at least this many pairs. Essentially, Ni is an invariant quantity across any near interaction parallelization scheme.

Ni can be broken into two components, the average number of near interactions between particles on the same homebox ("local interactions") and the average number of near interactions between particles on different homeboxes ("remote interactions").

Consider a randomly chosen particle again. After the particle has been selected and its position within the homebox determined, the expected number of local interaction partners is the expected number of particles in the portion of the particle's interaction region that are contained within the particle's homebox. This can be succinctly expressed as

where r' is the coordinate of the particle and (9 is the indicator function

Since the particle could be located anywhere within its homebox, before the particle position within the homebox is determined, the expected local number of interaction partners for the particle is the above quantity times the probability of finding a particle at r ' integrated over the homebox. For a uniform particle distribution the relevant probability density is 1/V f1 where Vi 1 is the volume of the homebox.

Viφc a i is the expected volume of the portion of a randomly chosen particle's interaction region that is contained within the particle's homebox.

Multiplying N ipPι ι oca ι by the half number of particles in the simulation yields N / , /0CΛ ;:

In any method that handles local interactions without communication (which includes all methods considered in this paper), Nijocai is a ls° a* 1 invariant quantity. Trivially then, Ni >rem ote, wher is also invariant.

Above hi the multiple zone method considered in this paper, a special case homebox-homebox zone interaction handles all local interactions. Thus, remote interactions are processed exclusively by the remaining zone-zone interactions. A multiple zone method tests

10 remote pairs for having a near interaction. The pre-factor, V gc /V h , is the number of processors. The zones have been enumerated from 1 to N 2 where N 2 is the number of zones. Vi is the volume of zone i. Zone 1 is the homebox is 1 if the two zones are interacted and 0 if not. A valid multiple zone method must test at least as many remote pairs as there are remote interactions:

I-*

The overall volume of the import region is:

To optimize the communication bandwidth required for particle interactions, this volume should be minimized. However, there are limits to how much V tmport can be reduced; below a certain 0 import volume, it would be impossible for the method to test enough remote pairs. To see this, let be the optimal import volume for a multiple zone method given the global cell volume, homebox dimension, interaction radius and number of zones. In order to get the maximum number of interactions out of this import volume, the zone interaction schedule would ideally be dens . For this ideal schedule, the number of remote pairs considered 5 is:

Vi ιOpt is the volume of zone / in this idealized method. Above, zone-zone interactions involving the homebox zone have been separated from the double sum.

Given the symmetry of Vi mpo r t .opt and Nι ι0pt with respect to import zone number — namely, the remote import zones can be re-numbered without affecting V tmpO r t , o p t and Nι >opt , the import zones must have equal volume when the import volume is at the absolute minimum:

The number of remote zones, N zr = N 2 -I, is introduced to simplify the following equations. With this optimal set of zones:

At a minimum, this optimal method must consider at least Ni ιremote pairs. Or:

In the case with multiple remote import zones (N zr >l):

At equality, this is a simple quadratic equation with two solutions:

the roots are always real and one root is always negative. The negative root is not physical. Keeping the positive root and noting that equality occurs at the minimum and that a non-optimal method would import even more volume:

Simplifying:

In the case with N zr = 1 :

2 Approximation of finite homebox volume effects

V ir, remo t e is a non-analytic integral and the above derivation only provides limited insight into how to design methods that approach the lower bound. Further, to achieve equality with the lower bound, every pair a method tests must be a near interaction, the zones must be equal volume and the schedule must be dense. This is particularly unrealistic when the homebox dimensions are much larger than the interaction radius, hi this regime, a particle far from the boundary of the homebox will not interact with any imported particle and particles near the boundary will only interact with small fraction of the imported particles. While these effects could be tediously incorporated to tighten the lower bound, limiting cases of the bound are highly suggestive of an approximate lower bound that approximately captures finite volume homebox effects while providing insight into how to design methods over a wide range of parameters.

If the interaction radius is much larger than the distance between opposite corners of a homebox, Vir.rem ote is exactly the volume of the interaction region minus the volume of the homebox. This is approximately and asymptotically equal to the volume of the remote portion of the homebox interaction neighborhood. At the same time, when there is only one import zone, by inspection, the HS method is optimal and the import volume of the HS method is exactly half the volume of the remote portion of the interaction neighborhood. This suggests that replacing V R emote in the bounds by the volume of the remote portion of the interaction neighborhood volume is appropriate. This replacement is exact when there is 1 remote input zone for any interaction radius and asymptotically exact for any number of remote zones in the limit of large interaction radius. Note that this replacement is equivalent to changing the derivation of the lower bound to

say that the number of remote pairs a multiple zone method could interact is greater or equal to than the number of remote pairs that the HS could interact. Under this approximation, the effective remote interaction volume is:

At this point, it is convenient to introduce the normalized homebox aspect ratios and the parallelization parameter In terms of these variables, the effective remote interaction region can be written as:

Above, the fact that OC x a y O 2 = 1 has been used. The approximate lower bound is thus:

To give a sense of relationship between the parallelization parameter and machine size, for a small molecular dynamics simulation of biological systems with 25,000 atoms and an interaction radius of l2A,p ~ 145 a & . Thus, O R < 1 corresponds to modest clusters (less than roughly 145 processors for example), 1 < « R < 2 corresponds to large clusters (144 to 1152 processors). 6 < a R < 8 corresponds to ultra-parallel hardware like BlueGene or QCDOC.

3 Limiting Cases

3.1 Massive parallelism limit

In the massively parallel limit, CCR → ∞, the cubic term in the square root of the approximate bound dominates:

Substituting in the form of CC R and noting that where p is the number of processors, for fixed V gc :

In this limit, the bound is exact and the equation suggests that using more zones can improve import bandwidth. The "foam" method discussed later approaches this limit asymptotically for

3.2 Low parallelism limit

In the low parallelism limit, CC R ~ 0. In this limit, the square root can be Taylor expanded about the first term and simplified to give:

The import volume is independent of the number of imported zones. However, in this limit, the inequality is approximate when N zr > 1 ; the eighth-shell method described later has 7 import zones and very slightly beats the approximate bound.

3.3 Fine grained scheduling

In the massively parallel case for N ∑r >l, the import lower bound is proportional to:

When N zr = 2, this factor is ~1.41. In the infinitely fine schedule limit, this factor is 1. Thus, at most, using fine grained schedules could give a ~29% improvement over simple methods in the massively parallel case. However, to achieve this requires a dense zone interaction schedule such that every remote pair considered has a near interaction. Geometrically this is unrealistic. However, the existence of the eighth shell method and the additional rounding opportunities demonstrated for the multiple zone formulation of the NT method suggest that mild gains in import bandwidth are achievable by using more zones. Further, finer grained scheduling gives more opportunities for overlapping communications with computation.

Appendix B

Zonal Methods for the Parallel Execution of Range-Limited »-Bodv Simulations

Zonal Methods for the Parallel Execution of Range-Limited «-Body Simulations

Kevin J. Bowers, ' Ron O. Dror, 1 and David E. Shaw 1> 2> 3

Abstract: Particle simulations in fields ranging from biochemistry to astrophysics require the evaluation of interactions between all pairs of particles separated by less than some fixed interaction radius. The applicability of such simulations is often limited by the time required for calculation, but the use of massive parallelism to accelerate these computations is typically limited by inter-processor communication requirements. Recently, Snir [1] and Shaw [2] independently introduced two distinct methods that offer asymptotic reductions in the amount of data transferred between processors. In the present paper, we show that these schemes represent special cases of a more general class of methods, and introduce several new members of this class of algorithms that offer practical advantages over all previously described methods for a wide range of problem parameters. We also show that several of these algorithms approach an approximate lower bound on inter-processor data transfer.

Key words and phrases: Molecular simulation, molecular dynamics, parallel computing, H-body problem, pairwise particle interactions

1 D. E. Shaw Research, LLC, 39th Floor, Tower 45, 120 West Forty-Fifth Street, New York, NY 10036

2 Center for Computational Biology and Bioinformatics, Columbia University

3 E-mail correspondence: david@deshaw.com

1. Introduction

Simulations in many fields require the explicit evaluation of interactions between all pairs of particles separated by less than some interaction radius R (near interactions). Examples of such range-limited n-body problems include molecular dynamics simulations in biochemistry and materials science, gravitational simulations in astrophysics, particle simulations in plasma physics, and smooth particle hydrodynamic simulations in fluid dynamics [3-11]. The remaining pairwise interactions are either neglected or approximated using one of several less expensive methods [3, 12-17]. The computation of near interactions is often the dominant cost in such simulations, so efficient methods for parallelizing this computation are critical to accelerating these simulations. Although near interactions can be evaluated in parallel, the communication required to bring particle pairs together on the same processor can severely limit scalability.

Traditional methods for parallelizing range-limited «-body problems are described in several review papers [18, 19]. They include atom, force, and spatial decomposition methods. Spatial decomposition methods offer an advantage over atom and force decomposition methods in that the amount of data to be transferred into and out of each processor (the method's communication bandwidth) decreases as the interaction radius decreases. Force decomposition methods, on the other hand, offer an advantage over spatial and atom decomposition methods in that the communication bandwidth decreases as the number of processors increases.

Two independently developed and recently published methods — Snir's hybrid method (referred to here as the SH method) [1] and Shaw's Nr (for "Neutral Territory") method [2] — combine the advantages of traditional spatial and force decomposition methods. These newer algorithms have the property that the communication bandwidth decreases both as the interaction radius decreases and as the number of processors increases. Table I compares the asymptotic scaling properties of the SH and NT methods with those of traditional methods. The SH and NT methods have the same scaling properties, which are superior to those of traditional methods.

It has been shown [2] that both the SH and NT methods offer significant advantages over traditional methods for practical machine and simulation sizes. Furthermore, the NT method always requires less communication than the SH method, but the SH method can be optimized using ideas from the NT method such that the difference is small. The NT- optimized version of the SH method is referred to as the SNT method [2].

The volume of space from which a given processor "imports" data that ordinarily resides within other processors (its import region) is significantly different for the NT and SH methods. The two methods, however, do share several essential elements. In both the NT method and the SH method, each processor assumes primary responsibility for all particles falling within some rectangular box. In both methods, the processor that computes the interaction between a pair of particles is typically not the one on which either particle resides, but a third processor that imports both of these particles. In both methods, each processor interacts particles residing in one region of space with particles residing in another region, after importing some of the particles in each region from other

processors. This suggests that the two methods may be related and raises the question of whether other new methods exhibiting similar properties might have advantages over both methods for various choices of problem parameters.

In this paper, we describe a new class of parallelization methods, which we refer to as zonal methods. This class includes as special cases the NT method, the SH method, and the SNT method, along with traditional spatial decomposition methods and a number of novel methods introduced in this paper. We refer to a parallelization method for range- limited «-body problems as a generalized spatial decomposition method if each processor assumes responsibility for updating the positions of particles in a distinct region of space, regardless of where the particle-particle interactions are computed. Generalized spatial decomposition methods can be divided into home territory methods, in which a pair of particles always interacts on the processor on which one of them resides, and neutral territory methods, in which a pair of particles may interact on a processor in which neither of them resides. All zonal methods are generalized spatial decomposition methods, but some (for example, traditional spatial decomposition methods) are home territory methods, while others (including the NT, SH, and SNT methods, as well as the new methods introduced in this paper) are neutral territory methods.

Which parallelization method minimizes inter-processor communication bandwidth depends on the properties of both the computation to be performed (e.g., the interaction radius and the dimensions of the physical system) and the hardware available for the simulation (e.g., the number of nodes and the network topology). In this paper, we analyze the communication requirements of a variety of zonal methods. We describe

new methods that require less communication bandwidth than any previously described method for a wide range of such parameter values, including those associated with most practical systems. In some cases, these new methods represent minor variations on the NT and SNT methods; in others, they constitute entirely novel methods.

Section 2 reviews the NT and SH methods, as well as a traditional spatial decomposition method that we call the HS method. Sections 3 and 4 describe the general class of zonal methods, and introduce certain tools that prove useful for their analysis. In Section 3, we assume that each processor will interact particles from a single spatial region with particles from another single spatial region. We describe the pairs of regions one can choose to ensure that all near interactions are computed. We also present a systematic process to take advantage of the fact that each particle interacts with other particles that reside within a surrounding sphere, rather than a cube; we refer to this process as rounding. The HS, NT, and SNT methods already exploit a form of rounding, but some methods support a stronger form that we refer to as three-dimensional rounding.

Section 4 generalizes the ideas of Section 3 by considering k-zone methods that specify a set of spatial regions along with a schedule that determines the order in which each processor must compute interactions between particles from various pairs of regions. Such methods can further reduce communication requirements and can increase the extent to which communication and computation are performed simultaneously.

In Section 5, we derive lower bounds on the amount of data to be communicated by any zonal method. These bounds, which depend particularly on the interaction radius and on the number of processors available, guide the design of novel parallelization methods. In

Section 6, we specify several methods, each of which has the lowest communication bandwidth requirement of any published method of which we are aware for some set of practical parameter settings. While we do not prove these methods to be optimal, we show that they come close to an approximate lower bound on communication bandwidth requirements.

2. Existing Parallelization Methods and Terminology

We refer to the region containing the system to be simulated as the global cell. Zonal methods require that the global cell be divided into a lattice of identically shaped regions, with each processor assuming primary responsibility for updating the coordinates of all particles in that subregion. To simplify our exposition, we will assume in this paper that the global cell is a rectangular parallelepiped of dimensions G x xG y xG z and that it is divided into a regular, three-dimensional grid of smaller rectangular parallelepipeds that we call boxes. Each processor updates the coordinates of particles in one box, referred to as the home box of that processor, of those particles, and of any point in the box. In the interest of simplicity, we will refer interchangeably to a processor and its home box. The dimensions of each box are b x xb y xb z . We refer to the quantities b x /b y and b x lb as the box aspect ratios. The base coordinates of a given box (and of any particle located within that box) are defined as the coordinates of the low-coordinate corner of that box. We will assume that the low-coordinate corner of the global cell has coordinates (0,0,0). If a box has base coordinates (c x ,c y ,c z ), we define the coordinates (c x /b x ,c y /b y ,c z /b z ) as the logical base coordinates of that box and of any particle located within that box.

We will assume that the global cell tiles an infinite space by repeating in each dimension with a period equal to the side length of the global cell in that dimension. The periodic boundary conditions imposed by this assumption simplify our exposition, but the methods discussed in this paper are also applicable to systems with other boundary conditions. We will also assume for simplicity that G x ≥b x + 2R, G y ≥ b y +2R, and

G z ≥ b z +2R , so that at most one image of a given particle interacts with any particle in a

given box. 1

This section describes the NT and SH methods as well as an example of a traditional spatial decomposition method, the HS (for "Half-Shell") method. These methods can each be described using two spatial regions called zones. Each method interacts all particles in one zone with all particles in the other zone (subject to a local filtering process to be described later). The location of each zone is specified relative to the box in which these interactions take place (the interaction box), and each box has the same spatial relationship with its zones as each other box. For each of these three methods, both zones include the interaction box itself, but are otherwise non-intersecting.

We refer to particles that reside within a particular box as local to that box, and to other particles as remote from that box. Similarly, we refer to interactions between particles that have the same home box as local interactions, and to interactions between particles with different home boxes as remote interactions.

Assuming uniform particle density, the amount of particle data that must be transferred into each processor is proportional to the volume of the import region V 1 . We will

therefore use the volume of the import region as a measure of communication bandwidth requirements.

For the parallelization methods discussed in this paper, the shape of the import region depends only on the ratios of the box side lengths to the interaction radius R. In three dimensions, the ratio of the import volume F/ to the box volume Vt for a particular method can be determined uniquely given RIb x , Rl b y , and Rl b z . Alternatively, V \ l V b can be expressed as a function of the box aspect ratios b x lb y and b x /b z and the parallelization parameter α/?, where we define ctø as the geometric mean OfRIb x , Rl b y , and RIb 2 :

For cubic boxes of side length b, a. R is simply RIb. The parallelization parameter ot^ might be viewed as a measure of the extent to which a particular simulation has been parallelized. For convenience, we also define the normalized box side lengths CC x = which can be uniquely determined from the box aspect ratios.

When interacting a pair of particles a and b, we may wish to compute either a single quantity summarizing the interaction (e.g., an interaction energy) or one quantity for each of the two particles (e.g., the force exerted on each). The former scenario may be viewed as a special case of the latter, as we can assign a single quantity, or a scaled version of that quantity, to both particles. For the sake of generality, we will assume throughout this paper that we wish to compute one quantity for each of the two particles, and we will

refer to the quantity associated with particle a as the influence of b on a and to the quantity associated with particle b as the influence of a on b. 2 In the HS, NT, and SH methods, a pair of particles always interacts in a single processor that computes the influence of each particle on the other.

2.1 HS method

In traditional spatial decompositions, two particles always interact within a box in which at least one of them resides. In the HS method, the particles interact within the home box of the particle with the smaller x base coordinate. If the two particles have the same x base coordinate, the particles interact within the home box with the smaller^ base coordinate. If they also have the same>> base coordinate, they interact within the home box with the smaller z base coordinate. If the two particles reside within the same box, they interact within that box.

As a result, each interaction box has an import region that consists of half the points external to the interaction box that lie within a distance R of some point within the interaction box. Figure I shows this import region in red and the interaction box itself in purple. Every particle in the interaction box interacts with every imported particle, and with every other particle in the interaction box. Equivalently, the two zones are (a) the interaction box itself, and (b) the union of the interaction box and the import region.

This method is guaranteed to interact all pairs of particles separated by less than a distance R. Because the interaction box has a finite volume, however, this method might, in the absence of any modifications, also interact some pairs separated by a distance

greater than R. By way of example, if the x base coordinate of the interaction box is C x , a particle within the interaction box whose x coordinate is C x + δ (for some sufficiently small positive value of δ) could be interacted with a particle within the import region whose x coordinate is C x + b x + R — δ, despite the fact that the two particles are separated by a distance greater than R. Such excess interactions can be eliminated by explicitly testing the distance between each pair to be interacted. Other tests can be used to ensure that a given particle residing within the interaction box does not interact with itself, and that each local interaction is computed once rather than twice. The application of such tests to eliminate inappropriate interactions will be referred to as filtering. Throughout this paper, when considering whether a particular method guarantees that all required particle interactions are computed, we will ignore filtering, as it will be used only to eliminate unnecessary interactions.

2.2 NT method

In the NT method, two particles interact within a box that has the x and y base coordinates associated with one particle and the z base coordinate associated with the other particle. Specifically, the x and y base coordinates of the interaction box are those of the particle with the smaller x base coordinate, with ties broken first in favor of the particle with the smaller^ base coordinate and then in favor of the particle with the larger z base coordinate. The z base coordinate of the interaction box is the z base coordinate of the other particle.

The resulting import region of each interaction box is the union of its outer tower and outer plate. The outer tower is shown in blue in Figure 1, and comprises all points that lie directly above or below the interaction box within a distance R of the boundary. The outer plate is shown in red, and includes half of the region external to the interaction box that lies in its xy plane within a distance R of the boundary. The zones are the tower, which consists of the union of the outer tower and the interaction box, and the plate, which consists of the union of the outer plate and the interaction box.

Interacting each particle in the tower with each particle in the plate ensures that all near interactions will be computed at least once. To ensure that no near interaction is computed more than once, we can filter pairwise interactions such that plate particles in the interaction box will not interact with particles in the lower half of the outer tower. The filtering criteria of the HS method are also necessary. One can minimize the import requirements of the NT method by choosing the box aspect ratios appropriately.

2.3 SH method

The SH method employs a cubical box of side length b. An interaction box with logical base coordinates (J, j, k) imports particles from two zones. The first is a contiguous region that we refer to as the base. This zone is shown in blue and purple in Figure 1, and consists of the set of boxes with logical base coordinates where u e The second zone is a set of non contiguous regions that we refer to as the comb. This zone is shown in red and purple in Figure 1, and consists of the set of boxes with logical base coordinates

where v e [-r, r\ and w 2 <≡ [0, [//J 1 J]- Note that both the comb and the base include the

interaction box. Snir [1] showed that interacting the comb and base ensures that all near interactions will be computed at least once. Filtering is required to ensure that each near interaction is computed only once, as for the NT method.

3. Two-Zone Methods

The HS, NT, and SH methods differ in their choice of zones, which in turn determine their import regions. In this section, we consider the question of which zone choices will guarantee that all required interactions are computed. We consider methods with two zones, and we assume that interactions are computed between each particle in one zone and each particle in the other zone, subject to filtering. In all cases, both zones include the interaction box itself but are otherwise non-intersecting.

We focus initially on describing a broad class of parallelization methods for range-limited «-body problems. The merits of various methods are considered later.

3.1 Simplified model

For clarity of exposition, we initially consider a simplified problem where:

1. The particles are restricted to the two-dimensional xy plane.

2. A pair of particles are required to interact if they are separated by a distance less than R in the x dimension and by a distance less than R in the y dimension.

3. The only allowable parallelization schemes are those whose zones consist only of a set of complete boxes. We refer to such a parallelization scheme as a voxelized method.

4. We will ignore the fact that when computing the influence of two particles on one another, one typically uses the same data, which may (depending on problem- specific considerations) make it advantageous to perform the computation on the same processor. Instead, we assume that each processor computes the influence of particles in one zone (the red zone) on particles in the other zone (the blue zone), but not vice versa.

These simplifying assumptions will be progressively relaxed.

We define the influence region of a box as the smallest region of space that is guaranteed to contain all particles with which any particle in the box could interact. In this simplified problem, the influence region is a rectangle extending a distance R beyond the box in the positive and negative x andy directions.

An analog to the HS method for the simplified problem is shown in Figure 2(a). Each box computes the influence on its own particles of all particles in its influence region. Each box imports all the particles in its influence region that do not already reside in the box.

An analog to the NT method is shown in Figure 2(b). In this method, each box computes the influence of all particles in a row of neighboring boxes spanning the influence region horizontally (red) on all particles in a column of neighboring boxes spanning the

influence region vertically (blue). Each box imports both the remote particles in its red zone and the remote particles in its blue zone. The import region of this method is a strict subset of the import region of the HS analog, even though the two methods compute exactly the same interactions when the union of all calculations by all processors is considered.

Other choices for the two zones can also ensure that all required interactions are computed. Figure 2(c) shows a method reminiscent of the SH method. In this method, the red zone comprises a set of equally spaced horizontal slabs, while the blue zone is a vertically oriented block whose height is equal to the spatial period of the slabs. Again, each processor imports the particles in its two zones and computes the influence of particles in its red zone on particles in its blue zone. This method is guaranteed to compute the influence of all particles in a box's influence region on all particles in the box. In Figure 2(c), we have numbered the rows of the influence region of the interaction box B from 1 to 9. Box B will compute the influence of particles in rows 2, 5, and 8 on particles in B. The box directly above B will compute the influence of particles in rows 1, 4, and 7 on particles in B, and the box directly below B will compute the influence of particles in rows 3, 6, and 9 on particles in B.

Figure 2(d) shows a novel method that is also guaranteed to compute all required interactions. The blue zone comprises a rectangular area centered on the interaction box. The red zone is a foam-like structure that consists of boxes positioned such that their period in each dimension is equal to the width of the blue zone. This method is an analog of the three-dimensional foam method discussed in Section 3.6.

In all the methods we have described, the intersection of the two zones is simply the interaction box. We will assume throughout Section 3 that the zones do not overlap outside the interaction box, even though our analysis generally applies even if they do. If the zones were to overlap outside the interaction box, then some pairs of particles that reside in the same box would be interacted in a different box.

3.2 The convolution criterion

Given a pair of zones associated with each interaction box, how can one check whether all required interactions are computed? We first address this question under the assumptions of the simplified two-dimensional model. For any particular method, we define the coverage region of a box as the largest region of space such that the method will compute the influence of any particle in the coverage region on any particle in the box. If the coverage region of a box includes the influence region of that box, then all required interactions will be computed (any additional, redundant interactions can be eliminated by filtering). Note that the coverage region is a property of the parallelization method employed, while the influence region is not.

Under the assumptions of the simplified model, we can easily determine the coverage region of a box B given the zones associated with a particular method. Under the assumptions of our simplified model, each box computes the influence of each particle in its red zone on each particle in its blue zone. We first find the set of boxes whose blue zone includes B. We then take the union of the red zones for all boxes in this set to obtain the coverage region of B.

To determine the coverage region graphically, we first reflect the blue zone of B through the center of B to show the set of boxes whose blue zones include B. In two dimensions, reflection through the center of B is equivalent to a 180° rotation about B. If the center of B is at (x, y), then the reflected blue zone will include the point (x - a, y - b) if and only if the blue zone of B includes the point (x + a, y + b). The coverage region is the union of the red zones corresponding to each box in the reflected blue zone, as each box in the reflected blue zone will compute the influence of all particles in its own red zone on all particles in B. The red zones of different boxes differ from one another only by translation, so we can also compute the coverage region by translating the red zone of B such that it has the same spatial relationship to each box in the reflected blue zone of B that it originally had to B, and then taking the union of these translated red zones.

This process can also be described as a convolution operation. The coverage region is the support of the convolution of an indicator function over the blue zone with an indicator function over the red zone. For that reason, we term the criterion that the coverage region include the entire influence region the convolution criterion. In a voxelized method without filtering, all particles in a box are interacted with the same set of particles outside that box. If some part of the influence region of a box is not in the box's coverage region, then no interactions will be computed between particles in the box and particles in that part of the influence region. For a voxelized method, the convolution criterion is thus not only sufficient but also necessary to guarantee that all required particle interactions are computed. We consider non-voxelized methods in Section 3.4.

For the four methods illustrated in Figure 2, the coverage region of the interaction box is identical to its influence region. In the absence of filtering, all particles in a particular box will therefore be interacted with all other particles whose distance from the box is no more than R in both the x and y dimensions. Figure 3 illustrates several other pairs of zones that also satisfy the convolution criterion. The method shown in Figure 3(a) is similar to that of Figure 2(d), but in Figure 3(a), the two zones associated with a particular box are not centered on that box. This method also differs from those of Figure 2 in that the coverage region of the box is larger than its influence region, with parts of the red zone lying outside the influence region. If we eliminated these parts of the red zone, however, the coverage region would no longer include the entire influence region, so the method would no longer ensure that all required interactions were computed. Figures 3(b) and 3(c) show two exotic methods that also satisfy the convolution criterion.

3.3 Symmetric interactions

So far, we have assumed that each box computes the influence of particles in its red zone on particles in its blue zone, but not vice versa. Because the influence of particle α on particle b typically depends on some or all of the same data as the influence of particle b on particle α, we can reduce communication requirements by allowing each box to also compute the influence of its blue zone on its red zone.

While the convolution criterion still holds in this case, we must modify the procedure used to derive the coverage region from the zones. We assume for now that assumptions 1 through 3 of our simplified model are still in effect (i.e., particles in a two-dimensional space interact if they are separated by a distance less than R in each dimension, and

interactions are computed using some voxelized method). We now compute the coverage region as the union of a red-on-blue coverage region, determined using the procedure described in Section 3.2, and a blue-on-red coverage region, computed using the same procedure but with the roles of the red and blue zones reversed.

For a voxelized method, the blue-on-red coverage region of a box B is simply the red-on- blue coverage region reflected through the center of B. The coverage region will therefore include the entire influence region of B if and only if, for any point in the influence region that does not fall within the red-on-blue coverage region, the point obtained by reflecting that point through the center of B falls within the red-on-blue coverage region.

Figure 4 shows modifications to the import regions of Figure 2, assuming that each box will compute the influence of both its blue zone on its red zone and its red zone on its blue zone. In each case, we have cut one of the two zones roughly in half. The red-on- blue coverage region for each scheme is enclosed within a dashed line. In each case, the red-on-blue coverage region includes at least half the influence region, and the union of the red-on-blue coverage region and its reflection covers the entire influence region. Again, filtering can be used to ensure that no particle pair will interact twice and that only required interactions are computed.

3.4 Non-voxelized methods

Thus far, we have assumed that each zone contains only whole boxes. We can often reduce the volume of the import region, however, by using zones that contain one or more partial boxes.

To handle such non-voxelized methods, we must generalize the convolution criterion. In particular, we will compute a coverage region for each point in a box, rather than for the box as a whole. Given a particular method, we define the coverage region of a point as the region whose influence on a hypothetical particle at that point will be computed by that method. The coverage region of a point may again be constructed as the union of a red-on-blue coverage region and a blue-on-red coverage region. We construct the red- on-blue coverage region for a point q using a procedure similar to that used for its home box B, except that we take the union of red zones of the set of boxes whose blue zones include the point q rather than the full box B. We compute the blue-on-red coverage region of a point using a similar procedure, but with the roles of the red zone and the blue zone reversed.

We define the influence region of a point as the minimal region guaranteed to contain all particles required to interact with a particle at that point. In the simplified model, the influence region is a square of side length 2R centered on the point. In the more typical scenario where particles are required to interact if separated by a Euclidean distance less than R in three dimensions, the influence region is a sphere of radius R surrounding the point.

Ill

The generalized version of the convolution criterion states that for each point in a box, the coverage region of that point must include the entire influence region of that point. This criterion is both necessary and sufficient to ensure that each particle will interact with all other particles in its influence region.

Figure 5 shows a modified version of the NT-like method of Figure 4(b), where the zones contain partial boxes. In this illustration, we have assumed that assumptions 1 and 2 of the simplified model still apply (i.e., two-dimensional problem with a square influence region), but we have chosen R to be a non-integer multiple of δ. The method shown ensures that each point in the box interacts with all points within a square of side length 2R centered on that point, even though no point in the box is guaranteed to interact with all points within the influence region of the box.

3.5 Rounding and the rounding criterion

Under assumption 2 of the simplified model, two particles were required to interact if they were separated by a distance no greater than R in each dimension. We now adopt the more realistic assumption that two particles are required to interact if and only if they are separated by a Euclidean distance less than R. In two dimensions, the influence region of each point is therefore enclosed by a circle of radius R rather than a square of side length 2R. Likewise, the influence region of each box is a subset of the rectangular region considered in the previous sections; its corners are rounded. We can take advantage of this fact to reduce the size of the import region, a technique that we will refer to as rounding.

A point may be discarded from zone v if the point is further than R away from the closest point in the set of zones with which v interacts. We refer to this condition as the rounding criterion. For the two-zone methods considered in this section, the rounding criterion implies that points in the red zone may be discarded if they are further than R away from all points in the blue zone, and points in the blue zone may be discarded if they are further than R away from all points in the red zone. Whereas the rounding criterion is a sufficient condition to guarantee that a point can be safely discarded from a zone, it is not a necessary condition, because in some cases one can also safely eliminate points that do not meet this condition; we describe such a method in a separate paper [20]. One can, however, always use the generalized form of the convolution criterion to test whether a particular rounded method is guaranteed to compute all required interactions.

Figure 6 shows rounded versions of the HS and SH analogs. While rounding never increases import volume, it does not always reduce it. The NT analog in Figure 4(b) and the foam analog in Figure 4(d), for example, are not affected by rounding.

3.6 Two-zone methods in higher dimensions

The analytical tools discussed in the previous sections generalize in a straightforward manner to the case where the particles lie in a three-dimensional space (or, for that matter, a higher-dimensional space). Given a method where each box interacts particles in one zone with particles in another zone, we can use the convolution criterion to determine whether all particle pairs within a distance R will be interacted. The influence region is three-dimensional, and the "convolution-like" process of determining the

coverage region takes place in three dimensions (or in d dimensions, if the particles lie in a ^/-dimensional space). We can also apply the rounding criterion to reduce the volumes of the zones. Again, the convolution criterion is a necessary and sufficient condition for ensuring that all particles within a distance R will be interacted. The rounding criterion, on the other hand, is not guaranteed to produce the smallest permissible zones.

Figure 7 shows the SNT method as well as three novel three-dimensional parallelization methods, to which we refer as the clouds, city, and foam methods. All of these might be viewed as rounded, three-dimensional generalizations of the two-dimensional methods of Figures 4(c) and 4(d). The foam method is discussed further in Section 6.3.

These three-dimensional methods differ from one another in the extent to which they can take advantage of rounding. The NT and SNT methods can only use two-dimensional rounding — in the rounded versions of these methods, the coverage region of each point includes a cylinder of radius R and height 2R. In the high parallelization limit, when boxes are very small, two-dimensional rounding reduces the size of one of the zones and of the coverage region by a factor of π/4 compared to the unrounded case. Other methods, including the HS, clouds, city, and foam methods, allow three-dimensional rounding — they reduce the coverage region of each point to a sphere of radius R in the high parallelization limit. Asymptotically, such rounding reduces the size of one of the zones and of the coverage region by a factor of π/6 compared to the unrounded case and by a factor of 2/3 compared to the two-dimensional-rounded case. Likewise, in spaces of more than three dimensions, rounding reduces the coverage region of each point to a

multidimensional sphere in the high parallelization limit for certain methods but not for others.

In a space with an odd number of dimensions, the two zones of a method guaranteed to compute all near interactions will necessarily have different shapes and, typically, different volumes. In order to reduce total import volume, one can adjust the relative volumes of the zones in two ways:

• Use non-cubical boxes, making one of the zones larger than it would have been otherwise and the other smaller. The NT method employs this technique.

• In methods where one or both zones involve multiple non-contiguous pieces (e.g., the SH method), choose the spacing between the pieces appropriately.

Methods in which the two zones have approximately equal volumes typically have a smaller total import volume than methods where one zone is much larger than the other. We will return to the topic of minimizing import volume in Sections 5 and 6.

4. /f-Zone Methods

The methods considered thus far involve the interaction of two overlapping zones. These methods interact all particles in one zone with all particles in the other zone, subject to filtering criteria. In this section, we consider a class of methods that make use of three or more zones, interacting any number of pairs of these zones. Such k-zone methods interact all particles in a given zone with all particles in a subset of the other zones, subject to

filtering criteria. Two-zone methods and k-zone methods together constitute the class of zonal methods.

The four zones of a particular two-dimensional k-zone method are shown in Figure 8. Each processor imports a rectangle extending a distance R beyond the interaction box in the +x direction (zone X), a rectangle extending a distance R beyond the interaction box in the +y direction (zone Y), and a quarter circle of radius R (zone C) adjacent to both zone X and zone Y. The interaction box is treated as a distinct zone (zone I). Zone X interacts with zone Y, and zone I interacts with all the zones, itself included. Filtering eliminates particle pairs separated by a distance greater than R, as well as duplicated local interactions. In this method, two particles interact within a box whose x base coordinate is that of the particle with the smaller x coordinate and whose y base coordinate is that of the particle with the smaller jμ coordinate. The import volume of this method is always less than that of the rounded HS analog shown in Figure 6(a). When R is small relative to the box side lengths, the import volume is also less than that of the NT analog shown in Figure 4(b).

The convolution criterion introduced in the context of two-zone methods also applies to &-zone methods: all required particle pair interactions will be computed if and only if the coverage region for each point in a box covers the entire influence region of that point. The procedure for computing the coverage region of a point, however, is somewhat more complicated for a Axzone method than for a two-zone method. To determine the coverage region of a point for a A-zone method, one takes the union of the zone-pair coverage regions of that point for all pairs of zones interacted by the method, where we define the

zone-pair coverage region of a point for a given pair of zones as the region whose influence on a hypothetical particle at that point will be computed by the interaction of those two zones. Each zone-pair coverage region can be computed using the procedure described in Section 3.4 for computation of the coverage region for a two-zone method.

Alternatively, one can compute the coverage region for the full box by taking the union of zone-pair coverage regions of the box for all pairs of zones to be interacted, where we define the zone-pair coverage region of a box for a given pair of zones as the region whose influence on any particle in the box will be computed by the interaction of those two zones. Each zone-pair coverage region can be determined using the procedure of Section 3.3. All required particle-pair interactions are guaranteed to take place if the coverage region of the box includes the entire influence region of the box. For a voxelized method, the converse is also true.

In the &-zone method illustrated in Figure 8, the union of the zone-pair coverage regions for the interaction of the four zones with zone I includes all but two quarter-circle portions of the interaction box influence region. The zone-pair coverage region for the interaction of zone X and zone Y covers these two remaining quarter circles.

In our discussion of A-zone methods, we will assume that the zones do not overlap, and that a zone cannot interact with itself. The exception is the interaction box, which we define as a separate zone that will interact with itself. In the absence of these assumptions, we could convert any Λ-zone method into a method with just one zone by defining the single zone to include the entire import region and the interaction box.

However, this would lead to a situation where some pairs of particles that reside in the same processor would interact on a different processor.

The Λ-zone methods have several practical benefits. First, they can reduce communication bandwidth requirements relative to two-zone methods, both by allowing more near interactions to be computed for a given import volume (as in the method of Figure 8) and by exposing additional rounding opportunities (as will be illustrated shortly). Second, A-zone methods can be used to simplify the filtering criteria required to avoid redundant computations. For example, the two-zone formulation of the NT method requires filtering to prevent pairs of particles with the same x and y base coordinates from being interacted twice, while a &-zone formulation of the NT method, presented below, eliminates the need for such filtering in all zone-zone interactions except for the interaction of the interaction box with itself. Third, the use of additional zones allows more opportunities to overlap communication and computation, because interactions between some pairs of zones can be computed while other zones are being imported, as discussed below.

The extent to which communication between processors can be overlapped with computation depends on the order in which pairs of zones are interacted. We use an interaction schedule (or simply schedule) to specify which pairs of zones should be interacted and in what order those interactions should take place. The schedule may be represented as an upper triangular matrix in which the entry in row /, column 7 (where i ≤f) corresponds to the interaction, if any, between zone / and zoney. A zero entry indicates that the corresponding pair of zones is not interacted at all. The remaining

entries are unique positive integers indicating the order in which the zone-zone interactions are computed: the entry containing a 1 represents the first interaction to be computed, a 2 represents the second interaction to be computed, and so on. A schedule for the method of Figure 8 is shown in Table II.

The order of interactions that allows maximal overlap of computation and communication depends on details of the available computational hardware. The schedules presented in this paper were formulated using the following procedure:

• First, we determine the order in which the zones will be imported, requiring that zone u be imported before zone v if zone u participates in more zone-zone interactions than zone v.

• We then number the zones according to the order in which they are imported, with zone 1 representing the interaction box, zone 2 representing the first zone imported, zone 3 representing the second zone imported, and so on.

• We set schedule entries corresponding to pairs of zones that should not be interacted to zero. We then number the remaining entries in column-major order; that is, the interaction in row i\ and column^ will be computed before the interaction in row / 2 and columny 2 if J x < J 2 , or if J 1 - j 2 and Z 1 < I 1 . This

ensures that computation of the interactions between a pair of zones that have already been imported never has to wait for the computation of an interaction involving a zone that has not yet been imported.

To illustrate the advantages of &-zone methods, we describe the kZ-NT method, a reformulation of the NT method using four zones. The zones correspond to the interaction box (I), the lower half of the outer tower (L), the upper half of the outer tower (U), and the outer plate (P), as shown in Figure 9. Zone I interacts with zones I, U, and P 5 and zone P interacts with zones U and L; the interaction schedule is shown in Table III. We can start computing interactions between pairs of particles in the interaction box while importing particles from the other zones. We can also compute interactions between P and I while importing U and L, and we can complete the import of L while interacting U with I and P. By eliminating the interaction between I and L 5 we prevent pairs of particles lying in different boxes with the same x and y base coordinates from being interacted twice. The A-zone formulation of the NT method exposes an additional rounding opportunity not exploited by the original NT method. Because the lower tower L interacts only with P, the portions of L that are more than a distance R from any point in P can be eliminated, as shown in Figure 9.

5. Lower Bounds on the Import Volume

5.1 A general bound

In the Appendix, we establish a lower bound for the import volume Vi of a zonal method. We assume that the zones are non-overlapping, that one of the zones is the interaction box, and that only the interaction box zone interacts with itself. We refer to zones other than the interaction box as remote zones. We denote the total number of zones by N z and the number of remote zones by N zr = N- - 1. We find that

(D

where Vb is the volume of a box and where the average across all points in a box B of the volume of the portion of the point's influence region that lies outside B, is given by

This is a strict bound that holds for all zonal parallelization methods.

5.2 A useful approximate lower bound

As discussed in the Appendix, one can obtain a larger, though approximate, lower bound by replacing Vm, r emote in the above bound by the volume of the interaction neighborhood Vm, defined as the portion of a box's influence region that lies outside the box. Using the parallelization parameter a; ? and the normalized box side lengths CC x , Oy, and O 2 , defined in Section 2, we can express the volume of the interaction neighborhood V IN as

In the high parallelism limit, where the box volume Vb approaches zero and the parallelization parameter α« approaches infinity, VIN and Vi R ιremote converge to 4πR 3 /3, so the approximate lower bound approaches the strict one.

The Appendix also shows that if the box volume and interaction radius are fixed, then one minimizes Vm by using cubic boxes (this does not imply that cubic boxes minimize import volume for any particular parallelization method). Thus, for any box aspect ratios,

Substituting this minimum value of Vm for Vm,remote in Eq. (1) gives an approximate lower bound that is independent of box aspect ratios:

In the high parallelism limit, as Qj? goes to infinity, the oc^ term in Eq. (2) dominates, so

the approximate bound of Eq. (2) becomes

Because Vm and converge in the high parallelism limit, the bound of Eq. (3) becomes strict in that limit. The import volume of the foam method discussed in Section 6.3 approaches this lower bound asymptotically for N zr = 2.

By substituting CCR = R / Vb β in Eq. (3) and noting that Vb = V gc /p, where V gc is the volume of the global cell, one can show that the minimal import volume for a zonal method scales as R 3/2 p ~m for N zr > 1 and as R 3 for N zr - 1. Snir also proved that scaling of R V2 p ~U2 is asymptotically optimal, albeit under slightly different assumptions [I].

The factor (l --^) 2 in Eq. (2) decreases as N zr grows, starting at -Jl for N zr = 2 and

approaching 1 as N zr approaches infinity. In practice, the use of multiple zones typically gives a smaller improvement in import volume, because it is generally not possible to devise a method where every pair of remote zones interacts and where every particle pair considered before filtering corresponds to a unique near interaction.

6. Specific Methods

This section details several specific zonal methods that require a lower import volume than any previously described method of which we are aware for some range of parallelization parameter values, where the parallelization parameter CCR = R/ V b V3 determines the effective degree of parallelism, as discussed in Section 2. These methods are guaranteed to interact any pair of particles separated by a distance less than R. They will interact a pair of particles twice only if both particles lie in the same box — a form of redundancy that can be eliminated by filtering.

6.1 ES method

Intuitively, the HS method might seem to be optimal in terms of import volume in the low-parallelism limit, but this is not the case. Figure 10 shows the zones of the ES (for

"Eighth-Shell") method, which is the three-dimensional generalization of the two- dimensional method illustrated in Figure 8. In the ES method, two particles interact within a box whose x base coordinate is that of the particle with the smaller x coordinate, whose y base coordinate is that of the particle with the smaller y coordinate, and whose z base coordinate is that of the particle with the smaller z coordinate. This method utilizes eight non-overlapping zones: the interaction box (I); three face subregions (FX, FY and FZ) that abut the +x, +y, and +z faces, respectively, of the interaction box; three edge subregions (EX, EY and EZ) that abut the +y+z, +x+z, and +x+y edges, respectively, of the interaction box; and one corner subregion (C) that touches the +x+y+z corner of the interaction box. The import region of the ES method is smaller than that of the HS method at any degree of parallelism. The import region of the ES method, which consists of one corner subregion, three edge subregions, and three face subregions, is a strict subset of the import region of the HS method, which consists of four corner subregions, six edge subregions, and three face subregions. The total import volume of the ES method is

If the home box volume V b and the parallelization parameter c& are fixed, the import volume of the ES method is minimized by using cubic home boxes, for which

The ES method is so named because in the high parallelism limit, its import region becomes one-eighth of the interaction neighborhood, an "eighth-shell." In that limit, its

import volume will be one-quarter that of the half-shell method. On the other hand, the import volume of the ES method has the same asymptotic scaling properties as the HS method, so at high levels of parallelism, it proves inferior to the NT method.

Table IV shows an interaction schedule for the ES method. Zone I interacts with all of the zones (including itself). In addition, each face zone interacts with the other two face zones and with one edge zone.

6.2 kZ-NT and kZ-SNT methods

The kZ-NT method was introduced in Section 4. The import volume of this method is smaller than that of the original NT method because the kZ-NT method exploits an additional rounding opportunity. More specifically, the import volume of the original NT method is

One can show that the import volume of the kZ-NT method is smaller than that of the NT method by

where and where we have assumed that , because the import volumes of both the NT and kZ-NT methods are minimized for fixed V b , b z , and R when b x = b y . This reduction in import volume is small relative to the total import volume for

high parallelization parameter values, when R is typically much larger than b x y, but is significant for lower parallelization parameter values.

The SNT method can also be reformulated using additional zones. Figure 11 shows the zones of the kZ-SNT method, and Table V gives the interaction schedule. The import volume of the kZ-SNT method is smaller than that of the original SNT method because the kZ-SNT method exploits two additional rounding opportunities, illustrated in Figure 11.

If there is no space between the bars of the comb in the kZ-SNT method, then the zone BA disappears and import of zone E becomes unnecessary, as it only interacts with BA. In that case, the remaining import region exactly corresponds to that of the NT method. It can be shown that if one has the freedom to optimize the box aspect ratios in addition to the spacing between the bars in the kZ-SNT method, then one minimizes the import volume of the kZ-SNT method by reducing it to the kZ-NT method.

If the box aspect ratios are fixed, on the other hand, the SNT method has an advantage over the NT method in that the SNT method can approximately balance the volume of the base and the comb by adjusting the spacing between the bars of the comb, substantially reducing the total import volume at higher levels of parallelism. The SNT method therefore has a lower import volume than the NT method for high parallelization

parameter values at fixed box aspect ratios. Likewise, the kZ-SNT method has a lower import volume than the kZ-NT method under these conditions.

6.3 Foam method

The foam method is illustrated in Figure 7. The interaction box imports a brick consisting of s 3 boxes, which are configured as an s x s X s cube centered on the interaction box in the x and y dimensions and extending below, but not above, the interaction box in the z dimension. The interaction box also imports a foam of individual boxes spaced every s boxes in each dimension. The foam extends above the interaction box, and is rounded to form an approximately hemispherical structure. The foam method dominates the kZ-NT and kZ-SNT methods at very high levels of parallelism because it exploits three- dimensional rounding.

The asymptotic import volume of the foam method for large numbers of processors is

Optimizing s to minimize the asymptotic import volume yields

and

Asymptotically, the foam method achieves the lower bound of Eq. (3) for a method with two remote zones. Shaw [2] showed that the NT and SNT methods achieve an

asymptotic import volume of (with optimized box aspect ratios). In the high parallelism limit, the foam method has times the import volume of the NT and SNT methods in either their original or k-zone formulations.

Because the foam method only becomes practical at very high degrees of parallelism, efficient implementations would likely need to use a large number of zones to hide import latencies. A practical interaction schedule is omitted in the interest of brevity. Conceptually, the key elements of the schedule are that the interaction box interacts with the foam, the brick, and itself, while the foam and the brick interact with one another.

6.4 Comparison of communication bandwidth requirements

Figures 12 through 15 graphically depict the import volumes of several zonal methods over a range of parallelization parameter values. These figures include the methods that have the lowest import volume among the methods discussed in this paper for some range of parallelization parameter values. Figures 12 and 13 assume that the box aspect ratios of each method are tuned to minimize import volume at each setting of the parallelization parameter, while Figures 14 and 15 assume that boxes are constrained to be cubical.

When box aspect ratios are tunable, the ES method has the lowest import volume of the methods we have discussed for (ZR below 0.60. The kZ-NT method has the lowest import volume for values of cfø ranging from 0.60 to approximately 15. The foam method has

the lowest import volume in the high parallelism limit, but becomes competitive with the kZ-NT method only when c& is greater than about 15.

As an example, consider a reference system having a cubic global cell measuring 8θA on a side and an interaction radius of 12A. These parameters are within the range that might be typical for a molecular dynamics simulation of a biomolecular system; at a typical density of 0.1 atoms/A 3 , such a system would contain about 51,000 atoms. For this

system, the relationship between CCR and the number of processors/? is p « 296α^ » so the

ES method minimizes import volume for fewer than about 64 processors 3 (small computer clusters). The kZ-NT method minimizes import volume between about 64 and a million processors (large clusters, or even ultraparallel hardware such as QCDOC [21] or Blue Gene/L [22]). The foam method dominates only at extreme levels of parallelism involving more than a million processors.

When box aspect ratios are fixed, the kZ-SNT method requires a smaller import volume than the kZ-NT method for sufficiently large parallelization parameters. For cubic boxes, the ES method minimizes import volume for (XR below 0.66, followed by the kZ-NT method for (XR between 0.66 and 2.7 and the kZ-SNT method for «R between 2.7 and approximately 16. The foam method again has the smallest import volume in the high parallelism limit, but becomes competitive with kZ-SNT only for CCR above 16. For our reference system, the ES method has the lowest communication bandwidth of the methods we have discussed for small clusters (up to about 85 processors), the kZ-NT method has the lowest bandwidth for large clusters (about 85 to about 5800 processors), the kZ-SNT method has the lowest bandwidth for existing ultraparallel hardware (about

5800 to 1.2 million processors), and the foam method has the lowest bandwidth only beyond 1.2 million processors.

In practice, the ease with which box aspect ratios can be tuned depends on the architecture of the communication network. The dimensions of the global cell are usually determined by the physical system being simulated. In a network where communication between any two processors is equally expensive, one can easily change the dimensions of the box grid, so aspect ratios are relatively unconstrained. On a network with a mesh or toroidal topology (e.g., in Blue Gene/L or in Cray's T3D, T3E, and XT3 systems [23, 24]), on the other hand, it is generally most convenient for the box grid to correspond to the network grid, determining fixed box aspect ratios.

Figures 12 through 15 also show the approximate lower bounds of Eq. (2) for two remote zones (N zr = 2) and for an infinite number of remote zones (N zr = ∞ ). The parallelization methods discussed in Section 6 come close to the N n - = 2 bound at all levels of parallelism, with the foam method approaching the bound asymptotically, albeit slowly, at high levels of parallelism and the ES method actually beating the N zr = 2 bound at low levels of parallelism by using more than two remote zones. The gap between actual import volumes and the N 2 ,. = ∞ bound reflects the fact that this bound is tight only when all remote zones interact with one another without producing redundant interactions, which is rarely if ever the case for a method with a large number of zones.

We have not proven that the methods of Section 6 are optimal in terms of import volume. In fact, we have discovered novel neutral territory methods, to be described in a subsequent paper [25], whose import volume is at least slightly smaller than that of any

method described in this paper over a wide range of parallelization parameters. Another technique called the midpoint method, which we also describe in a separate paper [20], has an import volume identical to that of the ES method at all levels of parallelism and offers certain practical advantages over the ES method on some machines.

Our assumption that G x ≥ b x + IR , G y ≥ b y + 2R , and G x ≥ b z + 2R (Section 2) implies

that the global cell must be partitioned at least once along each dimension (i.e., the mesh of boxes must be at least 2x2x2). When only a small number of processors are available, methods that involve a partition of the space into boxes along only one or two of the three dimensions may minimize import volume. If we partition the global cell into a 1 x 1 xn mesh of boxes, implying that b y = G y and b z = G 2 , then the only

communication required will be in the positive and negative z directions. If we partition the global cell into an I xmxn mesh of boxes, implying that b z = G z , then the only

communication required will be in the>z plane. In either case, one can still utilize the zonal methods we have introduced, but the formulae for their import volumes will be different from those presented previously. When parallelizing our reference system across 16 or fewer processors, application of the ES or kZ-NT methods with a partition along two dimensions will require a lower import volume than application of the ES method with a partition along three dimensions. If fewer than six processors are available, a traditional spatial decomposition method with a partition along one dimension will require a lower import volume than any of the methods we have described with a two- or three-dimensional partition.

7. Conclusions

This paper has introduced a broad class of techniques, which we refer to as zonal methods, for the efficient parallelization of range-limited π-body problems. This class includes both traditional spatial decomposition methods and the methods of Shaw and Snir as special cases; it also includes a wide variety of "neutral territory" methods that have not been described previously. We have described a test to determine whether a particular zonal method is guaranteed to interact all pairs of particles separated by a distance less than R. We have also demonstrated a systematic method for exploiting opportunities for "rounding," and have shown how the use of three or more "zones" can reduce communication bandwidth requirements, overlap communication time with computation time, and efficiently avoid the computation of redundant interactions. We have also derived strict and approximate lower bounds on the import volume requirements of zonal methods at various degrees of parallelism.

The current paper also introduces new methods that have the lower communication bandwidth requirements than any published method of which we are aware at both very low and very high levels of parallelism. For the broad class of problem and system parameters that lie in between these two extremes, we have also described a new set of modifications to the previously described NT and SNT methods that reduce their communication bandwidth requirements below those of any published method of which we are aware.

While this paper has focused primarily on import volume as a metric for comparing different parallelization methods, the best choice of parallelization method may in practice depend on other factors as well. These include:

• Load balancing. The efficiency of a given computation may depend in part on the ability of the parallelization method used to balance computational load across the available processors. Unlike home territory methods, neutral territory methods can achieve a significant degree of load balancing even in the case where the number of simulated particles (as distinct from the number of particle interactions) is smaller than the number of available processors, since neutral territory methods can calculate pairwise interactions within processors that contain no particles. We discuss techniques for load balancing further in a separate paper [20].

• Redundant interactions. To maximize computational efficiency, it is desirable to interact only pairs of particles that lie within a distance R of one another, and to interact each such pair only once. One can eliminate redundant interactions through a process of "filtering," but the filtering process itself imposes an additional computational load. Parallelization methods that reduce the need for filtering will in general improve computational efficiency.

• Communication latency. The amount of time spent on communication depends not only on the amount of data to be communicated but also on the amount of time required to transmit even a minimum-sized packet of information between two processors (communication latency). In some parallel architectures (e.g., a

mesh-connected machine), particles in nearby boxes can generally be imported more quickly than those in distant boxes, since the latter may require that a given information packet pass through multiple intermediate processors before arriving at its intended destination [23, 24, 26]. In such cases, methods like the midpoint method [20] that tend to import particles from nearby boxes may have a comparative advantage. Furthermore, some parallelization methods may do a better job than others in balancing the amount of data to be transmitted over different communication links.

• Overlap of communication and computation. The total time required to evaluate near interactions will in general depend on the extent to which communication and computation can be performed simultaneously. Certain k-zonβ methods lend themselves particularly well to hiding communication time by overlapping it with the required computation. More generally, the calculation of near interactions is often one of several computations to be performed in parallel for a particular simulation. This may present additional opportunities for overlapping computation and communication.

The best parallelization method for a particular range-limited w-body problem on a particular parallel machine will in general depend on a variety of parameters, including the size and shape of the system being simulated, the distribution of particles within the simulated system, the interaction radius required to achieve the level of accuracy required by the problem at hand, the number and speed of the available processors, the topology of the communication network through which they are connected, and the operational

characteristics associated with each communication link. Given the number of and interactions among these considerations, it is not feasible to formulate a simple, single set of criteria for selecting the best choice of parallelization scheme for all possible applications. The analysis and methods presented in this paper, however, may provide a useful framework for choosing among an expanded range of highly efficient parallelization schemes, and may provide a theoretical foundation for future work in this area.

Acknowledgments

Special thanks to Christine McLeavey for rendering many of the figures in this paper and for providing helpful comments on an earlier draft. We benefited from fruitful discussions with members of the GROMACS [27] development team, particularly Berk Hess, Erik Lindahl, and David van der Spoel.

References

1 M. Snir, A Note on N-Body Computations with Cutoffs, Theor. Comput. Syst. 37, 295 (2004).

2 D. E. Shaw, A Fast, Scalable Method for the Parallel Evaluation of Distance-Limited Pairwise Particle Interactions, J. Comput. Chem. 26, 1318 (2005).

3 R. W. Hockney and J. W. Eastwood, Computer Simulation Using Particles (Adam Hilger, Bristol, 1988).

4 E. Bertshinger and J. M. GeIb, Cosmological N-Body Simulation, Computers in Physics 5, 164 (1991).

5 J. M. Thijssen, Computational Physics (Cambridge University Press, 1999).

I. Brooks, CL. and D. A. Case, Simulations of Peptide Conformational Dynamics and Thermodynamics, Chem. Rev. 93, 2487 (1993).

D. Frenkel and B. Smit, Understanding Molecular Simulation: From Algorithms to Applications (Academic Press, London, 2001).

W. Wang, O. Donini, C. M. Reyes and P. A. Kollman, Biomolecular Simulations: Recent Developments in Force Fields, Simulations of Enzyme Catalysis, Protein-Ligand, Protein-Protein, and Protein-Nucleic Acid Noncovalent Interactions, Annu. Rev. Biophys. Biomol. Struct. 30, 211 (2001).

M. Karplus and J. A. McCammon, Molecular Dynamics Simulations of Biomolecules, Nat. Struct. Bio. 9, 646 (2002).

0 J. J. Monaghan, Smooth Particle Hydrodynamics, Annu. Rev. Astron. Astrophys. 30, 543 (1992).

1 T. Schlick, R. D. Skeel, A. T. Brunger, et al., Algorithmic Challenges in Computational Molecular Biophysics, J. Comput. Phys. 151, 9 (1999).

T. Darden, D. York and L. Pedersen, Particle Mesh Ewald — an N.Log(N) Method for Ewald Sums in Large Systems, J. Chem. Phys. 98, 10089 (1993).

H. Q. Ding, N. Karasawa and W. A. Goddard, The Reduced Cell Multipole Method for Coulomb Interactions in Periodic-Systems with Million-Atom Unit Cells, Chem. Phys. Lett. 196, 6 (1992).

U. Essmann, L. Perera, M. L. Berkowitz, et al., A Smooth Particle Mesh Ewald Method, J. Chem. Phys. 103, 8577 (1995).

L. Greengard and V. Rokhlin, A Fast Algorithm for Particle Simulations, J. Comput. Phys. 73, 325 (1987).

C. Sagui and T. Darden, Multigrid Methods for Classical Molecular Dynamics Simulations of Biomolecules, J. Chem. Phys. 114, 6578 (2001).

Y. Shan, J. L. Klepeis, M. P. Eastwood, et al., Gaussian Split Ewald: A Fast Ewald Mesh Method for Molecular Simulation, J. Chem. Phys. 122, 054101 (2005).

G. S. Heffelfinger, Parallel Atomistic Simulations, Comput. Phys. Commun. 128, 219 (2000).

S. Plimpton and B. Hendrickson, A New Parallel Method for Molecular Dynamics Simulation of Macromolecular Systems, J. Comput. Chem. 17, 326 (1996).

K. J. Bowers, R. O. Dror and D. E. Shaw, The Midpoint Method for Parallelization of Particle Simulations, under review.

P. A. Boyle, C. Jung and T. Wettig, The QCDOC Supercomputer: Hardware, Software, and Performance, International Conference on Computing in High Energy and Nuclear Physics (CHEP) (2003).

F. Allen, G. Almasi, W. Andreoni, et al., Blue Gene: A Vision for Protein Science Using a Petaflop Supercomputer, IBM Syst. J. 40, 310 (2001).

R. E. Kessler and J. L. Schwarzmeier, Cray T3d: A New Dimension for Cray Research, proceedings of the IEEE Computer Society International Conference (1993).

24 S. Scott and G. Thomas, The Cray T3e Network: Adaptive Routing in a High Performance 3d Torus, proceedings of the Hot Interconnects IV: A Symposium on High Performance Interconnects, Stanford University, Palo Alto, CA (1996).

25 R. O. Dror, K. J. Bowers and D. E. Shaw, in preparation.

26 N. R. Adiga, G. Almasi, G. S. Almasi, et al., An Overview of the Bluegene/L Supercomputer, proceedings of the ACM/IEEE Conference on Supercomputing, Baltimore, Maryland, USA (2002).

27 D. v. d. Spoel, E. Lindahl, B. Hess, et al., Gromacs: Fast, Flexible, and Free, J. Comput. Chem. 26, 1701 (2005).

Appendix

Derivation of an exact lower bound on import volume

We derive a lower bound on the import volume of a zonal method under the following assumptions:

• The method uses boxes with a predetermined volume (Vb) and predetermined aspect ratios.

• The zones do not overlap. Each interaction box has a total of N z zones, of which one is the box itself. We denote the volume of the import region by V,- and the number of remote zones by N zr , where N zr = N--I . Note that a method with two

zones whose intersection is the interaction box can be reformulated as a method with three non-overlapping zones.

• A pair of particles in the same box must interact in that box. This implies that only the interaction box zone can interact with itself.

• The method guarantees that any two particles within a distance R of one another are guaranteed to interact.

• As in the remainder of this paper, G x ≥ b x + 2R , G y ≥ b y + 2R , and G z ≥ b z + 2R ,

implying that the global cell must be partitioned into at least a 2x2x2 mesh of boxes.

In a zonal method, the positions of two particles uniquely determine the processor on which they will interact. We can therefore think of each processor as interacting points in space with one another, whether or not a particle is located at each point. By integrating over all pairs of points that must be interacted, we can express the total quantity of required point-point interactions in units of volume times volume. The influence region of any point q is a sphere of radius R, whose volume we denote by V m = j 7vR 3 . The total

quantity of required point-point interactions to be computed by the ensemble ofp processors is therefore £ V gc V 1R , where V gc is the volume of the global cell and the factor

of 1/2 reflects the fact that each pair of points needs to be interacted only once. In a zonal method, each processor is responsible for the same quantity of point-point interactions, so the quantity required of each processor is

Some of these required interactions involve two points within the same box. We denote the quantity of local interactions to be computed in a box B by I re guired,iodai a °d express it as

where <9is the indicator function

The remainder of the required point-point interactions involve points with different home boxes, and are therefore remote interactions. We denote the quantity of remote interactions required of each processor by I re qwred,re m ote and compute it as

We can write this expression as where

One can think of Vm, rem ote as the average volume of the portion of a point's influence region that lies outside its home box, with the average computed over all points in a box.

Next, we calculate the quantity of remote point-point interactions computed by each processor in a zonal method as

where Vj is the volume of thejth zone and I β equals 1 if zonesy and k interact and 0 otherwise. (By omitting cases where j = k from this sum, we have omitted the

interaction between the interaction box and itself.) If we assume that zone 1 is the interaction box, then

where equality holds if and only if all remote zones interact with all other remote zones. For a fixed number of remote zones and a fixed import volume Vi, one maximizes the

quantity by setting the volumes of the remote zones equal to one another, so

The actual quantity of point-point interactions computed by each box must not be less than the required quantity, so . Substituting in the previously derived expressions for ives

Solving for V / , we find that

A useful approximate lower bound on import volume

The lower bound of Eq. (4) involves Vi^ remoie , the average volume of the portion of a point's influence region that lies outside its home box. The expression for F/ R , remo re involves a non-analytic integral, and the derivation of the bound provides only limited insight into how to design methods with an import volume near the lower bound. One can obtain a useful approximate lower bound by replacing Vi R , r e m ote in the bound with the volume of a box's interaction neighborhood, defined as the portion of a box's influence region that lies outside the box. (The volume of the interaction neighborhood is simply the volume of the influence region minus the volume of the box.) As the interaction radius becomes large relative to the side lengths of a box, this approximation for F / «, remote approaches the exact value of Vm, r emote- For smaller interaction radii, this replacement amounts to raising the lower bound to require that the quantity of remote point-point interactions computed by the method is greater than or equal to that computed by the HS method.

The volume of the interaction neighborhood f a box is

where b x , b y , and b z are the side lengths of the box.

and the normalized box side lengths

and noting that an be written as

Substituting Vm for ViR trem ote in Eq. (4) gives the approximate lower bound:

Footnotes

1 While the base coordinates (c x , c y , c z ) of a given box will satisfy C x e [0, G x ] , c y € [0, G y ] , and

C 2 e [0, G 2 ] , the coordinates of an imported particle may fall outside those ranges, indicating that the particle was imported from a different image of the global cell. When specifying the box on which a pair of particles will interact, we treat corresponding particles in different images of the global cell as separate particles with different coordinates.

2 Variants of the methods described in this paper are also applicable to cases where one wishes to compute the influence of one set of particles on another set of particles, but not vice versa — for example, if one wishes to map charge from atoms to mesh points.

3 These figures are approximate because we have ignored the constraints on box aspect ratios due to the finite number of processors. For example, in order for the boxes to be exactly cubical when the global cell is cubical, the number of processors must be the cube of some integer. In practice, one might choose not to use a few of the available processors in order to obtain more convenient aspect ratios.

Figures and tables appear on the following pages.

Figure 1 : Zones associated with each of several previously described parallelization methods. The interaction box and import region are shown for the HS method (top), the NT method (middle), and the SH method (bottom). In each case, one zone consists of the union of the interaction box (purple) and the red region, while the other consists of the union of the interaction box and the blue region. The import region consists of the union of the red and blue' regions.

(a) (b)

(C) (d)

Figure 2: Four different two-zone methods for the simplified two-dimensional problem.

The interaction box and import region are shown for analogs of (a) the HS method, (b) the NT method, (c) the SH method, and (d) the foam method. In each case, as in Figure 1, one zone consists of the union of the interaction box (purple) and the red region, while the other consists of the union of the interaction box and the blue region. The import region again consists of the union of the red and blue regions. All figures illustrating two-dimensional methods (Figures 2, 3, 4, 5, 6, and 8) use square boxes with side length b. In this figure, the side length of each box is equal to 1/4 of the interaction radius (R = 46). The solid line encloses the influence region of the interaction box. In (c), we have numbered the rows of this influence region.

(b)

(C)

Figure 3: Three additional methods that satisfy the convolution criterion. In (a), R = Zb, while in (b) and (c), R = 86. Colors are assigned as in Figure 2, and the solid line again encloses the influence region of the interaction box. The dashed line in (a) encompasses the coverage region associated with this method.

(a) (b)

(C) (d)

Figure 4: Methods that take advantage of particle-particle interaction symmetry. The methods shown are two-dimensional analogs of (a) the HS method, (b) the NT method, (c) the SH method, and (d) the foam method, with R = Ab. In each case, the dashed line encloses the red-on-blue coverage region of the interaction box, and the solid line encloses its influence region.

Figure 5: A non-voxelized version of the two-dimensional NT analog of Figure 4, for a problem where R = 4.5b. All points in the interaction box have a red-on-blue coverage region that includes the area enclosed by the dashed line, while some points have a red-on-blue coverage region that includes part of the dithered areas. The solid line again encloses the influence region of the interaction box.

(a) (b)

Figure 6: Rounded two-dimensional analogs of (a) the HS method and (b) the SH method, with R = Ab. In each case, the dashed line encloses the red-on-blue coverage region of the interaction box, and the solid line encloses its influence region. As illustrated in (b), one cannot perform valid rounding by simply discarding all parts of the import region that lie outside the influence region.

Figure 7: Zones of several three-dimensional, two-zone parallelization methods. This figure illustrates the import regions and the zones of the SNT (upper left), clouds (upper right), city (lower left) and foam (lower right) methods, with colors assigned as in Figure 1.

Figure 8: Zones of a two-dimensional /c-zone method, with R - b. Zone I appears in purple, X in blue, Y in red, and C in green. The solid line encloses the influence region of the interaction box.

Figure 9: Zones of the kZ-NT method. Zone I appears in purple, L in red, U in yellow, and P in blue.

Figure 10: Zones of the ES method. Zone I appears in purple, EX in blue, EY in green, EZ in red, C in orange, FX in beige, and FZ in cyan. Zone FY is hidden from view.

Figure 11: Zones of the kZ-SNT method. Zone I appears in purple, BA in orange, CO in yellow, W in blue, S in white, N in green, and E in red. This /c-zone formulation exposes two new rounding opportunities: (1) a portion of E near one of its edges exceeds a distance R from all points in BA and can be eliminated, because E interacts only with BA; and (2) a portion of N near one of its corners exceeds a distance R from all points in CO and W and can be eliminated, because N interacts only with CO and W.

Figure 12: Comparison of the import volume for several methods, with optimized box dimensions. The parallelization parameter α R increases with increasing number of processors, increasing interaction radius, and decreasing simulated system size. Import volumes for all methods are represented relative to that of the HS method (Vι ιhl s) at each parallelization parameter setting and plotted on a log axis. The box aspect ratios for each method, as well as the spacing parameter s for the foam method, were optimized at each setting of the parallelization parameter to minimize import volume. The optimization procedure for the foam method involved an analytic approximation, so a fully optimized foam method may have a lower import volume than that shown here. The kZ-SNT method is not included in the figure because it always requires more bandwidth than the kZ-NT method when box aspect ratios are tunable. The approximate lower bounds are those of Eq. (2).

Figure 13: Expanded view of the low-parallelization portion of Figure 12. This figure provides a more detailed view of that portion of the data presented in Figure 12 that corresponds to a relatively small number of processors, and thus a relatively large box size, relative to the interaction radius R. The foam method has been omitted, since it fares poorly throughout this range of parameter values.

10

10

C/5

X

-1 10

-2

10

10 15 20

Parallelization parameter (α D Λ1/3>

K = R / V. ι/o> I

Figure 14: Comparison of the import volume for several methods, with boxes constrained to be cubical. The parallelization parameter α R increases with increasing number of processors, increasing interaction radius, and decreasing simulated system size. Import volumes for all methods are represented relative to that of the HS method (Vι ,HS ) at each parallelization parameter setting and plotted on a log axis. The spacing parameters for the foam and kZ-SNT methods were optimized at each setting of the parallelization parameter to minimize import volume. The approximate lower bounds are those of Eq. (2).

Figure 15: Expanded view of the low-parallelization portion of Figure 14. The foam and kZ- SNT methods have been omitted, since they fare poorly throughout this range of parameter values.

Table I: Asymptotic scaling properties of communication bandwidth required by various parallelization methods. R is the interaction radius and p is the number of processors.

Exploitable Scaling with range number of limitation processors

Atom decomposition methods

Force decomposition methods

Spatial decomposition methods

NT and SH methods

Table II: Interaction schedule for the two-dimensional k-zone method of Figure 8.

Table III: Interaction schedule for the kZ-NT method. The corresponding zones are illustrated in Figure 9.

Table IV: Interaction schedule for the ES method. The corresponding zones are illustrated in Figure 10.

Table V: Interaction schedule for the kZ-SNT method. The corresponding zones are illustrated in Figure 11.

Appendix C

The Midpoint Method for Parallelization of Particle Simulations

The Midpoint Method for Parallelization of Particle Simulations

Kevin J. Bowers, 1 Ron O. Dror, 1 and David E. Shaw 1 ' 2 ' 3

Abstract: The evaluation of interactions between nearby particles constitutes the majority of the computational workload involved in classical molecular dynamics (MD) simulations. In this paper, we introduce a new method for the parallelization of range- limited particle interactions that proves particularly well suited to MD applications. Because it applies not only to pairwise interactions but also to interactions I nvolving three or more particles, the method can be used for evaluation of both non-bonded and bonded forces in an MD simulation. It requires less inter-processor data transfer than traditional spatial decomposition methods at all but the lowest levels of parallelism. It gains an additional practical advantage in certain commonly used inter-processor communication networks by distributing the communication burden more evenly across network links and by decreasing the associated latency. When used to parallelize MD, it further reduces communication requirements by allowing the computations associated with short-range non-bonded interactions, long-range electrostatics, bonded interactions, and particle migration to use much of the same communicated data. We also introduce certain variants of this method that can significantly improve the balance of computational load across processors.

Key words and phrases: Molecular simulation, molecular dynamics, parallel computing, tt-body problem, pairwise particle interaction, multiple-body interactions, load balance

1 D. E. Shaw Research, LLC

2 Center for Computational Biology and Bioinformatics, Columbia University

1 E-mail correspondence: david@deshaw.com

1. Introduction

Molecular dynamics (MD) simulations of biological macromolecules in explicit solvent aim to provide a computational "microscope," yielding insight into biochemical events that are difficult to observe experimentally l . Simulations of significantly more than a microsecond, however, require tremendous computational power and will likely only become practical through the use of a large number of processors in parallel 2 . Such parallelism is limited by inter-processor communication requirements. In this paper, we introduce the midpoint method, a parallelization method that proves particularly useful for MD.

Most of the computational workload in MD is associated with the evaluation of electrostatic and van der Waals forces between all pairs of atoms separated by less than some interaction radius R 3 . The explicit evaluation of interactions between pairs of particles separated by less than some maximum distance also constitutes the majority of the computational expense of gravitational simulations in astrophysics, particle simulations in plasma physics, and smooth particle hydrodynamic simulations in fluid dynamics 4 ' 5 . Traditional methods for parallelizing such problems, described in a several review papers 6 , include atom, force, and spatial decomposition methods. Unlike atom and force decomposition methods, spatial decomposition methods offer the desirable property that the amount of data to be transferred into and out of each processor (the method's communication bandwidth) decreases as the interaction radius decreases.

A number of recently introduced methods for parallelizing range-limited particle interactions 7"9 require significantly less communication bandwidth than traditional parallelization methods. These novel methods are similar to traditional spatial decompositions in that each processor takes responsibility for updating positions of particles that fall in a certain region of space and in that their communication bandwidth decreases as the interaction radius decreases. Unlike traditional spatial decompositions, however, these new methods sometimes compute the interaction between a pair of particles on a processor on which neither particle resides 8 . We refer to such techniques as neutral territory methods.

The midpoint method is a neutral territory method that requires less communication bandwidth than any traditional spatial decomposition method at all but extremely low levels of parallelism, where it requires the same amount of communication bandwidth as traditional methods. While certain other neutral territory methods require less communication bandwidth than the midpoint method for pairwise interactions parallelized over a sufficiently large number of processors, the midpoint method offers significant advantages over previously described methods in several respects. It applies not only to pairwise interactions, but also to interactions involving sets of three or more particles. In addition, the method typically incurs a smaller penalty due to communication latency than other methods. It also communicates equal amounts of data in each direction, leading to more effective use of communication links in certain network topologies.

The midpoint method enjoys additional advantages in the context of MD computations. It can be used to parallelize the evaluation of bonded force terms, which commonly involve the interaction of up to four particles, as well as pairwise electrostatic and van der Waals terms. Moreover, it allows several components of the MD computation, including the evaluation of the bonded force terms, to rely on the same data that is communicated for the evaluation of electrostatic and van der Waals forces between pairs of nearby atoms. The midpoint method thus proves a particularly attractive choice for the parallelization of MD computations on up to at least 512 processors. (Throughout this paper, we use the term "processor" to refer to a processing node connected to other such nodes through a communication network; one such "processor" might actually include a number of independent processing units, but we assume that the communication between them is faster than the communication between processing nodes, and thus focus on parallelization across, but not within, the processing nodes.)

We also describe certain variants of the midpoint method, to which we refer as midpoint- ensured methods, that lend themselves particularly well to load balancing. In a midpoint- ensured method, each processor imports the same data as would be imported in the midpoint method, but the assignment of interaction computations to processors is adjusted in such a way that the computational load on each processor is more even.

2. Specification of the Midpoint Method

In the midpoint method, as in traditional spatial decomposition methods and previously described neutral territory methods, each processor assumes responsibility for updating the positions of particles in a distinct region of space, regardless of where the particle interactions are computed. When applied to pairwise interactions, the midpoint method specifies that two particles interact on a particular processor if and only if the midpoint of the segment connecting them falls within the region of space associated with that processor.

We refer to the region containing the system to be simulated as the global cell. One can use the midpoint method with a global cell of any shape and with any regular or irregular partition of the global cell into regions assigned to different processors 10 . To simplify our exposition, we will assume in this paper that the global cell is a rectangular parallelepiped of dimensions G x xG y xG z and that it is divided into a regular grid of smaller rectangular parallelepipeds that we call boxes. Each processor updates the coordinates of particles in one box, referred to as the home box of that processor and of those particles. In the interest of simplicity, we will refer interchangeably to a processor and its home box. We refer to particles that do not reside in a particular box as remote to that box. When working in a three-dimensional space, we denote the dimensions of each box by b x xb y xb z , and we refer to the quantities b x lb y and b x lb z as the box aspect ratios. The base coordinates of a given box and of any particle located within that box are defined as the coordinates of the low-coordinate corner of that box.

We will assume that the global cell tiles an infinite space by repeating in each dimension with a period equal to the side length of the global cell in that dimension. The periodic boundary conditions imposed by this assumption simplify our exposition, but the methods discussed in this paper are also applicable to systems with other boundary conditions. We will also assume for simplicity that G x ≥ b x + 2R , G y ≥ b y + 2R, and

G z ≥ b z +2R , so that at most one image of a given particle interacts with any particle in a

given box.

We refer to the box in which a set of particles interact as their interaction box. Figure 1 illustrates the assignment of particle pairs to interaction boxes implied by the midpoint method. Two particles that lie in the same box necessarily interact in that box, but particles that lie in different boxes may interact either in the box in which one of them resides (e.g., particles 2 and 3) or in a box in which neither resides (e.g., particles 1 and 5 or particles 3 and 4).

If two particles are separated by a distance r < R, then the distance from either particle to their midpoint is r/2 < RIl. If the midpoint lies within a particular box, then each particle must lie either within that box or less than a distance Rl 2 from its boundary. In the midpoint method, the volume of space from which a given processor must "import" particle data that ordinarily resides within other processors (its import region) therefore includes only points within a distance RI2 of its home box. This import region is shown in Figure 2(a) for a two-dimensional system and in Figure 3(a) for a three-dimensional

system. In an MD simulation, or any other application that requires computation of the total force on each particle, each interaction box must "export" a force contribution to each of the particles in its import region after it has computed all the interactions assigned to it. For all of the generalized spatial decomposition methods described in this paper, the volume of space to which a given processor must export force contributions is identical to its import region. In applications that require only computation of a global potential energy, force export is not required.

Figures 2 and 3 also show the import regions of several previously described methods. The HS (for "Half-Shell") method is an example of a traditional spatial decomposition in which the box that interacts two particles is always the home box of one or both of them. In the HS method, the particles interact within the home box of the particle with the smaller x base coordinate, with ties broken first in favor of the particle with the smaller^ base coordinate and then in favor of the particle with the smaller z base coordinate. Figure 3(b) shows the import region of the HS method, and Figure 2(b) shows the import region of a two-dimensional analog of that method.

The Nr (for "Neutral Territory") and ES (for "Eighth-Shell") methods, whose import regions are shown in Figure 3(c) and (d), are neutral territory methods that we recently introduced 7"9 . In the NT method, two particles interact within a box that has the x andy base coordinates of one particle and the z base coordinate of the other particle. Specifically, the x and y base coordinates of the interaction box are those of the particle with the smaller x base coordinate, with ties broken first in favor of the particle with the

smaller^ base coordinate and then in favor of the particle with the larger z base coordinate. The z base coordinate of the interaction box is the z base coordinate of the other particle. In the ES method, a pair of particles interacts within a box whose x base coordinate is that of the particle with the smaller x coordinate, whose y base coordinate is that of the particle with the smaller^ coordinate, and whose z base coordinate is that of the particle with the smaller z coordinate.

In a previous paper 8 , we described the zonal methods, a class of parallelization algorithms that include both traditional spatial decomposition methods (e.g., the HS method) and certain neutral territory methods (e.g., the NT and ES methods). In a zonal method, one associates with each box a set of regions called zones, where each box has the same spatial relationship with its zones as does each other box. Each box "interacts" certain pairs of its zones by computing the interaction between each particle in one zone and each particle in the other zone. This procedure may lead to some duplicate interactions or to interactions between pairs of particles separated by a distance greater than R, but such redundant interactions can be avoided if each processor computes interactions only for particle pairs that satisfy certain test criteria. The midpoint method for pairwise particle interactions can also be formulated as a zonal method, providing a convenient means to implement the midpoint method. Figure 4 shows a formulation with nine zones for the midpoint method in two dimensions; an analogous formulation in three dimensions would require 27 zones. Alternatively, one might divide the import region and the interaction box into a larger number of zones, which may result in higher performance by reducing wasted computation associated with particle pairs that fail the

test criteria and by providing more opportunities to overlap communication time with computation time.

The midpoint method can also handle interactions that involve sets of three or more particles. Suppose that we wish to compute interactions between all sets of m particles that fall within some sphere of radius Rl 2 (if m ~ 2, this is equivalent to the statement that we wish to compute interactions between all pairs of particles separated by less than a distance R). In the midpoint method, the interaction between a set of m particles is computed on the processor that contains their midpoint, defined as the center of the smallest sphere that contains all m particles. The import region of the midpoint method in the w-particle case is identical to that for the two-particle case.

The import region of the HS method also guarantees that that at least one processor will import all the particles necessary to compute the interaction between any set of m particles that fall within a sphere of radius Rl 2, but this is not the case for previously described neutral territory methods such as the NT and ES methods. The ES method does generalize in a straightforward manner to interactions of m particles; it assigns the interaction of a set of particles to the box whose x base coordinate is that of the particle with the smallest x coordinate, whose y base coordinate is that of the particle with the smallest j> coordinate, and whose z base coordinate is that of the particle with the smallest z coordinate. This generalization requires expansion of the import region, however, as discussed further in the Appendix.

In practice, determining the midpoint of three or more particles involves significantly more computation than determining the midpoint of two particles. One can reduce the required amount of computation at the expense of a small amount of additional communication by employing the midpoint-ensured methods introduced in Section 4 or by using an approximation to the midpoint, as described in the Appendix.

3. Performance Characteristics of the Midpoint Method

3.1 Communication bandwidth

Assuming uniform particle density, the amount of particle data that must be transferred into each processor during particle import (and out of each processor during force export, in applications that require force calculation) is proportional to the volume of the import region. We will therefore use the volume of the import region as a measure of communication bandwidth requirements. This volume depends not only on the interaction radius and the volume of the boxes, but also on the box aspect ratios. For the midpoint, HS, and ES methods, one minimizes import volume for a fixed interaction radius and box volume by using cubic boxes. For the NT method, on the other hand, non-cubic boxes typically minimize import volume 9 .

For the parallelization methods discussed in this paper, the shape of the import region depends only on the ratios of the box side lengths to the interaction radius R. In three dimensions, the ratio of the import volume to the box volume (ViI V b ) for a particular method can be determined uniquely given RIb x , Rl b y , and Rl b z . Alternatively, ViI Vb can

be expressed as a function of the box aspect ratios b x lb y and b x /b z and the parallelization parameter OLR, where we define α« as the geometric mean OfRIb x , Rl b y , and Rl b z :

For cubic boxes of side length b, a, R is simply RIb. The parallelization parameter OIR might be viewed as a measure of the extent to which a particular simulation has been parallelized.

Assuming cubic boxes, the import volume for the HS method is

while that for the midpoint method is

The import volume of the midpoint method is always smaller than that of the HS method, with the difference growing in both relative and absolute terms as a^ grows. The import volume of the ES method for pairwise interactions 8 is identical to that of the midpoint method. To the best of our knowledge, no published method has a smaller import volume than the midpoint method for small O.R (i.e., at low degrees of parallelism). 12

Asymptotically, the import volumes of the midpoint, ES, and HS methods all grow with the cube of <ZR, while the import volumes of certain other neutral territory methods, including the NT method, grow as a R 312 8>9 ' 13 . Thus for large values of a R , the NT method has a lower import volume than the midpoint or ES methods. More specifically, the NT method has a lower import volume than the midpoint method when c^ > 0.82, assuming box aspect ratios are chosen to minimize import volume for each method.

Figure 5 shows the import volumes u of the various parallelization methods as a function of the number of processors for a system with a cubic global cell measuring 8θA on a side and an interaction radius of 12A. These parameters are within the range that might be typical for computation of explicit pairwise electrostatic and van der Waals forces in an MD simulation of a biomolecular system; at a typical density of 0.1 atoms/A 3 , such a system would contain about 51,000 atoms. The midpoint and ES methods have the lowest import volume of the methods shown for up to 160 processors, with the NT method achieving a lower import volume for higher numbers of processors. The kZ-NT method, a variant of the NT method that we described in a recent paper 8 , has a slightly lower import volume than any of the methods shown here for more than 63 processors. As discussed in the remainder of the paper, however, the midpoint method has a number of advantages over the competing methods that are not captured by import volume.

The one situation in which the midpoint method requires the same import volume as traditional spatial decompositions, rather than a smaller one, is when the global cell is partitioned into boxes along only one dimension — that is, when the grid of boxes has

dimensions I x I x n, l χ « x l, or n x l χ l. Such a situation violates our assumption that G x ≥ b x + 2R , G y ≥ b y + IR , and G 2 ≥ b z + 2R (Section 2); the midpoint method still

applies in this case, as do traditional spatial decompositions, but their import volumes are different from those computed above, as communication is only required along one dimension. A one-dimensional partition of the global cell maximizes bandwidth only at very low levels of parallelism. In our example system above, such an approach has minimal import volume only when using fewer than six processors. The midpoint method also applies when the global cell is partitioned along two out of three dimensions (e.g., a grid of boxes with dimensions 1 x «; x « 2 , where «, > 2 and W 2 > 2); in this

situation, it always has a lower import volume than traditional spatial decompositions.

Thus far we have concentrated our analysis on import volume, which serves as an accurate proxy for per-processor communication bandwidth when particles are distributed uniformly and when a single interaction radius R is used throughout the simulation, as is generally the case for explicit solvent molecular dynamics simulations. When these assumptions do not hold, different processors may need to import different amounts of data, which will negatively impact the scalability of any of the methods discussed in this paper.

In a system with multiple communication links, communication will typically be most efficient if the communication load is spread evenly over all the links. The symmetry of the midpoint method's import region therefore proves advantageous for certain communication network topologies such as the three-dimensional toroidal mesh used in

IBM's BlueGene/L, the Sandia/Cray Red Storm, Cray's T3D, T3E and XT3, and some Linux clusters 15 . In this topology each processor is connected to the processors adjacent to it in each of the six cardinal directions (+x, ~x, +y, -y, +z, -z). For three-dimensional simulation, the midpoint, HS, ES, and NT methods all map naturally to this topology, but the midpoint method has an advantage over the others because it transfers an approximately equal amount of data in each of the six directions. The ES method, on the other hand, communicates data only in the —x, —y, and —z directions during particle import, so while the ES and midpoint methods have the same import volumes, the maximum bandwidth per communication link for the ES method will be roughly double that for the midpoint method in such an architecture. The import procedures of the NT and HS methods communicate data in five of the six cardinal directions, but they send nearly twice as much data in the —x direction as in any other direction.

3.2 Communication latency

For certain network topologies and protocols, the midpoint method gains a further advantage from the fact that the maximum distance of a point in the import region from the interaction box boundary is smaller for the midpoint method than for other published methods. This is perhaps most obvious for a toroidal mesh network, where the communication latency associated with a pair of processors (a fixed time interval associated with the overhead entailed in transferring a packet of data from one processor to another, independent of the amount of data contained in that packet) increases with the hop count, the number of inter-processor connections a message must traverse to get from

one processor to the other. In such a network, the maximum hop count and therefore the maximum communication latency associated with particle import under the midpoint method is less than or equal to that under the HS or ES methods. At high levels of parallelism, the hop count of the midpoint method will be approximately half that of the HS and ES methods. For example, consider a three-dimensional case with cubic boxes and with cc R = 2 (i.e., R = 2b, where b is the box side length). The midpoint method

requires each interaction box to import data from the boxes with which it shares a face, edge, or corner. Communication between boxes that share a face requires one hop, communication between boxes that share an edge but no face requires two hops, and communication between boxes that share only a corner requires three hops, so the maximum hop count for the midpoint method in this case is three. The HS and ES methods, on the other hand, have a maximum hop count of six in this case, because these methods require the import of data from more distant boxes in every direction. The maximum hop count for the NT method in this case is four, whether one constrains the boxes to be cubic or optimizes their aspect ratios to minimize import volume. At low levels of parallelism, the maximum hop count of the NT method may be less than that of the midpoint method; in particular, if R were less than any box side length, the maximum hop count of the NT method would be two while that of the midpoint method would still be three. The maximum hop count of the NT method is always greater than or equal to

that of the midpoint method when a κ > yjϊ , or when boxes are cubic and a κ > 1.

Many communication protocols, including some common implementations of MPI, require the use of blocking communication, meaning that once a processor sends a data

message, it must wait until that message has been received to send more data or to perform computation. In such a system, the total number of send /receive cycles per processor is often the primary determinant of total communication time, regardless of network topology. Whenever o^ > 1, the midpoint method will require fewer such cycles than the NT and HS methods and the same number of cycles as the ES method, whether box aspect ratio is constrained to be cubic or adjusted to minimize import volume. Consider again the three-dimensional case with cubic boxes and a« = 2 (i.e., R = 2b). The midpoint method requires that each processor import data from its 26 nearest neighbors. We can accomplish all communication necessary for particle import in six send /receive cycles by first sending all data that must move in the +x direction one step in that direction, then sending all data that must move in the —x direction one step in that direction, and then performing similar send operations for the +y, ~y, +z, and —z directions. The ES method also requires six send/receive cycles, because data must move up to two steps in each of the -x, -y, and -z directions. The NT and HS methods require ten send /receive cycles, however, because data must move up to two steps in each of the -x, +y, -y, +z, and —z directions.

3.3 Further advantages for molecular dynamics simulations

The midpoint method proves particularly well-suited for parallelization of MD simulations because it allows several different components of the computation to utilize the same communicated particle data, thereby resulting in lesser total communication time than if each component were parallelized separately.

The total force on a particle in most common biomolecular force fields, such as CHARMM 16 , AMBER x \ OPLS-AA 18 , GROMOS 19 , and MMFF 20 , is expressed as a sum of bonded force terms, which depend on the covalent bond structure of the molecules, and non-bonded force terms, which comprise electrostatic and van der Waals interactions between all pairs of particles in the system. Bonded force terms include bond length terms, involving two particles connected by a bond; bond angle terms, involving three particles, two of which are bonded to a third; and dihedral angle (torsional) terms, involving four particles connected by three bonds. We denote by r bon d ed the maximum radius of a sphere required to enclose the particles involved in any one bonded force term. In common biomolecular force fields, ^b on ded is no larger than 4A.

The non-bonded force terms involve all pairs of particles in the system and therefore represent a much larger computational burden than the bonded force terms. Van der Waals forces fall off sufficiently quickly with distance that they can typically be neglected for pairs of particles separated by more than some cutoff distance dnon-bo n ded, typically chosen between 9 and 12A. An ever-mounting body of evidence shows that neglecting electrostatic interactions beyond a cutoff is inadequate for explicit solvent simulations 21 , so electrostatic forces are typically computed by one of several efficient, approximate methods that take long-range interactions into account without requiring the explicit interaction of all pairs of particles. The most common of these are mesh-based Ewald methods such as particle-particle particle mesh (PPPM) 4 , particle mesh Ewald (PME) 22 , lattice Gaussian multigrid (LGM) 23 , and Gaussian split Ewald (GSE) 2 \ which

use the fast Fourier transform (FFT) or a multigrid solver to compute electrostatic potential on a mesh given a charge distribution on that mesh. These methods require that modified electrostatic interactions be computed explicitly within some cutoff radius; we will assume that this cutoff radius is chosen, to be identical to the van der Waals cutoff distance ^ n o n -bo n ded, as is typically the case. These methods also require that charge be mapped from particles to nearby mesh points before the FFT or multigrid computation, and that forces on particles be calculated from potentials at nearby mesh points afterwards. In the charge spreading computation, the charge on each mesh point is determined as a sum over nearby particles, while in the force interpolation computation, the force on each particle is computed as a sum over nearby mesh points. We denote by ^ spre adi n g the maximum distance between any particle and any mesh point that "interact" in either charge spreading or force interpolation. The value of Sp r eadi n g depends on the parameters of the long-range electrostatics method employed, but typical values are around 5A.

When one parallelizes the computation of the explicit non-bonded force terms using the midpoint method, each processor must import data for particles lying within a distance of '' non -b on ded = 6?non-bonded / 2 from the boundary of its home box. Once it has completed computation of pairwise interactions, it must export a force contribution for each of these particles to the respective particle's home box. Because dnon-b on ded typically lies between 9A and 12A, r n on-bo n ded typically lies between 4.5A and 6A.

If d preading ≤ r non - bonded ' tnen n0 additional communication is required to support charge

spreading and force interpolation. The import requirements of the explicit non-bonded force computation ensure that each processor will import all remote particles within a distance r n o n - bon ded of any mesh point lying in the home box. Likewise, the export requirements of the explicit non-bonded force computation ensure that each processor will export force contributions to this set of remote particles. The force contribution on each remote particle due to mesh points in the home box can be combined with the corresponding explicit non-bonded force contribution before it is exported. In MD runs using common biomolecular force fields, Spreadi n g is usually approximately equal to r non - bon ded- Parameters of the mesh-based Ewald method employed can be tuned to guarantee that d spreading < r non . bonded ; alternatively, the import region can be expanded slightly to

include all particles within a distance d preading of the home box boundary.

The computation of bonded force terms can also be parallelized using the midpoint method, which specifies that each bonded term be computed by the processor that contains the midpoint of the particles involved in that term. If r bonded ≤ r non . bonded — as is

typically the case for MD simulations using common biomolecular force fields — then no additional communication is required to support computation of the bonded terms; the import requirements of the explicit non-bonded force computation ensure that data for each particle will be available on any processor that needs it for bonded force term computation, while the export requirements ensure that a force contribution will be exported back to the processor on which the particle resides.

In an MD simulation parallelized via the midpoint method, particles migrate from one box to another as they move. As long as the maximum distance emo t i on that a particle can move in a time step — typically under 0.1 A — is less than r nO n-bonded, no extra communication will be required to migrate the particle position data. This data will be communicated as part of the import data required by the midpoint method for explicit non-bonded interactions (in certain cases this data will need to be imported from more boxes than if particle migration had been performed separately; because migration at the end of a time step is delayed to coincide with particle import during the following timestep, data for particles within a distance r non- b on de d of a given interaction box may still reside on the home boxes of points within a distance r n o n -bo n ded + ^ m otio n away when import is performed). If one uses the NT, HS, or ES methods, on the other hand, particle migration will require additional communication of particle positions, because proximity of a particle to an interaction box does not guarantee that the particle will be imported by that interaction box. In some molecular dynamics implementations, particle migration may also require communication of particle state variables other than position, such as velocity, but the midpoint method will still require less total communication for particle migration than the NT, HS, or ES methods. Moreover, in the midpoint method, any additional data required for migration can be included in the same communication step as the particle position import, avoiding the need for an additional round of communication.

Consider an MD simulation with a cubic global cell measuring 8θA on a side, running on a system with 512 processors in an 8 x 8 x 8 configuration. Suppose that

^non-bonded = 12A, rbonded = 4.5A, ^spreading = 5A, and ^motion = O.lA, all typical values for

MD simulations of biomolecular systems. Using the midpoint method to parallelize computation of the explicit non-bonded interactions would require an import volume of 7.9 nm 3 . Assuming that the bonded force calculation is also parallelized using the midpoint method, no additional communication is required for bonded force computation, charge spreading, force interpolation, or particle position migration.

Using the HS method to parallelize the explicit non-bonded force computation would require an import volume of 14.0 nm 3 . The import region of the HS method does not ensure that all particles within a distance S pre adi n g of a particular box are imported onto that box, so each processor will need to import particles from an additional 2.9 nm 3 region to support charge spreading, for a combined import volume of 16.9 nm 3 . Traditional parallelization strategies for bonded forces, in which each term is computed either on a fixed processor or on the processor containing a specific particle, will require additional communication, as will particle position migration.

Using the NT method to parallelize the explicit non-bonded force computation would require an import volume of 7.1 nm 3 . Importing all particles within a distance Spreadi n g of each interaction box, in order to support charge spreading, increases the combined import volume to 10.1 nm 3 . One could reduce communication requirements slightly by applying the NT method to charge spreading (where the x and y base coordinates of the interaction box are those of the particle while the z base coordinate is that of the mesh point). Some of the communication will then take the form of charges on mesh points, but if we assume that the amount of data associated with mesh points in a volume of space is

comparable with the amount of data associated with particles in the same volume, then the amount of communication required is equivalent to that associated with a combined import volume of 9.8 nm 3 . Optimization of the box aspect ratios can reduce the equivalent import volume to 9.2 nm 3 , but this requires that one either adjust the aspect ratios of the global cell or use a grid of boxes with dimensions other than 8 x 8 x 8. Even then, the equivalent import volume is greater than the 7.9 nm 3 required by the midpoint method, and the parallelization strategy based on the NT method, unlike one based on the midpoint method, will require additional communication for computation of bonded forces and for particle position migration.

4. Midpoint-Ensured Methods with Improved

Computational Load Balance

We describe a particle as being available to a processor if the particle resides in the home box or import region of that processor. If multiple particles are all available to the same processor, we describe them as being coavailable to that processor. The import region of the midpoint method ensures that any set of particles that can be enclosed by a sphere of radius Rl 2 will be coavailable to at least one processor (the one whose home box contains their midpoint), but frequently, the particles will also be coavailable to one or more additional processors. In Figure 1, for example, particles 1 and 2 interact in the home box of particle 1 (the box labeled b), but these two particles are also coavailable in the boxes labeled a, d, and e. We can take advantage of the fact that sets of interacting particles are frequently coavailable on more than one processor to improve computational

load balance among processors. We refer to a parallelization method whose import region includes that of the midpoint method as a midpoint-ensured method, regardless of how it assigns particle interactions to processors, because the fact that the box containing the midpoint of a set of interacting particles is guaranteed to house or import all of them ensures that at least one processor will be able to compute each required interaction.

In order to improve load balance, midpoint-ensured methods can assign interactions to processors based on the spatial distribution of particles. The more information processors share with one another about the particle distribution, the more effectively they can balance computational load. The relative performance of various midpoint-ensured methods depends on the relative costs of communication and computation, on the interaction radius and box size, and on prior statistical knowledge about the spatial distribution of particles. We leave the full exploration of the space of possible midpoint- ensured methods as an area for further work, but we present as an example one particular midpoint-ensured method that can significantly improve load balance with only minimal additional communication requirements.

For expository simplicity, we first describe this method for a one-dimensional space where each "box" is simply an interval of length b and where a set of m particles interact if they all fall within some sphere of radius RI2 (i.e., within some line segment of length R). We also assume that R is no greater than b. We number the processors such that the home box of processor / is the rth box from the left. The computation proceeds in the following manner:

Each processor imports all remote particles within a distance R/ 2 of its home box.

Each processor finds its set of locally computable interactions, consisting of all combinations of m particles available to that processor that fall within a sphere of radius Rl 2. Each processor i divides its set of locally computable interactions into three nonoverlapping subsets: interactions that are also computable on its left neighbor (processor /-1), interactions that are also computable on its right neighbor (processor i+l), and interactions that are only computable on processor /. We denote these subsets L 1 , R t , and Ct, respectively, and we denote their sizes by /,-, 77, and c,-. Rt and Lm contain exactly the same elements, namely all combinations of m particles that are within a distance Rl 2 of the boundary between box / and box i+l , so r,= Im-

Each processor i sends the value of c,- to its left and right neighbors. This is the only communication required by this parallelization method beyond that required by the midpoint method. Each processor / now has the values of CM and cm as well as c,-, /,-, and r t .

Each processor / determines which of the interactions in L t and which of the interactions in Ri it will compute, and which of these interactions it will leave to its neighbors to compute. In order to ensure that each interaction is in fact computed somewhere, neighboring processors must make consistent decisions about the division of labor between them. Processors i and /+1 both have access to the values of c,-, cm, and 77 (because 77= /, +1 ). Processor / will compute the first kj interactions in Rt, while processor /+1 will compute the last r,— & < interactions,

where processors i and /+1 independently compute identical values for k t based on Ci, c /+ i, and r,-. To avoid missing or duplicating computation of any interaction, processors i and i+1 must agree on the ordering of the equivalent sets i?,- and Lj+ r, such a consistent ordering might be based on the positions of the particles involved in each interaction, with ties broken by other particle characteristics such as charge. In the absence of prior statistical information about the spatial distribution of particles, each processor computes &, as

That is, if c, + i and c,- are identical, then we assume that processors i and i+1 have a comparable load, and we therefore split the interactions in Rj evenly between them. If c, + i > c,-, then we assign more than half the interactions in Ri to processor i, in an attempt to balance the load. The max, min, and round operations in Eq. (1) are necessitated by the fact that k can only take on integer values between 0 and r,-. In the absence of such restrictions, distributing interactions according to

the formul for all i would result in the assignment of an identical number of interactions to processors / and i+1, if CM and c,+ 2 were both equal to the average of c,- and C 1+1 . This average represents a reasonable estimate of Cj-i and c, +2 based on information to which processors i and i+1 both have access. Better load balancing could be achieved if both processors had access to c,-i and c, +2 , or if processors had some prior statistical information about the particle distribution (e.g., the average particle density).

Generalizations of this method to higher dimensions can take several forms. A computationally efficient and easily implemented generalization is available if we expand the import region slightly such that its outer boundary is a rectangular parallelepiped extending a distance Rl 2 beyond the interaction box in the positive and negative directions along each coordinate axis. We describe this generalization in the two- dimensional case, where the outer boundary of the import region will be a rectangle (a square if the interaction box is a square). Assuming that the side lengths of the interaction box are all larger than the import radius R, an individual interaction may be computable on four boxes that share a common corner, on two boxes that share a common edge, or only on a single box. Our strategy is first to assign each interaction that can be computed by two or four boxes in neighboring columns to a unique column, and then to assign interactions that can be computed by boxes in neighboring rows to a unique row. We perform these assignments by repeated application of a procedure similar to that used to split R t in the one-dimensional case, with appropriate values substituted for r,- 5 c,-, and c, + i in Eq. (1).

More specifically, we assume that processor (i,j) is in the fth row and theyth column. After importing particles in its import region, processor (i,f) divides its set of locally computable interactions into nine subsets, based on which other processors can compute

them. We denote these subsets by S"f , where a and b can each take on the values -1, 0,

or 1. The value of a indicates whether the interactions in Sf j can be computed by

processor ( / - 1 J) (in which case a = -1), by processor ( H-I ,j) (in which case a = 1), or neither (in which case a = 0). The value of b indicates whether the interactions can be

computed by processor (in which case b = -1), by processor (in which case b = 1), or neither (in which case b = 0). The shape of the import region implies that an interaction is in if and only if it is computable in both processors (J, β and

(/+1 ,7+1) (this would not be the case if the import volume were restricted to that of the midpoint method). Neighboring processors agree on their set of shared elements; for example, contains the same elements as . We denote the size of .

Each processor (/,7) sends the values to processors (/,7"-I) and (/,7+1). For each a in , we assign the first k°j interactions in S?j to processors in column^ and the last o processors in columny+1, where kfj is computed by Eq. (1) with substituted for substituted for C 1 , and substituted for c, + /. Processor independently computes identical column assignments for all interactions in Processors independently compute identical column assignments for interactions in , and processors independently compute identical column assignments for interactions in . At this

point, all interactions have been uniquely assigned to a column of boxes. Interactions are then distributed between boxes within each column using a procedure identical to that described previously for the one-dimensional case.

Figure 6 shows the results of applying this algorithm to a three-dimensional system representing a protein solvated in explicit water. We determined particle positions by placing the reported crystal structure of the 213-residue protein Catechol O- Methy transferase (PDB code lvid) 25 in a water bath (neutralized with the appropriate number of counter ions), with interactions modeled using the OPLS-AA force field 18 and the SPC model for water 26 . A representative frame was generated by subjecting the solvated system to minimization followed by 50-picosecond MD simulations at 300K for equilibration. The entire chemical system contains 50,846 atoms in a cubic global cell measuring 80.0A per side. Figure 6(a) shows the number of atom pair interactions assigned to each processor using the midpoint method, assuming 64 processors in a 4 x 4 x 4 configuration and an interaction radius of R = 12A. Figure 6(b) shows the number of interactions assigned to each processor using the midpoint-ensured method we have described. The maximum number of interactions assigned to a processor by the midpoint method exceeds the mean by 11.2%. With the midpoint-ensured method, the maximum exceeds the mean by only 3.4%. In this example, the import region of the midpoint- ensured method is larger than that of the midpoint method by 12.1%. One could formulate multi-dimensional midpoint-ensured methods that would require only the import volume of the midpoint method, at the cost of a more complex algorithm for assigning interactions to processors.

As we have noted previously 8 , neutral territory methods enjoy improved load balancing properties relative to traditional spatial decompositions when the number of processors is close to or larger than the number of particles in the system, because neutral territory

methods can calculate interactions within processors that contain no particles. At high levels of parallelism, the midpoint method, the ES method, and the NT method will all result in better load balance than the HS method. At the level of parallelism in our example, however, the differences are negligible; the maximum number of interactions assigned to a processor exceeds the mean by between 9% and 13% for all four methods.

One can generalize the midpoint-ensured method described above in a straightforward manner to the case where R is greater than the smallest side length of the box by requiring that an interaction between a set of particles always be computed either in the box containing their midpoint or in a neighboring box (i.e., a box that shares at least a corner with it). Other midpoint-ensured methods that remove this restriction can lead to further improvements in load balance at the cost of some additional communication or complexity. In some situations, generalizations beyond the scope of this paper may be required to achieve satisfactory load balance. For example, a midpoint-ensured method may take into account the fact that, in some simulations, certain interactions are more computationally expensive than others. Likewise, to compensate for coarse-scale variations in particle density, one might use a midpoint-ensured method where the regions assigned to different processors are of different shapes and sizes.

5. Conclusions

Unlike previously described neutral territory methods, the midpoint method applies to interactions involving more than two particles. It requires less communication bandwidth

than traditional spatial decompositions whenever the global cell is partitioned into boxes along more than one dimension. The midpoint method also offers several other advantages over both traditional spatial decomposition methods and previously described neutral territory methods, including a more even distribution of bandwidth across communication links, a lower maximum hop count in communication networks with a toroidal mesh topology, and a smaller required number of send /receive cycles in networks using a blocking communication protocol. The midpoint method proves applicable to problems in which each interaction involves multiple particles, with no required increase in import volume. Variants of the midpoint method known as midpoint-ensured methods allow for improved computational load balance among processors relative to both the midpoint method and traditional methods.

The midpoint method proves particularly effective for parallelization of MD simulations using common biomolecular force fields. It applies to the evaluation of both bonded force terms and explicit pairwise nonbonded force terms. Moreover, its data communication requirements are such that the communication that supports pairwise computation of nonbonded forces also suffices for bonded force computation, for particle position migration, and for the charge spreading and force interpolation computations associated with the efficient evaluation of long-range electrostatic interactions. At very high degrees of parallelism, certain other neutral territory methods require asymptotically lower communication bandwidth than the midpoint method 8 ' 9 ' 13 . In the course of implementing a parallel molecular dynamics package for a commodity Linux cluster, however, we have been able to obtain better performance with the midpoint method than

with either traditional methods or previously reported neutral territory methods for the parallelization of moderately sized systems (~50,000 atoms) up to at least 512 processing nodes; details of this package will be reported in a future paper.

Acknowledgments

Thanks to Christine McLeavey for rendering many of the figures in this paper and to John Klepeis for preparing the chemical system used as an example to assess load balancing properties of different parallelization methods (Figure 6). We have benefited from fruitful discussions with Mike Eastwood and with members of the GROMACS 3 development team, particularly Berk Hess, Erik Lindahl, and David van der Spoel.

Appendix

One disadvantage of the midpoint method for interactions that involve three or more particles is that determining the center of the smallest sphere containing m particles requires a nontrivial amount of computation when m ≥ 3 . One alternative is to use midpoint-ensured methods that do not require computation of a midpoint. Another is to use an efficient approximation for the midpoint. Two such approximations are:

1. For each of the coordinate dimensions, approximate the coordinate of the midpoint as the arithmetic mean of the coordinates of the particles.

2. For each of the coordinate dimensions, approximate the coordinate of the midpoint as the arithmetic mean of the maximum and the minimum of the coordinates of the particles.

When using either approximation, one needs to expand the import region slightly to ensure that the box containing the approximate midpoint of a set of interacting particles will import all remote particles in that set. In the absence of any restrictions on the spatial distribution of particles, use of approximation 1 requires that the import region

contain all particles within a distanc of the interaction box boundary, implying a substantial increase in import volume (if m-\ particles are collocated inside

the interaction box at a distance from the boundary and the mth point is located

outside the interaction box at a distance just under from the boundary, then the midpoint will fall within the boundary). In MD simulations using common biomolecular force fields, however, a smaller increase in import volume may be sufficient, because the spatial arrangement of the particles involved in a bonded term is not arbitrary; for example, any pair of atoms must maintain some minimal distance from one another under normal conditions.

A sufficient import region for use of approximation 2 is that enclosed by a rectangular parallelepiped extending a distance R/2 beyond the interaction box in the positive and negative directions along each coordinate axis. One can show that a somewhat smaller

region suffices, one that includes all remote particles within a rectangular parallelepiped extending a distance RIA beyond the interaction box in the positive and negative directions along each coordinate axis as well as all particles within a distance RIA of that parallelepiped. In three or fewer dimensions, for m ≥3 , this import volume is always smaller than that required by approximation 1 in the absence of restrictions on the spatial distribution of particles.

The generalization of the ES method to interactions involving three or more particles (Section 2) also requires expanding the import region, because a set of particles that interact in a particular box under this method may include a particle at a distance greater than R from the box. A sufficient import region includes all remote particles within a a rectangular parallelepiped extending a distance Rl 2 beyond the interaction box in the +x, +y, and +z directions, as well as all particles within a distance RI2 of that parallelepiped whose x, y, and z coordinates are larger than the base coordinates of the interaction box. This import region has the same volume as the smaller of the aforementioned sufficient import regions for use of approximation 2 with the midpoint method; this volume is larger than the import volume of the exact midpoint method.

References

1 M. Karplus and J. A. McCammon, Nat. Struct. Bio.9 (9), 646 (2002); I. Brooks, CL. and D. A. Case, Chem. Rev. 93, 2487 (1993); D. Frenkel and B. Smit, Understanding Molecular Simulation: From Algorithms to Applications, Second ed. (Academic Press, London, 2001); W. Wang, O. Donini, C. M. Reyes, and P. A. Kollman, Annu. Rev. Biophys. Biomol. Struct.30, 211 (2001); T. Schlick, R. D. Skeel, A. T. Brunger, L. V. Kale, J. A. Board, J. Hermans, and K. Schulten, J. Comput. Phys. 151 (1), 9 (1999).

2 G. S. Almasi, C. Cascaval, J. G. Castanos, M. Denneau, W. Donath, M. Eleftheriou, M. Giampapa, H. Ho, D. Lieber, J. E. Moreira, D. Newns, M. Snir, and H. S. Warren, Int. J. Parallel. Prog.30 (4), 317 (2002).

3 D. v. d. Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark, and H. J. C. Berendsen, J. Comput. Chem. 26 (16), 1701 (2005).

4 R. W. Hockney and J. W. Eastwood, Computer Simulation Using Particles. (Adam Hilger, Bristol, 1988).

5 J. J. Monaghan, Annu. Rev. Astron. Astrophys.30, 543 (1992).

6 G. S. Heffelfinger, Comput. Phys. Commun. 128 (1-2), 219 (2000); S. Plimpton, J. Comput. Phys. 117 (1), 1 (1995).

7 K. J. Bowers, R. O. Dror, and D. E. Shaw, J. Phys. Conf. Ser. 16, 300 (2005).

8 K. J. Bowers, R. O. Dror, and D. E. Shaw, under review. D. E. Shaw, J. Comput. Chem. 26 (13), 1318 (2005).

10 One can only guarantee that an interaction between two particles residing in the same processor will be computed on that processor if the region assigned to that processor is convex.

11 R. O. Dror, K. J. Bowers, and D. E. Shaw, in preparation.

12 We have discovered novel neutral territory methods whose import volume is slightly smaller than that of the midpoint and ES methods for small ctø. These methods will be described in a subsequent paper.

13 M. Snir, Theor. Comput. Syst.37, 295 (2004).

These figures are approximate because we have ignored the constraints on box aspect ratios due to the finite number of processors. In order for the boxes to be exactly cubic when the global cell is cubic, for example, the number of processors must be the cube of some integer. In practice, one might choose not to use a few of the available processors in order to obtain more convenient aspect ratios.

Red Storm Raises Bar on Supercomputing Scalability, Cray, Inc. (2003); N. R. Adiga, G. Almasi, G. S. Almasi, Y. Aridor, R. Barik, D. K. Beece, R. Bellofatto, G. Bhanot, R. Bickford, M. A. Blumrich, A. A. Bright, J. Brunheroto, C. Cascaval, J. Castanos, W. Chan, L. Ceze, P. Coteus, S. Chatterjee, D. Chen, G. Chiu, T. M. Cipolla, P. Crumley, A. Desai, A. Deutsch, T. Domany, M. B. Dombrowa, W. Donath, M. Eleftheriou, C. C. Erway, J. Esch, B. G. Fitch, J. Gagliano, A. Gara, R. Garg, R. S. Germain, M. Giampapa, B. Gopalsamy, J. A. Gunnels, M. Gupta, F. G. Gustavson, S. Hall, R. A. Haring, D. Heidel, P. Heidelberger, L. M. Herger, D. Hoenicke, R. D. Jackson, T. Jamal-Eddine, G. V. Kopcsay, E. Krevat, M. P. Kurhekar, A. P. Lanzetta, D. Lieber, L. K. Liu, M. Lu, M. Mendell, A. Misra, Y. Moatti, L. Mok, J. E. Moreira, B. J. Nathanson, M. Newton, M. Ohmacht, A. Oliner, V. Pandit, R. B. Pudota, R. A. Rand, R. D. Regan, B. Rubin, A. E. Ruehli, S. Rus, R. K. Sahoo, A. Sanomiya, E. Schenfeld, M. Sharma, E. Shmueli, S. Singh, P. Song, V. Srinivasan, B. D. Steinmacher-Burow, K. Strauss, C. Surovic, R. A. Swetz, T. Takken, R. B. Tremaine, M. Tsao, A. R. Umamaheshwaran, P. Verma, P. Vranas, T. J. C. Ward, M. E. Wazlowski, W. Barret, C. Engel, B. Drehmel, B. Hilgart, D. Hill, F. Kasemkhani, D. Krolak, C. T. Li, T. Liebsch, J. Marcella, A. Muff, A. Okomo, M. Rouse, A. Schram, M. Tubbs, G. Ulsh, C. Wait, J. Wittrup, M. Bae, K. Dockser, L. Kissel, M. K. Seager, J. S. Vetter and K. Yates, presented at the ACM/IEEE Conference on Supercomputing, Baltimore, Maryland, USA, 2002; R. E. Kessler and J. L. Schwarzmeier, presented at the IEEE Computer Society International Conference, 1993; S. Scott and G. Thomas, presented at Hot Interconnects IV: A Symposium on High Performance Interconnects, Stanford University, Palo Alto, CA, 1996. J. MacKerell, A. D., D. Bashford, M. Bellott, R. L. Dunbrack, J. D. Evanseck, M. J. Field, S. Fischer, J. Gao, H. Guo, S. Ha, D. Joseph-McCarthy, L. Kuchnir, K. Kuczera, F. T. K. Lau, C. Mattos, S. Michnick, T. Ngo, D. T. Nguyen, B. Prodhom, I. Reiher, W. E., B. Roux, M.

Schlenkrich, J. C. Smith, R. Stote, J. Straub, M. Watanabe, J. Wiόrkiewicz-Kuczera, D. Yin, and M. J. Karplus, J. Phys. Chem. B 102 (18), 3586 (1998). P. A. Kollman, R. W. Dixon, W. D. Cornell, T. Fox, C. Chipot, and A. Pohorille, in Computer Simulations of Biological Systems, edited by W. F. van Gunsteren (KLUWER/ESCOM, Dordrecht, Netherlands, 1997), Vol.3, pp. 83. W. L. Jorgensen, D. S. Maxwell, and J. Tirado-Rives, J. Am. Chem. Soc. 118 (45), 11225 (1996). W. R. P. Scott, P. H. Hϋnenberger, I. G. Tironi, A. E. Marks, S. R. Billeter, J. Fennen, A. E. Torda, T. Huber, P. Krttger, and W. F. van Gunsteren, J. Phys. Chem. A 103 (19), 3596 (1999). T. A. Halgren, J. Comput. Chem. 20 (7), 730 (1999). C. L. Brooks, B. M. Pettit, and M. Karplus, J. Chem Phys. 83 (11), 5897 (1985); P. Mark and L. Nilsson, J. Comput. Chem. 23 (13), 1211 (2002); J. Norberg and L. Nilsson, Biophys. J. 79 (3), 1537 (2000); M. Patra, M. Karttunen, T. Hyvδnen, E. Falck, P. Lindqvist, and I. Vattulainen, Biophys. J. 84, 3636 (2003). T. Darden, D. York, and L. Pedersen, J. Chem Phys.98 (12), 10089 (1993); U. Essmann, L. Perera, M. L. Berkowitz, T. Darden, H. Lee, and L. G. Pedersen, J. Chem Phys. 103 (19), 8577 (1995). C. Sagui and T. Darden, J. Chem Phys. 114 (15), 6578 (2001). Y. Shan, J. L. Klepeis, M. P. Eastwood, R. O. Dror, and D. E. Shaw, J. Chem Phys. 122, 054101 (2005). J. Vidgren, L. A. Svensson, and A. Liljas, Nature 368, 354 (1994). H. J. C. Berendsen, J. P. M. Postma, W. F. van Gunsteren, and J. Hermans, in Intermolecular Forces, edited by B. Pullman (D. Reidel Publishing Company, Dordrecht, 1981), pp. 331.

Figures and captions appear on the following pages.

FIG.l

FIG.2

FIG.3

(a)

(b)

FIG.4

FIG. 5

Midpoint method

(a)

Midpoint-ensured method

(b)

FIG.6

Figure Caption List:

FIG. 1:

Assignment of particle pairs to interaction boxes in the midpoint method. In this figure, the boxes are square with side length b, and R = 1.56. Each pair of particles separated by a distance less than R is connected by a dashed line segment, with the "x" at its center lying in the box which will compute the interaction of that pair.

FIG.2:

Import regions of (a) the two-dimensional midpoint method and (b) the two-dimensional analog of the HS method. Boxes are square with side length b, and R = 1.5b. In each case, the interaction box is shown in light gray and the import region in dark gray.

FIG.3:

Import regions of (a) the three-dimensional midpoint method, (b) the HS method, (c) the NT method, and (d) the ES method. In each case, the interaction box is shown in light gray and the import region in dark gray. The diagrams of the midpoint method, the HS method, and the ES method assume that boxes are cubic with side length b and that R = 1.56. The diagram of the NT method assumes the same values for R and for box volume, but box aspect ratios have been optimized to minimize import volume.

FIG.4:

A formulation of the two-dimensional midpoint method as a zonal method, (a) The interaction box (I) represents a single zone, and the import region is partitioned into eight zones, (b) An interaction schedule indicating which pairs of zones interact. An entry of 0 indicates that the zones in the corresponding row and column do not interact, while an entry of 1 indicates that they do.

FIG. 5:

Import volumes of several parallelization methods for pairwise interactions as a function of the number of processors p, assuming a cubic global cell with a side length of 8θA and an interaction radius R = 12A. Import volumes are represented relative to that of the HS method (Vi 1H s) for any number of processors. The box aspect ratios of the NT method were optimized at each value of p to minimize import volume, while the boxes are assumed to be cubic for the other methods. The midpoint and ES methods have the same import volume for all p.

FIG. 6:

Number of interactions assigned to each processor for a biomolecular system by (a) the midpoint method and (b) a midpoint-ensured method. The system contained 50,846 atoms in a cubic global cell measuring 80.0A per side, parallelized on 64 processors with R = 12A.