Optimization of Docking Conformations
Using Grid Datafarm
METI Project Report
March 2003
Junwei Cao and Takumi Washio
C&C Research Laboratories, NEC Europe Ltd., Germany
{cao, washio}@ccrl-nece.de
1.Introduction
Grid computing originated from a new computing infrastructure for scientific research and cooperation and is becoming a mainstream technology for large-scale resource sharing and distributed application integration. Data access and management is one of the key services that must be provided by the grid infrastructure and related topics include data transfer, replica, visualization and so on.
Grid Datafarm (GFarm) is a Japanese national project that aims to design an infrastructure for global petascale data intensive computing. GFarm tools and APIs are provided to handle large data files in both single filesystem image and local file views. While the Grid Datafarm is originally motivated by high energy physics applications, it is a generic distributed I/O management and scheduling infrastructure that can be applied to support other data grid applications.
One of the biomedical simulation applications developed in the NEC for the docking conformation optimization is investigated to use the GFarm infrastructure to handle large-scale simulation results. The result files of the application contain detailed conformation information and corresponding energy values, which potentially become very large if the considered problem size and simulation iterations increase. In this report, two cases are considered with the second one more realistically implemented.
This report documents some initially implemented tools, which are based on existing GFarm tools and APIs and can be provided to users with application specific interfaces. Docking confirmation result files can be handled and some key features of the GFarm environment are validated. In the second case, tools are evaluated in a mini scale and some performance results are also included. Some suggestions are also provided for future improvement of the GFarm toolkit.
2.Grid Datafarm
Architecture
The Grid Datafarm architecture couples the disk storage capabilities together with the high performance computing processors to maximize the disk I/O advantages and avoid the bandwidth limitation of the global network in a grid environment. A GFarm metaserver with an LDAP-based metadata DB is utilized for management and information services to achieve a single file system.
One of the key features of the GFarm infrastructure is the support of both global and local file views. The global view eases the file handling, and the local view can be used to achieve the scalability of the file manipulation. These are both used in the GFarm application specific tools introduced later.
GFarm Tools and APIs
Grid Datafarm implementation provides a series of tools and programming APIs. Apart from some general file operation commands (e.g. gfls, gfwhere, gfreg, etc.), several example tools that are included in the GFarm release is found to be useful especially when applied for the docking optimization application. These include:
gfimport_fixed
The command gfimport can be used to import a file into the GFarm infrastructure by dividing the file into same-sized fragments. If users want to define fixed-size blocks in the file and avoid divisions in the middle of blocks, the gfimport_fixed command can be used. And the file will be divided into almost-same-size fragments.
gfimport_text
If the considered file is a text file and the user want to avoid divisions within lines when it is imported into the GFarm infrastructure, the tool gfimport_text can be used. And the file will be divided into almost-same-size fragments.
gfmpirun
The command gfmpirun is used to run MPI programs that handle GFarm files when the datafarm infrastructure is installed in a small-scale cluster computing environment. The gfmpirun firstly looks up the locations of the corresponding file fragments and then transfers multiple copies of the MPI programs onto these file nodes. Current gfmpirun makes system call mpirun and relies on the specific MPI implementation. This has to be updated according to the configurations of the NEC MPI implementation that is described below.
GFarm and NEC MPI
The implementation of the NEC MPI is especially coupled with a cluster scheduling system, COSY. When the user requires anmpirun, the COSY scheduling will issue a resource allocation to the user automatically. But the best way to guarantee the required resources is to reserve the resources via COSY in advance and use an additional reservation ID jid for mpirun. The gfmpirun is updated to include these specific options of the NEC MPI and COSY implementation.
3.Optimization of Docking Conformations
Overview
This optimization code ultimately aims at finding minimal energy conformation for given pairs of proteins and ligands (molecules which become medicines). Such a tool can be used for screening candidates of medicine by computer simulation before doing actual experiments. By doing this screening process fast and accurately, the development cycle of medicines can be dramatically reduced. For this type of minimization problem, there are a quite large number of local minima. Thus, the many optimizations with different initial conformations must take place. Optimization processes for different initial conformations can be performed independently from others. Thus, one can run many optimization processes in parallel on distributed computers and can save the results of local minimum conformations in distributed storage devices. After all processes are completed, one would want to make ranking for all produced data, and then pick up the top energy conformations (from the minimum) at his/her own computer. For such file manipulations, Gfarm can be used.
The development of the optimization code has been almost finished. The optimization code is independent from the energy and force calculation routines. Namely, users can easily couple their molecular mechanics codes with the optimization code by giving energies and forces for conformations provided by the optimization code. This mechanism is described in Fig. 1.
Fig. 1. Flowchart of the optimization code
In most cases, bond lengths and bond angles of molecules are stiff. Thus, in order to accelerate the convergence of the optimization process, one wants to allow movements only for torsion angles. Such optimization is possible with this code. Actually, users can specify places where they want to allow stretching, bending or twisting as they like.
In the test case, eight identical organic compounds (consisting of 37 atoms) are docked from randomly generated initial conformations (See Fig. 2). In Fig.3, the energies are plotted for 1000 optimized conformations. As can be shown in the figure, there are many local minima, thus many local optimizations must take place to obtain a nearly best conformation in terms of energy. And such the optimizations will be done for many ligands. Thus, the amount of data will grow up for realistic situations.
The application result files are currently handled manually. After the optimization is finished, the user needs to look into the energy values and find corresponding item index numbers with minimum energy. The conformation information is then picked up according to the index number and saved into a separate file that can be taken as an input file of visualization tools. This work aims to provide tools on top of the GFarm infrastructure and facilitate these processes.
Fig. 2. Initial conformation and after the optimization
Fig. 3. Energy plots for 1000 optimization processes
Data Formats
When the optimization is finished, two text files are produced: recording of all the detailed information on docking conformations and corresponding energy values. The formats of the file data are illustrated in Fig. 4. The way GFarm should be used to handle these files is related to these formats.
Fig. 4. Example result files
In Fig. 4, both of the files contain data of 100 items which are generated by the optimization processes with randomly made initial conformations. Each line of one item in the first file describes docking conformation information of one of the atoms of a certain dock. The size of each line is fixed, therefore the total size of one item is determined by the total number of atoms involved in the application. In Fig. 4, docking of 8 molecules are considered and allmolecules are chemically identical, and each of them is composed of 37 atoms, so the total number of atoms is 296.
The second file includes the energy values of each docking. The first column is the index number and the second is the total energy value which is the most important and used for ranking of the docking conformations. Each line is corresponded to one item in the first file and the size of each line is not fixed.
The two files potentially becomequite large (especially the first file) if more initial conformations are optimized (instead of 100 in Fig. 4) and molecules are potentially composed of more atoms (say, thousands of atoms). When the GFarm is used to handle these files, it is required to avoid fragment divisions within one item of data.
GFarm Tools
gfopt_import
This is used to import the docking conformation result file into a GFarm file. The number of atoms in the application should be input so that the tool can calculate the fixed size of each block automatically and utilize the gfimport_fixed to avoid breaks in the middle of blocks.
gfopt_import_energies
This is used to import the energy result file into a GFarm file. The gfimport_text is utilised. However, instead of the almost-same-size file division, the almost-same-line fragment division strategy is utilized so that the division can be carried out in the same way that the gfopt_import tool does.
gfopt_export_energies
This is actually a MPI program that takes a GFarm energy file as input. The program reads through the energy results in a parallel way and tries to find the index number with minimum total energy value. This is executed using the updated gfmpirun and the output index numbers can be used as inputs of gfopt_export.
gfopt_export
The gfopt_export can be used to retrieve a specific item of data from a GFarm docking conformation result file, given the index number. The tool can calculate the location of the file fragment where the corresponding block of data is stored, switch to the GFarm local file view, seek to the beginning position (using the gfs_pio_seek API) and read out a fixed-size of data. The output can be saved in an individual file that can be used directly by visualization tools.
The process benefits from the GFarm support of local file views. Since only one of the file fragments is opened, the performance of the operation will not decrease much when the file size increases.
4.Screening with a Docking Optimization Code
Overview
The docking simulation code ultimately aims at finding ligands which dock tightly with a targeted protein. Such ligands are selected from libraries of chemical compounds which include a huge number of candidates. The main engine of our docking code is composed of two parts:
1.Energy and force calculations for given conformations based on a classical molecular dynamic model
2.Optimization of conformations to find local minima of energy
In our molecular dynamic model, to take the solvent (water molecules surrounding the protein-ligand complex) effect into account, we apply an implicit solvent model where the electrostatic interaction between the solvent and the solute (the complex of protein and a ligand) is computed by solving potential equations with an inhomogeneous dielectric, and the non-polar interaction between the solute and the solvent is taken into account by assuming that it is proportional to the molecular surface area. In our optimization, we allow mobility only for the ligand and some important molecules and the side chains of the protein in the active site (a region where the docking takes place). We exploited such conditions (the localized mobility and the implicit solvent model) to reduce computational load. And we have been able to realize the feasible computational time for the local optimization. Here, the local optimization means finding a local minimum close to a given initial conformation. One shot of the optimization takes around a few minutes with a linux PC (Pentium 4) at the current implementation even when the solvent effect computation with the implicit solvent model is combined. To achieve such a feasible execution time, we have developed the following novel methods.
1.Constrained optimization techniques to reduce number of optimization steps.
2.A multilevel optimization technique with which the number of computation of the energies and forces due to the solvation effect on the fine grid are reduced by the use of the coarser grids.
3.Acceleration of the convergence of the above multilevel optimization iterations using a Krylov subspace technique.
Both of an initial conformation and a resultant conformation obtained by our optimization code from it are depicted in Fig. 5. In this example, the initial conformation is taken from experimental results, the atoms of the ligand move 0.84 Angstrom in average during optimization. Though the deformation is not so large, the optimization reduces the energy dramatically (the reduction of the energy from the first iteration to the last is about 100 kcal/mol). Thus, to obtain a reasonable energy for the screening, such an optimization is needed. In Fig. 7, the convergence history of the energy for this example is given. The horizontal axis indicates the number of evaluations of forces. About one thousand of the evaluations were performed in this case. However, among them, only one hundred times of evaluations of the solvation effects on the finest grid were needed by employing the coarse grids computations. For this computation, 242 seconds was consumed with our Linux PC with Pentium 4 processor.
Fig. 5. Conformations before and after the energy optimization
Fig. 6. The conformation change from the initial one (the wires) and the optimizedone (the sticks) including the mobile sidechains and the mobile watermolecules.
Fig. 7.Convergence history of the energy
Though we have made above mentioned efforts to realize fast computation of the energy optimization, it is still not feasible and not efficient enough to apply our code directly to all of possible candidates found in libraries, since such optimization problems have too many local minima. At the moment, we do not have any useful solution to resolve this difficulty by our code. However, we can use existing docking prediction codes which produce a set of possible conformations that may be close to conformations of small energies. In general, these codes use simpler energy models and search conformations in roughly discretized spaces to achieve high throughput. Thus, it makes sense to use the given set of conformations by such tools as the initial conformations of our optimization codes and to evaluate the energies after the optimizations with our more accurate energy model. Furthermore, the most of the existing docking prediction tools fix the protein structure during optimization, whereas our code can allow mobility for any part of proteins as can be seen in Fig. 6. We think that this is also important improvement for the energy evaluation, since the protein is also flexible. After all computations are completed for all the given conformations through all the candidates, we can pick up candidates of the medicine by selecting ones which give smaller binding energies at their best docking conformations. This is called two stage screening strategy.
By saving also atomic coordinates of all the optimized conformations, we can use them later on to understand detailed mechanisms of the protein-ligand interactions. And such knowledge might be exploited to design a new compound.
Fig. 8.Work flow of the two screening strategy
In Fig. 7, an image of the two stage screening is shown. With the first docking prediction tool, some conformations (may be up to 100) are given for each of ligands. Then, all of conformations are optimized by our codes through all the ligands. At the end of each optimization, the computed conformation is saved with its energy in an output file. Here, the coordinates only of the mobile atoms (in the ligand, some molecules around the docking site, and some side chains of the protein) are saved. In case of the previous example, 5K byte of capacity is required to store one conformation of the mobile atoms. Thus, 0.5 M byte of capacity is needed to store one hundred conformations. If we need to compute about ten thousand ligands, 5G byte of capacity is needed to store all of the conformations. As for the computational time, one hundred of optimizations might be done in 10 hours. Thus, if we can use a thousand of PCs, all of optimizations for ten thousand ligands might be finished within 5 days. Even though the required data capacity is not so large for one case (one targeted protein), we may need to perform calculations over different parameter sets of the molecular dynamic model to confirm obtained results. Thus, it may not be possible to save all the output files in one local storage device. Therefore, it is convenient to have tools that allow us to access specified data in distributed output files over some network. In the following sections, we will discuss these issues.