Estimation of absolute solvent and solvation shell entropies via permutation reduction

Despite its prominent contribution to the free energy of solvated macromolecules such as proteins or DNA, and although principally contained within molecular dynamics simulations, the entropy of the solvation shell is inaccessible to straightforward application of established entropy estimation methods. The complication is twofold. First, the configurational space density of such systems is too complex for a sufficiently accurate fit. Second, and in contrast to the internal macromolecular dynamics, the configurational space volume explored by the diffusive motion of the solvent molecules is too large to be exhaustively sampled by current simulation techniques. Here, we develop a method to overcome the second problem and to significantly alleviate the first one. We propose to exploit the permutation symmetry of the solvent by transforming the trajectory in a way that renders established estimation methods applicable, such as the quasiharmonic approximation or principal component analysis. Our permutation-reduced approach involves a combinatorial problem, which is solved through its equivalence with the linear assignment problem, for which O N3 methods exist. From test simulations of dense Lennard-Jones gases, enhanced convergence and improved entropy estimates are obtained. Moreover, our approach renders diffusive systems accessible to improved fit functions. © 2007 American Institute of Physics. DOI: 10.1063/1.2400220


I. INTRODUCTION
Biomolecular processes are driven by molecular free energy changes.Many physico-chemical phenomena, such as the hydrophobic effect, emerge from a fine-tuned competition of the two free energy components, entropy and enthalpy.During protein folding, e.g., molecular interactions compete with the associated huge decrease of conformational entropy of the protein. 1 Hydrophobic forces, which drive many biological phenomena such as membrane association and also protein folding, are also governed by entropy changes, but mainly of the solvent. 2 As a fully atomistic description, molecular dynamics ͑MD͒ simulations should capture this enthalpy-entropy competition.That this is actually the case has become particularly evident by the successful first-principle MD folding of peptides and proteins of increasing size, 3,4 where the obtained native structure is particularly sensitive to the enthalpy-entropy balance.][7][8][9] Unexpectedly, however, accurate values for both enthalpy and entropy are notoriously difficult to extract from such simulations, although they are correctly treated and thus, apart from the sampling problem, are fully contained in the simulation.This at first almost paradoxical gap severely inhibits a full causal and thermodynamic understanding of many macromolecular processes.
Free energy differences between two states of a system are usually obtained from MD simulations by means of free energy perturbation 10 or thermodynamic integration; 11 absolute values, however, are more difficult to obtain.In both cases accuracy is limited by sampling.3][14] For absolute values, however, the sampling problem is usually prohibitive due to the large volume of phase space that needs to be sampled.Even more so, the separate computation of enthalpy or entropy differences typically suffers from large statistical uncertainties. 15Accordingly, to estimate these quantities already represents a considerable challenge.
In addition, biological systems usually involve a solute ͑e.g., a protein͒ solvated in a solvent ͑e.g., water͒.Therefore, an estimation method should include both groups in the analysis and distinguish their contributions.This is an additional challenge which is not met satisfactorily by any method proposed so far.
The quasiharmonic approach 16,17 approximates the phase space density of a system by fitting a multivariate Gaussian density to a MD simulation trajectory using the covariance matrix.From this density estimate, the entropy is calculated analytically.This approach is computationally efficient and often yields good results for the entropy contribution of the solute. 18,19As we will discuss further below, however, the solvent contribution to the entropy is not accessible to this method.This is particularly problematic since the solvent degrees of freedom are typically correlated with those of the solute and, therefore, the associated entropies are usually not separable.
To compute entropy differences between two states, thermodynamic integration ͑TI͒ ͑Refs.10 and 11͒ is the method of choice.TI is most widely used for the computation of free energy differences and, in principle, includes both solute and solvent.More recently, it has been extended to the computation of entropy differences. 15,20This variant, however, suffers from severe sampling problems, and is therefore computationally particularly expensive.
A third method, adiabatic switching, 21 is also based on a gradual change of the system's Hamiltonian.In contrast to TI, this change is performed adiabatically, i.e., the pathway is chosen such that the entropy does not change.This change is used to transform the system into a simpler one whose entropy can be calculated analytically.This method has been successfully applied to pure solvent systems; its application to mixed systems, however, is not straightforward, 15 and has not yet been demonstrated.
In summary, despite its importance, estimation of entropy contributions to the free energy and, in particular, those of the solvent, still represents a considerable challenge.Here, we propose a method to fill this gap.By exploiting permutation symmetry we propose to transform the solvent trajectory such that density-based entropy estimation methods like the quasiharmonic approach become applicable.

II. THEORY
Because our method rests on the quasiharmonic approximation, and also to clarify notation, we will first sketch this method.For a comprehensive treatment, the reader is referred to the literature. 16,17Subsequently, we will describe how permutation symmetry can be exploited to render the solvent accessible to the quasiharmonic approximation.

A. Quasiharmonic approximation
Within the quasiharmonic approximation framework, one assumes that a molecular dynamics trajectory ͕x 1 ͑t͒ , ... ,x N ͑t͖͒ = ͕x͑t͖͒ of a classical N-particle system approximates a Boltzmann ensemble which is distributed according to the configuration phase space density, of the system under study.Here, x͑t͒ denotes the configurational vector of the system at time t, ␤ is the reciprocal thermal energy, V͑x͒ the potential energy ͑force field͒, and Z the canonical partition function; the contributions of the momenta can be treated analytically and are, therefore, not considered here.Usually, and also here, the trajectory is one from which rigid body motions of the solute have been removed.
Accordingly, the trajectory is used to reconstruct an approximate density ˜͑x͒ = 1 where ͗...͘ denotes an average over the trajectory.For harmonic V͑x͒, this approximation is exact; hence, the approach is referred to as quasiharmonic approximation. 22he approximate density ˜serves to obtain an analytical estimate S ˜for the ͑configurational͒ entropy S, 16 S Ϸ S ˜= 1 2 k B ͕3N + ln͓͑2͒ 3N det͑C͔͖͒.͑1͒ More recently, a quantum-mechanical correction has been proposed 17 which additionally is numerically more stable,

͑2͒
Here, i 2 are the 3N eigenvalues of the mass-weighted covariance matrix.
For later reference we note that for indistinguishable particles, a correction k B ln N! ͑Gibbs factor͒ has to be applied.Due to their chemical bonding, the atoms of a protein molecule are usually considered distinguishable, and no Gibbs factor is applied.
By construction, the harmonic approximation is accurate as long as resembles a Gaussian distribution.Solvated biological macromolecules often adopt a well-defined average structure, and the atomic fluctuations around their average structure are small ͑though exceptions exist͒.Therefore, their configurational space density is compact, and the Gaussian density approximation is reasonably good ͓Fig.1͑a͔͒.The quasiharmonic approximation is also applicable to an ideal gas in a finite box. 19,23This is unexpected because the associated rectangular phase space density ͓Fig.1͑b͔͒ is far from Gaussian-shaped.
However, for noncompact densities the harmonic approximation breaks down ͓Fig.1͑c͔͒.In this situation, the Gaussian density approximation is maximal at the "hole" in the center of Fig. 1͑c͒, where the actual density vanishes.This sketch captures the situation for solvents.Here, the diverging potential for overlapping solvent atoms or molecules creates, via the Boltzmann factor, holes in the highdimensional configurational space density of the system.Even larger holes arise from repulsive interactions of the solvent molecules with a macromolecular solute such as a protein.This is one of the reasons why the solvent contribution to the entropy is usually inaccessible to the harmonic approximation.

B. Permutation symmetry
Motivated by this observation, and aiming at removal of the holes, we propose to "localize" the solvent degrees of freedom by exploiting the solvent's permutation symmetry.Specifically, we will permute the indices of the solvent molecules for each individual trajectory frame x͑t͒ such as to minimize the error introduced by the Gaussian density approximation.This process will be referred to as "relabeling."Due to its permutation symmetry, relabeling of the molecules leaves the physics of the system invariant, but translates the 3N-dimensional configurational vectors x of the trajectory and, therefore, enables us to tailor its "shape." Assuming that the approximation error decreases with increasing compactness of the trajectory, we therefore propose to relabel each frame i of the trajectory such as to minimize its 3N-dimensional radius of gyration, where i is a permutation operator such that i • x͑t i ͒ denotes the relabeled configuration at time t i , and ͗x͘ is the average configuration.Note that the latter depends on the particular choice of all optimal permutations i , which renders the above equation recursive and precludes straightforward optimization.We circumvent this problem by replacing the permutation-dependent average configuration with an appropriately chosen reference configuration x 0 .This trick reduces the coupled optimization problem, Eq. ͑3͒, to the independent optimization problems for each trajectory frame i of finding a permutation i which minimizes the distance to the reference configuration, This problem is a special case of the linear assignment problem ͑LAP͒, for which efficient algorithms have been developed. 24,25We will refer to this procedure-relabeling the trajectory according to the optimal permutations-as "permutation reduction," and to an entropy estimate derived from the permutation-reduced trajectory as "PRPCA" estimate.
Figure 2 illustrates this approach for a simulation of 216 water molecules in a cubic box.Shown are superpositions of the trajectories of all oxygen atoms; the atom numbers are color coded.Figure 2͑a͒ shows the trajectory as obtained from the simulation.Clearly visible is the free self-diffusion of the molecules, such that each color covers the whole volume.Figure 2͑b͒ displays the permutation-reduced trajectory.As intended, the diffusive motion is converted into localized fluctuations of the water molecules around their reference positions.In this respect, the water molecules collectively behave similar to a protein fluctuating around its welldefined average structure; in this sense, the solvent has been "proteinized." By construction, permutation reduction projects the trajectory into one of N! identical subvolumes of the configurational space.Accordingly, the underlying configurational space density is projected onto a reduced density 0 , which vanishes everywhere outside the subvolume accessible to the relabeled trajectory.This reduced density 0 is by construction much more compact than the original density .Accordingly, the Gaussian approximation should allow for an improved estimate of the density 0 as well as of its entropy S 0 .The original density is easily recovered by applying all N! permutations FIG. 2. ͑Color͒ Molecular dynamics simulation of 216 water molecules within a cubic box.Shown are trajectories of the oxygen atoms without ͑a͒ and with ͑b͒ permutation reduction applied.In the first case, all molecules diffuse freely within the box volume.After permutation reduction, each molecule remains close to its reference position because it is relabeled otherwise.Note that the physics of the system remains unchanged.
where the factor N! accounts for proper normalization.It follows that S = S 0 + k B ln͑N!͒.We note that for indistinguishable particles, S = S 0 .

III. METHODS
We implemented the outlined approach using the LAP algorithm of Jonker and Volgenant, 26 which solves the LAP with O͑N 3 ͒ efficiency.This implementation is available electronically. 27o assess the accuracy of our approach, we chose a gas of Lennard-Jones particles as a first simple test system.Similar test systems have also been considered previously. 18,19he parameters c 12 and c 6 of the Lennard-Jones interaction potential, were scaled by a dimensionless parameter such that c 12 = ϫ9.8493ϫ 10 −6 nm 12 kJ mol −1 K −1 , c 6 = ϫ6.2644ϫ10 −3 nm 6 kJ mol −1 K −1 .The case = 1 corresponds to argon. 19,28We simulated 100 atoms of mass 39.948 amu within a box of constant volume ͑2.2 nm͒ 3 with periodic boundary conditions.The system was kept at T = 300 K by coupling to a temperature bath with a 0.1 ps relaxation time. 29Lennard-Jones interactions were cut off at r = 1.0 nm.All simulations were started from an energyminimized structure, and a 2 fs time step was used for the numerical integration of the equations of motion.Two sets of simulations were performed.The first set comprised eighteen 50 ns trajectories.For each trajectory, a fixed between 0 and 1 was chosen.Note that = 0 describes an ideal gas.For the second set of simulations, the interaction of the particles was held fixed at = 1, and an additional particle-mimicking a solute-was inserted at the center of the box.This particle was restrained by a harmonic potential with a spring constant of 10 6 kJ mol −1 nm −2 .For its interaction with the other particles, a Lennard-Jones potential with the above parameters was chosen, with fixed ranging from =0 to = 100.
For all trajectories, the 100 solvent atoms were permutation reduced.To characterize the influence of the reference structure x 0 on the entropy estimate, three different types of reference structures were considered, the minimized initial structure, a simple cubic lattice of argon atoms, lattice constant d = 0.55 nm, and the last frame of the respective simulation.
Covariance analyses were carried out for both the original and the relabeled trajectories.From the obtained covariance matrices, the entropy was estimated via Eq.͑2͒.Excess entropies were computed by subtracting the ideal gas value obtained with the respective estimation method.For proper comparison, the N! correction was used for the PRPCA estimate, i.e., the particles were treated as distinguishable.
To obtain accurate entropy reference values, for each of the considered systems the excess entropy to the ideal gas was calculated by thermodynamic integration, 11,15,20 which was possible for the simple test system used here.The computations were performed by switching the interaction parameter in 100 steps from 0 to .For each step, a 200 ps simulation was performed.A soft core potential, as implemented in GROMACS, was used with ␣ = 1.5 and = 0.3.
To check for convergence, forward and backward thermodynamic integration was performed for both simulation sets.As the worst case, the largest interaction strengths were considered ͑ = 1 for the first and = 100 for the second set, respectively͒.Only small differences below 0.12 J mol −1 K −1 for the first and 0.04 J mol −1 K −1 for the second series were observed.Hence, sufficient convergence is assumed.
All simulations were carried out using GROMACS 3.2.1. 30For the thermodynamic integration calculations, the code was slightly modified to output the derivatives of the Hamiltonian required for the entropy thermodynamic integration. 15

A. Convergence of the PRPCA entropy estimate
To test for convergence, the ideal gas case ͑ =0͒ of the first simulation set was considered.For this case, analytic reference values for the entropy as well as the Gaussian entropy estimate, Eq. ͑2͒, are available for comparison, 23 shown as horizontal lines in Fig. 3.
Figure 3 shows entropies from the Schlitter method for trajectory portions of different lengths.As can be seen, the entropy estimates from the relabeled portions ͑PR-SC, circles and PR-EM, diamonds͒ converge about three times faster than the estimates from the original trajectory portions ͑MD, squares͒.This improvement was expected and is because the relabeled trajectory is restrained to the much smaller subvolume of the phase space, which, for given trajectory length, is sampled more completely.
The good agreement of this estimate with the analytical result is unexpected.Although it has been pointed out recently 23 that a Gaussian fit to a constant density distribution within a hypercube yields good entropy estimates, it is far from clear that a Gaussian fit function should remain an appropriate choice for the simplexlike subvolume sampled by the relabeled trajectory.In any case, this is an encouraging result.
All entropy estimates converge within 10 ns; therefore, sampling errors are considered negligible for the 50 ns test simulations described below.

B. Lennard-Jones gas
In real solvents, interactions give rise to correlations between molecules and thereby reduce the entropy.To analyze how well this effect is captured by our method, we studied a system of Lennard-Jones atoms for different interaction strengths ͑the first simulation set͒.With a density of 9.4/ nm 3 or 0.623 g / cm 3 , which approaches that of liquids, this system should represent a sufficiently tough test case.
Figure 4 compares two PRPCA estimates ͑PR-SC, PR-EM͒ with estimates obtained from Gaussian fits to the original trajectory ͑MD͒.As a reference, the thermodynamic integration result is provided ͑solid line͒.Note that the excess entropy is shown here.For this difference between two entropies, the quasiharmonic approximation result is no strict upper limit.As can be seen from the thermodynamic integration curve, the entropy of the gas decreases for increasing interaction strength, due to increasing correlations between the particles.This effect is completely missed by the entropy estimate derived from the unrelabeled trajectory, and even shows a slight increase.This failure has also been observed by others. 19uch better estimates are obtained from the relabeled trajectory.In particular, the entropy decrease with increasing interaction strength is now fully captured.Furthermore, quantitative agreement with the thermodynamic integration reference is obtained for weak to medium interaction strengths ͑Ͻ0.1͒.However, for strong interactions, deviations remain.
Closer analysis suggests that the effect of the repulsive core is well described by our method.In contrast, the attractive part of the interaction-which becomes dominant for large -is described less accurately by the Gaussian approximation.Nevertheless, also for this critical interaction regime, a significantly improved estimate is achieved by exploiting the permutation symmetry.
Figure 5 illustrates this situation.The full configurational space density ͓Fig.5͑a͔͒ exhibits holes due to repulsive-core interactions, which undermine the Gaussian fit.As can be shown, the centers of all these holes are located at the surface of the permutation-reduced subvolumes ͓Fig.5͑b͔͒, which renders Gaussian fits more accurate.

C. Solvation of a particle in a Lennard-Jones gas
The second set of simulations aimed at characterizing the accuracy of our entropy estimate for a macromolecular solute, here mimicked by a strongly interacting Lennard-Jones particle.Figure 6 shows the excess entropies with respect to the pure "solvent" for the four methods discussed above.Additionally, a relabeled trajectory with the last frame of the 50 ns simulations as reference frame x 0 is considered.
Similar to the previous system, increasing interaction strengths induce increasing correlations and thereby decrease the entropy.The PRPCA estimate captures also this effect, which is missed by the estimate from the original trajectory.FIG. 6. ͑Color͒ Solvation of a Lennard-Jones particle solvated in a Lennard-Jones system.The interaction strength between the solvated particle and the solvent is varied from weak to very strong.Entropy estimates are obtained from the original trajectory ͑MD͒ as well as from relabeled trajectories with the simple cubic ͑PR-SC͒ and the energy minimized configuration ͑PR-EM͒, respectively, as reference structures.In an additional set of relabeled trajectories, the respective last trajectory frame we used as reference structure ͑PR-F͒.Thermodynamic integration values ͑TI͒ are given as reference.
This similar behavior is nontrivial, because the sketch in Fig. 5 cannot be applied here in a straightforward manner due to the large radius of the solvated particle.
For large interaction strengths v, self-diffusion of the first solvation shell is drastically reduced, similar to the known behavior of a protein hydration shell.As a result, an accelerated decrease of the TI entropy is seen.An even more pronounced decrease is seen for the estimate from the original trajectory.Because this estimate has been rigorously shown to provide an upper bound for any configurational space density, 17 and assuming increasing deviation from a Gaussian of the density at hand for increasing , the agreement with the TI entropy is likely to be due to insufficient sampling.In this case, it would be purely accidental.Indeed, insufficient convergence is seen for different trajectory lengths ͑data not shown͒.In contrast, much faster convergence is observed for the relabeled trajectory.This effect is particularly pronounced for large .Accordingly, the PRPCA estimates capture the marked decrease seen for the TI entropies, albeit with decreasing accuracy.
Finally, the influence of the choice of the reference structure x 0 remains to be assessed.
To this end, we consider the entropy estimates from the relabeled trajectory which used the last frame as reference ͑PR-F in Fig. 6͒.The apparent scatter is due to the fact that here, in contrast to the other estimates, each entropy estimate rests on a different reference structure.Accordingly, this scatter of about = 0.05 J mol −1 K −1 half-width measures the uncertainty introduced by the choice of the reference structure.Interestingly, the deviation of the PRPCA estimate from the TI reference falls into this range.Optimization of the reference structure, therefore, might provide further improved entropy estimates.

V. CONCLUSION
We have presented a entropy estimation method particularly tailored for diffusive systems, such as the solvent contribution to the entropy of solvated macromolecules, e.g., proteins.As test cases, we have applied this method to a gas of 100 Lennard-Jones particles and to the solution of a strongly interacting Lennard-Jones particle in this gas.This system was chosen because its entropy is accessible to thermodynamic integration, such that a sufficiently accurate reference value can be provided.
In both test cases, our PRPCA method enhanced convergence and significantly improved the entropy estimate compared to straightforward application of the conventional PCA approach.There remain deviations from the exact result as well as from more accurate methods such as thermodynamic integration or hypothetical scanning Monte Carlo. 31These deviations, however, have to be attributed to the chosen quasiharmonic ͑Gaussian͒ approximation for an inherently non-Gaussian configurational space density.Thus, we are now in the favorable situation that construction of a better fit function alone will push the accuracy of our method to the level reached by statistical approaches, which was not possible before.
The intractable sampling problem, common to all competitive methods, is thus significantly alleviated and transformed into a fitting problem that should be addressable with drastically reduced computational effort, especially in view of the fact that the characteristic shape of the permutationreduced configurational space density should allow one to take advantage of specialized fit functions.Our PRPCA method thus opens the route to further systematic entropy estimate improvements without the computational burden of exhaustive sampling, the main obstacle for TI and Monte Carlo-type methods.
Similar developments will enable us to apply the method to more complex solvents such as water.While the permutation reduction will be fully applicable and will provide similarly enhanced sampling, the fit function here will have to be extended to additionally include the rotational and intramolecular degrees of freedom.Finally, by transforming the solvent entropy calculation into the quasiharmonic framework, the solvation entropy of solvated macromolecules such as proteins will become accessible.
An implementation of the PRPCA method is available electronically. 32

FIG. 1 .
FIG.1.Sketch of quasiharmonic density estimates for ͑a͒ a protein fluctuating around its average structure; ͑b͒ an ideal gas in a box; ͑c͒ a solvent in a box.Shown are potential energy ͑solid lines͒, resulting configurational densities ͑dotted͒, and derived Gaussian density estimates ˜͑dashed͒.

FIG. 3 .
FIG.3.͑Color͒ Convergence of an ideal gas simulation.Shown are entropy estimates for different portions of the original trajectory ͑MD͒ as well as of the relabeled trajectories with the simple cubic structure ͑PR-SC͒ and the energy-minimized structure ͑PR-EM͒ as reference structures.The horizontal lines show the analytical results for the ideal gas entropy ͑S͒ and its Gaussian approximation ͑Schlitter, Ana method͒.Note that The PR-SC and PR-EM curves match so closely that they can hardly be distinguished in the figure at the chosen scale.

FIG. 4 .
FIG. 4. ͑Color͒ Excess entropy of a Lennard-Jones gas, estimated from the original trajectory ͑MD͒ as well as from the relabeled trajectories with the simple cubic ͑PR-SC͒ and the energy-minimized structure ͑PR-EM͒ as reference structures.Values computed by thermodynamic integration ͑TI͒ provide a reference.