Abstract
Given the need to extend current quantum-chemical treatments to large molecules with thousands of atoms, the main goal of this work was the implementation of linear scaling approaches in the context of semiempirical molecular orbital theory. At the outset of this project, three such approaches had already been suggested in the literature. We have chosen the conjugate gradient density matrix search (CG-DMS) because it employs reliable and well-established minimization procedures and offers a transparent route towards linear scaling through the use of cutoffs in the density matrix and the Fock matrix.
Three versions of the CG-DMS code have been implemented. The full-matrix versions with precomputation (ITERCG) and on-the-fly computation (DIRCG) of the required integrals serve mainly for testing purposes, whereas the sparse-matrix integral-direct version (DIRCGS) is designed for linear scaling production work on large molecules. DIRCGS employs the compressed sparse row format and subroutines for matrix operations from the public-domain SPARSKIT2 library. In systematic test calculations on small molecules without cutoffs, the three versions of the code yield identical results, which are in full agreement with the results from conventional SCF calculations using matrix diagonalization.
When starting from a diagonal initial density matrix (as commonly done in conventional semiempirical calculations), the SCF/CG-DMS approach does not converge reliably. This failure can be traced to the fact that such an initial density matrix is far from idempotent, which may cause the McWeeny idempotency transformation to behave erratically such that the number of electrons is not conserved. To circumvent such problems, an alternative option to generate an initial density matrix for large molecules has been implemented: a block-diagonal guess is assembled from non-converged fragment density matrices which are obtained by performing typically one conventional SCF iteration for user-defined fragments. This initial guess is idempotent and normalized, by construction, and leads to reliable and robust convergence in all cases considered so far.
The SCF/CG-DMS procedure involves a triply nested loop of iterations (SCF, CG, McWeeny) that are controlled by a large number of parameters. To establish recommended default values for these control parameters, systematic tests have been carried out using three sets of calculations: MNDO for 218 small organic molecules, MNDO/d for 366 small inorganic molecules, and AM1 for 47 large biochemical molecules. Default values for all options have been chosen on this basis. The tests have shown in particular that it is not efficient to converge the inner CG and McWeeny cycles tightly when the outer SCF cycles are still far from convergence: imposing a maximum number of two CG and McWeeny cycles is found to be the best choice.
For validation purposes, the sparse-matrix integral-direct code has been applied to several series of large molecules including polyglycines, water clusters, and proteins. For suitable cutoffs, the computational effort for SCF/CG-DMS calculations is found to scale linearly with molecular size, as expected. The crossover points with conventional SCF calculations occur later than in other published implementations which is probably at least partly due to the use of non-optimized sparse matrix routines from the SPARSKIT2 library. In the case of the AM1 calculations on proteins, rms charge fluctuations of the order of 0.05 e per residue are found which cannot be captured by the usual classical force fields with fixed charges.
The new linear scaling code has been applied to study reactions in the enzymes p-hydroxybenzoate hydroxylase (PHBH) and triosephosphate isomerase (TIM) which had previously been investigated in our group by combined quantum mechanical / molecular mechanical (QM/MM) methods. These single-point AM1 calculations at the available optimized QM/MM geometries would not have been possible with the conventional SCF code due to the size of these systems (7004 and 8326 atoms, respectively). The AM1 relative energies and activation barriers are consistent with the previous AM1/GROMOS and AM1/CHARMM results. An analysis of the charge distributions justifies the chosen QM/MM approach for PHBH since the total charge of the QM region remains close to the formal value of -4 e throughout the reaction. In the case of TIM, the AM1 charges from the full enzyme calculations indicate a considerable charge transfer from the QM region to the protein environment (about 0.4 e) which is however rather uniform for all species involved so that the QM/MM results may benefit from error compensation.
As a result of this thesis, a validated and working semiempirical SCF/CG-DMS code is available that exhibits linear scaling in the computational effort for large molecules. Further optimization and tuning of this code seems possible, in particular with regard to the sparse matrix library routines (especially for sparse matrix multiplication). In its present form, the code can already be used in studies on enzymes to complement corresponding QM/MM work.