Faster Self-Consistent Field (SCF) Calculations on GPU Clusters
A novel implementation of the self-consistent field (SCF) procedure specifically designed for high-performance execution on multiple graphics processing units (GPUs) is presented. The algorithm offloads to GPUs the three major computational stages of the SCF, namely, the calculation of one-electron integrals, the calculation and digestion of electron repulsion integrals, and the diagonalization of the Fock matrix, including SCF acceleration via DIIS. Performance results for a variety of test molecules and basis sets show remarkable speedups with respect to the state-of-the-art parallel GAMESS CPU code and relative to other widely used GPU codes for both single and multi-GPU execution. The new code outperforms all existing multi-GPU implementations when using eight V100 GPUs, with speedups relative to Terachem ranging from 1.2× to 3.3× and speedups of up to 28× over QUICK on one GPU and 15× using eight GPUs. Strong scaling calculations show nearly ideal scalability up to 8 GPUs while retaining high parallel efficiency for up to 18 GPUs.
Read more
Scaling the Hartree-Fock Matrix Build 
on Summit
Usage of Graphics Processing Units (GPU) has become strategic for simulating the chemistry of large molecular systems, with the majority of top supercomputers utilizing GPUs as their main source of computational horsepower. In this paper, a new fragmentation-based Hartree-Fock matrix build algorithm designed for scaling on many-GPU architectures is presented. The new algorithm uses a novel dynamic load balancing scheme based on a binned shell-pair container to distribute batches of significant shell quartets with the same code path to different GPUs. This maximizes computational throughput and load balancing, and eliminates GPU thread divergence due to integral screening. Additionally, the code uses a novel Fock digestion algorithm to contract electron repulsion integrals into the Fock matrix, which exploits all forms of permutational symmetry and eliminates thread synchronization requirements. The implementation demonstrates excellent scalability on the Summit computer, achieving good strong scaling performance up to 4096 nodes, and linear weak scaling up to 612 nodes.
Read more
Enabling large-scale correlated electronic structure calculations: scaling the RI-MP2 method on summit
Second-order Møller-Plesset perturbation theory using the Resolution-of-the-Identity approximation (RI-MP2) is a state-of-the-art approach to accurately estimate many-body electronic correlation effects. This is critical for predicting the physicochemical properties of complex molecular systems; however, the scale of these calculations is limited by their extremely high computational cost. In this paper, a novel many-GPU algorithm and implementation of a molecular-fragmentation-based RI-MP2 method are presented that enable correlated calculations on over 180,000 electrons and 45,000 atoms using up to the entire Summit supercomputer in 12 minutes. The implementation demonstrates remarkable speedups with respect to other current GPU and CPU codes, excellent strong scalability on Summit achieving 89.1% parallel efficiency on 4600 nodes, and shows nearly-ideal weak scaling up to 612 nodes. This work makes feasible ab initio correlated quantum chemistry calculations on significantly larger molecular scales than before on both large supercomputing systems and on commodity clusters, with a potential for major impact on progress in chemical, physical, biological and engineering sciences.
Read more
Toward an extreme-scale electronic structure system
Electronic structure calculations have the potential to predict key matter transformations for applications of strategic technological importance, from drug discovery to material science and catalysis. However, a predictive physicochemical characterization of these processes often requires accurate quantum chemical modeling of complex molecular systems with hundreds to thousands of atoms. Due to the computationally demanding nature of electronic structure calculations and the complexity of modern high-performance computing hardware, quantum chemistry software has historically failed to operate at such large molecular scales with accuracy and speed that are useful in practice. In this paper, novel algorithms and software are presented that enable extreme-scale quantum chemistry capabilities with particular emphasis on exascale calculations. This includes the development and application of the multi-Graphics Processing Unit (GPU) library LibCChem 2.0 as part of the General Atomic and Molecular Electronic Structure System package and of the standalone Extreme-scale Electronic Structure System (EXESS), designed from the ground up for scaling on thousands of GPUs to perform high-performance accurate quantum chemistry calculations at unprecedented speed and molecular scales. Among various results, we report that the EXESS implementation enables Hartree–Fock/cc-pVDZ plus RI-MP2/cc-pVDZ/cc-pVDZ-RIFIT calculations on an ionic liquid system with 623 016 electrons and 146 592 atoms in less than 45 min using 27 600 GPUs on the Summit supercomputer with a 94.6% parallel efficiency.
Read more
Scaling Correlated Fragment Molecular Orbital Calculations on Summit
Correlated electronic structure calculations enable an accurate prediction of the physicochemical properties of complex molecular systems; however, the scale of these calculations is limited by their extremely high computational cost. The Fragment Molecular Orbital (FMO) method is arguably one of the most effective ways to lower this computational cost while retaining predictive accuracy. In this paper, a novel distributed many-GPU algorithm and implementation of the FMO method are presented. When applied in tandem with the Hartree-Fock and RI-MP2 methods, the new implementation enables correlated calculations on 623,016 electrons and 146,592 atoms in less than 45 minutes using 99.8% of the Summit supercomputer (27,600 GPUs). The implementation demonstrates remarkable speedups with respect to other current GPU and CPU codes, and excellent strong scalability on Summit achieving 94.6 % parallel efficiency on 4600 nodes. This work makes feasible correlated quantum chemistry calculations on significantly larger molecular systems than before and with higher accuracy.
Read more