Publication: Regression on Interatomic Descriptor Data: Direct Solution Strategies for Linear Regression in CPU and Memory-Constrained Environments
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Molecular-dynamics (MD) simulations convert Newton’s laws into atom-by-atom trajectories, from which pressure, temperature, and free energy are extracted for various experiments. While simulating any atomic environment, the most crucial feature of the atomic system is the energy. Also, it is the most challenging feature to estimate. The Machine Learning Interatomic Potentials (MLIP) solve this by fitting models to the system’s definitive functions, which are utilized to make fast inferences regarding the energy of a system. The accuracy of these models depends on the domain-specific hyperparameter optimization, which is quite slow due to the use of complex deep neural networks. With this work, a new interatomic descriptor called quadratic ACE (qACE) is proposed, which surpasses the neural network’s accuracy. Then, we explore and benchmark possible ways to fit a linear regression model to this computationally demanding solution in CPU and memory-bound environments. To solve the regression problem, several strategies are explored, including data reduction techniques and parallel processing. By leveraging Dask’s comprehensive task scheduling infrastructure, we compute the direct least-squares regression on a 27 GB dataset in under five minutes, demonstrating both scalability and computational efficiency. Overall, this work demonstrates that efficient feature engineering, combined with lightweight parallel regression strategies, can substitute for deep models without sacrificing accuracy or scalabilit
