StochasticTriples/Manuscript/stochastic_triples.tex

350 lines
20 KiB
TeX

\documentclass[aip,jcp,reprint,noshowkeys,superscriptaddress]{revtex4-1}
%\usepackage{graphicx,dcolumn,bm,xcolor,microtype,multirow,amscd,amsmath,amssymb,amsfonts,physics,longtable,wrapfig,bbold,siunitx,xspace}
\usepackage{graphicx,xcolor,physics,siunitx,xspace}
\usepackage[version=4]{mhchem}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{hyperref}
\hypersetup{
colorlinks,
linkcolor={red!50!black},
citecolor={red!70!black},
urlcolor={red!80!black}
}
\DeclareSIUnit{\bohr}{\text{\ensuremath{a_{0}}}}
\DeclareSIUnit{\hartree}{\text{\ensuremath{E_{\textup{h}}}}}
\newcommand{\mc}{\multicolumn}
\newcommand{\fnm}{\footnotemark}
\newcommand{\fnt}{\footnotetext}
\newcommand{\tabc}[1]{\multicolumn{1}{c}{#1}}
\newcommand{\titou}[1]{\textcolor{red}{#1}}
\newcommand{\toto}[1]{\textcolor{blue}{#1}}
\newcommand{\joonho}[1]{\textcolor{purple}{#1}}
\newcommand{\trashPFL}[1]{\textcolor{r\ed}{\sout{#1}}}
\newcommand{\PFL}[1]{\titou{(\underline{\bf PFL}: #1)}}
\newcommand{\AS}[1]{\toto{(\underline{\bf AS}: #1)}}
\newcommand{\JL}[1]{\joonho{(\underline{\bf JL}: #1)}}
\usepackage{listings}
\definecolor{codegreen}{rgb}{0.58,0.4,0.2}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.25,0.35,0.55}
\definecolor{codeblue}{rgb}{0.30,0.60,0.8}
\definecolor{backcolour}{rgb}{0.98,0.98,0.98}
\definecolor{mygray}{rgb}{0.5,0.5,0.5}
\definecolor{sqred}{rgb}{0.85,0.1,0.1}
\definecolor{sqgreen}{rgb}{0.25,0.65,0.15}
\definecolor{sqorange}{rgb}{0.90,0.50,0.15}
\definecolor{sqblue}{rgb}{0.10,0.3,0.60}
\lstdefinestyle{mystyle}{
backgroundcolor=\color{backcolour},
commentstyle=\color{codegreen},
keywordstyle=\color{codeblue},
numberstyle=\tiny\color{codegray},
stringstyle=\color{codepurple},
basicstyle=\ttfamily\footnotesize,
breakatwhitespace=false,
breaklines=true,
captionpos=b,
keepspaces=true,
numbers=left,
numbersep=5pt,
numberstyle=\ttfamily\tiny\color{mygray},
showspaces=false,
showstringspaces=false,
showtabs=false,
tabsize=2
}
\newcolumntype{d}{D{.}{.}{-1}}
\lstset{style=mystyle}
% addresses
\newcommand{\LCPQ}{Laboratoire de Chimie et Physique Quantiques (UMR 5626), Universit\'e de Toulouse, CNRS, UPS, France}
\newcommand{\Vienna}{Vienna}
\begin{document}
\title{Stochastically accelerated perturbative triples correction in coupled cluster calculations}
% Alphabetic order
\author{Yann \surname{Damour}}
\affiliation{\LCPQ}
\author{Alejandro \surname{Gallo}}
\affiliation{\Vienna}
\author{Andreas \surname{Irmler}}
\affiliation{\Vienna}
\author{Andreas \surname{Gr\"uneis}}
\affiliation{\Vienna}
\author{Anthony \surname{Scemama}}
\email{scemama@irsamc.ups-tlse.fr}
\affiliation{\LCPQ}
\begin{abstract}
We introduce a novel algorithm that leverages stochastic sampling
techniques to approximate perturbative triples correction in the
coupled-cluster (CC) framework.
By combining elements of randomness and determinism, our algorithm
achieves a favorable balance between accuracy and computational cost.
The main advantage of this algorithm is that it allows for
calculations to be stopped at any time, providing an unbiased
estimate, with a statistical error that goes to zero as the exact
calculation is approached.
We provide evidence that our semi-stochastic algorithm achieves
substantial computational savings compared to traditional
deterministic methods.
Specifically, we demonstrate that a precision of 0.5 milliHartree can
be attained with only 10\% of the computational effort required by the
full calculation.
This work opens up new avenues for efficient and accurate
computations, enabling investigations of complex molecular systems
that were previously computationally prohibitive.
\bigskip
%\begin{center}
% \boxed{\includegraphics[width=0.5\linewidth]{TOC}}
%\end{center}
%\bigskip
\end{abstract}
\maketitle
%=================================================================%
\section{Introduction}
\label{sec:introduction}
Coupled cluster (CC) theory is a powerful quantum mechanical approach widely used in computational chemistry and physics to describe the electronic structure of atoms, molecules, and materials.
It offers a systematic and rigorous framework for accurate predictions of molecular properties and reactions by accounting for electron correlation effects beyond the mean-field approximation.
Among the various variants of the CC method, the CCSD(T) method stands as the gold standard of quantum chemistry.
CCSD(T), which incorporates singles, doubles, and a perturbative treatment of triples, has demonstrated exceptional accuracy and reliability, making it one of the preferred choices for benchmark calculations and highly accurate predictions.
The CCSD(T) method has found successful applications in a diverse range of areas, including spectroscopy,\cite{villa_2011,watson_2016,vilarrubias_2020} reaction kinetics,\cite{dontgen_2015,castaneda_2012} and materials design,\cite{zhang_2019} and has played a pivotal role in advancing our understanding of complex chemical phenomena.
In the context of CC theory, perturbative triples represent an important contribution to the accuracy of electronic structure calculations.\cite{stanton_1997}
However, the computational cost associated with their calculation can be prohibitively high, especially for large molecular systems.
The CCSD(T) method, which includes the perturbative treatment of triples, is known to have a computational scaling of $\order{N^7}$, where $N$ represents the system size.
This scaling can rapidly become impractical, posing significant challenges in terms of computational resources and time requirements.
To address this computational bottleneck, our goal is to develop a novel semi-stochastic algorithm that brings back the computational time to a level smaller or comparable to that of the CCSD method, which has a scaling of $\order{N^6}$, while ensuring well-controlled approximations.
Our algorithm strikes a balance between computational efficiency and
accuracy, making calculations for larger basis sets more feasible without compromising precision.
By incorporating stochastic sampling techniques, our approach provides an alternative avenue for approximating perturbative triples, relieving the computational burden inherent in traditional deterministic methods. This not only reduces the computational time to a more favorable level but also preserves the parallelism capabilities of CC calculations, ensuring efficient utilization of computational resources.
In the following sections of this paper, we will provide a brief introduction to the computation of perturbative triples in coupled cluster theory. We will explain the principles of our semi-stochastic algorithm, outlining its key features and advantages. Additionally, we will present implementation details, discussing the technical aspects and considerations involved in the algorithm's practical realization. To demonstrate the effectiveness and applicability of our approach, we finally present illustrative examples that showcase the algorithm's performance and compare it with the conventional algorithm.
%=================================================================%
\section{Theoretical Background}
\label{sec:theory}
%a. Brief overview of coupled cluster theory
%b. Perturbative triples in coupled cluster theory
%c. Challenges and limitations of traditional approaches
The perturbative triples correction,
\begin{equation}
E_{(T)} = \sum_{ijk\,abc} E_{ijk}^{abc},
\end{equation}
is a sum of $N_{\text{o}}^3 \times N_{\text{v}}^3$ terms, where $N_{\text{o}}^3$ and $N_{\text{v}}^3$ denote the number of occupied and virtual molecular orbitals, respectively.
Each individual term is expressed as
\begin{equation}
E_{ijk}^{abc} = \frac{(4 W_{ijk}^{abc} +
W_{ijk}^{bca} + W_{ijk}^{cab})
(V_{ijk}^{abc} - V_{ijk}^{cba})}{\epsilon_i + \epsilon_j + \epsilon_k -
\epsilon_a - \epsilon_b - \epsilon_c},
\end{equation}
and depends on the canonical orbital energies $\epsilon$, and on the tensors $W$ and $V$.
The indices $i,j,k$ and $a,b,c$ refer to occupied and virtual orbitals, respectively.
The bottleneck of the perturbative triples correction is the computation of the $W$ tensor
which requires $\order{N_o^3 \times N_v^4}$ operations. Fortunately, most of
the operations involved in the computation of $W$ can be recast into matrix
multiplications, which are among the most efficient operations than can be
executed on modern CPUs and
accelerators.\cite{ma_2011,haidar_2015,dinapoli_2014,springer_2018}
%=================================================================%
\section{Semi-Stochastic Algorithm}
\label{sec:algorithm}
%a. Explanation of the algorithm's main principles and methodology
% - Rewriting with nV in outer-most loops
% - Implies summing all permutations of (a,b,c) to avoid extra dgemms
% - Comparison in number of Gflops to theory
%b. Description of the stochastic sampling procedure
% - Memoization
% - Importance function
% - Semi-stochastic algorithm
%c. Discussion of the algorithm's advantages and potential trade-offs
%d. Detailed pseudocode or algorithmic steps, if applicable
\subsection{Stochastic formulation}
\subsection{Test code}
\label{subsec:test_code}
% Include the test code here, if applicable.
%=================================================================%
\section{Implementation Details}
\label{sec:implementation}
%a. Description of the computational framework and software used
%b. Discussion of any specific optimizations or parallelization techniques employed
% - Explain that form_w and form_v can be entirely produced by dgemm
% - Show implementation with OpenMP tasks
%c. Any relevant technical considerations or limitations
% - Limitations: memory because in-core algorithm.
%=================================================================%
\section{Numerical experiments}
% + Benzene TZ/QZ
% - Streptocyanine QZ: Small molecule in a large basis set
% - Caffeine def2-svp: Large molecule in a small basis set
% + Vibrational frequency of CuCl/cc-pvqz
% + Measure flops and compare to the peak
%c. Analysis of the algorithm's accuracy, efficiency, and scalability
%d. Discussion of any observed limitations or challenges
\subsection{Convergence of the statistical error in benzene}
In this section we illustrate the convergence of the statistical error of the perturbative triples correction as a function of the computational cost.
The benzene molecule serves as our reference system for conducting frozen-core CCSD(T) calculations with the cc-pVTZ and cc-pVQZ basis sets.
Essentially, this involves the correlation of 30 electrons using either 258 or 503 molecular orbitals.
The calculations were performed on an Intel Xeon Gold 6130 dual socket server (32 cores in total).
\begin{figure}
\includegraphics[width=\columnwidth]{benzene_tz.pdf}
\includegraphics[width=\columnwidth]{benzene_qz.pdf}
\caption{\label{fig:benzene} Energy convergence of benzene plotted against the program execution time, showing comparisons between the cc-pVTZ (upper curve) and cc-pVQZ (lower curve) basis sets. The blue lines indicate the exact CCSD(T) energies.}
\end{figure}
\begin{figure}
\includegraphics[width=\columnwidth]{benzene_err.pdf}
\caption{\label{fig:benzene_err} Convergence of the statistical error of the perturbative triples contribution in benzene as a function of the percentage of computed contributions, for both cc-pVTZ and cc-pVQZ basis sets.}
\end{figure}
Figure~\ref{fig:benzene} shows the convergence of the CCSD(T) energy as a function of the program execution time using the two basis sets.
Notably, the exact CCSD(T) energy always falls within $2\sigma$, affirming the reliability of the statistical error.
Figure~\ref{fig:benzene_err} displays the statistical error as a function of the percentage of computed contributions.
Noteworthy in both figures are the curve discontinuities, attributable to readjustments in the separation between the deterministic and stochastic components of the calculation.
These updates lead to revised estimates and a diminution in statistical error.
Achieving chemical accuracy, defined as \SI{1.6}{\milli\hartree}, necessitates less than 1\% of the total contributions in both basis sets.
Attaining a \SI{0.1}{\milli\hartree} precision level requires computation of 32\% and 15\% of the contributions for cc-pVTZ and cc-pVQZ, respectively.
The more rapid convergence observed with the larger basis set aligns with expectations, as expanding the basis set tends to increase the proportion of minor contributions while maintaining a relatively steady count of significant contributions.
This trend underscores the algorithm's enhanced suitability for systems with fewer electrons and extensive basis sets, as opposed to larger electron counts in smaller basis sets.
\subsection{Vibrational frequency of copper chloride}
Our methodology proves especially advantageous for scenarios requiring the aggregation of numerous CCSD(T) energies, such as neural network training or the exploration of potential energy surfaces.
In this section, we discuss the application of our novel algorithm within the context of computing vibrational frequencies, specifically through the example of copper chloride (\ce{CuCl}).
A demonstrative application presented here involves the determination of the equilibrium bond length and the computation of the vibrational frequency of \ce{CuCl} using the CCSD(T)/cc-pVQZ level of theory.
The procedure involves determining the CCSD(T) potential energy curve for \ce{CuCl}, followed by its analytical representation through a Morse potential fitting:
\begin{equation}
E(r) = D_e \left( 1 - e^{-a (r - r_e)} \right)^2 + E_0
\end{equation}
where $E(r)$ represents the energy at a bond length $r$, $D_e$ the depth of the potential well, $r_e$ the equilibrium bond length, $a$ the parameter defining the potential well's width, and $E_0$ the energy at the equilibrium bond length. The vibrational frequency, $\nu$, is derived as follows:
\begin{equation}
\nu = \frac{1}{2 \pi c} \sqrt{\frac{2D_e a^2}{\mu}}
\end{equation}
with $\mu$ denoting the reduced mass of the \ce{CuCl} molecule, and $c$ the speed of light.
\begin{figure}
\includegraphics[width=\columnwidth]{cucl.pdf}
\caption{\label{fig:cucl} CCSD(T) energies of CuCl obtained with the exact CCSD(T) algorithm (dots), the stochastic algorithm using only 1\% of the contributions (error bars), and the Morse potential fitting the points obtained with the stochastic algorithm.}
\end{figure}
The initial step involved the precise calculation of the CCSD(T) energy across various points along the potential curve.
We froze the six lowest molecular orbitals, specifically the $1s$ orbital of \ce{Cl} and the $1s$, $2s$, and $2p$ orbitals of \ce{Cu}, and correlated 34 electrons within 157 molecular orbitals.
The fitted Morse potential revealed a vibrational frequency of $\nu = \SI{414.7}{\per\centi\meter}$ and an equilibrium bond length of $r_e = \SI{3.92}{\bohr}$, aligning remarkably well with experimental values from the NIST database\cite{nist_2022} $\nu = \SI{417.6}{\per\centi\meter}$ and $r_e = \SI{3.88}{\bohr}$.
Subsequently, we applied our semi-stochastic algorithm to estimate the perturbative triples correction, utilizing merely 1\% of the total contributions.
This approach yielded a hundredfold acceleration in computational efficiency, achieving statistical uncertainty within the range of \SI{1.3} to \SI{2.5}{\milli\hartree}.
The vibrational frequency and equilibrium distance estimated using this data, $\nu = \SI{415.0}{\per\centi\meter}$ and $r_e = \SI{3.92}{\bohr}$, demonstrated comparable precision to the full computational results.
Figure \ref{fig:cucl} illustrates the potential energy surface of \ce{CuCl}, displaying both the exact CCSD(T) energies and those estimated via the semi-stochastic method.
\subsection{Performance analysis}
The bottleneck of the proposed algorithm is the creation of the sub-tensor $W^{abc}$ for each given $(a,b,c)$ triplet.
We have mentioned in section~\ref{sec:theory} that this operation could be recast into matrix multiplications, leading to a high efficiency of our implementation.
We have measured the efficiency of our implementation using the Likwid\cite{treibig_2010} performance analysis tool on an AMD EPYC 7513 dual-socket server (64 cores at \SI{2.6}{\giga \hertz}) and on an
Intel Xeon Gold 6130 dual-socket server (32 cores at \SI{2.1}{\giga \hertz}).
The code was linked with the Intel MKL library for BLAS operations.
Measurements of the average number of floating-point operations per second (Flop/s) were activated section of the code for the computation of the perturbative triples correction.
We have also run the code on an ARM Q80 server (80 cores at \SI{2.8}{\giga \hertz})), and as the performance counters were not available to Likwid, we have compared the total execution time of the computation of the perturbative triples correction with the time measured on the AMD CPU to estimate the Flop/s rate.
The code was linked with the ArmPL library for BLAS operations.
\begin{table}
\begin{ruledtabular}
\begin{tabular}{lccccc}
CPU & $N_{\text{cores}}$ & $V$ & $F$ & Peak DP & Measured \\
& & & (GHz) & (GFlop/s) & (GFlop/s) \\
\hline
EPYC 7513 & 64 & 4 & 2.6 & 2~662 & 1~576 \\
Xeon Gold 6130 & 32 & 8 & 2.1 & 2~150 & 667 \\ % 239.891
ARM Q80 & 80 & 2 & 2.8 & 1~792 & 547 \\ % 292.492
\end{tabular}
\end{ruledtabular}
\caption{\label{tab:flops} Average performance of the code measured as the number of double precision (DP) floating-point operations per second (Flop/s) on different machines.}
\end{table}
Table~\ref{tab:flops} shows the results of these tests.
The peak performance is obtained by counting the maximum number of Flops/s that can be achieved on the CPU:
\begin{equation}
P = N_{\text{cores}} \times N_{\text{FMA}} \times 2 \times V \times F
\end{equation}
where $F$ is the frequency, the factor of 2 comes from the fact that all these CPUs can execute fused multiply-add (FMA) operations which account for two flops, $V$ is the number of double precision elements in a vector register, $N_{\text{FMA}}$ is the number of vector FMA units per core (all considered CPUs have two), and $N_{\text{cores}}$ is the number of cores.
The Xeon and ARM CPUs both reach around 30\% of the peak performance, while the AMD EPYC CPU is twice more efficient with 60\% of the peak.
The relatively low performance of 30\% is due to the small sizes of the matrices: the largest matrix multiplications in the computation of a task involve a matrix of size $N_\text{o}^2 \times N_\text{v}$ and a matrix of size $N_\text{v} \times N_\text{o}$ to produce an $N_\text{o}^2 \times N_\text{o}$ matrix.
Such matrix multiplications have an arithmetic intensity below $N_\text{o} / 4$ flops/byte. In the case of benzene in the triple-zeta basis set, the arithmetic intensity is 3.52 flops/byte.
Such a value is not sufficient to reach the peak performance on any of these CPUs. Using the memory bandwidth and the double precision throughput peak we can determine the critical arithmetic intensity needed to reach the peak performance.
On the Xeon and on the ARM CPUs, we obtain respectively 8.39 and 8.75~flops/byte as critical values.
On the EPYC CPU, we obtain a value of 6.50 flops/byte thanks to its high memory bandwidth.
%%%
\section{Conclusion}
\label{sec:conclusion}
%a. Summary of the algorithm and its advantages
%b. Recapitulation of the key findings and contributions
%c. Final remarks and encouragement for further research
%
% Works better for few electrons in large basis sets
% Interesting for ML or PES exploration
%=================================================================%
%%%%%%%%%%%%%%%%%%%%%%%
\acknowledgements{
This work was supported by the European Centre of Excellence in Exascale Computing TREX --- Targeting Real Chemical Accuracy at the Exascale.
This project has received funding from the European Union's Horizon 2020 — Research and Innovation program --- under grant agreement No.~952165.
This work was performed using HPC resourced from CALMIP (Toulouse) under allocations p18005 and p22001.}
%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section*{Data availability statement}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The data that supports the findings of this study are available within the article and its supplementary material.
%%%%%%%%%%%%%%%%%%%%%%%
\bibliography{stochastic_triples}
%%%%%%%%%%%%%%%%%%%%%%%
\end{document}