Compare commits

...

3 Commits

Author SHA1 Message Date
Anthony Scemama c6866453d7 Parallel 2024-04-16 14:57:21 +02:00
Anthony Scemama 792ab5ceb2 Scaling 2024-04-16 14:45:18 +02:00
Anthony Scemama a83cdf77d1 Algorithm 2024-04-16 14:16:19 +02:00
8 changed files with 128 additions and 214 deletions

View File

@ -1,20 +1,16 @@
# 1
# 2
# 4
# 8
#16
#32
#64
# TZ : AMD EPYC 7402 24-Core Processor
#1
#2
#4
#6
#12
#24
#32 123.718727827072
#48 121.613038063049
# DZ: AMD EPYC
1 266. 0.1015625
2 133.202792882919 0.203125
4 68.3963158130646 0.40625
8 35.3168480396271 0.8125
16 17.4276471138000 1.625
24 11.5433599948883
28 10.3698871135712
32 9.43897294998169 3.25 12.1150507926941
40 8.34011387825012
48 7.40271902084351
56 6.72331714630127
64 6.40971302986145 6.5
# DZ: ARM Q80

View File

@ -1,180 +1,13 @@
#!/usr/bin/gnuplot -persist
#
#
# G N U P L O T
# Version 5.4 patchlevel 2 last modified 2021-06-01
#
# Copyright (C) 1986-1993, 1998, 2004, 2007-2021
# Thomas Williams, Colin Kelley and many others
#
# gnuplot home: http://www.gnuplot.info
# faq, bugs, etc: type "help FAQ"
# immediate help: type "help" (plot window: hit 'h')
# set terminal qt 0 font "Sans,9"
# set output
unset clip points
set clip one
unset clip two
unset clip radial
set errorbars front 1.000000
set border 31 front lt black linewidth 1.000 dashtype solid
set zdata
set ydata
set xdata
set y2data
set x2data
set boxwidth
set boxdepth 0
set style fill empty border
set style rectangle back fc bgnd fillstyle solid 1.00 border lt -1
set style circle radius graph 0.02
set style ellipse size graph 0.05, 0.03 angle 0 units xy
set dummy x, y
set format x "% h"
set format y "% h"
set format x2 "% h"
set format y2 "% h"
set format z "% h"
set format cb "% h"
set format r "% h"
set ttics format "% h"
set timefmt "%d/%m/%y,%H:%M"
set angles radians
set tics back
set grid nopolar
set grid xtics nomxtics ytics nomytics noztics nomztics nortics nomrtics \
nox2tics nomx2tics noy2tics nomy2tics nocbtics nomcbtics
set grid layerdefault lt 0 linecolor 0 linewidth 0.500, lt 0 linecolor 0 linewidth 0.500
unset raxis
set theta counterclockwise right
set style parallel front lt black linewidth 2.000 dashtype solid
set key notitle
set key fixed left top vertical Right noreverse enhanced autotitle nobox
set key noinvert samplen 4 spacing 1 width 0 height 0
set key maxcolumns 0 maxrows 0
set key noopaque
unset label
unset arrow
unset style line
unset style arrow
set style histogram clustered gap 2 title textcolor lt -1
unset object
unset walls
set style textbox transparent margins 1.0, 1.0 border lt -1 linewidth 1.0
set offsets 0, 0, 0, 0
set pointsize 1
set pointintervalbox 1
set encoding default
unset polar
unset parametric
unset spiderplot
unset decimalsign
unset micro
unset minussign
set view 60, 30, 1, 1
set view azimuth 0
set rgbmax 255
set samples 100, 100
set isosamples 10, 10
set surface
unset contour
set cntrlabel format '%8.3g' font '' start 5 interval 20
set mapping cartesian
set datafile separator whitespace
set datafile nocolumnheaders
unset hidden3d
set cntrparam order 4
set cntrparam linear
set cntrparam levels 5
set cntrparam levels auto
set cntrparam firstlinetype 0 unsorted
set cntrparam points 5
set size ratio 0 1,1
set origin 0,0
set style data points
set style function lines
unset xzeroaxis
unset yzeroaxis
unset zzeroaxis
unset x2zeroaxis
unset y2zeroaxis
set xyplane relative 0.5
set tics scale 1, 0.5, 1, 1, 1
set mxtics default
set mytics default
set mztics default
set mx2tics default
set my2tics default
set mcbtics default
set mrtics default
set nomttics
set xtics border in scale 1,0.5 mirror norotate autojustify
set xtics norangelimit autofreq
set ytics border in scale 1,0.5 mirror norotate autojustify
set ytics norangelimit autofreq
set ztics border in scale 1,0.5 nomirror norotate autojustify
set ztics norangelimit autofreq
unset x2tics
unset y2tics
set cbtics border in scale 1,0.5 mirror norotate autojustify
set cbtics norangelimit autofreq
set rtics axis in scale 1,0.5 nomirror norotate autojustify
set rtics norangelimit autofreq
unset ttics
set title ""
set title font "" textcolor lt -1 norotate
set timestamp bottom
set timestamp ""
set timestamp font "" textcolor lt -1 norotate
set trange [ * : * ] noreverse nowriteback
set urange [ * : * ] noreverse nowriteback
set vrange [ * : * ] noreverse nowriteback
#!/usr/bin/env gnuplot
set grid
set key bottom
set format y "%.1f"
set xlabel "Number of cores"
set xlabel font "" textcolor lt -1 norotate
set x2label ""
set x2label font "" textcolor lt -1 norotate
set xrange [ * : * ] noreverse writeback
set x2range [ * : * ] noreverse writeback
set ylabel "Speedup"
set ylabel font "" textcolor lt -1 rotate
set y2label ""
set y2label font "" textcolor lt -1 rotate
set yrange [ * : * ] noreverse writeback
set y2range [ * : * ] noreverse writeback
set zlabel ""
set zlabel font "" textcolor lt -1 norotate
set zrange [ * : * ] noreverse writeback
set cblabel ""
set cblabel font "" textcolor lt -1 rotate
set cbrange [ * : * ] noreverse writeback
set rlabel ""
set rlabel font "" textcolor lt -1 norotate
set rrange [ * : * ] noreverse writeback
unset logscale
unset jitter
set zero 1e-08
set lmargin -1
set bmargin -1
set rmargin -1
set tmargin -1
set locale "en_AU.UTF-8"
set pm3d explicit at s
set pm3d scansautomatic
set pm3d interpolate 1,1 flush begin noftriangles noborder corners2color mean
set pm3d clip z
set pm3d nolighting
set palette positive nops_allcF maxcolors 0 gamma 1.5 color model RGB
set palette rgbformulae 7, 5, 15
set colorbox default
set colorbox vertical origin screen 0.9, 0.2 size screen 0.05, 0.6 front noinvert bdefault
set style boxplot candles range 1.50 outliers pt 7 separation 1 labels auto unsorted
set loadpath
set fontpath
set psdir
set fit brief errorvariables nocovariancevariables errorscaling prescale nowrap v5
GNUTERM = "qt"
I = {0.0, 1.0}
VoxelDistance = 0.0
## Last datafile plotted: "scaling.dat"
plot 'scaling.dat' u 1:(740.99828964984044/$2) w lp notitle, x title "Ideal"
# EOF
set term pdfcairo enhanced font "Times,14" linewidth 2 rounded size 5.0in, 3.0in
set output 'scaling.pdf'
set pointsize 0.5
plot 'scaling.dat' i 1 u 1:(740.99828964984044/$2) w lp title "ARM Q80", \
'scaling.dat' i 0 u 1:(266./$2) w lp title "AMD EPYC", \
x title "Ideal"

BIN
Manuscript/benzene_qz.pdf Normal file

Binary file not shown.

BIN
Manuscript/benzene_tz.pdf Normal file

Binary file not shown.

BIN
Manuscript/scaling.pdf Normal file

Binary file not shown.

View File

@ -260,3 +260,16 @@ i@article{watson_2016,
publisher = {North-Holland},
doi = {10.1016/0009-2614(91)87003-T}
}
@article{garniron_2017,
author = {Garniron, Yann and Scemama, Anthony and Loos, Pierre-Fran{\c{c}}ois and Caffarel, Michel},
title = {{Hybrid stochastic-deterministic calculation of the second-order perturbative contribution of multireference perturbation theory}},
journal = {J. Chem. Phys.},
volume = {147},
number = {3},
year = {2017},
month = jul,
issn = {0021-9606},
publisher = {AIP Publishing},
doi = {10.1063/1.4992127}
}

View File

@ -121,13 +121,20 @@ that were previously computationally prohibitive.
\section{Introduction}
\label{sec:introduction}
Coupled cluster (CC) theory is a powerful quantum mechanical approach widely used in computational chemistry and physics to describe the electronic structure of atoms, molecules, and materials.
Coupled cluster (CC) theory is an accurate quantum mechanical approach widely used in computational chemistry and physics to describe the electronic structure of atoms, molecules, and materials.
It offers a systematic and rigorous framework for accurate predictions of molecular properties and reactions by accounting for electron correlation effects beyond the mean-field approximation.
Among the various variants of the CC method, the Coupled Cluster Singles and Doubles with perturbative Triples method, CCSD(T), stands as the gold standard of quantum chemistry.
CC theory starts with a parametrized wave function, typically referred to as the CC wave function, which is expressed as an exponential series of excitation operators acting on a reference:
\begin{equation}
\ket{\Psi_{\text{CC}}} = e^{\hat{T}} \ket{\Phi}
\end{equation}
where $|\ket{\Phi}$ is the reference determinant, and $\hat{T}$ is the cluster operator representing single, double, triple, and higher excitations from the reference wave function.
Coupled Cluster with Singles and Doubles (CCSD) includes single and double excitations and represents the most commonly used variant of CC theory due to its favorable balance between accuracy and computational cost.
Coupled Cluster with Singles, Doubles, and perturbative Triples (CCSD(T)) incorporates a perturbative correction to the CCSD energy to account for some higher-order correlation effects, and stands as the gold standard of quantum chemistry.
CCSD(T) has demonstrated exceptional accuracy and reliability, making it one of the preferred choices for benchmark calculations and highly accurate predictions.
It has found successful applications in a diverse range of areas, including spectroscopy,\cite{villa_2011,watson_2016,vilarrubias_2020} reaction kinetics,\cite{dontgen_2015,castaneda_2012} and materials design,\cite{zhang_2019} and has played a pivotal role in advancing our understanding of complex chemical phenomena.
In the context of CC theory, perturbative triples represent an important contribution to the accuracy of electronic structure calculations.\cite{stanton_1997}
In the context of CC theory, the perturbative triples correction represents an important contribution to the accuracy of electronic structure calculations.\cite{stanton_1997}
However, the computational cost associated with the calculation of this correction can be prohibitively high, especially for large systems.
The inclusion of the perturbative triples in the CCSD(T) method leads to a computational scaling of $\order{N^7}$, where $N$ is proportional to the number of molecular orbitals.
This scaling can rapidly become impractical, posing significant challenges in terms of computational resources and time requirements.
@ -196,13 +203,62 @@ In the algorithm proposed by Rendell\cite{rendell_1991}, for each given triplet
\subsection{Stochastic formulation}
\subsection{Test code}
\label{subsec:test_code}
% Include the test code here, if applicable.
We propose an algorithm influenced by the semi-stochastic approach originally developed for computing the Epstein-Nesbet second-order perturbation correction to the energy. \cite{garniron_2017}
The perturbative triples correction is expressed as a sum of corrections, each indexed solely by virtual orbitals:
\begin{equation}
E_{(T)} = \sum_{abc} E^{abc} \text{, where }
E^{abc} = \sum_{ijk} E_{ijk}^{abc}.
\end{equation}
Monte Carlo sampling is employed by selecting samples $E_{abc}$.
The principal advantage of this formulation is that the number of triplet combinations $(a,b,c)$, given by $N_v^3$, is sufficiently small to allow for all contributions $E_{abc}$ to be stored in memory.
The first time a triplet $(a,b,c)$ is drawn, its corresponding value $E_{abc}$ is computed and then stored.
Subsequent drawings of the same triplet retrieve the value from memory. We refer to this technique as \emph{memoization}.
Thus, the computational expense of calculating the sample, which scales as $N_\text{o}^3 \times N_\text{v}$, is incurred only once, with all subsequent accesses being computationally trivial.
Consequently, employing a sufficient number of Monte Carlo samples to ensure that each contribution is selected at least once results in a total computational cost that is only negligibly higher than that of an exact computation.
To reduce the variance of the samples, the samples are drawn using the
probability
\begin{equation}
P(a,b,c) = \frac{1}{\mathcal{N}} \frac{1}{\bar{\epsilon}_{ijk} - \epsilon_a - \epsilon_b - \epsilon_c}
\end{equation}
where $\mathcal{N}$ normalizes the sum such that $\sum P(a,b,c) = 1$. Here, $\bar{\epsilon}_{ijk}$ represents the average energy of the occupied orbitals, calculated as follows:
\begin{equation}
\bar{\epsilon}_{ijk} = \frac{3}{N_\text{o}} \sum_{i=1}^{N_\text{o}} \epsilon_i.
\end{equation}
The perturbative contribution is then computed by
\begin{equation}
E_{(T)} = \mathcal{N} \sum_{abc} P(a,b,c) \, E^{abc} \,
(\bar{\epsilon}_{ijk} - \epsilon_a - \epsilon_b - \epsilon_c).
\end{equation}
This approach effectively reduces the statistical error bars by approximately a factor of two for the same computational expense due to two primary reasons: i) the estimator exhibits reduced fluctuations, ii) some triplet combinations are more likely to be selected than others, enhancing the efficiency of memoization.
We employ the inverse transform sampling technique to select samples, where an array of pairs $\qty(P(a,b,c), (a,b,c))$ is stored.
To further reduce the variance of the samples, this array is sorted in descending order based on $P(a,b,c)$ and subsequently partitioned into buckets, $B$.
Each bucket is designed such that the sum $\sum_{(a,b,c) \in B} P(a,b,c)$ within it is as uniform
as possible across all buckets.
As each bucket is equiprobable, samples are defined as combinations of triplets, with one triplet drawn from each bucket.
Should the values of $E_{abc}$ be skewed, this advanced refinement significantly diminishes the variance.
The total perturbative contribution is computed as the aggregate of contributions from various buckets:
\begin{equation}
E_{(T)} = \sum_B E_B = \sum_B\sum_{(a,b,c) \in B} E_{abc}.
\end{equation}
Once every triplet within a bucket $B$ has been drawn at least once, the contribution $E_B$ can be determined.
At this juncture, there is no longer a necessity to evaluate \(E_B\) stochastically, and the buckets can be categorized into stochastic ($\mathcal{S}$) and deterministic ($\mathcal{D}$) groups:
\begin{equation}
E_{(T)} = \sum_{B \in \mathcal{D}} E_B + \frac{1}{|\mathcal{S}|} \sum_{B \in \mathcal{S}}
\left \langle E^B_{abc} \times \frac{- \epsilon_a - \epsilon_b - \epsilon_c}{\mathcal{N}} \right \rangle_{P(a,b,c), (a,b,c) \in B}.
\end{equation}
Not all buckets are of equal size; the number of triplets per bucket decreases with the bucket's index. Consequently, the initial buckets transition into the deterministic set first, gradually reducing the stochastic contribution. When every triplet has been drawn, the exact value of $E_{(T)}$ is obtained, devoid of statistical error.
To accelerate the completion of the buckets, each Monte Carlo iteration triggers the computation of the first non-computed triplet. This ensures that after $N$ drawings, the
exact contribution from each bucket can be obtained.
%=================================================================%
\section{Implementation Details}
\subsection{Implementation Details}
\label{sec:implementation}
The algorithm presented in Algorithm~\cite{alg:stoch} was implemented in the \textsc{Quantum Package} software.
@ -368,18 +424,15 @@ The vibrational frequency and equilibrium distance estimated using this data, $\
Figure \ref{fig:cucl} illustrates the potential energy surface of \ce{CuCl}, displaying both the exact CCSD(T) energies and those estimated via the semi-stochastic method.
\subsection{Parallel efficiency}
\subsection{Performance analysis}
The primary bottleneck of our proposed algorithm lies in the generation of the sub-tensor $W^{abc}$ for each $(a,b,c)$ triplet, as discussed in Section~\ref{sec:theory}.
However, we have outlined a strategy to reframe this operation into BLAS matrix multiplications,\cite{form_w_abc} offering the potential for significantly enhanced efficiency.
We evaluated the efficiency of our implementation using the Likwid\cite{treibig_2010} performance analysis tool on two distinct x86 platforms: an AMD EPYC 7513 dual-socket server equipped with 64 cores at \SI{2.6}{\giga\hertz}, and an Intel Xeon Gold 6130 dual-socket server with 32 cores at \SI{2.1}{\giga\hertz}.
We evaluated the efficiency of our implementation using the Likwid\cite{treibig_2010} performance analysis tool on two distinct x86 platforms: an AMD \textsc{Epyc} 7513 dual-socket server equipped with 64 cores at \SI{2.6}{\giga\hertz}, and an Intel Xeon Gold 6130 dual-socket server with 32 cores at \SI{2.1}{\giga\hertz}.
We linked our code with the Intel MKL library for BLAS operations.
Additionally, we executed the code on an ARM Q80 server featuring 80 cores at \SI{2.8}{\giga\hertz}, and although performance counters were unavailable, we approximated the Flop/s rate by comparing the total execution time with that measured on the AMD CPU.
For this, we utilized the ArmPL library for BLAS operations.
For this, we utilized the \textsc{ArmPL} library for BLAS operations.
\begin{table*}
\begin{ruledtabular}
@ -387,7 +440,7 @@ For this, we utilized the ArmPL library for BLAS operations.
CPU & $N_{\text{cores}}$ & $V$ & $F$ & Memory Bandwidth & Peak DP & Measured performance \\
& & & (GHz) & (GB/s) & (GFlop/s) & (GFlop/s) \\
\hline
EPYC 7513 & 64 & 4 & 2.6 & 409.6 & 2~662 & 1~576 \\
\textsc{EPYC} 7513 & 64 & 4 & 2.6 & 409.6 & 2~662 & 1~576 \\
Xeon Gold 6130 & 32 & 8 & 2.1 & 256.0 & 2~150 & 667 \\ % 239.891
ARM Q80 & 80 & 2 & 2.8 & 204.8 & 1~792 & 547 \\ % 292.492
\end{tabular}
@ -400,7 +453,7 @@ Peak performance is determined by calculating the maximum achievable Flops/s on
\begin{equation}
P = N_{\text{cores}} \times N_{\text{FMA}} \times 2 \times V \times F
\end{equation}
where $F$ represents the frequency, $V$ the number of double precision elements in a vector register, $N_{\text{FMA}}$ denotes the number of vector FMA units per core (all considered CPUs possess two), and $N_{\text{cores}}$ reflects the number of cores. Notably, the Xeon and ARM CPUs both operate at approximately 30\% of peak performance, while the AMD EPYC CPU demonstrates twice the efficiency, achieving 60\% of the peak.
where $F$ represents the frequency, $V$ the number of double precision elements in a vector register, $N_{\text{FMA}}$ denotes the number of vector FMA units per core (all considered CPUs possess two), and $N_{\text{cores}}$ reflects the number of cores. Notably, the Xeon and ARM CPUs both operate at approximately 30\% of peak performance, while the AMD \textsc{Epyc} CPU demonstrates twice the efficiency, achieving 60\% of the peak.
The relatively modest performance, at around 30\% efficiency, is attributed to the small dimensions of the matrices involved.
@ -410,9 +463,29 @@ These multiplications exhibit an arithmetic intensity of
I = \frac{2\, {N_\text{o}}^3\, N_\text{v}}{8\, \qty({N_\text{o}}^3 + {N_\text{o}}^2 N_\text{v} + {N_\text{o}} N_\text{v})}
\end{equation}
which can be approximated by $N_\text{o} / 4$ flops/byte as an upper bound, which is usually relatively low.
For instance, in the case of benzene with a triple-zeta basis set, the arithmetic intensity is calculated to be 3.52 flops/byte, falling short of the threshold required to attain peak performance on any of the CPUs.
By leveraging memory bandwidth and double precision throughput peak, we determined the critical arithmetic intensity necessary to achieve peak performance. On the Xeon and ARM CPUs, this critical value stands at approximately 8.4 and 8.8 flops/byte, respectively. Meanwhile, the EPYC CPU exhibits a value of 6.5 flops/byte, thanks to its superior memory bandwidth.
For instance, in the case of benzene with a triple-zeta basis set, the arithmetic intensity is calculated to be 3.33 flops/byte, falling short of the threshold required to attain peak performance on any of the CPUs.
By leveraging memory bandwidth and double precision throughput peak, we determined the critical arithmetic intensity necessary to achieve peak performance. On the Xeon and ARM CPUs, this critical value stands at approximately 8.4 and 8.8 flops/byte, respectively. Meanwhile, the \textsc{EPYC} CPU exhibits a value of 6.5 flops/byte, thanks to its superior memory bandwidth.
\subsection{Parallel efficiency}
\begin{figure}
\includegraphics[width=\columnwidth]{scaling.pdf}
\caption{\label{fig:speedup} Parallel speedup obtained with the ARM Q80 and AMD \textsc{Epyc} servers.}
\end{figure}
The parallel speedup performance of the ARM and AMD servers for computations involving the benzene molecule in a triple-zeta basis set is illustrated in Figure~\ref{fig:speedup}. The results delineate three distinct performance regimes:
\begin{itemize}
\item In the first regime, encompassing up to 24 cores, the performance closely approximates the ideal, with nearly linear speedup.
\item The second regime, spanning 24 to 64 cores, shows decent performance, achieving a 40-fold acceleration with 64 cores.
\item The third regime begins beyond 64 cores, where parallel efficiency rapidly deteriorates.
\end{itemize}
This performance behavior can largely be attributed to the arithmetic intensity and the bandwidth characteristics of these servers.
On the ARM server, the peak performance is attained at an arithmetic intensity of 8.75~flops/byte.
Notably, with fewer cores, the bandwidth per core increases, thereby enhancing efficiency.
For the benzene molecule in the triple-zeta basis set, the critical arithmetic intensity is 3.33~flops/byte.
This intensity corresponds to a threshold of approximately 30 cores for the ARM server and 32 cores for the AMD server.
Beyond these thresholds, particularly after 64 cores on the ARM server, the heavy demand on memory bandwidth results in a rapid decline in speedup.
%%%

View File

@ -59,7 +59,7 @@ fetched from memory. In this way, the $N_o^3 \times N_v$ cost of
computing the sample is paid only once, and all other evaluations are
negligible.
Hence, using an number of Monte Carlo samples large enough such that each
Hence, using a number of Monte Carlo samples large enough such that each
contribution has been drawn at least once will have a computational
cost larger only by a negligible amount than the cost of making the
exact computation.
@ -81,10 +81,9 @@ contributions of the occupied orbitals:
The perturbative contribution is computed as
\[
\[
E_{(T)} = \mathcal{N} \sum_{abc} P(a,b,c) \, E_{abc} \,
(\epsilon_{\text{occ}} - \epsilon_a - \epsilon_b - \epsilon_c) \,
\]
\]
This modification reduces the statistical error bars by a factor of two
for the same computational cost for two reasons: