Software Heritage

2024-04-01 20:08:28 +02:00 · 2024-04-01 20:08:28 +02:00 · 50f49ea55c
commit 50f49ea55c
parent 39d9f66f9b
2 changed files with 58 additions and 28 deletions
--- a/Manuscript/stochastic_triples.bib
+++ b/Manuscript/stochastic_triples.bib
@ -178,6 +178,7 @@ i@article{watson_2016,

@incollection{treibig_2010,
 	author = {Treibig, Jan and Hager, Georg and Wellein, Gerhard},
+        year = 2010,
 	title = {{LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments}},
 	booktitle = {{2010 39th International Conference on Parallel Processing Workshops}},
 	journal = {Published in: 2010 39th International Conference on Parallel Processing Workshops},
@ -222,3 +223,28 @@ i@article{watson_2016,
 	publisher = {Multidisciplinary Digital Publishing Institute},
 	doi = {10.3390/computation8010020}
 }
+
+
+@article{garniron_2019,
+	author = {Garniron, Yann and Applencourt, Thomas and Gasperich, Kevin and Benali, Anouar and Fert{\'{e}}, Anthony and Paquier, Julien and Pradines, Barth{\'{e}}l{\'{e}}my and Assaraf, Roland and Reinhardt, Peter and Toulouse, Julien and Barbaresco, Pierrette and Renon, Nicolas and David, Gr{\'{e}}goire and Malrieu, Jean-Paul and V{\'{e}}ril, Micka{\"{e}}l and Caffarel, Michel and Loos, Pierre-Fran{\c{c}}ois and Giner, Emmanuel and Scemama, Anthony},
+	title = {{Quantum Package 2.0: An Open-Source Determinant-Driven Suite of Programs}},
+	journal = {J. Chem. Theory Comput.},
+	volume = {15},
+	number = {6},
+	pages = {3591--3609},
+	year = {2019},
+	month = jun,
+	issn = {1549-9618},
+	publisher = {American Chemical Society},
+	doi = {10.1021/acs.jctc.9b00176},
+        note = {\url{https://archive.softwareheritage.org/swh:1:dir:6d82ae7ac757c78d7720dd89dfa52d7a453d2f68;origin=https://github.com/QuantumPackage/qp2;visit=swh:1:snp:1f9d307c45a14259eea8991c328065400029b975;anchor=swh:1:rev:c63b69e8dac8017d6415df602c5f7f5c02e35a2a;path=/src/ccsd/}}
+}
+
+@misc{form_w_abc,
+  title = {Formation of the {$W$} tensor in Quantum Package},
+  year = {2024},
+  note = {\url{https://archive.softwareheritage.org/swh:1:cnt:12a71045f2333584fe7b499f1c70b5ff2dc4989c;origin=https://github.com/QuantumPackage/qp2;visit=swh:1:snp:1f9d307c45a14259eea8991c328065400029b975;anchor=swh:1:rev:c63b69e8dac8017d6415df602c5f7f5c02e35a2a;path=/src/ccsd/ccsd_t_space_orb_abc.irp.f;lines=233-395}}
+}
+
+
+
--- a/Manuscript/stochastic_triples.tex
+++ b/Manuscript/stochastic_triples.tex
@ -165,7 +165,7 @@ The indices $i,j,k$ and $a,b,c$ refer to occupied and virtual orbitals, respecti
 The bottleneck of the perturbative triples correction is the computation of the $W$ tensor
 which requires $\order{N_o^3 \times N_v^4}$ operations. Fortunately, most of
 the operations involved in the computation of $W$ can be recast into matrix
-multiplications, which are among the most efficient operations than can be
+multiplications,\cite{form_w_abc} which are among the most efficient operations than can be
 executed on modern CPUs and
 accelerators.\cite{ma_2011,haidar_2015,dinapoli_2014,springer_2018}

@ -195,6 +195,9 @@ accelerators.\cite{ma_2011,haidar_2015,dinapoli_2014,springer_2018}
 \section{Implementation Details}
 \label{sec:implementation}

+The algorithm was implemented in the \textsc{Quantum Package} software.
+\cite{garniron_2019}
+
 %a. Description of the computational framework and software used
 %b. Discussion of any specific optimizations or parallelization techniques employed
 % - Explain that form_w and form_v can be entirely produced by dgemm
@ -276,43 +279,42 @@ Figure \ref{fig:cucl} illustrates the potential energy surface of \ce{CuCl}, dis

 \subsection{Performance analysis}

-The bottleneck of the proposed algorithm is the creation of the sub-tensor $W^{abc}$ for each given $(a,b,c)$ triplet.
-We have mentioned in section~\ref{sec:theory} that this operation could be recast into matrix multiplications, leading to a high efficiency of our implementation.
+The primary bottleneck of our proposed algorithm lies in the generation of the sub-tensor $W^{abc}$ for each $(a,b,c)$ triplet, as discussed in Section~\ref{sec:theory}.
+However, we have outlined a strategy to reframe this operation into BLAS matrix multiplications,\cite{form_w_abc} offering the potential for significantly enhanced efficiency.

-We have measured the efficiency of our implementation using the Likwid\cite{treibig_2010} performance analysis tool on an AMD EPYC 7513 dual-socket server (64 cores at \SI{2.6}{\giga \hertz}) and on an
-Intel Xeon Gold 6130 dual-socket server (32 cores at \SI{2.1}{\giga \hertz}).
-The code was linked with the Intel MKL library for BLAS operations.
-Measurements of the average number of floating-point operations per second (Flop/s) were activated section of the code for the computation of the perturbative triples correction. 
-We have also run the code on an ARM Q80 server (80 cores at \SI{2.8}{\giga \hertz})), and as the performance counters were not available to Likwid, we have compared the total execution time of the computation of the perturbative triples correction with the time measured on the AMD CPU to estimate the Flop/s rate.
-The code was linked with the ArmPL library for BLAS operations.
+We evaluated the efficiency of our implementation using the Likwid\cite{treibig_2010} performance analysis tool on two distinct x86 platforms: an AMD EPYC 7513 dual-socket server equipped with 64 cores at \SI{2.6}{\giga\hertz}, and an Intel Xeon Gold 6130 dual-socket server with 32 cores at \SI{2.1}{\giga\hertz}.
+We linked our code with the Intel MKL library for BLAS operations.
+Additionally, we executed the code on an ARM Q80 server featuring 80 cores at \SI{2.8}{\giga\hertz}, and although performance counters were unavailable, we approximated the Flop/s rate by comparing the total execution time with that measured on the AMD CPU.
+For this, we utilized the ArmPL library for BLAS operations.

-\begin{table}
+\begin{table*}
 \begin{ruledtabular}
-\begin{tabular}{lccccc}
-CPU & $N_{\text{cores}}$ & $V$ & $F$ & Peak DP & Measured \\
-        &                &     & (GHz) & (GFlop/s)    & (GFlop/s) \\
-\hline
-EPYC 7513      & 64 & 4 & 2.6 & 2~662 & 1~576 \\
-Xeon Gold 6130 & 32 & 8 & 2.1 & 2~150 &   667 \\  % 239.891
-ARM Q80        & 80 & 2 & 2.8 & 1~792 &   547 \\  % 292.492 
+\begin{tabular}{lcccccc}
+CPU & $N_{\text{cores}}$ & $V$ & $F$   & Memory Bandwidth & Peak DP   & Measured performance \\
+               &         &     & (GHz) &      (GB/s)      & (GFlop/s) & (GFlop/s) \\
+\hline                     
+EPYC 7513      &      64 &  4  &  2.6  &    409.6         &     2~662 & 1~576 \\
+Xeon Gold 6130 &      32 &  8  &  2.1  &    256.0         &     2~150 &   667 \\  % 239.891
+ARM Q80        &      80 &  2  &  2.8  &    204.8         &     1~792 &   547 \\  % 292.492 
 \end{tabular}
 \end{ruledtabular}
 \caption{\label{tab:flops} Average performance of the code measured as the number of double precision (DP) floating-point operations per second (Flop/s) on different machines.}
-\end{table}
+\end{table*}

-Table~\ref{tab:flops} shows the results of these tests.
-The peak performance is obtained by counting the maximum number of Flops/s that can be achieved on the CPU: 
+Table~\ref{tab:flops} summarizes the performance tests.
+Peak performance is determined by calculating the maximum achievable Flops/s on the CPU using the formula:
 \begin{equation}
-  P = N_{\text{cores}} \times N_{\text{FMA}} \times 2 \times V \times F
+P = N_{\text{cores}} \times N_{\text{FMA}} \times 2 \times V \times F
 \end{equation}
-where $F$ is the frequency, the factor of 2 comes from the fact that all these CPUs can execute fused multiply-add (FMA) operations which account for two flops, $V$ is the number of double precision elements in a vector register, $N_{\text{FMA}}$ is the number of vector FMA units per core (all considered CPUs have two), and $N_{\text{cores}}$ is the number of cores.
-The Xeon and ARM CPUs both reach around 30\% of the peak performance, while the AMD EPYC CPU is twice more efficient with 60\% of the peak.
+where $F$ represents the frequency, $V$ the number of double precision elements in a vector register, $N_{\text{FMA}}$ denotes the number of vector FMA units per core (all considered CPUs possess two), and $N_{\text{cores}}$ reflects the number of cores. Notably, the Xeon and ARM CPUs both operate at approximately 30\% of peak performance, while the AMD EPYC CPU demonstrates twice the efficiency, achieving 60\% of the peak.
+
+
+The relatively modest performance, at around 30\% efficiency, is attributed to the small dimensions of the matrices involved.
+The largest matrix multiplications in the computational task entail a matrix of size $N_\text{o}^2 \times N_\text{v}$ and another of size $N_\text{v} \times N_\text{o}$ to yield an $N_\text{o}^2 \times N_\text{o}$ matrix.
+These multiplications exhibit an arithmetic intensity below $N_\text{o} / 4$ flops/byte, which is usually relatively low.
+For instance, in the case of benzene with a triple-zeta basis set, the arithmetic intensity is calculated to be 3.52 flops/byte, falling short of the threshold required to attain peak performance on any of the CPUs.
+By leveraging memory bandwidth and double precision throughput peak, we determined the critical arithmetic intensity necessary to achieve peak performance. On the Xeon and ARM CPUs, this critical value stands at approximately 8.4 and 8.8 flops/byte, respectively. Meanwhile, the EPYC CPU exhibits a value of 6.5 flops/byte, thanks to its superior memory bandwidth.

-The relatively low performance of 30\% is due to the small sizes of the matrices: the largest matrix multiplications in the computation of a task involve a matrix of size $N_\text{o}^2 \times N_\text{v}$ and a matrix of size $N_\text{v} \times N_\text{o}$ to produce an $N_\text{o}^2 \times N_\text{o}$ matrix.
-Such matrix multiplications have an arithmetic intensity below $N_\text{o} / 4$ flops/byte. In the case of benzene in the triple-zeta basis set, the arithmetic intensity is 3.52 flops/byte.
-Such a value is not sufficient to reach the peak performance on any of these CPUs. Using the memory bandwidth and the double precision throughput peak we can determine the critical arithmetic intensity needed to reach the peak performance.
-On the Xeon and on the ARM CPUs, we obtain respectively 8.39 and 8.75~flops/byte as critical values.
-On the EPYC CPU, we obtain a value of 6.50 flops/byte thanks to its high memory bandwidth.


 %%%
@ -330,6 +332,8 @@ On the EPYC CPU, we obtain a value of 6.50 flops/byte thanks to its high memory

 %=================================================================%

+
+
 %%%%%%%%%%%%%%%%%%%%%%%
 \acknowledgements{
 This work was supported by the European Centre of Excellence in Exascale Computing TREX --- Targeting Real Chemical Accuracy at the Exascale.