Performance

2024-03-30 02:18:06 +01:00 · 2024-03-30 02:18:06 +01:00 · 39d9f66f9b
commit 39d9f66f9b
parent 98a897d9c8
2 changed files with 69 additions and 28 deletions
--- a/Manuscript/stochastic_triples.bib
+++ b/Manuscript/stochastic_triples.bib
@ -194,3 +194,31 @@ i@article{watson_2016,
 	note = {[Online; accessed 28. Mar. 2024]},
 	url = {https://www.nist.gov/pml/diatomic-spectral-database}
 }
+
+@article{williams_2009,
+	author = {Williams, Samuel and Waterman, Andrew and Patterson, David},
+	title = {{Roofline: an insightful visual performance model for multicore architectures}},
+	journal = {Commun. ACM},
+	volume = {52},
+	number = {4},
+	pages = {65--76},
+	year = {2009},
+	month = apr,
+	issn = {0001-0782},
+	publisher = {Association for Computing Machinery},
+	doi = {10.1145/1498765.1498785}
+}
+
+@article{calore_2020,
+	author = {Calore, Enrico and Gabbana, Alessandro and Schifano, Sebastiano Fabio and Tripiccione, Raffaele},
+	title = {{ThunderX2 Performance and Energy-Efficiency for HPC Workloads}},
+	journal = {Computation},
+	volume = {8},
+	number = {1},
+	pages = {20},
+	year = {2020},
+	month = mar,
+	issn = {2079-3197},
+	publisher = {Multidisciplinary Digital Publishing Institute},
+	doi = {10.3390/computation8010020}
+}
--- a/Manuscript/stochastic_triples.tex
+++ b/Manuscript/stochastic_triples.tex
@ -243,34 +243,6 @@ The more rapid convergence observed with the larger basis set aligns with expect
 This trend underscores the algorithm's enhanced suitability for systems with fewer electrons and extensive basis sets, as opposed to larger electron counts in smaller basis sets.


-\subsection{Performance analysis}
-
-The bottleneck of the proposed algorithm is the creation of the sub-tensor $W^{abc}$ for each given $(a,b,c)$ triplet.
-We have mentioned in section~\ref{sec:theory} that this operation could be recast into matrix multiplications, leading to a high efficiency of our implementation.
-
-We have measured the efficiency of our implementation using the Likwid\cite{treibig_2010} performance analysis tool on an AMD EPYC 7513 dual-socket server (64 cores at \SI{2.6}{\giga \hertz}) and on an
-Intel Xeon Gold 6130 dual-socket server (32 cores at \SI{2.1}{\giga \hertz}).
-The code was linked with the Intel MKL library for BLAS operations.
-Measurements of the number of floating-point operations per second (Flop/s) we activated section of the code for the computation of the perturbative triples correction. 
-We have also run the code on an ARM Q80 server (80 cores at \SI{2.8}{\giga \hertz})), and as the performance counters were not available to Likwid, we have compared the total execution time of the computation of the perturbative triples correction with the time measured on the AMD CPU to estimate the Flop/s rate.
-The code was linked with the ArmPL library for BLAS operations.
-
-\begin{table}
-\begin{ruledtabular}
-\begin{tabular}{lcccc}
-CPU & \# cores & Vector length & Performance  & \% Peak \\
-        &          & (bits)        & (GFlop/s)    & \\
-\hline
-EPYC 7513      & 64 & 256    &1~576 & 59.2 \% \\  % 101.53
-Xeon Gold 6130 & 32 & 512    &  667 & 31.0 \% \\  % 239.891
-ARM Q80        & 80 & 128    &  547 & 30.5 \% \\  % 292.492 
-\end{tabular}
-\end{ruledtabular}
-\caption{\label{tab:flops} Performance of the code measured on different architectures.}
-\end{table}
-
-Table~\ref{tab:flops} shows the results of these tests on an AMD EPYC 7513 dual socket server (64 cores in total).
-

 \subsection{Vibrational frequency of copper chloride}

@ -302,6 +274,47 @@ The vibrational frequency and equilibrium distance estimated using this data, $\
 Figure \ref{fig:cucl} illustrates the potential energy surface of \ce{CuCl}, displaying both the exact CCSD(T) energies and those estimated via the semi-stochastic method.


+\subsection{Performance analysis}
+
+The bottleneck of the proposed algorithm is the creation of the sub-tensor $W^{abc}$ for each given $(a,b,c)$ triplet.
+We have mentioned in section~\ref{sec:theory} that this operation could be recast into matrix multiplications, leading to a high efficiency of our implementation.
+
+We have measured the efficiency of our implementation using the Likwid\cite{treibig_2010} performance analysis tool on an AMD EPYC 7513 dual-socket server (64 cores at \SI{2.6}{\giga \hertz}) and on an
+Intel Xeon Gold 6130 dual-socket server (32 cores at \SI{2.1}{\giga \hertz}).
+The code was linked with the Intel MKL library for BLAS operations.
+Measurements of the average number of floating-point operations per second (Flop/s) were activated section of the code for the computation of the perturbative triples correction. 
+We have also run the code on an ARM Q80 server (80 cores at \SI{2.8}{\giga \hertz})), and as the performance counters were not available to Likwid, we have compared the total execution time of the computation of the perturbative triples correction with the time measured on the AMD CPU to estimate the Flop/s rate.
+The code was linked with the ArmPL library for BLAS operations.
+
+\begin{table}
+\begin{ruledtabular}
+\begin{tabular}{lccccc}
+CPU & $N_{\text{cores}}$ & $V$ & $F$ & Peak DP & Measured \\
+        &                &     & (GHz) & (GFlop/s)    & (GFlop/s) \\
+\hline
+EPYC 7513      & 64 & 4 & 2.6 & 2~662 & 1~576 \\
+Xeon Gold 6130 & 32 & 8 & 2.1 & 2~150 &   667 \\  % 239.891
+ARM Q80        & 80 & 2 & 2.8 & 1~792 &   547 \\  % 292.492 
+\end{tabular}
+\end{ruledtabular}
+\caption{\label{tab:flops} Average performance of the code measured as the number of double precision (DP) floating-point operations per second (Flop/s) on different machines.}
+\end{table}
+
+Table~\ref{tab:flops} shows the results of these tests.
+The peak performance is obtained by counting the maximum number of Flops/s that can be achieved on the CPU: 
+\begin{equation}
+  P = N_{\text{cores}} \times N_{\text{FMA}} \times 2 \times V \times F
+\end{equation}
+where $F$ is the frequency, the factor of 2 comes from the fact that all these CPUs can execute fused multiply-add (FMA) operations which account for two flops, $V$ is the number of double precision elements in a vector register, $N_{\text{FMA}}$ is the number of vector FMA units per core (all considered CPUs have two), and $N_{\text{cores}}$ is the number of cores.
+The Xeon and ARM CPUs both reach around 30\% of the peak performance, while the AMD EPYC CPU is twice more efficient with 60\% of the peak.
+
+The relatively low performance of 30\% is due to the small sizes of the matrices: the largest matrix multiplications in the computation of a task involve a matrix of size $N_\text{o}^2 \times N_\text{v}$ and a matrix of size $N_\text{v} \times N_\text{o}$ to produce an $N_\text{o}^2 \times N_\text{o}$ matrix.
+Such matrix multiplications have an arithmetic intensity below $N_\text{o} / 4$ flops/byte. In the case of benzene in the triple-zeta basis set, the arithmetic intensity is 3.52 flops/byte.
+Such a value is not sufficient to reach the peak performance on any of these CPUs. Using the memory bandwidth and the double precision throughput peak we can determine the critical arithmetic intensity needed to reach the peak performance.
+On the Xeon and on the ARM CPUs, we obtain respectively 8.39 and 8.75~flops/byte as critical values.
+On the EPYC CPU, we obtain a value of 6.50 flops/byte thanks to its high memory bandwidth.
+
+
 %%%