From c6866453d75b30dae478544c5358eeb60f49f203 Mon Sep 17 00:00:00 2001 From: Anthony Scemama Date: Tue, 16 Apr 2024 14:57:21 +0200 Subject: [PATCH] Parallel --- Manuscript/stochastic_triples.tex | 21 +++++++++++++-------- 1 file changed, 13 insertions(+), 8 deletions(-) diff --git a/Manuscript/stochastic_triples.tex b/Manuscript/stochastic_triples.tex index 042fedc..04cde09 100644 --- a/Manuscript/stochastic_triples.tex +++ b/Manuscript/stochastic_triples.tex @@ -432,7 +432,7 @@ However, we have outlined a strategy to reframe this operation into BLAS matrix We evaluated the efficiency of our implementation using the Likwid\cite{treibig_2010} performance analysis tool on two distinct x86 platforms: an AMD \textsc{Epyc} 7513 dual-socket server equipped with 64 cores at \SI{2.6}{\giga\hertz}, and an Intel Xeon Gold 6130 dual-socket server with 32 cores at \SI{2.1}{\giga\hertz}. We linked our code with the Intel MKL library for BLAS operations. Additionally, we executed the code on an ARM Q80 server featuring 80 cores at \SI{2.8}{\giga\hertz}, and although performance counters were unavailable, we approximated the Flop/s rate by comparing the total execution time with that measured on the AMD CPU. -For this, we utilized the ArmPL library for BLAS operations. +For this, we utilized the \textsc{ArmPL} library for BLAS operations. \begin{table*} \begin{ruledtabular} @@ -472,15 +472,20 @@ By leveraging memory bandwidth and double precision throughput peak, we determin \includegraphics[width=\columnwidth]{scaling.pdf} \caption{\label{fig:speedup} Parallel speedup obtained with the ARM Q80 and AMD \textsc{Epyc} servers.} \end{figure} -Figure~\ref{fig:speedup} shows the parallel speedups obtained with the ARM and AMD servers for the benzene molecule in the triple-zeta basis set. -Three distinct regimes appear. -The first one, up to 24 cores is close to the ideal regime -The second one, between 24 and 64 cores is decent and enables an acceleration of $40 \times$ with 64 cores. Then, beyond 64 cores, the parallel efficiency drops quickly. - -These behaviors can be explained by the arithmetic intensity and the bandwidth of these machines. -On the ARM server, we have seen that the critical arithmetic intensity to leverage peak performance was 8.8 flops/byte. However, if the number of cores decreases, the bandwidth per core increases and so does the efficiency. +The parallel speedup performance of the ARM and AMD servers for computations involving the benzene molecule in a triple-zeta basis set is illustrated in Figure~\ref{fig:speedup}. The results delineate three distinct performance regimes: +\begin{itemize} +\item In the first regime, encompassing up to 24 cores, the performance closely approximates the ideal, with nearly linear speedup. +\item The second regime, spanning 24 to 64 cores, shows decent performance, achieving a 40-fold acceleration with 64 cores. +\item The third regime begins beyond 64 cores, where parallel efficiency rapidly deteriorates. +\end{itemize} +This performance behavior can largely be attributed to the arithmetic intensity and the bandwidth characteristics of these servers. +On the ARM server, the peak performance is attained at an arithmetic intensity of 8.75~flops/byte. +Notably, with fewer cores, the bandwidth per core increases, thereby enhancing efficiency. +For the benzene molecule in the triple-zeta basis set, the critical arithmetic intensity is 3.33~flops/byte. +This intensity corresponds to a threshold of approximately 30 cores for the ARM server and 32 cores for the AMD server. +Beyond these thresholds, particularly after 64 cores on the ARM server, the heavy demand on memory bandwidth results in a rapid decline in speedup. %%%