This commit is contained in:
Anthony Scemama 2024-04-16 14:57:21 +02:00
parent 792ab5ceb2
commit c6866453d7
1 changed files with 13 additions and 8 deletions

View File

@ -432,7 +432,7 @@ However, we have outlined a strategy to reframe this operation into BLAS matrix
We evaluated the efficiency of our implementation using the Likwid\cite{treibig_2010} performance analysis tool on two distinct x86 platforms: an AMD \textsc{Epyc} 7513 dual-socket server equipped with 64 cores at \SI{2.6}{\giga\hertz}, and an Intel Xeon Gold 6130 dual-socket server with 32 cores at \SI{2.1}{\giga\hertz}.
We linked our code with the Intel MKL library for BLAS operations.
Additionally, we executed the code on an ARM Q80 server featuring 80 cores at \SI{2.8}{\giga\hertz}, and although performance counters were unavailable, we approximated the Flop/s rate by comparing the total execution time with that measured on the AMD CPU.
For this, we utilized the ArmPL library for BLAS operations.
For this, we utilized the \textsc{ArmPL} library for BLAS operations.
\begin{table*}
\begin{ruledtabular}
@ -472,15 +472,20 @@ By leveraging memory bandwidth and double precision throughput peak, we determin
\includegraphics[width=\columnwidth]{scaling.pdf}
\caption{\label{fig:speedup} Parallel speedup obtained with the ARM Q80 and AMD \textsc{Epyc} servers.}
\end{figure}
Figure~\ref{fig:speedup} shows the parallel speedups obtained with the ARM and AMD servers for the benzene molecule in the triple-zeta basis set.
Three distinct regimes appear.
The first one, up to 24 cores is close to the ideal regime
The second one, between 24 and 64 cores is decent and enables an acceleration of $40 \times$ with 64 cores. Then, beyond 64 cores, the parallel efficiency drops quickly.
These behaviors can be explained by the arithmetic intensity and the bandwidth of these machines.
On the ARM server, we have seen that the critical arithmetic intensity to leverage peak performance was 8.8 flops/byte. However, if the number of cores decreases, the bandwidth per core increases and so does the efficiency.
The parallel speedup performance of the ARM and AMD servers for computations involving the benzene molecule in a triple-zeta basis set is illustrated in Figure~\ref{fig:speedup}. The results delineate three distinct performance regimes:
\begin{itemize}
\item In the first regime, encompassing up to 24 cores, the performance closely approximates the ideal, with nearly linear speedup.
\item The second regime, spanning 24 to 64 cores, shows decent performance, achieving a 40-fold acceleration with 64 cores.
\item The third regime begins beyond 64 cores, where parallel efficiency rapidly deteriorates.
\end{itemize}
This performance behavior can largely be attributed to the arithmetic intensity and the bandwidth characteristics of these servers.
On the ARM server, the peak performance is attained at an arithmetic intensity of 8.75~flops/byte.
Notably, with fewer cores, the bandwidth per core increases, thereby enhancing efficiency.
For the benzene molecule in the triple-zeta basis set, the critical arithmetic intensity is 3.33~flops/byte.
This intensity corresponds to a threshold of approximately 30 cores for the ARM server and 32 cores for the AMD server.
Beyond these thresholds, particularly after 64 cores on the ARM server, the heavy demand on memory bandwidth results in a rapid decline in speedup.
%%%