Parallel
This commit is contained in:
parent
792ab5ceb2
commit
c6866453d7
@ -432,7 +432,7 @@ However, we have outlined a strategy to reframe this operation into BLAS matrix
|
||||
We evaluated the efficiency of our implementation using the Likwid\cite{treibig_2010} performance analysis tool on two distinct x86 platforms: an AMD \textsc{Epyc} 7513 dual-socket server equipped with 64 cores at \SI{2.6}{\giga\hertz}, and an Intel Xeon Gold 6130 dual-socket server with 32 cores at \SI{2.1}{\giga\hertz}.
|
||||
We linked our code with the Intel MKL library for BLAS operations.
|
||||
Additionally, we executed the code on an ARM Q80 server featuring 80 cores at \SI{2.8}{\giga\hertz}, and although performance counters were unavailable, we approximated the Flop/s rate by comparing the total execution time with that measured on the AMD CPU.
|
||||
For this, we utilized the ArmPL library for BLAS operations.
|
||||
For this, we utilized the \textsc{ArmPL} library for BLAS operations.
|
||||
|
||||
\begin{table*}
|
||||
\begin{ruledtabular}
|
||||
@ -472,15 +472,20 @@ By leveraging memory bandwidth and double precision throughput peak, we determin
|
||||
\includegraphics[width=\columnwidth]{scaling.pdf}
|
||||
\caption{\label{fig:speedup} Parallel speedup obtained with the ARM Q80 and AMD \textsc{Epyc} servers.}
|
||||
\end{figure}
|
||||
Figure~\ref{fig:speedup} shows the parallel speedups obtained with the ARM and AMD servers for the benzene molecule in the triple-zeta basis set.
|
||||
Three distinct regimes appear.
|
||||
The first one, up to 24 cores is close to the ideal regime
|
||||
The second one, between 24 and 64 cores is decent and enables an acceleration of $40 \times$ with 64 cores. Then, beyond 64 cores, the parallel efficiency drops quickly.
|
||||
|
||||
These behaviors can be explained by the arithmetic intensity and the bandwidth of these machines.
|
||||
On the ARM server, we have seen that the critical arithmetic intensity to leverage peak performance was 8.8 flops/byte. However, if the number of cores decreases, the bandwidth per core increases and so does the efficiency.
|
||||
|
||||
The parallel speedup performance of the ARM and AMD servers for computations involving the benzene molecule in a triple-zeta basis set is illustrated in Figure~\ref{fig:speedup}. The results delineate three distinct performance regimes:
|
||||
\begin{itemize}
|
||||
\item In the first regime, encompassing up to 24 cores, the performance closely approximates the ideal, with nearly linear speedup.
|
||||
\item The second regime, spanning 24 to 64 cores, shows decent performance, achieving a 40-fold acceleration with 64 cores.
|
||||
\item The third regime begins beyond 64 cores, where parallel efficiency rapidly deteriorates.
|
||||
\end{itemize}
|
||||
|
||||
This performance behavior can largely be attributed to the arithmetic intensity and the bandwidth characteristics of these servers.
|
||||
On the ARM server, the peak performance is attained at an arithmetic intensity of 8.75~flops/byte.
|
||||
Notably, with fewer cores, the bandwidth per core increases, thereby enhancing efficiency.
|
||||
For the benzene molecule in the triple-zeta basis set, the critical arithmetic intensity is 3.33~flops/byte.
|
||||
This intensity corresponds to a threshold of approximately 30 cores for the ARM server and 32 cores for the AMD server.
|
||||
Beyond these thresholds, particularly after 64 cores on the ARM server, the heavy demand on memory bandwidth results in a rapid decline in speedup.
|
||||
|
||||
|
||||
%%%
|
||||
|
Loading…
Reference in New Issue
Block a user