From c6866453d75b30dae478544c5358eeb60f49f203 Mon Sep 17 00:00:00 2001
From: Anthony Scemama <scemama@irsamc.ups-tlse.fr>
Date: Tue, 16 Apr 2024 14:57:21 +0200
Subject: [PATCH] Parallel

---
 Manuscript/stochastic_triples.tex | 21 +++++++++++++--------
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/Manuscript/stochastic_triples.tex b/Manuscript/stochastic_triples.tex
index 042fedc..04cde09 100644
--- a/Manuscript/stochastic_triples.tex
+++ b/Manuscript/stochastic_triples.tex
@@ -432,7 +432,7 @@ However, we have outlined a strategy to reframe this operation into BLAS matrix
 We evaluated the efficiency of our implementation using the Likwid\cite{treibig_2010} performance analysis tool on two distinct x86 platforms: an AMD \textsc{Epyc} 7513 dual-socket server equipped with 64 cores at \SI{2.6}{\giga\hertz}, and an Intel Xeon Gold 6130 dual-socket server with 32 cores at \SI{2.1}{\giga\hertz}.
 We linked our code with the Intel MKL library for BLAS operations.
 Additionally, we executed the code on an ARM Q80 server featuring 80 cores at \SI{2.8}{\giga\hertz}, and although performance counters were unavailable, we approximated the Flop/s rate by comparing the total execution time with that measured on the AMD CPU.
-For this, we utilized the ArmPL library for BLAS operations.
+For this, we utilized the \textsc{ArmPL} library for BLAS operations.
 
 \begin{table*}
 \begin{ruledtabular}
@@ -472,15 +472,20 @@ By leveraging memory bandwidth and double precision throughput peak, we determin
 \includegraphics[width=\columnwidth]{scaling.pdf}
 \caption{\label{fig:speedup} Parallel speedup obtained with the ARM Q80 and AMD \textsc{Epyc} servers.}
 \end{figure}
-Figure~\ref{fig:speedup} shows the parallel speedups obtained with the ARM and AMD servers for the benzene molecule in the triple-zeta basis set.
-Three distinct regimes appear.
-The first one, up to 24 cores is close to the ideal regime
-The second one, between 24 and 64 cores is decent and enables an acceleration of $40 \times$ with 64 cores. Then, beyond 64 cores, the parallel efficiency drops quickly.
-
-These behaviors can be explained by the arithmetic intensity and the bandwidth of these machines.
-On the ARM server, we have seen that the critical arithmetic intensity to leverage peak performance was 8.8 flops/byte. However, if the number of cores decreases, the bandwidth per core increases and so does the efficiency.
 
+The parallel speedup performance of the ARM and AMD servers for computations involving the benzene molecule in a triple-zeta basis set is illustrated in Figure~\ref{fig:speedup}. The results delineate three distinct performance regimes:
+\begin{itemize}
+\item In the first regime, encompassing up to 24 cores, the performance closely approximates the ideal, with nearly linear speedup.
+\item The second regime, spanning 24 to 64 cores, shows decent performance, achieving a 40-fold acceleration with 64 cores.
+\item The third regime begins beyond 64 cores, where parallel efficiency rapidly deteriorates.
+\end{itemize}
 
+This performance behavior can largely be attributed to the arithmetic intensity and the bandwidth characteristics of these servers.
+On the ARM server, the peak performance is attained at an arithmetic intensity of 8.75~flops/byte.
+Notably, with fewer cores, the bandwidth per core increases, thereby enhancing efficiency.
+For the benzene molecule in the triple-zeta basis set, the critical arithmetic intensity is 3.33~flops/byte.
+This intensity corresponds to a threshold of approximately 30 cores for the ARM server and 32 cores for the AMD server.
+Beyond these thresholds, particularly after 64 cores on the ARM server, the heavy demand on memory bandwidth results in a rapid decline in speedup.
 
 
 %%%