Ajouts L3

2024-06-19 14:19:40 +02:00 · 2024-06-19 14:19:40 +02:00 · 24f2b44dae
commit 24f2b44dae
parent 0adb3c7d9c
1 changed files with 13 additions and 11 deletions
--- a/Manuscript/stochastic_triples.tex
+++ b/Manuscript/stochastic_triples.tex
@ -27,6 +27,7 @@

 \newcommand{\todo}[1]{\textcolor{blue}{#1}}
 \newcommand{\yann}[1]{\textcolor{purple}{#1}}
+\newcommand{\anthony}[1]{\textcolor{red}{#1}}

 \usepackage{listings}
 \definecolor{codegreen}{rgb}{0.58,0.4,0.2}
@ -431,13 +432,13 @@ On the ARM architecture, we utilized the \textsc{ArmPL} library for BLAS operati

 \begin{table*}[htb]
 \begin{ruledtabular}
-\begin{tabular}{lcccccc}
-CPU & $N_{\text{cores}}$ & $V$ & $F$   & Memory Bandwidth & Peak DP   & Measured performance \\
-               &         &     & (GHz) &      (GB/s)      & (GFlop/s) & (GFlop/s) \\
+\begin{tabular}{lccccccc}
+CPU & $N_{\text{cores}}$ & Shared L3 cache & $V$ & $F$   & Memory Bandwidth & Peak DP   & Measured performance \\
+                         &       (MB)      &         &     & (GHz) &      (GB/s)      & (GFlop/s) & (GFlop/s) \\
 \hline                           
-\textsc{EPYC} 7513      &      64 &  4  &  2.6  &    409.6         &     2~662 & 1~576 \\
-Xeon Gold 6130 &      32 &  8  &  2.1  &    256.0         &     2~150 &   667 \\  % 239.891
-ARM Q80        &      80 &  2  &  2.8  &    204.8         &     1~792 &   547 \\  % 292.492
+\textsc{EPYC} 7513       & $2\times 128$  & $2 \times 32)$ &  4  &  2.6  &    409.6         &     2~662 & 1~576 \\
+Xeon Gold 6130           & $2\times 22$   & $2 \times 16)$ &  8  &  2.1  &    256.0         &     2~150 &   667 \\  % 239.891
+ARM Q80                  &          $32$  &           $80$ &  2  &  2.8  &    204.8         &     1~792 &   547 \\  % 292.492
 \end{tabular}
 \end{ruledtabular}
 \caption{\label{tab:flops} Average performance of the code measured as the number of double precision (DP) floating-point operations per second (Flop/s) on different machines.}
@ -458,7 +459,7 @@ These multiplications exhibit an arithmetic intensity of
 I = \frac{2\, {N_\text{o}}^3\, N_\text{v}}{8\, \qty({N_\text{o}}^3 + {N_\text{o}}^2 N_\text{v} + {N_\text{o}} N_\text{v})}
 \end{equation}
 which can be approximated by $N_\text{o} / 4$ flops/byte as an upper bound, which is usually relatively low.
-For instance, in the case of benzene with a triple-zeta basis set, the arithmetic intensity is calculated to be 3.33 flops/byte, falling short of the threshold required to attain peak performance on any of the CPUs.
+For instance, in the case of benzene with a triple-zeta basis set \anthony{($N_\text{o}=15, N_\text{v}=243$)}, the arithmetic intensity is calculated to be 3.33 flops/byte, falling short of the threshold required to attain peak performance on any of the CPUs.
 By leveraging memory bandwidth and double precision throughput peak, we determined the critical arithmetic intensity necessary to achieve peak performance. On the Xeon and ARM CPUs, this critical value stands at approximately 8.4 and 8.8 flops/byte, respectively. Meanwhile, the \textsc{EPYC} CPU exhibits a value of 6.5 flops/byte, thanks to its superior memory bandwidth.

 \subsection{Parallel efficiency}
@ -477,10 +478,11 @@ The parallel speedup performance of the ARM and AMD servers for computations inv

 This performance behavior can largely be attributed to the arithmetic intensity and the bandwidth characteristics of these servers.
 On the ARM server, the peak performance is attained at an arithmetic intensity of 8.75~flops/byte.
-Notably, with fewer cores, the bandwidth per core increases, thereby enhancing efficiency.
-For the benzene molecule in the triple-zeta basis set, the critical arithmetic intensity is 3.33~flops/byte.
+Notably, with fewer cores, the bandwidth per core \anthony{and the amount of available shared level-3 (L3) cache} increase, thereby enhancing efficiency.
+For the benzene molecule in the triple-zeta basis set, the arithmetic intensity is 3.33~flops/byte.
 This intensity corresponds to a threshold of approximately 30 cores for the ARM server and 32 cores for the AMD server.
-Beyond these thresholds, particularly after 64 cores on the ARM server, the heavy demand on memory bandwidth results in a rapid decline in speedup.
+\anthony{Beyond these thresholds, the heavy demand on memory bandwidth results in a decrease in speedup.
+Beyond 64 cores on the ARM server, we observe a severe performance drop due to the limited size of the L3 cache: each matrix multiplication requires 494~kb to host the three matrices, and with 64 independent threads the L3 cache is already full.}


 %%%