Final changes

This commit is contained in:
Anthony Scemama 2024-06-25 12:35:28 +02:00
parent 80098d38fe
commit acc4f921dc

View File

@ -210,7 +210,7 @@ To reduce the fluctuations of the statistical estimator, we apply importance sam
P^{abc} = \frac{1}{\mathcal{N}} \frac{1}{\max \left(\epsilon_{\min}, \epsilon_a + \epsilon_b + \epsilon_c \right)}
\end{equation}
where $\mathcal{N}$ normalizes the sum such that $\sum_{abc} P^{abc} = 1$, and $\epsilon_{\min}$ is an arbitrary minimal denominator to ensure that $P^{abc}$ does not diverge. In our calculations, we have set $\epsilon_{\min}$ to 0.2~a.u.
\anthony{The algorithm is not very sensitive to the value of $\epsilon_{\min}$ as long as it is taken within reasonable bounds, in the range of the level-shift parameter in SCF calculations.}
\anthony{The algorithm is not very sensitive to the value of $\epsilon_{\min}$ as long as it is taken within reasonable bounds (in the range of the level-shift parameter of SCF calculations).}
The perturbative contribution is then evaluated as an average over $M$ samples
\begin{equation}
E_{(T)} = \left\langle \frac{E^{abc}}{P^{abc}} \right \rangle_{P^{abc}} =
@ -442,7 +442,7 @@ Xeon Gold 6130 & $2 \times 16$ & $2\times 22$ & 8 & 2.1 & 256.0
ARM Q80 & $80$ & $32$ & 2 & 2.8 & 204.8 & 1~792 & 547 \\ % 292.492
\end{tabular}
\end{ruledtabular}
\caption{\label{tab:flops} Average performance of the code measured as the number of double precision (DP) floating-point operations per second (Flop/s) on different machines.}
\caption{\label{tab:flops} \anthony{Characteristics of the different machines, and the measured performance of the code in terms of double precision (DP) floating-point operations per second (Flop/s).}}
\end{table*}
Table~\ref{tab:flops} summarizes the performance tests.
@ -483,7 +483,7 @@ Notably, with fewer cores, the bandwidth per core \anthony{and the amount of ava
For the benzene molecule in the triple-zeta basis set, the arithmetic intensity is 3.33~flops/byte.
This intensity corresponds to a threshold of approximately 30 cores for the ARM server and 32 cores for the AMD server.
\anthony{Beyond these thresholds, the heavy demand on memory bandwidth results in a decrease in speedup.
Beyond 64 cores on the ARM server, we observe a severe performance drop due to the limited size of the L3 cache: each matrix multiplication requires 494~kb to host the three matrices, and with 64 independent threads the L3 cache is already full.}
Beyond 64 cores on the ARM server, we observe a severe performance drop due to the limited size of the L3 cache: each matrix multiplication requires 494~kb for the three matrices, and with 64 independent threads the L3 cache is already full.}
%%%