Final changes
This commit is contained in:
parent
80098d38fe
commit
acc4f921dc
@ -210,7 +210,7 @@ To reduce the fluctuations of the statistical estimator, we apply importance sam
|
||||
P^{abc} = \frac{1}{\mathcal{N}} \frac{1}{\max \left(\epsilon_{\min}, \epsilon_a + \epsilon_b + \epsilon_c \right)}
|
||||
\end{equation}
|
||||
where $\mathcal{N}$ normalizes the sum such that $\sum_{abc} P^{abc} = 1$, and $\epsilon_{\min}$ is an arbitrary minimal denominator to ensure that $P^{abc}$ does not diverge. In our calculations, we have set $\epsilon_{\min}$ to 0.2~a.u.
|
||||
\anthony{The algorithm is not very sensitive to the value of $\epsilon_{\min}$ as long as it is taken within reasonable bounds, in the range of the level-shift parameter in SCF calculations.}
|
||||
\anthony{The algorithm is not very sensitive to the value of $\epsilon_{\min}$ as long as it is taken within reasonable bounds (in the range of the level-shift parameter of SCF calculations).}
|
||||
The perturbative contribution is then evaluated as an average over $M$ samples
|
||||
\begin{equation}
|
||||
E_{(T)} = \left\langle \frac{E^{abc}}{P^{abc}} \right \rangle_{P^{abc}} =
|
||||
@ -442,7 +442,7 @@ Xeon Gold 6130 & $2 \times 16$ & $2\times 22$ & 8 & 2.1 & 256.0
|
||||
ARM Q80 & $80$ & $32$ & 2 & 2.8 & 204.8 & 1~792 & 547 \\ % 292.492
|
||||
\end{tabular}
|
||||
\end{ruledtabular}
|
||||
\caption{\label{tab:flops} Average performance of the code measured as the number of double precision (DP) floating-point operations per second (Flop/s) on different machines.}
|
||||
\caption{\label{tab:flops} \anthony{Characteristics of the different machines, and the measured performance of the code in terms of double precision (DP) floating-point operations per second (Flop/s).}}
|
||||
\end{table*}
|
||||
|
||||
Table~\ref{tab:flops} summarizes the performance tests.
|
||||
@ -483,7 +483,7 @@ Notably, with fewer cores, the bandwidth per core \anthony{and the amount of ava
|
||||
For the benzene molecule in the triple-zeta basis set, the arithmetic intensity is 3.33~flops/byte.
|
||||
This intensity corresponds to a threshold of approximately 30 cores for the ARM server and 32 cores for the AMD server.
|
||||
\anthony{Beyond these thresholds, the heavy demand on memory bandwidth results in a decrease in speedup.
|
||||
Beyond 64 cores on the ARM server, we observe a severe performance drop due to the limited size of the L3 cache: each matrix multiplication requires 494~kb to host the three matrices, and with 64 independent threads the L3 cache is already full.}
|
||||
Beyond 64 cores on the ARM server, we observe a severe performance drop due to the limited size of the L3 cache: each matrix multiplication requires 494~kb for the three matrices, and with 64 independent threads the L3 cache is already full.}
|
||||
|
||||
|
||||
%%%
|
||||
|
Loading…
Reference in New Issue
Block a user