pres_intel/scemama.org
2021-07-01 16:04:42 +02:00

14 KiB

TREX : an innovative view of HPC usage applied to Quantum Monte Carlo simulations

#+LaTeX_CLASS_OPTIONS:[aspectratio=169]

Quantum chemistry

  • Describing matter with quantum mechanics (Schrödinger's equation)
  • Users: theoretical chemists and physicists

/scemama/pres_intel/media/commit/c57746aabfe2965a8af342ea136ebf00b9b3ac6b/Water.png

/scemama/pres_intel/media/commit/c57746aabfe2965a8af342ea136ebf00b9b3ac6b/casula.png

- Health Drug design
- Electronics Nano- and micro-electronics
- Materials Carbon nanotubes, graphene, …
- Catalysis Enzymatic reactions, petroleum

The TREX CoE

/scemama/pres_intel/media/commit/c57746aabfe2965a8af342ea136ebf00b9b3ac6b/TREX2.png

  • CHAMP
  • QMC=Chem
  • TurboRVB
  • NECI
  • Quantum Package
  • GammCor

TREX: Targeting REal chemical accuracy at the EXascale

/scemama/pres_intel/media/commit/c57746aabfe2965a8af342ea136ebf00b9b3ac6b/Curve.png

How: Instead of re-writing codes, provide libraries

  • A library for exchanging information between codes (TREXIO) $\Longrightarrow$ Enables HTC
  • A library for high-performance (QMCkl) $\Longrightarrow$ Enables HPC
  • Highly accurate
  • Massively parallelisable (multiple QMC trajectories)
  • CPU intensive

I/O library (TREXIO)

digraph G {
QP [label="Quantum Package"];
QMCCHEM [label="QMC=Chem"];
Turbo   [label="TurboRVB"];
QP -> NECI;
NECI -> GammCor [style="dotted"];
NECI -> QMCCHEM [style="dotted"] ;
QP -> QMCCHEM;
QP -> CHAMP;
QP -> GammCor [style="dotted"];
QP -> Turbo [style="dotted"];
NECI -> Turbo [style="dotted"];
NECI -> CHAMP [style="dotted"];
QMCCHEM -> GammCor [style="dotted"];
CHAMP -> GammCor [style="dotted"];
Turbo -> GammCor [style="dotted"];
}

/scemama/pres_intel/media/commit/c57746aabfe2965a8af342ea136ebf00b9b3ac6b/interfaces.png

digraph G {
layout=circo;
External [label="External codes"];
QP [label="Quantum Package"];
QMCCHEM [label="QMC=Chem"];
Turbo   [label="TurboRVB"];
TREX [label="TREXIO File", shape="box"];
CHAMP -> TREX;
GammCor -> TREX;
NECI -> TREX;
QMCCHEM -> TREX;
QP -> TREX;
Turbo -> TREX;
External -> TREX;

TREX -> CHAMP;
TREX -> GammCor;
TREX -> NECI;
TREX -> QMCCHEM;
TREX -> QP;
TREX -> Turbo;
TREX -> External;
}

/scemama/pres_intel/media/commit/c57746aabfe2965a8af342ea136ebf00b9b3ac6b/interfaces2.png

(BSD license)
https://github.com/trex-coe/trexio

I/O library (TREXIO)

  • Definition of an API for to read/write wave functions
  • C-compatible API: Easy bindings in other languages
  • File is self-contained: no external knowledge needed to compute $\Psi(r_1,\dots,r_n)$ (normalization factors, basis et parameters, etc)
  • Strong conventions (atomic units, ordering of cartesian orbitals, etc)

/scemama/pres_intel/media/commit/c57746aabfe2965a8af342ea136ebf00b9b3ac6b/api.png

  • HDF5: Efficient I/O
  • Text: debugging, fallback when HDF5 can't be installed

Source code generated from a config file.

Quantum Monte Carlo (QMC)

\alert{Problem}: Stochastic resolution of the Schr\"odinger equation for $N$ electrons
\begin{eqnarray}
E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
                         {\int \dcoord \Phi(\coord) \Phi(\coord)} \nonumber \\
                  &\sim & \sum \frac{ {\cal H}\Psi(\coord )}{\Psi(\coord )}
                    \text{, sampled with } (\Psi \times \Phi)
\nonumber
\end{eqnarray}
\begin{columns}
\begin{column}{.5\textwidth}
\begin{itemize}
\item[$\cal H $: ] Hamiltonian operator
\item[$E$: ] Energy
\end{itemize}
\end{column}
\begin{column}{.4\textwidth}
\begin{itemize}
\item[$\coord $: ] Electron coordinates
\item[$\Phi $: ] Almost exact wave function
\item[$\Psi $: ] Trial wave function
\end{itemize}
\end{column}
\end{columns}

Quantum Monte Carlo (QMC)

  • Very low memory requirements (no integrals)
  • Distribute walkers on different cores or compute nodes
  • No blocking communication: near-ideal scaling
  • Difficulty: parallelize within a QMC trajectory

/scemama/pres_intel/media/commit/c57746aabfe2965a8af342ea136ebf00b9b3ac6b/Qmc.png

QMC kernel library (QMCkl)

Computational kernels

  • QMCkl will contain the main kernels of QMC methods
  • Written together by QMC experts and HPC experts
  • Multiple high performance implementations of the kernels, tuned for different

    • architectures
    • problem sizes
    • requested accuracy (reduced precision)

QMC kernel library (QMCkl)

Two implementations

  • Documentation : easy to read and understand, not necessarily efficient
  • High performance : efficient, but not necessarily readable by physicists/chemists
  • Both Documentation and High performance have the same API.

Advantages

  • The code can stay easy to understand by the physicists/chemists Performance-related aspects are delegated to the library
  • Scientists can use their preferred language
  • Scientists don't lose control on their codes
  • Codes don't die when the architecture changes
  • Scientific code development does not break the performance
  • Better re-use of the optimization effort among the community

HPC library

  • Same API as the documentation library
  • Optimization is guided by analysis with MAQAO\footnote{https://maqao.org}.
  • Propose performance-critical choices in the API design (data structures, memory management, etc)
  • Both CPU and GPU versions of the kernels
  • Task parallelism with StarPU\footnote{C. Augonnet et al, doi:10.1002/cpe.1631} to schedule kernels on CPU and GPU and handle asynchronous CPU-GPU transfers

Efficiently guiding the developer

/scemama/pres_intel/media/commit/c57746aabfe2965a8af342ea136ebf00b9b3ac6b/maqao1.png

Extensive/automatic testing of different configurations

/scemama/pres_intel/media/commit/c57746aabfe2965a8af342ea136ebf00b9b3ac6b/maqao2.png

First application : 3-body Jastrow factor

\[ \Jeen (\br,\bR) = \sum_{\alpha=1}^{\Nat} \sum_{i=1}^{\Nel} \sum_{j=1}^{i-1} \sum_{p=2}^{\Nord} \sum_{k=0}^{p-1} \sum_{l=0}^{\lmax} c_{lkp\alpha} \left( {r}_{ij} \right)^k \left[ \left( {R}_{i\alpha} \right)^l + \left( {R}_{j\alpha} \right)^l \right] \left( {R}_{i\,\alpha} \, {R}_{j\alpha} \right)^{(p-k-l)/2} \]

/scemama/pres_intel/src/commit/c57746aabfe2965a8af342ea136ebf00b9b3ac6b/speedup.pdf

  • Gradient and Laplacian are also required
  • Up to $20\times$ faster than in the original code
  • $\sim 80\%$ of the AVX-512 peak is reached
  • Expressed with a DGEMM kernel $\Longrightarrow$ also efficient on GPU

Verificarlo CI

/scemama/pres_intel/media/commit/c57746aabfe2965a8af342ea136ebf00b9b3ac6b/img/cmp-runs.png

  • Track precision of kernels over commits
  • Shows significant digits $s$, standard deviation $\sigma$, variable distribution

/scemama/pres_intel/media/commit/c57746aabfe2965a8af342ea136ebf00b9b3ac6b/img/inspect-runs.png

  • Focus in depth on one particular run
  • Compare multiple implementations of the same kernel