pres_intel/scemama.org

22 KiB

Libraries developed in the TREX CoE

#+LaTeX_CLASS_OPTIONS:[aspectratio=169]

QMC in TREX

QMC in TREX

  • Highly accurate methods
  • Massively parallelisable (multiple QMC trajectories)
  • Very CPU intensive: One of the most "compute-hungry" methods
  • Still under development: scientists need to run and develop code
  • Input data is complex (wave function)

How: Instead of re-writing codes, provide libraries (free software)

  1. TREXIO: A library for exchanging information between codes $\Longrightarrow$ Enables HTC
  2. QMCkl: A library for high-performance $\Longrightarrow$ Enables HPC

Quantum Monte Carlo (QMC)

\alert{Problem}: Stochastic resolution of the Schr\"odinger equation for $N$ electrons
\begin{eqnarray}
E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
                        {\int \dcoord \Phi(\coord) \Phi(\coord)} \nonumber \\
                 &\sim & \sum \frac{ {\cal H}\Psi(\coord )}{\Psi(\coord )}
                   \text{, sampled with } (\Psi \times \Phi)
\nonumber
\end{eqnarray}
\begin{columns}
\begin{column}{.5\textwidth}
\begin{itemize}
\item[$\cal H $: ] Hamiltonian operator
\item[$E$: ] Energy
\end{itemize}
\end{column}
\begin{column}{.4\textwidth}
\begin{itemize}
\item[$\coord $: ] Electron coordinates
\item[$\Phi $: ] Almost exact wave function
\item[$\Psi $: ] Trial wave function
\end{itemize}
\end{column}
\end{columns}

Quantum Monte Carlo (QMC)

  • Very low memory requirements (no integrals)
  • Distribute walkers on different cores or compute nodes
  • No blocking communication: near-ideal scaling
  • Difficulty to parallelize within a QMC trajectory: depends on the number of electrons

/scemama/pres_intel/media/commit/920eb724f3398f431f284a48daae2a665adcdcfb/Qmc.png

Both libraries

Three objectives

  1. Productivity
    Usable and useful by scientists in different programming languages
  2. Portability
    Target: all HPC systems (CPU, GPU, ARM, x86, etc.)
  3. Performance
    Must be efficient on all architectures: possible tradeoffs between portability and performance

Free (libre) software

  • Requirement for open science
  • BSD license for adoption by any software (academic, commercial, …)

TREXIO: I/O library

TREXIO: I/O library

digraph G {
QP [label="Quantum Package"];
QMCCHEM [label="QMC=Chem"];
Turbo   [label="TurboRVB"];
QP -> NECI;
NECI -> GammCor [style="dotted"];
NECI -> QMCCHEM [style="dotted"] ;
QP -> QMCCHEM;
QP -> CHAMP;
QP -> GammCor [style="dotted"];
QP -> Turbo [style="dotted"];
NECI -> Turbo [style="dotted"];
NECI -> CHAMP [style="dotted"];
QMCCHEM -> GammCor [style="dotted"];
CHAMP -> GammCor [style="dotted"];
Turbo -> GammCor [style="dotted"];
}

/scemama/pres_intel/media/commit/920eb724f3398f431f284a48daae2a665adcdcfb/interfaces.png

digraph G {
layout=circo;
External [label="External\ncodes"];
QP [label="Quantum\nPackage"];
QMCCHEM [label="QMC=Chem"];
Turbo   [label="TurboRVB"];
TREX [label="TREXIO File", shape="box"];
CHAMP -> TREX;
GammCor -> TREX;
NECI -> TREX;
QMCCHEM -> TREX;
QP -> TREX;
Turbo -> TREX;
External -> TREX;

TREX -> CHAMP;
TREX -> GammCor;
TREX -> NECI;
TREX -> QMCCHEM;
TREX -> QP;
TREX -> Turbo;
TREX -> External;
}

/scemama/pres_intel/media/commit/920eb724f3398f431f284a48daae2a665adcdcfb/interfaces2.png

(BSD license)
https://github.com/trex-coe/trexio

TREXIO: I/O library

  • Definition of an API for to read/write wave functions
  • C-compatible API: Easy usage in all common languages

/scemama/pres_intel/media/commit/920eb724f3398f431f284a48daae2a665adcdcfb/api.png

  • HDF5: Efficient I/O
  • Text:

    • Fallback when HDF5 can't be installed
    • Debugging
    • Version control systems

Content of the files

  • File is self-contained: no external knowledge needed to compute $\Psi(r_1,\dots,r_n)$ (normalization factors, basis et parameters, etc)
  • Strong conventions (atomic units, ordering of atomic orbitals, etc.)
  • The data stored in the files is organized in different groups:

    Metadata Electron Slater Determinants
    Nucleus Basis CI coefficients
    AO MO Two-electron integrals
    One-electron integrals Density matrices ECP
  • Each group contains multiple attributes: information related to the group

QMCkl: QMC kernel library

QMC kernel library

Computational kernels

  • QMCkl will contain the main kernels of QMC methods: Domain specific library, end-user driven
  • Written together by QMC experts and HPC experts
  • Multiple high performance implementations of the kernels, tuned for different

    • architectures: portability is critical for users
    • problem sizes: from small to large systems
    • requested accuracy: reduced precision

Objectives

  • The code must stay easy to understand by the physicists/chemists. Performance-related aspects should be delegated to the library
  • Scientists should be able to use their preferred language
  • Scientists should not lose control of their codes
  • Codes should not die when the architecture changes
  • Scientific code development should not kill the performance
  • Reuse of the optimization effort among the community

Functionality and performance

  • Keeping high productivity, portability and performance is very hard in a single piece of software.

    We propose (at least) two implementations:

    1. Documentation library
      Easy to read, understand, modify for scientists, not necessarily efficient.
    2. High performance libraries
      Efficient on a given architecture, but not necessarily readable by physicists/chemists.
      Performance within 10% to maximize portability and simplicity.
    3. Ultra-High performance libraries
      Generated with auto-tuning tools for well identified datasets.
  • Both Documentation and High performance have the same API (similar to BLAS on netlib vs MKL).
  • Scientific progress is made in the documentation library, and implemented in the HPC versions when the API is stabilized.
  • Performance: enable a data-driven task-based parallelism

Library design

  • Creation of a Context that keeps a consistent state of the library (pointers to computed data, configuration parameters, etc.)
  • Memory allocation is abstract:

    void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info);

    allows allocation on CPU/GPU by the HPC variants

  • Low level functions: access to simple low-level functions leaving the context untouched (no allocation, no modification in-place)
  • High-level functions: let the library call multiple kernels in an optimal way, possibly updating the context
  • Use of IRP programming paradigm\footnote{http://arxiv.org/abs/0909.5012} to keep track of dependencies between kernels: re-compute only what is necessary and store computed data in the context

Dependencies between kernels

digraph G {
rankdir = BT;
E_pot -> E_loc;
E_kin -> E_loc;
V_NN -> E_pot;
V_eN -> E_pot;
V_ee -> E_pot;
Distance_ee -> V_ee;
Distance_eN -> V_eN;
Distance_NN -> V_NN;
ECP -> E_pot;
ECP_Local -> ECP;
ECP_Non_Local -> ECP;
Determinants -> E_kin;
Jastrow -> E_kin;
J_eN -> Jastrow;
J_ee -> Jastrow;
J_eeN -> Jastrow;
Distance_eN -> J_eN;
Distance_ee -> J_ee;
Distance_eN -> J_eeN;
Distance_ee -> J_eeN;
Det_up -> Determinants;
Det_down -> Determinants;
MOs -> Det_up;
MOs -> Det_down;
AOs -> MOs;
AO_radial -> AOs;
AO_angular -> AOs;
elec_coord -> Distance_ee;
elec_coord -> Distance_eN;
Distance_eN -> AO_radial;
Distance_eN -> AO_angular;
Determinants -> ECP_Non_Local;
}
  • Only the needed sub-graph is computed
  • HPC: Each kernel is one/many parallel Task(s)
  • HPC: Use OpenMP tasks or StarPU\footnote{C. Augonnet et al, doi:10.1002/cpe.1631} for hybrid architectures: (StarPU handles very well asynchronous CPU-GPU transfers).

Use case: low-level

#include <qmckl.h> 

// ...
qmckl_exit_code  rc;
int64_t  m, n, LDA, LDB, LDC;
// ...
double   A[LDA*3];
double   B[LDB*3];
double   C[LDC*n];
// ...

context = qmckl_context_create();

// Compute inter-particle distances between xyz coordinates in A[m][3] and B[3][n]
// and store the result in C[m][n]
rc = qmckl_distance(context, 'N', 'T', m, n, A, LDA, B, LDB, C, LDC);
assert (rc == QMCKL_SUCCESS);
// ...

Use case: high-level

#include <qmckl.h> 
// ...
qmckl_exit_code  rc;
double           e_loc;
qmckl_context    context;

context = qmckl_context_create();

// Store WF parameters in the context
rc = qmckl_read_trexio(context, trexio_filename);
assert (rc == QMCKL_SUCCESS);

// Set the electron coordinates in the context
rc = qmckl_set_electron_coord (context, 'N', elec_coord);          
assert(rc == QMCKL_SUCCESS);

// Return the local energy at the current electron positions
rc = qmckl_get_local_energy(context, &e_loc);
// ...
/home/scemama/MEGA/TEX/Presentations/2021/Intel/scemama.pdf

Development strategy

  1. Kernel extraction: QMC specialists agree on the mathematical expression of the problem
  2. A mini-application is written to find the optimal data layout with HPC experts from real-size examples
  3. The kernel is written in the documentation library
  4. The documentation library is linked in a QMC code to check correctness and numerical accuracy
  5. HPC experts provide an HPC version of the kernel
  6. The HPC library is linked in the QMC codes of the CoE

First application : 3-body Jastrow factor

\[ \Jeen (\br,\bR) = \sum_{\alpha=1}^{\Nat} \sum_{i=1}^{\Nel} \sum_{j=1}^{i-1} \sum_{p=2}^{\Nord} \sum_{k=0}^{p-1} \sum_{l=0}^{\lmax} c_{lkp\alpha} \left( {r}_{ij} \right)^k \left[ \left( {R}_{i\alpha} \right)^l + \left( {R}_{j\alpha} \right)^l \right] \left( {R}_{i\,\alpha} \, {R}_{j\alpha} \right)^{(p-k-l)/2} \]

/scemama/pres_intel/src/commit/920eb724f3398f431f284a48daae2a665adcdcfb/speedup.pdf

  • Gradient and Laplacian are also required
  • Up to $20\times$ faster than in the original code
  • $\sim 80\%$ of the AVX-512 peak is reached using standard MKL on Intel Skylake
  • Expressed with a DGEMM kernel $\Longrightarrow$ also efficient on GPU

High-Performance strategies

Linear algebra hot spots

GEMM Rank-1 update Matrix Inversion
GEMV Diagonal of GEMM Shermann-Morrison-Woodburry

Matrices are relatively small ($\le 1000\times 1000$)

  • Matrices are stored in tiled format fitting a block formulation of the algorithms $\Longrightarrow$ task-based linear algebra, interleaved computation of multiple kernels
  • Tile sizes will be adjusted by auto-tuning
  • Increase parallelism by aggregating multiple independent walkers in matrices
  • Needs fast linear algebra kernels for small matrices (tile size)
  • For tiny matrices ($<5\times5$) specialized versions are implemented

Example: Specialized DGEMM kernel I

Simple algorithm   B_block BMCOL

  • Simple micro kernel (GotoDGEMM\footnote{doi:10.1145/1356052.1356053})
  • Code written using asm_volatile to force good code generation by compilers
  • Tiling scheme\footnote{doi:10.1109/ICPP.2015.29}

Example: Specialized DGEMM kernel II

Benchmarks

Why do we like our DGEMM?

  • Open source code : can be modified easily
  • Simple code (280 LOC)
  • Decent performance with 10% of MKL
  • Can be rewritten in different languages to increase portability (MIPP\footnote{https://github.com/aff3ct/MIPP})
  • Can be coupled with simple pack/unpack routines to handle different data storage (tiled matrices)
  • Allows to keep control on parallelism
  • A good starting point for autotuning

High-Performance strategies

Tuning

  • Optimization is guided by analysis with MAQAO\footnote{https://maqao.org}.
  • Specialized versions of critical hot-spots
  • MIPP for portable intrinsics / specialized code generation
  • Monitoring of the use of the library to choose most efficient versions
  • Optimizations guided by monitoring numerical accuracy (Verificarlo\footnote{https://github.com/verificarlo/verificarlo})

Efficiently guiding the developer

/scemama/pres_intel/media/commit/920eb724f3398f431f284a48daae2a665adcdcfb/maqao1.png

Extensive/automatic testing of different configurations

/scemama/pres_intel/media/commit/920eb724f3398f431f284a48daae2a665adcdcfb/maqao2.png

Summary

Summary

  • QMC codes integrated in an ecosystem of multiple codes for high-accuracy quantum chemistry
  • Development of open-source libraries to be used in the TREX codes and beyond
  • Libraries focus on performance, portability and productivity
  • Strategies to make the collaboration between physicists/chemists and HPC experts optimal

Bonus slides

Verificarlo CI

/scemama/pres_intel/media/commit/920eb724f3398f431f284a48daae2a665adcdcfb/img/cmp-runs.png

  • Track precision of kernels over commits
  • Shows significant digits $s$, standard deviation $\sigma$, variable distribution

/scemama/pres_intel/media/commit/920eb724f3398f431f284a48daae2a665adcdcfb/img/inspect-runs.png

  • Focus in depth on one particular run
  • Compare multiple implementations of the same kernel