pres_intel/scemama.org
2021-10-05 09:10:22 +02:00

20 KiB

Libraries developed in the TREX CoE

#+LaTeX_CLASS_OPTIONS:[aspectratio=169]

TREX: Targeting REal chemical accuracy at the EXascale

TREX: Targeting REal chemical accuracy at the EXascale

  • Highly accurate methods
  • Massively parallelisable (multiple QMC trajectories)
  • Very CPU intensive: One of the most "compute-hungry" methods
  • Still under development: scientists need to run and develop code
  • Input data is complex (wave function)

How: Instead of re-writing codes, provide libraries (free software)

  1. TREXIO: A library for exchanging information between codes $\Longrightarrow$ Enables HTC
  2. QMCkl: A library for high-performance $\Longrightarrow$ Enables HPC

Quantum Monte Carlo (QMC)

\alert{Problem}: Stochastic resolution of the Schr\"odinger equation for $N$ electrons
\begin{eqnarray}
E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
                        {\int \dcoord \Phi(\coord) \Phi(\coord)} \nonumber \\
                 &\sim & \sum \frac{ {\cal H}\Psi(\coord )}{\Psi(\coord )}
                   \text{, sampled with } (\Psi \times \Phi)
\nonumber
\end{eqnarray}
\begin{columns}
\begin{column}{.5\textwidth}
\begin{itemize}
\item[$\cal H $: ] Hamiltonian operator
\item[$E$: ] Energy
\end{itemize}
\end{column}
\begin{column}{.4\textwidth}
\begin{itemize}
\item[$\coord $: ] Electron coordinates
\item[$\Phi $: ] Almost exact wave function
\item[$\Psi $: ] Trial wave function
\end{itemize}
\end{column}
\end{columns}

Quantum Monte Carlo (QMC)

  • Very low memory requirements (no integrals)
  • Distribute walkers on different cores or compute nodes
  • No blocking communication: near-ideal scaling
  • Difficulty: parallelize within a QMC trajectory

/scemama/pres_intel/media/commit/9046dc87025e091bc3fafe28503e4548e4b1f335/Qmc.png

Both libraries

Three objectives

  1. Productivity
    Used and developed by scientists in different languages
  2. Portability
    Target: all HPC systems (CPU, GPU, ARM, x86, etc.)
  3. Performance
    Must be efficient on all architectures

Free (libre) software

  • Requirement for open science
  • BSD license for adoption by any software (academic, commercial, …)

TREXIO: I/O library

TREXIO: I/O library

digraph G {
QP [label="Quantum Package"];
QMCCHEM [label="QMC=Chem"];
Turbo   [label="TurboRVB"];
QP -> NECI;
NECI -> GammCor [style="dotted"];
NECI -> QMCCHEM [style="dotted"] ;
QP -> QMCCHEM;
QP -> CHAMP;
QP -> GammCor [style="dotted"];
QP -> Turbo [style="dotted"];
NECI -> Turbo [style="dotted"];
NECI -> CHAMP [style="dotted"];
QMCCHEM -> GammCor [style="dotted"];
CHAMP -> GammCor [style="dotted"];
Turbo -> GammCor [style="dotted"];
}

/scemama/pres_intel/media/commit/9046dc87025e091bc3fafe28503e4548e4b1f335/interfaces.png

digraph G {
layout=circo;
External [label="External\ncodes"];
QP [label="Quantum\nPackage"];
QMCCHEM [label="QMC=Chem"];
Turbo   [label="TurboRVB"];
TREX [label="TREXIO File", shape="box"];
CHAMP -> TREX;
GammCor -> TREX;
NECI -> TREX;
QMCCHEM -> TREX;
QP -> TREX;
Turbo -> TREX;
External -> TREX;

TREX -> CHAMP;
TREX -> GammCor;
TREX -> NECI;
TREX -> QMCCHEM;
TREX -> QP;
TREX -> Turbo;
TREX -> External;
}

/scemama/pres_intel/media/commit/9046dc87025e091bc3fafe28503e4548e4b1f335/interfaces2.png

(BSD license)
https://github.com/trex-coe/trexio

TREXIO: I/O library

  • Definition of an API for to read/write wave functions
  • C-compatible API: Easy usage in all common languages

/scemama/pres_intel/media/commit/9046dc87025e091bc3fafe28503e4548e4b1f335/api.png

  • HDF5: Efficient I/O
  • Text:

    • Fallback when HDF5 can't be installed
    • Debugging
    • Version control systems

Content of the files

  • File is self-contained: no external knowledge needed to compute $\Psi(r_1,\dots,r_n)$ (normalization factors, basis et parameters, etc)
  • Strong conventions (atomic units, ordering of atomic orbitals, etc.)
  • The data stored in the files is organized in different groups:

    Metadata Electron Slater Determinants
    Nucleus Basis CI coefficients
    AO MO Two-electron integrals
    One-electron integrals Density matrices ECP
  • Each group contains multiple attributes

Source code

  • For each attribute :

    trexio_exit_code  trexio_[has/read/write]_<group>_<attribute>
                                        (trexio_t* file, <type> attribute)
  • The library can be auto-generated by a script as the function names can be computed
  • Productivity : Literate programming with Org-mode
    Table $\rightarrow$ JSON $\rightarrow$ C
    \phantom{Table} $\rightarrow$ Documentation
  • Fortran and Python/Numpy interfaces are also generated
  • Performance : HDF5 back end
  • Portability : Only optional dependency is HDF5

Source code

Productivity:

/scemama/pres_intel/media/commit/9046dc87025e091bc3fafe28503e4548e4b1f335/trexio-doc1.png

Documentation

/scemama/pres_intel/media/commit/9046dc87025e091bc3fafe28503e4548e4b1f335/trexio-doc2.png

QMCkl: QMC kernel library

QMC kernel library

Computational kernels

  • QMCkl will contain the main kernels of QMC methods (Domain specific library, end-user driven)
  • Written together by QMC experts and HPC experts
  • Multiple high performance implementations of the kernels, tuned for different

    • architectures (portability is critical for users)
    • problem sizes (from small to large systems)
    • requested accuracy (reduced precision)

Objectives

  • The code must stay easy to understand by the physicists/chemists. Performance-related aspects are delegated to the library
  • Scientists should be able to use their preferred language
  • Scientists should not lose control of their codes
  • Codes should not die when the architecture changes
  • Scientific code development should not kill the performance
  • Reuse of the optimization effort among the community

Functionality and performance

  • Keeping high productivity, portability and performance is very hard in a single piece of software.

    We propose (at least) two implementations:

    1. Documentation library
      Easy to read, understand, modify for scientists, not necessarily efficient.
    2. High performance libraries
      Efficient on a given architecture, but not necessarily readable by physicists/chemists.
      Performance within 10\% to maximize portability and simplicity.
  • Both Documentation and High performance have the same API (similar to BLAS on netlib vs MKL).
  • Scientific progress is made in the documentation library, and implemented in the HPC versions when the API is stabilized.
  • Performance: enable a data-driven task-based parallelism

Library design

  • Creation of a Context that keeps a consistent state of the library
  • Memory allocation is abstract:

    void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info);

    allows allocation on CPU/GPU by the HPC variants

  • Low level functions: access to simple low-level functions leaving the context untouched (no allocation, no modification in-place)
  • High-level functions: let the library call multiple kernels in an optimal way, possibly updating the context
  • Use of IRP programming paradigm to keep track of dependencies between kernels: re-compute only what is necessary

Use case: low-level

#include <qmckl.h> 

// ...
qmckl_exit_code  rc;
int64_t  m, n, LDA, LDB, LDC;
// ...
double   A[LDA*3];
double   B[LDB*3];
double   C[LDC*n];
// ...

context = qmckl_context_create();

// Compute inter-particle distances between xyz coordinates in A[m][3] and B[3][n]
// and store the result in C[m][n]
rc = qmckl_distance(context, 'N', 'T', m, n, A, LDA, B, LDB, C, LDC);
assert (rc == QMCKL_SUCCESS);
// ...

Use case: high-level

#include <qmckl.h> 
// ...
qmckl_exit_code  rc;
double           e_loc;
qmckl_context    context;

context = qmckl_context_create();

// Store WF parameters in the context
rc = qmckl_read_trexio(context, trexio_filename);
assert (rc == QMCKL_SUCCESS);

// Set the electron coordinates in the context
rc = qmckl_set_electron_coord (context, 'N', elec_coord);          
assert(rc == QMCKL_SUCCESS);

// Return the local energy at the current electron positions
rc = qmckl_get_local_energy(context, &e_loc);
// ...
/home/scemama/MEGA/TEX/Presentations/2021/Intel/scemama.pdf

Dependencies between kernels

digraph G {
rankdir = BT;
E_pot -> E_loc;
E_kin -> E_loc;
V_NN -> E_pot;
V_eN -> E_pot;
V_ee -> E_pot;
Distance_ee -> V_ee;
Distance_eN -> V_eN;
Distance_NN -> V_NN;
ECP -> E_pot;
ECP_Local -> ECP;
ECP_Non_Local -> ECP;
Determinants -> E_kin;
Jastrow -> E_kin;
J_eN -> Jastrow;
J_ee -> Jastrow;
J_eeN -> Jastrow;
Distance_eN -> J_eN;
Distance_ee -> J_ee;
Distance_eN -> J_eeN;
Distance_ee -> J_eeN;
Det_up -> Determinants;
Det_down -> Determinants;
MOs -> Det_up;
MOs -> Det_down;
AOs -> MOs;
AO_radial -> AOs;
AO_angular -> AOs;
elec_coord -> Distance_ee;
elec_coord -> Distance_eN;
Distance_eN -> AO_radial;
Distance_eN -> AO_angular;
Determinants -> ECP_Non_Local;
}
  • Only the needed sub-graph is computed
  • HPC: Each kernel is one/many parallel Task(s)
  • HPC: Use OpenMP tasks or StarPU\footnote{C. Augonnet et al, doi:10.1002/cpe.1631} for hybrid architectures: (StarPU handles very well asynchronous CPU-GPU transfers).

Development strategy

  1. Kernel extraction: QMC specialists agree on the mathematical expression of the problem
  2. A mini-application is written to find the optimal data layout with HPC experts from real-size examples
  3. The kernel is written in the documentation library
  4. The documentation library is linked in a QMC code to check correctness
  5. HPC experts provide an HPC version of the kernel
  6. The HPC library is linked in the QMC codes of the CoE

Documentation library

Literate programming with Org-mode:

  • Comments are more important than code
  • Can add graphics, \LaTeX formulas, tables, etc
  • Documentation always synchronized with the code
  • Some routines can be generated by embedded scripts
  • Kernels are implemented in Fortran for readability
  • The API is C-compatible: QMCkl appears like a C library $\Longrightarrow$ can be used in all other languages
  • Example: Prototyping in Julia

High-Performance strategies

Linear algebra hot spots

GEMM Rank-1 update Matrix Inversion
GEMV Diagonal of GEMM Shermann-Morrison-Woodburry

Matrices are relatively small ($\le 1000\times 1000$)

  • Matrices are stored in tiled format $\Longrightarrow$ task-based linear algebra interleaved computation of multiple kernels
  • Increase parallelism by aggregating multiple independent walkers in matrices
  • Needs fast linear algebra kernels for small matrices

High-Performance strategies

Tuning

  • Optimization is guided by analysis with MAQAO\footnote{https://maqao.org}.
  • Specialized versions of critical hot-spots
  • MIPP for portable intrinsics / specialized code generation
  • Monitoring of the use of the library to choose most efficient versions
  • Optimizations guided by monitoring numerical accuracy (Verificarlo\footnote{https://github.com/verificarlo/verificarlo})

Example: Specialized DGEMM kernel

VIJAY

Efficiently guiding the developer

/scemama/pres_intel/media/commit/9046dc87025e091bc3fafe28503e4548e4b1f335/maqao1.png

Extensive/automatic testing of different configurations

/scemama/pres_intel/media/commit/9046dc87025e091bc3fafe28503e4548e4b1f335/maqao2.png

First application : 3-body Jastrow factor

\[ \Jeen (\br,\bR) = \sum_{\alpha=1}^{\Nat} \sum_{i=1}^{\Nel} \sum_{j=1}^{i-1} \sum_{p=2}^{\Nord} \sum_{k=0}^{p-1} \sum_{l=0}^{\lmax} c_{lkp\alpha} \left( {r}_{ij} \right)^k \left[ \left( {R}_{i\alpha} \right)^l + \left( {R}_{j\alpha} \right)^l \right] \left( {R}_{i\,\alpha} \, {R}_{j\alpha} \right)^{(p-k-l)/2} \]

/scemama/pres_intel/src/commit/9046dc87025e091bc3fafe28503e4548e4b1f335/speedup.pdf

  • Gradient and Laplacian are also required
  • Up to $20\times$ faster than in the original code
  • $\sim 80\%$ of the AVX-512 peak is reached
  • Expressed with a DGEMM kernel $\Longrightarrow$ also efficient on GPU

Verificarlo CI

/scemama/pres_intel/media/commit/9046dc87025e091bc3fafe28503e4548e4b1f335/img/cmp-runs.png

  • Track precision of kernels over commits
  • Shows significant digits $s$, standard deviation $\sigma$, variable distribution

/scemama/pres_intel/media/commit/9046dc87025e091bc3fafe28503e4548e4b1f335/img/inspect-runs.png

  • Focus in depth on one particular run
  • Compare multiple implementations of the same kernel