22 KiB
Libraries developed in the TREX CoE
#+LaTeX_CLASS_OPTIONS:[aspectratio=169]
QMC in TREX
QMC in TREX
- Highly accurate methods
- Massively parallelisable (multiple QMC trajectories)
- Very CPU intensive: One of the most "compute-hungry" methods
- Still under development: scientists need to run and develop code
- Input data is complex (wave function)
How: Instead of re-writing codes, provide libraries (free software)
- TREXIO: A library for exchanging information between codes $\Longrightarrow$ Enables HTC
- QMCkl: A library for high-performance $\Longrightarrow$ Enables HPC
Quantum Monte Carlo (QMC)
\alert{Problem}: Stochastic resolution of the Schr\"odinger equation for $N$ electrons
\begin{eqnarray}
E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
{\int \dcoord \Phi(\coord) \Phi(\coord)} \nonumber \\
&\sim & \sum \frac{ {\cal H}\Psi(\coord )}{\Psi(\coord )}
\text{, sampled with } (\Psi \times \Phi)
\nonumber
\end{eqnarray}
\begin{columns}
\begin{column}{.5\textwidth}
\begin{itemize}
\item[$\cal H $: ] Hamiltonian operator
\item[$E$: ] Energy
\end{itemize}
\end{column}
\begin{column}{.4\textwidth}
\begin{itemize}
\item[$\coord $: ] Electron coordinates
\item[$\Phi $: ] Almost exact wave function
\item[$\Psi $: ] Trial wave function
\end{itemize}
\end{column}
\end{columns}
Quantum Monte Carlo (QMC)
- Very low memory requirements (no integrals)
- Distribute walkers on different cores or compute nodes
- No blocking communication: near-ideal scaling
- Difficulty to parallelize within a QMC trajectory: depends on the number of electrons
Both libraries
Three objectives
- Productivity
Usable and useful by scientists in different programming languages - Portability
Target: all HPC systems (CPU, GPU, ARM, x86, etc.) - Performance
Must be efficient on all architectures: possible tradeoffs between portability and performance
Free (libre) software
- Requirement for open science
- BSD license for adoption by any software (academic, commercial, …)
TREXIO: I/O library
TREXIO: I/O library
digraph G {
QP [label="Quantum Package"];
QMCCHEM [label="QMC=Chem"];
Turbo [label="TurboRVB"];
QP -> NECI;
NECI -> GammCor [style="dotted"];
NECI -> QMCCHEM [style="dotted"] ;
QP -> QMCCHEM;
QP -> CHAMP;
QP -> GammCor [style="dotted"];
QP -> Turbo [style="dotted"];
NECI -> Turbo [style="dotted"];
NECI -> CHAMP [style="dotted"];
QMCCHEM -> GammCor [style="dotted"];
CHAMP -> GammCor [style="dotted"];
Turbo -> GammCor [style="dotted"];
}
digraph G {
layout=circo;
External [label="External\ncodes"];
QP [label="Quantum\nPackage"];
QMCCHEM [label="QMC=Chem"];
Turbo [label="TurboRVB"];
TREX [label="TREXIO File", shape="box"];
CHAMP -> TREX;
GammCor -> TREX;
NECI -> TREX;
QMCCHEM -> TREX;
QP -> TREX;
Turbo -> TREX;
External -> TREX;
TREX -> CHAMP;
TREX -> GammCor;
TREX -> NECI;
TREX -> QMCCHEM;
TREX -> QP;
TREX -> Turbo;
TREX -> External;
}
(BSD license)
https://github.com/trex-coe/trexio
TREXIO: I/O library
- Definition of an API for to read/write wave functions
- C-compatible API: Easy usage in all common languages
- HDF5: Efficient I/O
-
Text:
- Fallback when HDF5 can't be installed
- Debugging
- Version control systems
Content of the files
- File is self-contained: no external knowledge needed to compute $\Psi(r_1,\dots,r_n)$ (normalization factors, basis et parameters, etc)
- Strong conventions (atomic units, ordering of atomic orbitals, etc.)
-
The data stored in the files is organized in different groups:
Metadata Electron Slater Determinants Nucleus Basis CI coefficients AO MO Two-electron integrals One-electron integrals Density matrices ECP - Each group contains multiple attributes: information related to the group
QMCkl: QMC kernel library
QMC kernel library
Computational kernels
- QMCkl will contain the main kernels of QMC methods: Domain specific library, end-user driven
- Written together by QMC experts and HPC experts
-
Multiple high performance implementations of the kernels, tuned for different
- architectures: portability is critical for users
- problem sizes: from small to large systems
- requested accuracy: reduced precision
Objectives
- The code must stay easy to understand by the physicists/chemists. Performance-related aspects should be delegated to the library
- Scientists should be able to use their preferred language
- Scientists should not lose control of their codes
- Codes should not die when the architecture changes
- Scientific code development should not kill the performance
- Reuse of the optimization effort among the community
Functionality and performance
-
Keeping high productivity, portability and performance is very hard in a single piece of software.
We propose (at least) two implementations:
- Documentation library
Easy to read, understand, modify for scientists, not necessarily efficient. - High performance libraries
Efficient on a given architecture, but not necessarily readable by physicists/chemists.
Performance within 10% to maximize portability and simplicity. - Ultra-High performance libraries
Generated with auto-tuning tools for well identified datasets.
- Documentation library
- Both Documentation and High performance have the same API (similar to BLAS on netlib vs MKL).
- Scientific progress is made in the documentation library, and implemented in the HPC versions when the API is stabilized.
- Performance: enable a data-driven task-based parallelism
Library design
- Creation of a Context that keeps a consistent state of the library (pointers to computed data, configuration parameters, etc.)
-
Memory allocation is abstract:
void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info);
allows allocation on CPU/GPU by the HPC variants
- Low level functions: access to simple low-level functions leaving the context untouched (no allocation, no modification in-place)
- High-level functions: let the library call multiple kernels in an optimal way, possibly updating the context
- Use of IRP programming paradigm\footnote{http://arxiv.org/abs/0909.5012} to keep track of dependencies between kernels: re-compute only what is necessary and store computed data in the context
Dependencies between kernels
digraph G {
rankdir = BT;
E_pot -> E_loc;
E_kin -> E_loc;
V_NN -> E_pot;
V_eN -> E_pot;
V_ee -> E_pot;
Distance_ee -> V_ee;
Distance_eN -> V_eN;
Distance_NN -> V_NN;
ECP -> E_pot;
ECP_Local -> ECP;
ECP_Non_Local -> ECP;
Determinants -> E_kin;
Jastrow -> E_kin;
J_eN -> Jastrow;
J_ee -> Jastrow;
J_eeN -> Jastrow;
Distance_eN -> J_eN;
Distance_ee -> J_ee;
Distance_eN -> J_eeN;
Distance_ee -> J_eeN;
Det_up -> Determinants;
Det_down -> Determinants;
MOs -> Det_up;
MOs -> Det_down;
AOs -> MOs;
AO_radial -> AOs;
AO_angular -> AOs;
elec_coord -> Distance_ee;
elec_coord -> Distance_eN;
Distance_eN -> AO_radial;
Distance_eN -> AO_angular;
Determinants -> ECP_Non_Local;
}
- Only the needed sub-graph is computed
- HPC: Each kernel is one/many parallel Task(s)
- HPC: Use OpenMP tasks or StarPU\footnote{C. Augonnet et al, doi:10.1002/cpe.1631} for hybrid architectures: (StarPU handles very well asynchronous CPU-GPU transfers).
Use case: low-level
#include <qmckl.h>
// ...
qmckl_exit_code rc;
int64_t m, n, LDA, LDB, LDC;
// ...
double A[LDA*3];
double B[LDB*3];
double C[LDC*n];
// ...
context = qmckl_context_create();
// Compute inter-particle distances between xyz coordinates in A[m][3] and B[3][n]
// and store the result in C[m][n]
rc = qmckl_distance(context, 'N', 'T', m, n, A, LDA, B, LDB, C, LDC);
assert (rc == QMCKL_SUCCESS);
// ...
Use case: high-level
#include <qmckl.h>
// ...
qmckl_exit_code rc;
double e_loc;
qmckl_context context;
context = qmckl_context_create();
// Store WF parameters in the context
rc = qmckl_read_trexio(context, trexio_filename);
assert (rc == QMCKL_SUCCESS);
// Set the electron coordinates in the context
rc = qmckl_set_electron_coord (context, 'N', elec_coord);
assert(rc == QMCKL_SUCCESS);
// Return the local energy at the current electron positions
rc = qmckl_get_local_energy(context, &e_loc);
// ...
/home/scemama/MEGA/TEX/Presentations/2021/Intel/scemama.pdf
Development strategy
- Kernel extraction: QMC specialists agree on the mathematical expression of the problem
- A mini-application is written to find the optimal data layout with HPC experts from real-size examples
- The kernel is written in the documentation library
- The documentation library is linked in a QMC code to check correctness and numerical accuracy
- HPC experts provide an HPC version of the kernel
- The HPC library is linked in the QMC codes of the CoE
First application : 3-body Jastrow factor
\[ \Jeen (\br,\bR) = \sum_{\alpha=1}^{\Nat} \sum_{i=1}^{\Nel} \sum_{j=1}^{i-1} \sum_{p=2}^{\Nord} \sum_{k=0}^{p-1} \sum_{l=0}^{\lmax} c_{lkp\alpha} \left( {r}_{ij} \right)^k \left[ \left( {R}_{i\alpha} \right)^l + \left( {R}_{j\alpha} \right)^l \right] \left( {R}_{i\,\alpha} \, {R}_{j\alpha} \right)^{(p-k-l)/2} \]
/scemama/pres_intel/src/branch/master/speedup.pdf
- Gradient and Laplacian are also required
- Up to $20\times$ faster than in the original code
- $\sim 80\%$ of the AVX-512 peak is reached using standard MKL on Intel Skylake
- Expressed with a DGEMM kernel $\Longrightarrow$ also efficient on GPU
High-Performance strategies
Linear algebra hot spots
GEMM | Rank-1 update | Matrix Inversion |
GEMV | Diagonal of GEMM | Shermann-Morrison-Woodburry |
Matrices are relatively small ($\le 1000\times 1000$)
- Matrices are stored in tiled format fitting a block formulation of the algorithms $\Longrightarrow$ task-based linear algebra, interleaved computation of multiple kernels
- Tile sizes will be adjusted by auto-tuning
- Increase parallelism by aggregating multiple independent walkers in matrices
- Needs fast linear algebra kernels for small matrices (tile size)
- For tiny matrices ($<5\times5$) specialized versions are implemented
Example: Specialized DGEMM kernel I
Simple algorithm B_block BMCOL
- Simple micro kernel (GotoDGEMM\footnote{doi:10.1145/1356052.1356053})
- Code written using
asm_volatile
to force good code generation by compilers - Tiling scheme\footnote{doi:10.1109/ICPP.2015.29}
Tiling scheme B_block BMCOL
Example: Specialized DGEMM kernel II
Benchmarks
-
Comparison of MKL vs Specialized DGEMM
/scemama/pres_intel/src/branch/master/plot_percentage_vs_mkl_tiled_good.pdf
- Strong impact on MKL performance due to the number of consecutive executions
- Favorable comparison for MKL: Many consecutive executions to amortize setup cost, JIT, Skylake CPU
Why do we like our DGEMM?
- Open source code : can be modified easily
- Simple code (280 LOC)
- Decent performance with 10% of MKL
- Can be rewritten in different languages to increase portability (MIPP\footnote{https://github.com/aff3ct/MIPP})
- Can be coupled with simple pack/unpack routines to handle different data storage (tiled matrices)
- Allows to keep control on parallelism
- A good starting point for autotuning
High-Performance strategies
Tuning
- Optimization is guided by analysis with MAQAO\footnote{https://maqao.org}.
- Specialized versions of critical hot-spots
- MIPP for portable intrinsics / specialized code generation
- Monitoring of the use of the library to choose most efficient versions
- Optimizations guided by monitoring numerical accuracy (Verificarlo\footnote{https://github.com/verificarlo/verificarlo})
Efficiently guiding the developer
Extensive/automatic testing of different configurations
Summary
Summary
- QMC codes integrated in an ecosystem of multiple codes for high-accuracy quantum chemistry
- Development of open-source libraries to be used in the TREX codes and beyond
- Libraries focus on performance, portability and productivity
- Strategies to make the collaboration between physicists/chemists and HPC experts optimal
Bonus slides
Verificarlo CI
- Track precision of kernels over commits
- Shows significant digits $s$, standard deviation $\sigma$, variable distribution
- Focus in depth on one particular run
- Compare multiple implementations of the same kernel