Modifs zoom vendredi matin

This commit is contained in:
Anthony Scemama 2021-10-08 12:28:21 +02:00
commit 59137eb55b
5 changed files with 301 additions and 168 deletions

Binary file not shown.

View File

@ -88,7 +88,8 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
- Very low memory requirements (no integrals) - Very low memory requirements (no integrals)
- Distribute walkers on different cores or compute nodes - Distribute walkers on different cores or compute nodes
- No blocking communication: near-ideal scaling - No blocking communication: near-ideal scaling
- Difficulty: parallelize within a QMC trajectory - Difficulty to parallelize within a QMC trajectory: depends on the
number of electrons
#+LATEX: \end{column} #+LATEX: \end{column}
#+LATEX: \begin{column}{0.6\textwidth} #+LATEX: \begin{column}{0.6\textwidth}
#+ATTR_LATEX: :width \textwidth #+ATTR_LATEX: :width \textwidth
@ -99,11 +100,12 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
** Both libraries ** Both libraries
*** Three objectives *** Three objectives
1. *Productivity* \\ 1. *Productivity* \\
Used and developed by scientists in different languages Usable and useful by scientists in different programming languages
2. *Portability* \\ 2. *Portability* \\
Target: all HPC systems (CPU, GPU, ARM, x86, etc.) Target: all HPC systems (CPU, GPU, ARM, x86, etc.)
3. *Performance* \\ 3. *Performance* \\
Must be efficient on all architectures Must be efficient on all architectures: possible tradeoffs
between portability and performance
*** Free (libre) software *** Free (libre) software
- Requirement for open science - Requirement for open science
@ -208,7 +210,8 @@ digraph G {
| Nucleus | Basis | CI coefficients | | Nucleus | Basis | CI coefficients |
| AO | MO | Two-electron integrals | | AO | MO | Two-electron integrals |
| One-electron integrals | Density matrices | ECP | | One-electron integrals | Density matrices | ECP |
- Each group contains multiple *attributes* - Each group contains multiple *attributes*: information related to the
group
** Source code :noexport: ** Source code :noexport:
@ -241,23 +244,23 @@ trexio_exit_code trexio_[has/read/write]_<group>_<attribute>
* QMCkl: QMC kernel library * QMCkl: QMC kernel library
** QMC kernel library ** QMC kernel library
*** Computational kernels *** Computational kernels
- QMCkl will contain the main kernels of QMC methods (Domain - QMCkl will contain the main kernels of QMC methods: Domain
specific library, end-user driven) specific library, end-user driven
- Written together by QMC experts and HPC experts - Written together by QMC experts and HPC experts
- Multiple high performance implementations of the kernels, tuned - Multiple high performance implementations of the kernels, tuned
for different for different
- architectures: portability is critical for users - architectures: portability is critical for users
- problem sizes (from small to large systems) - problem sizes: from small to large systems
- requested accuracy (reduced precision) - requested accuracy: reduced precision
** Objectives ** Objectives
- The code must stay easy to understand by the physicists/chemists. - The code must stay easy to understand by the physicists/chemists.
Performance-related aspects should be delegated to the library Performance-related aspects should be delegated to the library
- Scientists should be able to use their preferred language - Scientists should be able to use their preferred language
- Scientists should not lose control on their codes - Scientists should not lose control of their codes
- Codes should not die when the architecture changes - Codes should not die when the architecture changes
- Scientific code development should not kill the performance - Scientific code development should not kill the performance
- Reuse of the optimization effort among the community - Reuse of the optimization effort among the community
@ -273,8 +276,10 @@ trexio_exit_code trexio_[has/read/write]_<group>_<attribute>
Easy to read, understand, modify for scientists, not necessarily efficient. Easy to read, understand, modify for scientists, not necessarily efficient.
2. *High performance libraries* \\ 2. *High performance libraries* \\
Efficient on a given architecture, but not necessarily Efficient on a given architecture, but not necessarily
readable by physicists/chemists. \\ readable by physicists/chemists. \\
Performance within 10% to maximize portability and simplicity. Performance within 10% to maximize portability and simplicity.
3. *Ultra-High performance libraries* \\
Generated with auto-tuning tools for well identified datasets.
- Both /Documentation/ and /High performance/ have the same API - Both /Documentation/ and /High performance/ have the same API
(similar to BLAS on netlib /vs/ MKL). (similar to BLAS on netlib /vs/ MKL).
@ -283,10 +288,22 @@ trexio_exit_code trexio_[has/read/write]_<group>_<attribute>
implemented in the HPC versions when the API is stabilized. implemented in the HPC versions when the API is stabilized.
- Performance: enable a data-driven task-based parallelism - Performance: enable a data-driven task-based parallelism
** Documentation library :noexport:
Literate programming with Org-mode:
- Comments are more important than code
- Can add graphics, \LaTeX formulas, tables, etc
- Documentation always synchronized with the code
- Some routines can be generated by embedded scripts
- Kernels are implemented in Fortran for readability
- The API is C-compatible: QMCkl appears like a C library
$\Longrightarrow$ can be used in all other languages
- Example: Prototyping in Julia
** Library design ** Library design
- Creation of a /Context/ that keeps a consistent state of the library - Creation of a /Context/ that keeps a consistent state of the
library (pointers to computed data, configuration parameters, etc.)
- Memory allocation is abstract: - Memory allocation is abstract:
#+begin_src c #+begin_src c
void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info); void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info);
@ -297,7 +314,8 @@ void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info);
- High-level functions: let the library call multiple kernels in an - High-level functions: let the library call multiple kernels in an
optimal way, possibly updating the context optimal way, possibly updating the context
- Use of IRP programming paradigm\footnote{http://arxiv.org/abs/0909.5012} to keep track of dependencies - Use of IRP programming paradigm\footnote{http://arxiv.org/abs/0909.5012} to keep track of dependencies
between kernels: re-compute only what is necessary between kernels: re-compute only what is necessary and store
computed data in the context
** Dependencies between kernels ** Dependencies between kernels
@ -407,58 +425,11 @@ rc = qmckl_get_local_energy(context, &e_loc);
2. A mini-application is written to find the optimal data layout 2. A mini-application is written to find the optimal data layout
with HPC experts from real-size examples with HPC experts from real-size examples
3. The kernel is written in the documentation library 3. The kernel is written in the documentation library
4. The documentation library is linked in a QMC code to check correctness 4. The documentation library is linked in a QMC code to check
correctness and numerical accuracy
5. HPC experts provide an HPC version of the kernel 5. HPC experts provide an HPC version of the kernel
6. The HPC library is linked in the QMC codes of the CoE 6. The HPC library is linked in the QMC codes of the CoE
** Documentation library
Literate programming with Org-mode:
- Comments are more important than code
- Can add graphics, \LaTeX formulas, tables, etc
- Documentation always synchronized with the code
- Some routines can be generated by embedded scripts
- Kernels are implemented in Fortran for readability
- The API is C-compatible: QMCkl appears like a C library
$\Longrightarrow$ can be used in all other languages
- Example: Prototyping in Julia
** High-Performance strategies
*** Linear algebra hot spots
| GEMM | Rank-1 update | Matrix Inversion |
| GEMV | Diagonal of GEMM | Shermann-Morrison-Woodburry |
*** Matrices are relatively small ($\le 1000\times 1000$)
- Matrices are stored in tiled format $\Longrightarrow$ task-based
linear algebra interleaved computation of multiple kernels
- Increase parallelism by agregating multiple independent walkers
in matrices
- Needs fast linear algebra kernels for small matrices
** High-Performance strategies
*** Tuning
- Optimization is guided by analysis with *MAQAO*\footnote{https://maqao.org}.
- Specialized versions of critical hot-spots
- MIPP\footnote{https://github.com/aff3ct/MIPP} for portable intrinsics / specialized code generation
- Monitoring of the use of the library to choose most efficient versions
- Optimizations guided by monitoring numerical accuracy (*Verificarlo*\footnote{https://github.com/verificarlo/verificarlo})
** Example: Specialized DGEMM kernel
VIJAY
** Efficiently guiding the developer
#+ATTR_LATEX: :width \textwidth
[[./maqao1.png]]
** Extensive/automatic testing of different configurations
#+ATTR_LATEX: :width \textwidth
[[./maqao2.png]]
** First application : 3-body Jastrow factor ** First application : 3-body Jastrow factor
#+LATEX: \newcommand{\Jeen}{J_{\text{een}}} #+LATEX: \newcommand{\Jeen}{J_{\text{een}}}
@ -489,14 +460,109 @@ rc = qmckl_get_local_energy(context, &e_loc);
#+LATEX: \begin{column}{0.5\textwidth} #+LATEX: \begin{column}{0.5\textwidth}
- Gradient and Laplacian are also required - Gradient and Laplacian are also required
- Up to $20\times$ faster than in the original code - Up to $20\times$ faster than in the original code
- $\sim 80\%$ of the AVX-512 peak is reached - $\sim 80\%$ of the AVX-512 peak is reached using standard MKL on
Intel Skylake
- Expressed with a DGEMM kernel $\Longrightarrow$ also efficient on GPU - Expressed with a DGEMM kernel $\Longrightarrow$ also efficient on GPU
#+LATEX: \end{column} #+LATEX: \end{column}
#+LATEX: \end{columns} #+LATEX: \end{columns}
** High-Performance strategies
#+LATEX: \end{frame}
#+INCLUDE: "verificarlo.tex" export latex *** Linear algebra hot spots
| GEMM | Rank-1 update | Matrix Inversion |
| GEMV | Diagonal of GEMM | Shermann-Morrison-Woodburry |
*** Matrices are relatively small ($\le 1000\times 1000$)
- Matrices are stored in tiled format fitting a block formulation
of the algorithms $\Longrightarrow$ task-based
linear algebra, interleaved computation of multiple kernels
- Tile sizes will be adjusted by auto-tuning
- Increase parallelism by aggregating multiple independent walkers
in matrices
- Needs fast linear algebra kernels for small matrices (tile size)
- For tiny matrices ($<5\times5$) specialized versions are implemented
** Example: Specialized DGEMM kernel I
*** Simple algorithm :B_block:BMCOL:
:PROPERTIES:
:BEAMER_env: block
:BEAMER_col: 0.45
:END:
- Simple micro kernel (*GotoDGEMM*\footnote{doi:10.1145/1356052.1356053})
- Code written using ~asm_volatile~ to force good code generation by
compilers
- *Tiling* scheme\footnote{doi:10.1109/ICPP.2015.29}
*** Tiling scheme :B_block:BMCOL:
:PROPERTIES:
:BEAMER_col: 0.45
:BEAMER_env: block
:END:
#+ATTR_LATEX: :width 5cm :height 5cm :keepaspectratio :right
[[./tiling_icpp2015.pdf]]
** Example: Specialized DGEMM kernel II
*** Benchmarks
- Comparison of MKL vs Specialized DGEMM
#+ATTR_LATEX: :height 4cm :keepaspectratio
[[./plot_percentage_vs_mkl_tiled_good.pdf]]
- Strong impact on MKL performance due to the number of consecutive executions
- Favorable comparison for MKL: Many consecutive executions to
amortize setup cost, JIT, Skylake CPU
** Why do we like our DGEMM?
- Open source code : can be modified easily
- Simple code (280 LOC)
- Decent performance with 10% of MKL
- Can be rewritten in different languages to increase
portability (MIPP\footnote{https://github.com/aff3ct/MIPP})
- Can be coupled with simple pack/unpack routines to handle different
data storage (tiled matrices)
- Allows to keep control on parallelism
- A good starting point for autotuning
** High-Performance strategies
*** Tuning
- Optimization is guided by analysis with *MAQAO*\footnote{https://maqao.org}.
- Specialized versions of critical hot-spots
- *MIPP* for portable intrinsics / specialized code generation
- Monitoring of the use of the library to choose most efficient versions
- Optimizations guided by monitoring numerical accuracy (*Verificarlo*\footnote{https://github.com/verificarlo/verificarlo})
** Efficiently guiding the developer
#+ATTR_LATEX: :width \textwidth
[[./maqao1.png]]
** Extensive/automatic testing of different configurations
#+ATTR_LATEX: :width \textwidth
[[./maqao2.png]]
* Summary
** Summary
- QMC codes integrated in an ecosystem of multiple codes for
high-accuracy quantum chemistry
- Development of open-source libraries to be used in the
TREX codes and beyond
- Libraries focus on /performance/, /portability/ and /productivity/
- Strategies to make the collaboration between physicists/chemists
and HPC experts optimal
* Bonus slides
#+INCLUDE: "verificarlo.tex" export latex
** Verificarlo CI ** Verificarlo CI
#+LATEX: \begin{columns} #+LATEX: \begin{columns}
@ -518,7 +584,6 @@ rc = qmckl_get_local_energy(context, &e_loc);
#+LATEX: \end{exampleblock} #+LATEX: \end{exampleblock}
#+LATEX: \end{column} #+LATEX: \end{column}
#+LATEX: \end{columns} #+LATEX: \end{columns}
* Useful links :noexport: * Useful links :noexport:
| TREX web site | https://trex-coe.eu | | TREX web site | https://trex-coe.eu |
@ -597,3 +662,4 @@ together: perf et productivity
: /home/scemama/MEGA/TEX/Presentations/2021/Intel/scemama.pdf : /home/scemama/MEGA/TEX/Presentations/2021/Intel/scemama.pdf

View File

@ -1,4 +1,4 @@
% Created 2021-10-07 Thu 12:17 % Created 2021-10-08 Fri 12:27
% Intended LaTeX compiler: pdflatex % Intended LaTeX compiler: pdflatex
\documentclass[aspectratio=169]{beamer} \documentclass[aspectratio=169]{beamer}
\usepackage[utf8]{inputenc} \usepackage[utf8]{inputenc}
@ -53,8 +53,8 @@ $^2$University of Versailles, Li-PaRAD (France)}
\maketitle \maketitle
\section{QMC in TREX} \section{QMC in TREX}
\label{sec:org527cfcf} \label{sec:orge5169ea}
\begin{frame}[label={sec:org3bfadea}]{QMC in TREX} \begin{frame}[label={sec:org16615d0}]{QMC in TREX}
\begin{exampleblock}{QMC: Quantum Monte Carlo methods} \begin{exampleblock}{QMC: Quantum Monte Carlo methods}
\begin{itemize} \begin{itemize}
\item Highly accurate methods \item Highly accurate methods
@ -75,7 +75,7 @@ How: Instead of re-writing codes, provide libraries (free software)
\end{exampleblock} \end{exampleblock}
\end{frame} \end{frame}
\begin{frame}[label={sec:orge26ef23}]{Quantum Monte Carlo (QMC)} \begin{frame}[label={sec:orgd8db692}]{Quantum Monte Carlo (QMC)}
\alert{Problem}: Stochastic resolution of the Schr\"odinger equation for $N$ electrons \alert{Problem}: Stochastic resolution of the Schr\"odinger equation for $N$ electrons
\begin{eqnarray} \begin{eqnarray}
E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)} E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
@ -101,14 +101,15 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
\end{columns} \end{columns}
\end{frame} \end{frame}
\begin{frame}[label={sec:orgd65402e}]{Quantum Monte Carlo (QMC)} \begin{frame}[label={sec:orgcee35fc}]{Quantum Monte Carlo (QMC)}
\begin{columns} \begin{columns}
\begin{column}{0.4\textwidth} \begin{column}{0.4\textwidth}
\begin{itemize} \begin{itemize}
\item Very low memory requirements (no integrals) \item Very low memory requirements (no integrals)
\item Distribute walkers on different cores or compute nodes \item Distribute walkers on different cores or compute nodes
\item No blocking communication: near-ideal scaling \item No blocking communication: near-ideal scaling
\item Difficulty: parallelize within a QMC trajectory \item Difficulty to parallelize within a QMC trajectory: depends on the
number of electrons
\end{itemize} \end{itemize}
\end{column} \end{column}
\begin{column}{0.6\textwidth} \begin{column}{0.6\textwidth}
@ -119,15 +120,16 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
\end{columns} \end{columns}
\end{frame} \end{frame}
\begin{frame}[label={sec:org3e8242f}]{Both libraries} \begin{frame}[label={sec:org4bb2da0}]{Both libraries}
\begin{block}{Three objectives} \begin{block}{Three objectives}
\begin{enumerate} \begin{enumerate}
\item \alert{Productivity} \\ \item \alert{Productivity} \\
Used and developed by scientists in different languages Usable and useful by scientists in different programming languages
\item \alert{Portability} \\ \item \alert{Portability} \\
Target: all HPC systems (CPU, GPU, ARM, x86, etc.) Target: all HPC systems (CPU, GPU, ARM, x86, etc.)
\item \alert{Performance} \\ \item \alert{Performance} \\
Must be efficient on all architectures Must be efficient on all architectures: possible tradeoffs
between portability and performance
\end{enumerate} \end{enumerate}
\end{block} \end{block}
@ -140,8 +142,8 @@ Must be efficient on all architectures
\end{frame} \end{frame}
\section{TREXIO: I/O library} \section{TREXIO: I/O library}
\label{sec:orgf8ad1e7} \label{sec:orga389b46}
\begin{frame}[label={sec:org02f0485}]{TREXIO: I/O library} \begin{frame}[label={sec:org61be819}]{TREXIO: I/O library}
\begin{columns} \begin{columns}
\begin{column}{0.4\textwidth} \begin{column}{0.4\textwidth}
\begin{exampleblock}{Before} \begin{exampleblock}{Before}
@ -163,7 +165,7 @@ Must be efficient on all architectures
\url{https://github.com/trex-coe/trexio} \url{https://github.com/trex-coe/trexio}
\end{frame} \end{frame}
\begin{frame}[label={sec:org2341c39}]{TREXIO: I/O library} \begin{frame}[label={sec:org01dc873}]{TREXIO: I/O library}
\begin{exampleblock}{Front end} \begin{exampleblock}{Front end}
\begin{itemize} \begin{itemize}
\item Definition of an API for to read/write wave functions \item Definition of an API for to read/write wave functions
@ -192,7 +194,7 @@ Must be efficient on all architectures
\end{columns} \end{columns}
\end{frame} \end{frame}
\begin{frame}[label={sec:org51a55c1}]{Content of the files} \begin{frame}[label={sec:org6f3aa58}]{Content of the files}
\begin{itemize} \begin{itemize}
\item File is \alert{self-contained}: no external knowledge needed to compute \item File is \alert{self-contained}: no external knowledge needed to compute
\(\Psi(r_1,\dots,r_n)\) (normalization factors, basis et \(\Psi(r_1,\dots,r_n)\) (normalization factors, basis et
@ -208,43 +210,44 @@ AO & MO & Two-electron integrals\\
One-electron integrals & Density matrices & ECP\\ One-electron integrals & Density matrices & ECP\\
\end{tabular} \end{tabular}
\end{center} \end{center}
\item Each group contains multiple \alert{attributes} \item Each group contains multiple \alert{attributes}: information related to the
group
\end{itemize} \end{itemize}
\end{frame} \end{frame}
\section{QMCkl: QMC kernel library} \section{QMCkl: QMC kernel library}
\label{sec:org53e6105} \label{sec:org3669f0e}
\begin{frame}[label={sec:org4dc9060}]{QMC kernel library} \begin{frame}[label={sec:org89970a2}]{QMC kernel library}
\begin{block}{Computational kernels} \begin{block}{Computational kernels}
\begin{itemize} \begin{itemize}
\item QMCkl will contain the main kernels of QMC methods (Domain \item QMCkl will contain the main kernels of QMC methods: Domain
specific library, end-user driven) specific library, end-user driven
\item Written together by QMC experts and HPC experts \item Written together by QMC experts and HPC experts
\item Multiple high performance implementations of the kernels, tuned \item Multiple high performance implementations of the kernels, tuned
for different for different
\begin{itemize} \begin{itemize}
\item architectures: portability is critical for users \item architectures: portability is critical for users
\item problem sizes (from small to large systems) \item problem sizes: from small to large systems
\item requested accuracy (reduced precision) \item requested accuracy: reduced precision
\end{itemize} \end{itemize}
\end{itemize} \end{itemize}
\end{block} \end{block}
\end{frame} \end{frame}
\begin{frame}[label={sec:orgcf8c268}]{Objectives} \begin{frame}[label={sec:org27f2ac6}]{Objectives}
\begin{itemize} \begin{itemize}
\item The code must stay easy to understand by the physicists/chemists. \item The code must stay easy to understand by the physicists/chemists.
Performance-related aspects should be delegated to the library Performance-related aspects should be delegated to the library
\item Scientists should be able to use their preferred language \item Scientists should be able to use their preferred language
\item Scientists should not lose control on their codes \item Scientists should not lose control of their codes
\item Codes should not die when the architecture changes \item Codes should not die when the architecture changes
\item Scientific code development should not kill the performance \item Scientific code development should not kill the performance
\item Reuse of the optimization effort among the community \item Reuse of the optimization effort among the community
\end{itemize} \end{itemize}
\end{frame} \end{frame}
\begin{frame}[label={sec:org523cd8a}]{Functionality and performance} \begin{frame}[label={sec:org7fe4d9a}]{Functionality and performance}
\begin{itemize} \begin{itemize}
\item Keeping high \emph{productivity}, \emph{portability} and \emph{performance} is very \item Keeping high \emph{productivity}, \emph{portability} and \emph{performance} is very
hard in a single piece of software. hard in a single piece of software.
@ -255,9 +258,11 @@ We propose (at least) two implementations:
\item \alert{Documentation library} \\ \item \alert{Documentation library} \\
Easy to read, understand, modify for scientists, not necessarily efficient. Easy to read, understand, modify for scientists, not necessarily efficient.
\item \alert{High performance libraries} \\ \item \alert{High performance libraries} \\
Efficient on a given architecture, but not necessarily Efficient on a given architecture, but not necessarily
readable by physicists/chemists. \\ readable by physicists/chemists. \\
Performance within 10\% to maximize portability and simplicity. Performance within 10\% to maximize portability and simplicity.
\item \alert{Ultra-High performance libraries} \\
Generated with auto-tuning tools for well identified datasets.
\end{enumerate} \end{enumerate}
\item Both \emph{Documentation} and \emph{High performance} have the same API \item Both \emph{Documentation} and \emph{High performance} have the same API
@ -270,9 +275,10 @@ implemented in the HPC versions when the API is stabilized.
\end{itemize} \end{itemize}
\end{frame} \end{frame}
\begin{frame}[label={sec:org1030a63},fragile]{Library design} \begin{frame}[label={sec:orgca18759},fragile]{Library design}
\begin{itemize} \begin{itemize}
\item Creation of a \emph{Context} that keeps a consistent state of the library \item Creation of a \emph{Context} that keeps a consistent state of the
library (pointers to computed data, configuration parameters, etc.)
\item Memory allocation is abstract: \item Memory allocation is abstract:
\begin{minted}[frame=lines,fontsize=\scriptsize,linenos]{c} \begin{minted}[frame=lines,fontsize=\scriptsize,linenos]{c}
void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info); void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info);
@ -283,11 +289,12 @@ context untouched (no allocation, no modification in-place)
\item High-level functions: let the library call multiple kernels in an \item High-level functions: let the library call multiple kernels in an
optimal way, possibly updating the context optimal way, possibly updating the context
\item Use of IRP programming paradigm\footnote{http://arxiv.org/abs/0909.5012} to keep track of dependencies \item Use of IRP programming paradigm\footnote{http://arxiv.org/abs/0909.5012} to keep track of dependencies
between kernels: re-compute only what is necessary between kernels: re-compute only what is necessary and store
computed data in the context
\end{itemize} \end{itemize}
\end{frame} \end{frame}
\begin{frame}[label={sec:orgd8c37c2}]{Dependencies between kernels} \begin{frame}[label={sec:org1c791dc}]{Dependencies between kernels}
\begin{columns} \begin{columns}
\begin{column}{0.5\textwidth} \begin{column}{0.5\textwidth}
\begin{center} \begin{center}
@ -307,7 +314,7 @@ between kernels: re-compute only what is necessary
\end{columns} \end{columns}
\end{frame} \end{frame}
\begin{frame}[label={sec:org465f70f},fragile]{Use case: low-level} \begin{frame}[label={sec:org5202b14},fragile]{Use case: low-level}
\begin{minted}[frame=lines,fontsize=\scriptsize,linenos]{c} \begin{minted}[frame=lines,fontsize=\scriptsize,linenos]{c}
#include <qmckl.h> #include <qmckl.h>
@ -330,7 +337,7 @@ assert (rc == QMCKL_SUCCESS);
\end{minted} \end{minted}
\end{frame} \end{frame}
\begin{frame}[label={sec:orgb80c323},fragile]{Use case: high-level} \begin{frame}[label={sec:org1ecca91},fragile]{Use case: high-level}
\begin{minted}[frame=lines,fontsize=\scriptsize,linenos]{c} \begin{minted}[frame=lines,fontsize=\scriptsize,linenos]{c}
#include <qmckl.h> #include <qmckl.h>
// ... // ...
@ -354,82 +361,21 @@ rc = qmckl_get_local_energy(context, &e_loc);
\end{minted} \end{minted}
\end{frame} \end{frame}
\begin{frame}[label={sec:org518f369}]{Development strategy} \begin{frame}[label={sec:org3f3c8bf}]{Development strategy}
\begin{enumerate} \begin{enumerate}
\item Kernel extraction: QMC specialists agree on the \item Kernel extraction: QMC specialists agree on the
mathematical expression of the problem mathematical expression of the problem
\item A mini-application is written to find the optimal data layout \item A mini-application is written to find the optimal data layout
with HPC experts from real-size examples with HPC experts from real-size examples
\item The kernel is written in the documentation library \item The kernel is written in the documentation library
\item The documentation library is linked in a QMC code to check correctness \item The documentation library is linked in a QMC code to check
correctness and numerical accuracy
\item HPC experts provide an HPC version of the kernel \item HPC experts provide an HPC version of the kernel
\item The HPC library is linked in the QMC codes of the CoE \item The HPC library is linked in the QMC codes of the CoE
\end{enumerate} \end{enumerate}
\end{frame} \end{frame}
\begin{frame}[label={sec:org7c60b7a}]{Documentation library} \begin{frame}[label={sec:orgb6a9085}]{First application : 3-body Jastrow factor}
Literate programming with Org-mode:
\begin{itemize}
\item Comments are more important than code
\item Can add graphics, \LaTeX formulas, tables, etc
\item Documentation always synchronized with the code
\item Some routines can be generated by embedded scripts
\item Kernels are implemented in Fortran for readability
\item The API is C-compatible: QMCkl appears like a C library
\(\Longrightarrow\) can be used in all other languages
\item Example: Prototyping in Julia
\end{itemize}
\end{frame}
\begin{frame}[label={sec:orgf424cd4}]{High-Performance strategies}
\begin{block}{Linear algebra hot spots}
\begin{center}
\begin{tabular}{lll}
GEMM & Rank-1 update & Matrix Inversion\\
GEMV & Diagonal of GEMM & Shermann-Morrison-Woodburry\\
\end{tabular}
\end{center}
\end{block}
\begin{block}{Matrices are relatively small (\(\le 1000\times 1000\))}
\begin{itemize}
\item Matrices are stored in tiled format \(\Longrightarrow\) task-based
linear algebra interleaved computation of multiple kernels
\item Increase parallelism by agregating multiple independent walkers
in matrices
\item Needs fast linear algebra kernels for small matrices
\end{itemize}
\end{block}
\end{frame}
\begin{frame}[label={sec:orgea7372b}]{High-Performance strategies}
\begin{block}{Tuning}
\begin{itemize}
\item Optimization is guided by analysis with \alert{MAQAO}\footnote{https://maqao.org}.
\item Specialized versions of critical hot-spots
\item MIPP\footnote{https://github.com/aff3ct/MIPP} for portable intrinsics / specialized code generation
\item Monitoring of the use of the library to choose most efficient versions
\item Optimizations guided by monitoring numerical accuracy (\alert{Verificarlo}\footnote{https://github.com/verificarlo/verificarlo})
\end{itemize}
\end{block}
\end{frame}
\begin{frame}[label={sec:orgba656d9}]{Example: Specialized DGEMM kernel}
VIJAY
\end{frame}
\begin{frame}[label={sec:orgd3ca712}]{Efficiently guiding the developer}
\begin{center}
\includegraphics[width=\textwidth]{./maqao1.png}
\end{center}
\end{frame}
\begin{frame}[label={sec:orgcc14268}]{Extensive/automatic testing of different configurations}
\begin{center}
\includegraphics[width=\textwidth]{./maqao2.png}
\end{center}
\end{frame}
\begin{frame}[label={sec:org7ee3c30}]{First application : 3-body Jastrow factor}
\newcommand{\Jeen}{J_{\text{een}}} \newcommand{\Jeen}{J_{\text{een}}}
\newcommand{\Nel}{N_{\text{elec}}} \newcommand{\Nel}{N_{\text{elec}}}
\newcommand{\Nat}{N_{\text{nucl}}} \newcommand{\Nat}{N_{\text{nucl}}}
@ -460,14 +406,133 @@ VIJAY
\begin{itemize} \begin{itemize}
\item Gradient and Laplacian are also required \item Gradient and Laplacian are also required
\item Up to \(20\times\) faster than in the original code \item Up to \(20\times\) faster than in the original code
\item \(\sim 80\%\) of the AVX-512 peak is reached \item \(\sim 80\%\) of the AVX-512 peak is reached using standard MKL on
Intel Skylake
\item Expressed with a DGEMM kernel \(\Longrightarrow\) also efficient on GPU \item Expressed with a DGEMM kernel \(\Longrightarrow\) also efficient on GPU
\end{itemize} \end{itemize}
\end{column} \end{column}
\end{columns} \end{columns}
\end{frame} \end{frame}
\begin{frame}[label={sec:orgd6d3e26}]{High-Performance strategies}
\begin{block}{Linear algebra hot spots}
\begin{center}
\begin{tabular}{lll}
GEMM & Rank-1 update & Matrix Inversion\\
GEMV & Diagonal of GEMM & Shermann-Morrison-Woodburry\\
\end{tabular}
\end{center}
\end{block}
\begin{block}{Matrices are relatively small (\(\le 1000\times 1000\))}
\begin{itemize}
\item Matrices are stored in tiled format fitting a block formulation
of the algorithms \(\Longrightarrow\) task-based
linear algebra, interleaved computation of multiple kernels
\item Tile sizes will be adjusted by auto-tuning
\item Increase parallelism by aggregating multiple independent walkers
in matrices
\item Needs fast linear algebra kernels for small matrices (tile size)
\item For tiny matrices (\(<5\times5\)) specialized versions are implemented
\end{itemize}
\end{block}
\end{frame}
\begin{frame}[label={sec:orgeb97339},fragile]{Example: Specialized DGEMM kernel I}
\begin{columns}
\begin{column}{0.45\columnwidth}
\begin{block}{Simple algorithm}
\begin{itemize}
\item Simple micro kernel (\alert{GotoDGEMM}\footnote{doi:10.1145/1356052.1356053})
\item Code written using \texttt{asm\_volatile} to force good code generation by
compilers
\item \alert{Tiling} scheme\footnote{doi:10.1109/ICPP.2015.29}
\end{itemize}
\end{block}
\end{column}
\begin{column}{0.45\columnwidth}
\begin{block}{Tiling scheme}
\begin{center}
\includegraphics[width=5cm,height=5cm]{./tiling_icpp2015.pdf}
\end{center}
\end{block}
\end{column}
\end{columns}
\end{frame}
\begin{frame}[label={sec:org76e8117}]{Example: Specialized DGEMM kernel II}
\begin{block}{Benchmarks}
\begin{itemize}
\item Comparison of MKL vs Specialized DGEMM
\begin{center}
\includegraphics[height=4cm]{./plot_percentage_vs_mkl_tiled_good.pdf}
\end{center}
\item Strong impact on MKL performance due to the number of consecutive executions
\item Favorable comparison for MKL: Many consecutive executions to
amortize setup cost, JIT, Skylake CPU
\end{itemize}
\end{block}
\end{frame}
\begin{frame}[label={sec:orgc7d8abc}]{Why do we like our DGEMM?}
\begin{itemize}
\item Open source code : can be modified easily
\item Simple code (280 LOC)
\item Decent performance with 10\% of MKL
\item Can be rewritten in different languages to increase
portability (MIPP\footnote{https://github.com/aff3ct/MIPP})
\item Can be coupled with simple pack/unpack routines to handle different
data storage (tiled matrices)
\item Allows to keep control on parallelism
\item A good starting point for autotuning
\end{itemize}
\end{frame}
\begin{frame}[label={sec:org18a9bee}]{High-Performance strategies}
\begin{block}{Tuning}
\begin{itemize}
\item Optimization is guided by analysis with \alert{MAQAO}\footnote{https://maqao.org}.
\item Specialized versions of critical hot-spots
\item \alert{MIPP} for portable intrinsics / specialized code generation
\item Monitoring of the use of the library to choose most efficient versions
\item Optimizations guided by monitoring numerical accuracy (\alert{Verificarlo}\footnote{https://github.com/verificarlo/verificarlo})
\end{itemize}
\end{block}
\end{frame}
\begin{frame}[label={sec:org4489490}]{Efficiently guiding the developer}
\begin{center}
\includegraphics[width=\textwidth]{./maqao1.png}
\end{center}
\end{frame}
\begin{frame}[label={sec:orgddd3631}]{Extensive/automatic testing of different configurations}
\begin{center}
\includegraphics[width=\textwidth]{./maqao2.png}
\end{center}
\end{frame}
\section{Summary}
\label{sec:org30e04a5}
\begin{frame}[label={sec:org705d3cf}]{Summary}
\begin{itemize}
\item QMC codes integrated in an ecosystem of multiple codes for
high-accuracy quantum chemistry
\item Development of open-source libraries to be used in the
TREX codes and beyond
\item Libraries focus on \emph{performance}, \emph{portability} and \emph{productivity}
\item Strategies to make the collaboration between physicists/chemists
and HPC experts optimal
\end{itemize}
\end{frame}
\section{Bonus slides}
\label{sec:orgb118e4f}
\begin{frame}[fragile]{Numerical analysis with Verificarlo} \begin{frame}[fragile]{Numerical analysis with Verificarlo}
@ -566,8 +631,10 @@ vfc\_probe\_assert("Sherman-Morisson", "res", res, \tikzmark{target}1e-7)
\draw[arrow] \draw[arrow]
(targetex.south) to[out=-90,in=90] ([yshift=1.2ex, xshift=.5cm]{pic cs:target}); (targetex.south) to[out=-90,in=90] ([yshift=1.2ex, xshift=.5cm]{pic cs:target});
\end{tikzpicture} \end{tikzpicture}
\end{frame} \end{frame}
\begin{frame}[label={sec:org8493521}]{Verificarlo CI}
\begin{frame}[label={sec:org560588a}]{Verificarlo CI}
\begin{columns} \begin{columns}
\begin{column}{0.5\textwidth} \begin{column}{0.5\textwidth}
\begin{exampleblock}{Compare runs} \begin{exampleblock}{Compare runs}

BIN
tiling_icpp2015.pdf Normal file

Binary file not shown.

View File

@ -97,4 +97,4 @@ vfc\_probe\_assert("Sherman-Morisson", "res", res, \tikzmark{target}1e-7)
(targetex.south) to[out=-90,in=90] ([yshift=1.2ex, xshift=.5cm]{pic cs:target}); (targetex.south) to[out=-90,in=90] ([yshift=1.2ex, xshift=.5cm]{pic cs:target});
\end{tikzpicture} \end{tikzpicture}
\end{frame}