Modifs zoom vendredi matin
This commit is contained in:
commit
59137eb55b
BIN
plot_percentage_vs_mkl_tiled_good.pdf
Normal file
BIN
plot_percentage_vs_mkl_tiled_good.pdf
Normal file
Binary file not shown.
192
scemama.org
192
scemama.org
@ -88,7 +88,8 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
|
||||
- Very low memory requirements (no integrals)
|
||||
- Distribute walkers on different cores or compute nodes
|
||||
- No blocking communication: near-ideal scaling
|
||||
- Difficulty: parallelize within a QMC trajectory
|
||||
- Difficulty to parallelize within a QMC trajectory: depends on the
|
||||
number of electrons
|
||||
#+LATEX: \end{column}
|
||||
#+LATEX: \begin{column}{0.6\textwidth}
|
||||
#+ATTR_LATEX: :width \textwidth
|
||||
@ -99,11 +100,12 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
|
||||
** Both libraries
|
||||
*** Three objectives
|
||||
1. *Productivity* \\
|
||||
Used and developed by scientists in different languages
|
||||
Usable and useful by scientists in different programming languages
|
||||
2. *Portability* \\
|
||||
Target: all HPC systems (CPU, GPU, ARM, x86, etc.)
|
||||
3. *Performance* \\
|
||||
Must be efficient on all architectures
|
||||
Must be efficient on all architectures: possible tradeoffs
|
||||
between portability and performance
|
||||
|
||||
*** Free (libre) software
|
||||
- Requirement for open science
|
||||
@ -208,7 +210,8 @@ digraph G {
|
||||
| Nucleus | Basis | CI coefficients |
|
||||
| AO | MO | Two-electron integrals |
|
||||
| One-electron integrals | Density matrices | ECP |
|
||||
- Each group contains multiple *attributes*
|
||||
- Each group contains multiple *attributes*: information related to the
|
||||
group
|
||||
|
||||
** Source code :noexport:
|
||||
|
||||
@ -243,21 +246,21 @@ trexio_exit_code trexio_[has/read/write]_<group>_<attribute>
|
||||
** QMC kernel library
|
||||
|
||||
*** Computational kernels
|
||||
- QMCkl will contain the main kernels of QMC methods (Domain
|
||||
specific library, end-user driven)
|
||||
- QMCkl will contain the main kernels of QMC methods: Domain
|
||||
specific library, end-user driven
|
||||
- Written together by QMC experts and HPC experts
|
||||
- Multiple high performance implementations of the kernels, tuned
|
||||
for different
|
||||
- architectures: portability is critical for users
|
||||
- problem sizes (from small to large systems)
|
||||
- requested accuracy (reduced precision)
|
||||
- problem sizes: from small to large systems
|
||||
- requested accuracy: reduced precision
|
||||
|
||||
** Objectives
|
||||
|
||||
- The code must stay easy to understand by the physicists/chemists.
|
||||
Performance-related aspects should be delegated to the library
|
||||
- Scientists should be able to use their preferred language
|
||||
- Scientists should not lose control on their codes
|
||||
- Scientists should not lose control of their codes
|
||||
- Codes should not die when the architecture changes
|
||||
- Scientific code development should not kill the performance
|
||||
- Reuse of the optimization effort among the community
|
||||
@ -275,6 +278,8 @@ trexio_exit_code trexio_[has/read/write]_<group>_<attribute>
|
||||
Efficient on a given architecture, but not necessarily
|
||||
readable by physicists/chemists. \\
|
||||
Performance within 10% to maximize portability and simplicity.
|
||||
3. *Ultra-High performance libraries* \\
|
||||
Generated with auto-tuning tools for well identified datasets.
|
||||
|
||||
- Both /Documentation/ and /High performance/ have the same API
|
||||
(similar to BLAS on netlib /vs/ MKL).
|
||||
@ -284,9 +289,21 @@ trexio_exit_code trexio_[has/read/write]_<group>_<attribute>
|
||||
|
||||
- Performance: enable a data-driven task-based parallelism
|
||||
|
||||
** Documentation library :noexport:
|
||||
Literate programming with Org-mode:
|
||||
- Comments are more important than code
|
||||
- Can add graphics, \LaTeX formulas, tables, etc
|
||||
- Documentation always synchronized with the code
|
||||
- Some routines can be generated by embedded scripts
|
||||
- Kernels are implemented in Fortran for readability
|
||||
- The API is C-compatible: QMCkl appears like a C library
|
||||
$\Longrightarrow$ can be used in all other languages
|
||||
- Example: Prototyping in Julia
|
||||
|
||||
** Library design
|
||||
|
||||
- Creation of a /Context/ that keeps a consistent state of the library
|
||||
- Creation of a /Context/ that keeps a consistent state of the
|
||||
library (pointers to computed data, configuration parameters, etc.)
|
||||
- Memory allocation is abstract:
|
||||
#+begin_src c
|
||||
void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info);
|
||||
@ -297,7 +314,8 @@ void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info);
|
||||
- High-level functions: let the library call multiple kernels in an
|
||||
optimal way, possibly updating the context
|
||||
- Use of IRP programming paradigm\footnote{http://arxiv.org/abs/0909.5012} to keep track of dependencies
|
||||
between kernels: re-compute only what is necessary
|
||||
between kernels: re-compute only what is necessary and store
|
||||
computed data in the context
|
||||
|
||||
** Dependencies between kernels
|
||||
|
||||
@ -407,58 +425,11 @@ rc = qmckl_get_local_energy(context, &e_loc);
|
||||
2. A mini-application is written to find the optimal data layout
|
||||
with HPC experts from real-size examples
|
||||
3. The kernel is written in the documentation library
|
||||
4. The documentation library is linked in a QMC code to check correctness
|
||||
4. The documentation library is linked in a QMC code to check
|
||||
correctness and numerical accuracy
|
||||
5. HPC experts provide an HPC version of the kernel
|
||||
6. The HPC library is linked in the QMC codes of the CoE
|
||||
|
||||
** Documentation library
|
||||
Literate programming with Org-mode:
|
||||
- Comments are more important than code
|
||||
- Can add graphics, \LaTeX formulas, tables, etc
|
||||
- Documentation always synchronized with the code
|
||||
- Some routines can be generated by embedded scripts
|
||||
- Kernels are implemented in Fortran for readability
|
||||
- The API is C-compatible: QMCkl appears like a C library
|
||||
$\Longrightarrow$ can be used in all other languages
|
||||
- Example: Prototyping in Julia
|
||||
|
||||
** High-Performance strategies
|
||||
|
||||
*** Linear algebra hot spots
|
||||
|
||||
| GEMM | Rank-1 update | Matrix Inversion |
|
||||
| GEMV | Diagonal of GEMM | Shermann-Morrison-Woodburry |
|
||||
|
||||
*** Matrices are relatively small ($\le 1000\times 1000$)
|
||||
|
||||
- Matrices are stored in tiled format $\Longrightarrow$ task-based
|
||||
linear algebra interleaved computation of multiple kernels
|
||||
- Increase parallelism by agregating multiple independent walkers
|
||||
in matrices
|
||||
- Needs fast linear algebra kernels for small matrices
|
||||
|
||||
** High-Performance strategies
|
||||
|
||||
*** Tuning
|
||||
- Optimization is guided by analysis with *MAQAO*\footnote{https://maqao.org}.
|
||||
- Specialized versions of critical hot-spots
|
||||
- MIPP\footnote{https://github.com/aff3ct/MIPP} for portable intrinsics / specialized code generation
|
||||
- Monitoring of the use of the library to choose most efficient versions
|
||||
- Optimizations guided by monitoring numerical accuracy (*Verificarlo*\footnote{https://github.com/verificarlo/verificarlo})
|
||||
|
||||
** Example: Specialized DGEMM kernel
|
||||
|
||||
VIJAY
|
||||
|
||||
** Efficiently guiding the developer
|
||||
|
||||
#+ATTR_LATEX: :width \textwidth
|
||||
[[./maqao1.png]]
|
||||
** Extensive/automatic testing of different configurations
|
||||
|
||||
#+ATTR_LATEX: :width \textwidth
|
||||
[[./maqao2.png]]
|
||||
|
||||
** First application : 3-body Jastrow factor
|
||||
|
||||
#+LATEX: \newcommand{\Jeen}{J_{\text{een}}}
|
||||
@ -489,14 +460,109 @@ rc = qmckl_get_local_energy(context, &e_loc);
|
||||
#+LATEX: \begin{column}{0.5\textwidth}
|
||||
- Gradient and Laplacian are also required
|
||||
- Up to $20\times$ faster than in the original code
|
||||
- $\sim 80\%$ of the AVX-512 peak is reached
|
||||
- $\sim 80\%$ of the AVX-512 peak is reached using standard MKL on
|
||||
Intel Skylake
|
||||
- Expressed with a DGEMM kernel $\Longrightarrow$ also efficient on GPU
|
||||
#+LATEX: \end{column}
|
||||
#+LATEX: \end{columns}
|
||||
|
||||
** High-Performance strategies
|
||||
|
||||
*** Linear algebra hot spots
|
||||
|
||||
| GEMM | Rank-1 update | Matrix Inversion |
|
||||
| GEMV | Diagonal of GEMM | Shermann-Morrison-Woodburry |
|
||||
|
||||
*** Matrices are relatively small ($\le 1000\times 1000$)
|
||||
|
||||
- Matrices are stored in tiled format fitting a block formulation
|
||||
of the algorithms $\Longrightarrow$ task-based
|
||||
linear algebra, interleaved computation of multiple kernels
|
||||
- Tile sizes will be adjusted by auto-tuning
|
||||
- Increase parallelism by aggregating multiple independent walkers
|
||||
in matrices
|
||||
- Needs fast linear algebra kernels for small matrices (tile size)
|
||||
- For tiny matrices ($<5\times5$) specialized versions are implemented
|
||||
|
||||
** Example: Specialized DGEMM kernel I
|
||||
|
||||
*** Simple algorithm :B_block:BMCOL:
|
||||
:PROPERTIES:
|
||||
:BEAMER_env: block
|
||||
:BEAMER_col: 0.45
|
||||
:END:
|
||||
- Simple micro kernel (*GotoDGEMM*\footnote{doi:10.1145/1356052.1356053})
|
||||
- Code written using ~asm_volatile~ to force good code generation by
|
||||
compilers
|
||||
- *Tiling* scheme\footnote{doi:10.1109/ICPP.2015.29}
|
||||
|
||||
*** Tiling scheme :B_block:BMCOL:
|
||||
:PROPERTIES:
|
||||
:BEAMER_col: 0.45
|
||||
:BEAMER_env: block
|
||||
:END:
|
||||
#+ATTR_LATEX: :width 5cm :height 5cm :keepaspectratio :right
|
||||
[[./tiling_icpp2015.pdf]]
|
||||
|
||||
** Example: Specialized DGEMM kernel II
|
||||
|
||||
*** Benchmarks
|
||||
|
||||
- Comparison of MKL vs Specialized DGEMM
|
||||
|
||||
#+ATTR_LATEX: :height 4cm :keepaspectratio
|
||||
[[./plot_percentage_vs_mkl_tiled_good.pdf]]
|
||||
|
||||
- Strong impact on MKL performance due to the number of consecutive executions
|
||||
- Favorable comparison for MKL: Many consecutive executions to
|
||||
amortize setup cost, JIT, Skylake CPU
|
||||
|
||||
** Why do we like our DGEMM?
|
||||
|
||||
- Open source code : can be modified easily
|
||||
- Simple code (280 LOC)
|
||||
- Decent performance with 10% of MKL
|
||||
- Can be rewritten in different languages to increase
|
||||
portability (MIPP\footnote{https://github.com/aff3ct/MIPP})
|
||||
- Can be coupled with simple pack/unpack routines to handle different
|
||||
data storage (tiled matrices)
|
||||
- Allows to keep control on parallelism
|
||||
- A good starting point for autotuning
|
||||
|
||||
** High-Performance strategies
|
||||
|
||||
*** Tuning
|
||||
- Optimization is guided by analysis with *MAQAO*\footnote{https://maqao.org}.
|
||||
- Specialized versions of critical hot-spots
|
||||
- *MIPP* for portable intrinsics / specialized code generation
|
||||
- Monitoring of the use of the library to choose most efficient versions
|
||||
- Optimizations guided by monitoring numerical accuracy (*Verificarlo*\footnote{https://github.com/verificarlo/verificarlo})
|
||||
|
||||
** Efficiently guiding the developer
|
||||
|
||||
#+ATTR_LATEX: :width \textwidth
|
||||
[[./maqao1.png]]
|
||||
** Extensive/automatic testing of different configurations
|
||||
|
||||
#+ATTR_LATEX: :width \textwidth
|
||||
[[./maqao2.png]]
|
||||
|
||||
* Summary
|
||||
|
||||
** Summary
|
||||
- QMC codes integrated in an ecosystem of multiple codes for
|
||||
high-accuracy quantum chemistry
|
||||
- Development of open-source libraries to be used in the
|
||||
TREX codes and beyond
|
||||
- Libraries focus on /performance/, /portability/ and /productivity/
|
||||
- Strategies to make the collaboration between physicists/chemists
|
||||
and HPC experts optimal
|
||||
|
||||
|
||||
* Bonus slides
|
||||
|
||||
#+LATEX: \end{frame}
|
||||
#+INCLUDE: "verificarlo.tex" export latex
|
||||
|
||||
** Verificarlo CI
|
||||
|
||||
#+LATEX: \begin{columns}
|
||||
@ -518,7 +584,6 @@ rc = qmckl_get_local_energy(context, &e_loc);
|
||||
#+LATEX: \end{exampleblock}
|
||||
#+LATEX: \end{column}
|
||||
#+LATEX: \end{columns}
|
||||
|
||||
* Useful links :noexport:
|
||||
|
||||
| TREX web site | https://trex-coe.eu |
|
||||
@ -597,3 +662,4 @@ together: perf et productivity
|
||||
: /home/scemama/MEGA/TEX/Presentations/2021/Intel/scemama.pdf
|
||||
|
||||
|
||||
|
||||
|
263
scemama.tex
263
scemama.tex
@ -1,4 +1,4 @@
|
||||
% Created 2021-10-07 Thu 12:17
|
||||
% Created 2021-10-08 Fri 12:27
|
||||
% Intended LaTeX compiler: pdflatex
|
||||
\documentclass[aspectratio=169]{beamer}
|
||||
\usepackage[utf8]{inputenc}
|
||||
@ -53,8 +53,8 @@ $^2$University of Versailles, Li-PaRAD (France)}
|
||||
\maketitle
|
||||
|
||||
\section{QMC in TREX}
|
||||
\label{sec:org527cfcf}
|
||||
\begin{frame}[label={sec:org3bfadea}]{QMC in TREX}
|
||||
\label{sec:orge5169ea}
|
||||
\begin{frame}[label={sec:org16615d0}]{QMC in TREX}
|
||||
\begin{exampleblock}{QMC: Quantum Monte Carlo methods}
|
||||
\begin{itemize}
|
||||
\item Highly accurate methods
|
||||
@ -75,7 +75,7 @@ How: Instead of re-writing codes, provide libraries (free software)
|
||||
\end{exampleblock}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:orge26ef23}]{Quantum Monte Carlo (QMC)}
|
||||
\begin{frame}[label={sec:orgd8db692}]{Quantum Monte Carlo (QMC)}
|
||||
\alert{Problem}: Stochastic resolution of the Schr\"odinger equation for $N$ electrons
|
||||
\begin{eqnarray}
|
||||
E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
|
||||
@ -101,14 +101,15 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
|
||||
\end{columns}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:orgd65402e}]{Quantum Monte Carlo (QMC)}
|
||||
\begin{frame}[label={sec:orgcee35fc}]{Quantum Monte Carlo (QMC)}
|
||||
\begin{columns}
|
||||
\begin{column}{0.4\textwidth}
|
||||
\begin{itemize}
|
||||
\item Very low memory requirements (no integrals)
|
||||
\item Distribute walkers on different cores or compute nodes
|
||||
\item No blocking communication: near-ideal scaling
|
||||
\item Difficulty: parallelize within a QMC trajectory
|
||||
\item Difficulty to parallelize within a QMC trajectory: depends on the
|
||||
number of electrons
|
||||
\end{itemize}
|
||||
\end{column}
|
||||
\begin{column}{0.6\textwidth}
|
||||
@ -119,15 +120,16 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
|
||||
\end{columns}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:org3e8242f}]{Both libraries}
|
||||
\begin{frame}[label={sec:org4bb2da0}]{Both libraries}
|
||||
\begin{block}{Three objectives}
|
||||
\begin{enumerate}
|
||||
\item \alert{Productivity} \\
|
||||
Used and developed by scientists in different languages
|
||||
Usable and useful by scientists in different programming languages
|
||||
\item \alert{Portability} \\
|
||||
Target: all HPC systems (CPU, GPU, ARM, x86, etc.)
|
||||
\item \alert{Performance} \\
|
||||
Must be efficient on all architectures
|
||||
Must be efficient on all architectures: possible tradeoffs
|
||||
between portability and performance
|
||||
\end{enumerate}
|
||||
\end{block}
|
||||
|
||||
@ -140,8 +142,8 @@ Must be efficient on all architectures
|
||||
\end{frame}
|
||||
|
||||
\section{TREXIO: I/O library}
|
||||
\label{sec:orgf8ad1e7}
|
||||
\begin{frame}[label={sec:org02f0485}]{TREXIO: I/O library}
|
||||
\label{sec:orga389b46}
|
||||
\begin{frame}[label={sec:org61be819}]{TREXIO: I/O library}
|
||||
\begin{columns}
|
||||
\begin{column}{0.4\textwidth}
|
||||
\begin{exampleblock}{Before}
|
||||
@ -163,7 +165,7 @@ Must be efficient on all architectures
|
||||
\url{https://github.com/trex-coe/trexio}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:org2341c39}]{TREXIO: I/O library}
|
||||
\begin{frame}[label={sec:org01dc873}]{TREXIO: I/O library}
|
||||
\begin{exampleblock}{Front end}
|
||||
\begin{itemize}
|
||||
\item Definition of an API for to read/write wave functions
|
||||
@ -192,7 +194,7 @@ Must be efficient on all architectures
|
||||
\end{columns}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:org51a55c1}]{Content of the files}
|
||||
\begin{frame}[label={sec:org6f3aa58}]{Content of the files}
|
||||
\begin{itemize}
|
||||
\item File is \alert{self-contained}: no external knowledge needed to compute
|
||||
\(\Psi(r_1,\dots,r_n)\) (normalization factors, basis et
|
||||
@ -208,43 +210,44 @@ AO & MO & Two-electron integrals\\
|
||||
One-electron integrals & Density matrices & ECP\\
|
||||
\end{tabular}
|
||||
\end{center}
|
||||
\item Each group contains multiple \alert{attributes}
|
||||
\item Each group contains multiple \alert{attributes}: information related to the
|
||||
group
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\section{QMCkl: QMC kernel library}
|
||||
\label{sec:org53e6105}
|
||||
\label{sec:org3669f0e}
|
||||
|
||||
\begin{frame}[label={sec:org4dc9060}]{QMC kernel library}
|
||||
\begin{frame}[label={sec:org89970a2}]{QMC kernel library}
|
||||
\begin{block}{Computational kernels}
|
||||
\begin{itemize}
|
||||
\item QMCkl will contain the main kernels of QMC methods (Domain
|
||||
specific library, end-user driven)
|
||||
\item QMCkl will contain the main kernels of QMC methods: Domain
|
||||
specific library, end-user driven
|
||||
\item Written together by QMC experts and HPC experts
|
||||
\item Multiple high performance implementations of the kernels, tuned
|
||||
for different
|
||||
\begin{itemize}
|
||||
\item architectures: portability is critical for users
|
||||
\item problem sizes (from small to large systems)
|
||||
\item requested accuracy (reduced precision)
|
||||
\item problem sizes: from small to large systems
|
||||
\item requested accuracy: reduced precision
|
||||
\end{itemize}
|
||||
\end{itemize}
|
||||
\end{block}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:orgcf8c268}]{Objectives}
|
||||
\begin{frame}[label={sec:org27f2ac6}]{Objectives}
|
||||
\begin{itemize}
|
||||
\item The code must stay easy to understand by the physicists/chemists.
|
||||
Performance-related aspects should be delegated to the library
|
||||
\item Scientists should be able to use their preferred language
|
||||
\item Scientists should not lose control on their codes
|
||||
\item Scientists should not lose control of their codes
|
||||
\item Codes should not die when the architecture changes
|
||||
\item Scientific code development should not kill the performance
|
||||
\item Reuse of the optimization effort among the community
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:org523cd8a}]{Functionality and performance}
|
||||
\begin{frame}[label={sec:org7fe4d9a}]{Functionality and performance}
|
||||
\begin{itemize}
|
||||
\item Keeping high \emph{productivity}, \emph{portability} and \emph{performance} is very
|
||||
hard in a single piece of software.
|
||||
@ -258,6 +261,8 @@ Easy to read, understand, modify for scientists, not necessarily efficient.
|
||||
Efficient on a given architecture, but not necessarily
|
||||
readable by physicists/chemists. \\
|
||||
Performance within 10\% to maximize portability and simplicity.
|
||||
\item \alert{Ultra-High performance libraries} \\
|
||||
Generated with auto-tuning tools for well identified datasets.
|
||||
\end{enumerate}
|
||||
|
||||
\item Both \emph{Documentation} and \emph{High performance} have the same API
|
||||
@ -270,9 +275,10 @@ implemented in the HPC versions when the API is stabilized.
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:org1030a63},fragile]{Library design}
|
||||
\begin{frame}[label={sec:orgca18759},fragile]{Library design}
|
||||
\begin{itemize}
|
||||
\item Creation of a \emph{Context} that keeps a consistent state of the library
|
||||
\item Creation of a \emph{Context} that keeps a consistent state of the
|
||||
library (pointers to computed data, configuration parameters, etc.)
|
||||
\item Memory allocation is abstract:
|
||||
\begin{minted}[frame=lines,fontsize=\scriptsize,linenos]{c}
|
||||
void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info);
|
||||
@ -283,11 +289,12 @@ context untouched (no allocation, no modification in-place)
|
||||
\item High-level functions: let the library call multiple kernels in an
|
||||
optimal way, possibly updating the context
|
||||
\item Use of IRP programming paradigm\footnote{http://arxiv.org/abs/0909.5012} to keep track of dependencies
|
||||
between kernels: re-compute only what is necessary
|
||||
between kernels: re-compute only what is necessary and store
|
||||
computed data in the context
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:orgd8c37c2}]{Dependencies between kernels}
|
||||
\begin{frame}[label={sec:org1c791dc}]{Dependencies between kernels}
|
||||
\begin{columns}
|
||||
\begin{column}{0.5\textwidth}
|
||||
\begin{center}
|
||||
@ -307,7 +314,7 @@ between kernels: re-compute only what is necessary
|
||||
\end{columns}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:org465f70f},fragile]{Use case: low-level}
|
||||
\begin{frame}[label={sec:org5202b14},fragile]{Use case: low-level}
|
||||
\begin{minted}[frame=lines,fontsize=\scriptsize,linenos]{c}
|
||||
#include <qmckl.h>
|
||||
|
||||
@ -330,7 +337,7 @@ assert (rc == QMCKL_SUCCESS);
|
||||
\end{minted}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:orgb80c323},fragile]{Use case: high-level}
|
||||
\begin{frame}[label={sec:org1ecca91},fragile]{Use case: high-level}
|
||||
\begin{minted}[frame=lines,fontsize=\scriptsize,linenos]{c}
|
||||
#include <qmckl.h>
|
||||
// ...
|
||||
@ -354,82 +361,21 @@ rc = qmckl_get_local_energy(context, &e_loc);
|
||||
\end{minted}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:org518f369}]{Development strategy}
|
||||
\begin{frame}[label={sec:org3f3c8bf}]{Development strategy}
|
||||
\begin{enumerate}
|
||||
\item Kernel extraction: QMC specialists agree on the
|
||||
mathematical expression of the problem
|
||||
\item A mini-application is written to find the optimal data layout
|
||||
with HPC experts from real-size examples
|
||||
\item The kernel is written in the documentation library
|
||||
\item The documentation library is linked in a QMC code to check correctness
|
||||
\item The documentation library is linked in a QMC code to check
|
||||
correctness and numerical accuracy
|
||||
\item HPC experts provide an HPC version of the kernel
|
||||
\item The HPC library is linked in the QMC codes of the CoE
|
||||
\end{enumerate}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:org7c60b7a}]{Documentation library}
|
||||
Literate programming with Org-mode:
|
||||
\begin{itemize}
|
||||
\item Comments are more important than code
|
||||
\item Can add graphics, \LaTeX formulas, tables, etc
|
||||
\item Documentation always synchronized with the code
|
||||
\item Some routines can be generated by embedded scripts
|
||||
\item Kernels are implemented in Fortran for readability
|
||||
\item The API is C-compatible: QMCkl appears like a C library
|
||||
\(\Longrightarrow\) can be used in all other languages
|
||||
\item Example: Prototyping in Julia
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:orgf424cd4}]{High-Performance strategies}
|
||||
\begin{block}{Linear algebra hot spots}
|
||||
\begin{center}
|
||||
\begin{tabular}{lll}
|
||||
GEMM & Rank-1 update & Matrix Inversion\\
|
||||
GEMV & Diagonal of GEMM & Shermann-Morrison-Woodburry\\
|
||||
\end{tabular}
|
||||
\end{center}
|
||||
\end{block}
|
||||
|
||||
\begin{block}{Matrices are relatively small (\(\le 1000\times 1000\))}
|
||||
\begin{itemize}
|
||||
\item Matrices are stored in tiled format \(\Longrightarrow\) task-based
|
||||
linear algebra interleaved computation of multiple kernels
|
||||
\item Increase parallelism by agregating multiple independent walkers
|
||||
in matrices
|
||||
\item Needs fast linear algebra kernels for small matrices
|
||||
\end{itemize}
|
||||
\end{block}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:orgea7372b}]{High-Performance strategies}
|
||||
\begin{block}{Tuning}
|
||||
\begin{itemize}
|
||||
\item Optimization is guided by analysis with \alert{MAQAO}\footnote{https://maqao.org}.
|
||||
\item Specialized versions of critical hot-spots
|
||||
\item MIPP\footnote{https://github.com/aff3ct/MIPP} for portable intrinsics / specialized code generation
|
||||
\item Monitoring of the use of the library to choose most efficient versions
|
||||
\item Optimizations guided by monitoring numerical accuracy (\alert{Verificarlo}\footnote{https://github.com/verificarlo/verificarlo})
|
||||
\end{itemize}
|
||||
\end{block}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:orgba656d9}]{Example: Specialized DGEMM kernel}
|
||||
VIJAY
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:orgd3ca712}]{Efficiently guiding the developer}
|
||||
\begin{center}
|
||||
\includegraphics[width=\textwidth]{./maqao1.png}
|
||||
\end{center}
|
||||
\end{frame}
|
||||
\begin{frame}[label={sec:orgcc14268}]{Extensive/automatic testing of different configurations}
|
||||
\begin{center}
|
||||
\includegraphics[width=\textwidth]{./maqao2.png}
|
||||
\end{center}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:org7ee3c30}]{First application : 3-body Jastrow factor}
|
||||
\begin{frame}[label={sec:orgb6a9085}]{First application : 3-body Jastrow factor}
|
||||
\newcommand{\Jeen}{J_{\text{een}}}
|
||||
\newcommand{\Nel}{N_{\text{elec}}}
|
||||
\newcommand{\Nat}{N_{\text{nucl}}}
|
||||
@ -460,14 +406,133 @@ VIJAY
|
||||
\begin{itemize}
|
||||
\item Gradient and Laplacian are also required
|
||||
\item Up to \(20\times\) faster than in the original code
|
||||
\item \(\sim 80\%\) of the AVX-512 peak is reached
|
||||
\item \(\sim 80\%\) of the AVX-512 peak is reached using standard MKL on
|
||||
Intel Skylake
|
||||
\item Expressed with a DGEMM kernel \(\Longrightarrow\) also efficient on GPU
|
||||
\end{itemize}
|
||||
\end{column}
|
||||
\end{columns}
|
||||
|
||||
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:orgd6d3e26}]{High-Performance strategies}
|
||||
\begin{block}{Linear algebra hot spots}
|
||||
\begin{center}
|
||||
\begin{tabular}{lll}
|
||||
GEMM & Rank-1 update & Matrix Inversion\\
|
||||
GEMV & Diagonal of GEMM & Shermann-Morrison-Woodburry\\
|
||||
\end{tabular}
|
||||
\end{center}
|
||||
\end{block}
|
||||
|
||||
\begin{block}{Matrices are relatively small (\(\le 1000\times 1000\))}
|
||||
\begin{itemize}
|
||||
\item Matrices are stored in tiled format fitting a block formulation
|
||||
of the algorithms \(\Longrightarrow\) task-based
|
||||
linear algebra, interleaved computation of multiple kernels
|
||||
\item Tile sizes will be adjusted by auto-tuning
|
||||
\item Increase parallelism by aggregating multiple independent walkers
|
||||
in matrices
|
||||
\item Needs fast linear algebra kernels for small matrices (tile size)
|
||||
\item For tiny matrices (\(<5\times5\)) specialized versions are implemented
|
||||
\end{itemize}
|
||||
\end{block}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:orgeb97339},fragile]{Example: Specialized DGEMM kernel I}
|
||||
\begin{columns}
|
||||
\begin{column}{0.45\columnwidth}
|
||||
\begin{block}{Simple algorithm}
|
||||
\begin{itemize}
|
||||
\item Simple micro kernel (\alert{GotoDGEMM}\footnote{doi:10.1145/1356052.1356053})
|
||||
\item Code written using \texttt{asm\_volatile} to force good code generation by
|
||||
compilers
|
||||
\item \alert{Tiling} scheme\footnote{doi:10.1109/ICPP.2015.29}
|
||||
\end{itemize}
|
||||
\end{block}
|
||||
\end{column}
|
||||
|
||||
\begin{column}{0.45\columnwidth}
|
||||
\begin{block}{Tiling scheme}
|
||||
\begin{center}
|
||||
\includegraphics[width=5cm,height=5cm]{./tiling_icpp2015.pdf}
|
||||
\end{center}
|
||||
\end{block}
|
||||
\end{column}
|
||||
\end{columns}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:org76e8117}]{Example: Specialized DGEMM kernel II}
|
||||
\begin{block}{Benchmarks}
|
||||
\begin{itemize}
|
||||
\item Comparison of MKL vs Specialized DGEMM
|
||||
|
||||
\begin{center}
|
||||
\includegraphics[height=4cm]{./plot_percentage_vs_mkl_tiled_good.pdf}
|
||||
\end{center}
|
||||
|
||||
\item Strong impact on MKL performance due to the number of consecutive executions
|
||||
\item Favorable comparison for MKL: Many consecutive executions to
|
||||
amortize setup cost, JIT, Skylake CPU
|
||||
\end{itemize}
|
||||
\end{block}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:orgc7d8abc}]{Why do we like our DGEMM?}
|
||||
\begin{itemize}
|
||||
\item Open source code : can be modified easily
|
||||
\item Simple code (280 LOC)
|
||||
\item Decent performance with 10\% of MKL
|
||||
\item Can be rewritten in different languages to increase
|
||||
portability (MIPP\footnote{https://github.com/aff3ct/MIPP})
|
||||
\item Can be coupled with simple pack/unpack routines to handle different
|
||||
data storage (tiled matrices)
|
||||
\item Allows to keep control on parallelism
|
||||
\item A good starting point for autotuning
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:org18a9bee}]{High-Performance strategies}
|
||||
\begin{block}{Tuning}
|
||||
\begin{itemize}
|
||||
\item Optimization is guided by analysis with \alert{MAQAO}\footnote{https://maqao.org}.
|
||||
\item Specialized versions of critical hot-spots
|
||||
\item \alert{MIPP} for portable intrinsics / specialized code generation
|
||||
\item Monitoring of the use of the library to choose most efficient versions
|
||||
\item Optimizations guided by monitoring numerical accuracy (\alert{Verificarlo}\footnote{https://github.com/verificarlo/verificarlo})
|
||||
\end{itemize}
|
||||
\end{block}
|
||||
\end{frame}
|
||||
|
||||
\begin{frame}[label={sec:org4489490}]{Efficiently guiding the developer}
|
||||
\begin{center}
|
||||
\includegraphics[width=\textwidth]{./maqao1.png}
|
||||
\end{center}
|
||||
\end{frame}
|
||||
\begin{frame}[label={sec:orgddd3631}]{Extensive/automatic testing of different configurations}
|
||||
\begin{center}
|
||||
\includegraphics[width=\textwidth]{./maqao2.png}
|
||||
\end{center}
|
||||
\end{frame}
|
||||
|
||||
\section{Summary}
|
||||
\label{sec:org30e04a5}
|
||||
|
||||
\begin{frame}[label={sec:org705d3cf}]{Summary}
|
||||
\begin{itemize}
|
||||
\item QMC codes integrated in an ecosystem of multiple codes for
|
||||
high-accuracy quantum chemistry
|
||||
\item Development of open-source libraries to be used in the
|
||||
TREX codes and beyond
|
||||
\item Libraries focus on \emph{performance}, \emph{portability} and \emph{productivity}
|
||||
\item Strategies to make the collaboration between physicists/chemists
|
||||
and HPC experts optimal
|
||||
\end{itemize}
|
||||
\end{frame}
|
||||
|
||||
|
||||
\section{Bonus slides}
|
||||
\label{sec:orgb118e4f}
|
||||
|
||||
\begin{frame}[fragile]{Numerical analysis with Verificarlo}
|
||||
|
||||
|
||||
@ -566,8 +631,10 @@ vfc\_probe\_assert("Sherman-Morisson", "res", res, \tikzmark{target}1e-7)
|
||||
\draw[arrow]
|
||||
(targetex.south) to[out=-90,in=90] ([yshift=1.2ex, xshift=.5cm]{pic cs:target});
|
||||
\end{tikzpicture}
|
||||
|
||||
\end{frame}
|
||||
\begin{frame}[label={sec:org8493521}]{Verificarlo CI}
|
||||
|
||||
\begin{frame}[label={sec:org560588a}]{Verificarlo CI}
|
||||
\begin{columns}
|
||||
\begin{column}{0.5\textwidth}
|
||||
\begin{exampleblock}{Compare runs}
|
||||
|
BIN
tiling_icpp2015.pdf
Normal file
BIN
tiling_icpp2015.pdf
Normal file
Binary file not shown.
@ -97,4 +97,4 @@ vfc\_probe\_assert("Sherman-Morisson", "res", res, \tikzmark{target}1e-7)
|
||||
(targetex.south) to[out=-90,in=90] ([yshift=1.2ex, xshift=.5cm]{pic cs:target});
|
||||
\end{tikzpicture}
|
||||
|
||||
|
||||
\end{frame}
|
||||
|
Loading…
Reference in New Issue
Block a user