Modifs zoom vendredi matin

This commit is contained in:
Anthony Scemama 2021-10-08 12:28:21 +02:00
commit 59137eb55b
5 changed files with 301 additions and 168 deletions

Binary file not shown.

View File

@ -88,7 +88,8 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
- Very low memory requirements (no integrals)
- Distribute walkers on different cores or compute nodes
- No blocking communication: near-ideal scaling
- Difficulty: parallelize within a QMC trajectory
- Difficulty to parallelize within a QMC trajectory: depends on the
number of electrons
#+LATEX: \end{column}
#+LATEX: \begin{column}{0.6\textwidth}
#+ATTR_LATEX: :width \textwidth
@ -99,11 +100,12 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
** Both libraries
*** Three objectives
1. *Productivity* \\
Used and developed by scientists in different languages
Usable and useful by scientists in different programming languages
2. *Portability* \\
Target: all HPC systems (CPU, GPU, ARM, x86, etc.)
3. *Performance* \\
Must be efficient on all architectures
Must be efficient on all architectures: possible tradeoffs
between portability and performance
*** Free (libre) software
- Requirement for open science
@ -208,7 +210,8 @@ digraph G {
| Nucleus | Basis | CI coefficients |
| AO | MO | Two-electron integrals |
| One-electron integrals | Density matrices | ECP |
- Each group contains multiple *attributes*
- Each group contains multiple *attributes*: information related to the
group
** Source code :noexport:
@ -241,23 +244,23 @@ trexio_exit_code trexio_[has/read/write]_<group>_<attribute>
* QMCkl: QMC kernel library
** QMC kernel library
*** Computational kernels
- QMCkl will contain the main kernels of QMC methods (Domain
specific library, end-user driven)
- QMCkl will contain the main kernels of QMC methods: Domain
specific library, end-user driven
- Written together by QMC experts and HPC experts
- Multiple high performance implementations of the kernels, tuned
for different
- architectures: portability is critical for users
- problem sizes (from small to large systems)
- requested accuracy (reduced precision)
- problem sizes: from small to large systems
- requested accuracy: reduced precision
** Objectives
- The code must stay easy to understand by the physicists/chemists.
Performance-related aspects should be delegated to the library
- Scientists should be able to use their preferred language
- Scientists should not lose control on their codes
- Scientists should not lose control of their codes
- Codes should not die when the architecture changes
- Scientific code development should not kill the performance
- Reuse of the optimization effort among the community
@ -273,8 +276,10 @@ trexio_exit_code trexio_[has/read/write]_<group>_<attribute>
Easy to read, understand, modify for scientists, not necessarily efficient.
2. *High performance libraries* \\
Efficient on a given architecture, but not necessarily
readable by physicists/chemists. \\
Performance within 10% to maximize portability and simplicity.
readable by physicists/chemists. \\
Performance within 10% to maximize portability and simplicity.
3. *Ultra-High performance libraries* \\
Generated with auto-tuning tools for well identified datasets.
- Both /Documentation/ and /High performance/ have the same API
(similar to BLAS on netlib /vs/ MKL).
@ -283,10 +288,22 @@ trexio_exit_code trexio_[has/read/write]_<group>_<attribute>
implemented in the HPC versions when the API is stabilized.
- Performance: enable a data-driven task-based parallelism
** Documentation library :noexport:
Literate programming with Org-mode:
- Comments are more important than code
- Can add graphics, \LaTeX formulas, tables, etc
- Documentation always synchronized with the code
- Some routines can be generated by embedded scripts
- Kernels are implemented in Fortran for readability
- The API is C-compatible: QMCkl appears like a C library
$\Longrightarrow$ can be used in all other languages
- Example: Prototyping in Julia
** Library design
- Creation of a /Context/ that keeps a consistent state of the library
- Creation of a /Context/ that keeps a consistent state of the
library (pointers to computed data, configuration parameters, etc.)
- Memory allocation is abstract:
#+begin_src c
void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info);
@ -297,7 +314,8 @@ void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info);
- High-level functions: let the library call multiple kernels in an
optimal way, possibly updating the context
- Use of IRP programming paradigm\footnote{http://arxiv.org/abs/0909.5012} to keep track of dependencies
between kernels: re-compute only what is necessary
between kernels: re-compute only what is necessary and store
computed data in the context
** Dependencies between kernels
@ -407,58 +425,11 @@ rc = qmckl_get_local_energy(context, &e_loc);
2. A mini-application is written to find the optimal data layout
with HPC experts from real-size examples
3. The kernel is written in the documentation library
4. The documentation library is linked in a QMC code to check correctness
4. The documentation library is linked in a QMC code to check
correctness and numerical accuracy
5. HPC experts provide an HPC version of the kernel
6. The HPC library is linked in the QMC codes of the CoE
** Documentation library
Literate programming with Org-mode:
- Comments are more important than code
- Can add graphics, \LaTeX formulas, tables, etc
- Documentation always synchronized with the code
- Some routines can be generated by embedded scripts
- Kernels are implemented in Fortran for readability
- The API is C-compatible: QMCkl appears like a C library
$\Longrightarrow$ can be used in all other languages
- Example: Prototyping in Julia
** High-Performance strategies
*** Linear algebra hot spots
| GEMM | Rank-1 update | Matrix Inversion |
| GEMV | Diagonal of GEMM | Shermann-Morrison-Woodburry |
*** Matrices are relatively small ($\le 1000\times 1000$)
- Matrices are stored in tiled format $\Longrightarrow$ task-based
linear algebra interleaved computation of multiple kernels
- Increase parallelism by agregating multiple independent walkers
in matrices
- Needs fast linear algebra kernels for small matrices
** High-Performance strategies
*** Tuning
- Optimization is guided by analysis with *MAQAO*\footnote{https://maqao.org}.
- Specialized versions of critical hot-spots
- MIPP\footnote{https://github.com/aff3ct/MIPP} for portable intrinsics / specialized code generation
- Monitoring of the use of the library to choose most efficient versions
- Optimizations guided by monitoring numerical accuracy (*Verificarlo*\footnote{https://github.com/verificarlo/verificarlo})
** Example: Specialized DGEMM kernel
VIJAY
** Efficiently guiding the developer
#+ATTR_LATEX: :width \textwidth
[[./maqao1.png]]
** Extensive/automatic testing of different configurations
#+ATTR_LATEX: :width \textwidth
[[./maqao2.png]]
** First application : 3-body Jastrow factor
#+LATEX: \newcommand{\Jeen}{J_{\text{een}}}
@ -489,14 +460,109 @@ rc = qmckl_get_local_energy(context, &e_loc);
#+LATEX: \begin{column}{0.5\textwidth}
- Gradient and Laplacian are also required
- Up to $20\times$ faster than in the original code
- $\sim 80\%$ of the AVX-512 peak is reached
- $\sim 80\%$ of the AVX-512 peak is reached using standard MKL on
Intel Skylake
- Expressed with a DGEMM kernel $\Longrightarrow$ also efficient on GPU
#+LATEX: \end{column}
#+LATEX: \end{columns}
#+LATEX: \end{frame}
#+INCLUDE: "verificarlo.tex" export latex
** High-Performance strategies
*** Linear algebra hot spots
| GEMM | Rank-1 update | Matrix Inversion |
| GEMV | Diagonal of GEMM | Shermann-Morrison-Woodburry |
*** Matrices are relatively small ($\le 1000\times 1000$)
- Matrices are stored in tiled format fitting a block formulation
of the algorithms $\Longrightarrow$ task-based
linear algebra, interleaved computation of multiple kernels
- Tile sizes will be adjusted by auto-tuning
- Increase parallelism by aggregating multiple independent walkers
in matrices
- Needs fast linear algebra kernels for small matrices (tile size)
- For tiny matrices ($<5\times5$) specialized versions are implemented
** Example: Specialized DGEMM kernel I
*** Simple algorithm :B_block:BMCOL:
:PROPERTIES:
:BEAMER_env: block
:BEAMER_col: 0.45
:END:
- Simple micro kernel (*GotoDGEMM*\footnote{doi:10.1145/1356052.1356053})
- Code written using ~asm_volatile~ to force good code generation by
compilers
- *Tiling* scheme\footnote{doi:10.1109/ICPP.2015.29}
*** Tiling scheme :B_block:BMCOL:
:PROPERTIES:
:BEAMER_col: 0.45
:BEAMER_env: block
:END:
#+ATTR_LATEX: :width 5cm :height 5cm :keepaspectratio :right
[[./tiling_icpp2015.pdf]]
** Example: Specialized DGEMM kernel II
*** Benchmarks
- Comparison of MKL vs Specialized DGEMM
#+ATTR_LATEX: :height 4cm :keepaspectratio
[[./plot_percentage_vs_mkl_tiled_good.pdf]]
- Strong impact on MKL performance due to the number of consecutive executions
- Favorable comparison for MKL: Many consecutive executions to
amortize setup cost, JIT, Skylake CPU
** Why do we like our DGEMM?
- Open source code : can be modified easily
- Simple code (280 LOC)
- Decent performance with 10% of MKL
- Can be rewritten in different languages to increase
portability (MIPP\footnote{https://github.com/aff3ct/MIPP})
- Can be coupled with simple pack/unpack routines to handle different
data storage (tiled matrices)
- Allows to keep control on parallelism
- A good starting point for autotuning
** High-Performance strategies
*** Tuning
- Optimization is guided by analysis with *MAQAO*\footnote{https://maqao.org}.
- Specialized versions of critical hot-spots
- *MIPP* for portable intrinsics / specialized code generation
- Monitoring of the use of the library to choose most efficient versions
- Optimizations guided by monitoring numerical accuracy (*Verificarlo*\footnote{https://github.com/verificarlo/verificarlo})
** Efficiently guiding the developer
#+ATTR_LATEX: :width \textwidth
[[./maqao1.png]]
** Extensive/automatic testing of different configurations
#+ATTR_LATEX: :width \textwidth
[[./maqao2.png]]
* Summary
** Summary
- QMC codes integrated in an ecosystem of multiple codes for
high-accuracy quantum chemistry
- Development of open-source libraries to be used in the
TREX codes and beyond
- Libraries focus on /performance/, /portability/ and /productivity/
- Strategies to make the collaboration between physicists/chemists
and HPC experts optimal
* Bonus slides
#+INCLUDE: "verificarlo.tex" export latex
** Verificarlo CI
#+LATEX: \begin{columns}
@ -518,7 +584,6 @@ rc = qmckl_get_local_energy(context, &e_loc);
#+LATEX: \end{exampleblock}
#+LATEX: \end{column}
#+LATEX: \end{columns}
* Useful links :noexport:
| TREX web site | https://trex-coe.eu |
@ -597,3 +662,4 @@ together: perf et productivity
: /home/scemama/MEGA/TEX/Presentations/2021/Intel/scemama.pdf

View File

@ -1,4 +1,4 @@
% Created 2021-10-07 Thu 12:17
% Created 2021-10-08 Fri 12:27
% Intended LaTeX compiler: pdflatex
\documentclass[aspectratio=169]{beamer}
\usepackage[utf8]{inputenc}
@ -53,8 +53,8 @@ $^2$University of Versailles, Li-PaRAD (France)}
\maketitle
\section{QMC in TREX}
\label{sec:org527cfcf}
\begin{frame}[label={sec:org3bfadea}]{QMC in TREX}
\label{sec:orge5169ea}
\begin{frame}[label={sec:org16615d0}]{QMC in TREX}
\begin{exampleblock}{QMC: Quantum Monte Carlo methods}
\begin{itemize}
\item Highly accurate methods
@ -75,7 +75,7 @@ How: Instead of re-writing codes, provide libraries (free software)
\end{exampleblock}
\end{frame}
\begin{frame}[label={sec:orge26ef23}]{Quantum Monte Carlo (QMC)}
\begin{frame}[label={sec:orgd8db692}]{Quantum Monte Carlo (QMC)}
\alert{Problem}: Stochastic resolution of the Schr\"odinger equation for $N$ electrons
\begin{eqnarray}
E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
@ -101,14 +101,15 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
\end{columns}
\end{frame}
\begin{frame}[label={sec:orgd65402e}]{Quantum Monte Carlo (QMC)}
\begin{frame}[label={sec:orgcee35fc}]{Quantum Monte Carlo (QMC)}
\begin{columns}
\begin{column}{0.4\textwidth}
\begin{itemize}
\item Very low memory requirements (no integrals)
\item Distribute walkers on different cores or compute nodes
\item No blocking communication: near-ideal scaling
\item Difficulty: parallelize within a QMC trajectory
\item Difficulty to parallelize within a QMC trajectory: depends on the
number of electrons
\end{itemize}
\end{column}
\begin{column}{0.6\textwidth}
@ -119,15 +120,16 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
\end{columns}
\end{frame}
\begin{frame}[label={sec:org3e8242f}]{Both libraries}
\begin{frame}[label={sec:org4bb2da0}]{Both libraries}
\begin{block}{Three objectives}
\begin{enumerate}
\item \alert{Productivity} \\
Used and developed by scientists in different languages
Usable and useful by scientists in different programming languages
\item \alert{Portability} \\
Target: all HPC systems (CPU, GPU, ARM, x86, etc.)
\item \alert{Performance} \\
Must be efficient on all architectures
Must be efficient on all architectures: possible tradeoffs
between portability and performance
\end{enumerate}
\end{block}
@ -140,8 +142,8 @@ Must be efficient on all architectures
\end{frame}
\section{TREXIO: I/O library}
\label{sec:orgf8ad1e7}
\begin{frame}[label={sec:org02f0485}]{TREXIO: I/O library}
\label{sec:orga389b46}
\begin{frame}[label={sec:org61be819}]{TREXIO: I/O library}
\begin{columns}
\begin{column}{0.4\textwidth}
\begin{exampleblock}{Before}
@ -163,7 +165,7 @@ Must be efficient on all architectures
\url{https://github.com/trex-coe/trexio}
\end{frame}
\begin{frame}[label={sec:org2341c39}]{TREXIO: I/O library}
\begin{frame}[label={sec:org01dc873}]{TREXIO: I/O library}
\begin{exampleblock}{Front end}
\begin{itemize}
\item Definition of an API for to read/write wave functions
@ -192,7 +194,7 @@ Must be efficient on all architectures
\end{columns}
\end{frame}
\begin{frame}[label={sec:org51a55c1}]{Content of the files}
\begin{frame}[label={sec:org6f3aa58}]{Content of the files}
\begin{itemize}
\item File is \alert{self-contained}: no external knowledge needed to compute
\(\Psi(r_1,\dots,r_n)\) (normalization factors, basis et
@ -208,43 +210,44 @@ AO & MO & Two-electron integrals\\
One-electron integrals & Density matrices & ECP\\
\end{tabular}
\end{center}
\item Each group contains multiple \alert{attributes}
\item Each group contains multiple \alert{attributes}: information related to the
group
\end{itemize}
\end{frame}
\section{QMCkl: QMC kernel library}
\label{sec:org53e6105}
\label{sec:org3669f0e}
\begin{frame}[label={sec:org4dc9060}]{QMC kernel library}
\begin{frame}[label={sec:org89970a2}]{QMC kernel library}
\begin{block}{Computational kernels}
\begin{itemize}
\item QMCkl will contain the main kernels of QMC methods (Domain
specific library, end-user driven)
\item QMCkl will contain the main kernels of QMC methods: Domain
specific library, end-user driven
\item Written together by QMC experts and HPC experts
\item Multiple high performance implementations of the kernels, tuned
for different
\begin{itemize}
\item architectures: portability is critical for users
\item problem sizes (from small to large systems)
\item requested accuracy (reduced precision)
\item problem sizes: from small to large systems
\item requested accuracy: reduced precision
\end{itemize}
\end{itemize}
\end{block}
\end{frame}
\begin{frame}[label={sec:orgcf8c268}]{Objectives}
\begin{frame}[label={sec:org27f2ac6}]{Objectives}
\begin{itemize}
\item The code must stay easy to understand by the physicists/chemists.
Performance-related aspects should be delegated to the library
\item Scientists should be able to use their preferred language
\item Scientists should not lose control on their codes
\item Scientists should not lose control of their codes
\item Codes should not die when the architecture changes
\item Scientific code development should not kill the performance
\item Reuse of the optimization effort among the community
\end{itemize}
\end{frame}
\begin{frame}[label={sec:org523cd8a}]{Functionality and performance}
\begin{frame}[label={sec:org7fe4d9a}]{Functionality and performance}
\begin{itemize}
\item Keeping high \emph{productivity}, \emph{portability} and \emph{performance} is very
hard in a single piece of software.
@ -255,9 +258,11 @@ We propose (at least) two implementations:
\item \alert{Documentation library} \\
Easy to read, understand, modify for scientists, not necessarily efficient.
\item \alert{High performance libraries} \\
Efficient on a given architecture, but not necessarily
Efficient on a given architecture, but not necessarily
readable by physicists/chemists. \\
Performance within 10\% to maximize portability and simplicity.
\item \alert{Ultra-High performance libraries} \\
Generated with auto-tuning tools for well identified datasets.
\end{enumerate}
\item Both \emph{Documentation} and \emph{High performance} have the same API
@ -270,9 +275,10 @@ implemented in the HPC versions when the API is stabilized.
\end{itemize}
\end{frame}
\begin{frame}[label={sec:org1030a63},fragile]{Library design}
\begin{frame}[label={sec:orgca18759},fragile]{Library design}
\begin{itemize}
\item Creation of a \emph{Context} that keeps a consistent state of the library
\item Creation of a \emph{Context} that keeps a consistent state of the
library (pointers to computed data, configuration parameters, etc.)
\item Memory allocation is abstract:
\begin{minted}[frame=lines,fontsize=\scriptsize,linenos]{c}
void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info);
@ -283,11 +289,12 @@ context untouched (no allocation, no modification in-place)
\item High-level functions: let the library call multiple kernels in an
optimal way, possibly updating the context
\item Use of IRP programming paradigm\footnote{http://arxiv.org/abs/0909.5012} to keep track of dependencies
between kernels: re-compute only what is necessary
between kernels: re-compute only what is necessary and store
computed data in the context
\end{itemize}
\end{frame}
\begin{frame}[label={sec:orgd8c37c2}]{Dependencies between kernels}
\begin{frame}[label={sec:org1c791dc}]{Dependencies between kernels}
\begin{columns}
\begin{column}{0.5\textwidth}
\begin{center}
@ -307,7 +314,7 @@ between kernels: re-compute only what is necessary
\end{columns}
\end{frame}
\begin{frame}[label={sec:org465f70f},fragile]{Use case: low-level}
\begin{frame}[label={sec:org5202b14},fragile]{Use case: low-level}
\begin{minted}[frame=lines,fontsize=\scriptsize,linenos]{c}
#include <qmckl.h>
@ -330,7 +337,7 @@ assert (rc == QMCKL_SUCCESS);
\end{minted}
\end{frame}
\begin{frame}[label={sec:orgb80c323},fragile]{Use case: high-level}
\begin{frame}[label={sec:org1ecca91},fragile]{Use case: high-level}
\begin{minted}[frame=lines,fontsize=\scriptsize,linenos]{c}
#include <qmckl.h>
// ...
@ -354,82 +361,21 @@ rc = qmckl_get_local_energy(context, &e_loc);
\end{minted}
\end{frame}
\begin{frame}[label={sec:org518f369}]{Development strategy}
\begin{frame}[label={sec:org3f3c8bf}]{Development strategy}
\begin{enumerate}
\item Kernel extraction: QMC specialists agree on the
mathematical expression of the problem
\item A mini-application is written to find the optimal data layout
with HPC experts from real-size examples
\item The kernel is written in the documentation library
\item The documentation library is linked in a QMC code to check correctness
\item The documentation library is linked in a QMC code to check
correctness and numerical accuracy
\item HPC experts provide an HPC version of the kernel
\item The HPC library is linked in the QMC codes of the CoE
\end{enumerate}
\end{frame}
\begin{frame}[label={sec:org7c60b7a}]{Documentation library}
Literate programming with Org-mode:
\begin{itemize}
\item Comments are more important than code
\item Can add graphics, \LaTeX formulas, tables, etc
\item Documentation always synchronized with the code
\item Some routines can be generated by embedded scripts
\item Kernels are implemented in Fortran for readability
\item The API is C-compatible: QMCkl appears like a C library
\(\Longrightarrow\) can be used in all other languages
\item Example: Prototyping in Julia
\end{itemize}
\end{frame}
\begin{frame}[label={sec:orgf424cd4}]{High-Performance strategies}
\begin{block}{Linear algebra hot spots}
\begin{center}
\begin{tabular}{lll}
GEMM & Rank-1 update & Matrix Inversion\\
GEMV & Diagonal of GEMM & Shermann-Morrison-Woodburry\\
\end{tabular}
\end{center}
\end{block}
\begin{block}{Matrices are relatively small (\(\le 1000\times 1000\))}
\begin{itemize}
\item Matrices are stored in tiled format \(\Longrightarrow\) task-based
linear algebra interleaved computation of multiple kernels
\item Increase parallelism by agregating multiple independent walkers
in matrices
\item Needs fast linear algebra kernels for small matrices
\end{itemize}
\end{block}
\end{frame}
\begin{frame}[label={sec:orgea7372b}]{High-Performance strategies}
\begin{block}{Tuning}
\begin{itemize}
\item Optimization is guided by analysis with \alert{MAQAO}\footnote{https://maqao.org}.
\item Specialized versions of critical hot-spots
\item MIPP\footnote{https://github.com/aff3ct/MIPP} for portable intrinsics / specialized code generation
\item Monitoring of the use of the library to choose most efficient versions
\item Optimizations guided by monitoring numerical accuracy (\alert{Verificarlo}\footnote{https://github.com/verificarlo/verificarlo})
\end{itemize}
\end{block}
\end{frame}
\begin{frame}[label={sec:orgba656d9}]{Example: Specialized DGEMM kernel}
VIJAY
\end{frame}
\begin{frame}[label={sec:orgd3ca712}]{Efficiently guiding the developer}
\begin{center}
\includegraphics[width=\textwidth]{./maqao1.png}
\end{center}
\end{frame}
\begin{frame}[label={sec:orgcc14268}]{Extensive/automatic testing of different configurations}
\begin{center}
\includegraphics[width=\textwidth]{./maqao2.png}
\end{center}
\end{frame}
\begin{frame}[label={sec:org7ee3c30}]{First application : 3-body Jastrow factor}
\begin{frame}[label={sec:orgb6a9085}]{First application : 3-body Jastrow factor}
\newcommand{\Jeen}{J_{\text{een}}}
\newcommand{\Nel}{N_{\text{elec}}}
\newcommand{\Nat}{N_{\text{nucl}}}
@ -460,14 +406,133 @@ VIJAY
\begin{itemize}
\item Gradient and Laplacian are also required
\item Up to \(20\times\) faster than in the original code
\item \(\sim 80\%\) of the AVX-512 peak is reached
\item \(\sim 80\%\) of the AVX-512 peak is reached using standard MKL on
Intel Skylake
\item Expressed with a DGEMM kernel \(\Longrightarrow\) also efficient on GPU
\end{itemize}
\end{column}
\end{columns}
\end{frame}
\begin{frame}[label={sec:orgd6d3e26}]{High-Performance strategies}
\begin{block}{Linear algebra hot spots}
\begin{center}
\begin{tabular}{lll}
GEMM & Rank-1 update & Matrix Inversion\\
GEMV & Diagonal of GEMM & Shermann-Morrison-Woodburry\\
\end{tabular}
\end{center}
\end{block}
\begin{block}{Matrices are relatively small (\(\le 1000\times 1000\))}
\begin{itemize}
\item Matrices are stored in tiled format fitting a block formulation
of the algorithms \(\Longrightarrow\) task-based
linear algebra, interleaved computation of multiple kernels
\item Tile sizes will be adjusted by auto-tuning
\item Increase parallelism by aggregating multiple independent walkers
in matrices
\item Needs fast linear algebra kernels for small matrices (tile size)
\item For tiny matrices (\(<5\times5\)) specialized versions are implemented
\end{itemize}
\end{block}
\end{frame}
\begin{frame}[label={sec:orgeb97339},fragile]{Example: Specialized DGEMM kernel I}
\begin{columns}
\begin{column}{0.45\columnwidth}
\begin{block}{Simple algorithm}
\begin{itemize}
\item Simple micro kernel (\alert{GotoDGEMM}\footnote{doi:10.1145/1356052.1356053})
\item Code written using \texttt{asm\_volatile} to force good code generation by
compilers
\item \alert{Tiling} scheme\footnote{doi:10.1109/ICPP.2015.29}
\end{itemize}
\end{block}
\end{column}
\begin{column}{0.45\columnwidth}
\begin{block}{Tiling scheme}
\begin{center}
\includegraphics[width=5cm,height=5cm]{./tiling_icpp2015.pdf}
\end{center}
\end{block}
\end{column}
\end{columns}
\end{frame}
\begin{frame}[label={sec:org76e8117}]{Example: Specialized DGEMM kernel II}
\begin{block}{Benchmarks}
\begin{itemize}
\item Comparison of MKL vs Specialized DGEMM
\begin{center}
\includegraphics[height=4cm]{./plot_percentage_vs_mkl_tiled_good.pdf}
\end{center}
\item Strong impact on MKL performance due to the number of consecutive executions
\item Favorable comparison for MKL: Many consecutive executions to
amortize setup cost, JIT, Skylake CPU
\end{itemize}
\end{block}
\end{frame}
\begin{frame}[label={sec:orgc7d8abc}]{Why do we like our DGEMM?}
\begin{itemize}
\item Open source code : can be modified easily
\item Simple code (280 LOC)
\item Decent performance with 10\% of MKL
\item Can be rewritten in different languages to increase
portability (MIPP\footnote{https://github.com/aff3ct/MIPP})
\item Can be coupled with simple pack/unpack routines to handle different
data storage (tiled matrices)
\item Allows to keep control on parallelism
\item A good starting point for autotuning
\end{itemize}
\end{frame}
\begin{frame}[label={sec:org18a9bee}]{High-Performance strategies}
\begin{block}{Tuning}
\begin{itemize}
\item Optimization is guided by analysis with \alert{MAQAO}\footnote{https://maqao.org}.
\item Specialized versions of critical hot-spots
\item \alert{MIPP} for portable intrinsics / specialized code generation
\item Monitoring of the use of the library to choose most efficient versions
\item Optimizations guided by monitoring numerical accuracy (\alert{Verificarlo}\footnote{https://github.com/verificarlo/verificarlo})
\end{itemize}
\end{block}
\end{frame}
\begin{frame}[label={sec:org4489490}]{Efficiently guiding the developer}
\begin{center}
\includegraphics[width=\textwidth]{./maqao1.png}
\end{center}
\end{frame}
\begin{frame}[label={sec:orgddd3631}]{Extensive/automatic testing of different configurations}
\begin{center}
\includegraphics[width=\textwidth]{./maqao2.png}
\end{center}
\end{frame}
\section{Summary}
\label{sec:org30e04a5}
\begin{frame}[label={sec:org705d3cf}]{Summary}
\begin{itemize}
\item QMC codes integrated in an ecosystem of multiple codes for
high-accuracy quantum chemistry
\item Development of open-source libraries to be used in the
TREX codes and beyond
\item Libraries focus on \emph{performance}, \emph{portability} and \emph{productivity}
\item Strategies to make the collaboration between physicists/chemists
and HPC experts optimal
\end{itemize}
\end{frame}
\section{Bonus slides}
\label{sec:orgb118e4f}
\begin{frame}[fragile]{Numerical analysis with Verificarlo}
@ -566,8 +631,10 @@ vfc\_probe\_assert("Sherman-Morisson", "res", res, \tikzmark{target}1e-7)
\draw[arrow]
(targetex.south) to[out=-90,in=90] ([yshift=1.2ex, xshift=.5cm]{pic cs:target});
\end{tikzpicture}
\end{frame}
\begin{frame}[label={sec:org8493521}]{Verificarlo CI}
\begin{frame}[label={sec:org560588a}]{Verificarlo CI}
\begin{columns}
\begin{column}{0.5\textwidth}
\begin{exampleblock}{Compare runs}

BIN
tiling_icpp2015.pdf Normal file

Binary file not shown.

View File

@ -97,4 +97,4 @@ vfc\_probe\_assert("Sherman-Morisson", "res", res, \tikzmark{target}1e-7)
(targetex.south) to[out=-90,in=90] ([yshift=1.2ex, xshift=.5cm]{pic cs:target});
\end{tikzpicture}
\end{frame}