Modifs zoom vendredi matin

2021-10-08 12:28:21 +02:00 · 2021-10-08 12:28:21 +02:00 · 59137eb55b
commit 59137eb55b
parent ef1802427e 877feadb7b
5 changed files with 301 additions and 168 deletions
--- a/plot_percentage_vs_mkl_tiled_good.pdf
+++ b/plot_percentage_vs_mkl_tiled_good.pdf
--- a/scemama.org
+++ b/scemama.org
@ -88,7 +88,8 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
     - Very low memory requirements (no integrals)
     - Distribute walkers on different cores or compute nodes
     - No blocking communication: near-ideal scaling
-     - Difficulty: parallelize within a QMC trajectory
+     - Difficulty to parallelize within a QMC trajectory: depends on the
       number of electrons
    #+LATEX: \end{column}
    #+LATEX: \begin{column}{0.6\textwidth}
    #+ATTR_LATEX: :width \textwidth
@ -99,11 +100,12 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
 ** Both libraries
 *** Three objectives
    1. *Productivity* \\
-       Used and developed by scientists in different languages
+       Usable and useful by scientists in different programming languages
    2. *Portability* \\
       Target: all HPC systems (CPU, GPU, ARM, x86, etc.)
    3. *Performance* \\
-       Must be efficient on all architectures
+       Must be efficient on all architectures: possible tradeoffs
       between portability and performance
 *** Free (libre) software
    - Requirement for open science
@ -208,7 +210,8 @@ digraph G {
     | Nucleus                | Basis            | CI coefficients        |
     | AO                     | MO               | Two-electron integrals |
     | One-electron integrals | Density matrices | ECP                    |
-    - Each group contains multiple *attributes*
+    - Each group contains multiple *attributes*: information related to the
      group
 ** Source code                                                     :noexport:
@ -241,23 +244,23 @@ trexio_exit_code  trexio_[has/read/write]_<group>_<attribute>
 * QMCkl: QMC kernel library
 ** QMC kernel library
-
+   
 *** Computational kernels
-   - QMCkl will contain the main kernels of QMC methods (Domain
+   - QMCkl will contain the main kernels of QMC methods: Domain
-     specific library, end-user driven)
+     specific library, end-user driven
   - Written together by QMC experts and HPC experts
   - Multiple high performance implementations of the kernels, tuned
     for different
     - architectures: portability is critical for users
-     - problem sizes (from small to large systems)
+     - problem sizes: from small to large systems
-     - requested accuracy (reduced precision)
+     - requested accuracy: reduced precision
 ** Objectives
    - The code must stay easy to understand by the physicists/chemists.
      Performance-related aspects should be delegated to the library
    - Scientists should be able to use their preferred language
-    - Scientists should not lose control on their codes
+    - Scientists should not lose control of their codes
    - Codes should not die when the architecture changes
    - Scientific code development should not kill the performance
    - Reuse of the optimization effort among the community
@ -273,8 +276,10 @@ trexio_exit_code  trexio_[has/read/write]_<group>_<attribute>
        Easy to read, understand, modify for scientists, not necessarily efficient.
    2. *High performance libraries* \\
        Efficient on a given architecture, but not necessarily
-       readable by physicists/chemists. \\
+        readable by physicists/chemists. \\
-       Performance within 10% to maximize portability and simplicity.
+        Performance within 10% to maximize portability and simplicity.
    3. *Ultra-High performance libraries* \\
        Generated with auto-tuning tools for well identified datasets.
  - Both /Documentation/ and /High performance/ have the same API
    (similar to BLAS on netlib /vs/ MKL).
@ -283,10 +288,22 @@ trexio_exit_code  trexio_[has/read/write]_<group>_<attribute>
    implemented in the HPC versions when the API is stabilized.
  - Performance: enable a data-driven task-based parallelism
 ** Documentation library                                           :noexport:
   Literate programming with Org-mode:
   - Comments are more important than code
   - Can add graphics, \LaTeX formulas, tables, etc
   - Documentation always synchronized with the code
   - Some routines can be generated by embedded scripts
   - Kernels are implemented in Fortran for readability
   - The API is C-compatible: QMCkl appears like a C library
     $\Longrightarrow$ can be used in all other languages
   - Example: Prototyping in Julia
 ** Library design
-  - Creation of a /Context/ that keeps a consistent state of the library
+  - Creation of a /Context/ that keeps a consistent state of the
    library (pointers to computed data, configuration parameters, etc.)
  - Memory allocation is abstract:
    #+begin_src c
 void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info);
@ -297,7 +314,8 @@ void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info);
  - High-level functions: let the library call multiple kernels in an
    optimal way, possibly updating the context
  - Use of IRP programming paradigm\footnote{http://arxiv.org/abs/0909.5012} to keep track of dependencies
-    between kernels: re-compute only what is necessary
+    between kernels: re-compute only what is necessary and store
    computed data in the context
 ** Dependencies between kernels
@ -407,58 +425,11 @@ rc = qmckl_get_local_energy(context, &e_loc);
   2. A mini-application is written to find the optimal data layout
      with HPC experts from real-size examples
   3. The kernel is written in the documentation library
-   4. The documentation library is linked in a QMC code to check correctness
+   4. The documentation library is linked in a QMC code to check
      correctness and numerical accuracy
   5. HPC experts provide an HPC version of the kernel
   6. The HPC library is linked in the QMC codes of the CoE
 ** Documentation library
   Literate programming with Org-mode:
   - Comments are more important than code
   - Can add graphics, \LaTeX formulas, tables, etc
   - Documentation always synchronized with the code
   - Some routines can be generated by embedded scripts
   - Kernels are implemented in Fortran for readability
   - The API is C-compatible: QMCkl appears like a C library
     $\Longrightarrow$ can be used in all other languages
   - Example: Prototyping in Julia
 ** High-Performance strategies
 *** Linear algebra hot spots
   | GEMM | Rank-1 update    | Matrix Inversion            |
   | GEMV | Diagonal of GEMM | Shermann-Morrison-Woodburry |
 *** Matrices are relatively small ($\le 1000\times 1000$)
   - Matrices are stored in tiled format $\Longrightarrow$ task-based
     linear algebra interleaved computation of multiple kernels
   - Increase parallelism by agregating multiple independent walkers
     in matrices
   - Needs fast linear algebra kernels for small matrices
 ** High-Performance strategies
 *** Tuning
   - Optimization is guided by analysis with *MAQAO*\footnote{https://maqao.org}.
   - Specialized versions of critical hot-spots
   - MIPP\footnote{https://github.com/aff3ct/MIPP} for portable intrinsics / specialized code generation
   - Monitoring of the use of the library to choose most efficient versions
   - Optimizations guided by monitoring numerical accuracy (*Verificarlo*\footnote{https://github.com/verificarlo/verificarlo})
 ** Example: Specialized DGEMM kernel
    VIJAY
 ** Efficiently guiding the developer
   #+ATTR_LATEX: :width \textwidth
   [[./maqao1.png]]
 ** Extensive/automatic testing of different configurations
   #+ATTR_LATEX: :width \textwidth
   [[./maqao2.png]]
 ** First application : 3-body Jastrow factor
 #+LATEX: \newcommand{\Jeen}{J_{\text{een}}}
@ -489,14 +460,109 @@ rc = qmckl_get_local_energy(context, &e_loc);
    #+LATEX: \begin{column}{0.5\textwidth}
    - Gradient and Laplacian are also required
    - Up to $20\times$ faster than in the original code
-    - $\sim 80\%$ of the AVX-512 peak is reached
+    - $\sim 80\%$ of the AVX-512 peak is reached using standard MKL on
      Intel Skylake
    - Expressed with a DGEMM kernel $\Longrightarrow$ also efficient on GPU
    #+LATEX: \end{column}
    #+LATEX: \end{columns}
-   
+** High-Performance strategies
- #+LATEX: \end{frame}
+
- #+INCLUDE: "verificarlo.tex" export latex
+*** Linear algebra hot spots
   | GEMM | Rank-1 update    | Matrix Inversion            |
   | GEMV | Diagonal of GEMM | Shermann-Morrison-Woodburry |
 *** Matrices are relatively small ($\le 1000\times 1000$)
   - Matrices are stored in tiled format fitting a block formulation
     of the algorithms $\Longrightarrow$ task-based
     linear algebra, interleaved computation of multiple kernels
   - Tile sizes will be adjusted by auto-tuning 
   - Increase parallelism by aggregating multiple independent walkers
     in matrices
   - Needs fast linear algebra kernels for small matrices (tile size)
   - For tiny matrices ($<5\times5$) specialized versions are implemented
 ** Example: Specialized DGEMM kernel I
 *** Simple algorithm :B_block:BMCOL:
 :PROPERTIES:
 :BEAMER_env: block
 :BEAMER_col: 0.45
 :END:
 - Simple micro kernel (*GotoDGEMM*\footnote{doi:10.1145/1356052.1356053})
 - Code written using ~asm_volatile~ to force good code generation by
  compilers
 - *Tiling* scheme\footnote{doi:10.1109/ICPP.2015.29}
 *** Tiling scheme :B_block:BMCOL:
 :PROPERTIES:
 :BEAMER_col: 0.45
 :BEAMER_env: block
 :END:
   #+ATTR_LATEX: :width 5cm :height 5cm :keepaspectratio :right
   [[./tiling_icpp2015.pdf]]
 ** Example: Specialized DGEMM kernel II
 *** Benchmarks
 - Comparison of MKL vs Specialized DGEMM
   #+ATTR_LATEX: :height 4cm :keepaspectratio
   [[./plot_percentage_vs_mkl_tiled_good.pdf]]
 - Strong impact on MKL performance due to the number of consecutive executions
 - Favorable comparison for MKL: Many consecutive executions to
  amortize setup cost, JIT, Skylake CPU
 ** Why do we like our DGEMM?
   - Open source code : can be modified easily
   - Simple code (280 LOC)  
   - Decent performance with 10% of MKL
   - Can be rewritten in different languages to increase
     portability (MIPP\footnote{https://github.com/aff3ct/MIPP})
   - Can be coupled with simple pack/unpack routines to handle different
     data storage (tiled matrices)
   - Allows to keep control on parallelism
   - A good starting point for autotuning
 ** High-Performance strategies
 *** Tuning
   - Optimization is guided by analysis with *MAQAO*\footnote{https://maqao.org}.
   - Specialized versions of critical hot-spots
   - *MIPP* for portable intrinsics / specialized code generation
   - Monitoring of the use of the library to choose most efficient versions
   - Optimizations guided by monitoring numerical accuracy (*Verificarlo*\footnote{https://github.com/verificarlo/verificarlo})
 ** Efficiently guiding the developer
   #+ATTR_LATEX: :width \textwidth
   [[./maqao1.png]]
 ** Extensive/automatic testing of different configurations
   #+ATTR_LATEX: :width \textwidth
   [[./maqao2.png]]
 * Summary
 ** Summary
  - QMC codes integrated in an ecosystem of multiple codes for
    high-accuracy quantum chemistry
  - Development of open-source libraries to be used in the 
    TREX codes and beyond
  - Libraries focus on /performance/, /portability/ and /productivity/
  - Strategies to make the collaboration between physicists/chemists
    and HPC experts optimal
 * Bonus slides
  #+INCLUDE: "verificarlo.tex" export latex
 ** Verificarlo CI
    #+LATEX: \begin{columns}
@ -518,7 +584,6 @@ rc = qmckl_get_local_energy(context, &e_loc);
    #+LATEX: \end{exampleblock}
    #+LATEX: \end{column}
    #+LATEX: \end{columns}
 * Useful links                                                     :noexport:
  | TREX web site       | https://trex-coe.eu                        |
@ -597,3 +662,4 @@ together: perf et productivity
  : /home/scemama/MEGA/TEX/Presentations/2021/Intel/scemama.pdf
--- a/scemama.tex
+++ b/scemama.tex
@ -1,4 +1,4 @@
-% Created 2021-10-07 Thu 12:17
+% Created 2021-10-08 Fri 12:27
 % Intended LaTeX compiler: pdflatex
 \documentclass[aspectratio=169]{beamer}
 \usepackage[utf8]{inputenc}
@ -53,8 +53,8 @@ $^2$University of Versailles, Li-PaRAD (France)}
 \maketitle
 \section{QMC in TREX}
-\label{sec:org527cfcf}
+\label{sec:orge5169ea}
-\begin{frame}[label={sec:org3bfadea}]{QMC in TREX}
+\begin{frame}[label={sec:org16615d0}]{QMC in TREX}
 \begin{exampleblock}{QMC: Quantum Monte Carlo methods}
 \begin{itemize}
 \item Highly accurate methods
@ -75,7 +75,7 @@ How: Instead of re-writing codes, provide libraries (free software)
 \end{exampleblock}
 \end{frame}
-\begin{frame}[label={sec:orge26ef23}]{Quantum Monte Carlo (QMC)}
+\begin{frame}[label={sec:orgd8db692}]{Quantum Monte Carlo (QMC)}
 \alert{Problem}: Stochastic resolution of the Schr\"odinger equation for $N$ electrons
 \begin{eqnarray}
 E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
@ -101,14 +101,15 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
 \end{columns}
 \end{frame}
-\begin{frame}[label={sec:orgd65402e}]{Quantum Monte Carlo (QMC)}
+\begin{frame}[label={sec:orgcee35fc}]{Quantum Monte Carlo (QMC)}
 \begin{columns}
 \begin{column}{0.4\textwidth}
 \begin{itemize}
 \item Very low memory requirements (no integrals)
 \item Distribute walkers on different cores or compute nodes
 \item No blocking communication: near-ideal scaling
-\item Difficulty: parallelize within a QMC trajectory
+\item Difficulty to parallelize within a QMC trajectory: depends on the
 number of electrons
 \end{itemize}
 \end{column}
 \begin{column}{0.6\textwidth}
@ -119,15 +120,16 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
 \end{columns}
 \end{frame}
-\begin{frame}[label={sec:org3e8242f}]{Both libraries}
+\begin{frame}[label={sec:org4bb2da0}]{Both libraries}
 \begin{block}{Three objectives}
 \begin{enumerate}
 \item \alert{Productivity} \\
-Used and developed by scientists in different languages
+Usable and useful by scientists in different programming languages
 \item \alert{Portability} \\
 Target: all HPC systems (CPU, GPU, ARM, x86, etc.)
 \item \alert{Performance} \\
-Must be efficient on all architectures
+Must be efficient on all architectures: possible tradeoffs
 between portability and performance
 \end{enumerate}
 \end{block}
@ -140,8 +142,8 @@ Must be efficient on all architectures
 \end{frame}
 \section{TREXIO: I/O library}
-\label{sec:orgf8ad1e7}
+\label{sec:orga389b46}
-\begin{frame}[label={sec:org02f0485}]{TREXIO: I/O library}
+\begin{frame}[label={sec:org61be819}]{TREXIO: I/O library}
 \begin{columns}
 \begin{column}{0.4\textwidth}
 \begin{exampleblock}{Before}
@ -163,7 +165,7 @@ Must be efficient on all architectures
 \url{https://github.com/trex-coe/trexio}
 \end{frame}
-\begin{frame}[label={sec:org2341c39}]{TREXIO: I/O library}
+\begin{frame}[label={sec:org01dc873}]{TREXIO: I/O library}
 \begin{exampleblock}{Front end}
 \begin{itemize}
 \item Definition of an API for to read/write wave functions
@ -192,7 +194,7 @@ Must be efficient on all architectures
 \end{columns}
 \end{frame}
-\begin{frame}[label={sec:org51a55c1}]{Content of the files}
+\begin{frame}[label={sec:org6f3aa58}]{Content of the files}
 \begin{itemize}
 \item File is \alert{self-contained}: no external knowledge needed to compute
 \(\Psi(r_1,\dots,r_n)\) (normalization factors, basis et
@ -208,43 +210,44 @@ AO & MO & Two-electron integrals\\
 One-electron integrals & Density matrices & ECP\\
 \end{tabular}
 \end{center}
-\item Each group contains multiple \alert{attributes}
+\item Each group contains multiple \alert{attributes}: information related to the
 group
 \end{itemize}
 \end{frame}
 \section{QMCkl: QMC kernel library}
-\label{sec:org53e6105}
+\label{sec:org3669f0e}
-\begin{frame}[label={sec:org4dc9060}]{QMC kernel library}
+\begin{frame}[label={sec:org89970a2}]{QMC kernel library}
 \begin{block}{Computational kernels}
 \begin{itemize}
-\item QMCkl will contain the main kernels of QMC methods (Domain
+\item QMCkl will contain the main kernels of QMC methods: Domain
-specific library, end-user driven)
+specific library, end-user driven
 \item Written together by QMC experts and HPC experts
 \item Multiple high performance implementations of the kernels, tuned
 for different
 \begin{itemize}
 \item architectures: portability is critical for users
-\item problem sizes (from small to large systems)
+\item problem sizes: from small to large systems
-\item requested accuracy (reduced precision)
+\item requested accuracy: reduced precision
 \end{itemize}
 \end{itemize}
 \end{block}
 \end{frame}
-\begin{frame}[label={sec:orgcf8c268}]{Objectives}
+\begin{frame}[label={sec:org27f2ac6}]{Objectives}
 \begin{itemize}
 \item The code must stay easy to understand by the physicists/chemists.
 Performance-related aspects should be delegated to the library
 \item Scientists should be able to use their preferred language
-\item Scientists should not lose control on their codes
+\item Scientists should not lose control of their codes
 \item Codes should not die when the architecture changes
 \item Scientific code development should not kill the performance
 \item Reuse of the optimization effort among the community
 \end{itemize}
 \end{frame}
-\begin{frame}[label={sec:org523cd8a}]{Functionality and performance}
+\begin{frame}[label={sec:org7fe4d9a}]{Functionality and performance}
 \begin{itemize}
 \item Keeping high \emph{productivity}, \emph{portability} and \emph{performance} is very
 hard in a single piece of software.
@ -255,9 +258,11 @@ We propose (at least) two implementations:
 \item \alert{Documentation library} \\
 Easy to read, understand, modify for scientists, not necessarily efficient.
 \item \alert{High performance libraries} \\
- Efficient on a given architecture, but not necessarily
+Efficient on a given architecture, but not necessarily
 readable by physicists/chemists. \\
 Performance within 10\% to maximize portability and simplicity.
 \item \alert{Ultra-High performance libraries} \\
 Generated with auto-tuning tools for well identified datasets.
 \end{enumerate}
 \item Both \emph{Documentation} and \emph{High performance} have the same API
@ -270,9 +275,10 @@ implemented in the HPC versions when the API is stabilized.
 \end{itemize}
 \end{frame}
-\begin{frame}[label={sec:org1030a63},fragile]{Library design}
+\begin{frame}[label={sec:orgca18759},fragile]{Library design}
 \begin{itemize}
-\item Creation of a \emph{Context} that keeps a consistent state of the library
+\item Creation of a \emph{Context} that keeps a consistent state of the
 library (pointers to computed data, configuration parameters, etc.)
 \item Memory allocation is abstract:
 \begin{minted}[frame=lines,fontsize=\scriptsize,linenos]{c}
 void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info);
@ -283,11 +289,12 @@ context untouched (no allocation, no modification in-place)
 \item High-level functions: let the library call multiple kernels in an
 optimal way, possibly updating the context
 \item Use of IRP programming paradigm\footnote{http://arxiv.org/abs/0909.5012} to keep track of dependencies
-between kernels: re-compute only what is necessary
+between kernels: re-compute only what is necessary and store
 computed data in the context
 \end{itemize}
 \end{frame}
-\begin{frame}[label={sec:orgd8c37c2}]{Dependencies between kernels}
+\begin{frame}[label={sec:org1c791dc}]{Dependencies between kernels}
 \begin{columns}
 \begin{column}{0.5\textwidth}
 \begin{center}
@ -307,7 +314,7 @@ between kernels: re-compute only what is necessary
 \end{columns}
 \end{frame}
-\begin{frame}[label={sec:org465f70f},fragile]{Use case: low-level}
+\begin{frame}[label={sec:org5202b14},fragile]{Use case: low-level}
 \begin{minted}[frame=lines,fontsize=\scriptsize,linenos]{c}
 #include <qmckl.h> 
@ -330,7 +337,7 @@ assert (rc == QMCKL_SUCCESS);
 \end{minted}
 \end{frame}
-\begin{frame}[label={sec:orgb80c323},fragile]{Use case: high-level}
+\begin{frame}[label={sec:org1ecca91},fragile]{Use case: high-level}
 \begin{minted}[frame=lines,fontsize=\scriptsize,linenos]{c}
 #include <qmckl.h> 
 // ...
@ -354,82 +361,21 @@ rc = qmckl_get_local_energy(context, &e_loc);
 \end{minted}
 \end{frame}
-\begin{frame}[label={sec:org518f369}]{Development strategy}
+\begin{frame}[label={sec:org3f3c8bf}]{Development strategy}
 \begin{enumerate}
 \item Kernel extraction: QMC specialists agree on the 
 mathematical expression of the problem
 \item A mini-application is written to find the optimal data layout
 with HPC experts from real-size examples
 \item The kernel is written in the documentation library
-\item The documentation library is linked in a QMC code to check correctness
+\item The documentation library is linked in a QMC code to check
 correctness and numerical accuracy
 \item HPC experts provide an HPC version of the kernel
 \item The HPC library is linked in the QMC codes of the CoE
 \end{enumerate}
 \end{frame}
-\begin{frame}[label={sec:org7c60b7a}]{Documentation library}
+\begin{frame}[label={sec:orgb6a9085}]{First application : 3-body Jastrow factor}
 Literate programming with Org-mode:
 \begin{itemize}
 \item Comments are more important than code
 \item Can add graphics, \LaTeX formulas, tables, etc
 \item Documentation always synchronized with the code
 \item Some routines can be generated by embedded scripts
 \item Kernels are implemented in Fortran for readability
 \item The API is C-compatible: QMCkl appears like a C library
 \(\Longrightarrow\) can be used in all other languages
 \item Example: Prototyping in Julia
 \end{itemize}
 \end{frame}
 \begin{frame}[label={sec:orgf424cd4}]{High-Performance strategies}
 \begin{block}{Linear algebra hot spots}
 \begin{center}
 \begin{tabular}{lll}
 GEMM & Rank-1 update & Matrix Inversion\\
 GEMV & Diagonal of GEMM & Shermann-Morrison-Woodburry\\
 \end{tabular}
 \end{center}
 \end{block}
 \begin{block}{Matrices are relatively small (\(\le 1000\times 1000\))}
 \begin{itemize}
 \item Matrices are stored in tiled format \(\Longrightarrow\) task-based
 linear algebra interleaved computation of multiple kernels
 \item Increase parallelism by agregating multiple independent walkers
 in matrices
 \item Needs fast linear algebra kernels for small matrices
 \end{itemize}
 \end{block}
 \end{frame}
 \begin{frame}[label={sec:orgea7372b}]{High-Performance strategies}
 \begin{block}{Tuning}
 \begin{itemize}
 \item Optimization is guided by analysis with \alert{MAQAO}\footnote{https://maqao.org}.
 \item Specialized versions of critical hot-spots
 \item MIPP\footnote{https://github.com/aff3ct/MIPP} for portable intrinsics / specialized code generation
 \item Monitoring of the use of the library to choose most efficient versions
 \item Optimizations guided by monitoring numerical accuracy (\alert{Verificarlo}\footnote{https://github.com/verificarlo/verificarlo})
 \end{itemize}
 \end{block}
 \end{frame}
 \begin{frame}[label={sec:orgba656d9}]{Example: Specialized DGEMM kernel}
 VIJAY
 \end{frame}
 \begin{frame}[label={sec:orgd3ca712}]{Efficiently guiding the developer}
 \begin{center}
 \includegraphics[width=\textwidth]{./maqao1.png}
 \end{center}
 \end{frame}
 \begin{frame}[label={sec:orgcc14268}]{Extensive/automatic testing of different configurations}
 \begin{center}
 \includegraphics[width=\textwidth]{./maqao2.png}
 \end{center}
 \end{frame}
 \begin{frame}[label={sec:org7ee3c30}]{First application : 3-body Jastrow factor}
 \newcommand{\Jeen}{J_{\text{een}}}
 \newcommand{\Nel}{N_{\text{elec}}}
 \newcommand{\Nat}{N_{\text{nucl}}}
@ -460,14 +406,133 @@ VIJAY
 \begin{itemize}
 \item Gradient and Laplacian are also required
 \item Up to \(20\times\) faster than in the original code
-\item \(\sim 80\%\) of the AVX-512 peak is reached
+\item \(\sim 80\%\) of the AVX-512 peak is reached using standard MKL on
 Intel Skylake
 \item Expressed with a DGEMM kernel \(\Longrightarrow\) also efficient on GPU
 \end{itemize}
 \end{column}
 \end{columns}
 \end{frame}
 \begin{frame}[label={sec:orgd6d3e26}]{High-Performance strategies}
 \begin{block}{Linear algebra hot spots}
 \begin{center}
 \begin{tabular}{lll}
 GEMM & Rank-1 update & Matrix Inversion\\
 GEMV & Diagonal of GEMM & Shermann-Morrison-Woodburry\\
 \end{tabular}
 \end{center}
 \end{block}
 \begin{block}{Matrices are relatively small (\(\le 1000\times 1000\))}
 \begin{itemize}
 \item Matrices are stored in tiled format fitting a block formulation
 of the algorithms \(\Longrightarrow\) task-based
 linear algebra, interleaved computation of multiple kernels
 \item Tile sizes will be adjusted by auto-tuning
 \item Increase parallelism by aggregating multiple independent walkers
 in matrices
 \item Needs fast linear algebra kernels for small matrices (tile size)
 \item For tiny matrices (\(<5\times5\)) specialized versions are implemented
 \end{itemize}
 \end{block}
 \end{frame}
 \begin{frame}[label={sec:orgeb97339},fragile]{Example: Specialized DGEMM kernel I}
 \begin{columns}
 \begin{column}{0.45\columnwidth}
 \begin{block}{Simple algorithm}
 \begin{itemize}
 \item Simple micro kernel (\alert{GotoDGEMM}\footnote{doi:10.1145/1356052.1356053})
 \item Code written using \texttt{asm\_volatile} to force good code generation by
 compilers
 \item \alert{Tiling} scheme\footnote{doi:10.1109/ICPP.2015.29}
 \end{itemize}
 \end{block}
 \end{column}
 \begin{column}{0.45\columnwidth}
 \begin{block}{Tiling scheme}
 \begin{center}
 \includegraphics[width=5cm,height=5cm]{./tiling_icpp2015.pdf}
 \end{center}
 \end{block}
 \end{column}
 \end{columns}
 \end{frame}
 \begin{frame}[label={sec:org76e8117}]{Example: Specialized DGEMM kernel II}
 \begin{block}{Benchmarks}
 \begin{itemize}
 \item Comparison of MKL vs Specialized DGEMM
 \begin{center}
 \includegraphics[height=4cm]{./plot_percentage_vs_mkl_tiled_good.pdf}
 \end{center}
 \item Strong impact on MKL performance due to the number of consecutive executions
 \item Favorable comparison for MKL: Many consecutive executions to
 amortize setup cost, JIT, Skylake CPU
 \end{itemize}
 \end{block}
 \end{frame}
 \begin{frame}[label={sec:orgc7d8abc}]{Why do we like our DGEMM?}
 \begin{itemize}
 \item Open source code : can be modified easily
 \item Simple code (280 LOC)
 \item Decent performance with 10\% of MKL
 \item Can be rewritten in different languages to increase
 portability (MIPP\footnote{https://github.com/aff3ct/MIPP})
 \item Can be coupled with simple pack/unpack routines to handle different
 data storage (tiled matrices)
 \item Allows to keep control on parallelism
 \item A good starting point for autotuning
 \end{itemize}
 \end{frame}
 \begin{frame}[label={sec:org18a9bee}]{High-Performance strategies}
 \begin{block}{Tuning}
 \begin{itemize}
 \item Optimization is guided by analysis with \alert{MAQAO}\footnote{https://maqao.org}.
 \item Specialized versions of critical hot-spots
 \item \alert{MIPP} for portable intrinsics / specialized code generation
 \item Monitoring of the use of the library to choose most efficient versions
 \item Optimizations guided by monitoring numerical accuracy (\alert{Verificarlo}\footnote{https://github.com/verificarlo/verificarlo})
 \end{itemize}
 \end{block}
 \end{frame}
 \begin{frame}[label={sec:org4489490}]{Efficiently guiding the developer}
 \begin{center}
 \includegraphics[width=\textwidth]{./maqao1.png}
 \end{center}
 \end{frame}
 \begin{frame}[label={sec:orgddd3631}]{Extensive/automatic testing of different configurations}
 \begin{center}
 \includegraphics[width=\textwidth]{./maqao2.png}
 \end{center}
 \end{frame}
 \section{Summary}
 \label{sec:org30e04a5}
 \begin{frame}[label={sec:org705d3cf}]{Summary}
 \begin{itemize}
 \item QMC codes integrated in an ecosystem of multiple codes for
 high-accuracy quantum chemistry
 \item Development of open-source libraries to be used in the 
 TREX codes and beyond
 \item Libraries focus on \emph{performance}, \emph{portability} and \emph{productivity}
 \item Strategies to make the collaboration between physicists/chemists
 and HPC experts optimal
 \end{itemize}
 \end{frame}
 \section{Bonus slides}
 \label{sec:orgb118e4f}
 \begin{frame}[fragile]{Numerical analysis with Verificarlo}
@ -566,8 +631,10 @@ vfc\_probe\_assert("Sherman-Morisson", "res", res, \tikzmark{target}1e-7)
 \draw[arrow]
  (targetex.south) to[out=-90,in=90] ([yshift=1.2ex, xshift=.5cm]{pic cs:target});  
 \end{tikzpicture}
 \end{frame}
-\begin{frame}[label={sec:org8493521}]{Verificarlo CI}
+
 \begin{frame}[label={sec:org560588a}]{Verificarlo CI}
 \begin{columns}
 \begin{column}{0.5\textwidth}
 \begin{exampleblock}{Compare runs}
--- a/tiling_icpp2015.pdf
+++ b/tiling_icpp2015.pdf
--- a/verificarlo.tex
+++ b/verificarlo.tex
@ -97,4 +97,4 @@ vfc\_probe\_assert("Sherman-Morisson", "res", res, \tikzmark{target}1e-7)
  (targetex.south) to[out=-90,in=90] ([yshift=1.2ex, xshift=.5cm]{pic cs:target});  
 \end{tikzpicture}
-
+\end{frame}