diff --git a/plot_percentage_vs_mkl_tiled_good.pdf b/plot_percentage_vs_mkl_tiled_good.pdf
new file mode 100644
index 0000000..fa28347
Binary files /dev/null and b/plot_percentage_vs_mkl_tiled_good.pdf differ
diff --git a/scemama.org b/scemama.org
index 8197b8a..d283a86 100644
--- a/scemama.org
+++ b/scemama.org
@@ -88,7 +88,8 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
      - Very low memory requirements (no integrals)
      - Distribute walkers on different cores or compute nodes
      - No blocking communication: near-ideal scaling
-     - Difficulty: parallelize within a QMC trajectory
+     - Difficulty to parallelize within a QMC trajectory: depends on the
+       number of electrons
     #+LATEX: \end{column}
     #+LATEX: \begin{column}{0.6\textwidth}
     #+ATTR_LATEX: :width \textwidth
@@ -99,11 +100,12 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
 ** Both libraries
 *** Three objectives
     1. *Productivity* \\
-       Used and developed by scientists in different languages
+       Usable and useful by scientists in different programming languages
     2. *Portability* \\
        Target: all HPC systems (CPU, GPU, ARM, x86, etc.)
     3. *Performance* \\
-       Must be efficient on all architectures
+       Must be efficient on all architectures: possible tradeoffs
+       between portability and performance
 
 *** Free (libre) software
     - Requirement for open science
@@ -208,7 +210,8 @@ digraph G {
      | Nucleus                | Basis            | CI coefficients        |
      | AO                     | MO               | Two-electron integrals |
      | One-electron integrals | Density matrices | ECP                    |
-    - Each group contains multiple *attributes*
+    - Each group contains multiple *attributes*: information related to the
+      group
 
 ** Source code                                                     :noexport:
 
@@ -241,23 +244,23 @@ trexio_exit_code  trexio_[has/read/write]_<group>_<attribute>
 * QMCkl: QMC kernel library
 
 ** QMC kernel library
-
+   
 *** Computational kernels
-   - QMCkl will contain the main kernels of QMC methods (Domain
-     specific library, end-user driven)
+   - QMCkl will contain the main kernels of QMC methods: Domain
+     specific library, end-user driven
    - Written together by QMC experts and HPC experts
    - Multiple high performance implementations of the kernels, tuned
      for different
      - architectures: portability is critical for users
-     - problem sizes (from small to large systems)
-     - requested accuracy (reduced precision)
+     - problem sizes: from small to large systems
+     - requested accuracy: reduced precision
 
 ** Objectives
 
     - The code must stay easy to understand by the physicists/chemists.
       Performance-related aspects should be delegated to the library
     - Scientists should be able to use their preferred language
-    - Scientists should not lose control on their codes
+    - Scientists should not lose control of their codes
     - Codes should not die when the architecture changes
     - Scientific code development should not kill the performance
     - Reuse of the optimization effort among the community
@@ -273,8 +276,10 @@ trexio_exit_code  trexio_[has/read/write]_<group>_<attribute>
         Easy to read, understand, modify for scientists, not necessarily efficient.
     2. *High performance libraries* \\
         Efficient on a given architecture, but not necessarily
-       readable by physicists/chemists. \\
-       Performance within 10% to maximize portability and simplicity.
+        readable by physicists/chemists. \\
+        Performance within 10% to maximize portability and simplicity.
+    3. *Ultra-High performance libraries* \\
+        Generated with auto-tuning tools for well identified datasets.
 
   - Both /Documentation/ and /High performance/ have the same API
     (similar to BLAS on netlib /vs/ MKL).
@@ -283,10 +288,22 @@ trexio_exit_code  trexio_[has/read/write]_<group>_<attribute>
     implemented in the HPC versions when the API is stabilized.
     
   - Performance: enable a data-driven task-based parallelism
+    
+** Documentation library                                           :noexport:
+   Literate programming with Org-mode:
+   - Comments are more important than code
+   - Can add graphics, \LaTeX formulas, tables, etc
+   - Documentation always synchronized with the code
+   - Some routines can be generated by embedded scripts
+   - Kernels are implemented in Fortran for readability
+   - The API is C-compatible: QMCkl appears like a C library
+     $\Longrightarrow$ can be used in all other languages
+   - Example: Prototyping in Julia
 
 ** Library design
 
-  - Creation of a /Context/ that keeps a consistent state of the library
+  - Creation of a /Context/ that keeps a consistent state of the
+    library (pointers to computed data, configuration parameters, etc.)
   - Memory allocation is abstract:
     #+begin_src c
 void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info);
@@ -297,7 +314,8 @@ void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info);
   - High-level functions: let the library call multiple kernels in an
     optimal way, possibly updating the context
   - Use of IRP programming paradigm\footnote{http://arxiv.org/abs/0909.5012} to keep track of dependencies
-    between kernels: re-compute only what is necessary
+    between kernels: re-compute only what is necessary and store
+    computed data in the context
 
 ** Dependencies between kernels
 
@@ -407,58 +425,11 @@ rc = qmckl_get_local_energy(context, &e_loc);
    2. A mini-application is written to find the optimal data layout
       with HPC experts from real-size examples
    3. The kernel is written in the documentation library
-   4. The documentation library is linked in a QMC code to check correctness
+   4. The documentation library is linked in a QMC code to check
+      correctness and numerical accuracy
    5. HPC experts provide an HPC version of the kernel
    6. The HPC library is linked in the QMC codes of the CoE
 
-** Documentation library
-   Literate programming with Org-mode:
-   - Comments are more important than code
-   - Can add graphics, \LaTeX formulas, tables, etc
-   - Documentation always synchronized with the code
-   - Some routines can be generated by embedded scripts
-   - Kernels are implemented in Fortran for readability
-   - The API is C-compatible: QMCkl appears like a C library
-     $\Longrightarrow$ can be used in all other languages
-   - Example: Prototyping in Julia
-
-** High-Performance strategies
-
-*** Linear algebra hot spots
-
-   | GEMM | Rank-1 update    | Matrix Inversion            |
-   | GEMV | Diagonal of GEMM | Shermann-Morrison-Woodburry |
-
-*** Matrices are relatively small ($\le 1000\times 1000$)
-
-   - Matrices are stored in tiled format $\Longrightarrow$ task-based
-     linear algebra interleaved computation of multiple kernels
-   - Increase parallelism by agregating multiple independent walkers
-     in matrices
-   - Needs fast linear algebra kernels for small matrices
-
-** High-Performance strategies
-
-*** Tuning
-   - Optimization is guided by analysis with *MAQAO*\footnote{https://maqao.org}.
-   - Specialized versions of critical hot-spots
-   - MIPP\footnote{https://github.com/aff3ct/MIPP} for portable intrinsics / specialized code generation
-   - Monitoring of the use of the library to choose most efficient versions
-   - Optimizations guided by monitoring numerical accuracy (*Verificarlo*\footnote{https://github.com/verificarlo/verificarlo})
-
-** Example: Specialized DGEMM kernel
-
-    VIJAY
-
-** Efficiently guiding the developer
-
-   #+ATTR_LATEX: :width \textwidth
-   [[./maqao1.png]]
-** Extensive/automatic testing of different configurations
-  
-   #+ATTR_LATEX: :width \textwidth
-   [[./maqao2.png]]
-
 ** First application : 3-body Jastrow factor
 
  #+LATEX: \newcommand{\Jeen}{J_{\text{een}}}
@@ -489,14 +460,109 @@ rc = qmckl_get_local_energy(context, &e_loc);
     #+LATEX: \begin{column}{0.5\textwidth}
     - Gradient and Laplacian are also required
     - Up to $20\times$ faster than in the original code
-    - $\sim 80\%$ of the AVX-512 peak is reached
+    - $\sim 80\%$ of the AVX-512 peak is reached using standard MKL on
+      Intel Skylake
     - Expressed with a DGEMM kernel $\Longrightarrow$ also efficient on GPU
     #+LATEX: \end{column}
     #+LATEX: \end{columns}
   
-   
- #+LATEX: \end{frame}
- #+INCLUDE: "verificarlo.tex" export latex
+** High-Performance strategies
+
+*** Linear algebra hot spots
+
+   | GEMM | Rank-1 update    | Matrix Inversion            |
+   | GEMV | Diagonal of GEMM | Shermann-Morrison-Woodburry |
+
+*** Matrices are relatively small ($\le 1000\times 1000$)
+
+   - Matrices are stored in tiled format fitting a block formulation
+     of the algorithms $\Longrightarrow$ task-based
+     linear algebra, interleaved computation of multiple kernels
+   - Tile sizes will be adjusted by auto-tuning 
+   - Increase parallelism by aggregating multiple independent walkers
+     in matrices
+   - Needs fast linear algebra kernels for small matrices (tile size)
+   - For tiny matrices ($<5\times5$) specialized versions are implemented
+
+** Example: Specialized DGEMM kernel I
+
+*** Simple algorithm :B_block:BMCOL:
+:PROPERTIES:
+:BEAMER_env: block
+:BEAMER_col: 0.45
+:END:
+- Simple micro kernel (*GotoDGEMM*\footnote{doi:10.1145/1356052.1356053})
+- Code written using ~asm_volatile~ to force good code generation by
+  compilers
+- *Tiling* scheme\footnote{doi:10.1109/ICPP.2015.29}
+
+*** Tiling scheme :B_block:BMCOL:
+:PROPERTIES:
+:BEAMER_col: 0.45
+:BEAMER_env: block
+:END:
+   #+ATTR_LATEX: :width 5cm :height 5cm :keepaspectratio :right
+   [[./tiling_icpp2015.pdf]]
+
+** Example: Specialized DGEMM kernel II
+
+*** Benchmarks
+
+- Comparison of MKL vs Specialized DGEMM
+
+   #+ATTR_LATEX: :height 4cm :keepaspectratio
+   [[./plot_percentage_vs_mkl_tiled_good.pdf]]
+
+- Strong impact on MKL performance due to the number of consecutive executions
+- Favorable comparison for MKL: Many consecutive executions to
+  amortize setup cost, JIT, Skylake CPU
+
+** Why do we like our DGEMM?
+
+   - Open source code : can be modified easily
+   - Simple code (280 LOC)  
+   - Decent performance with 10% of MKL
+   - Can be rewritten in different languages to increase
+     portability (MIPP\footnote{https://github.com/aff3ct/MIPP})
+   - Can be coupled with simple pack/unpack routines to handle different
+     data storage (tiled matrices)
+   - Allows to keep control on parallelism
+   - A good starting point for autotuning
+
+** High-Performance strategies
+
+*** Tuning
+   - Optimization is guided by analysis with *MAQAO*\footnote{https://maqao.org}.
+   - Specialized versions of critical hot-spots
+   - *MIPP* for portable intrinsics / specialized code generation
+   - Monitoring of the use of the library to choose most efficient versions
+   - Optimizations guided by monitoring numerical accuracy (*Verificarlo*\footnote{https://github.com/verificarlo/verificarlo})
+
+** Efficiently guiding the developer
+
+   #+ATTR_LATEX: :width \textwidth
+   [[./maqao1.png]]
+** Extensive/automatic testing of different configurations
+  
+   #+ATTR_LATEX: :width \textwidth
+   [[./maqao2.png]]
+
+* Summary
+  
+** Summary
+  - QMC codes integrated in an ecosystem of multiple codes for
+    high-accuracy quantum chemistry
+  - Development of open-source libraries to be used in the 
+    TREX codes and beyond
+  - Libraries focus on /performance/, /portability/ and /productivity/
+  - Strategies to make the collaboration between physicists/chemists
+    and HPC experts optimal
+
+
+* Bonus slides
+  
+  #+INCLUDE: "verificarlo.tex" export latex
+
 ** Verificarlo CI
 
     #+LATEX: \begin{columns}
@@ -518,7 +584,6 @@ rc = qmckl_get_local_energy(context, &e_loc);
     #+LATEX: \end{exampleblock}
     #+LATEX: \end{column}
     #+LATEX: \end{columns}
-
 * Useful links                                                     :noexport:
 
   | TREX web site       | https://trex-coe.eu                        |
@@ -597,3 +662,4 @@ together: perf et productivity
   : /home/scemama/MEGA/TEX/Presentations/2021/Intel/scemama.pdf
 
   
+  
diff --git a/scemama.tex b/scemama.tex
index 2b6c108..adf6a99 100644
--- a/scemama.tex
+++ b/scemama.tex
@@ -1,4 +1,4 @@
-% Created 2021-10-07 Thu 12:17
+% Created 2021-10-08 Fri 12:27
 % Intended LaTeX compiler: pdflatex
 \documentclass[aspectratio=169]{beamer}
 \usepackage[utf8]{inputenc}
@@ -53,8 +53,8 @@ $^2$University of Versailles, Li-PaRAD (France)}
 \maketitle
 
 \section{QMC in TREX}
-\label{sec:org527cfcf}
-\begin{frame}[label={sec:org3bfadea}]{QMC in TREX}
+\label{sec:orge5169ea}
+\begin{frame}[label={sec:org16615d0}]{QMC in TREX}
 \begin{exampleblock}{QMC: Quantum Monte Carlo methods}
 \begin{itemize}
 \item Highly accurate methods
@@ -75,7 +75,7 @@ How: Instead of re-writing codes, provide libraries (free software)
 \end{exampleblock}
 \end{frame}
 
-\begin{frame}[label={sec:orge26ef23}]{Quantum Monte Carlo (QMC)}
+\begin{frame}[label={sec:orgd8db692}]{Quantum Monte Carlo (QMC)}
 \alert{Problem}: Stochastic resolution of the Schr\"odinger equation for $N$ electrons
 \begin{eqnarray}
 E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
@@ -101,14 +101,15 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
 \end{columns}
 \end{frame}
 
-\begin{frame}[label={sec:orgd65402e}]{Quantum Monte Carlo (QMC)}
+\begin{frame}[label={sec:orgcee35fc}]{Quantum Monte Carlo (QMC)}
 \begin{columns}
 \begin{column}{0.4\textwidth}
 \begin{itemize}
 \item Very low memory requirements (no integrals)
 \item Distribute walkers on different cores or compute nodes
 \item No blocking communication: near-ideal scaling
-\item Difficulty: parallelize within a QMC trajectory
+\item Difficulty to parallelize within a QMC trajectory: depends on the
+number of electrons
 \end{itemize}
 \end{column}
 \begin{column}{0.6\textwidth}
@@ -119,15 +120,16 @@ E &= &\frac{\int \dcoord \Phi(\coord) {\cal H} \Phi(\coord)}
 \end{columns}
 \end{frame}
 
-\begin{frame}[label={sec:org3e8242f}]{Both libraries}
+\begin{frame}[label={sec:org4bb2da0}]{Both libraries}
 \begin{block}{Three objectives}
 \begin{enumerate}
 \item \alert{Productivity} \\
-Used and developed by scientists in different languages
+Usable and useful by scientists in different programming languages
 \item \alert{Portability} \\
 Target: all HPC systems (CPU, GPU, ARM, x86, etc.)
 \item \alert{Performance} \\
-Must be efficient on all architectures
+Must be efficient on all architectures: possible tradeoffs
+between portability and performance
 \end{enumerate}
 \end{block}
 
@@ -140,8 +142,8 @@ Must be efficient on all architectures
 \end{frame}
 
 \section{TREXIO: I/O library}
-\label{sec:orgf8ad1e7}
-\begin{frame}[label={sec:org02f0485}]{TREXIO: I/O library}
+\label{sec:orga389b46}
+\begin{frame}[label={sec:org61be819}]{TREXIO: I/O library}
 \begin{columns}
 \begin{column}{0.4\textwidth}
 \begin{exampleblock}{Before}
@@ -163,7 +165,7 @@ Must be efficient on all architectures
 \url{https://github.com/trex-coe/trexio}
 \end{frame}
 
-\begin{frame}[label={sec:org2341c39}]{TREXIO: I/O library}
+\begin{frame}[label={sec:org01dc873}]{TREXIO: I/O library}
 \begin{exampleblock}{Front end}
 \begin{itemize}
 \item Definition of an API for to read/write wave functions
@@ -192,7 +194,7 @@ Must be efficient on all architectures
 \end{columns}
 \end{frame}
 
-\begin{frame}[label={sec:org51a55c1}]{Content of the files}
+\begin{frame}[label={sec:org6f3aa58}]{Content of the files}
 \begin{itemize}
 \item File is \alert{self-contained}: no external knowledge needed to compute
 \(\Psi(r_1,\dots,r_n)\) (normalization factors, basis et
@@ -208,43 +210,44 @@ AO & MO & Two-electron integrals\\
 One-electron integrals & Density matrices & ECP\\
 \end{tabular}
 \end{center}
-\item Each group contains multiple \alert{attributes}
+\item Each group contains multiple \alert{attributes}: information related to the
+group
 \end{itemize}
 \end{frame}
 
 \section{QMCkl: QMC kernel library}
-\label{sec:org53e6105}
+\label{sec:org3669f0e}
 
-\begin{frame}[label={sec:org4dc9060}]{QMC kernel library}
+\begin{frame}[label={sec:org89970a2}]{QMC kernel library}
 \begin{block}{Computational kernels}
 \begin{itemize}
-\item QMCkl will contain the main kernels of QMC methods (Domain
-specific library, end-user driven)
+\item QMCkl will contain the main kernels of QMC methods: Domain
+specific library, end-user driven
 \item Written together by QMC experts and HPC experts
 \item Multiple high performance implementations of the kernels, tuned
 for different
 \begin{itemize}
 \item architectures: portability is critical for users
-\item problem sizes (from small to large systems)
-\item requested accuracy (reduced precision)
+\item problem sizes: from small to large systems
+\item requested accuracy: reduced precision
 \end{itemize}
 \end{itemize}
 \end{block}
 \end{frame}
 
-\begin{frame}[label={sec:orgcf8c268}]{Objectives}
+\begin{frame}[label={sec:org27f2ac6}]{Objectives}
 \begin{itemize}
 \item The code must stay easy to understand by the physicists/chemists.
 Performance-related aspects should be delegated to the library
 \item Scientists should be able to use their preferred language
-\item Scientists should not lose control on their codes
+\item Scientists should not lose control of their codes
 \item Codes should not die when the architecture changes
 \item Scientific code development should not kill the performance
 \item Reuse of the optimization effort among the community
 \end{itemize}
 \end{frame}
 
-\begin{frame}[label={sec:org523cd8a}]{Functionality and performance}
+\begin{frame}[label={sec:org7fe4d9a}]{Functionality and performance}
 \begin{itemize}
 \item Keeping high \emph{productivity}, \emph{portability} and \emph{performance} is very
 hard in a single piece of software.
@@ -255,9 +258,11 @@ We propose (at least) two implementations:
 \item \alert{Documentation library} \\
 Easy to read, understand, modify for scientists, not necessarily efficient.
 \item \alert{High performance libraries} \\
- Efficient on a given architecture, but not necessarily
+Efficient on a given architecture, but not necessarily
 readable by physicists/chemists. \\
 Performance within 10\% to maximize portability and simplicity.
+\item \alert{Ultra-High performance libraries} \\
+Generated with auto-tuning tools for well identified datasets.
 \end{enumerate}
 
 \item Both \emph{Documentation} and \emph{High performance} have the same API
@@ -270,9 +275,10 @@ implemented in the HPC versions when the API is stabilized.
 \end{itemize}
 \end{frame}
 
-\begin{frame}[label={sec:org1030a63},fragile]{Library design}
+\begin{frame}[label={sec:orgca18759},fragile]{Library design}
  \begin{itemize}
-\item Creation of a \emph{Context} that keeps a consistent state of the library
+\item Creation of a \emph{Context} that keeps a consistent state of the
+library (pointers to computed data, configuration parameters, etc.)
 \item Memory allocation is abstract:
 \begin{minted}[frame=lines,fontsize=\scriptsize,linenos]{c}
 void* qmckl_malloc(qmckl_context context, const qmckl_memory_info_struct info);
@@ -283,11 +289,12 @@ context untouched (no allocation, no modification in-place)
 \item High-level functions: let the library call multiple kernels in an
 optimal way, possibly updating the context
 \item Use of IRP programming paradigm\footnote{http://arxiv.org/abs/0909.5012} to keep track of dependencies
-between kernels: re-compute only what is necessary
+between kernels: re-compute only what is necessary and store
+computed data in the context
 \end{itemize}
 \end{frame}
 
-\begin{frame}[label={sec:orgd8c37c2}]{Dependencies between kernels}
+\begin{frame}[label={sec:org1c791dc}]{Dependencies between kernels}
 \begin{columns}
 \begin{column}{0.5\textwidth}
 \begin{center}
@@ -307,7 +314,7 @@ between kernels: re-compute only what is necessary
 \end{columns}
 \end{frame}
 
-\begin{frame}[label={sec:org465f70f},fragile]{Use case: low-level}
+\begin{frame}[label={sec:org5202b14},fragile]{Use case: low-level}
  \begin{minted}[frame=lines,fontsize=\scriptsize,linenos]{c}
 #include <qmckl.h> 
 
@@ -330,7 +337,7 @@ assert (rc == QMCKL_SUCCESS);
 \end{minted}
 \end{frame}
 
-\begin{frame}[label={sec:orgb80c323},fragile]{Use case: high-level}
+\begin{frame}[label={sec:org1ecca91},fragile]{Use case: high-level}
  \begin{minted}[frame=lines,fontsize=\scriptsize,linenos]{c}
 #include <qmckl.h> 
 // ...
@@ -354,82 +361,21 @@ rc = qmckl_get_local_energy(context, &e_loc);
 \end{minted}
 \end{frame}
 
-\begin{frame}[label={sec:org518f369}]{Development strategy}
+\begin{frame}[label={sec:org3f3c8bf}]{Development strategy}
 \begin{enumerate}
 \item Kernel extraction: QMC specialists agree on the 
 mathematical expression of the problem
 \item A mini-application is written to find the optimal data layout
 with HPC experts from real-size examples
 \item The kernel is written in the documentation library
-\item The documentation library is linked in a QMC code to check correctness
+\item The documentation library is linked in a QMC code to check
+correctness and numerical accuracy
 \item HPC experts provide an HPC version of the kernel
 \item The HPC library is linked in the QMC codes of the CoE
 \end{enumerate}
 \end{frame}
 
-\begin{frame}[label={sec:org7c60b7a}]{Documentation library}
-Literate programming with Org-mode:
-\begin{itemize}
-\item Comments are more important than code
-\item Can add graphics, \LaTeX formulas, tables, etc
-\item Documentation always synchronized with the code
-\item Some routines can be generated by embedded scripts
-\item Kernels are implemented in Fortran for readability
-\item The API is C-compatible: QMCkl appears like a C library
-\(\Longrightarrow\) can be used in all other languages
-\item Example: Prototyping in Julia
-\end{itemize}
-\end{frame}
-
-\begin{frame}[label={sec:orgf424cd4}]{High-Performance strategies}
-\begin{block}{Linear algebra hot spots}
-\begin{center}
-\begin{tabular}{lll}
-GEMM & Rank-1 update & Matrix Inversion\\
-GEMV & Diagonal of GEMM & Shermann-Morrison-Woodburry\\
-\end{tabular}
-\end{center}
-\end{block}
-
-\begin{block}{Matrices are relatively small (\(\le 1000\times 1000\))}
-\begin{itemize}
-\item Matrices are stored in tiled format \(\Longrightarrow\) task-based
-linear algebra interleaved computation of multiple kernels
-\item Increase parallelism by agregating multiple independent walkers
-in matrices
-\item Needs fast linear algebra kernels for small matrices
-\end{itemize}
-\end{block}
-\end{frame}
-
-\begin{frame}[label={sec:orgea7372b}]{High-Performance strategies}
-\begin{block}{Tuning}
-\begin{itemize}
-\item Optimization is guided by analysis with \alert{MAQAO}\footnote{https://maqao.org}.
-\item Specialized versions of critical hot-spots
-\item MIPP\footnote{https://github.com/aff3ct/MIPP} for portable intrinsics / specialized code generation
-\item Monitoring of the use of the library to choose most efficient versions
-\item Optimizations guided by monitoring numerical accuracy (\alert{Verificarlo}\footnote{https://github.com/verificarlo/verificarlo})
-\end{itemize}
-\end{block}
-\end{frame}
-
-\begin{frame}[label={sec:orgba656d9}]{Example: Specialized DGEMM kernel}
-VIJAY
-\end{frame}
-
-\begin{frame}[label={sec:orgd3ca712}]{Efficiently guiding the developer}
-\begin{center}
-\includegraphics[width=\textwidth]{./maqao1.png}
-\end{center}
-\end{frame}
-\begin{frame}[label={sec:orgcc14268}]{Extensive/automatic testing of different configurations}
-\begin{center}
-\includegraphics[width=\textwidth]{./maqao2.png}
-\end{center}
-\end{frame}
-
-\begin{frame}[label={sec:org7ee3c30}]{First application : 3-body Jastrow factor}
+\begin{frame}[label={sec:orgb6a9085}]{First application : 3-body Jastrow factor}
 \newcommand{\Jeen}{J_{\text{een}}}
 \newcommand{\Nel}{N_{\text{elec}}}
 \newcommand{\Nat}{N_{\text{nucl}}}
@@ -460,14 +406,133 @@ VIJAY
 \begin{itemize}
 \item Gradient and Laplacian are also required
 \item Up to \(20\times\) faster than in the original code
-\item \(\sim 80\%\) of the AVX-512 peak is reached
+\item \(\sim 80\%\) of the AVX-512 peak is reached using standard MKL on
+Intel Skylake
 \item Expressed with a DGEMM kernel \(\Longrightarrow\) also efficient on GPU
 \end{itemize}
 \end{column}
 \end{columns}
-
-
 \end{frame}
+
+\begin{frame}[label={sec:orgd6d3e26}]{High-Performance strategies}
+\begin{block}{Linear algebra hot spots}
+\begin{center}
+\begin{tabular}{lll}
+GEMM & Rank-1 update & Matrix Inversion\\
+GEMV & Diagonal of GEMM & Shermann-Morrison-Woodburry\\
+\end{tabular}
+\end{center}
+\end{block}
+
+\begin{block}{Matrices are relatively small (\(\le 1000\times 1000\))}
+\begin{itemize}
+\item Matrices are stored in tiled format fitting a block formulation
+of the algorithms \(\Longrightarrow\) task-based
+linear algebra, interleaved computation of multiple kernels
+\item Tile sizes will be adjusted by auto-tuning
+\item Increase parallelism by aggregating multiple independent walkers
+in matrices
+\item Needs fast linear algebra kernels for small matrices (tile size)
+\item For tiny matrices (\(<5\times5\)) specialized versions are implemented
+\end{itemize}
+\end{block}
+\end{frame}
+
+\begin{frame}[label={sec:orgeb97339},fragile]{Example: Specialized DGEMM kernel I}
+ \begin{columns}
+\begin{column}{0.45\columnwidth}
+\begin{block}{Simple algorithm}
+\begin{itemize}
+\item Simple micro kernel (\alert{GotoDGEMM}\footnote{doi:10.1145/1356052.1356053})
+\item Code written using \texttt{asm\_volatile} to force good code generation by
+compilers
+\item \alert{Tiling} scheme\footnote{doi:10.1109/ICPP.2015.29}
+\end{itemize}
+\end{block}
+\end{column}
+
+\begin{column}{0.45\columnwidth}
+\begin{block}{Tiling scheme}
+\begin{center}
+\includegraphics[width=5cm,height=5cm]{./tiling_icpp2015.pdf}
+\end{center}
+\end{block}
+\end{column}
+\end{columns}
+\end{frame}
+
+\begin{frame}[label={sec:org76e8117}]{Example: Specialized DGEMM kernel II}
+\begin{block}{Benchmarks}
+\begin{itemize}
+\item Comparison of MKL vs Specialized DGEMM
+
+\begin{center}
+\includegraphics[height=4cm]{./plot_percentage_vs_mkl_tiled_good.pdf}
+\end{center}
+
+\item Strong impact on MKL performance due to the number of consecutive executions
+\item Favorable comparison for MKL: Many consecutive executions to
+amortize setup cost, JIT, Skylake CPU
+\end{itemize}
+\end{block}
+\end{frame}
+
+\begin{frame}[label={sec:orgc7d8abc}]{Why do we like our DGEMM?}
+\begin{itemize}
+\item Open source code : can be modified easily
+\item Simple code (280 LOC)
+\item Decent performance with 10\% of MKL
+\item Can be rewritten in different languages to increase
+portability (MIPP\footnote{https://github.com/aff3ct/MIPP})
+\item Can be coupled with simple pack/unpack routines to handle different
+data storage (tiled matrices)
+\item Allows to keep control on parallelism
+\item A good starting point for autotuning
+\end{itemize}
+\end{frame}
+
+\begin{frame}[label={sec:org18a9bee}]{High-Performance strategies}
+\begin{block}{Tuning}
+\begin{itemize}
+\item Optimization is guided by analysis with \alert{MAQAO}\footnote{https://maqao.org}.
+\item Specialized versions of critical hot-spots
+\item \alert{MIPP} for portable intrinsics / specialized code generation
+\item Monitoring of the use of the library to choose most efficient versions
+\item Optimizations guided by monitoring numerical accuracy (\alert{Verificarlo}\footnote{https://github.com/verificarlo/verificarlo})
+\end{itemize}
+\end{block}
+\end{frame}
+
+\begin{frame}[label={sec:org4489490}]{Efficiently guiding the developer}
+\begin{center}
+\includegraphics[width=\textwidth]{./maqao1.png}
+\end{center}
+\end{frame}
+\begin{frame}[label={sec:orgddd3631}]{Extensive/automatic testing of different configurations}
+\begin{center}
+\includegraphics[width=\textwidth]{./maqao2.png}
+\end{center}
+\end{frame}
+
+\section{Summary}
+\label{sec:org30e04a5}
+
+\begin{frame}[label={sec:org705d3cf}]{Summary}
+\begin{itemize}
+\item QMC codes integrated in an ecosystem of multiple codes for
+high-accuracy quantum chemistry
+\item Development of open-source libraries to be used in the 
+TREX codes and beyond
+\item Libraries focus on \emph{performance}, \emph{portability} and \emph{productivity}
+\item Strategies to make the collaboration between physicists/chemists
+and HPC experts optimal
+\end{itemize}
+\end{frame}
+
+
+\section{Bonus slides}
+\label{sec:orgb118e4f}
+
 \begin{frame}[fragile]{Numerical analysis with Verificarlo}
 
 
@@ -566,8 +631,10 @@ vfc\_probe\_assert("Sherman-Morisson", "res", res, \tikzmark{target}1e-7)
 \draw[arrow]
   (targetex.south) to[out=-90,in=90] ([yshift=1.2ex, xshift=.5cm]{pic cs:target});  
 \end{tikzpicture}
+
 \end{frame}
-\begin{frame}[label={sec:org8493521}]{Verificarlo CI}
+
+\begin{frame}[label={sec:org560588a}]{Verificarlo CI}
 \begin{columns}
 \begin{column}{0.5\textwidth}
 \begin{exampleblock}{Compare runs}
diff --git a/tiling_icpp2015.pdf b/tiling_icpp2015.pdf
new file mode 100644
index 0000000..6bed32d
Binary files /dev/null and b/tiling_icpp2015.pdf differ
diff --git a/verificarlo.tex b/verificarlo.tex
index e1262fe..b7eb787 100644
--- a/verificarlo.tex
+++ b/verificarlo.tex
@@ -97,4 +97,4 @@ vfc\_probe\_assert("Sherman-Morisson", "res", res, \tikzmark{target}1e-7)
   (targetex.south) to[out=-90,in=90] ([yshift=1.2ex, xshift=.5cm]{pic cs:target});  
 \end{tikzpicture}
 
-
+\end{frame}