Architecture OK

2021-11-17 10:50:09 +01:00 · 2021-11-17 10:50:09 +01:00 · 7830fce68c
commit 7830fce68c
parent d1f3d00901
16 changed files with 192 additions and 2 deletions
--- a/Amdahl.pdf
+++ b/Amdahl.pdf
--- a/Amdahl2.pdf
+++ b/Amdahl2.pdf
--- a/FranceGrilles.png
+++ b/FranceGrilles.png
--- a/Meso.png
+++ b/Meso.png
--- a/PyramideNewLook2012-2.png
+++ b/PyramideNewLook2012-2.png
--- a/blas1.png
+++ b/blas1.png
--- a/blas2.png
+++ b/blas2.png
--- a/blas3.png
+++ b/blas3.png
--- a/arriere_chassis.jpg
+++ b/arriere_chassis.jpg
--- a/genci.png
+++ b/genci.png
--- a/hierarchy.pdf
+++ b/hierarchy.pdf
--- a/parallelism_scemama.org
+++ b/parallelism_scemama.org
@ -23,10 +23,202 @@
 #+startup: beamer
 #+options: H:2
 * Program                                                          :noexport:
 ** Mardi apres-midi
   Je peux leur faire
   - paralleliser un produit de matrices avec OpenMP en Fortran/C
   - calculer pi avec un algorithme Monte Carlo avec MPI (Master-Worker) d'abord en Python puis en Fortran/C.
   - calculer une integrale dans R^3 sur une grille de points avec MPI en Fortran/C.
   Ca va juste leur donner les bases avec OMP DO et MPI_Reduce, et on ne pourra pas
   aller beaucoup plus loin. Mais ils pourront quand meme faire tourner les 112
   coeurs du cluster. 
   Attention: il faut aussi prevoir qu'ils n'auront peut-etre jamais utilise un
   cluster, et ils ne savent probablement pas ce que c'est. Donc je vais devoir
   inclure un peu de hardware dans ma presentation sur le parallelisme et leur
   expliquer qu'il faut faire sbatch pour lancer un calcul.
   C'est bien slurm to gestionnaire de batch?
 ** Mercredi
 Pour IRPF90, je peux faire une presentation assez generale.
 J'ai deja un tutoriel pour ecrire un programme de dynamique moleculaire avec un
 potentiel Lennard-Jones, je pense que ce sera plus facile puisqu'il n'y a pas
 beaucoup de temps pour faire les TPs.
 Si ils vont suffisamment vite, on peut ensuite basculer sur la parallelisation
 OpenMP des boucles dans le code, et sur le lancement de plusieurs trajectoires
 avec MPI en reprenant le modele de parallelisation de pi de la veille.
 Pour QP, je me dis que ce serait bien de le le presenter en 15 minutes une fois
 qu'ils ont fait le TP sur IRPF90. Ensuite on peut leur faire une demo sur comment
 implementer un SCF en 10 minutes, mais je ne pense pas qu'on aura le temps de
 leur faire faire le travail dans QP, et ca simplifie les problemes de
 compilation sur les machines. On peut aussi leur donner acces au compte ou QP
 est installe pour ceux qui vont tres vite et qui veulent essayer de jouer avec.
 * Supercomputers
 ** Computers
    #+LATEX: \begin{columns}                                                                                                  
    #+LATEX: \begin{column}{0.7\textwidth} 
    #+LATEX: \begin{exampleblock}{Today (order of magnitude)}
    - 1 *socket* (x86 CPU @ 2.2-3.3 GHz, *4 cores*, hyperthreading)
    - \sim 4-16 GB RAM
    - \sim 500GB SSD
    - Graphics card : ATI Radeon, Nvidia GeForce
    - Gigabit ethernet
    - USB, Webcam, Sound card, etc
    - \sim 500 euros
    #+LATEX: \end{exampleblock}
    #+LATEX: \end{column}
    #+LATEX: \begin{column}{0.3\textwidth}
    #+ATTR_LATEX: :width \textwidth
   [[./desktop-inspiron-MT-3650-pdp-module-1.jpg]]
    #+LATEX: \end{column}                                                                                                     
    #+LATEX: \end{columns}                                                                                                     
 ** Computer designed for computation
    #+LATEX: \begin{columns}                                                                                                  
    #+LATEX: \begin{column}{0.6\textwidth} 
    #+LATEX: \begin{exampleblock}{Today (order of magnitude)}
    - 2 sockets (x86 CPU @ 2.2 GHz, 32 cores/socket, hyperthreading)                                                              
    - 64-128 GB RAM
    - Multiple SSD HDDs (RAID0)
    - Gigabit ethernet
    - Possibly an Accelerator (Nvidia Volta/Ampere)
    - \sim 5k euros
    #+LATEX: \end{exampleblock}
    #+LATEX: \end{column}
    #+LATEX: \begin{column}{0.4\textwidth}
    #+ATTR_LATEX: :width \textwidth
   [[./z840_gallery_img4_tcm245_2164103_tcm245_1871309_tcm245-2164103.jpg]]
    #+LATEX: \end{column}                                                                                                     
    #+LATEX: \end{columns}                                                                                                     
 ** Cluster
    #+LATEX: \begin{columns}                                                                                                  
    #+LATEX: \begin{column}{0.6\textwidth} 
    #+LATEX: \begin{exampleblock}{}
    - Many computers designed for computation                                                                                     
    - Compact (1-2U in rack) / machine
    - Network switches
    - Login server
    - Batch queuing system (SLURM / PBS / SGE / LFS)
    - Cheap Cooling system
    - Requires a lot of electrical power (~10kW/rack)
    - Possibly a Low-latency / High-bandwidth network (Infiniband or 10Gb ethernet)
    - >50k euros
    #+LATEX: \end{exampleblock}
    #+LATEX: \end{column}
    #+LATEX: \begin{column}{0.4\textwidth}
    #+ATTR_LATEX: :width \textwidth
   [[./img_20160510_152246_resize.jpg]]
    #+LATEX: \end{column}                                                                                                     
    #+LATEX: \end{columns}                                                                                                     
 ** Supercomputer
    #+LATEX: \begin{columns}                                                                                                  
    #+LATEX: \begin{column}{0.6\textwidth} 
    #+LATEX: \begin{exampleblock}{}
    - Many computers designed for computation                                                                                     
    - Very compact (<1U in rack) / machine
    - Low-latency / High-bandwidth network (Infiniband or 10Gb ethernet)
    - Network switches
    - Parallel filesystem for scratch space (Lustre / BeeGFS / GPFS)
    - Multiple login servers
    - Batch queuing system (SLURM / PBS / SGE / LFS)
    - Highly efficient cooling system (water)
    - Requires a lot of electrical power (>100kW)
    #+LATEX: \end{exampleblock}
    #+LATEX: \end{column}
    #+LATEX: \begin{column}{0.4\textwidth}
    #+ATTR_LATEX: :width \textwidth
   [[./Eos.png]]
    #+LATEX: \end{column}                                                                                                     
    #+LATEX: \end{columns}                                                                                                     
 ** Definitions
   - Top500 :: Rank of the 500 fastest supercomputers
   - Flop  :: Floating point operation
   - Flops :: Flop/s, Number of Floating point operations per second
   - RPeak :: Peak performance, max possible number of Flops
   - RMax  :: Real performance on the Linpack benchmark (dense eigenproblem)
   - SP :: Single precision (32-bit floats)
   - DP :: Double precision (64-bit floats)
   - FPU :: Floating Point Unit
   - FMA :: Fused multiply-add ($a\times x+b$ in 1 instruction)
 ** Quantifying performance
    #+LATEX: \begin{exampleblock}{Example}
 *RPeak* of Intel Xeon Gold 6140 Processor :
   - 18 cores
   - 2.3 GHz
   - 2 FPUs
   - 8 FMA (DP)/FPU/cycle
   $18 \times 2.3\, 10^9 \times 2 \times 8 \times 2  = 1.3$ TFlops (DP)
    #+LATEX: \end{exampleblock}
   - Number of hours ::  730/month, 8760/year
   - Units :: Kilo (K), Mega (M), Giga (G), Tera (T), Peta (P), Exa (E), ...
 ** Top500 (1996)
    #+ATTR_LATEX: :height 0.9\textheight
   [[./top500_95.png]]
 ** Top500 (2021)
    #+ATTR_LATEX: :height 0.9\textheight
   [[./top500_21.png]]
   https://www.top500.org/lists/top500/2021/11/
 ** Curie thin nodes (TGCC, France) 
   Ranked 9th in 2012, 77 184 cores, 1.7 PFlops, 2.1 MW
   #+ATTR_LATEX: :height 0.8\textheight
   [[./tgcc.jpg]]
 ** Mare Nostrum (BSC, Spain) 
   Ranked 13th in 2017, 153 216 cores, 6.5 PFlops, 1.6 MW
   #+ATTR_LATEX: :height 0.8\textheight
   [[./marenostrum.jpg]]
 ** Architecture
   #+ATTR_LATEX: :height 0.9\textheight
   [[./hierarchy.pdf]]
 ** Chassis (Front)
   #+ATTR_LATEX: :height 0.9\textheight
   [[./chassis.jpg]]
 ** Chassis (Back)
   #+ATTR_LATEX: :height 0.9\textheight
   [[./chassis_back.jpg]]
 ** Compute Node
   #+ATTR_LATEX: :height 0.9\textheight
   [[./blade.jpg]]
 ** Socket
   #+ATTR_LATEX: :height 0.9\textheight
   [[./socket.jpg]]
 ** Core
   #+ATTR_LATEX: :height 0.9\textheight
   [[./Nehalem.jpg]]
 * Fundamentals of parallelization
@ -54,7 +246,6 @@ but every problem is not parallelizable at all levels.
 ** Data movement
 * OpenMP
 * Message Passing Interface (MPI)
@ -83,7 +274,6 @@ but every problem is not parallelizable at all levels.
  : /home/scemama/MEGA/TEX/Cours/TCCM/TCCM2022/Parallelism/parallelism_scemama.pdf
 * Figures                                                          :noexport:
  #+BEGIN_SRC dot :output file :file interfaces.png
--- a/prace.png
+++ b/prace.png
--- a/top500.png
+++ b/top500.png
--- a/top500_95.png
+++ b/top500_95.png
--- a/z840_gallery_img4_tcm245_2164103_tcm245_1871309_tcm245-2164103.jpg
+++ b/z840_gallery_img4_tcm245_2164103_tcm245_1871309_tcm245-2164103.jpg