scemama/TCCM2021

Fork 0

Anthony Scemama 0edf87790f Exercise 1

2021-11-21 23:49:55 +01:00

64 KiB

Raw Permalink Blame History

Fundamentals of parallelization

Multi-threading
OpenMP
Exercises

#+LaTeX_CLASS_OPTIONS:[aspectratio=169]

Multi-threading noesport

Processes vs threads

Process

Has its own memory address space
Context switching between processes is slow
Processes interact only through system-provided communication mechanisms
Fork: creates a copy of the current process
Exec: switches to running another binary executable
Spawn: Fork, then exec the child

Thread

Exist as subsets of a process
Context switching between threads is fast
Share the same memory address space : interact via shared memory

Threads

/scemama/TCCM2021/media/branch/master/smp.png

Concurrent programming
Graphical user interfaces (progress bars, …)
Asynchronous I/O
Standard library: POSIX threads (pthreads)

Communication time

Low latency network latency : ∼ 1.2 microsecond
Random memory access : ∼ 0.1 microsecond

Threads example (Python)

#!/usr/bin/env python
import threading
import time

class test:
 def __init__(self, Nthreads):
     self.Nthreads = Nthreads
     self.data = [ i for i in range(Nthreads) ]

 def run_thread(self, j):
     self.data[j] = 0
     time.sleep(j)
     self.data[j] = j

Threads example (Python)

 def run(self):
     thread = [ None ] * self.Nthreads
     t0 = time.time()
     print(self.data)
     for i in range(self.Nthreads):
         thread[i] = threading.Thread( target=self.run_thread, args=(i,)  )
         thread[i].start()
     for i in range(self.Nthreads):
         thread[i].join()
         print(time.time()-t0, "seconds. ", self.data)

if __name__ == '__main__':
t = test(4)
t.run()

$ python thread_python.py 
[0, 1, 2, 3]
0.0009775161743164062 seconds.  [0, 0, 0, 0]
1.0018701553344727 seconds.  [0, 1, 0, 0]
2.003377676010132 seconds.  [0, 1, 2, 0]
3.004056930541992 seconds.  [0, 1, 2, 3]

Computation of π with threads in Python

#!/usr/bin/env python
import os, sys, threading
from random import random, seed
from math import sqrt

NMAX = 10000000            # Nb of MC steps/process
error_threshold = 1.0e-4   # Stopping criterion 

class pi_calculator:
 def __init__(self, Nthreads):
     self.Nthreads= Nthreads
     self.results = []
     self.lock = threading.Lock()

 def compute_pi(self):
     result = 0.
     for i in range(NMAX):         # Loop NMAX times
         x,y = random(), random()   # Draw 2 random numbers x and y
         if x*x + y*y <= 1.:        # Check if (x,y) is in the circle
             result += 1
     with self.lock:
         self.results.append(4.* float(result)/float(NMAX))

Computation of π with threads in Python

 def run(self):
    thread = [None] * self.Nthreads
    for i in range(self.Nthreads):
       thread[i] = threading.Thread( target=self.compute_pi, args=()  )
       thread[i].start()
    print("All threads started")
    
    while True:
       for i in range(self.Nthreads):
          thread[i].join()
       N = len(self.results)
       average = sum(self.results)/N             # Compute average
       if N > 2:                         # Compute variance
          l = [ (x-average)*(x-average) for x in self.results ]
          variance = sum(l)/(N-1.)
       else:
          variance = 0.
       error = sqrt(variance)/sqrt(N)    # Compute error
       print("%f +/- %f %d"%(average, error, N))

Computation of π with threads in Python

       if N > 2 and error < error_threshold:  # Stopping condition
          return

       for i in range(self.Nthreads):
          thread[i] = threading.Thread( target=self.compute_pi, args=()  )
          thread[i].start()

if __name__ == '__main__':
calc = pi_calculator(4)
calc.run()

Note: Inefficient in Python because of the Global Interpreter Lock (GIL), but you got the idea.

OpenMP

OpenMP is an extension of programming languages that enable the use of multi-threading to parallelize the code using directives.
The OpenMP library may be implemented using pthreads
Extensions in OpenMP 5.0 to offload code execution to GPUs

The same source code can be executed with/without OpenMP

!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i)
!$OMP DO
do i=1,n
  A(i) = B(i) + C(i)
end do
!$OMP END DO
!$OMP END PARALLEL

OpenMP directives (Fortran)

!$OMP PARALLEL starts a new multi-threaded section. Everything inside this block is executed by all the threads
!$OMP DO tells the compiler to split the loop among the different threads (by changing the loop boundaries for instance)
!$OMP END DO marks the end of the parallel loop. It contains an implicit synchronization. After this line, all the threads have finished executing the loop.
!$OMP END PARALLEL marks the end of the parallel section. Contains also an implicit barrier.
DEFAULT(SHARED) : all the variables (A,B,C) are in shared memory by default
PRIVATE(i) : the variable i is private to every thread

OpenMP directives (Fortran)

!$OMP CRITICAL ... !$OMP END CRITICAL : all the statements in this block are protected by a lock
!$OMP TASK ... !$OMP END TASK : define a new task to execute
!$OMP BARRIER : synchronization barrier
!$OMP SINGLE ... !$OMP END SINGLE : all the statements in this block are executed by a single thread
!$OMP MASTER ... !$OMP END MASTER : all the statements in this block are executed by the master thread

OpenMP

Functions

omp_get_thread_num() : returns the ID of the current running thread (like MPI_Rank)
omp_get_num_threads() : returns the total number of running threads (like MPI_Size)
OMP_NUM_THREADS : Environment variable (shell) that fixes the number of threads to run

Important

Multiple threads can read at the same memory address
Multiple threads must not write at the same address

Matrix multiplication

\[ C_{ij} = \sum_{k=1}^N A_{ik} B_{kj} \]

do j=1,N
 do i=1,N
   C(i,j) = 0.d0
   do k=1,N
     C(i,j) = C(i,j) + A(i,k) * B(k,j)
   end do
 end do
end do

Matrix multiplication with OpenMP

\[ C_{ij} = \sum_{k=1}^N A_{ik} B_{kj} \]

!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE (i,j,k)
do j=1,N
 do i=1,N
   C(i,j) = 0.d0
   do k=1,N
     C(i,j) = C(i,j) + A(i,k) * B(k,j)
   end do
 end do
end do
!$OMP END PARALLEL DO

Loop is parallelized over j
Writing in C(i,j) is OK

Matrix multiplication with OpenMP

\[ C_{ij} = \sum_{k=1}^N A_{ik} B_{kj} \]

!$OMP PARALLEL DEFAULT(SHARED) PRIVATE (i,j,k)
!$OMP DO COLLAPSE(2)
do j=1,N
 do i=1,N
   C(i,j) = 0.d0
   do k=1,N
     C(i,j) = C(i,j) + A(i,k) * B(k,j)
   end do
 end do
end do
!$OMP END DO
!$OMP END PARALLEL

Loop is parallelized over pairs (i,j)
Writing in C(i,j) is OK

Exercises

Exercise 1: Monte Carlo

Write a Fortran
```
double precision function compute_pi(M)
```
that computes $\pi$ with the Monte Carlo algorithm using $M$ samples

Call it using this main program:

program pi_mc
implicit none
integer                      :: M
logical                      :: iterate
double precision             :: sample
double precision, external   :: compute_pi
call random_seed()  ! Initialize random number generator
open(unit=11, file='fortran_out.fifo', status='old', action='write', &
 access='stream',form='formatted')
iterate = .True.
do while (iterate)  ! Compute pi over N samples until 'iterate=.False.'
sample = compute_pi()
write(11,*) sample
end do
end program pi_mc

Exercise 1: Monte Carlo

Write a Python server pi_server.py that receives samples of π in a socket and compute the running average of π. Its address and port number are written in a file server.cfg.
Write a Python script pi_bridge.py that reads samples of π in a named pipe fortran_out.fifo and sends the samples to the server. This script can read the the address and port number of the server from the file server.cfg.
When the convergence criterion is reached, the server informs the bridges that they can stop.

Running a simulation on multiple nodes

Run a single server
Run one bridge per compute node using the mpiexec command
Run one Fortran process per core using the mpiexec command

64 KiB Raw Permalink Blame History Unescape Escape

Fundamentals of parallelization

Multi-threading noesport

Processes vs threads

Process

Thread

Threads

Communication time

Threads example (Python)

Threads example (Python)

Computation of π with threads in Python

Computation of π with threads in Python

Computation of π with threads in Python

OpenMP

OpenMP

OpenMP directives (Fortran)

OpenMP directives (Fortran)

OpenMP

Functions

Important

Matrix multiplication

Matrix multiplication with OpenMP

Matrix multiplication with OpenMP

Exercises

Exercise 1: Monte Carlo

Exercise 1: Monte Carlo

Running a simulation on multiple nodes

Exercise 2: Divide and conquer for matrix multiplication

64 KiB

Raw Permalink Blame History