64 KiB
64 KiB
Fundamentals of parallelization
#+LaTeX_CLASS_OPTIONS:[aspectratio=169]
Multi-threading noesport
Processes vs threads
Process
- Has its own memory address space
- Context switching between processes is slow
- Processes interact only through system-provided communication mechanisms
- Fork: creates a copy of the current process
- Exec: switches to running another binary executable
- Spawn:
Fork
, thenexec
the child
Thread
- Exist as subsets of a process
- Context switching between threads is fast
- Share the same memory address space : interact via shared memory
Threads
- Concurrent programming
- Graphical user interfaces (progress bars, …)
- Asynchronous I/O
- Standard library: POSIX threads (
pthreads
)
Communication time
- Low latency network latency : ∼ 1.2 microsecond
- Random memory access : ∼ 0.1 microsecond
Threads example (Python)
#!/usr/bin/env python
import threading
import time
class test:
def __init__(self, Nthreads):
self.Nthreads = Nthreads
self.data = [ i for i in range(Nthreads) ]
def run_thread(self, j):
self.data[j] = 0
time.sleep(j)
self.data[j] = j
Threads example (Python)
def run(self):
thread = [ None ] * self.Nthreads
t0 = time.time()
print(self.data)
for i in range(self.Nthreads):
thread[i] = threading.Thread( target=self.run_thread, args=(i,) )
thread[i].start()
for i in range(self.Nthreads):
thread[i].join()
print(time.time()-t0, "seconds. ", self.data)
if __name__ == '__main__':
t = test(4)
t.run()
$ python thread_python.py
[0, 1, 2, 3]
0.0009775161743164062 seconds. [0, 0, 0, 0]
1.0018701553344727 seconds. [0, 1, 0, 0]
2.003377676010132 seconds. [0, 1, 2, 0]
3.004056930541992 seconds. [0, 1, 2, 3]
Computation of π with threads in Python
#!/usr/bin/env python
import os, sys, threading
from random import random, seed
from math import sqrt
NMAX = 10000000 # Nb of MC steps/process
error_threshold = 1.0e-4 # Stopping criterion
class pi_calculator:
def __init__(self, Nthreads):
self.Nthreads= Nthreads
self.results = []
self.lock = threading.Lock()
def compute_pi(self):
result = 0.
for i in range(NMAX): # Loop NMAX times
x,y = random(), random() # Draw 2 random numbers x and y
if x*x + y*y <= 1.: # Check if (x,y) is in the circle
result += 1
with self.lock:
self.results.append(4.* float(result)/float(NMAX))
Computation of π with threads in Python
def run(self):
thread = [None] * self.Nthreads
for i in range(self.Nthreads):
thread[i] = threading.Thread( target=self.compute_pi, args=() )
thread[i].start()
print("All threads started")
while True:
for i in range(self.Nthreads):
thread[i].join()
N = len(self.results)
average = sum(self.results)/N # Compute average
if N > 2: # Compute variance
l = [ (x-average)*(x-average) for x in self.results ]
variance = sum(l)/(N-1.)
else:
variance = 0.
error = sqrt(variance)/sqrt(N) # Compute error
print("%f +/- %f %d"%(average, error, N))
Computation of π with threads in Python
if N > 2 and error < error_threshold: # Stopping condition
return
for i in range(self.Nthreads):
thread[i] = threading.Thread( target=self.compute_pi, args=() )
thread[i].start()
if __name__ == '__main__':
calc = pi_calculator(4)
calc.run()
Note: Inefficient in Python because of the Global Interpreter Lock (GIL), but you got the idea.
OpenMP
OpenMP
- OpenMP is an extension of programming languages that enable the use of multi-threading to parallelize the code using directives.
- The OpenMP library may be implemented using
pthreads
- Extensions in OpenMP 5.0 to offload code execution to GPUs
-
The same source code can be executed with/without OpenMP
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i) !$OMP DO do i=1,n A(i) = B(i) + C(i) end do !$OMP END DO !$OMP END PARALLEL
OpenMP directives (Fortran)
!$OMP PARALLEL
starts a new multi-threaded section. Everything inside this block is executed by all the threads!$OMP DO
tells the compiler to split the loop among the different threads (by changing the loop boundaries for instance)!$OMP END DO
marks the end of the parallel loop. It contains an implicit synchronization. After this line, all the threads have finished executing the loop.!$OMP END PARALLEL
marks the end of the parallel section. Contains also an implicit barrier.DEFAULT(SHARED)
: all the variables (A,B,C) are in shared memory by defaultPRIVATE(i)
: the variable i is private to every thread
OpenMP directives (Fortran)
!$OMP CRITICAL ... !$OMP END CRITICAL
: all the statements in this block are protected by a lock!$OMP TASK ... !$OMP END TASK
: define a new task to execute!$OMP BARRIER
: synchronization barrier!$OMP SINGLE ... !$OMP END SINGLE
: all the statements in this block are executed by a single thread!$OMP MASTER ... !$OMP END MASTER
: all the statements in this block are executed by the master thread
OpenMP
Functions
omp_get_thread_num()
: returns the ID of the current running thread (likeMPI_Rank
)omp_get_num_threads()
: returns the total number of running threads (likeMPI_Size
)OMP_NUM_THREADS
: Environment variable (shell) that fixes the number of threads to run
Important
- Multiple threads can read at the same memory address
- Multiple threads must not write at the same address
Matrix multiplication
\[ C_{ij} = \sum_{k=1}^N A_{ik} B_{kj} \]
do j=1,N
do i=1,N
C(i,j) = 0.d0
do k=1,N
C(i,j) = C(i,j) + A(i,k) * B(k,j)
end do
end do
end do
Matrix multiplication with OpenMP
\[ C_{ij} = \sum_{k=1}^N A_{ik} B_{kj} \]
!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE (i,j,k)
do j=1,N
do i=1,N
C(i,j) = 0.d0
do k=1,N
C(i,j) = C(i,j) + A(i,k) * B(k,j)
end do
end do
end do
!$OMP END PARALLEL DO
- Loop is parallelized over
j
- Writing in
C(i,j)
is OK
Matrix multiplication with OpenMP
\[ C_{ij} = \sum_{k=1}^N A_{ik} B_{kj} \]
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE (i,j,k)
!$OMP DO COLLAPSE(2)
do j=1,N
do i=1,N
C(i,j) = 0.d0
do k=1,N
C(i,j) = C(i,j) + A(i,k) * B(k,j)
end do
end do
end do
!$OMP END DO
!$OMP END PARALLEL
- Loop is parallelized over pairs
(i,j)
- Writing in
C(i,j)
is OK
Exercises
Exercise 1: Monte Carlo
- Write a Fortran
double precision function compute_pi(M)
-
Call it using this main program:
program pi_mc implicit none integer :: M logical :: iterate double precision :: sample double precision, external :: compute_pi call random_seed() ! Initialize random number generator open(unit=11, file='fortran_out.fifo', status='old', action='write', & access='stream',form='formatted') iterate = .True. do while (iterate) ! Compute pi over N samples until 'iterate=.False.' sample = compute_pi() write(11,*) sample end do end program pi_mc
Exercise 1: Monte Carlo
- Write a Python server
pi_server.py
that receives samples of π in a socket and compute the running average of π. Its address and port number are written in a fileserver.cfg
. - Write a Python script
pi_bridge.py
that reads samples of π in a named pipefortran_out.fifo
and sends the samples to the server. This script can read the the address and port number of the server from the fileserver.cfg
. - When the convergence criterion is reached, the server informs the bridges that they can stop.
Running a simulation on multiple nodes
- Run a single server
- Run one bridge per compute node using the
mpiexec
command - Run one Fortran process per core using the
mpiexec
command