TCCM2021/parallelism_scemama.org
2021-11-21 23:49:55 +01:00

64 KiB
Raw Permalink Blame History

Fundamentals of parallelization

#+LaTeX_CLASS_OPTIONS:[aspectratio=169]

Multi-threading   noesport

Processes vs threads

Process

  • Has its own memory address space
  • Context switching between processes is slow
  • Processes interact only through system-provided communication mechanisms
  • Fork: creates a copy of the current process
  • Exec: switches to running another binary executable
  • Spawn: Fork, then exec the child

Thread

  • Exist as subsets of a process
  • Context switching between threads is fast
  • Share the same memory address space : interact via shared memory

Threads

/scemama/TCCM2021/media/branch/master/smp.png

  • Concurrent programming
  • Graphical user interfaces (progress bars, …)
  • Asynchronous I/O
  • Standard library: POSIX threads (pthreads)

Communication time

  • Low latency network latency : 1.2 microsecond
  • Random memory access : 0.1 microsecond

Threads example (Python)

#!/usr/bin/env python
import threading
import time

class test:
 def __init__(self, Nthreads):
     self.Nthreads = Nthreads
     self.data = [ i for i in range(Nthreads) ]

 def run_thread(self, j):
     self.data[j] = 0
     time.sleep(j)
     self.data[j] = j

Threads example (Python)

 def run(self):
     thread = [ None ] * self.Nthreads
     t0 = time.time()
     print(self.data)
     for i in range(self.Nthreads):
         thread[i] = threading.Thread( target=self.run_thread, args=(i,)  )
         thread[i].start()
     for i in range(self.Nthreads):
         thread[i].join()
         print(time.time()-t0, "seconds. ", self.data)

if __name__ == '__main__':
t = test(4)
t.run()

$ python thread_python.py 
[0, 1, 2, 3]
0.0009775161743164062 seconds.  [0, 0, 0, 0]
1.0018701553344727 seconds.  [0, 1, 0, 0]
2.003377676010132 seconds.  [0, 1, 2, 0]
3.004056930541992 seconds.  [0, 1, 2, 3]

Computation of π with threads in Python

#!/usr/bin/env python
import os, sys, threading
from random import random, seed
from math import sqrt

NMAX = 10000000            # Nb of MC steps/process
error_threshold = 1.0e-4   # Stopping criterion 

class pi_calculator:
 def __init__(self, Nthreads):
     self.Nthreads= Nthreads
     self.results = []
     self.lock = threading.Lock()

 def compute_pi(self):
     result = 0.
     for i in range(NMAX):         # Loop NMAX times
         x,y = random(), random()   # Draw 2 random numbers x and y
         if x*x + y*y <= 1.:        # Check if (x,y) is in the circle
             result += 1
     with self.lock:
         self.results.append(4.* float(result)/float(NMAX))

Computation of π with threads in Python

 def run(self):
    thread = [None] * self.Nthreads
    for i in range(self.Nthreads):
       thread[i] = threading.Thread( target=self.compute_pi, args=()  )
       thread[i].start()
    print("All threads started")
    
    while True:
       for i in range(self.Nthreads):
          thread[i].join()
       N = len(self.results)
       average = sum(self.results)/N             # Compute average
       if N > 2:                         # Compute variance
          l = [ (x-average)*(x-average) for x in self.results ]
          variance = sum(l)/(N-1.)
       else:
          variance = 0.
       error = sqrt(variance)/sqrt(N)    # Compute error
       print("%f +/- %f %d"%(average, error, N))

Computation of π with threads in Python

       if N > 2 and error < error_threshold:  # Stopping condition
          return

       for i in range(self.Nthreads):
          thread[i] = threading.Thread( target=self.compute_pi, args=()  )
          thread[i].start()

if __name__ == '__main__':
calc = pi_calculator(4)
calc.run()

Note: Inefficient in Python because of the Global Interpreter Lock (GIL), but you got the idea.

OpenMP

OpenMP

  • OpenMP is an extension of programming languages that enable the use of multi-threading to parallelize the code using directives.
  • The OpenMP library may be implemented using pthreads
  • Extensions in OpenMP 5.0 to offload code execution to GPUs
  • The same source code can be executed with/without OpenMP

    !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i)
    !$OMP DO
    do i=1,n
      A(i) = B(i) + C(i)
    end do
    !$OMP END DO
    !$OMP END PARALLEL

OpenMP directives (Fortran)

  • !$OMP PARALLEL starts a new multi-threaded section. Everything inside this block is executed by all the threads
  • !$OMP DO tells the compiler to split the loop among the different threads (by changing the loop boundaries for instance)
  • !$OMP END DO marks the end of the parallel loop. It contains an implicit synchronization. After this line, all the threads have finished executing the loop.
  • !$OMP END PARALLEL marks the end of the parallel section. Contains also an implicit barrier.
  • DEFAULT(SHARED) : all the variables (A,B,C) are in shared memory by default
  • PRIVATE(i) : the variable i is private to every thread

OpenMP directives (Fortran)

  • !$OMP CRITICAL ... !$OMP END CRITICAL : all the statements in this block are protected by a lock
  • !$OMP TASK ... !$OMP END TASK : define a new task to execute
  • !$OMP BARRIER : synchronization barrier
  • !$OMP SINGLE ... !$OMP END SINGLE : all the statements in this block are executed by a single thread
  • !$OMP MASTER ... !$OMP END MASTER : all the statements in this block are executed by the master thread

OpenMP

Functions

  • omp_get_thread_num() : returns the ID of the current running thread (like MPI_Rank)
  • omp_get_num_threads() : returns the total number of running threads (like MPI_Size)
  • OMP_NUM_THREADS : Environment variable (shell) that fixes the number of threads to run

Important

  • Multiple threads can read at the same memory address
  • Multiple threads must not write at the same address

Matrix multiplication

\[ C_{ij} = \sum_{k=1}^N A_{ik} B_{kj} \]

do j=1,N
 do i=1,N
   C(i,j) = 0.d0
   do k=1,N
     C(i,j) = C(i,j) + A(i,k) * B(k,j)
   end do
 end do
end do

Matrix multiplication with OpenMP

\[ C_{ij} = \sum_{k=1}^N A_{ik} B_{kj} \]

!$OMP PARALLEL DO DEFAULT(SHARED) PRIVATE (i,j,k)
do j=1,N
 do i=1,N
   C(i,j) = 0.d0
   do k=1,N
     C(i,j) = C(i,j) + A(i,k) * B(k,j)
   end do
 end do
end do
!$OMP END PARALLEL DO
  • Loop is parallelized over j
  • Writing in C(i,j) is OK

Matrix multiplication with OpenMP

\[ C_{ij} = \sum_{k=1}^N A_{ik} B_{kj} \]

!$OMP PARALLEL DEFAULT(SHARED) PRIVATE (i,j,k)
!$OMP DO COLLAPSE(2)
do j=1,N
 do i=1,N
   C(i,j) = 0.d0
   do k=1,N
     C(i,j) = C(i,j) + A(i,k) * B(k,j)
   end do
 end do
end do
!$OMP END DO
!$OMP END PARALLEL
  • Loop is parallelized over pairs (i,j)
  • Writing in C(i,j) is OK

Exercises

Exercise 1: Monte Carlo

  1. Write a Fortran
    double precision function compute_pi(M)
    that computes $\pi$ with the Monte Carlo algorithm using $M$ samples
  2. Call it using this main program:

    program pi_mc
    implicit none
    integer                      :: M
    logical                      :: iterate
    double precision             :: sample
    double precision, external   :: compute_pi
    call random_seed()  ! Initialize random number generator
    open(unit=11, file='fortran_out.fifo', status='old', action='write', &
     access='stream',form='formatted')
    iterate = .True.
    do while (iterate)  ! Compute pi over N samples until 'iterate=.False.'
    sample = compute_pi()
    write(11,*) sample
    end do
    end program pi_mc

Exercise 1: Monte Carlo

  1. Write a Python server pi_server.py that receives samples of π in a socket and compute the running average of π. Its address and port number are written in a file server.cfg.
  2. Write a Python script pi_bridge.py that reads samples of π in a named pipe fortran_out.fifo and sends the samples to the server. This script can read the the address and port number of the server from the file server.cfg.
  3. When the convergence criterion is reached, the server informs the bridges that they can stop.

Running a simulation on multiple nodes

  • Run a single server
  • Run one bridge per compute node using the mpiexec command
  • Run one Fortran process per core using the mpiexec command

Exercise 2: Divide and conquer for matrix multiplication