10
0
mirror of https://github.com/LCPQ/quantum_package synced 2024-12-22 12:23:48 +01:00

Parallelism.md

Anthony Scemama 2016-02-09 18:29:18 +01:00
parent 87f84156d2
commit 0b21534a36

61
Parallelism.md Normal file

@ -0,0 +1,61 @@
================
Task parallelism
================
A task scheduler is implemented in OCaml and runs in the `qp_run` program.
The IRPF90 programs communicate with this scheduler to fetch new tasks to do.
The flexibility of IRPF90 enables to build micro-services to help a running
program. For example, if the computation of the AOs takes too much time, it is
possible to start on multiple compute nodes some tasks that will accelerate
the running program.
The typical scheme is the following:
1) The program (IRPF90) asks `qp_run` to create a new queue for a state of
the calculation
2) The program adds multiple tasks to do to the queue
3) The program starts a **collector** thread that waits for the results
computed by the workers
4) The program starts multiple **worker** threads that will fetch tasks to do
from the queue, compute the corresponding task, and send the result directly
to the collector. Then, the queue is informed that the task has been done
5) When the queue is empty and all workers have sent their results, the
last worker receives from `qp_run` a *control* integer, and sends it to
the collector thread
6) The collector thread checks that the control integer is correct : this can be
for instance the number of AOs to compute and the number of actually
computed AOs.
7) The parallel section is terminated
The task scheduler
------------------
The task scheduler (implemented in the `TaskServer.ml` file) understands text
messages, and transforms them in typed messages. The list of understood
messages can be found in the `of_string` function of the `Message.ml` file.
When `qp_run` starts, it tries to opens a ZeroMQ REP socket on a default port,
and tries again on different ports until a suitable port range is found.
The port is bound using both the TCP protocal and the inproc protocol to
enable both multi-threaded and distributed support.
To initiate a parallel job, a `Newjob` message has to be sent to the scheduler,
with a job name (`state`) that will be checked at every connection to the
scheduler. This will create a new `Queuing_system` instance. The
`Queuing_system` contains
* A list of tasks
* A list of connected clients (empty at the initialization)
* The subset of tasks still queued
* The subset of tasks currently running, and on which client they run
To add tasks to the system, the `AddTask` message is sent. It can be
* `add_task <state> range <i> <j> ` : Builds a list of tasks (j) where
i<l<j.
* `add_task <state> triangle <i>` : Builds a list of tasks (l,i) where
1<l<i.
* `add_task <state> <msg>` : Builds a list of tasks (msg)