mirror of
https://github.com/LCPQ/quantum_package
synced 2024-12-22 12:23:48 +01:00
Parallelism.md
parent
87f84156d2
commit
0b21534a36
61
Parallelism.md
Normal file
61
Parallelism.md
Normal file
@ -0,0 +1,61 @@
|
||||
================
|
||||
Task parallelism
|
||||
================
|
||||
|
||||
|
||||
A task scheduler is implemented in OCaml and runs in the `qp_run` program.
|
||||
The IRPF90 programs communicate with this scheduler to fetch new tasks to do.
|
||||
The flexibility of IRPF90 enables to build micro-services to help a running
|
||||
program. For example, if the computation of the AOs takes too much time, it is
|
||||
possible to start on multiple compute nodes some tasks that will accelerate
|
||||
the running program.
|
||||
|
||||
The typical scheme is the following:
|
||||
|
||||
1) The program (IRPF90) asks `qp_run` to create a new queue for a state of
|
||||
the calculation
|
||||
2) The program adds multiple tasks to do to the queue
|
||||
3) The program starts a **collector** thread that waits for the results
|
||||
computed by the workers
|
||||
4) The program starts multiple **worker** threads that will fetch tasks to do
|
||||
from the queue, compute the corresponding task, and send the result directly
|
||||
to the collector. Then, the queue is informed that the task has been done
|
||||
5) When the queue is empty and all workers have sent their results, the
|
||||
last worker receives from `qp_run` a *control* integer, and sends it to
|
||||
the collector thread
|
||||
6) The collector thread checks that the control integer is correct : this can be
|
||||
for instance the number of AOs to compute and the number of actually
|
||||
computed AOs.
|
||||
7) The parallel section is terminated
|
||||
|
||||
|
||||
The task scheduler
|
||||
------------------
|
||||
|
||||
The task scheduler (implemented in the `TaskServer.ml` file) understands text
|
||||
messages, and transforms them in typed messages. The list of understood
|
||||
messages can be found in the `of_string` function of the `Message.ml` file.
|
||||
|
||||
When `qp_run` starts, it tries to opens a ZeroMQ REP socket on a default port,
|
||||
and tries again on different ports until a suitable port range is found.
|
||||
The port is bound using both the TCP protocal and the inproc protocol to
|
||||
enable both multi-threaded and distributed support.
|
||||
|
||||
To initiate a parallel job, a `Newjob` message has to be sent to the scheduler,
|
||||
with a job name (`state`) that will be checked at every connection to the
|
||||
scheduler. This will create a new `Queuing_system` instance. The
|
||||
`Queuing_system` contains
|
||||
|
||||
* A list of tasks
|
||||
* A list of connected clients (empty at the initialization)
|
||||
* The subset of tasks still queued
|
||||
* The subset of tasks currently running, and on which client they run
|
||||
|
||||
To add tasks to the system, the `AddTask` message is sent. It can be
|
||||
|
||||
* `add_task <state> range <i> <j> ` : Builds a list of tasks (j) where
|
||||
i<l<j.
|
||||
* `add_task <state> triangle <i>` : Builds a list of tasks (l,i) where
|
||||
1<l<i.
|
||||
* `add_task <state> <msg>` : Builds a list of tasks (msg)
|
||||
|
Loading…
Reference in New Issue
Block a user