diff --git a/Parallelism.md b/Parallelism.md new file mode 100644 index 0000000..3bc70d4 --- /dev/null +++ b/Parallelism.md @@ -0,0 +1,61 @@ +================ +Task parallelism +================ + + +A task scheduler is implemented in OCaml and runs in the `qp_run` program. +The IRPF90 programs communicate with this scheduler to fetch new tasks to do. +The flexibility of IRPF90 enables to build micro-services to help a running +program. For example, if the computation of the AOs takes too much time, it is +possible to start on multiple compute nodes some tasks that will accelerate +the running program. + +The typical scheme is the following: + +1) The program (IRPF90) asks `qp_run` to create a new queue for a state of + the calculation +2) The program adds multiple tasks to do to the queue +3) The program starts a **collector** thread that waits for the results + computed by the workers +4) The program starts multiple **worker** threads that will fetch tasks to do + from the queue, compute the corresponding task, and send the result directly + to the collector. Then, the queue is informed that the task has been done +5) When the queue is empty and all workers have sent their results, the + last worker receives from `qp_run` a *control* integer, and sends it to + the collector thread +6) The collector thread checks that the control integer is correct : this can be + for instance the number of AOs to compute and the number of actually + computed AOs. +7) The parallel section is terminated + + +The task scheduler +------------------ + +The task scheduler (implemented in the `TaskServer.ml` file) understands text +messages, and transforms them in typed messages. The list of understood +messages can be found in the `of_string` function of the `Message.ml` file. + +When `qp_run` starts, it tries to opens a ZeroMQ REP socket on a default port, +and tries again on different ports until a suitable port range is found. +The port is bound using both the TCP protocal and the inproc protocol to +enable both multi-threaded and distributed support. + +To initiate a parallel job, a `Newjob` message has to be sent to the scheduler, +with a job name (`state`) that will be checked at every connection to the +scheduler. This will create a new `Queuing_system` instance. The +`Queuing_system` contains + +* A list of tasks +* A list of connected clients (empty at the initialization) +* The subset of tasks still queued +* The subset of tasks currently running, and on which client they run + +To add tasks to the system, the `AddTask` message is sent. It can be + +* `add_task range ` : Builds a list of tasks (j) where + i triangle ` : Builds a list of tasks (l,i) where + 1 ` : Builds a list of tasks (msg) +