10
0
mirror of https://github.com/LCPQ/quantum_package synced 2024-12-22 12:23:48 +01:00

Parallelism.md

Anthony Scemama 2016-02-09 18:55:51 +01:00
parent 491378c3b5
commit 2e2d2b17c8

@ -40,19 +40,68 @@ enable both multi-threaded and distributed support.
To initiate a parallel job, a `Newjob` message has to be sent to the scheduler,
with a job name (`state`) that will be checked at every connection to the
scheduler. This will create a new `Queuing_system` instance. The
`Queuing_system` contains
scheduler.
`new_job <state> <push_address_tcp> <push_address_inproc>`
The collector thread needs to opens a `PULL` socket bound both using the TCP
and the inproc protocols, and those endpoints need to be given together with
the `Newjob` message.
Now, a new `Queuing_system` instance is created. The `Queuing_system` contains
* A list of tasks
* A list of connected clients (empty at the initialization)
* The subset of tasks still queued
* The subset of tasks currently running, and on which client they run
To add tasks to the system, the `AddTask` message is sent. It can be
The format of tasks is up to the user : it is just a string.
To add new tasks to the system, the `AddTask` message is sent:
* `add_task <state> <string>` : Add a unique task to the `Queuing_system`
* `add_task <state> range <i> <j> ` : Builds a list of tasks (j) where
i<l<j.
* `add_task <state> triangle <i>` : Builds a list of tasks (l,i) where
1<l<i.
* `add_task <state> <msg>` : Builds a list of tasks (msg)
When workers connect to the `Queuing_system` with a `Connect` message, they
obtain as a reply the address of the `PULL` socket of the collector,
as well as a new Client ID :
`connect (tcp|inproc)`
Now the Workers connect a `PUSH` socket to the `PULL` endpoint of the
collector (TCP or inproc).
Workers are now ready to fetch new tasks using `GetTask` messages:
`get_task <state> <client_id>`
and the `Queuing_system` now knows which tasks runs on which client.
The reply is the task, as a string, and the corresponding Task ID.
When the task is done, the worker pushes the results to the collector,
and sends to `qp_run` a `TaskDone` message with the corresponding Task ID,
and its contribution to the control integer which will be accumulated in the
`Queuing_system` instance:
`task_done <state> <client_id> <task_id> <control>`
If the queue is empty, the reply to the GetTask` message is a `Terminate` message
which informs the workers to terminate.
When a worker terminates, it sends a `Disconnect` message to the scheduler:
`disconnect <state> <client_id>`
If there are remaining running clients, the reply is `0`. For the last client,
the reply contains the control integer, and this allows the worker to inform the
collector that all tasks are done and all workers are disconnected.
Once the collector thread has finished to pull all the data, it can terminate.
Now, the main thread can send an `End_job` message to the scheduler to inform
it that the parallel task is done.