Run is the primary unit of workload in dstack. Users can:
- Submit a run using
dstack applyor the API. - Stop a run using
dstack stopor the API.
Runs are created from run configurations. There are three types of run configurations:
dev-environment— runs a VS Code server.task— runs the user's bash script until completion.service— runs the user's bash script and exposes a port through dstack-proxy.
A run can spawn one or multiple jobs, depending on the configuration. A task that specifies multiple nodes spawns a job for every node (a multi-node task). A service that specifies multiple replicas spawns a job for every replica. A job submission is always assigned to one particular instance. If a job fails and the configuration allows retrying, the server creates a new job submission for the job.
- STEP 1: The user submits the run.
services.runs.submit_runcreates jobs with statusSUBMITTED. Now the run has statusSUBMITTED. - STEP 2:
background.tasks.process_runsperiodically pulls unfinished runs and processes them:- If any job is
RUNNING, the run becomesRUNNING. - If any job is
PROVISIONINGorPULLING, the run becomesPROVISIONING. - If any job fails and cannot be retried, the run becomes
TERMINATING, and after processing,FAILED. - If all jobs are
DONE, the run becomesTERMINATING, and after processing,DONE. - If any job fails, can be retried, and there is any other active job, the failed job will be resubmitted in-place.
- If any jobs in a replica fail and can be retried and there is other active replicas, the jobs of the failed replica are resubmitted in-place (without stopping other replicas). But if some jobs in a replica fail, then all the jobs in a replica are terminated and resubmitted. This include multi-node tasks that represent one replica with multiple jobs.
- If all jobs fail and can be resubmitted, the run becomes
PENDING.
- If any job is
- STEP 3: If the run is
TERMINATING, the server makes all jobsTERMINATING.background.tasks.process_runssets their status toTERMINATING, assignsJobTerminationReason, and sends a graceful stop command todstack-runner.process_terminating_jobsthen ensures that jobs are terminated assigns a finished status. - STEP 4: Once all jobs are finished, the run becomes
TERMINATED,DONE, orFAILEDbased onRunTerminationReason. - STEP 0: If the run is
PENDING,background.tasks.process_runswill resubmit jobs. The run becomesSUBMITTEDagain.
Use
switch_run_status()for all status transitions. Do not setRunModel.statusdirectly.
No one must assign the finished status to the run, except
services.runs.process_terminating_run. To terminate the run, assignTERMINATINGstatus andRunTerminationReason.
Services' lifecycle has some modifications:
- During STEP 1, the service is registered on the gateway. If the gateway is not accessible or the domain name is taken, the run submission fails.
- During STEP 2, downscaled jobs are ignored.
- During STEP 4, the service is unregistered on the gateway.
- During STEP 0, the service can stay in
PENDINGstatus if it was downscaled to zero (WIP).
dstack retries the run only if:
- The configuration enables
retry. - The job termination reason is covered by
retry.on_events. - The
retry.durationis not exceeded.
- STEP 1: A newly submitted job has status
SUBMITTED. It is not assigned to any instance yet. - STEP 2:
background.tasks.process_submitted_jobstries to assign an existing instance or provision a new one.- On success, the job becomes
PROVISIONING. - On failure, the job becomes
TERMINATING, and after processing,FAILEDbecause ofFAILED_TO_START_DUE_TO_NO_CAPACITY.
- On success, the job becomes
- STEP 3:
background.tasks.process_running_jobsperiodically pulls unfinished jobs and processes them.- While
dstack-shim/dstack-runneris not responding, the job staysPROVISIONING. - Once
dstack-shim(for VM-featured backends) becomes available, it submits the docker image name, and the job becomesPULLING. - Once
dstack-runnerinside a docker container becomes available, it submits the code and the job spec, and the job becomesRUNNING. - If
dstack-shimordstack-runnerdon't respond for a long time or fail to respond after successful connection and multiple retries, the job becomesTERMINATING, and after processing,FAILED.
- While
- STEP 4:
background.tasks.process_running_jobsprocessesRUNNINGjobs, pulling job logs, runner logs, and job status.- If the pulled status is
DONE, the job becomesTERMINATING, and after processing,DONE. - Otherwise, the job becomes
TERMINATING, and after processing,FAILED.
- If the pulled status is
- STEP 5:
background.tasks.process_terminating_jobsprocessesTERMINATINGjobs.- If the job has
remove_atin the future, nothing happens. This is to give the job some time for a graceful stop. - Once
remove_atis in the past, it stops the container viadstack-shim, detaches instance volumes, and releases the instance. The job becomesTERMINATED,DONE,FAILED, orABORTEDbased onJobTerminationReason. - If some volumes fail to detach, it keeps the job
TERMINATINGand checks volumes attachment status.
- If the job has
Use
switch_job_status()for all status transitions. Do not setJobModel.statusdirectly.
No one must assign the finished status to the job, except
services.jobs.process_terminating_job. To terminate the job, assignTERMINATINGstatus andJobTerminationReason.
Services' jobs lifecycle has some modifications:
- During STEP 3, once the job becomes
RUNNING, it is registered on the gateway as a replica. If the gateway is not accessible, the job fails. - During STEP 5, the job is unregistered on the gateway (WIP).