exec: add time-based limit for job termination before draining nodes

The current job termination system uses `max-kill-count` to limit the number of attempts to kill a job before draining nodes with "unkillable user processes." While this provides fine-grained control, it makes it difficult for administrators to answer a simple operational question: "How long should the system wait before giving up and draining nodes?"

The termination sequence involves a complex escalation schedule:

 - Initial kill-timeout for SIGTERM → SIGKILL to tasks
 - Delay of `5*kill-timeout` before switching to signaling shells
 - Exponential backoff starting at `kill-timeout`, doubling each attempt (capped at 300s)
 - Limited by `max-kill-count` total attempts

To determine the actual wall-clock time before draining, an administrator must:
 - Understand the multi-phase escalation algorith
 - Calculate the sum of: task kill attempts + `5*delay` + exponential backoff series
 - Account for the 300s cap on individual timeouts
 - Adjust `max-kill-count` to achieve their desired total duration

This is error-prone and unintuitive. Common administrative questions like "give jobs 30 minutes to clean up before draining nodes" cannot be directly configured.

A simpler time-based cap on the amount of time jobs are given after an initial exception or termination signal should be added which overrides `max-kill-count` and is simpler for administrators to understand.

cc: @kkier

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exec: add time-based limit for job termination before draining nodes #7297

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

exec: add time-based limit for job termination before draining nodes #7297

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions