-
Notifications
You must be signed in to change notification settings - Fork 57
Description
The current job termination system uses max-kill-count to limit the number of attempts to kill a job before draining nodes with "unkillable user processes." While this provides fine-grained control, it makes it difficult for administrators to answer a simple operational question: "How long should the system wait before giving up and draining nodes?"
The termination sequence involves a complex escalation schedule:
- Initial kill-timeout for SIGTERM β SIGKILL to tasks
- Delay of
5*kill-timeoutbefore switching to signaling shells - Exponential backoff starting at
kill-timeout, doubling each attempt (capped at 300s) - Limited by
max-kill-counttotal attempts
To determine the actual wall-clock time before draining, an administrator must:
- Understand the multi-phase escalation algorith
- Calculate the sum of: task kill attempts +
5*delay+ exponential backoff series - Account for the 300s cap on individual timeouts
- Adjust
max-kill-countto achieve their desired total duration
This is error-prone and unintuitive. Common administrative questions like "give jobs 30 minutes to clean up before draining nodes" cannot be directly configured.
A simpler time-based cap on the amount of time jobs are given after an initial exception or termination signal should be added which overrides max-kill-count and is simpler for administrators to understand.
cc: @kkier