Skip to content

Deleting pods and other resources with graceful shutdown #2789

Closed
@smarterclayton

Description

@smarterclayton
Contributor

On Friday there was a discussion about how pods could be deleted and convey graceful shutdown of processes.

  • Deleting a pod implicitly conveys a request to terminate the processes in the pod
  • In general, users prefer graceful shutdown of processes: to allow processes the opportunity to cleanly shutdown, which may involve seconds, minutes, or even days in extreme cases
  • Processes may occasionally fail to terminate gracefully - they must then be force killed
  • Some processes may never terminate due to kernel errors or bugs in code
  • While a pod "exists", the name the pod owns cannot be reused

Goals

  • Allow users to convey a grace period for shutdown along with a pod and as part of the act of deletion (Consistently support graceful and immediate termination for all objects #1535)
    • Avoid creating a different verb for shutdown beyond HTTP delete.
  • Allow users to watch and wait for when all of the processes of a pod are no longer running via a specific API call
    • Due to the nature of processes this may run forever or for extended periods of time - it cannot be a synchronous http call
  • Users who create pods with specific names and then delete those pods often wish to reuse the names of the pods - the longer the interval between when the user deletes and then is able to post with the same name, the more likely the user is to view the delay as a failure of the system, rather than as desired behavior.
  • Make it easy and efficient for API consumers to watch on important state transitions within the Pod - created -> scheduled, scheduled -> running, running -> completed.

Non-goals

Assumptions

  • If a pod is deleted, the deletion must complete in a finite, bounded by user-input-or-expectation, amount of time - T(delete) - in order to free the name for a subsequent create
  • Processes cannot be guaranteed to terminate within T(delete) and so if users wish to continue to watch for process termination, there must be an endpoint that displays the process status of a pod after it has been deleted
  • A pod name is not guaranteed to uniquely identify the pod across time, but the UID is, so when a user wishes to view the process status of a pod even if it is deleted, they should be able to watch on that pod by its UID (Make deleted objects available from API for some time. #1468)
  • If we assume an endpoint that displays pod info along with pod process info even after deletion, that endpoint can be present even prior to deletion and is the natural candidate to watch for process termination.
  • Automatically deleting a run-once pod shortly (<30 seconds) after it reaches completion is confusing for end users - pods would disappear without the opportunity to view the UID or react to status
    • API consumers may desire to specify the period after which run once pods are deleted at creation time

Options

TBD, sleep

Activity

added
area/apiIndicates an issue on api area.
priority/backlogHigher priority than priority/awaiting-more-evidence.
on Dec 8, 2014
bgrant0607

bgrant0607 commented on Dec 8, 2014

@bgrant0607
Member

See also #1535 and #1468.

smarterclayton

smarterclayton commented on Dec 8, 2014

@smarterclayton
ContributorAuthor

Going to use this as a proposal issue, with those two as reference. Once I get to it... :)

smarterclayton

smarterclayton commented on Jan 24, 2015

@smarterclayton
ContributorAuthor

This is on my list after uniquification names and pod templatesif no one else gets to it.

modified the milestone: v1.0 on Feb 6, 2015
bgrant0607

bgrant0607 commented on Mar 5, 2015

@bgrant0607
Member

One issue that's come up lately: Nothing GCs terminated pods. We could use this issue for that, or file a new one.

smarterclayton

smarterclayton commented on Mar 5, 2015

@smarterclayton
ContributorAuthor

This is addressed by #5085, and copying the discussion from #1535:

Ok, here's the rough design I'm going with (with various caveats):

  1. Allow a Storage object to implement graceful deletion by implementing a new method Delete(ctx api.Context, name string, options *api.DeleteOptions) - i.e. DELETE /pods/foo {"kind":"DeleteOptions","gracePeriod":10}
  2. DeleteOptions is a "simple" resource that has a single *int64 GracePeriod that is an optional time to delete the resource. If GracePeriod is nil, the default value is used (which comes from the resource type and maybe even from the resource, once pods have a graceful shutdown value). If GracePeriod is 0, termination is immediate (equivalent to current behavior)
  3. The Storage object will check whether the object is already in the process of being deleted - a shorter GracePeriod will shorten the deletion timer, but a longer grace period will be ignored
  4. If the grace period is > 0 and an existing shorter grace period is not pending, I will set the TTL on the etcd record. This generates an Etcd watch event
  5. The resource exposed by etcd will have a new metadata field set "metadata.deleteAt" or "metadata.deleteTimestamp" or something similar that indicates the time that the resource will be deleted at.
  6. The kubelet would see the watch event on the pod and send a SIGTERM to the Docker container with a duration of "metadata.deleteAt - now" - Docker would then SIGKILL automatically
  7. The kubelet would not start a pod that has deleteAt set (even if it dies)
  8. At metadata.deleteAt the pod will be removed by etcd and the TTL, and a delete watch event is set

There is weirdness about removing the pod from the bindings, until the kubelet stops using bound pods we have an unclean setup. It can be worked around in a way that isn't visible to end users though.

smarterclayton

smarterclayton commented on Mar 5, 2015

@smarterclayton
ContributorAuthor

DeleteOptions should also support "reason" as described by #1535

bgrant0607

bgrant0607 commented on Mar 6, 2015

@bgrant0607
Member

Actually, reason parameter issue is #1462.

bgrant0607

bgrant0607 commented on Mar 6, 2015

@bgrant0607
Member

I see a couple issues:

  1. Wait while entities are shutdown gracefully. This requires a change to the desired state that indicates the object is in cleanup mode so that the responsible controller continues to work on shutting it down until it's done. For a pod, it would be executing pre- and (eventually) post-stop hooks and/or sending SIGTERM and waiting for the container to exit. For a replication controller, it would be waiting for all pods to be terminated. For a service, it could involve continuing to serve traffic for some amount of time, and then deleting cloud load balancers. For a namespace, it could involve deleting all objects within the namespace.
  2. Maintaining visibility of deleted objects for some time in order to facilitate observability of the final object status and/or cleanup progress.

Setting the object TTL is useful for (2). It feels like we need to treat (1) distinctly.

smarterclayton

smarterclayton commented on Mar 6, 2015

@smarterclayton
ContributorAuthor

On Mar 6, 2015, at 12:47 PM, Brian Grant notifications@github.com wrote:

I see a couple issues:

Wait while entities are shutdown gracefully. This requires a change to the desired state that indicates the object is in cleanup mode so that the responsible controller continues to work on shutting it down until it's done. For a pod, it would be executing pre- and (eventually) post-stop hooks and/or sending SIGTERM and waiting for the container to exit. For a replication controller, it would be waiting for all pods to be terminated. For a service, it could involve continuing to serve traffic for some amount of time, and then deleting cloud load balancers. For a namespace, it could involve deleting all objects within the namespace.

Maintaining visibility of deleted objects for some time in order to facilitate observability of the final object status and/or cleanup progress.

Setting the object TTL is useful for (2). It feels like we need to treat (1) distinctly.

Graceful delete period seems to me to be "the time I wait before I hard kill everything". At least in what you described in 1, I don't see the difference for pods / rc / services between the use of ttl (hard kill when ttl exceeded) and use of the grace period. Namespace I agree is slightly special.

Reply to this email directly or view it on GitHub.

modified the milestones: v1.0-post, v1.0 on Apr 27, 2015
removed this from the v1.0-post milestone on Jul 24, 2015
soltysh

soltysh commented on Dec 1, 2015

@soltysh
Contributor

@bgrant0607 is there something still that needs working here? IIRC current pod deletion is working as described by @smarterclayton in the issue description. Unless this issue of yours, which, I admit, might be reasonable solution in my use-case in #17940:

Maintaining visibility of deleted objects for some time in order to facilitate observability of the final object status and/or cleanup progress.

bgrant0607

bgrant0607 commented on Dec 1, 2015

@bgrant0607
Member

@soltysh That specific detail is covered by #1468.

No resources other than pods currently support graceful termination.

We also need to implement server-side cascading deletion, as proposed in #1535.

lavalamp

lavalamp commented on Dec 3, 2015

@lavalamp
Member

Sounds like we can close this -- cascading deletion seems big enough to deserve its own issue, #1535 can serve for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/apiIndicates an issue on api area.area/usabilitypriority/backlogHigher priority than priority/awaiting-more-evidence.sig/api-machineryCategorizes an issue or PR as relevant to SIG API Machinery.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @soltysh@lavalamp@smarterclayton@bgrant0607

        Issue actions

          Deleting pods and other resources with graceful shutdown · Issue #2789 · kubernetes/kubernetes