Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-apiserver endpoint cleanup when --apiserver-count>1 #22609

Closed
Calpicow opened this issue Mar 6, 2016 · 40 comments · Fixed by #51698
Closed

kube-apiserver endpoint cleanup when --apiserver-count>1 #22609

Calpicow opened this issue Mar 6, 2016 · 40 comments · Fixed by #51698
Assignees
Labels
area/HA priority/backlog Higher priority than priority/awaiting-more-evidence. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.

Comments

@Calpicow
Copy link
Contributor

Calpicow commented Mar 6, 2016

Using v1.2.0-beta.0 and running --apiserver-count=2 in my cluster. I would expect the service endpoint to be cleaned up when one of them goes offline. This doesn't happen though, and causes ~50% of apiserver requests to fail.

@dchen1107 dchen1107 added sig/node Categorizes an issue or PR as relevant to SIG Node. kind/support Categorizes issue or PR as a support question. labels Mar 8, 2016
@dchen1107 dchen1107 self-assigned this Mar 8, 2016
@dchen1107
Copy link
Member

I haven't tried to reproduce the issue yet, but I suspect you might hit #22625 here. There is a pending pr #22626 to resolve #22625. Could you please try to patch your cluster to see if you still have the issue? Thanks!

@Calpicow
Copy link
Contributor Author

Calpicow commented Mar 9, 2016

I don't believe that issue is related. When I down a master, I could see the node and pods being removed. What's not being removed is the second apiserver IP in the kubernetes endpoint.

@dchen1107
Copy link
Member

cc/ @mikedanese @lavalamp they might see the issue before.

@mikedanese mikedanese added priority/backlog Higher priority than priority/awaiting-more-evidence. area/HA team/control-plane and removed kind/support Categorizes issue or PR as a support question. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Mar 9, 2016
@mikedanese
Copy link
Member

This isn't exactly a bug. This is by design but it's likely the design could be better.

@lavalamp
Copy link
Member

Yes, we don't expect an apiserver to go down and stay down without a replacement.

@lavalamp
Copy link
Member

We could make this more robust by having apiservers count themselves, e.g. by each separately making an entry in etcd somewhere.

For now, if you change the number of apiservers that are running, you must change the --apiserver-count= flag by restarting all apiservers.

@hanikesn
Copy link

What happens if one of them crashes/machines goes down? The kubernetes service endpoint would not be usable as 50% of the requests would fail.

@victorgp
Copy link

This is actually an important issue that can break the high availability of the whole cluster.
The DNS service relies on the default kubernetes service to connect to the api. If an api server goes down, as hanikesn says, 50% of the requests will fail, hence, the DNS service will fail, therefore, the whole cluster might fail. Is there any plan to change the design of that apiserver-count?

@javefang
Copy link

javefang commented Jul 7, 2016

Agreed with @victorgp , we are creating a high availability cluster, and some services like DNS and Traefik do rely on the default kubernetes endpoint. I can workaround this by forcing them to use the load balancer's URL directly (basically don't use the default endpoint). But I feel like the service endpoint for kubernetes should be consistent with the rest?

@timothysc
Copy link
Member

Does this still exist @ncdc , post recent mod on endpoint update?

@hanikesn
Copy link

This issue still exists with 1.4.1.

@ncdc
Copy link
Member

ncdc commented Oct 17, 2016

@timothysc yes, it still exists. We have an endpoint reconciler for the kubernetes service in OpenShift that uses etcd 2's key ttl mechanism to maintain a lease. If an apiserver goes down, one of the remaining members will remove the dead backend IP from the list of endpoints. If we can agree on a mechanism to do this in Kube (there were some concerns about etcd key ttl), I'd be happy to put together a PR.

@cristifalcas
Copy link

👍

@fgrzadkowski
Copy link
Contributor

This problem came up when discussing HA master design. This made me think that maybe there's a better solution. As you say we should be using a TTL for each IP. What we can do is:

  1. In the Endpoints object annotations we'd keep a TTL for each IP. Each annotation would keep a pair with an IP, that it corresponds to, and a TTL
  2. Each apiserver when updating service kubernetes will do two things:
    1. Add it's own IP if it's not there and add/update TTL for it
    2. Remove all the IPs with too old TTL

This should be a very simple change and hopefully would solve this issue. WDYT?

@ncdc
Copy link
Member

ncdc commented Oct 20, 2016

@fgrzadkowski that's essentially what we're doing in OpenShift here, although we use a separate path in etcd to store the data, instead of using the existing endpoints path.

@fgrzadkowski
Copy link
Contributor

Do you think that baking the logic I described above into apiserver here would make sense?

@roberthbailey @jszczepkowski @lavalamp @krousey @nikhiljindal

@fgrzadkowski
Copy link
Contributor

Slight modification after discussion with @thockin and @jszczepkowski.

We believe that a reasonable approach would be to:

  1. Add a ConfigMap that would keep the list of active apiservers, with their expiration times; those would be updated by each apiserver separately
  2. Change EndpointsReconsiler in apiserver to update Endpoints list to match active apiservers from the ConfigMap.

That way we will have a dynamic configuration and at the same time we will not be updating Endpoints too often, as expiration times will be stored in a dedicated ConfigMap.

@ncdc
Copy link
Member

ncdc commented Oct 25, 2016

I assume you'd add retries in the event that 2 apiservers tried to update
the ConfigMap simultaneously?

On Mon, Oct 24, 2016 at 10:13 AM, Filip Grzadkowski <
notifications@github.com> wrote:

Slight modification after discussion with @thockin
https://github.com/thockin and @jszczepkowski
https://github.com/jszczepkowski.

We believe that a reasonable approach would be to:

  1. Add a ConfigMap that would keep the list of active apiservers, with
    their expiration times; those would be updated by each apiserver separately
  2. Change EndpointsReconsiler in apiserver to update Endpoints list to
    match active apiservers from the ConfigMap.

That way we will have a dynamic configuration and at the same time we will
not be updating Endpoints too often, as expiration times will be stored in
a dedicated ConfigMap.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#22609 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAABYmG6QxOT54pQL3uKKOEQ8JhHyJg7ks5q3L0hgaJpZM4HqLYR
.

@smarterclayton
Copy link
Contributor

smarterclayton commented Aug 4, 2017

Updating endpoints is bad, in that it fans out to the entire cluster (all kube-proxy, any network plugins, all ingress that watch endpoints, and any custom firewall controllers) every write. At 5k nodes that is a lot of traffic. Updating another resource like configmap is less bad. Least bad would be updating a resource that no one watches globally that is designed for this purpose.

@liggitt
Copy link
Member

liggitt commented Aug 4, 2017

5 minutes seems really long... I'd expect an unresponsive master to get its IP pulled out of the kubernetes service endpoints way sooner (more like a 15-30 second response time). The masters contending on a single resource to heartbeat also seems like it could be problematic.

@szuecs
Copy link
Member

szuecs commented Aug 8, 2017

We run 2 masters in AWS with an ASG. If we teardown one master it will take about 5-10 minutes to get a replacement, which is fine for all loadbalanced applications.

If you want to have a HA Kubernetes you have to set --apiserver-count=N, where N>1, but this will make sure that the "kubernetes" endpoints will not clean up the teardowned master from above. This is not like normal loadbalancers work and I think it is much worse, then an unavailable control plane!

% kubectl get nodes -l master
NAME                                            STATUS                     AGE       VERSION
ip-172-31-15-39.eu-central-1.compute.internal   Ready,SchedulingDisabled   18h       v1.6.7+coreos.0

% kubectl get endpoints
NAME               ENDPOINTS                                            AGE
kubernetes         172.31.15.37:443,172.31.15.38:443,172.31.15.39:443   21h

@rphillips
Copy link
Member

Thank you for the feedback.

Proposal

Add an kube-apiserver-endpoints ConfigMap in the kube-system namespace.

Populate the ConfigMap with the following:

kind: ConfigMap
apiVersion: v1
metadata:
  creationTimestamp: 2016-02-18T19:14:38Z
  name: kube-apiserver-endpoints
  namespace: kube-system
data: 
  192.168.0.3:
    update.timestamp=2016-02-18T19:14:38Z
  192.168.0.4:
    update.timestamp=2016-02-18T19:14:38Z

In the reconcile endpoints loop, expire the endpoint after a configured period of time (~1 minute?)

/cc @smarterclayton @liggitt

olavmrk pushed a commit to Uninett/kubernetes-terraform that referenced this issue Aug 22, 2017
This session affinity causes problems if one of the API servers is
down. If a client has a connection to the API server that fails, it
will continue to connect to that node, because the session affinity
tries to steer connections back to the failed node.

(There is a related issue that causes a failed API server to never be removed from the list of valid service endpoints. See: kubernetes/kubernetes#22609)
rphillips pushed a commit to rphillips/kubernetes that referenced this issue Aug 31, 2017
k8s-github-robot pushed a commit to kubernetes/community that referenced this issue Aug 31, 2017
Automatic merge from submit-queue

add apiserver-count fix proposal

This is a proposal to fix the apiserver-count issue at kubernetes/kubernetes#22609. I would appreciate a review on the proposal.

- [x] Add ConfigMap for configurable options
- [ ] Find out dependencies on the Endpoints API and add them to the proposal
rphillips pushed a commit to rphillips/kubernetes that referenced this issue Sep 11, 2017
hh pushed a commit to ii/kubernetes that referenced this issue Sep 23, 2017
…t_reconciler

Automatic merge from submit-queue (batch tested with PRs 52240, 48145, 52220, 51698, 51777). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>..

add lease endpoint reconciler

**What this PR does / why we need it**: Adds OpenShift's LeaseEndpointReconciler to register kube-apiserver endpoints within the storage registry.

Adds a command-line argument `alpha-endpoint-reconciler-type` to the kube-apiserver.

Defaults to the old MasterCount reconciler.

**Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes kubernetes/community#939 fixes kubernetes#22609

**Release note**:
```release-note
Adds a command-line argument to kube-apiserver called
--alpha-endpoint-reconciler-type=(master-count, lease, none) (default
"master-count"). The original reconciler is 'master-count'. The 'lease'
reconciler uses the storageapi and a TTL to keep alive an endpoint within the
`kube-apiserver-endpoint` storage namespace. The 'none' reconciler is a noop
reconciler that does not do anything. This is useful for self-hosted
environments.
```

/cc @lavalamp @smarterclayton @ncdc
gurvindersingh pushed a commit to Uninett/daas-kube that referenced this issue Jan 11, 2018
This session affinity causes problems if one of the API servers is
down. If a client has a connection to the API server that fails, it
will continue to connect to that node, because the session affinity
tries to steer connections back to the failed node.

(There is a related issue that causes a failed API server to never be removed from the list of valid service endpoints. See: kubernetes/kubernetes#22609)
justaugustus pushed a commit to justaugustus/enhancements that referenced this issue Sep 3, 2018
…t-fix

Automatic merge from submit-queue

add apiserver-count fix proposal

This is a proposal to fix the apiserver-count issue at kubernetes/kubernetes#22609. I would appreciate a review on the proposal.

- [x] Add ConfigMap for configurable options
- [ ] Find out dependencies on the Endpoints API and add them to the proposal
bergmannf pushed a commit to bergmannf/salt that referenced this issue Feb 27, 2019
When a cluster is bootstrapped with multiple kube-apiservers, the `kubernetes`
service contains a list of all of these endpoints.

By default, this list of endpoints will *not* be updated if one of the
apiservers goes down. This can lead to the api becoming unresponsive and
breaking it. To have the endpoints automatically keep track of the apiservers
that are available the `--endpoint-reconciler-type` option `lease` needs to be
added.

(The default option for 1.10 `master-count` only changes the endpoint when the
count changes: https://github.com/apprenda/kismatic/issues/987)

See:

kubernetes/kubernetes#22609
kubernetes/kubernetes#56584
kubernetes/kubernetes#51698
bergmannf pushed a commit to bergmannf/salt that referenced this issue Feb 27, 2019
When a cluster is bootstrapped with multiple kube-apiservers, the `kubernetes`
service contains a list of all of these endpoints.

By default, this list of endpoints will *not* be updated if one of the
apiservers goes down. This can lead to the api becoming unresponsive and
breaking it. To have the endpoints automatically keep track of the apiservers
that are available the `--endpoint-reconciler-type` option `lease` needs to be
added.

(The default option for 1.10 `master-count` only changes the endpoint when the
count changes: https://github.com/apprenda/kismatic/issues/987)

See:

kubernetes/kubernetes#22609
kubernetes/kubernetes#56584
kubernetes/kubernetes#51698
bergmannf pushed a commit to bergmannf/salt that referenced this issue Feb 27, 2019
When a cluster is bootstrapped with multiple kube-apiservers, the `kubernetes`
service contains a list of all of these endpoints.

By default, this list of endpoints will *not* be updated if one of the
apiservers goes down. This can lead to the api becoming unresponsive and
breaking it. To have the endpoints automatically keep track of the apiservers
that are available the `--endpoint-reconciler-type` option `lease` needs to be
added.

(The default option for 1.10 `master-count` only changes the endpoint when the
count changes: https://github.com/apprenda/kismatic/issues/987)

See:

kubernetes/kubernetes#22609
kubernetes/kubernetes#56584
kubernetes/kubernetes#51698
MadhavJivrajani pushed a commit to kubernetes/design-proposals-archive that referenced this issue Nov 30, 2021
Automatic merge from submit-queue

add apiserver-count fix proposal

This is a proposal to fix the apiserver-count issue at kubernetes/kubernetes#22609. I would appreciate a review on the proposal.

- [x] Add ConfigMap for configurable options
- [ ] Find out dependencies on the Endpoints API and add them to the proposal
MadhavJivrajani pushed a commit to MadhavJivrajani/design-proposals that referenced this issue Dec 1, 2021
Automatic merge from submit-queue

add apiserver-count fix proposal

This is a proposal to fix the apiserver-count issue at kubernetes/kubernetes#22609. I would appreciate a review on the proposal.

- [x] Add ConfigMap for configurable options
- [ ] Find out dependencies on the Endpoints API and add them to the proposal
MadhavJivrajani pushed a commit to MadhavJivrajani/design-proposals that referenced this issue Dec 1, 2021
Automatic merge from submit-queue

add apiserver-count fix proposal

This is a proposal to fix the apiserver-count issue at kubernetes/kubernetes#22609. I would appreciate a review on the proposal.

- [x] Add ConfigMap for configurable options
- [ ] Find out dependencies on the Endpoints API and add them to the proposal
MadhavJivrajani pushed a commit to MadhavJivrajani/design-proposals that referenced this issue Dec 1, 2021
Automatic merge from submit-queue

add apiserver-count fix proposal

This is a proposal to fix the apiserver-count issue at kubernetes/kubernetes#22609. I would appreciate a review on the proposal.

- [x] Add ConfigMap for configurable options
- [ ] Find out dependencies on the Endpoints API and add them to the proposal
MadhavJivrajani pushed a commit to kubernetes/design-proposals-archive that referenced this issue Dec 1, 2021
Automatic merge from submit-queue

add apiserver-count fix proposal

This is a proposal to fix the apiserver-count issue at kubernetes/kubernetes#22609. I would appreciate a review on the proposal.

- [x] Add ConfigMap for configurable options
- [ ] Find out dependencies on the Endpoints API and add them to the proposal
MadhavJivrajani pushed a commit to kubernetes/design-proposals-archive that referenced this issue Dec 1, 2021
Automatic merge from submit-queue

add apiserver-count fix proposal

This is a proposal to fix the apiserver-count issue at kubernetes/kubernetes#22609. I would appreciate a review on the proposal.

- [x] Add ConfigMap for configurable options
- [ ] Find out dependencies on the Endpoints API and add them to the proposal
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/HA priority/backlog Higher priority than priority/awaiting-more-evidence. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle.
Projects
None yet
Development

Successfully merging a pull request may close this issue.