kube-apiserver endpoint cleanup when --apiserver-count>1 #22609

Calpicow · 2016-03-06T05:11:04Z

Using v1.2.0-beta.0 and running --apiserver-count=2 in my cluster. I would expect the service endpoint to be cleaned up when one of them goes offline. This doesn't happen though, and causes ~50% of apiserver requests to fail.

The text was updated successfully, but these errors were encountered:

dchen1107 · 2016-03-08T00:28:24Z

I haven't tried to reproduce the issue yet, but I suspect you might hit #22625 here. There is a pending pr #22626 to resolve #22625. Could you please try to patch your cluster to see if you still have the issue? Thanks!

Calpicow · 2016-03-09T17:45:24Z

I don't believe that issue is related. When I down a master, I could see the node and pods being removed. What's not being removed is the second apiserver IP in the kubernetes endpoint.

dchen1107 · 2016-03-09T21:40:42Z

cc/ @mikedanese @lavalamp they might see the issue before.

mikedanese · 2016-03-09T21:50:30Z

This isn't exactly a bug. This is by design but it's likely the design could be better.

lavalamp · 2016-03-10T19:21:27Z

Yes, we don't expect an apiserver to go down and stay down without a replacement.

lavalamp · 2016-03-11T00:00:57Z

We could make this more robust by having apiservers count themselves, e.g. by each separately making an entry in etcd somewhere.

For now, if you change the number of apiservers that are running, you must change the --apiserver-count= flag by restarting all apiservers.

hanikesn · 2016-04-15T10:03:50Z

What happens if one of them crashes/machines goes down? The kubernetes service endpoint would not be usable as 50% of the requests would fail.

victorgp · 2016-04-21T12:23:47Z

This is actually an important issue that can break the high availability of the whole cluster.
The DNS service relies on the default kubernetes service to connect to the api. If an api server goes down, as hanikesn says, 50% of the requests will fail, hence, the DNS service will fail, therefore, the whole cluster might fail. Is there any plan to change the design of that apiserver-count?

javefang · 2016-07-07T16:24:39Z

Agreed with @victorgp , we are creating a high availability cluster, and some services like DNS and Traefik do rely on the default kubernetes endpoint. I can workaround this by forcing them to use the load balancer's URL directly (basically don't use the default endpoint). But I feel like the service endpoint for kubernetes should be consistent with the rest?

timothysc · 2016-07-28T19:28:46Z

Does this still exist @ncdc , post recent mod on endpoint update?

hanikesn · 2016-10-15T12:42:16Z

This issue still exists with 1.4.1.

ncdc · 2016-10-17T17:47:56Z

@timothysc yes, it still exists. We have an endpoint reconciler for the kubernetes service in OpenShift that uses etcd 2's key ttl mechanism to maintain a lease. If an apiserver goes down, one of the remaining members will remove the dead backend IP from the list of endpoints. If we can agree on a mechanism to do this in Kube (there were some concerns about etcd key ttl), I'd be happy to put together a PR.

cristifalcas · 2016-10-18T09:58:02Z

👍

fgrzadkowski · 2016-10-20T15:21:54Z

This problem came up when discussing HA master design. This made me think that maybe there's a better solution. As you say we should be using a TTL for each IP. What we can do is:

In the Endpoints object annotations we'd keep a TTL for each IP. Each annotation would keep a pair with an IP, that it corresponds to, and a TTL
Each apiserver when updating service kubernetes will do two things:
1. Add it's own IP if it's not there and add/update TTL for it
2. Remove all the IPs with too old TTL

This should be a very simple change and hopefully would solve this issue. WDYT?

ncdc · 2016-10-20T15:27:54Z

@fgrzadkowski that's essentially what we're doing in OpenShift here, although we use a separate path in etcd to store the data, instead of using the existing endpoints path.

fgrzadkowski · 2016-10-20T15:41:36Z

Do you think that baking the logic I described above into apiserver here would make sense?

@roberthbailey @jszczepkowski @lavalamp @krousey @nikhiljindal

fgrzadkowski · 2016-10-24T14:13:00Z

Slight modification after discussion with @thockin and @jszczepkowski.

We believe that a reasonable approach would be to:

Add a ConfigMap that would keep the list of active apiservers, with their expiration times; those would be updated by each apiserver separately
Change EndpointsReconsiler in apiserver to update Endpoints list to match active apiservers from the ConfigMap.

That way we will have a dynamic configuration and at the same time we will not be updating Endpoints too often, as expiration times will be stored in a dedicated ConfigMap.

ncdc · 2016-10-25T18:12:05Z

I assume you'd add retries in the event that 2 apiservers tried to update
the ConfigMap simultaneously?

On Mon, Oct 24, 2016 at 10:13 AM, Filip Grzadkowski <
notifications@github.com> wrote:

Slight modification after discussion with @thockin
https://github.com/thockin and @jszczepkowski
https://github.com/jszczepkowski.

We believe that a reasonable approach would be to:

Add a ConfigMap that would keep the list of active apiservers, with
their expiration times; those would be updated by each apiserver separately

Change EndpointsReconsiler in apiserver to update Endpoints list to
match active apiservers from the ConfigMap.

That way we will have a dynamic configuration and at the same time we will
not be updating Endpoints too often, as expiration times will be stored in
a dedicated ConfigMap.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#22609 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAABYmG6QxOT54pQL3uKKOEQ8JhHyJg7ks5q3L0hgaJpZM4HqLYR
.

smarterclayton · 2017-08-04T01:56:09Z

Updating endpoints is bad, in that it fans out to the entire cluster (all kube-proxy, any network plugins, all ingress that watch endpoints, and any custom firewall controllers) every write. At 5k nodes that is a lot of traffic. Updating another resource like configmap is less bad. Least bad would be updating a resource that no one watches globally that is designed for this purpose.

liggitt · 2017-08-04T01:59:52Z

5 minutes seems really long... I'd expect an unresponsive master to get its IP pulled out of the kubernetes service endpoints way sooner (more like a 15-30 second response time). The masters contending on a single resource to heartbeat also seems like it could be problematic.

szuecs · 2017-08-08T09:51:38Z

We run 2 masters in AWS with an ASG. If we teardown one master it will take about 5-10 minutes to get a replacement, which is fine for all loadbalanced applications.

If you want to have a HA Kubernetes you have to set --apiserver-count=N, where N>1, but this will make sure that the "kubernetes" endpoints will not clean up the teardowned master from above. This is not like normal loadbalancers work and I think it is much worse, then an unavailable control plane!

% kubectl get nodes -l master
NAME                                            STATUS                     AGE       VERSION
ip-172-31-15-39.eu-central-1.compute.internal   Ready,SchedulingDisabled   18h       v1.6.7+coreos.0

% kubectl get endpoints
NAME               ENDPOINTS                                            AGE
kubernetes         172.31.15.37:443,172.31.15.38:443,172.31.15.39:443   21h

rphillips · 2017-08-16T16:16:40Z

Thank you for the feedback.

Proposal

Add an kube-apiserver-endpoints ConfigMap in the kube-system namespace.

Populate the ConfigMap with the following:

kind: ConfigMap
apiVersion: v1
metadata:
  creationTimestamp: 2016-02-18T19:14:38Z
  name: kube-apiserver-endpoints
  namespace: kube-system
data: 
  192.168.0.3:
    update.timestamp=2016-02-18T19:14:38Z
  192.168.0.4:
    update.timestamp=2016-02-18T19:14:38Z

In the reconcile endpoints loop, expire the endpoint after a configured period of time (~1 minute?)

/cc @smarterclayton @liggitt

This session affinity causes problems if one of the API servers is down. If a client has a connection to the API server that fails, it will continue to connect to that node, because the session affinity tries to steer connections back to the failed node. (There is a related issue that causes a failed API server to never be removed from the list of valid service endpoints. See: kubernetes/kubernetes#22609)

fixes kubernetes/community#939 fixes kubernetes#22609

Automatic merge from submit-queue add apiserver-count fix proposal This is a proposal to fix the apiserver-count issue at kubernetes/kubernetes#22609. I would appreciate a review on the proposal. - [x] Add ConfigMap for configurable options - [ ] Find out dependencies on the Endpoints API and add them to the proposal

fixes kubernetes/community#939 fixes kubernetes#22609

@lavalamp

…t_reconciler Automatic merge from submit-queue (batch tested with PRs 52240, 48145, 52220, 51698, 51777). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.. add lease endpoint reconciler **What this PR does / why we need it**: Adds OpenShift's LeaseEndpointReconciler to register kube-apiserver endpoints within the storage registry. Adds a command-line argument `alpha-endpoint-reconciler-type` to the kube-apiserver. Defaults to the old MasterCount reconciler. **Which issue this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close that issue when PR gets merged)*: fixes kubernetes/community#939 fixes kubernetes#22609 **Release note**: ```release-note Adds a command-line argument to kube-apiserver called --alpha-endpoint-reconciler-type=(master-count, lease, none) (default "master-count"). The original reconciler is 'master-count'. The 'lease' reconciler uses the storageapi and a TTL to keep alive an endpoint within the `kube-apiserver-endpoint` storage namespace. The 'none' reconciler is a noop reconciler that does not do anything. This is useful for self-hosted environments. ``` /cc @lavalamp @smarterclayton @ncdc

This session affinity causes problems if one of the API servers is down. If a client has a connection to the API server that fails, it will continue to connect to that node, because the session affinity tries to steer connections back to the failed node. (There is a related issue that causes a failed API server to never be removed from the list of valid service endpoints. See: kubernetes/kubernetes#22609)

…t-fix Automatic merge from submit-queue add apiserver-count fix proposal This is a proposal to fix the apiserver-count issue at kubernetes/kubernetes#22609. I would appreciate a review on the proposal. - [x] Add ConfigMap for configurable options - [ ] Find out dependencies on the Endpoints API and add them to the proposal

When a cluster is bootstrapped with multiple kube-apiservers, the `kubernetes` service contains a list of all of these endpoints. By default, this list of endpoints will *not* be updated if one of the apiservers goes down. This can lead to the api becoming unresponsive and breaking it. To have the endpoints automatically keep track of the apiservers that are available the `--endpoint-reconciler-type` option `lease` needs to be added. (The default option for 1.10 `master-count` only changes the endpoint when the count changes: https://github.com/apprenda/kismatic/issues/987) See: kubernetes/kubernetes#22609 kubernetes/kubernetes#56584 kubernetes/kubernetes#51698

Automatic merge from submit-queue add apiserver-count fix proposal This is a proposal to fix the apiserver-count issue at kubernetes/kubernetes#22609. I would appreciate a review on the proposal. - [x] Add ConfigMap for configurable options - [ ] Find out dependencies on the Endpoints API and add them to the proposal

dchen1107 added sig/node kind/support labels Mar 8, 2016

dchen1107 self-assigned this Mar 8, 2016

mikedanese added priority/backlog area/HA team/control-plane and removed kind/support sig/node labels Mar 9, 2016

lavalamp unassigned dchen1107 Mar 11, 2016

javefang mentioned this issue Jul 7, 2016

Kubernetes provider: should allow the master url to be override traefik/traefik#501

Closed

hanikesn mentioned this issue Oct 20, 2016

Design for automated HA master deployment #29649

Merged

fgrzadkowski assigned jszczepkowski Oct 24, 2016

rphillips mentioned this issue Aug 17, 2017

add apiserver-count fix proposal kubernetes/community#939

Merged

2 tasks

ncdc mentioned this issue Aug 29, 2017

kube-proxy does not update iptables for default/kubernetes Endpoints #51445

Closed

rphillips pushed a commit to rphillips/kubernetes that referenced this issue Aug 31, 2017

add lease endpoint reconciler

cf94e70

fixes kubernetes/community#939 fixes kubernetes#22609

rphillips mentioned this issue Aug 31, 2017

add lease endpoint reconciler #51698

Merged

Jnchk mentioned this issue Sep 5, 2017

Kube-proxy can't select proper network interface to connect with kube-apiserver in a node with multiple network interfaces. #51675

Closed

rphillips pushed a commit to rphillips/kubernetes that referenced this issue Sep 11, 2017

add lease endpoint reconciler

d1bb08f

fixes kubernetes/community#939 fixes kubernetes#22609

k8s-github-robot closed this as completed in #51698 Sep 23, 2017

This was referenced Dec 4, 2017

Disable session affinity for internal kuberntes service #56690

Merged

kubenetes svc endpoints are not updated when one of API is down (when running HA api) #56584

Closed

bergmannf mentioned this issue Feb 27, 2019

Automatically update the kubernetes-service endpoint. SUSE/caasp-salt#752

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-apiserver endpoint cleanup when --apiserver-count>1 #22609

kube-apiserver endpoint cleanup when --apiserver-count>1 #22609

Calpicow commented Mar 6, 2016

dchen1107 commented Mar 8, 2016

Calpicow commented Mar 9, 2016

dchen1107 commented Mar 9, 2016

mikedanese commented Mar 9, 2016

lavalamp commented Mar 10, 2016

lavalamp commented Mar 11, 2016

hanikesn commented Apr 15, 2016

victorgp commented Apr 21, 2016

javefang commented Jul 7, 2016

timothysc commented Jul 28, 2016

hanikesn commented Oct 15, 2016

ncdc commented Oct 17, 2016

cristifalcas commented Oct 18, 2016

fgrzadkowski commented Oct 20, 2016

ncdc commented Oct 20, 2016

fgrzadkowski commented Oct 20, 2016

fgrzadkowski commented Oct 24, 2016

ncdc commented Oct 25, 2016

smarterclayton commented Aug 4, 2017 •

edited

Loading

liggitt commented Aug 4, 2017

szuecs commented Aug 8, 2017 •

edited

Loading

rphillips commented Aug 16, 2017

kube-apiserver endpoint cleanup when --apiserver-count>1 #22609

kube-apiserver endpoint cleanup when --apiserver-count>1 #22609

Comments

Calpicow commented Mar 6, 2016

dchen1107 commented Mar 8, 2016

Calpicow commented Mar 9, 2016

dchen1107 commented Mar 9, 2016

mikedanese commented Mar 9, 2016

lavalamp commented Mar 10, 2016

lavalamp commented Mar 11, 2016

hanikesn commented Apr 15, 2016

victorgp commented Apr 21, 2016

javefang commented Jul 7, 2016

timothysc commented Jul 28, 2016

hanikesn commented Oct 15, 2016

ncdc commented Oct 17, 2016

cristifalcas commented Oct 18, 2016

fgrzadkowski commented Oct 20, 2016

ncdc commented Oct 20, 2016

fgrzadkowski commented Oct 20, 2016

fgrzadkowski commented Oct 24, 2016

ncdc commented Oct 25, 2016

smarterclayton commented Aug 4, 2017 • edited Loading

liggitt commented Aug 4, 2017

szuecs commented Aug 8, 2017 • edited Loading

rphillips commented Aug 16, 2017

Proposal

smarterclayton commented Aug 4, 2017 •

edited

Loading

szuecs commented Aug 8, 2017 •

edited

Loading