Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: support joint consensus for cluster membership change #1468

Closed
siddontang opened this issue Jan 5, 2017 · 8 comments
Closed

raft: support joint consensus for cluster membership change #1468

siddontang opened this issue Jan 5, 2017 · 8 comments
Labels
sig/raft Component: Raft, RaftStore, etc. status/discussion Status: Under discussion or need discussion

Comments

@siddontang
Copy link
Contributor

Etcd uses a simple implementation for membership change(adding/removing one peer one time when applying the raft log).

This works well in most of the time, but sometime it may still have risk, especially when PD does balance.

E,g, three racks 1, 2 and 3, each rack has 2 machines (we use h11, h12 for machines in rack1, and so on). PD first schedules three peers p1, p2, p3 to h11, h21 and h31, then it finds that h11 has a high load, so it decides to add a new peer p4 to h12 and remove p1 in h11.

If rack 1 is down after adding p4, the region can't supply service. To avoid this, we must add p4 and remove p1 atomically, but now, we can't support it.

Supporting join consensus can fix this problem, but this is different from etcd, and we must do many tests to verify the correctness and cover the corner case.

/cc @ngaut @xiang90 @BusyJay @hhkbp2

@Hoverbear
Copy link
Contributor

#3314

@BusyJay
Copy link
Member

BusyJay commented Jun 19, 2020

It's traced by #7587 now.

@BusyJay BusyJay closed this as completed Jun 19, 2020
Safely replace node automation moved this from To do to Done Jun 19, 2020
@Diggsey
Copy link

Diggsey commented Jul 29, 2020

@siddontang Actually, I don't believe joint consensus solves this problem for the 3-node case.

Let's say we have a cluster of nodes {A, B, C} where A is the current leader, and we want to replace 'C' with 'D' to get {A, B, D}.

If we use joint consensus to achieve this in a single step, then the cluster is no longer resilient to a single node failure:

  1. D is brought up to date as a non-voter
  2. A replicates the new joint configuration {A, B, C}{A, B, D} to nodes B and C
  3. A crashes before replicating to D
  4. Cluster is unable to elect a new leader, because a majority cannot be obtained from {A, B, D} since A has crashed and D is unaware of its voter status. (Consensus must be obtained from both voting groups during joint consensus)

The specific inequalities to ensure fault tolerance at all times are:

  • num_shared - fault_tolerance > num_added
  • num_shared - fault_tolerance > num_removed

As a result of these inequalities, 3 node clusters must first add a node and then remove a node. On the other hand, a 4-node cluster can replace a node in a single step.

@BusyJay
Copy link
Member

BusyJay commented Jul 30, 2020

Non-voter means don't campaign, but OK to respond vote request.

@Diggsey
Copy link

Diggsey commented Jul 30, 2020

@BusyJay Interesting... If it's the case the non-voters should still vote, then that is really terrible naming on behalf of the Raft paper authors... Is there somewhere that explains this aspect of the algorithm in more detail?

@BusyJay
Copy link
Member

BusyJay commented Jul 30, 2020

We call it learner instead of non-voter in both our implementation and etcd's implementation. I don't think responding vote request is mentioned in the raft paper, it's a fix found in our practice.

You can see the related discussion etcd-io/etcd#8568 and tikv/raft-rs#57.

@Diggsey
Copy link

Diggsey commented Jul 30, 2020

Thanks, I think I will borrow that terminology!

@Diggsey
Copy link

Diggsey commented Jul 30, 2020

With that change, the inequality seems to be:

  • num_shared > fault_tolerance

(Since at least one shared node must still exist to be elected from the original configuration)

Does that seem correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/raft Component: Raft, RaftStore, etc. status/discussion Status: Under discussion or need discussion
Projects
Development

No branches or pull requests

5 participants