-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot Add a member during network partition with StrictReconfigCheck enabled #10114
Comments
If the newly joined node itself is misconfigured, the will be no quorum anymore. |
Wouldn't it be better to check that the newly added node is healthy before allowing it to join the cluster? I thought that was already being done. Are you saying that there is no check for that, so instead you rely on full connectivity as a substitute? |
That requires the learner feature that @gyuho and @jpbetz are working on. Basically, we need to test if the newly added node is able to catch up with the cluster before promoting it to participant into raft group. |
node healthy cannot simply be inferred from network connectivity. |
This is only true in 3 node clusters.
Yes, of course. But that's also exactly what this check does. I'm simply recommending substituting this check for the same one on the joining node, which I actually already thought existed. Since it doesn't, I suppose it requires larger changes, such as the coming learner change. I understand you are unwilling to make this change, and will happily wait for the learner change. It was quite a shock though to discover tests failing on our system because we couldn't add a node even though we had quorum. More specific documentation on that front may be helpful. Thanks. |
3 and 4 node I should have said. I was only considering odd numbered cluster sizes above 2. |
We probably can loose that checking for lager clusters. |
Great. Thanks. |
cc @jpbetz |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
hi bot! please stop pinging here |
What a silly thing it is to have a bot closing open bugs. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
As implemented, StrictReconfigCheck is very valuable in that it prevents adding unhealthy nodes to the cluster and removing nodes if it would result in quorum loss. However, there is a 3rd check that I believe is counterproductive, and was wondering if we could go ahead and remove it.
With StrictReconfigCheck enabled, it is impossible to add new nodes to the cluster when one node is down or partitioned, even if there is quorum. See
etcd/etcdserver/server.go
Line 1552 in e8b940f
This requirement makes it such that the procedure for replacing a "healthy" node is different than the procedure for replacing an "unhealthy" node. For the former, it is recommended that new nodes should be added first, and then old nodes removed. For the latter the old node should be removed first. This is an inconsistency, and it makes automating cluster changes slightly more complicated.
More importantly however, the other node may just be temporarily partitioned, yet perfectly healthy. Forcing removal in order to increase fault tolerance seems to be a very restrictive requirement.
I am failing to see any practical benefit to this rule, and therefore am requesting that it be removed. Removal would make the node replacement procedure consistent, and allow increased resiliency in the face of partitions.
The text was updated successfully, but these errors were encountered: