Skip to content

Kubernetes is vulnerable to stale reads, violating critical pod safety guarantees #59848

Open
@smarterclayton

Description

@smarterclayton

When we added resourceVersion=0 to reflectors, we didn't properly reason about its impact on nodes. Its current behavior can cause two nodes to run a pod with the same name at the same time when using multiple api servers, which violates the pod safety guarantees on the cluster. Because a read serviced by the watch cache can be arbitrarily delayed, a client that connects to that api server can read an arbitrarily old history. We explicitly use quorum reads against etcd to prevent this.

Scenario:

  1. T1: StatefulSet controller creates pod-0 (uid 1) which is scheduled to node-1
  2. T2: pod-0 is deleted as part of a rolling upgrade
  3. node-1 sees that pod-0 is deleted and cleans it up, then deletes the pod in the api
  4. The StatefulSet controller creates a second pod pod-0 (uid 2) which is assigned to node-2
  5. node-2 sees that pod-0 has been scheduled to it and starts pod-0
  6. The kubelet on node-1 crashes and restarts, then performs an initial list of pods scheduled to it against an API server in an HA setup (more than one API server) that is partitioned from the master (watch cache is arbitrarily delayed). The watch cache returns a list of pods from before T2
  7. node-1 fills its local cache with a list of pods from before T2
  8. node-1 starts pod-0 (uid 1) and node-2 is already running pod-0 (uid 2).

This violates pod safety. Since we support HA api servers, we cannot use resourceVersion=0 from reflectors on the node, and probably should not use it on the masters. We can only safely use resourceVersion=0 after we have retrieved at least one list, and only if we verify that resourceVersion is in the future.

@kubernetes/sig-apps-bugs @kubernetes/sig-api-machinery-bugs @kubernetes/sig-scalability-bugs This is a fairly serious issue that can lead to cluster identity guarantees being lost, which means clustered software cannot run safely if it has assumed the pod safety guarantee prevents two pods with the same name running on the cluster at the same time. The user impact is likely data loss of critical data.

This is also something that could happen for controllers - during a controller lease failover the next leader could be working from a very old cache and undo recent work done.

No matter what, the first list of a component with a clean state that must preserve "happens-before" must perform a live quorum read against etcd to fill their cache. That can only be done by omitting resourceVersion=0.

Fixes:

  1. Disable resourceVersion=0 from being used in reflector list, only use when known safe
  2. Disable resourceVersion=0 for first list, and optionally use resourceVersion=0 for subsequent calls if we know the previous resourceVersion is after our current version (assumes resource version monotonicity)
  3. Disable resourceVersion=0 for the first list from a reflector, then send resourceVersion= on subsequent list calls (which causes the watch cache to wait until the resource version shows up).
  4. Perform live reads from Kubelet on all new pods coming in to verify they still exist

1 is a pretty significant performance regression, but is the most correct and safest option (just like we enabled quorum reads everywhere). 2 is more complex, and there are a few people trying to remove the monotonicity guarantees from resource version, but would retain most of the performance benefits of using this in the reflector. 3 is probably less complex than 2, but i'm not positive it actually works. 4 is hideous and won't fix other usage.

Probably needs to be backported to 1.6.

Activity

added this to the v1.10 milestone on Feb 14, 2018
added
sig/appsCategorizes an issue or PR as relevant to SIG Apps.
kind/bugCategorizes issue or PR as related to a bug.
sig/scalabilityCategorizes an issue or PR as relevant to SIG Scalability.
on Feb 14, 2018
changed the title [-]Use of resourceVersion=0 in reflectors for initial sync breaks pod safety when more than one master[/-] [+]Use of resourceVersion=0 in reflectors for initial sync breaks pod safety when more than one api server[/+] on Feb 14, 2018
smarterclayton

smarterclayton commented on Feb 14, 2018

@smarterclayton
ContributorAuthor

Note that we added this to improve large cluster performance, so we are almost certainly going to regress to some degree (how much has been mitigated by other improvements in the last year is uncertain)

roycaihw

roycaihw commented on Feb 15, 2018

@roycaihw
Member

/sub

smarterclayton

smarterclayton commented on Feb 17, 2018

@smarterclayton
ContributorAuthor

Disabling resourceVersion=0 for lists when more than one master is present is an option as well.

As Jordan noted the optimization here is impactful because we avoid having to fetch many full pod lists from etcd when we only return a subset to each node. It’s possible we could require all resourceVersion=0 calls to acquire a read lease on etcd which bounds the delay, but doesn’t guarantee happens before if the cache is delayed. If the watch returned a freshness guarantee we could synchronize on that as well.

smarterclayton

smarterclayton commented on Feb 17, 2018

@smarterclayton
ContributorAuthor

We can do a synthetic write to the range and wait until it is observed by the cache, then service the rv=0. The logical equivalent is a serializable read on etcd for the range, but we need to know the highest observed RV on the range.

We can accomplish that by executing a range request with min create and mod revisions equal to the latest observed revision, and then performing the watch cache list at the highest RV on the range.

So:

  1. All reflectors need to preserve a “happens before” guarantee when fetching for general sanity - use of rv=0 breaks that
  2. rv=0 is a client visible optimization, but it’s not really necessary except for clients that can observe arbitrarily delayed history safely
  3. Not many clients can observe arbitrarily delayed history safely
  4. We should stop trearing rv=0 specially

We can mitigate the performance impact by ensuring that an initial list from the watch cache preserves happens before. We can safely serve historical lists (rv=N) at any time. The watch cache already handles rv=N mostly correctly.

So the proposed change here is:

  1. Clients remove rv=0 and we stop honoring it
  2. Verify the watch cache correctly waits for rv=N queries
  3. We can serve a list or get from watch cache iff we perform a serializable read against etcd and retrieve the highest create/mod revisions
  4. We can make that query efficient by using min_create/mod_revision on the etcd list call
  5. Clients currently sending rv=0 where perf is critical (node) should start using the last observed resource version, which allows the watch cache to serve correctly.

That resolves this issue.

added
priority/critical-urgentHighest priority. Must be actively worked on as someone's top priority right now.
on Feb 17, 2018
smarterclayton

smarterclayton commented on Feb 17, 2018

@smarterclayton
ContributorAuthor

Serializable read over the range, that is.

smarterclayton

smarterclayton commented on Feb 17, 2018

@smarterclayton
ContributorAuthor

Also:

  1. The next time someone adds a cache at any layer, we need to have a good review process that catches this. Probably a checklist we add into api review and an audit of any existing caches.
jberkus

jberkus commented on Feb 21, 2018

@jberkus

if this MUST be resolved for 1.10, please add status/approved-for-milestone to it before Code Freeze. Thanks!

148 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.sig/api-machineryCategorizes an issue or PR as relevant to SIG API Machinery.sig/appsCategorizes an issue or PR as relevant to SIG Apps.sig/scalabilityCategorizes an issue or PR as relevant to SIG Scalability.

Type

No type

Projects

Status

Needs Triage

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    Kubernetes is vulnerable to stale reads, violating critical pod safety guarantees · Issue #59848 · kubernetes/kubernetes