Recovering from Drift

How ACK controllers detect and remediate resource drift

Kubernetes controllers work on the principal of constant reconciliation. In essence, they continuously look at the current desired state of the system and compare it to the actual state, using the difference to determine the action required to get to the desired end result.

Once a controller has reconciled a resource to its desired state, the controller shouldn’t need to continue reconciling - the actual state of the resource meets the specification. However, this is only true for closed systems, where the controller is the only actor interacting with a resource. Unfortunately, ACK controllers don’t act in a closed system. ACK controllers are not the only actor capable of modifying the actual state of any AWS resources - other programs, or even people, may have their own privileges. When another actor modifies a resource after the ACK controller has reconciled it to its desired state, that’s called “drift”.

ACK controllers detect drift by continuing to reconcile resources after they have reached their desired state, but with much longer delays between reconciliation attempts. By default, all ACK controllers attempt to detect drift once every 10 hours. That is, every 10 hours after a resource has been marked with the ResourceSynced = true condition, its owner controller will describe the resource in AWS to see if it no longer matches the desired state. If the controller detects a difference, it then starts the reconciliation loop again to get back to that state (just as when any other change has been made).

Existing resource overrides

Some resources require more frequent drift remediation. For example, if a resource runs a stateful workload whose status changes frequently (such as a SageMaker TrainingJob). For these resources, the drift remediation period may already have been decreased by the controller authors to improve the responsiveness of the resource’s Status.

All override periods are logged to stdout when the controller is started.

Overriding the drift remediation period

For all resources owned by a controller

If you would like to decrease the drift remediation period for all resources owned by a controller, update the reconcile.defaultResyncPeriod value in the Helm chart values.yaml file with the number of seconds for the new period, like so:

reconcile:
    defaultResyncPeriod: 1800 # 30 minutes (in seconds)

For a single resource type

The most granular configuration for setting reconciliation periods is to apply it to all resources of a given type. For example, all S3 Bucket managed by a single controller.

Add the resource name and the overriding period (in seconds) to the reconcile.resourceResyncPeriods value in the Helm chart values.yaml like so:

reconcile:
    resourceResyncPeriods:
        Bucket: 1800 # 30 minutes (in seconds)

Edit this page on GitHub