Skip to main content

ADR-004: CAS (Check-And-Set) for All Restore Operations

Status: Accepted Date: 2026-04-04

Context

During a restore, Guardian writes values from Git back to Consul KV. Between the time Guardian reads the current state (to build the restore plan) and the time it writes the restored values, another process could modify a key. Without protection, the restore would silently overwrite that concurrent change.

This is a real risk in production. Restore operations often happen during incidents, exactly when other systems (deploy scripts, config management, other operators) are also making changes.

Decision

All restore operations must use CAS (Check-And-Set) via the Consul Transaction API. Every write includes the ModifyIndex read during planning. If the index has changed (meaning someone else modified the key), the write fails instead of overwriting.

The flow:

  1. Read current state from Consul, recording each key's ModifyIndex.
  2. Compare against desired state from Git.
  3. Build a plan of SET and DELETE operations, each tagged with the ModifyIndex.
  4. Execute the plan as a Consul transaction. CAS ensures atomicity.
  5. If a CAS check fails, report the conflict. Don't retry automatically.

Consequences

Positive

  • Prevents silent overwrites. A concurrent change is detected and reported, not lost.
  • Atomic batches. Consul transactions support up to 64 operations per batch, ensuring consistency within each batch.
  • Clear conflict signaling. A failed CAS tells the operator exactly which key was modified concurrently.
  • Dry-run mode. The plan can be previewed without executing, showing exactly what would change.

Negative

  • More complex than blind writes. The planner must track ModifyIndex per key.
  • CAS conflicts require manual intervention. The operator must re-run the plan after the conflict is resolved.
  • Consul's transaction batch limit of 64 operations requires pagination for large restores.
  • A partially failed batch (some ops succeed, some fail) needs careful handling.

Alternatives Considered

OptionProsCons
Blind writes (KVSet)Simple, fastDangerous. Can silently overwrite concurrent changes.
Session-based lockingStrong mutual exclusionToo heavy for restore. Blocks all other writers for the duration.
Raft barrierStrongest consistencyOnly available in Consul internals, not exposed via API.