ADR-004: CAS (Check-And-Set) for All Restore Operations
Status: Accepted Date: 2026-04-04
Context
During a restore, Guardian writes values from Git back to Consul KV. Between the time Guardian reads the current state (to build the restore plan) and the time it writes the restored values, another process could modify a key. Without protection, the restore would silently overwrite that concurrent change.
This is a real risk in production. Restore operations often happen during incidents, exactly when other systems (deploy scripts, config management, other operators) are also making changes.
Decision
All restore operations must use CAS (Check-And-Set) via the Consul Transaction API. Every write includes the ModifyIndex read during planning. If the index has changed (meaning someone else modified the key), the write fails instead of overwriting.
The flow:
- Read current state from Consul, recording each key's
ModifyIndex. - Compare against desired state from Git.
- Build a plan of SET and DELETE operations, each tagged with the
ModifyIndex. - Execute the plan as a Consul transaction. CAS ensures atomicity.
- If a CAS check fails, report the conflict. Don't retry automatically.
Consequences
Positive
- Prevents silent overwrites. A concurrent change is detected and reported, not lost.
- Atomic batches. Consul transactions support up to 64 operations per batch, ensuring consistency within each batch.
- Clear conflict signaling. A failed CAS tells the operator exactly which key was modified concurrently.
- Dry-run mode. The plan can be previewed without executing, showing exactly what would change.
Negative
- More complex than blind writes. The planner must track
ModifyIndexper key. - CAS conflicts require manual intervention. The operator must re-run the plan after the conflict is resolved.
- Consul's transaction batch limit of 64 operations requires pagination for large restores.
- A partially failed batch (some ops succeed, some fail) needs careful handling.
Alternatives Considered
| Option | Pros | Cons |
|---|---|---|
| Blind writes (KVSet) | Simple, fast | Dangerous. Can silently overwrite concurrent changes. |
| Session-based locking | Strong mutual exclusion | Too heavy for restore. Blocks all other writers for the duration. |
| Raft barrier | Strongest consistency | Only available in Consul internals, not exposed via API. |