RB-003 — Vault is sealed and cannot be unsealed¶
Use this runbook when:
vault.448.globalis reachable but reports"sealed": true, and the standard 3-of-5 unseal-key procedure cannot be completed (because the keys are missing, holders are unreachable, or shares fail validation).
id: RB-003
title: Vault unseal and recovery-of-last-resort
severity: critical
estimated_duration: 30 minutes (normal unseal) to multiple days (full secret rotation)
servers_affected: [O1]
apps_affected: ["15"]
related_kis: [KI-017]
related_actions: [RM-013, RM-017, RM-021]
Trigger / detection¶
curl -s https://vault.448.global/v1/sys/seal-status | jq
# {
# "sealed": true, ← if this is true, Vault is sealed
# "t": 3, ← threshold (number of keys needed)
# "n": 5, ← total shares
# ...
# }
A sealed Vault returns valid HTTP responses for /sys/seal-status but rejects any secret read. Apps that read secrets at startup will fail with "Vault sealed" or 503 errors.
If /sys/seal-status returns nothing (connection refused, 502 from Caddy), the Vault container/process is not running — see Path D below.
Severity¶
Critical. Apps that read secrets at startup cannot start. Apps with secrets cached in memory continue to work until restart. If the Vault is sealed because the storage backend was lost and the snapshot can't be restored, every secret in the company is gone — every credential, API key, signing key, OIDC client secret, DB password — must be rotated.
This runbook covers two sub-scenarios:
| Scenario | Path |
|---|---|
| Vault is up, just sealed (e.g. after a host restart with no auto-unseal) | Path A — routine unseal |
| Vault storage was lost; need to restore from snapshot AND unseal | Path B — DR restore |
| Unseal-keys are missing or holders unreachable; storage may or may not be restorable | Path C — total rotation |
| Vault container won't start (CAP_SETFCAP / image-update / runtime issue) | Path D — container recovery |
Required access¶
- Vault unseal-key shares — at least the threshold number (typically 3 of 5)
- Names + contact details of unseal-key custodians (on file in
infra/audits/[INFO NEEDED]) - (For Path B) Latest Vault Raft snapshot from OCI bucket
- (For Path B) SSH access to O1 (or new O1 instance per RB-002)
- (For Path C) An "all hands" — secret rotation across the estate is a multi-person exercise
Preconditions (must be true before this runbook can succeed)¶
| Item | Why |
|---|---|
| Unseal-key custody documented | Without knowing who has which share, you cannot reach threshold |
| Vault Raft snapshot off-host (RM-013) | Required for Path B |
| Snapshot has been restore-tested (RM-017) | Otherwise Path B is unproven and may fail mid-recovery |
Today (2026-05) Path A is the only path that can succeed. Path B requires RM-013 (Vault snapshot job) to have landed. Path C is documented for completeness but is the scenario RM-013 is designed to prevent.
Path A — routine unseal (3 of 5 keys)¶
The straightforward case: Vault is running but sealed (e.g. after a host reboot).
Step A1 — Confirm Vault is reachable and sealed¶
curl -s https://vault.448.global/v1/sys/seal-status | jq
# Confirm: "sealed": true, threshold matches expected (typically 3)
Step A2 — Reach 3 unseal-key custodians¶
Coordinate via whatever channel works (phone is best in an outage scenario where Slack/Teams may be SSO-affected). Each custodian provides their key share over a secure channel (in-person ideal; otherwise an out-of-band encrypted channel — never paste keys in shared chat).
Step A3 — Apply the keys¶
# Each holder runs (on a host with vault CLI installed and configured):
export VAULT_ADDR=https://vault.448.global
vault operator unseal
# Prompts for the key share. After 3 successful applications, Vault unseals.
Step A4 — Verify¶
Step A5 — Notify dependent apps¶
If any app failed to start during the sealed window, restart it now so it picks up its secrets fresh:
Path B — restore from snapshot, then unseal¶
When the Vault data was lost and must be restored from a Raft snapshot.
Step B1 — Provision Vault on the recovery host¶
If this is happening as part of RB-002, Vault has already been provisioned on the new O1. If standalone, run a fresh Vault container as in RB-002 Step 5.
Step B2 — Initialize the new Vault¶
If this is a fresh Vault that has never been initialized:
vault operator init
# This generates a NEW set of unseal keys + root token.
# These are NOT the same as the keys for the snapshot you're about to restore.
# IMPORTANT: distribute these to ≥5 custodians per the same Shamir threshold policy.
Step B3 — Pull the snapshot¶
oci os object get \
--bucket-name <bucket> \
--name vault-snapshots/latest.snap \
--file /tmp/vault.snap
sha256sum /tmp/vault.snap
# Compare against the recorded sum in the snapshot manifest
# (which is also in the bucket).
Step B4 — Restore¶
vault login # Use the new root token from Step B2
vault operator raft snapshot restore /tmp/vault.snap
# When the restore completes, the snapshot's keys take over.
# The init keys from Step B2 are discarded; the snapshot's
# unseal keys are now the active set.
Step B5 — Unseal with the snapshot's keys¶
This is the same as Path A — get 3 of 5 holders of the original keys (the ones in effect at snapshot time) to apply their shares.
Step B6 — Verify and notify (same as Path A Steps 4-5)¶
Path D — Vault container won't start¶
When vault.448.global returns 502 / connection refused, and docker ps shows the Vault container restarting or exited.
Step D1 — Capture the failure mode¶
Common error patterns:
| Log message | Cause | Go to |
|---|---|---|
unable to set CAP_SETFCAP effective capability: Operation not permitted |
Container missing SETFCAP capability — usually after image update |
Step D2 |
cannot mlock memory |
IPC_LOCK capability missing |
Step D2 |
permission denied on /vault/data |
Volume mount issue | manual fix; check ownership |
bind: address already in use |
Port conflict | stop the conflicting process |
| Vault crashes during raft startup | Storage corruption | switch to Path B |
Step D2 — Add missing capabilities AND SKIP_SETCAP=true¶
Important: the HashiCorp Vault image runs as the non-root
vaultuser (DockerfileUSER vault). A non-root container user cannot callsetcapeven whenCAP_SETFCAPis in the bounding set, because Docker doesn't grant ambient capabilities to non-root users by default. So addingcap_add: [IPC_LOCK, SETFCAP]is necessary but not sufficient — you also needSKIP_SETCAP=trueto tell the entrypoint to skip thesetcapstep entirely.This is the lesson from
incident 2026-05-01.
If using docker-compose.yml — this is the canonical pattern, also at infra/vault/docker-compose.yml:
services:
vault:
image: hashicorp/vault:<pinned-version> # pin, do not use :latest
container_name: vault
restart: always
command: server
cap_add:
- IPC_LOCK
- SETFCAP
environment:
- SKIP_SETCAP=true # <-- the critical one for non-root user case
expose:
- "8200"
volumes:
- /home/ubuntu/docker/vault/config:/vault/config
- /home/ubuntu/docker/vault/file:/vault/file
- /home/ubuntu/docker/vault/logs:/vault/logs
- /home/ubuntu/docker/vault/data:/vault/data
networks:
- caddy_default
labels:
- com.centurylinklabs.watchtower.enable=false # opt out of auto-update
networks:
caddy_default:
external: true
Apply and restart:
sudo docker compose up -d vault
sudo docker logs --tail=20 -f vault
# Expect "Vault server starting", "Vault server started! Log data will stream in below"
# Should stay running, not restart-loop.
Trade-off recorded: without SKIP_SETCAP=true, the container can't setcap and crashes. With it, Vault runs but cannot mlockall() to prevent its memory being swapped to disk. Acceptable on a low-memory-pressure host; revisit if the host is upgraded.
Step D3 — Alternative if you must have mlockall¶
Run the container as root and let the entrypoint drop privileges itself via su-exec:
services:
vault:
user: "0:0" # start as root
cap_add:
- IPC_LOCK
- SETFCAP
# SKIP_SETCAP not needed — the entrypoint runs setcap as root, then drops to vault user
Mildly less ideal from defense-in-depth (kernel-escape from container-root is worse than from a dropped user) but functionally equivalent and mlockall() works.
Step D4 — Once the container is running, unseal it¶
The container starts in sealed state. Continue with Path A — get 3-of-5 unseal-key holders to apply their shares.
Step D5 — Take an immediate snapshot¶
Even before RM-013 lands as a scheduled job:
# Use vault login (NOT a token on the command line — see security note below)
export VAULT_ADDR=https://vault.448.global
vault login -method=oidc
vault operator raft snapshot save /tmp/vault-post-incident-$(date +%F).snap
sha256sum /tmp/vault-post-incident-*.snap
oci os object put \
--bucket-name PECommon \
--file /tmp/vault-post-incident-*.snap \
--name infra/vault.448.global/raft-snapshots/$(date +%F).snap
shred -u /tmp/vault-post-incident-*.snap
Security note from the 2026-05-01 incident response: never pass tokens via
--header "X-Vault-Token: hvs.…"on the command line. Tokens land in shell history and any captured output. Usevault loginto set the session token via env, run your operations, thenvault token revoke -selfwhen done.
Step D6 — Lock the image and exclude from auto-update¶
So this exact failure can't repeat:
- Pin the Vault image in compose to a specific tag (e.g.
hashicorp/vault:1.18.2), not:latest. - Add the Watchtower-exclusion label:
com.centurylinklabs.watchtower.enable=false. - Future Vault upgrades become a deliberate step, not a Watchtower side-effect.
Path C — worst case, Vault is unrecoverable¶
If unseal keys cannot be assembled (custodians lost contact, shares lost, or Vault data permanently gone with no usable snapshot), Vault must be considered compromised by erasure. Every secret it held must be rotated.
This is a multi-day exercise involving every team. Plan it deliberately.
Step C1 — Establish a fresh Vault¶
Stand up a brand-new Vault instance on O1 (or anywhere). vault operator init produces fresh keys; distribute to ≥5 custodians under a renewed custody policy. Update apps/15-vault.md with the new key holders.
Step C2 — Inventory every secret that lived in old Vault¶
Refer to:
- The Vault credential paths schema documented in RM-024
- The
Credentials in Vaulttable in every per-app doc underapps/— this is the canonical inventory.
For each secret, list: - Where it was used (which app, which integration) - Who can rotate it (which DBA / cloud-admin) - Estimated downtime per rotation
Step C3 — Rotate in dependency order¶
Rotate from the leaves inward — secrets that affect the smallest blast radius first, leaving Tier-0 / cross-cutting credentials for last. Suggested order:
- Per-app database passwords (each app's DB user — rotate one app at a time, restart the app)
- SMTP credentials (one provider at a time)
- Per-app API keys (one external service at a time)
- OIDC client secrets between Authentik and apps (rotate Authentik-side, update each app)
- Object-storage access keys (MinIO root + per-app keys)
- OCI API signing keys (one user at a time)
- Domain registrar API tokens (GoDaddy)
- Authentik secret key (Tier-0 — invalidates all sessions)
- Microsoft 365 federation secrets (last; coordinate with M365 admin)
For each, populate the new Vault with the new credentials per RM-024.
Step C4 — Audit¶
Once rotation is complete, run a search across the GitLab repo for any committed secret references:
# In the engineering repo and related repos:
grep -rEn '(password|secret|api[_-]?key|token)\s*=\s*["'\''][^"'\'']{8,}' .
Anything found needs immediate attention.
Step C5 — Document¶
File at infra/incidents/YYYY-MM-DD-vault-loss.md:
- Root cause of the loss
- Timeline
- Which secrets had to be rotated
- Total customer-visible downtime
- New custody policy (in detail)
This event should also kick off RM-013 and RM-017 at the highest urgency if they hadn't landed already.
Verification (post-recovery)¶
-
vault statusshowsSealed: false,HA: Active - A sample secret read succeeds:
vault kv get secret/<known-path> - Authentik can start (it depends on a Vault-stored secret key) —
curl -sI https://auth.448.global/-/health/live/returns 204 - Apps that failed during the sealed window have been restarted and are healthy
- Beszel / Gotify alerting fires on a synthetic Vault outage
- (Path C only) Search of all repos returns no committed credentials
Rollback / abort¶
- During unseal: applying a wrong key share does not corrupt Vault — it just doesn't count toward the threshold. Safe to retry.
- During Path B restore: the snapshot restore is destructive of the new Vault's state, but the old snapshot is untouched. If restore fails, retry with a different snapshot.
- During Path C: there is no rollback. Once a credential is rotated, the old one is dead. Communicate clearly to all consumers before each rotation.
Break-glass: emergency root token¶
If unseal succeeds but the root token is also lost, generate a new one:
# Requires unseal-key holders again (3 of 5):
vault operator generate-root -init
# Each custodian runs:
vault operator generate-root <one-time-password>
# Apply each share until threshold; new root token printed.
# Revoke after use.
Post-incident¶
- For Path A: brief mention in
infra/incidents/YYYY-MM-DD-vault-unseal.md. Catalogue why Vault sealed (host restart, OOM, etc.) and whether auto-unseal could prevent the next occurrence (transit / cloud-KMS auto-unseal is a Phase-3 candidate). - For Path B: full incident report including snapshot-age (was the latest snapshot recent enough?). Update backups.md with restore-test outcome.
- For Path C: the most serious incident type the company can have. Triggers a full post-mortem with leadership.
Related¶
- KI-017 — Vault has no backup today (Path B precondition)
- KI-016 — No restore-test ever performed
- RM-013 — Vault Raft snapshot job (closes Path B precondition)
- RM-017 — Quarterly restore drills
- RM-024 — Vault path scheme (essential for Path C)
- apps/15-vault.md — Vault application doc
- RB-002 — O1 disaster recovery (calls into this runbook for the Vault-restore step)