Skip to content

RB-003 — Vault is sealed and cannot be unsealed

Use this runbook when: vault.448.global is reachable but reports "sealed": true, and the standard 3-of-5 unseal-key procedure cannot be completed (because the keys are missing, holders are unreachable, or shares fail validation).

id: RB-003
title: Vault unseal and recovery-of-last-resort
severity: critical
estimated_duration: 30 minutes (normal unseal) to multiple days (full secret rotation)
servers_affected: [O1]
apps_affected: ["15"]
related_kis: [KI-017]
related_actions: [RM-013, RM-017, RM-021]

Trigger / detection

curl -s https://vault.448.global/v1/sys/seal-status | jq
# {
#   "sealed": true,        ← if this is true, Vault is sealed
#   "t": 3,                ← threshold (number of keys needed)
#   "n": 5,                ← total shares
#   ...
# }

A sealed Vault returns valid HTTP responses for /sys/seal-status but rejects any secret read. Apps that read secrets at startup will fail with "Vault sealed" or 503 errors.

If /sys/seal-status returns nothing (connection refused, 502 from Caddy), the Vault container/process is not running — see Path D below.

Severity

Critical. Apps that read secrets at startup cannot start. Apps with secrets cached in memory continue to work until restart. If the Vault is sealed because the storage backend was lost and the snapshot can't be restored, every secret in the company is gone — every credential, API key, signing key, OIDC client secret, DB password — must be rotated.

This runbook covers two sub-scenarios:

Scenario Path
Vault is up, just sealed (e.g. after a host restart with no auto-unseal) Path A — routine unseal
Vault storage was lost; need to restore from snapshot AND unseal Path B — DR restore
Unseal-keys are missing or holders unreachable; storage may or may not be restorable Path C — total rotation
Vault container won't start (CAP_SETFCAP / image-update / runtime issue) Path D — container recovery

Required access

  • Vault unseal-key shares — at least the threshold number (typically 3 of 5)
  • Names + contact details of unseal-key custodians (on file in infra/audits/ [INFO NEEDED])
  • (For Path B) Latest Vault Raft snapshot from OCI bucket
  • (For Path B) SSH access to O1 (or new O1 instance per RB-002)
  • (For Path C) An "all hands" — secret rotation across the estate is a multi-person exercise

Preconditions (must be true before this runbook can succeed)

Item Why
Unseal-key custody documented Without knowing who has which share, you cannot reach threshold
Vault Raft snapshot off-host (RM-013) Required for Path B
Snapshot has been restore-tested (RM-017) Otherwise Path B is unproven and may fail mid-recovery

Today (2026-05) Path A is the only path that can succeed. Path B requires RM-013 (Vault snapshot job) to have landed. Path C is documented for completeness but is the scenario RM-013 is designed to prevent.


Path A — routine unseal (3 of 5 keys)

The straightforward case: Vault is running but sealed (e.g. after a host reboot).

Step A1 — Confirm Vault is reachable and sealed

curl -s https://vault.448.global/v1/sys/seal-status | jq
# Confirm: "sealed": true, threshold matches expected (typically 3)

Step A2 — Reach 3 unseal-key custodians

Coordinate via whatever channel works (phone is best in an outage scenario where Slack/Teams may be SSO-affected). Each custodian provides their key share over a secure channel (in-person ideal; otherwise an out-of-band encrypted channel — never paste keys in shared chat).

Step A3 — Apply the keys

# Each holder runs (on a host with vault CLI installed and configured):
export VAULT_ADDR=https://vault.448.global
vault operator unseal
# Prompts for the key share. After 3 successful applications, Vault unseals.

Step A4 — Verify

vault status
# Sealed: false
# HA: Active
# Smoke-test that secrets are readable:
vault kv get secret/some/well-known/path

Step A5 — Notify dependent apps

If any app failed to start during the sealed window, restart it now so it picks up its secrets fresh:

# Examples; adjust per app:
docker restart authentik-server
docker restart n8n
# etc.

Path B — restore from snapshot, then unseal

When the Vault data was lost and must be restored from a Raft snapshot.

Step B1 — Provision Vault on the recovery host

If this is happening as part of RB-002, Vault has already been provisioned on the new O1. If standalone, run a fresh Vault container as in RB-002 Step 5.

Step B2 — Initialize the new Vault

If this is a fresh Vault that has never been initialized:

vault operator init
# This generates a NEW set of unseal keys + root token.
# These are NOT the same as the keys for the snapshot you're about to restore.
# IMPORTANT: distribute these to ≥5 custodians per the same Shamir threshold policy.

Step B3 — Pull the snapshot

oci os object get \
  --bucket-name <bucket> \
  --name vault-snapshots/latest.snap \
  --file /tmp/vault.snap
sha256sum /tmp/vault.snap
# Compare against the recorded sum in the snapshot manifest
# (which is also in the bucket).

Step B4 — Restore

vault login   # Use the new root token from Step B2
vault operator raft snapshot restore /tmp/vault.snap
# When the restore completes, the snapshot's keys take over.
# The init keys from Step B2 are discarded; the snapshot's
# unseal keys are now the active set.

Step B5 — Unseal with the snapshot's keys

This is the same as Path A — get 3 of 5 holders of the original keys (the ones in effect at snapshot time) to apply their shares.

Step B6 — Verify and notify (same as Path A Steps 4-5)


Path D — Vault container won't start

When vault.448.global returns 502 / connection refused, and docker ps shows the Vault container restarting or exited.

Step D1 — Capture the failure mode

sudo docker ps -a --filter name=vault
sudo docker logs --tail=50 <vault-container-id>

Common error patterns:

Log message Cause Go to
unable to set CAP_SETFCAP effective capability: Operation not permitted Container missing SETFCAP capability — usually after image update Step D2
cannot mlock memory IPC_LOCK capability missing Step D2
permission denied on /vault/data Volume mount issue manual fix; check ownership
bind: address already in use Port conflict stop the conflicting process
Vault crashes during raft startup Storage corruption switch to Path B

Step D2 — Add missing capabilities AND SKIP_SETCAP=true

Important: the HashiCorp Vault image runs as the non-root vault user (Dockerfile USER vault). A non-root container user cannot call setcap even when CAP_SETFCAP is in the bounding set, because Docker doesn't grant ambient capabilities to non-root users by default. So adding cap_add: [IPC_LOCK, SETFCAP] is necessary but not sufficient — you also need SKIP_SETCAP=true to tell the entrypoint to skip the setcap step entirely.

This is the lesson from incident 2026-05-01.

If using docker-compose.yml — this is the canonical pattern, also at infra/vault/docker-compose.yml:

services:
  vault:
    image: hashicorp/vault:<pinned-version>   # pin, do not use :latest
    container_name: vault
    restart: always
    command: server
    cap_add:
      - IPC_LOCK
      - SETFCAP
    environment:
      - SKIP_SETCAP=true   # <-- the critical one for non-root user case
    expose:
      - "8200"
    volumes:
      - /home/ubuntu/docker/vault/config:/vault/config
      - /home/ubuntu/docker/vault/file:/vault/file
      - /home/ubuntu/docker/vault/logs:/vault/logs
      - /home/ubuntu/docker/vault/data:/vault/data
    networks:
      - caddy_default
    labels:
      - com.centurylinklabs.watchtower.enable=false   # opt out of auto-update
networks:
  caddy_default:
    external: true

Apply and restart:

sudo docker compose up -d vault
sudo docker logs --tail=20 -f vault
# Expect "Vault server starting", "Vault server started! Log data will stream in below"
# Should stay running, not restart-loop.

Trade-off recorded: without SKIP_SETCAP=true, the container can't setcap and crashes. With it, Vault runs but cannot mlockall() to prevent its memory being swapped to disk. Acceptable on a low-memory-pressure host; revisit if the host is upgraded.

Step D3 — Alternative if you must have mlockall

Run the container as root and let the entrypoint drop privileges itself via su-exec:

services:
  vault:
    user: "0:0"   # start as root
    cap_add:
      - IPC_LOCK
      - SETFCAP
    # SKIP_SETCAP not needed — the entrypoint runs setcap as root, then drops to vault user

Mildly less ideal from defense-in-depth (kernel-escape from container-root is worse than from a dropped user) but functionally equivalent and mlockall() works.

Step D4 — Once the container is running, unseal it

The container starts in sealed state. Continue with Path A — get 3-of-5 unseal-key holders to apply their shares.

Step D5 — Take an immediate snapshot

Even before RM-013 lands as a scheduled job:

# Use vault login (NOT a token on the command line — see security note below)
export VAULT_ADDR=https://vault.448.global
vault login -method=oidc
vault operator raft snapshot save /tmp/vault-post-incident-$(date +%F).snap
sha256sum /tmp/vault-post-incident-*.snap
oci os object put \
  --bucket-name PECommon \
  --file /tmp/vault-post-incident-*.snap \
  --name infra/vault.448.global/raft-snapshots/$(date +%F).snap
shred -u /tmp/vault-post-incident-*.snap

Security note from the 2026-05-01 incident response: never pass tokens via --header "X-Vault-Token: hvs.…" on the command line. Tokens land in shell history and any captured output. Use vault login to set the session token via env, run your operations, then vault token revoke -self when done.

Step D6 — Lock the image and exclude from auto-update

So this exact failure can't repeat:

  1. Pin the Vault image in compose to a specific tag (e.g. hashicorp/vault:1.18.2), not :latest.
  2. Add the Watchtower-exclusion label: com.centurylinklabs.watchtower.enable=false.
  3. Future Vault upgrades become a deliberate step, not a Watchtower side-effect.

Path C — worst case, Vault is unrecoverable

If unseal keys cannot be assembled (custodians lost contact, shares lost, or Vault data permanently gone with no usable snapshot), Vault must be considered compromised by erasure. Every secret it held must be rotated.

This is a multi-day exercise involving every team. Plan it deliberately.

Step C1 — Establish a fresh Vault

Stand up a brand-new Vault instance on O1 (or anywhere). vault operator init produces fresh keys; distribute to ≥5 custodians under a renewed custody policy. Update apps/15-vault.md with the new key holders.

Step C2 — Inventory every secret that lived in old Vault

Refer to:

  • The Vault credential paths schema documented in RM-024
  • The Credentials in Vault table in every per-app doc under apps/ — this is the canonical inventory.

For each secret, list: - Where it was used (which app, which integration) - Who can rotate it (which DBA / cloud-admin) - Estimated downtime per rotation

Step C3 — Rotate in dependency order

Rotate from the leaves inward — secrets that affect the smallest blast radius first, leaving Tier-0 / cross-cutting credentials for last. Suggested order:

  1. Per-app database passwords (each app's DB user — rotate one app at a time, restart the app)
  2. SMTP credentials (one provider at a time)
  3. Per-app API keys (one external service at a time)
  4. OIDC client secrets between Authentik and apps (rotate Authentik-side, update each app)
  5. Object-storage access keys (MinIO root + per-app keys)
  6. OCI API signing keys (one user at a time)
  7. Domain registrar API tokens (GoDaddy)
  8. Authentik secret key (Tier-0 — invalidates all sessions)
  9. Microsoft 365 federation secrets (last; coordinate with M365 admin)

For each, populate the new Vault with the new credentials per RM-024.

Step C4 — Audit

Once rotation is complete, run a search across the GitLab repo for any committed secret references:

# In the engineering repo and related repos:
grep -rEn '(password|secret|api[_-]?key|token)\s*=\s*["'\''][^"'\'']{8,}' .

Anything found needs immediate attention.

Step C5 — Document

File at infra/incidents/YYYY-MM-DD-vault-loss.md: - Root cause of the loss - Timeline - Which secrets had to be rotated - Total customer-visible downtime - New custody policy (in detail)

This event should also kick off RM-013 and RM-017 at the highest urgency if they hadn't landed already.


Verification (post-recovery)

  • vault status shows Sealed: false, HA: Active
  • A sample secret read succeeds: vault kv get secret/<known-path>
  • Authentik can start (it depends on a Vault-stored secret key) — curl -sI https://auth.448.global/-/health/live/ returns 204
  • Apps that failed during the sealed window have been restarted and are healthy
  • Beszel / Gotify alerting fires on a synthetic Vault outage
  • (Path C only) Search of all repos returns no committed credentials

Rollback / abort

  • During unseal: applying a wrong key share does not corrupt Vault — it just doesn't count toward the threshold. Safe to retry.
  • During Path B restore: the snapshot restore is destructive of the new Vault's state, but the old snapshot is untouched. If restore fails, retry with a different snapshot.
  • During Path C: there is no rollback. Once a credential is rotated, the old one is dead. Communicate clearly to all consumers before each rotation.

Break-glass: emergency root token

If unseal succeeds but the root token is also lost, generate a new one:

# Requires unseal-key holders again (3 of 5):
vault operator generate-root -init
# Each custodian runs:
vault operator generate-root <one-time-password>
# Apply each share until threshold; new root token printed.
# Revoke after use.

Post-incident

  1. For Path A: brief mention in infra/incidents/YYYY-MM-DD-vault-unseal.md. Catalogue why Vault sealed (host restart, OOM, etc.) and whether auto-unseal could prevent the next occurrence (transit / cloud-KMS auto-unseal is a Phase-3 candidate).
  2. For Path B: full incident report including snapshot-age (was the latest snapshot recent enough?). Update backups.md with restore-test outcome.
  3. For Path C: the most serious incident type the company can have. Triggers a full post-mortem with leadership.
  • KI-017 — Vault has no backup today (Path B precondition)
  • KI-016 — No restore-test ever performed
  • RM-013 — Vault Raft snapshot job (closes Path B precondition)
  • RM-017 — Quarterly restore drills
  • RM-024 — Vault path scheme (essential for Path C)
  • apps/15-vault.md — Vault application doc
  • RB-002 — O1 disaster recovery (calls into this runbook for the Vault-restore step)