Runbooks¶

Step-by-step recovery procedures for the highest-impact failure modes. Each runbook is structured so an on-call engineer can follow it under stress, and an AI agent can RAG over it and (where authorized) execute the verification commands.

Index¶

ID	Title	Severity	Trigger
RB-001	Caddy reverse proxy is down (E1 or O1)	critical	Customer URLs unreachable; TLS or connection-refused errors
RB-002	ORA448Global VPS (O1) is destroyed or unrecoverable	catastrophic	Entire `*.448.global` estate gone; Vault and Authentik with it
RB-003	Vault is sealed or the container will not start	critical	Vault `/sys/seal-status` shows sealed, or container in restart loop

Conventions¶

Each runbook follows the same outline:

Trigger / detection — how you know this is the right runbook
Severity — blast radius and urgency
Required access — credentials, network, tools needed
Preconditions — what must be in place before the runbook can succeed (often references Phase-2 actions)
Recovery steps — numbered, each with an action and a verification check
Verification (post-recovery) — confirm full functionality
Rollback / abort — when to stop the runbook and call for help
Post-incident — incident-record, update KIs, lessons-learned
Related — KI / RM / app references using stable IDs

ID schemes¶

Same as the rest of the repo:

RB-NNN — runbook
RM-NNN — roadmap action (phase-2-roadmap.md)
KI-NNN — known issue (known-issues.md)
E1, E2, O1, E3, E4, E5, O2, O3 — server / database identifiers (servers.md)
apps/NN-<slug>.md — per-app docs

How AI agents should use these¶

Treat verification commands as read-only by default. They confirm state; they do not change it.
Do not execute recovery steps automatically. Recovery actions can include destructive operations (re-imaging hosts, restoring backups over live data). These are gated to humans.
Use runbooks for diagnostic narration: when the agent detects a known failure pattern, point the on-call human at the relevant RB-NNN and report which preconditions are met.

Adding a new runbook¶

Pick the next RB-NNN.
Use the existing runbooks as the structural template.
Add to the index above.
Reference from the relevant KIs / app docs.