Runbooks¶
Step-by-step recovery procedures for the highest-impact failure modes. Each runbook is structured so an on-call engineer can follow it under stress, and an AI agent can RAG over it and (where authorized) execute the verification commands.
Index¶
| ID | Title | Severity | Trigger |
|---|---|---|---|
| RB-001 | Caddy reverse proxy is down (E1 or O1) | critical | Customer URLs unreachable; TLS or connection-refused errors |
| RB-002 | ORA448Global VPS (O1) is destroyed or unrecoverable | catastrophic | Entire *.448.global estate gone; Vault and Authentik with it |
| RB-003 | Vault is sealed or the container will not start | critical | Vault /sys/seal-status shows sealed, or container in restart loop |
Conventions¶
Each runbook follows the same outline:
- Trigger / detection — how you know this is the right runbook
- Severity — blast radius and urgency
- Required access — credentials, network, tools needed
- Preconditions — what must be in place before the runbook can succeed (often references Phase-2 actions)
- Recovery steps — numbered, each with an action and a verification check
- Verification (post-recovery) — confirm full functionality
- Rollback / abort — when to stop the runbook and call for help
- Post-incident — incident-record, update KIs, lessons-learned
- Related — KI / RM / app references using stable IDs
ID schemes¶
Same as the rest of the repo:
RB-NNN— runbookRM-NNN— roadmap action (phase-2-roadmap.md)KI-NNN— known issue (known-issues.md)E1, E2, O1, E3, E4, E5, O2, O3— server / database identifiers (servers.md)apps/NN-<slug>.md— per-app docs
How AI agents should use these¶
- Treat
verificationcommands as read-only by default. They confirm state; they do not change it. - Do not execute recovery steps automatically. Recovery actions can include destructive operations (re-imaging hosts, restoring backups over live data). These are gated to humans.
- Use runbooks for diagnostic narration: when the agent detects a known failure pattern, point the on-call human at the relevant
RB-NNNand report which preconditions are met.
Adding a new runbook¶
- Pick the next
RB-NNN. - Use the existing runbooks as the structural template.
- Add to the index above.
- Reference from the relevant KIs / app docs.