Skip to content

Runbooks

Step-by-step recovery procedures for the highest-impact failure modes. Each runbook is structured so an on-call engineer can follow it under stress, and an AI agent can RAG over it and (where authorized) execute the verification commands.

Index

ID Title Severity Trigger
RB-001 Caddy reverse proxy is down (E1 or O1) critical Customer URLs unreachable; TLS or connection-refused errors
RB-002 ORA448Global VPS (O1) is destroyed or unrecoverable catastrophic Entire *.448.global estate gone; Vault and Authentik with it
RB-003 Vault is sealed or the container will not start critical Vault /sys/seal-status shows sealed, or container in restart loop

Conventions

Each runbook follows the same outline:

  1. Trigger / detection — how you know this is the right runbook
  2. Severity — blast radius and urgency
  3. Required access — credentials, network, tools needed
  4. Preconditions — what must be in place before the runbook can succeed (often references Phase-2 actions)
  5. Recovery steps — numbered, each with an action and a verification check
  6. Verification (post-recovery) — confirm full functionality
  7. Rollback / abort — when to stop the runbook and call for help
  8. Post-incident — incident-record, update KIs, lessons-learned
  9. Related — KI / RM / app references using stable IDs

ID schemes

Same as the rest of the repo:

  • RB-NNN — runbook
  • RM-NNN — roadmap action (phase-2-roadmap.md)
  • KI-NNN — known issue (known-issues.md)
  • E1, E2, O1, E3, E4, E5, O2, O3 — server / database identifiers (servers.md)
  • apps/NN-<slug>.md — per-app docs

How AI agents should use these

  • Treat verification commands as read-only by default. They confirm state; they do not change it.
  • Do not execute recovery steps automatically. Recovery actions can include destructive operations (re-imaging hosts, restoring backups over live data). These are gated to humans.
  • Use runbooks for diagnostic narration: when the agent detects a known failure pattern, point the on-call human at the relevant RB-NNN and report which preconditions are met.

Adding a new runbook

  1. Pick the next RB-NNN.
  2. Use the existing runbooks as the structural template.
  3. Add to the index above.
  4. Reference from the relevant KIs / app docs.