RB-001 — Caddy reverse proxy is down (E1 or O1)¶
Use this runbook when: customer or internal URLs return TLS errors, "connection refused", or "site can't be reached", but the backend apps appear healthy.
id: RB-001
title: Caddy reverse proxy recovery (E1 or O1)
severity: critical
estimated_duration: 20-60 minutes
servers_affected_e1: [E1] # if E1 down: subdomain hostnames
servers_affected_o1: [O1] # if O1 down: entire *.448.global estate
related_kis: [KI-001, KI-014]
related_actions: [RM-006, RM-007, RM-019, RM-020]
Trigger / detection¶
Any of:
- External uptime check (RM-038) alerts on multiple URLs simultaneously.
- curl -sI https://<host> returns "connection refused" or a TLS handshake failure.
- OCI console shows the Caddy host stopped, paused, or being maintained.
- A scheduled OCI maintenance event was announced and we missed the prep window (this is what triggered the previous occurrence — see KI-001).
Severity¶
| Server down | Blast radius |
|---|---|
E1 (Caddy at 140.238.97.163) |
All subdomain hostnames go dark: GitLab, Dokploy admin, Teams Bot, the 3 CRMs, the 3 APEX-PE URLs, the 2 workforce-tenant URLs. WordPress apex domains keep working (they bypass E1). |
O1 (Caddy at 140.238.90.91) |
All *.448.global go dark. Vault and Authentik with them — cascading: any internal app using SSO becomes unable to authenticate new logins. |
Required access¶
- Tailscale member of the company tailnet (for SSH to the host)
- SSH key authorized on the affected host
- GitLab read access to
internal/engineering(for Caddyfile sync) - OCI console login (for instance status / restart)
- (For break-glass) OCI console-connection key
Preconditions¶
| Item | Status before runbook can fully succeed |
|---|---|
| Caddyfile in Git (RM-006 / RM-007) | Without this, only "Fallback: no Git source" path below is available. |
| Tailscale on company account (RM-019) | Until then, admin path is via personal Tailscale or public SSH on E1. |
| OCI block-volume snapshots configured | Provides last-resort fallback if Caddy data dir is corrupted. |
Recovery steps¶
Step 1 — Identify which Caddy is affected¶
Match the failing hostnames against the proxy map in infra/proxies.md:
| Symptom | Affected proxy |
|---|---|
crm.eidos-global.com, bot.projecteidos.com, git.projecteidos.com, platform.projecteidos.com, etc. failing |
E1 Caddy |
vault.448.global, auth.448.global, s3.448.global, etc. failing |
O1 Caddy |
WordPress apex domains (eidos-global.com, tneconnect.app, projecteidos.com) failing |
E2 Traefik (different scope; not this runbook) |
If both estates are down simultaneously, treat as separate incidents in parallel — same procedure, different host.
Step 2 — Verify the host is up at the OCI level¶
# Substitute <instance-ocid> with the affected server's OCID from infra/servers.md.
oci compute instance get \
--instance-id <instance-ocid> \
--query 'data."lifecycle-state"'
# Expect: "RUNNING"
If not RUNNING:
oci compute instance action --action START --instance-id <instance-ocid>
# Wait ~30s, re-check lifecycle-state
If the instance is STOPPED due to OCI maintenance, that may be the root cause — Caddy itself is fine but the host wasn't running. Once it boots, Caddy should auto-start (assuming systemctl enable caddy was set).
Step 3 — Reach the host¶
# Via Tailscale (preferred):
ssh ubuntu@<host-tailnet-name>
# Or via public IP if E1 and Tailscale not yet on E1 (KI-014):
ssh -i <key> ubuntu@140.238.97.163
If neither works, see Break-glass access at the bottom of this runbook.
Step 4 — Determine the failure mode¶
Common patterns:
| Pattern in journal | Likely cause | Go to |
|---|---|---|
| "open /etc/caddy/Caddyfile: no such file" | Caddyfile missing (host re-imaged?) | Step 5 |
| "adapting config using caddyfile: ..." (parse error) | Caddyfile corrupted | Step 5 |
| "loading certificates: ..." | TLS state corrupted or LE rate-limited | Step 6 |
| "could not connect to backend ..." | Caddy is fine; an upstream is down | Out of scope of this runbook — investigate the upstream |
Step 5 — Restore Caddyfile from Git¶
# On the affected host:
cd /tmp
git clone --depth=1 \
https://<git-readonly-creds>@git.projecteidos.com/internal/engineering.git
# Pick the right Caddyfile:
sudo cp engineering/infra/caddy/E1.Caddyfile /etc/caddy/Caddyfile
# Or for O1: sudo cp engineering/infra/caddy/O1.Caddyfile /etc/caddy/Caddyfile
sudo caddy validate --config /etc/caddy/Caddyfile
sudo systemctl restart caddy
sudo systemctl status caddy # confirm 'active (running)'
Step 6 — Verify TLS state¶
sudo ls -la /var/lib/caddy/.local/share/caddy/certificates/acme*/ 2>/dev/null
# Should list valid Let's Encrypt cert directories.
If certs are missing, Caddy will re-issue on first request. Watch the Let's Encrypt rate limit (50 certificates per registered domain per week). Spread requests if you have many hosts on the same domain — wget --quiet --spider https://<host>/ per hostname with 30s gap.
Step 7 — External verification¶
From a host outside the affected server (your laptop, a different OCI instance):
for host in <list of hostnames the proxy fronts>; do
printf '%-40s ' "$host"
curl -sI -m 5 "https://$host/" | head -1
done
# Every hostname should return an HTTP status line (200/301/302/403/404/etc.).
# "Connection refused" or no output means it's still down.
Verification (post-recovery)¶
- Every hostname in
infra/proxies.mdfor the affected Caddy returns an HTTP status line. - TLS certs present: a browser shows the Let's Encrypt issuer (not Caddy's self-signed default).
-
journalctl -u caddy --since "10 min ago"is clean of error messages. - Beszel agent on the host is reporting (if RM-037 has landed).
- Affected SSO-dependent apps can authenticate (only relevant if O1 was affected — Authentik went with it).
Rollback / abort¶
Abort if any of: - The Git copy of the Caddyfile is older than the one on the host and applying it would break currently-working routes you can't recover. - Caddy starts but TLS issuance fails repeatedly — likely a DNS API token or rate-limit issue. Investigate before retrying. - A Vault outage cascades from this (only relevant for O1) — Vault-recovery work via RB-003 takes priority.
To abort cleanly:
sudo systemctl stop caddy
# Restore the previous /etc/caddy/Caddyfile from a manual backup if you have one,
# or leave Caddy stopped and escalate.
Break-glass access (Tailscale unreachable, SSH key unavailable)¶
OCI Console → Compute → Instance → "Console connection" → "Create local console connection" or "Create VNC console connection". This bypasses SSH entirely; you authenticate with the OCI console user and a key registered to the console-connection feature. Useful when the network path to the host itself is broken.
Fallback: no Git source¶
If RM-006 / RM-007 has not yet landed, the Caddyfile lives only on the affected host. Recovery options:
- OCI block-volume snapshot: restore the volume from a snapshot to a new instance, then
scpthe Caddyfile across. - Hand-rebuild from the proxy map: use
infra/proxies.mdas the source-of-truth list of every hostname → backend mapping. This is exactly what we did during the previous outage; expect 30-60 minutes of careful editing per dozen hostnames. - Restore from any developer's local clone if anyone has worked on the file recently and still has a copy.
This is the path RM-006/007 is designed to make obsolete.
Post-incident¶
- Create
infra/incidents/YYYY-MM-DD-caddy-down.mdwith: timeline, root cause, customer impact, what worked / what didn't. - If the trigger was OCI maintenance, ensure someone is registered on the OCI maintenance-notification list.
- If the incident exposed any gap not yet in
known-issues.md, add a newKI-NNNand a correspondingRM-NNNin phase-2-roadmap.md. - If recovery took longer than estimated, refine this runbook.
Related¶
- KI-001 — Caddyfile not in Git
- KI-014 — E1 SSH still public
- RM-006 / RM-007 — the source-control fix
- RM-019 / RM-020 — admin-access modernization
- proxies.md — full hostname-to-server map
- servers.md — E1 / O1