RB-001 — Caddy reverse proxy is down (E1 or O1)¶

Use this runbook when: customer or internal URLs return TLS errors, "connection refused", or "site can't be reached", but the backend apps appear healthy.

id: RB-001
title: Caddy reverse proxy recovery (E1 or O1)
severity: critical
estimated_duration: 20-60 minutes
servers_affected_e1: [E1]    # if E1 down: subdomain hostnames
servers_affected_o1: [O1]    # if O1 down: entire *.448.global estate
related_kis: [KI-001, KI-014]
related_actions: [RM-006, RM-007, RM-019, RM-020]

Trigger / detection¶

Any of: - External uptime check (RM-038) alerts on multiple URLs simultaneously. - curl -sI https://<host> returns "connection refused" or a TLS handshake failure. - OCI console shows the Caddy host stopped, paused, or being maintained. - A scheduled OCI maintenance event was announced and we missed the prep window (this is what triggered the previous occurrence — see KI-001).

Severity¶

Server down	Blast radius
E1 (Caddy at `140.238.97.163`)	All subdomain hostnames go dark: GitLab, Dokploy admin, Teams Bot, the 3 CRMs, the 3 APEX-PE URLs, the 2 workforce-tenant URLs. WordPress apex domains keep working (they bypass E1).
O1 (Caddy at `140.238.90.91`)	All `*.448.global` go dark. Vault and Authentik with them — cascading: any internal app using SSO becomes unable to authenticate new logins.

Required access¶

Tailscale member of the company tailnet (for SSH to the host)
SSH key authorized on the affected host
GitLab read access to internal/engineering (for Caddyfile sync)
OCI console login (for instance status / restart)
(For break-glass) OCI console-connection key

Preconditions¶

Item	Status before runbook can fully succeed
Caddyfile in Git (RM-006 / RM-007)	Without this, only "Fallback: no Git source" path below is available.
Tailscale on company account (RM-019)	Until then, admin path is via personal Tailscale or public SSH on E1.
OCI block-volume snapshots configured	Provides last-resort fallback if Caddy data dir is corrupted.

Recovery steps¶

Step 1 — Identify which Caddy is affected¶

Match the failing hostnames against the proxy map in infra/proxies.md:

Symptom	Affected proxy
`crm.eidos-global.com`, `bot.projecteidos.com`, `git.projecteidos.com`, `platform.projecteidos.com`, etc. failing	E1 Caddy
`vault.448.global`, `auth.448.global`, `s3.448.global`, etc. failing	O1 Caddy
WordPress apex domains (`eidos-global.com`, `tneconnect.app`, `projecteidos.com`) failing	E2 Traefik (different scope; not this runbook)

If both estates are down simultaneously, treat as separate incidents in parallel — same procedure, different host.

Step 2 — Verify the host is up at the OCI level¶

# Substitute <instance-ocid> with the affected server's OCID from infra/servers.md.
oci compute instance get \
  --instance-id <instance-ocid> \
  --query 'data."lifecycle-state"'
# Expect: "RUNNING"

If not RUNNING:

oci compute instance action --action START --instance-id <instance-ocid>
# Wait ~30s, re-check lifecycle-state

If the instance is STOPPED due to OCI maintenance, that may be the root cause — Caddy itself is fine but the host wasn't running. Once it boots, Caddy should auto-start (assuming systemctl enable caddy was set).

Step 3 — Reach the host¶

# Via Tailscale (preferred):
ssh ubuntu@<host-tailnet-name>
# Or via public IP if E1 and Tailscale not yet on E1 (KI-014):
ssh -i <key> ubuntu@140.238.97.163

If neither works, see Break-glass access at the bottom of this runbook.

Step 4 — Determine the failure mode¶

sudo systemctl status caddy
sudo journalctl -u caddy -n 100 --no-pager

Common patterns:

Pattern in journal	Likely cause	Go to
"open /etc/caddy/Caddyfile: no such file"	Caddyfile missing (host re-imaged?)	Step 5
"adapting config using caddyfile: ..." (parse error)	Caddyfile corrupted	Step 5
"loading certificates: ..."	TLS state corrupted or LE rate-limited	Step 6
"could not connect to backend ..."	Caddy is fine; an upstream is down	Out of scope of this runbook — investigate the upstream

Step 5 — Restore Caddyfile from Git¶

# On the affected host:
cd /tmp
git clone --depth=1 \
  https://<git-readonly-creds>@git.projecteidos.com/internal/engineering.git
# Pick the right Caddyfile:
sudo cp engineering/infra/caddy/E1.Caddyfile /etc/caddy/Caddyfile
# Or for O1: sudo cp engineering/infra/caddy/O1.Caddyfile /etc/caddy/Caddyfile
sudo caddy validate --config /etc/caddy/Caddyfile
sudo systemctl restart caddy
sudo systemctl status caddy   # confirm 'active (running)'

Step 6 — Verify TLS state¶

sudo ls -la /var/lib/caddy/.local/share/caddy/certificates/acme*/ 2>/dev/null
# Should list valid Let's Encrypt cert directories.

If certs are missing, Caddy will re-issue on first request. Watch the Let's Encrypt rate limit (50 certificates per registered domain per week). Spread requests if you have many hosts on the same domain — wget --quiet --spider https://<host>/ per hostname with 30s gap.

Step 7 — External verification¶

From a host outside the affected server (your laptop, a different OCI instance):

for host in <list of hostnames the proxy fronts>; do
  printf '%-40s ' "$host"
  curl -sI -m 5 "https://$host/" | head -1
done
# Every hostname should return an HTTP status line (200/301/302/403/404/etc.).
# "Connection refused" or no output means it's still down.

Verification (post-recovery)¶

Every hostname in infra/proxies.md for the affected Caddy returns an HTTP status line.
TLS certs present: a browser shows the Let's Encrypt issuer (not Caddy's self-signed default).
journalctl -u caddy --since "10 min ago" is clean of error messages.
Beszel agent on the host is reporting (if RM-037 has landed).
Affected SSO-dependent apps can authenticate (only relevant if O1 was affected — Authentik went with it).

Rollback / abort¶

Abort if any of: - The Git copy of the Caddyfile is older than the one on the host and applying it would break currently-working routes you can't recover. - Caddy starts but TLS issuance fails repeatedly — likely a DNS API token or rate-limit issue. Investigate before retrying. - A Vault outage cascades from this (only relevant for O1) — Vault-recovery work via RB-003 takes priority.

To abort cleanly:

sudo systemctl stop caddy
# Restore the previous /etc/caddy/Caddyfile from a manual backup if you have one,
# or leave Caddy stopped and escalate.

Break-glass access (Tailscale unreachable, SSH key unavailable)¶

OCI Console → Compute → Instance → "Console connection" → "Create local console connection" or "Create VNC console connection". This bypasses SSH entirely; you authenticate with the OCI console user and a key registered to the console-connection feature. Useful when the network path to the host itself is broken.

Fallback: no Git source¶

If RM-006 / RM-007 has not yet landed, the Caddyfile lives only on the affected host. Recovery options:

OCI block-volume snapshot: restore the volume from a snapshot to a new instance, then scp the Caddyfile across.
Hand-rebuild from the proxy map: use infra/proxies.md as the source-of-truth list of every hostname → backend mapping. This is exactly what we did during the previous outage; expect 30-60 minutes of careful editing per dozen hostnames.
Restore from any developer's local clone if anyone has worked on the file recently and still has a copy.

This is the path RM-006/007 is designed to make obsolete.

Post-incident¶

Create infra/incidents/YYYY-MM-DD-caddy-down.md with: timeline, root cause, customer impact, what worked / what didn't.
If the trigger was OCI maintenance, ensure someone is registered on the OCI maintenance-notification list.
If the incident exposed any gap not yet in known-issues.md, add a new KI-NNN and a corresponding RM-NNN in phase-2-roadmap.md.
If recovery took longer than estimated, refine this runbook.

KI-001 — Caddyfile not in Git
KI-014 — E1 SSH still public
RM-006 / RM-007 — the source-control fix
RM-019 / RM-020 — admin-access modernization
proxies.md — full hostname-to-server map
servers.md — E1 / O1