RB-002 — ORA448Global VPS (O1) is destroyed or unrecoverable¶
Use this runbook when: the O1 instance is gone — terminated, disk failure, ransomware, or otherwise unrecoverable. This is the highest blast radius scenario in the estate today: ~15 internal apps including Vault (secrets) and Authentik (SSO) are on this single host.
id: RB-002
title: O1 disaster recovery (catastrophic loss of *.448.global estate)
severity: catastrophic
estimated_duration: 4-12 hours (heavily dependent on which Phase-2 actions have landed)
servers_affected: [O1]
apps_affected: ["14", "15", "17", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29"]
related_kis: [KI-002, KI-003, KI-013, KI-015, KI-016, KI-017]
related_actions: [RM-007, RM-013, RM-014, RM-016, RM-017, RM-018]
Trigger / detection¶
- The O1 OCI compute instance is in a non-recoverable state (terminated, unbootable, disk failure that block-volume snapshot can't fix).
- All
*.448.globalURLs return connection refused / DNS failure. - All apps that depend on Authentik (GitLab, Vault clients, etc.) cannot authenticate new logins.
Severity¶
Catastrophic. Direct impact:
- ~15 internal applications offline: Authentik, Vault, MinIO, Portainer, Beszel, Gotify, Wireguard, Coder, n8n, Open WebUI, Draw.io, IT Tools, PE Tube, Watchtower, the SQLcl image.
- Cascading: SSO-integrated apps (GitLab, others) cannot authenticate new logins until Authentik is restored.
- Cascading: any app that reads secrets at startup will fail to start until Vault is restored.
- The custom SQLcl Docker image and n8n CI/CD pipelines are gone unless they were backed up off-host.
This runbook is a disaster-recovery playbook, not a hot-restart procedure. Expect hours, not minutes. Bring a notepad.
Required access¶
- OCI tenancy owner / root on ORA448Global — Adam Pitt-Stanley (or Vishnu Kant as additional admin)
- OCI block-volume snapshot of O1 (if exists)
- Vault Raft snapshot (RM-013) and unseal-key shares (≥3 of 5 holders available)
- Authentik backup (RM-014): Postgres dump + media volume + secret key
- n8n backup (RM-016) including
N8N_ENCRYPTION_KEY(in Vault — chicken-and-egg) - OCI bucket access for retrieving backups
- GitLab access for Caddyfile + Authentik blueprint + n8n workflow exports + SQLcl Dockerfile
- Microsoft 365 admin on the relevant tenant — for Authentik's upstream identity restore
- DNS write access at GoDaddy — only if the new instance gets a different public IP
Preconditions¶
A precondition not met means partial or impossible recovery. Check each before starting:
| Precondition | Status if missing | Phase-2 action |
|---|---|---|
| OCI block-volume snapshot of O1 | If missing, recovery starts from scratch | (snapshot policy on servers.md) |
| Vault Raft snapshot off-host | Tier-0 — without it, every secret is unrecoverable | RM-013 |
| Authentik backup off-host | Without it, SSO must be rebuilt by hand from the OIDC client list | RM-014 |
| n8n workflow exports in Git | Without it, every CI/CD workflow is gone | RM-010 / RM-016 |
| Caddyfile in Git | Without it, hand-rebuild from proxies.md |
RM-007 |
| Custom SQLcl Dockerfile in registry | Without it, image must be rebuilt from scratch | RM-008 |
| Authentik blueprint in Git | Without it, OIDC client config rebuilt by hand | RM-011 |
| Vault unseal-key custodians reachable | If <3 reachable, see RB-003 | RM-013 |
Today (2026-05) most of these preconditions are NOT met. Running this runbook successfully depends on Wave 1 of the improvement plan having landed first. Until it has, this runbook is partly aspirational — but it shows the team exactly what's at stake and what's needed.
Recovery steps¶
Step 1 — Provision a new O1 instance¶
# In OCI console (or via CLI), create a new compute instance in ORA448Global:
# Shape: VM.Standard.A1.Flex (Always Free; 4 OCPU / 24 GB RAM full free tier headroom)
# Region: uk-london-1
# OS: same Ubuntu LTS as the old O1
# VCN: same as old O1, or fresh VCN if the old one is gone
# Public IP: assign a new one
Capture the new public IP. If the old O1's IP is still routable (rare), you can attempt to keep DNS unchanged; otherwise proceed to step 2.
Step 2 — Update DNS for *.448.global¶
If the new instance has a different public IP than the old O1:
# At GoDaddy, update A records for every *.448.global subdomain
# (15 subdomains, see infra/servers.md O1 section).
# Until DNS propagates (TTL is typically 600s = 10 min), URLs continue
# to resolve to the dead host.
Step 3 — Base setup¶
# SSH in (will use a fresh key, since the old SSH host key is gone)
ssh ubuntu@<new-public-ip>
sudo apt update && sudo apt -y full-upgrade
sudo apt install -y docker.io docker-compose-v2 caddy
sudo systemctl enable --now docker
Step 4 — Restore Caddy config¶
Same as RB-001 Step 5:
cd /tmp
git clone --depth=1 https://<git-creds>@git.projecteidos.com/internal/engineering.git
sudo cp engineering/infra/caddy/O1.Caddyfile /etc/caddy/Caddyfile
sudo caddy validate --config /etc/caddy/Caddyfile
sudo systemctl restart caddy
Caddy will start trying to issue Let's Encrypt certs immediately. Mind the rate limit — 15 hostnames issuing simultaneously may hit it. Stagger if needed.
Step 5 — Restore Vault (Tier-0 — do this before anything else)¶
# Pull the latest Vault Raft snapshot from the OCI bucket:
oci os object get \
--bucket-name <bucket> \
--name vault-snapshots/latest.snap \
--file /tmp/vault.snap
# Run a Vault container, point its data dir at a fresh volume, and start it:
docker run -d --name vault --cap-add IPC_LOCK \
-v /opt/vault/data:/vault/data \
-p 8200:8200 \
hashicorp/vault:<version> \
server -config=/etc/vault.hcl
# Initialize / restore from snapshot. If this is a fresh Vault that was
# never initialized: vault operator init.
# If restoring an existing Raft cluster's data:
vault operator raft snapshot restore /tmp/vault.snap
Then unseal Vault. Requires 3 of 5 unseal-key holders to participate. If you can't reach 3, jump to RB-003.
vault operator unseal <key-share-1>
vault operator unseal <key-share-2>
vault operator unseal <key-share-3>
vault status # Sealed: false
Once Vault is unsealed, the rest of the recovery has access to credentials.
Step 6 — Restore Authentik¶
# Pull latest Authentik backup from OCI bucket
oci os object get --bucket-name <bucket> \
--name authentik-backups/latest-pg_dump.sql.gz \
--file /tmp/authentik-pg.sql.gz
oci os object get --bucket-name <bucket> \
--name authentik-backups/latest-media.tar.gz \
--file /tmp/authentik-media.tar.gz
# Retrieve the Authentik secret key from Vault (it is NOT in the Postgres dump):
export AUTHENTIK_SECRET_KEY=$(vault kv get -field=secret_key secret/authentik/runtime)
# Stand up Authentik's Postgres + Redis + server containers (docker-compose
# from the engineering repo: infra/authentik/docker-compose.yml).
# Restore the DB:
gunzip < /tmp/authentik-pg.sql.gz | docker exec -i authentik-postgres psql -U authentik
# Untar media into the volume:
docker exec authentik-server tar xzf - -C /media < /tmp/authentik-media.tar.gz
docker compose up -d
Verify:
If Authentik is up but the OIDC client config differs from before (e.g. older blueprint), apply the latest blueprint from the engineering repo:
Step 7 — Restore MinIO¶
# MinIO is the most likely thing to have lost data IF object storage
# was on local disk on O1 (vs an external block volume that survived).
# If a separate disk attached to O1 survived (block volume snapshot exists),
# attach to new O1 and mount.
# Otherwise: data loss for whatever was in MinIO. Confirm impact before proceeding.
docker run -d --name minio \
-v /opt/minio/data:/data \
-p 9000:9000 -p 9001:9001 \
-e "MINIO_ROOT_USER=$(vault kv get -field=root_user secret/minio/root)" \
-e "MINIO_ROOT_PASSWORD=$(vault kv get -field=root_pw secret/minio/root)" \
minio/minio server /data --console-address ":9001"
Step 8 — Restore the rest of the apps¶
In rough dependency order (the ones other things depend on first):
- Wireguard / WG-Easy — peers will need re-issued configs unless server private key was backed up to Vault. Bring up; admins re-add devices.
- Portainer — fresh install; reconnect to the local Docker socket.
- Beszel — fresh install; agents on each host need re-registering.
- Gotify — fresh install; tokens for sources (Beszel, Watchtower) re-issued.
- Watchtower — restart; will manage container updates.
- n8n — restore Postgres dump + apply
N8N_ENCRYPTION_KEYfrom Vault. Workflow exports re-imported from Git. - Coder — fresh; users re-create workspaces.
- Open WebUI, Draw.io, IT Tools, PE Tube — fresh containers; data restored where available.
- Custom SQLcl image — pull from GitLab Container Registry; redeploy on user-defined Docker network with stable alias (per RM-008 / RM-009).
For each app, confirm:
docker ps --filter "name=<app>" --format '{{.Status}}'
# Expect "Up <time> (healthy)" or "Up <time>"
curl -sI https://<app>.448.global -m 5 | head -1
Verification (post-recovery)¶
- Every
*.448.globalURL returns expected status (matches pre-incident state). - An end-to-end SSO test succeeds: log in to GitLab via Authentik via M365.
- An app reads a secret from Vault successfully (e.g.
vault kv get secret/<app>/<key>). - Beszel agents on E1 + E2 + O1 are reporting.
- n8n executes a smoke-test workflow without "SQLcl unreachable" errors.
- Watchtower has run at least once and reported via Gotify.
- An external uptime check (RM-038) shows green for all O1-fronted hostnames.
Rollback / abort¶
There is no clean rollback once the new O1 is live and DNS has been flipped. If recovery is failing partway:
- Pause and escalate. This is a major incident; convening incident-comms is appropriate.
- Don't destroy the original (dead) O1 if any partial state remains — even a corrupt block volume may yield some data via forensic recovery.
- Communicate to the rest of the team — staff cannot use SSO-dependent tools until Authentik is back; they need to know.
Break-glass for Vault unseal¶
If <3 unseal-key holders can be reached, this runbook cannot complete normally. Switch to RB-003 — accept that Vault is unrecoverable and rotate every secret in the company. This is the worst-case scenario and the reason key custody must be deliberately distributed.
Post-incident¶
- File a major-incident report at
infra/incidents/YYYY-MM-DD-o1-loss.md. - Update
known-issues.mdwith any precondition gap that bit during recovery. - Update this runbook with timing reality — the estimated duration column should reflect how long it actually took.
- Convene a post-incident review focused on: was the trigger preventable? Were backups adequate? What single fix would have shortened the recovery most?
- Schedule a follow-up restore-test drill (RM-017) within 90 days.
Related¶
- RB-001 — Caddy down (much narrower scope; sometimes triggers concurrently)
- RB-003 — Vault sealed (the most likely point of failure inside this runbook)
- KI-017 — Backup gap on O1
- KI-016 — No restore-test ever performed
- servers.md — O1
- shared-infra.md — the dependency map showing what cascades when O1 dies