Skip to content

RB-002 — ORA448Global VPS (O1) is destroyed or unrecoverable

Use this runbook when: the O1 instance is gone — terminated, disk failure, ransomware, or otherwise unrecoverable. This is the highest blast radius scenario in the estate today: ~15 internal apps including Vault (secrets) and Authentik (SSO) are on this single host.

id: RB-002
title: O1 disaster recovery (catastrophic loss of *.448.global estate)
severity: catastrophic
estimated_duration: 4-12 hours (heavily dependent on which Phase-2 actions have landed)
servers_affected: [O1]
apps_affected: ["14", "15", "17", "19", "20", "21", "22", "23", "24", "25", "26", "27", "28", "29"]
related_kis: [KI-002, KI-003, KI-013, KI-015, KI-016, KI-017]
related_actions: [RM-007, RM-013, RM-014, RM-016, RM-017, RM-018]

Trigger / detection

  • The O1 OCI compute instance is in a non-recoverable state (terminated, unbootable, disk failure that block-volume snapshot can't fix).
  • All *.448.global URLs return connection refused / DNS failure.
  • All apps that depend on Authentik (GitLab, Vault clients, etc.) cannot authenticate new logins.

Severity

Catastrophic. Direct impact:

  • ~15 internal applications offline: Authentik, Vault, MinIO, Portainer, Beszel, Gotify, Wireguard, Coder, n8n, Open WebUI, Draw.io, IT Tools, PE Tube, Watchtower, the SQLcl image.
  • Cascading: SSO-integrated apps (GitLab, others) cannot authenticate new logins until Authentik is restored.
  • Cascading: any app that reads secrets at startup will fail to start until Vault is restored.
  • The custom SQLcl Docker image and n8n CI/CD pipelines are gone unless they were backed up off-host.

This runbook is a disaster-recovery playbook, not a hot-restart procedure. Expect hours, not minutes. Bring a notepad.

Required access

  • OCI tenancy owner / root on ORA448GlobalAdam Pitt-Stanley (or Vishnu Kant as additional admin)
  • OCI block-volume snapshot of O1 (if exists)
  • Vault Raft snapshot (RM-013) and unseal-key shares (≥3 of 5 holders available)
  • Authentik backup (RM-014): Postgres dump + media volume + secret key
  • n8n backup (RM-016) including N8N_ENCRYPTION_KEY (in Vault — chicken-and-egg)
  • OCI bucket access for retrieving backups
  • GitLab access for Caddyfile + Authentik blueprint + n8n workflow exports + SQLcl Dockerfile
  • Microsoft 365 admin on the relevant tenant — for Authentik's upstream identity restore
  • DNS write access at GoDaddy — only if the new instance gets a different public IP

Preconditions

A precondition not met means partial or impossible recovery. Check each before starting:

Precondition Status if missing Phase-2 action
OCI block-volume snapshot of O1 If missing, recovery starts from scratch (snapshot policy on servers.md)
Vault Raft snapshot off-host Tier-0 — without it, every secret is unrecoverable RM-013
Authentik backup off-host Without it, SSO must be rebuilt by hand from the OIDC client list RM-014
n8n workflow exports in Git Without it, every CI/CD workflow is gone RM-010 / RM-016
Caddyfile in Git Without it, hand-rebuild from proxies.md RM-007
Custom SQLcl Dockerfile in registry Without it, image must be rebuilt from scratch RM-008
Authentik blueprint in Git Without it, OIDC client config rebuilt by hand RM-011
Vault unseal-key custodians reachable If <3 reachable, see RB-003 RM-013

Today (2026-05) most of these preconditions are NOT met. Running this runbook successfully depends on Wave 1 of the improvement plan having landed first. Until it has, this runbook is partly aspirational — but it shows the team exactly what's at stake and what's needed.


Recovery steps

Step 1 — Provision a new O1 instance

# In OCI console (or via CLI), create a new compute instance in ORA448Global:
#   Shape: VM.Standard.A1.Flex  (Always Free; 4 OCPU / 24 GB RAM full free tier headroom)
#   Region: uk-london-1
#   OS: same Ubuntu LTS as the old O1
#   VCN: same as old O1, or fresh VCN if the old one is gone
#   Public IP: assign a new one

Capture the new public IP. If the old O1's IP is still routable (rare), you can attempt to keep DNS unchanged; otherwise proceed to step 2.

Step 2 — Update DNS for *.448.global

If the new instance has a different public IP than the old O1:

# At GoDaddy, update A records for every *.448.global subdomain
# (15 subdomains, see infra/servers.md O1 section).
# Until DNS propagates (TTL is typically 600s = 10 min), URLs continue
# to resolve to the dead host.

Step 3 — Base setup

# SSH in (will use a fresh key, since the old SSH host key is gone)
ssh ubuntu@<new-public-ip>
sudo apt update && sudo apt -y full-upgrade
sudo apt install -y docker.io docker-compose-v2 caddy
sudo systemctl enable --now docker

Step 4 — Restore Caddy config

Same as RB-001 Step 5:

cd /tmp
git clone --depth=1 https://<git-creds>@git.projecteidos.com/internal/engineering.git
sudo cp engineering/infra/caddy/O1.Caddyfile /etc/caddy/Caddyfile
sudo caddy validate --config /etc/caddy/Caddyfile
sudo systemctl restart caddy

Caddy will start trying to issue Let's Encrypt certs immediately. Mind the rate limit — 15 hostnames issuing simultaneously may hit it. Stagger if needed.

Step 5 — Restore Vault (Tier-0 — do this before anything else)

# Pull the latest Vault Raft snapshot from the OCI bucket:
oci os object get \
  --bucket-name <bucket> \
  --name vault-snapshots/latest.snap \
  --file /tmp/vault.snap

# Run a Vault container, point its data dir at a fresh volume, and start it:
docker run -d --name vault --cap-add IPC_LOCK \
  -v /opt/vault/data:/vault/data \
  -p 8200:8200 \
  hashicorp/vault:<version> \
  server -config=/etc/vault.hcl

# Initialize / restore from snapshot. If this is a fresh Vault that was
# never initialized: vault operator init.
# If restoring an existing Raft cluster's data:
vault operator raft snapshot restore /tmp/vault.snap

Then unseal Vault. Requires 3 of 5 unseal-key holders to participate. If you can't reach 3, jump to RB-003.

vault operator unseal <key-share-1>
vault operator unseal <key-share-2>
vault operator unseal <key-share-3>
vault status   # Sealed: false

Once Vault is unsealed, the rest of the recovery has access to credentials.

Step 6 — Restore Authentik

# Pull latest Authentik backup from OCI bucket
oci os object get --bucket-name <bucket> \
  --name authentik-backups/latest-pg_dump.sql.gz \
  --file /tmp/authentik-pg.sql.gz

oci os object get --bucket-name <bucket> \
  --name authentik-backups/latest-media.tar.gz \
  --file /tmp/authentik-media.tar.gz

# Retrieve the Authentik secret key from Vault (it is NOT in the Postgres dump):
export AUTHENTIK_SECRET_KEY=$(vault kv get -field=secret_key secret/authentik/runtime)

# Stand up Authentik's Postgres + Redis + server containers (docker-compose
# from the engineering repo: infra/authentik/docker-compose.yml).
# Restore the DB:
gunzip < /tmp/authentik-pg.sql.gz | docker exec -i authentik-postgres psql -U authentik
# Untar media into the volume:
docker exec authentik-server tar xzf - -C /media < /tmp/authentik-media.tar.gz
docker compose up -d

Verify:

curl -sI https://auth.448.global/-/health/live/
# Expect 204 No Content

If Authentik is up but the OIDC client config differs from before (e.g. older blueprint), apply the latest blueprint from the engineering repo:

# Refer to RM-011 / infra/authentik/blueprint.yaml

Step 7 — Restore MinIO

# MinIO is the most likely thing to have lost data IF object storage
# was on local disk on O1 (vs an external block volume that survived).
# If a separate disk attached to O1 survived (block volume snapshot exists),
# attach to new O1 and mount.
# Otherwise: data loss for whatever was in MinIO. Confirm impact before proceeding.

docker run -d --name minio \
  -v /opt/minio/data:/data \
  -p 9000:9000 -p 9001:9001 \
  -e "MINIO_ROOT_USER=$(vault kv get -field=root_user secret/minio/root)" \
  -e "MINIO_ROOT_PASSWORD=$(vault kv get -field=root_pw secret/minio/root)" \
  minio/minio server /data --console-address ":9001"

Step 8 — Restore the rest of the apps

In rough dependency order (the ones other things depend on first):

  1. Wireguard / WG-Easy — peers will need re-issued configs unless server private key was backed up to Vault. Bring up; admins re-add devices.
  2. Portainer — fresh install; reconnect to the local Docker socket.
  3. Beszel — fresh install; agents on each host need re-registering.
  4. Gotify — fresh install; tokens for sources (Beszel, Watchtower) re-issued.
  5. Watchtower — restart; will manage container updates.
  6. n8n — restore Postgres dump + apply N8N_ENCRYPTION_KEY from Vault. Workflow exports re-imported from Git.
  7. Coder — fresh; users re-create workspaces.
  8. Open WebUI, Draw.io, IT Tools, PE Tube — fresh containers; data restored where available.
  9. Custom SQLcl image — pull from GitLab Container Registry; redeploy on user-defined Docker network with stable alias (per RM-008 / RM-009).

For each app, confirm:

docker ps --filter "name=<app>" --format '{{.Status}}'
# Expect "Up <time> (healthy)" or "Up <time>"
curl -sI https://<app>.448.global -m 5 | head -1

Verification (post-recovery)

  • Every *.448.global URL returns expected status (matches pre-incident state).
  • An end-to-end SSO test succeeds: log in to GitLab via Authentik via M365.
  • An app reads a secret from Vault successfully (e.g. vault kv get secret/<app>/<key>).
  • Beszel agents on E1 + E2 + O1 are reporting.
  • n8n executes a smoke-test workflow without "SQLcl unreachable" errors.
  • Watchtower has run at least once and reported via Gotify.
  • An external uptime check (RM-038) shows green for all O1-fronted hostnames.

Rollback / abort

There is no clean rollback once the new O1 is live and DNS has been flipped. If recovery is failing partway:

  1. Pause and escalate. This is a major incident; convening incident-comms is appropriate.
  2. Don't destroy the original (dead) O1 if any partial state remains — even a corrupt block volume may yield some data via forensic recovery.
  3. Communicate to the rest of the team — staff cannot use SSO-dependent tools until Authentik is back; they need to know.

Break-glass for Vault unseal

If <3 unseal-key holders can be reached, this runbook cannot complete normally. Switch to RB-003 — accept that Vault is unrecoverable and rotate every secret in the company. This is the worst-case scenario and the reason key custody must be deliberately distributed.


Post-incident

  1. File a major-incident report at infra/incidents/YYYY-MM-DD-o1-loss.md.
  2. Update known-issues.md with any precondition gap that bit during recovery.
  3. Update this runbook with timing reality — the estimated duration column should reflect how long it actually took.
  4. Convene a post-incident review focused on: was the trigger preventable? Were backups adequate? What single fix would have shortened the recovery most?
  5. Schedule a follow-up restore-test drill (RM-017) within 90 days.
  • RB-001 — Caddy down (much narrower scope; sometimes triggers concurrently)
  • RB-003 — Vault sealed (the most likely point of failure inside this runbook)
  • KI-017 — Backup gap on O1
  • KI-016 — No restore-test ever performed
  • servers.md — O1
  • shared-infra.md — the dependency map showing what cascades when O1 dies