Shared infrastructure — dependency map¶

The apps that other apps depend on. Understanding these is the difference between "one service is down" and "half the company is offline".

A failure in any one of these has cascading blast radius. Every per-app maturity / DR plan must reference whichever of these it depends on.

The dependency graph¶

graph TB
    subgraph CRITICAL[Tier 0 — keys to everything]
        Vault[Vault<br/>vault.448.global]
        Auth[Authentik<br/>auth.448.global]
        DNS[Domain registrars<br/>+ DNS hosts]
    end
    subgraph BUILD[Tier 1 — build & deploy]
        Git[GitLab<br/>git.projecteidos.com]
        Dok[Dokploy<br/>platform.projecteidos.com]
        Reg[Container registry]
    end
    subgraph PLAT[Tier 1 — platform fabric]
        Port[Portainer<br/>portainer.448.global]
        S3[MinIO<br/>s3.448.global]
        WG[Wireguard<br/>wg.448.global]
    end
    subgraph OPS[Tier 2 — operations visibility]
        Mon[Beszel<br/>monitor.448.global]
        Notify[Gotify<br/>notify.448.global]
        WT[Watchtower]
    end
    subgraph APPS[Customer & internal apps]
        Direction1[All 22 user-facing apps]
    end

    Vault --> APPS
    Auth --> APPS
    DNS --> APPS
    Git --> Dok --> APPS
    Reg --> Dok
    Port --> APPS
    S3 --> APPS
    WG -.admin access.-> APPS
    Mon --> Notify
    WT --> APPS

Tier 0 — keys to everything¶

If any of these fail catastrophically (data loss, not just outage), the recovery cost is enormous.

App	What it provides	Failure consequence
Vault	All secrets	Apps reading secrets at startup fail; if storage is lost without restorable backup, every secret must be rotated.
Authentik	SSO / identity	Every SSO-integrated app becomes inaccessible.
Domain registrars + DNS hosts	DNS resolution + ownership	Total outage of every URL. Loss of registrar account = loss of domain. See domains.md.

These three should have the strongest backup, the lowest recovery time objective, and the most rehearsed runbooks.

Tier 1 — build & deploy¶

Failure means no new deploys, but running services usually keep serving.

App	Provides	Depends on
GitLab	Source control, CI/CD, container registry	Itself + storage backend
Dokploy	Auto-deploy from Git	GitLab + container registry + Traefik
Container registry (likely GitLab's)	Image storage	GitLab + storage

Loss profile: the company can't ship new code, but customers don't notice immediately.

Tier 1 — platform fabric¶

Failure has broad operational impact — multiple apps stall.

App	Provides	What depends on it
Portainer	Container UI / control plane	Engineers managing containers (CLI fallback exists)
MinIO	Object storage	Apps storing files: PE Tube, possibly GitLab artifacts, possibly app uploads, possibly backups
Wireguard	VPN to internal services	Every admin trying to reach `*.448.global` services that are not publicly exposed

Loss profile: depending on what's behind Wireguard, the company can lose its ability to manage its own infrastructure even though customers see nothing.

Tier 2 — operations visibility¶

Failure means we fly blind but services keep serving.

App	Provides	Notes
Beszel	Monitoring	"Who watches the watcher" question — what alerts on Beszel-down?
Gotify	Push notifications	Beszel + Watchtower + n8n likely all push here
Watchtower	Auto-update	Quietly important; misbehaviour can cause silent outages

Common failure cascades¶

"SSO is down"¶

Authentik unreachable → every app integrated with it rejects new logins. Active sessions may continue briefly until token TTL. Recovery requires Authentik and its DB and its secret key.

"We can't read secrets"¶

Vault sealed or unreachable → apps that read at startup fail to come up. Existing running apps with secrets in memory keep working until restart. Cascading restart = everything down.

"We can't deploy"¶

GitLab or Dokploy down → no merges, no auto-deploys, no rollbacks. Existing apps continue serving traffic.

"We can't see anything"¶

Beszel + Gotify both down → no alerts, no dashboards. Engineers operating from logs and user reports.

"We can't reach our own systems"¶

Wireguard down + admin tooling Wireguard-only → admins cannot fix anything internal until they restore Wireguard via console / cloud-provider browser console. Document the break-glass path.

"Container images go bad"¶

Watchtower pulls a broken image at 3am → services restart into a broken state, no human in the loop. Pin to specific tags and wire Gotify alerts on update events.

"Domain expires"¶

Renewal lapses → every URL on that domain stops resolving. Worst-case (squat): permanent loss. Mitigated by auto-renew + independent calendar reminders + registrar-account MFA.

What to write down before Phase 2 begins¶

For each Tier-0 / Tier-1 app, the maturity-upgrade plan needs answers to:

What is the backup strategy and where do backups live? (must not be on the same host)
When was the last successful restore test?
Who has the credentials to recover, and where? (Vault path or offline)
What is the recovery time objective? (i.e. how long are we willing to be down)
What is the recovery point objective? (i.e. how much data loss is acceptable)
What is the break-glass path that does not depend on this app being up?

These five questions, applied to each Tier-0 / Tier-1 app, are the agenda for the resilience phase.