Shared infrastructure — dependency map¶
The apps that other apps depend on. Understanding these is the difference between "one service is down" and "half the company is offline".
A failure in any one of these has cascading blast radius. Every per-app maturity / DR plan must reference whichever of these it depends on.
The dependency graph¶
graph TB
subgraph CRITICAL[Tier 0 — keys to everything]
Vault[Vault<br/>vault.448.global]
Auth[Authentik<br/>auth.448.global]
DNS[Domain registrars<br/>+ DNS hosts]
end
subgraph BUILD[Tier 1 — build & deploy]
Git[GitLab<br/>git.projecteidos.com]
Dok[Dokploy<br/>platform.projecteidos.com]
Reg[Container registry]
end
subgraph PLAT[Tier 1 — platform fabric]
Port[Portainer<br/>portainer.448.global]
S3[MinIO<br/>s3.448.global]
WG[Wireguard<br/>wg.448.global]
end
subgraph OPS[Tier 2 — operations visibility]
Mon[Beszel<br/>monitor.448.global]
Notify[Gotify<br/>notify.448.global]
WT[Watchtower]
end
subgraph APPS[Customer & internal apps]
Direction1[All 22 user-facing apps]
end
Vault --> APPS
Auth --> APPS
DNS --> APPS
Git --> Dok --> APPS
Reg --> Dok
Port --> APPS
S3 --> APPS
WG -.admin access.-> APPS
Mon --> Notify
WT --> APPS
Tier 0 — keys to everything¶
If any of these fail catastrophically (data loss, not just outage), the recovery cost is enormous.
| App | What it provides | Failure consequence |
|---|---|---|
| Vault | All secrets | Apps reading secrets at startup fail; if storage is lost without restorable backup, every secret must be rotated. |
| Authentik | SSO / identity | Every SSO-integrated app becomes inaccessible. |
| Domain registrars + DNS hosts | DNS resolution + ownership | Total outage of every URL. Loss of registrar account = loss of domain. See domains.md. |
These three should have the strongest backup, the lowest recovery time objective, and the most rehearsed runbooks.
Tier 1 — build & deploy¶
Failure means no new deploys, but running services usually keep serving.
| App | Provides | Depends on |
|---|---|---|
| GitLab | Source control, CI/CD, container registry | Itself + storage backend |
| Dokploy | Auto-deploy from Git | GitLab + container registry + Traefik |
| Container registry (likely GitLab's) | Image storage | GitLab + storage |
Loss profile: the company can't ship new code, but customers don't notice immediately.
Tier 1 — platform fabric¶
Failure has broad operational impact — multiple apps stall.
| App | Provides | What depends on it |
|---|---|---|
| Portainer | Container UI / control plane | Engineers managing containers (CLI fallback exists) |
| MinIO | Object storage | Apps storing files: PE Tube, possibly GitLab artifacts, possibly app uploads, possibly backups |
| Wireguard | VPN to internal services | Every admin trying to reach *.448.global services that are not publicly exposed |
Loss profile: depending on what's behind Wireguard, the company can lose its ability to manage its own infrastructure even though customers see nothing.
Tier 2 — operations visibility¶
Failure means we fly blind but services keep serving.
| App | Provides | Notes |
|---|---|---|
| Beszel | Monitoring | "Who watches the watcher" question — what alerts on Beszel-down? |
| Gotify | Push notifications | Beszel + Watchtower + n8n likely all push here |
| Watchtower | Auto-update | Quietly important; misbehaviour can cause silent outages |
Common failure cascades¶
"SSO is down"¶
Authentik unreachable → every app integrated with it rejects new logins. Active sessions may continue briefly until token TTL. Recovery requires Authentik and its DB and its secret key.
"We can't read secrets"¶
Vault sealed or unreachable → apps that read at startup fail to come up. Existing running apps with secrets in memory keep working until restart. Cascading restart = everything down.
"We can't deploy"¶
GitLab or Dokploy down → no merges, no auto-deploys, no rollbacks. Existing apps continue serving traffic.
"We can't see anything"¶
Beszel + Gotify both down → no alerts, no dashboards. Engineers operating from logs and user reports.
"We can't reach our own systems"¶
Wireguard down + admin tooling Wireguard-only → admins cannot fix anything internal until they restore Wireguard via console / cloud-provider browser console. Document the break-glass path.
"Container images go bad"¶
Watchtower pulls a broken image at 3am → services restart into a broken state, no human in the loop. Pin to specific tags and wire Gotify alerts on update events.
"Domain expires"¶
Renewal lapses → every URL on that domain stops resolving. Worst-case (squat): permanent loss. Mitigated by auto-renew + independent calendar reminders + registrar-account MFA.
What to write down before Phase 2 begins¶
For each Tier-0 / Tier-1 app, the maturity-upgrade plan needs answers to:
- What is the backup strategy and where do backups live? (must not be on the same host)
- When was the last successful restore test?
- Who has the credentials to recover, and where? (Vault path or offline)
- What is the recovery time objective? (i.e. how long are we willing to be down)
- What is the recovery point objective? (i.e. how much data loss is acceptable)
- What is the break-glass path that does not depend on this app being up?
These five questions, applied to each Tier-0 / Tier-1 app, are the agenda for the resilience phase.