Known operational issues¶

Active risks and recently-experienced incidents that the team is aware of and need to be addressed in Phase 2. Sorted by severity. New entries go at the top.

KI-001 — [MEDIUM] Caddyfile in Git on E1 (resolved); O1 still on host only¶

What: The Caddy reverse proxies on both VPSes used to read their config from a Caddyfile that lived only on the host filesystem. There was no source-controlled copy.

E1: resolved 2026-05-08. E1's Caddyfile is now committed to source at infra/caddy/E1.Caddyfile with a documented rebuild path. Updates are still applied manually to the host today; a future small improvement is a CI job that auto-deploys on merge to main (tracked under the same KI; not blocking).

O1: still on host only. The Caddyfile that fronts every *.448.global service hasn't been captured yet — same risk pattern as before. Tracked under RM-007.

Original incident: A recent outage broke all internal apps and client apps. The trigger was scheduled OCI VPS maintenance — a routine Oracle event during which the host's ephemeral state was lost. Because the Caddyfile (and Caddy's ACME data dir) lived on host filesystem only, recovery required hand-rebuilding the entire reverse-proxy config. We weren't ready for an event Oracle had announced.

Recovery runbook: RB-001.

Why it matters to leadership: any single corrupting event on either host (disk failure, OS upgrade gone wrong, mistaken edit, ransomware) takes the entire proxy config with it. Recovery requires hand-rebuilding the routing for ~30 hostnames — error-prone, slow, and customer-visible.

Phase-2 actions: 1. Move both Caddyfiles into the engineering repo (suggested path infra/caddy/E1.Caddyfile, infra/caddy/O1.Caddyfile). 2. Add a sync mechanism — pull on host startup / periodic CI deploy / Ansible / GitOps. 3. Document the bare-metal "rebuild Caddy from Git" procedure in infra/runbooks/. 4. Snapshot the host filesystem (OCI block-volume backup) at least daily. 5. Validate the Caddyfile in CI (caddy validate) before any change is deployed.

KI-002 — [HIGH] SQLcl container has no static internal IP, breaking n8n CI/CD pipelines¶

What: The custom Alpine-based SQLcl Docker image runs on O1 alongside n8n. n8n CI/CD pipelines connect to it by IP — but the container's 172.0.0.xx address changes on restart, so pipelines break unpredictably.

Why it matters: every restart (Watchtower update, host reboot, container OOM) potentially silently breaks the database-connected workflows. Operations team has no visibility into the failure mode beyond "the pipeline broke again".

Phase-2 actions: 1. Put the SQLcl container on a named Docker network with a stable alias (e.g. sqlcl.internal); have n8n connect by name, not IP. 2. Alternatively, assign a static IP via docker run --ip in a user-defined bridge network. 3. Add a healthcheck so n8n surfaces "SQLcl is unreachable" as an explicit alert, not a silent pipeline failure. 4. Move the Dockerfile + image into source control and a hosted registry (see KI-003).

KI-003 — [HIGH] Custom SQLcl image has no source repo and no registry¶

What: The Alpine-based custom SQLcl image is built once and lives only on O1. The Dockerfile is not in Git, the image is not in a registry. If O1 dies or the image is pruned, the build is lost.

Phase-2 actions: 1. Commit the Dockerfile to the engineering repo (infra/sqlcl/Dockerfile). 2. Push the built image to a hosted registry — GitLab Container Registry on git.projecteidos.com is the natural choice. 3. Pin n8n / Coder workspaces to a versioned tag, not :latest. 4. Document in 32-sqlcl-container.md where to find both.

KI-004 — [MEDIUM] Dokploy-hosted apps have no auto-update mechanism¶

What: Watchtower runs only on O1, where it auto-updates *.448.global containers. The 9 apps deployed via Dokploy on E2 (GitLab, Teams Bot, 3 WordPress sites, 3 Twenty CRMs, Dokploy itself) get no automated security updates.

Why it matters: these are the most-public-facing apps in the estate. WordPress especially attracts CVE traffic. Without a patching plan, drift accumulates and new vulnerabilities are unmitigated.

Phase-2 actions: 1. Decide patch strategy for Dokploy-hosted apps: - Option A: extend Watchtower to E2 (simple but Dokploy doesn't expect it). - Option B: configure Dokploy's own auto-deploy on Git tag changes. - Option C: scheduled rebuild pipeline (CI re-pulls images weekly). 2. Pin all containers to specific tags, not :latest, to avoid surprise breakage. 3. Set up Gotify / email alerts on update events, regardless of which mechanism is chosen.

What: Parallax (paying-customer system) runs prod and pre-prod on the same paid ADB (E5). No isolation at the database layer.

Why it matters: a bad migration, schema change, or load test in pre-prod can affect prod. Recovery from a corruption in pre-prod is also recovery for prod.

Phase-2 actions: 1. Provision a separate ADB for pre-prod (paid or free, depending on cost tolerance). 2. Document the pre-prod → prod promotion process explicitly. 3. Lock pre-prod from running migrations against prod schemas.

KI-006 — [MEDIUM] Free Tier ADBs auto-pause after 7 days idle, including the paying-customer Fourway tenant¶

What: Fourway TnE Connect tenant (paying client), Eidos TnE Connect tenant, and the two ORA448Global APEX dev ADBs all run on OCI Always Free, which auto-pauses after 7 days of inactivity.

Why it matters: A quiet weekend / holiday can leave the Fourway tenant inaccessible until manually restarted. For a paying customer this is a contractual / reputational risk. Free Tier also has no SLA from Oracle.

Phase-2 actions: 1. Upgrade Fourway tenant ADB to paid tier. 2. Upgrade Eidos tenant ADB if internal usage gaps are likely. 3. As a stop-gap, schedule a daily keep-alive query against each free ADB. 4. Surface ADB-pause events to Gotify.

KI-007 — [MEDIUM] Bradley keeps shared credentials in personal Bitwarden¶

What: Per Vishnu, credentials Bradley admins (DBA / EIDOSDev1 admin scope) are in his personal Bitwarden, not in vault.448.global.

Why it matters: off-boarding risk; bus-factor risk; not audit-loggable; not retrievable by the rest of the team in an incident.

Phase-2 actions: 1. Migrate every shared credential into vault.448.global under a documented path scheme. 2. Bradley's personal Bitwarden stops being a system-of-record for shared secrets. 3. Document who has access to which Vault paths.

KI-008 — [MEDIUM] Both VPSes are Free Tier with no SLA¶

What: E1, E2, and O1 are all OCI Always Free Ampere A1 instances. OCI's free tier carries no uptime SLA and Oracle reserves the right to reclaim free resources.

Phase-2 actions: 1. Cost-benefit on upgrading at minimum the VPSes hosting paying-customer surfaces (E1, E2). 2. Maintain a documented rebuild path so if Oracle reclaims a free instance, recovery is hours not days.

KI-009 — [HIGH] WireGuard runs full-tunnel through O1 Free VPS¶

What: WireGuard on O1 is configured with Allowed IPs: 0.0.0.0/0, ::/0 — every byte of every connected peer's traffic transits O1, an Always-Free Ampere A1 instance that already hosts Vault, Authentik, MinIO, and ~12 other apps.

Why it matters: every additional peer (the strategic plan is to put TnE Connect customers behind WireGuard) competes for CPU and bandwidth with the rest of the *.448.global estate. A peer's slow video call can degrade the company's identity provider.

Phase-2 actions: 1. Switch to split-tunnel — only route the actual internal hosts the VPN is meant to reach (*.448.global, the ADB endpoints) and let normal traffic exit the peer's local internet. 2. If full-tunnel is required for compliance, move WireGuard to a dedicated VPS so it doesn't share resources with the identity / secrets layer.

KI-010 — [HIGH] Tailscale (the actual admin path) is on a personal account¶

What: SSH access to E2 and O1 has been closed at the OCI ingress on port 22. The only admin path to those servers today is via Tailscale, which is registered on Vishnu's personal account, not a company entity.

Why it matters: the production admin path depends on a personal SaaS account. Risks: - Vishnu off-boarding takes the tailnet with them. - Personal billing / payment lapse = ops loses access. - Tailscale free tier device cap (100); ACLs tied to the personal owner. - Other admins (Bradley, Tracey) only have access if Vishnu shares. - E1 (Caddy reverse-proxy VPS) is not yet on the tailnet — its port 22 is still open to the public internet, partly defeating the rest of the lock-down.

Phase-2 actions: 1. Move to Headscale (self-hosted) on a dedicated low-cost host, or 2. Buy a company Tailscale subscription under an ops@ corporate identity. 3. Add E1 to whatever the chosen tailnet becomes; close E1's port 22 on the OCI side. 4. Document the break-glass path (provider console SSH) in case the tailnet itself is unavailable.

KI-011 — [HIGH] All Oracle ADBs have publicly reachable endpoints¶

What: Free Tier ADBs do not support private endpoints — every ADB (E3, E4, E5, O2, O3) exposes its ORDS / SQL*Net endpoints on the public internet. Authentication is via wallet (mTLS) plus schema credentials.

Why it matters: - The wallet+password is the only thing between the public internet and our paying-customer database (E5 — Parallax). A wallet leak from a compromised host = direct DB access from anywhere. - apex-ur.projecteidos.com and similar APEX-builder URLs expose login pages publicly; brute-force traffic is not theoretical. - Free Tier ADBs also lack VCN integration → can't be put behind Wireguard / NSG rules.

Phase-2 actions: 1. Upgrade at minimum the Parallax (E5) and Fourway (E4) ADBs to a paid tier that supports private endpoints in a VCN. 2. Restrict ADB access to specific source IPs (the E1, E2, O1 hosts and admin Tailscale exit nodes). 3. Audit wallet locations on every host that has one. Move into Vault. 4. Enforce MFA at the APEX login layer where supported.

KI-012 — [HIGH] Public exposure of internal admin UIs¶

What: Per Vishnu, most app URLs are public — including admin dashboards (vault.448.global, portainer.448.global, monitor.448.global, coder.448.global, n8n, etc.). The hardening plan is acknowledged but not yet executed.

Why it matters: the admin UI of any one of these systems is a high-value target. AI-driven scanning probes them continuously. A single weak credential or unpatched CVE on any of them can give an attacker leverage over the whole estate.

Phase-2 actions: 1. Put every admin UI behind either Wireguard / Tailscale or Authentik forward-auth — single sign-on at the proxy layer. 2. Where forward-auth isn't viable, enforce per-app SSO with MFA (most of these apps support OIDC). 3. Add IP allow-lists at the Caddy layer for the admin paths even when the public path is open. 4. Rate-limit /wp-admin/, /users/sign_in, APEX login, Vault UI at the Caddy level.

KI-013 — [MEDIUM] OCI Security Lists are manually managed¶

What: Both tenancies' OCI security lists / NSGs are edited by hand in the console. No Terraform, no review trail, no diff.

Why it matters: changes are unauditable; rollback is "remember what it was before"; drift between intended and actual state is invisible until something breaks.

Phase-2 actions: 1. Capture current state with oci network CLI / Terraform import. 2. Move to a Terraform module per tenancy with PRs in the engineering repo. 3. Apply via CI (Dokploy or GitLab pipelines) so changes are reviewed and recorded.

KI-014 — [MEDIUM] E1 still has SSH open to the public internet¶

What: E1 (the EIDOSDev1 Caddy proxy VPS) has port 22 open on its OCI security list — unlike E2 and O1 which have it closed. There's no Tailscale on E1 either.

Why it matters: SSH on a public IP is a brute-force target. Every public-facing host that doesn't need port 22 reachable should close it.

Phase-2 actions: 1. Add E1 to the tailnet (whichever account ends up holding it post-KI-010). 2. Close OCI port 22 on E1. 3. As an interim, restrict 22 to specific source IPs and enforce key-only auth + fail2ban.

KI-015 — [MEDIUM] Workflows / Caddyfiles / Dockerfiles all live "on host only"¶

What: Beyond the Caddyfiles (KI-001) and the SQLcl Dockerfile (KI-003), other operationally-critical artifacts likely live only on hosts and not in source control: - n8n workflow definitions (production CI/CD pipelines) - Dokploy app configs (env vars, build settings) - Authentik configuration (providers, applications, blueprint files) - WG-Easy peer configurations

Why it matters: rebuilding any of these from scratch after a host loss is laborious and error-prone. The pattern is the same as KI-001 — recover via known-good Git source, not memory.

Phase-2 actions: 1. Inventory every "config that lives only on host" artifact. 2. Commit each into the engineering repo with a sync mechanism (init script / CI deploy). 3. Validate every change against schema before deploy.

KI-016 — [HIGH] No backup has ever been restore-tested¶

What: Every backup mechanism in the estate (OCI block-volume snapshots on E2 and O1, Oracle automated ADB backups, Dokploy automatic backups to OCI bucket) has never been restored as a verification.

Why it matters: an untested backup is a hypothesis. Every recovery scenario for the company today is uncalibrated — recovery time and recovery success are both unknown.

Phase-2 actions: 1. Schedule a quarterly cold-restore drill for each Tier-0 / Tier-1 surface. 2. Document each restore procedure in infra/runbooks/. 3. Track restore-test outcomes (date, duration, what worked, what didn't) in backups.md.

KI-017 — [HIGH] No backup configured for Authentik, Vault, or n8n¶

What: Three Tier-0 / Tier-1 surfaces on O1 — Vault (every secret in the estate), Authentik (the SSO IDP), and n8n (production CI/CD pipelines + workflow integrations) — have no application-level backups. Only the host snapshot of O1 backs them up.

Why it matters: if O1's host snapshot is corrupted or fails to restore, every secret, every SSO config, and every CI/CD workflow disappears. Vault's loss is unrecoverable.

Recovery runbooks if this fires today: RB-002 (O1 disaster recovery), RB-003 (Vault recovery — Path C is unrecoverable today without a snapshot).

Phase-2 actions: 1. Vault: configure vault operator raft snapshot save on a schedule; ship snapshots to OCI bucket and an off-OCI destination. Document the unseal+restore procedure. 2. Authentik: Postgres dump + media volume backup; ship to OCI bucket. Capture the Authentik secret key separately. 3. n8n: workflow exports + Postgres backup; capture N8N_ENCRYPTION_KEY in Vault.

KI-018 — [HIGH] GitLab has no scheduled backup¶

What: GitLab on E2 (Dokploy) has no gitlab-backup create schedule. Source code, CI variables, container registry, and issues are all backed only by the E2 host snapshot.

Why it matters: GitLab restore from a host snapshot alone is brittle — Omnibus / Docker-based GitLab restores need both the data backup and gitlab-secrets.json. Without gitlab-secrets.json, encrypted CI variables can't be decrypted on the restored instance.

Phase-2 actions: 1. Schedule gitlab-backup create daily. 2. Capture gitlab-secrets.json separately on every backup run. 3. Ship both to OCI bucket and an off-OCI destination. 4. Restore-test to a parallel test instance and verify CI variables decrypt.

KI-019 — [MEDIUM] Free ADBs have 60-day backup retention but cannot be restored¶

What: Oracle's automated 60-day backup applies to all 5 ADBs (E3, E4, E5, O2, O3), but Free Tier ADBs cannot be restored. The backup exists; the recovery does not.

Why it matters most: E4 (Fourway TnE Connect tenant) is a paying-customer system on a Free ADB. A corruption event there is unrecoverable. E3 (our own Eidos tenant) and O2/O3 (internal dev) are less critical but suffer the same limitation.

Phase-2 actions: 1. Upgrade E4 (Fourway) to Paid Tier ADB. This is the single highest-leverage Phase-2 fix for paying-customer durability. 2. Decide whether to upgrade E3 (Eidos tenant) — internal use softens the urgency. 3. Maintain the keep-alive query stop-gap (also helps with auto-pause KI-006).

KI-020 — [MEDIUM] No monitoring on production ADBs¶

What: Beszel doesn't directly monitor Oracle ADBs (it's a host-metrics tool). There is no other monitoring on E3, E4, E5, O2, O3 — pause events, performance degradation, replication lag, login failures all go unnoticed.

Why it matters: the paying-customer Parallax ADB (E5) has zero observability today. We learn about problems from users.

Phase-2 actions: 1. Use OCI's built-in Database Management / Operations Insights to publish ADB metrics. 2. Forward ADB events to Gotify / Slack via OCI Notifications. 3. Add a synthetic check that runs a query against each ADB every 5 minutes and alerts on failure.

KI-021 — [MEDIUM] No external uptime monitoring¶

What: No external uptime check (UptimeRobot, BetterUptime, Healthchecks.io, Pingdom) on any customer-facing URL. Failures are detected only after Beszel notices a host is down or a user reports.

Phase-2 actions: 1. Pick an external uptime tool (Healthchecks.io has a generous free tier; UptimeRobot is the classic). 2. Monitor at minimum: parallax.projecteidos.com, fourway.tneconnect.app, eidos-global.tneconnect.app, tneconnect.app, projecteidos.com, eidos-global.com, the 3 CRMs, auth.448.global, vault.448.global. 3. Route alerts to a separate channel from Gotify so a Gotify outage doesn't swallow alerts.

KI-022 — [MEDIUM] Alert delivery is broken (Gotify lost, Beszel alerts not configured)¶

What: Gotify was initially set up but is no longer in active use ("not used now / lost"). Beszel has no alert rules configured. Today, no automated alert reaches a human.

Why it matters: monitoring without alerting is dashboards no one looks at. Today's setup is "wait for users to complain or notice in passing".

Phase-2 actions: 1. Decide alert path: rebuild Gotify, or move to email + Slack (Microsoft Teams via webhook, since corporate identity is M365). 2. Configure Beszel alerts on disk-full, container-down, ADB-pause (via synthetic). 3. Wire OCI Notifications for tenancy-level events (cost spikes, scheduled maintenance, Free Tier reclaim). 4. Test the full alert chain end-to-end.

KI-023 — [HIGH] Ubuntu OS patches not applied (no auto, no manual)¶

What: None of the 3 VPSes (E1, E2, O1) have unattended-upgrades enabled, and manual apt update && apt upgrade has not been run regularly because of fear of breaking running apps.

Why it matters: all three VPSes accumulate kernel and package CVEs over time. The longer the gap, the riskier each future patch becomes (because more changes accumulate, increasing the chance one of them breaks something). This is precisely the trap that makes patching feel scary — and the only way out is a controlled cadence.

Phase-2 actions: 1. Enable unattended-upgrades in security-only mode on all 3 hosts (low-risk; just CVE patches). 2. Schedule a monthly maintenance window with announcement to users. 3. Take an OCI snapshot before the monthly upgrade (cheap rollback). 4. Subscribe to Ubuntu security mailing lists / OCI maintenance feeds. 5. Restart non-stateful containers on a schedule via Watchtower; reboot the host monthly during the maintenance window.

KI-024 — [MEDIUM] TLS cert expiry not monitored¶

What: Cert renewal is left entirely to Caddy's automatic ACME flow. No external check verifies that a hostname's NotAfter is not approaching. If Caddy fails silently (network blip, DNS API token expired, ACME rate limit), the team finds out from users when the lock icon goes red.

Why it matters most for tneconnect.app: the .app TLD is HSTS-preloaded — browsers refuse to connect to a host with an invalid cert, with no warning prompt. A lapsed cert there is an immediate hard outage.

Phase-2 actions: 1. Add cert-expiry checks to whatever external uptime tool is chosen for KI-021 (most do this for free). 2. Add a Beszel custom-check that runs openssl s_client -connect <host>:443 -servername <host> and parses NotAfter. 3. Log Caddy's renewal events to a centralized log.

KI-025 — [LOW] E2 has dual ingress: apex direct, subdomain via E1¶

What: E2 (Dokploy, public IP 145.241.230.130) is the single backend for all 9 PE-side apps. But it has two ingress paths with two different TLS issuers: - The 3 WordPress apex hostnames (projecteidos.com, eidos-global.com, tneconnect.app) have DNS pointing direct to E2; Traefik on E2 issues their certs (Let's Encrypt). - The 6 subdomain hostnames (bot., git., platform., crm.*) have DNS pointing to E1 (140.238.97.163); Caddy on E1 issues their certs (Let's Encrypt) and proxies into E2.

Why it matters: mostly architectural inconsistency rather than active risk. Implications: - Two cert-issuance flows mean two failure modes. - Operationally, two reverse proxies need to be kept in sync about which app is at which hostname. - Conversely, having direct apex → E2 means WordPress availability isn't impacted if E1 dies. (small upside).

Discovered via: TLS-cert behaviour. 145.241.230.130 returns Traefik's self-signed cert for unconfigured hostnames; this is how E2 (Dokploy/Traefik) was identified versus E1 (Caddy at 140.238.97.163). Originally mis-framed in earlier doc revisions as a separate "WordPress server".

Phase-2 actions (low-priority): 1. Verify the IP-to-server mapping in OCI console. 2. Decide whether to standardize on a single ingress path (everything via E1 Caddy → E2, or everything direct to E2 Traefik). Standardization simplifies cert management; current split has the small advantage of partial E1-failure tolerance. 3. Document the choice in proxies.md.

KI-026 — [MEDIUM] No CAA records on any domain¶

What: None of the 4 owned domains has a CAA record published. CAA limits which Certificate Authorities can issue certs for a domain.

Why it matters: without CAA, any CA accepting a malicious validation can issue a valid cert for any host on the domain — broadens the supply-chain attack surface.

Phase-2 actions: 1. Add CAA 0 issue "letsencrypt.org" on all 4 domains. 2. If a commercial CA is in use anywhere, add it as an additional CAA 0 issue value. 3. Add CAA 0 iodef "mailto:security@<domain>" for incident notifications.

KI-027 — [MEDIUM] Malformed SPF on `projecteidos.com` (double `-all`)¶

What: The TXT record at projecteidos.com is:

v=spf1 include:spf.protection.outlook.com include:_spf.wpcloud.com -all include:secureserver.net -all

There are two -all directives. RFC 7208 says SPF evaluation stops at the first matching mechanism, so most resolvers will hit -all after _spf.wpcloud.com and never evaluate secureserver.net. The intended union of all three sender sources is not what's actually published.

Why it matters: mail sent legitimately from a secureserver.net-included sender (likely GoDaddy / WP Cloud transactional mail) may fail SPF and land in spam or be rejected. Customer-facing transactional email reliability suffers.

Phase-2 actions: 1. Rewrite the record as a single SPF policy:

v=spf1 include:spf.protection.outlook.com include:_spf.wpcloud.com include:secureserver.net -all

2. Validate with https://mxtoolbox.com/spf.aspx before / after. 3. Audit which senders are actually in use; remove obsolete includes.

KI-028 — [MEDIUM] DMARC `p=none` on `projecteidos.com` and `448.global`¶

What: DMARC policy on projecteidos.com and 448.global is p=none — monitoring-only mode. The other two domains (eidos-global.com, tneconnect.app) are at p=quarantine, which is better.

Why it matters: at p=none, recipients see DMARC failures in the report but don't act on them. Spoofed mail purporting to be from projecteidos.com may still reach inboxes.

Phase-2 actions: 1. Confirm SPF + DKIM are clean on projecteidos.com (after fixing KI-027 and KI-029). 2. Move projecteidos.com to p=quarantine first, monitor reports for 2-4 weeks. 3. Move to p=reject once clean. 4. Same path for 448.global once DKIM is configured.

KI-029 — [MEDIUM] DKIM not configured at common selectors on any domain¶

What: Probing for DKIM TXT records at common selectors (default, selector1, selector2, google, mail, dkim, k1, s1, m1) returned nothing on any of the 4 domains. Microsoft 365 typically uses selector1 and selector2 — both absent.

Why it matters: without DKIM, recipients can't cryptographically verify that mail was actually authorized by the sending domain. DMARC enforcement (p=quarantine/p=reject) is much weaker without DKIM in the mix. This is also a deliverability issue (Gmail / M365 increasingly require DKIM for inbox placement).

Phase-2 actions: 1. In each M365 tenant, enable DKIM for the relevant domain (Microsoft 365 Defender → Email & Collaboration → Policies → DKIM). 2. Publish the two CNAMEs M365 generates (selector1._domainkey... and selector2._domainkey...) at GoDaddy DNS. 3. Verify with dig TXT selector1._domainkey.<domain> (or DoH equivalent) before flipping DKIM "Sign" to enabled.

KI-030 — [MEDIUM] Three separate Microsoft 365 tenants¶

What: DNS-level identifiers reveal at least three distinct M365 tenants: - projecteidos.com → MS=ms38993142 - eidos-global.com → NETORG20317550.onmicrosoft.com - tneconnect.app → NETORG20331173.onmicrosoft.com - 448.global → NETORGFT19859797.onmicrosoft.com (a 4th one)

Why it matters: - Each tenant has independent admin consoles, billing, security policies, conditional-access rules. - The Authentik upstream IdP (Azure AD) is presumably one of these tenants — users in the others can't sign in via Authentik unless explicitly invited as guests. - MFA / DLP / retention policies need to be configured separately per tenant; drift is inevitable. - Off-boarding requires touching every tenant. - Licensing is paid separately per tenant.

Phase-2 actions: 1. Capture the full tenant inventory: tenant ID, billing owner, license SKU mix, admin list, federation status with Authentik. 2. Confirm which tenant Authentik federates with today and what the cross-tenant story is. 3. Long-term: evaluate consolidation into a single tenant with *.<domain> as additional verified domains. Significant project but addresses the off-boarding + MFA-drift risks.

KI-031 — [HIGH] No MFA anywhere on the Parallax stack¶

What: Parallax — our paying-customer system serving ~40 users across 23 hospitality properties — has no MFA enforced at any layer: - End users log in with single-factor APEX local accounts. - Application admins (when UR's admins are eventually provisioned) will log in the same way. - APEX workspace admins (developers) — no MFA. - ADB ADMIN (database admin) — no MFA. - OCI EIDOSDev1 tenancy admins (Tracey, Bradley) — [INFO NEEDED] confirmed enforced or not.

Why it matters: every credential is one phish or password reuse away from full app or DB compromise. The blast radius is asymmetric — an end-user takeover affects one user, but a workspace-admin or ADB-ADMIN takeover affects all 23 properties' data.

Phase-2 actions: 1. Enforce MFA on the APEX workspace admin logins immediately (smallest user count, biggest blast radius). 2. Enforce MFA on the ADB ADMIN account. 3. Enforce MFA on OCI tenancy admins: EIDOSDev1 (Tracey + Bradley) and ORA448Global (Adam Pitt-Stanley as owner + Vishnu as admin). 4. Plan MFA for end-user APEX accounts — APEX supports OIDC; could federate via Authentik to inherit M365 MFA. Coordinate with UR. 5. Same applies to all 4 other APEX hosts (apex1.PE, apex2.PE, apex1.448, apex2.448) on the same principle.

KI-032 — [MEDIUM] Parallax has no CD pipeline; deploy is manual export-import¶

What: Parallax is developed in Coder workspaces against an APEX export checked into Git, then manually exported and restored into the prod workspace by hand. There is no CD pipeline, no automated deployment, no rollback mechanism beyond restoring an older Git revision and re-exporting.

Why it matters: - Manual deploys are slow and error-prone — exactly the moment when a new release is going out is when stress is highest. - No clean rollback: if a deploy breaks something, recovery requires re-running the same manual import with an older export. - No environment isolation today — pre-prod and dev environments don't yet exist (KI-005), so any test happens in prod. - A lost laptop / Coder workspace state mid-deploy could leave prod in an inconsistent state.

Phase-2 actions: 1. Stand up the pre-prod APEX workspace (RM-002 — currently scoped as a separate ADB, but per Vishnu the pragmatic interim is a separate workspace in the same ADB). 2. Build a CI/CD pipeline (likely n8n or GitLab CI) that can apply an APEX export from Git to a target workspace. 3. Document the promote-prod workflow (manual approval gate after pre-prod test). 4. Use the existing schema_replication/poc work to seed pre-prod with realistic data without copying PII (Parallax has none, so this is technically simpler than usual).

KI-033 — [LOW] Vault container down since 2026-05-01 (`CAP_SETFCAP` error) — RESOLVED 2026-05-06¶

Status: Resolved 2026-05-06.

What happened: The Vault container on O1 was failing to start for ~5 days with the error unable to set CAP_SETFCAP effective capability: Operation not permitted. The Vault data was intact on disk; the process simply would not start. Five days of CI/CD pipelines failing silent.

Root cause: Two compounding factors — 1. Watchtower auto-pulled a newer hashicorp/vault image whose entrypoint runs setcap on the binary. 2. The Vault image's Dockerfile sets USER vault (non-root); a non-root user cannot use CAP_SETFCAP even when granted in the container's bounding set, because Docker doesn't grant ambient caps to non-root by default.

Resolution: SKIP_SETCAP=true env var added to bypass the entrypoint's setcap step. Image pinned to hashicorp/vault:1.18.5. Watchtower opt-out label applied. Compose file reconstructed and committed to source control at infra/vault/docker-compose.yml. Full incident record at incidents/2026-05-01-vault-container-down.md.

Trade-off accepted: Vault now runs without mlockall(), so its memory pages can be swapped under pressure. Acceptable on Free A1 with low memory pressure; revisit if O1 is upgraded.

Follow-ups still open: - Take an ad-hoc Raft snapshot off-host (precursor to RM-013). - Apply the same Watchtower-exclusion pattern to other Tier-0 services on O1.

KI-034 — [MEDIUM] TnE Connect source code on Bitbucket, not on self-hosted GitLab¶

What: The TnE Connect (workforce) product source repo lives at bitbucket.org/448_global/workforce.git, on Atlassian Bitbucket — outside our self-hosted GitLab estate. The Bitbucket account URL also includes a personal-username path (vishnukant1@bitbucket.org/...), suggesting the account may be personal-named rather than org-owned.

Why it matters: - Third-party dependency — the product's build / CI / source history depends on an external SaaS we don't control. - Inconsistent estate — every other custom app (Parallax) lives on git.projecteidos.com. Two source-of-truth platforms doubles the off-boarding workload, doubles the credential surface, and confuses contributors. - Bus factor on the Bitbucket account — if the account is personal-named, a Bitbucket account compromise or off-boarding event removes our access to the product source.

Phase-2 actions: 1. Confirm Bitbucket account ownership (organization vs personal-named). 2. Migrate the repo to git.projecteidos.com/<group>/workforce (preserve full history). 3. Update the auto-branch / Coder-workspace / CI/CD pipeline references to the new URL. 4. Decommission the Bitbucket repo (or leave as a read-only mirror). 5. Until migration: ensure the Bitbucket repo is mirrored at least daily into a backup bucket.

KI-035 — [HIGH] TnE Connect tenants hold heavy PII on Free Tier infra¶

What: Both TnE Connect tenants (Fourway and Eidos Global) are workforce-management systems holding heavy personal data — employee names, contact details, hours worked, schedules, manager hierarchies. The Fourway tenant is a paying customer (~150 users at ~£5,000/year) and the Eidos tenant holds our own staff records (~30 users). Both run on Free Tier Oracle Autonomous Databases that:

Cannot be restored even though they are backed up (Free Tier limitation, see KI-019).
Have no formal SLA from Oracle.
Have no off-host backup yet.
Have no cross-region DR.

Why it matters: - GDPR / UK DPA scope — PII at this volume puts the systems firmly in scope. We owe Fourway and our own staff the data-subject rights process (access, deletion, portability), and a credible recovery plan. None exists today. - Reputational risk — a data-loss event on Fourway is the kind of thing that ends customer relationships. - Strategic risk — we are betting big on TnE Connect as a growth product (RocketSaas marketing engaged, https://tneconnect.app/ relaunched). The runtime infra needs to match the ambition before more tenants land.

Phase-2 actions: 1. Upgrade Fourway tenant ADB to paid tier (already on the roadmap as RM-001). Re-prioritize as critical given KI-036 below. 2. Decide whether to upgrade the Eidos tenant ADB too — internal data is no less sensitive even if we own the relationship. 3. Conduct a Data Protection Impact Assessment (DPIA) on the TnE Connect product family. 4. Add off-host backup of the workforce schemas (parallel work to RM-013/014/015).

KI-036 — [HIGH] Oracle 19c → 26ai migration required on TnE Connect ADBs, no restorable backup¶

What: Both TnE Connect ADBs (Fourway APEX2 and Eidos EIDOSDev) currently run Oracle Database 19c. Oracle is asking us to schedule migration to 26ai (the new version line). The migration is mandatory on Oracle's timeline — eventually 19c support / Free Tier hosting will be retired.

Why this is dangerous in our current state: - Major-version DB migrations carry a non-zero risk of data corruption, query-plan regressions, or PL/SQL incompatibility. - The standard mitigation is to clone to a parallel ADB, migrate the clone, validate, and only then cut over — preserving the original as rollback. - Free Tier ADBs cannot be restored (KI-019). If the in-place migration fails or corrupts data, there is no rollback path. - The Fourway tenant is a paying customer with ~150 users and heavy PII (KI-035).

Phase-2 actions (sequencing matters): 1. First, upgrade Fourway tenant to paid ADB (RM-001) — gives us restore + clone capability. 2. Second, take a full data-pump export of both tenants, plus Vishnu's in-progress schema-cloning script (apps/03 §10) — adds a second recovery path independent of Oracle's restore mechanism. 3. Third, perform the 19c → 26ai migration on a clone first, validate the TnE Connect APEX app against it, then cut over. 4. Before any of this, never let Oracle proceed with an in-place migration on the Free Tier ADBs as-is — the no-rollback risk is too high for the paying-customer tenant.

This KI is on the critical path for the next Phase-2 conversation with Oracle.

KI-037 — [HIGH] Authentik runs `:latest` with Watchtower auto-updating (Vault déjà vu)¶

What: Authentik — the company's SSO IdP and a Tier-0 system — runs the goauthentik/server:latest image with no com.centurylinklabs.watchtower.enable=false label to opt out of Watchtower auto-updates. This is the identical configuration that took Vault down for 5 days starting 2026-05-01 (KI-033).

Why it matters: - A breaking image change in goauthentik/server:latest would cascade exactly the same way: Watchtower silently pulls, container restarts, the new image fails to start (capability change, env-var change, Postgres-schema mismatch, anything), and Authentik is down. This time the impact is broader than Vault's because all 15 OIDC clients lose login capability simultaneously. - Active sessions continue until token TTL expires — buys minutes to hours of grace, then everything breaks. - The fix takes 5 minutes and prevents the entire failure mode.

Phase-2 actions (do this within Wave 1): 1. Pin the Authentik image to the currently-running version: goauthentik/server:2025.2.3. 2. Add the Watchtower-exclusion label com.centurylinklabs.watchtower.enable=false to the Authentik container. 3. Capture the Authentik compose file in Git at infra/authentik/docker-compose.yml (same pattern as infra/vault/docker-compose.yml after KI-001 fix), so we don't repeat the "compose-file-was-in-Portainer-and-got-lost" trap. 4. While there, take an immediate Postgres + media + secret-key backup off-host (precursor to RM-014) — same playbook as the post-Vault snapshot. 5. Audit all other Tier-0 services on O1 (MinIO, n8n, anything else from the per-app inventory that auto-restarts after Watchtower pulls) for the same pattern.

This KI is sitting in production and could fire any night Watchtower runs. Treat as urgent.

KI-038 — [HIGH] E1 and E2 have no block-volume backups¶

What: Of the three OCI compute instances, only O1 has a snapshot policy configured (weekly + monthly + yearly incremental, ~£15/month). E1 (Caddy reverse proxy) and E2 (Dokploy host running 9 PE-side apps including GitLab + 3 WordPress sites + 3 Twenty CRMs + Teams Bot) have no automated block-volume backups at all.

Why it matters: - A scheduled OCI maintenance event on E1 already caused a real outage in April 2026 (the Caddyfile incident, KI-001). Without a snapshot, recovery was a hand-rebuild from memory. - E2 is the highest-blast-radius PE-side host: GitLab, three brand-facing WordPress sites, three CRM databases, Teams Bot, Dokploy itself. Loss of E2 with no snapshot = hand-rebuild of nine applications and an unknown amount of CRM customer-pipeline data and GitLab history. - O1's snapshot policy is the model — it should be replicated across E1 and E2 (the cost is small and consistent: ~£15/month per host).

Phase-2 actions: 1. Apply the same backup policy as O1 to E1 and E2 in OCI: weekly + monthly + yearly incremental, 4w / 12m / 5y retention. 2. Verify with oci bv backup-policy-assignment list --asset-id <volume-ocid> that the policy is attached. 3. Once configured, fold a restore drill into the RM-017 quarterly schedule.

KI-039 — [MEDIUM] Vault unseal-key threshold is effectively 1, not 3 (every holder has all 5 shares)¶

What: Vault was initialised with a 5-share / 3-threshold Shamir key split. The intended design is that no individual can unseal Vault alone — three of five holders must collaborate. In practice every one of the four current holders (Vishnu, Stacy Carpenter, Adam Pitt-Stanley, Bradley Leggett) holds all 5 shares, so any single holder can unseal alone.

Why it matters: - The threshold is the security control. A single account compromise on any one of the four holders gives the attacker full unseal capability (the entire Vault becomes readable to whoever holds the keys). - The threshold is also the resilience control: if it were truly 3-of-5 with one share each, three independently-failing holders breaks recovery. The current "everyone has everything" pattern is the inverse — operationally easier, security-wise no better than no Shamir at all.

Why the current pattern exists (not unreasonable): four custodians is below the threshold-of-5 default, so handing out one share each would mean any single absent / unreachable holder breaks recovery. Holding all shares per-person is the operational compromise.

Phase-2 options: 1. Recommended: add a fifth holder, then re-run vault operator generate-root and re-share with one share per person. Restores the intended threshold-of-3 security posture. 2. Alternative: keep the current pattern but tighten the personal storage discipline — each holder stores their copy in a locked password manager with a strong unique passphrase + MFA, so a single account compromise is harder. 3. Whichever option: rotate the Shamir keys after migration so the all-five-shares-everywhere copies are revoked.

KI-040 — [LOW] `projecteidos.com` SPF includes a Brevo verification of unknown purpose¶

What: The projecteidos.com apex TXT record contains brevo-code:12c21e5857bdd32b3b2dffbbbf0ef484 — a verification key for Brevo (transactional email / marketing-mailing service). When asked, no one currently on the team knows what Brevo was set up for or who is using it.

Why it matters: - If Brevo is in active use, the relevant credentials and account ownership need to be in our systems (Vault) and known to the team. - If Brevo is not in use, the SPF / verification entries should be removed; an orphan account is a leak risk. - It also surfaces the broader concern that DNS-level integrations have been added historically without being recorded anywhere the current team can find them.

Phase-2 actions: 1. Check whose name the Brevo account is in (likely findable from the GoDaddy email + a password reset, or from the Brevo verification record metadata). 2. If active: capture the credential into Vault, document the use-case in external-saas.md. 3. If inactive: delete the Brevo SPF entry from projecteidos.com (also helps clean up KI-027) and close the account.

What: When a user signs into any application that uses auth.448.global as its OIDC identity provider (GitLab, Vault, Portainer, n8n, MinIO, etc.) the flow proceeds as expected up to the Authentik consent / authentication step. Authentik authenticates the user successfully, then attempts to redirect back to the application's callback URL — at which point the application displays an Authentik error. The session cookie has, however, been set successfully on the application's domain. Refreshing the original application URL completes sign-in normally — the user is logged in.

Symptom signature:

User clicks "Sign in with Authentik" (or equivalent) on e.g. git.projecteidos.com.
Browser redirects to auth.448.global and authenticates against Microsoft Entra successfully.
Authentik attempts to redirect back to git.projecteidos.com/users/auth/openid_connect/callback (or the relevant callback URL).
The application displays an Authentik-branded error page (rather than the post-login dashboard).
Workaround: the user reloads the original application URL (e.g. git.projecteidos.com) — sign-in completes; user is logged in.

Why it matters: - Confusing user experience — a user who hasn't been told the workaround will believe sign-in failed and may give up. - Affects every SSO-integrated application (15 OIDC clients today, Authentik provider list). - Erodes confidence in the SSO chain — particularly important when leadership starts using docs.eidos-global.com and other Authentik-gated surfaces.

Workaround (document and communicate): When the application shows the Authentik error after sign-in, reload the original application URL in the browser. The session cookie set during the OIDC handshake will be picked up and the user will be signed in. No need to re-authenticate.

Likely root causes (to investigate during the fix):

redirect_uri mismatch — most common cause of OIDC callback errors. The callback URL the app sends in the authorization request must exactly match the value registered against the provider in Authentik (down to trailing slash, http vs https, port).
Token-exchange failure — Authentik issues the authorization code, but the app's back-channel call to exchange the code for a token fails. Often a clock-skew issue between the app server and Authentik (look for "iat in the future" / "exp in the past" in app logs).
HTTPS / cookie-flags issue — Authentik's session cookie may be SameSite=None which requires Secure=true. If anything in the chain is on plain HTTP, the cookie is dropped on the first redirect (but a subsequent same-origin reload reads the previously-set app session cookie).
Authentik flow stage misconfiguration — the post-authentication redirect stage might be configured for an old or conflicting URL.

Diagnosis steps to take:

Authentik admin → Events → Logs: filter on the affected user / time window — look at the actual stage of the flow that errors. Authentik usually logs the exact reason (e.g. redirect_uri_mismatch).
Browser DevTools → Network tab: capture the failing callback request. Inspect the URL, query parameters (code, state), and the response. Compare the URL to the provider's configured callback in Authentik admin.
App-side logs (e.g. GitLab production.log, Vault server log): look for OIDC token-exchange errors at the timestamp of the failure.
Time skew: confirm auth.448.global host's clock is in sync (timedatectl status); same for the app host.
Test on a single provider in isolation: pick the simplest OIDC client (e.g. Portainer or PE Tube) and see if the error reproduces — narrows whether it's app-specific or Authentik-wide.

Phase-2 actions:

Diagnose using the steps above; identify which root cause applies.
Fix at the source (most likely path: align redirect_uri in the Authentik Provider config with what the app actually sends).
Once fixed on one provider, validate against the other 14.
Communicate the workaround to all current Authentik users in the meantime.

Severity legend¶

active recent incident or imminent risk
systemic exposure that needs planning but no immediate emergency
known cosmetic / low-impact issue