IT Estate Overview & Modernization Roadmap¶

Leadership: the top half of this document gives a high-level view of our IT infrastructure and applications — what we own, where it runs, who uses it, and where we are improving. Engineering & AI agents: the bottom half is a structured action register with stable IDs, target states, and verification commands. See AI agent guidance for how agents should consume it.

Our IT estate¶

The four domains we operate¶

We own and operate four internet domains, each with a distinct purpose:

graph TB
    subgraph PUB[Customer & corporate brands]
        PE["projecteidos.com<br/>Project Eidos corporate brand"]
        EG["eidos-global.com<br/>Eidos Global corporate brand + UK/IN sales"]
        TNE["tneconnect.app<br/>TnE Connect product (workforce SaaS)"]
    end
    subgraph INT[Internal infrastructure]
        G448["448.global<br/>Internal tools, identity, secrets, monitoring"]
    end

Domain	Purpose	Audience
`projecteidos.com`	Corporate brand front-door (currently redirects to `eidos-global.com`); home of our development platform, source control, and the Parallax client product	Customers + staff
`eidos-global.com`	Eidos Global corporate website + the UK and India CRM systems for our sales operations	Public + staff
`tneconnect.app`	The TnE Connect workforce-management SaaS — the marketing site, the customer CRM, and the per-client tenants (Fourway and our own)	Paying customers + our staff
`448.global`	The internal back-office estate — staff identity, secret storage, monitoring, automation, internal tools	Staff only

What runs on each domain¶

graph LR
    subgraph PE["projecteidos.com"]
        PE1[Parallax<br/>UR client product]
        PE2[GitLab<br/>source control]
        PE3[Dokploy<br/>app platform]
        PE4[Teams Bot]
    end
    subgraph EG["eidos-global.com"]
        EG1[Corporate WordPress]
        EG2[Twenty CRM UK]
        EG3[Twenty CRM India]
    end
    subgraph TNE["tneconnect.app"]
        TNE1[Marketing WordPress]
        TNE2[Twenty CRM TnE]
        TNE3[Fourway tenant<br/>paying customer]
        TNE4[Eidos tenant<br/>our own staff]
    end
    subgraph G448["448.global"]
        G1[Identity & secrets<br/>Authentik · Vault]
        G2[Storage<br/>MinIO · PE Tube]
        G3[Operations<br/>Portainer · Beszel · Gotify · Wireguard · Watchtower]
        G4[Productivity<br/>n8n · Coder · Open WebUI · Draw.io · IT Tools]
        G5[Internal dev DBs<br/>APEX1 · APEX2]
    end

Apps grouped by business function¶

Function	Count	Examples
Customer-facing products	5	Parallax (UR), TnE Connect tenants × 2, three CRM instances
Customer-facing brand sites	3	The three WordPress sites
Engineering & development	4	GitLab, Dokploy, Coder, Teams Bot
Identity, secrets & networking	4	Authentik (SSO), Vault (secrets), Wireguard (VPN), Microsoft 365
Storage & data	3	MinIO (S3-compatible), PE Tube (video), 5 Oracle databases
Monitoring & ops	4	Portainer, Beszel, Gotify, Watchtower
Internal productivity	5	n8n, Open WebUI, Draw.io, IT Tools, internal dev databases
Total	32 apps	Full inventory in README app index

Where it physically runs¶

graph TB
    Net((Public Internet))
    subgraph EIDOS["OCI tenancy: EIDOSDev1 (uk-london-1)"]
        E1["E1 — Caddy proxy<br/>1 vCPU / 6 GB · Free tier<br/>140.238.97.163"]
        E2["E2 — Dokploy host<br/>3 vCPU / 18 GB · Free tier<br/>145.241.230.130<br/>9 PE-side apps incl. WordPress"]
        E5[("E5 — Paid Oracle DB<br/>Parallax UR client")]
        E3[("E3 — Free Oracle DB<br/>Eidos workforce tenant")]
        E4[("E4 — Free Oracle DB<br/>Fourway workforce tenant")]
    end
    subgraph G448["OCI tenancy: ORA448Global (uk-london-1)"]
        O1["O1 — All-in-one<br/>Ampere A1 · Free tier<br/>140.238.90.91<br/>15 internal-tools apps"]
        O2[("O2 — Free Oracle DB<br/>internal dev")]
        O3[("O3 — Free Oracle DB<br/>internal dev")]
    end
    subgraph EXT["External services"]
        GD["GoDaddy<br/>4 domain registrations<br/>+ Microsoft 365 mail"]
    end
    Net --> E1
    Net --> E2
    Net --> O1
    E1 -.->|internal| E2
    E2 --> E3
    E2 --> E4
    E2 --> E5
    O1 --> O2
    O1 --> O3
    GD -. DNS / mail .- Net

The footprint in numbers:

2 cloud accounts — both Oracle Cloud Infrastructure tenancies in London (uk-london-1).
3 servers — E1 (a small reverse proxy), E2 (the main application host), O1 (the internal-tools host).
5 managed databases — one paid Oracle Autonomous Database for our paying client (Parallax/UR), four Free-Tier ones for everything else.
1 external provider — GoDaddy for domains and Microsoft 365 email.
No other clouds, no other VPS providers, no on-premises kit.

How people log in¶

graph LR
    User[Staff member] --> M365[Microsoft 365<br/>Entra / Azure AD]
    M365 --> Authentik[Authentik<br/>auth.448.global]
    Authentik --> Apps[GitLab · Vault<br/>+ other internal apps]

Staff sign in once with their corporate Microsoft 365 account. Microsoft passes the verified identity to Authentik (our single sign-on hub), which in turn lets them into internal applications like GitLab and Vault. This means a leaver's Microsoft 365 deactivation propagates everywhere, and there's a single place to enforce password policy and multi-factor authentication.

Who runs it¶

Person	Role
Tracey Weetman	Oracle Lead — primary Oracle relationship; admin of the EIDOSDev1 tenancy
Bradley Leggett	Database Administrator — operates the Oracle databases
Vishnu Kant	Solutions Architect — additional admin on the ORA448Global tenancy (Adam Pitt-Stanley is the tenancy owner) and day-to-day operator of the 32 applications

Where things stand operationally¶

We run the estate lean — the costs are dominated by one paid Oracle database for our paying UR client, with everything else on Oracle's Always Free tier. That is the right shape for where the business is today, but it carries some specific pressures we are actively working through.

Current state at a glance¶

Area	Status	Notes
Source control & code	Healthy	Self-hosted GitLab, daily working repo
Identity & SSO	Healthy	Microsoft 365 + Authentik chain in place
Customer products live & serving	Healthy	Parallax, the two TnE Connect tenants, the three CRMs all up
Backup coverage	Partial	Some systems backed up, others not yet — see plan below
Disaster-recovery rehearsal	Not yet	Restore-tests scheduled in the next two quarters
Monitoring & alerting	Partial	Visibility exists, alert delivery being rebuilt
Single-region resilience	One region (UK London)	Cross-region DR is on the medium-term roadmap
Patch hygiene	Catching up	Monthly maintenance windows being introduced

A more detailed engineering view of every operational concern is maintained in infra/known-issues.md — that's the working register the team draws from when planning improvements.

The few items that warrant board-level attention¶

These are the items where we are actively investing engineering time:

Resilience for the Fourway client tenant. Our Fourway TnE Connect customer currently runs on Oracle's Always Free database tier, which carries no service-level guarantee. Upgrading this to a paid Oracle DB tier is the single biggest line item in the improvement plan — small monthly cost, large reduction in contractual risk.
Backup of identity & secret-storage systems. The applications that hold our staff identity (Authentik) and our shared credentials (Vault) need formal off-host backup. We have begun this work.
Configuration in source control. A recent brief outage during routine Oracle maintenance highlighted that some operationally-critical configuration files live only on the host where they run. We are migrating these into our GitLab repo so that any host can be rebuilt cleanly.
Single-region posture. Everything currently runs in Oracle Cloud's UK London region. A future investment is a second-region disaster-recovery copy for the customer-facing systems.
Centralized administration access. Day-to-day administrative access to our servers is being moved onto a company-managed VPN setup, replacing arrangements that have been in place since the small early days.

The improvement plan¶

A six-month plan organized in three waves. Each wave delivers a meaningful improvement in resilience or security without being so large that it disrupts ongoing work.

Wave	Months	Theme	What changes for the business
1	Month 1	Close the highest-impact gaps	Fourway tenant moves to paid Oracle DB tier; Vault & Authentik & GitLab gain off-host backup; configuration files in source control; cert-expiry alerts in place
2	Months 2–3	Build the operational floor	Parallax production / pre-production isolation; monthly server maintenance windows; restore-test drills; external uptime monitoring; email-authentication enforcement
3	Months 4–6	Modernize	Off-OCI backup destination; OCI identity federated with Authentik; admin interfaces gated; second-region DR feasibility; Microsoft 365 tenant rationalization

The full action plan — 40 individual work items, each with target outcomes and verification steps — is captured in the action register below. It is designed so the engineering team and (in time) automated tooling can pick up items, execute them, and verify completion against unambiguous criteria.

What this costs¶

The improvement plan is not capex-heavy. The estimated additions to monthly operating cost:

Item	Estimate
Paid Oracle DB tier for the Fourway tenant	low-to-mid hundreds of GBP per month
Centralized VPN service (or self-hosted equivalent)	<£50 / month
External uptime monitoring	£0 (free tier)
Off-region backup storage	<£20 / month
Engineering time	~30% of one engineer's time over six months

How the action register works¶

The lower half of this document is the action register — each improvement item gets a stable ID (RM-NNN), a structured metadata block, a target-state definition, and a verification step. This is the part the engineering team and AI agents work from. The structure is designed so:

Humans can scan the register top-to-bottom, follow the wave-plan, and assign owners.
AI agents doing RAG over this repo can extract individual actions, look up affected servers / apps / known-issues by ID, and (in some cases) execute the verification checks via the OCI CLI / curl / git. See AI agent guidance below.

Stable ID schemes used across the repo¶

Prefix	Meaning	Example
`RM-NNN`	Roadmap action (this document)	`RM-001`
`KI-NNN`	Known issue (`infra/known-issues.md`)	`KI-001`
`E1, E2, E3, E4, E5`	EIDOSDev1 OCI resources	`E2` = Dokploy VPS at `145.241.230.130`
`O1, O2, O3`	ORA448Global OCI resources	`O1` = the all-in-one VPS
`apps/NN-<slug>.md`	Per-app doc	`apps/01-parallax.md`

Workstreams¶

Eight themed workstreams. Each addresses a coherent slice of the operational debt.

WS	Theme	Goal
WS-1	Resilience for paying customers	Move client-facing systems off Free Tier and add proper backup + monitoring.
WS-2	Source-of-truth in Git	Everything we'd need to rebuild lives in `engineering` repo.
WS-3	Backup & DR program	Every Tier-0/1 surface is backed up, ships off-host, and is restore-tested.
WS-4	Identity & access	Tailscale off personal account; Vault is the secret canon; OCI federated.
WS-5	Network hardening	Admin surfaces gated; ADB exposure reduced; Sec Lists in Terraform.
WS-6	Email / sender authentication	SPF/DKIM/DMARC clean across all 4 domains.
WS-7	OS / patch hygiene	Monthly maintenance windows; security-only auto-updates.
WS-8	Monitoring & alerting	We learn problems before customers do.

Wave plan¶

Wave 1 — Month 1 — "Stop the bleeding"¶

The set of actions where doing nothing for one more week is most embarrassing.

ID	Action (short)	WS	Effort
RM-006	Caddyfile (E1) → Git + sync mechanism	WS-2	s
RM-007	Caddyfile (O1) → Git + sync mechanism	WS-2	s
RM-013	Vault Raft snapshot job → OCI bucket	WS-3	s
RM-014	Authentik Postgres + media backup → OCI bucket	WS-3	s
RM-015	GitLab `gitlab-backup` schedule (incl. `gitlab-secrets.json`)	WS-3	s
RM-001	Upgrade E4 (Fourway tenant) Free ADB → Paid ADB	WS-1	xs
RM-019	Move Tailscale off personal account	WS-4	s
RM-040	Cert-expiry alerting on all customer URLs	WS-8	xs
RM-029	Fix `projecteidos.com` malformed SPF	WS-6	xs
RM-005	External uptime check on `tneconnect.app` (`.app` HSTS)	WS-8	xs

Wave 2 — Months 2–3 — "Build the floor"¶

ID	Action (short)	WS	Effort
RM-002	Parallax pre-prod → separate ADB (isolate from prod)	WS-1	m
RM-003	ADB-level monitoring (synthetic + OCI Operations Insights)	WS-8	m
RM-017	Quarterly cold-restore drill schedule	WS-3	s
RM-021	Bradley's personal Bitwarden → `vault.448.global`	WS-4	s
RM-033	`unattended-upgrades` security-only on E1, E2, O1	WS-7	xs
RM-034	Monthly maintenance windows + snapshot-before-patch	WS-7	s
RM-020	Add E1 to tailnet; close OCI port 22 on E1	WS-4	xs
RM-038	External uptime monitor on full customer URL set	WS-8	xs
RM-039	Revive Gotify or replace with Teams/Slack webhook	WS-8	s
RM-027	CAA records on all 4 domains	WS-5	xs
RM-031	DMARC `p=quarantine` → `p=reject` on PE + `448.global`	WS-6	s
RM-008	SQLcl Dockerfile → Git + image to GitLab Registry	WS-2	s
RM-009	SQLcl container static internal IP (Docker network alias)	WS-2	xs
RM-010	n8n workflow exports → Git	WS-2	s
RM-011	Authentik blueprint → Git	WS-2	s
RM-016	n8n Postgres + workflow exports backup	WS-3	s
RM-024	Vault credential paths schema + rotation policy	WS-4	s

Wave 3 — Months 4–6 — "Modernize"¶

ID	Action (short)	WS	Effort
RM-018	Off-OCI backup destination (Backblaze B2 or similar)	WS-3	s
RM-022	OCI IAM ↔ Authentik OIDC federation	WS-4	m
RM-025	Admin UIs behind Wireguard / Authentik forward-auth	WS-5	m
RM-026	ADB private endpoints + IP allow-lists (where viable)	WS-5	m
RM-028	OCI Security Lists → Terraform	WS-5	m
RM-035	Dokploy app auto-update strategy (Watchtower-on-E2 vs Dokploy webhooks)	WS-7	s
RM-032	M365 tenant audit + consolidation plan	WS-6	l
RM-030	DKIM in all M365 tenants (publish CNAMEs)	WS-6	s
RM-004	Restore-test the Parallax E5 paid ADB (clone-restore drill)	WS-3	s
RM-012	WG-Easy peer config sync into Git	WS-2	s
RM-023	MFA audit on registrar accounts + cloud roots	WS-4	s
RM-036	Rotate registrar PAT into Vault	WS-4	xs
RM-037	Beszel agent on every host (verify coverage)	WS-8	xs

Action register¶

Each action is defined by a YAML metadata block (machine-readable) followed by human-readable narrative.

Effort key: xs = hours, s = 1-2 days, m = 1-2 weeks, l = >2 weeks. Priority key: critical = paying-customer or Tier-0; high = significant operational risk; medium = important hygiene; low = polish.

RM-001 — Upgrade Fourway tenant ADB from Free to Paid¶

id: RM-001
title: Upgrade E4 (Fourway tenant Autonomous DB) from Always Free to paid tier
workstream: WS-1
wave: 1
priority: critical
owner: TBD
status: proposed
servers: [E4]
apps: ["02"]
ki_addresses: [KI-006, KI-019]
effort: xs
dependencies: []
preconditions:
  - Fourway customer notified of brief reconfiguration window (likely none required)
  - Cost approval secured
target_state:
  - E4 isFreeTier == false
  - 60-day automated backups confirmed RESTORABLE (Free Tier blocks restore)
  - Auto-pause-after-7-days-idle behaviour disabled
  - Customer-facing SLA can reference paid Oracle tier
verification:
  - command: oci db autonomous-database get --autonomous-database-id <E4-ocid> --query 'data."is-free-tier"'
    expected: "false"
  - command: oci db autonomous-database get --autonomous-database-id <E4-ocid> --query 'data."lifecycle-state"'
    expected: "AVAILABLE"
risks_of_inaction: Paying customer's database can pause silently after a quiet weekend; backups exist but cannot be restored if data is corrupted.
risks_of_action: ~£N/month cost increase (TBD; depends on chosen shape).

The single highest-leverage Phase-2 action. Fourway is a paying client; running them on infrastructure Oracle reserves the right to reclaim is a contractual exposure we can't argue away. Free → Paid is a metadata flip in the OCI console — no data migration. Cost is the only friction.

Confirm by running the OCI CLI command in verification against the E4 ADB OCID once that's in our records.

RM-002 — Parallax pre-prod onto its own ADB¶

id: RM-002
title: Provision a separate ADB for Parallax pre-prod; remove pre-prod schemas from E5 prod ADB
workstream: WS-1
wave: 2
priority: critical
owner: TBD
status: proposed
servers: [E5]
apps: ["01"]
ki_addresses: [KI-005]
effort: m
dependencies: [RM-001]
preconditions:
  - Pre-prod schema inventory captured
  - UR communication on the migration window
target_state:
  - New ADB exists in EIDOSDev1 / UR compartment named e.g. PARALLAX_PREPROD
  - Pre-prod APEX workspaces / schemas migrated off E5
  - E5 hosts ONLY prod schemas
  - Migration / promotion flow from pre-prod → prod documented
verification:
  - manual: query E5 → confirm only PROD schemas remain
  - command: oci db autonomous-database list --compartment-id <UR-compartment-ocid> --query 'data[*].{name:"db-name",freetier:"is-free-tier"}'
    expected: 2 ADBs listed (Parallax PROD + new PREPROD)
risks_of_inaction: A bad migration in pre-prod can lock or corrupt the paying-customer prod database with no environment isolation.
risks_of_action: One-time migration effort + new ADB cost. Could be Free Tier for pre-prod (acceptable since it's not customer-facing) — but loses restore ability for pre-prod itself.

The fix is conceptually trivial — Oracle ADB provisioning is a few clicks. The work is in the migration plan: what schemas to move, how to keep pre-prod's data fresh from prod, how to gate promotion.

RM-003 — ADB-level monitoring + alerting¶

id: RM-003
title: Stand up monitoring for the 5 ADBs (synthetic checks + OCI Operations Insights)
workstream: WS-8
wave: 2
priority: high
owner: TBD
status: proposed
servers: [E3, E4, E5, O2, O3]
apps: ["01", "02", "03", "08", "09", "10", "30", "31"]
ki_addresses: [KI-020]
effort: m
dependencies: [RM-039]
preconditions:
  - Alert delivery channel chosen (Gotify / Teams / Slack)
target_state:
  - Each ADB has a synthetic 5-minute heartbeat query
  - OCI alarms configured for each ADB on lifecycle changes (PAUSED, FAILED)
  - OCI Notifications subscription points at the chosen channel
  - Free Tier ADB auto-pause events fire alerts
verification:
  - command: oci monitoring alarm list --compartment-id <UR-compartment-ocid>
    expected: ≥1 alarm per ADB
  - manual: simulate ADB stop on a non-prod ADB, confirm alert arrives
risks_of_inaction: Today, an ADB pause or outage is detected only by user complaint.
risks_of_action: Alert fatigue if thresholds are too aggressive — start with critical-only.

RM-004 — Restore-test the Parallax E5 paid ADB¶

id: RM-004
title: Perform a clone-restore drill on Parallax E5 to verify the 60-day automated backup
workstream: WS-3
wave: 3
priority: high
owner: TBD
status: proposed
servers: [E5]
apps: ["01"]
ki_addresses: [KI-016]
effort: s
dependencies: []
preconditions:
  - UR notified that a clone will be created (no impact to prod)
target_state:
  - A clone of E5 created from a backup ≥7 days old
  - Clone reaches AVAILABLE state
  - APEX login + a representative read query succeeds against the clone
  - Clone deleted; restore time + steps documented in infra/runbooks/parallax-restore.md
verification:
  - command: oci db autonomous-database list-clones --autonomous-database-id <E5-ocid>
    expected: clone present during drill
  - manual: log-in test against clone's APEX URL
risks_of_inaction: We assume the paid-tier backup works; we have not proven it.
risks_of_action: Minor OCI clone cost (typically free for the duration of the test).

RM-005 — Cert-expiry monitor on `.app` (HSTS-preload critical)¶

id: RM-005
title: Add external cert-expiry monitoring on tneconnect.app and all customer URLs
workstream: WS-8
wave: 1
priority: critical
owner: TBD
status: proposed
servers: []
apps: ["02", "03", "06", "13"]
ki_addresses: [KI-024]
effort: xs
dependencies: []
preconditions:
  - External uptime tool selected (Healthchecks.io free tier suffices)
target_state:
  - Cert NotAfter monitored for all customer-facing hostnames (esp. `*.tneconnect.app`)
  - Alert fires at 30 days and 7 days before expiry
  - Independent of Caddy/Traefik internal logging (a Caddy outage must still alert)
verification:
  - manual: pause Caddy auto-renewal in a test, confirm alert fires
risks_of_inaction: A cert lapse on `.app` is catastrophic — HSTS-preload means browsers refuse to connect with no error fallback.
risks_of_action: None significant.

RM-006 — Caddyfile (E1) into Git¶

id: RM-006
title: Move E1 Caddyfile into the engineering repo; add automated sync; document rebuild procedure
workstream: WS-2
wave: 1
priority: critical
owner: TBD
status: proposed
servers: [E1]
apps: []
ki_addresses: [KI-001]
effort: s
dependencies: []
preconditions:
  - SSH access to E1 (needs RM-020 if E1 not yet on tailnet — OK for now via current public SSH)
target_state:
  - File at engineering repo path infra/caddy/E1.Caddyfile
  - Sync mechanism: cron pull on host OR CI deploy on commit
  - Pre-deploy `caddy validate` runs in CI
  - Runbook at infra/runbooks/caddy-rebuild-from-git.md
  - Test: rebuild Caddy from Git on a sandbox host and confirm functional parity
verification:
  - command: git ls-files infra/caddy/E1.Caddyfile
    expected: file listed
  - manual: rebuild test passes
risks_of_inaction: A repeat of the recent OCI scheduled-maintenance outage — full estate down because the only copy of Caddyfile lived on a pause-vulnerable host.
risks_of_action: None significant; one careful manual capture.

RM-007 — Caddyfile (O1) into Git¶

id: RM-007
title: Move O1 Caddyfile into the engineering repo; same pattern as RM-006
workstream: WS-2
wave: 1
priority: critical
owner: TBD
status: proposed
servers: [O1]
apps: []
ki_addresses: [KI-001]
effort: s
dependencies: []
target_state:
  - File at engineering repo path infra/caddy/O1.Caddyfile
  - Sync + validation pipeline as RM-006
verification:
  - command: git ls-files infra/caddy/O1.Caddyfile
    expected: file listed
risks_of_inaction: O1 carries the entire `*.448.global` estate (Vault, Authentik, MinIO, etc.) — a Caddy state loss on O1 is even more impactful than the E1 incident.

RM-008 — SQLcl Dockerfile → Git + image to GitLab Registry¶

id: RM-008
title: Source-control the custom Alpine SQLcl image; push to GitLab Container Registry
workstream: WS-2
wave: 2
priority: high
owner: Vishnu Kant
status: proposed
servers: [O1]
apps: ["32"]
ki_addresses: [KI-003]
effort: s
dependencies: []
target_state:
  - Dockerfile committed at infra/sqlcl/Dockerfile
  - Image built and pushed to git.projecteidos.com/<group>/<project>/sqlcl:<version>
  - Versioned tag (no `:latest`)
  - n8n + Coder consumers updated to pull from registry
verification:
  - command: git ls-files infra/sqlcl/Dockerfile
  - command: docker manifest inspect git.projecteidos.com/<group>/<project>/sqlcl:<version>
    expected: success
risks_of_inaction: If O1 is rebuilt or the image pruned, the build is lost.

RM-009 — SQLcl static internal IP (fix n8n CI/CD breakage)¶

id: RM-009
title: Assign the SQLcl container a stable Docker network alias so n8n pipelines stop breaking on restart
workstream: WS-2
wave: 2
priority: high
owner: Vishnu Kant
status: proposed
servers: [O1]
apps: ["32", "25"]
ki_addresses: [KI-002]
effort: xs
dependencies: [RM-008]
target_state:
  - SQLcl container on a user-defined Docker network with alias e.g. `sqlcl.internal`
  - n8n workflows updated to use the alias, not a hardcoded IP
  - Healthcheck added so restart events surface explicit alerts
verification:
  - command: docker inspect <sqlcl-container> --format '{{range .NetworkSettings.Networks}}{{.Aliases}}{{end}}'
    expected: contains "sqlcl.internal"
risks_of_inaction: n8n CI/CD pipelines break silently on every container restart.

RM-010 — n8n workflow exports → Git¶

id: RM-010
title: Export n8n workflow definitions and commit them to engineering repo
workstream: WS-2
wave: 2
priority: high
owner: TBD
status: proposed
servers: [O1]
apps: ["25"]
ki_addresses: [KI-015]
effort: s
dependencies: []
target_state:
  - All production n8n workflows exported (JSON)
  - Committed at infra/n8n/workflows/<workflow-name>.json
  - Daily/weekly export cron job that diffs against committed state
  - Drift produces an alert
verification:
  - command: git ls-files infra/n8n/workflows/ | wc -l
    expected: ≥1 (matches workflow count)
risks_of_inaction: Production CI/CD pipelines are recoverable only by remembering them.

RM-011 — Authentik blueprint → Git¶

id: RM-011
title: Export Authentik OIDC providers, applications, and policies as a blueprint; commit to repo
workstream: WS-2
wave: 2
priority: high
owner: TBD
status: proposed
servers: [O1]
apps: ["14"]
ki_addresses: [KI-015]
effort: s
dependencies: []
target_state:
  - Blueprint YAML committed at infra/authentik/blueprint.yaml
  - Documented procedure to apply blueprint to a fresh Authentik
  - Authentik secret key stored in Vault (NOT in Git)
verification:
  - command: git ls-files infra/authentik/blueprint.yaml
risks_of_inaction: SSO-integrated app config is recoverable only by hand-rebuilding it post-incident.

RM-012 — WG-Easy peer config sync¶

id: RM-012
title: Periodic export of WG-Easy peer registry into engineering repo (not the keys; the metadata)
workstream: WS-2
wave: 3
priority: medium
owner: TBD
status: proposed
servers: [O1]
apps: ["20"]
ki_addresses: [KI-015]
effort: s
dependencies: []
target_state:
  - Peer metadata (name, allowed-IPs, last-handshake, NOT private keys) exported to infra/wireguard/peers.txt
  - Cron job re-exports daily; commit if changed
  - Server private key separately backed up to Vault (Tier-0 secret, NOT to Git)
verification:
  - command: git ls-files infra/wireguard/peers.txt
risks_of_inaction: Peer revocation history is invisible; stale ex-employee peers may persist.
security: Private keys must NEVER be committed. Verify the export script with a code review.

RM-013 — Vault Raft snapshot job¶

id: RM-013
title: Schedule periodic Vault Raft snapshot; ship to OCI bucket and (later) off-OCI destination
workstream: WS-3
wave: 1
priority: critical
owner: TBD
status: proposed
servers: [O1]
apps: ["15"]
ki_addresses: [KI-017, KI-016]
effort: s
dependencies: []
preconditions:
  - OCI bucket exists in EIDOSDev1 for backups (or create one in ORA448Global)
  - Vault root or appropriate policy token available for snapshot creation
target_state:
  - Cron job runs `vault operator raft snapshot save` daily
  - Snapshot uploaded to OCI bucket with retention (e.g. 30 daily + 12 monthly)
  - Snapshot file integrity verified (sha256)
  - Vault unseal-key and root-token recovery procedure documented
  - At least one snapshot used in a restore drill (covered by RM-017)
verification:
  - command: oci os object list --bucket-name <bucket> --prefix vault-snapshots/ --query 'data[].name' | wc -l
    expected: ≥1
risks_of_inaction: Vault loss = every secret in the company is unrecoverable. Tier-0.
security: Snapshot files contain encrypted secret material — handle with same controls as Vault itself.

RM-014 — Authentik Postgres + media backup¶

id: RM-014
title: Schedule daily Authentik Postgres dump + media volume snapshot; ship off-host
workstream: WS-3
wave: 1
priority: critical
owner: TBD
status: proposed
servers: [O1]
apps: ["14"]
ki_addresses: [KI-017]
effort: s
dependencies: []
target_state:
  - Daily `pg_dump` of Authentik Postgres
  - Media volume backup (avatars, custom CSS)
  - Authentik secret key captured in Vault separately
  - Both shipped to OCI bucket with retention
verification:
  - command: oci os object list --bucket-name <bucket> --prefix authentik-backups/ --query 'data[?contains("name", `pg_dump`)].name' | wc -l
    expected: ≥1 per day for last 7 days
risks_of_inaction: Loss of Authentik = loss of SSO trust state for every integrated app.

RM-015 — GitLab `gitlab-backup` schedule¶

id: RM-015
title: Schedule daily gitlab-backup including the secrets file; ship off-host
workstream: WS-3
wave: 1
priority: critical
owner: TBD
status: proposed
servers: [E2]
apps: ["16"]
ki_addresses: [KI-018]
effort: s
dependencies: []
target_state:
  - Daily `gitlab-backup create` runs successfully
  - `/etc/gitlab/gitlab-secrets.json` captured separately on the same schedule
  - Backups + secrets file both shipped to OCI bucket
  - Restore-test once (RM-017 covers this)
verification:
  - command: oci os object list --bucket-name <bucket> --prefix gitlab-backups/
    expected: ≥1 backup per day
  - command: oci os object list --bucket-name <bucket> --prefix gitlab-secrets/
    expected: ≥1 secrets file per day
risks_of_inaction: Without the secrets file, encrypted CI variables cannot be decrypted on a restored instance — the most common GitLab-restore failure mode.

RM-016 — n8n backup¶

id: RM-016
title: Daily n8n Postgres dump + workflow JSON exports + N8N_ENCRYPTION_KEY in Vault
workstream: WS-3
wave: 2
priority: high
owner: TBD
status: proposed
servers: [O1]
apps: ["25"]
ki_addresses: [KI-017]
effort: s
dependencies: [RM-010]
target_state:
  - Daily Postgres dump shipped to OCI bucket
  - N8N_ENCRYPTION_KEY confirmed in Vault (without it, stored credentials are dead)
  - Workflow exports under RM-010 covered separately
verification:
  - command: oci os object list --bucket-name <bucket> --prefix n8n-backups/
risks_of_inaction: Production CI/CD workflows cannot be restored.

RM-017 — Quarterly restore-drill schedule¶

id: RM-017
title: Establish quarterly cold-restore drills for Vault, Authentik, GitLab, and the paid ADB
workstream: WS-3
wave: 2
priority: high
owner: TBD
status: proposed
servers: [E5, O1, E2]
apps: ["14", "15", "16", "01"]
ki_addresses: [KI-016]
effort: s
dependencies: [RM-013, RM-014, RM-015]
target_state:
  - Calendar entries for Q1/Q2/Q3/Q4 drills (Vault Q1, Authentik Q2, GitLab Q3, Parallax ADB Q4 — rotate)
  - Each drill restores the backup to a parallel sandbox and runs a smoke test
  - Outcomes (date, duration, what worked, what didn't) tracked in infra/backups.md
verification:
  - manual: review log of past 4 quarters' drills
risks_of_inaction: The most universal Phase-2 finding — *no backup has ever been restored*. Until tested, hopes are not recoveries.

RM-018 — Off-OCI backup destination¶

id: RM-018
title: Add an off-OCI destination (e.g. Backblaze B2) for Tier-0 backups
workstream: WS-3
wave: 3
priority: high
owner: TBD
status: proposed
servers: []
apps: []
ki_addresses: [KI-016]
effort: s
dependencies: [RM-013, RM-014, RM-015]
preconditions:
  - Provider account opened with company billing
target_state:
  - Backblaze B2 (or equivalent) bucket created
  - Vault snapshots, Authentik dumps, GitLab backups mirrored from OCI bucket weekly
  - Object-lock / versioning on (ransomware resilience)
  - Credentials in Vault
verification:
  - manual: confirm latest week's mirrored copies present
risks_of_inaction: A simultaneous OCI tenancy compromise / billing lapse loses both production and backups.
risks_of_action: <£20/month at expected volumes.

RM-019 — Tailscale off personal account¶

id: RM-019
title: Migrate Tailscale tailnet from Vishnu's personal account to a company-owned identity (or self-host Headscale)
workstream: WS-4
wave: 1
priority: critical
owner: Vishnu Kant
status: proposed
servers: [E2, O1]
apps: []
ki_addresses: [KI-010]
effort: s
dependencies: []
target_state:
  - Either: company Tailscale subscription under ops@projecteidos.com, OR Headscale on a dedicated low-cost VM
  - E2 + O1 + (per RM-020) E1 are members of the company tailnet
  - Vishnu's personal tailnet decommissioned for org devices
  - Procedure for adding/removing devices documented
verification:
  - command: tailscale status (on E2/O1) shows ownership = ops@projecteidos.com
risks_of_inaction: Production admin path depends on a personal SaaS account.

RM-020 — E1 onto tailnet; close port 22 publicly¶

id: RM-020
title: Add E1 to the company tailnet; close OCI ingress on port 22 for E1
workstream: WS-4
wave: 2
priority: high
owner: TBD
status: proposed
servers: [E1]
apps: []
ki_addresses: [KI-014]
effort: xs
dependencies: [RM-019]
target_state:
  - E1 in `tailscale status` from another tailnet member
  - OCI Security List for E1 has port 22 closed (or restricted to bastion / Tailscale CGNAT range)
verification:
  - command: oci network security-list get --security-list-id <E1-sl-ocid> --query 'data."ingress-security-rules"'
    expected: no rule allowing 0.0.0.0/0 on tcp/22
risks_of_inaction: Public SSH on E1 is the last brute-force target on PE-side hosts.

RM-021 — Bradley's Bitwarden → Vault¶

id: RM-021
title: Migrate shared credentials from Bradley's personal Bitwarden into vault.448.global with a documented path scheme
workstream: WS-4
wave: 2
priority: high
owner: Bradley Leggett (with Vishnu)
status: proposed
servers: []
apps: ["15"]
ki_addresses: [KI-007, KI-024]
effort: s
dependencies: []
target_state:
  - All Oracle DBA / EIDOSDev1 admin credentials in Vault under a documented path scheme
  - Bradley's personal Bitwarden no longer the system-of-record for any shared secret
  - Path scheme documented at apps/15-vault.md
verification:
  - manual: spot-check that ADB ADMIN, OCI tenancy login, and other DBA-scope credentials resolve via Vault
risks_of_inaction: Off-boarding hazard; bus-factor; not audit-loggable.

RM-022 — OCI IAM ↔ Authentik federation¶

id: RM-022
title: Federate both OCI tenancies' IAM with Authentik (OIDC); deprecate local OCI users
workstream: WS-4
wave: 3
priority: medium
owner: TBD
status: proposed
servers: []
apps: ["14"]
ki_addresses: [KI-030]
effort: m
dependencies: [RM-021]
target_state:
  - EIDOSDev1 + ORA448Global identity domains federated to auth.448.global
  - Local OCI users disabled (except break-glass admin account)
  - Group-based access policies via Authentik group claims
  - Procedure tested: an Authentik user can console-into OCI without a local OCI user
verification:
  - manual: end-to-end SSO test
risks_of_inaction: Off-boarding requires touching two OCI tenancies independently; MFA / password policy drifts per tenancy.

RM-023 — MFA audit on registrar + cloud roots¶

id: RM-023
title: Confirm MFA enforced on GoDaddy account, both OCI tenancy admins, all M365 tenant globals
workstream: WS-4
wave: 3
priority: high
owner: TBD
status: proposed
servers: []
apps: []
ki_addresses: []
effort: s
dependencies: []
target_state:
  - GoDaddy login: MFA enabled (TOTP or security key)
  - OCI EIDOSDev1 tenancy admin (Tracey): MFA enforced
  - OCI ORA448Global tenancy owner (Adam) + additional admin (Vishnu): MFA enforced for both
  - All M365 tenant Global Administrators: MFA enforced
  - Audit log committed at infra/audits/mfa-2026-Q2.md (or similar)
verification:
  - manual: per-account screenshot or CLI confirmation
risks_of_inaction: Phishable single-credential routes to total tenancy compromise.

RM-024 — Vault credential paths schema + rotation policy¶

id: RM-024
title: Define a stable Vault path scheme; rotate the most-stale shared credentials
workstream: WS-4
wave: 2
priority: high
owner: Vishnu Kant
status: proposed
servers: []
apps: ["15"]
ki_addresses: [KI-007]
effort: s
dependencies: []
target_state:
  - Path scheme documented (e.g. /secret/<service>/<env>/<credential-type>)
  - Each app doc lists its expected Vault paths in section 4
  - Credentials older than 1 year rotated; new ones logged
  - Quarterly rotation review on the calendar
verification:
  - manual: each [INFO NEEDED] Vault path in apps/*.md is now resolved
risks_of_inaction: Inconsistent paths and stale credentials increase blast radius of any single leak.

RM-025 — Admin UIs behind WG / Authentik forward-auth¶

id: RM-025
title: Gate every admin UI on either Wireguard membership or Authentik forward-auth
workstream: WS-5
wave: 3
priority: high
owner: TBD
status: proposed
servers: [O1, E2]
apps: ["14", "15", "17", "19", "21", "23", "24", "25", "26", "27", "28"]
ki_addresses: [KI-012]
effort: m
dependencies: [RM-019]
target_state:
  - vault.448.global, portainer.448.global, monitor.448.global, n8n.448.global, coder.448.global, ai.448.global, draw.448.global, tools.448.global, notify.448.global, s3.448.global console: all reachable only via Wireguard OR through Authentik forward-auth at the proxy
  - Public access removed at Caddy level
  - Customer-facing apps (CRMs, WordPress, Workforce, Parallax) remain public
verification:
  - manual: from a non-tailnet client, confirm each admin UI returns 401/403 (Authentik) or refuses connection (WG-only)
risks_of_inaction: AI-driven scanners continually probe public admin dashboards; one CVE on any of them is a leverage point.
risks_of_action: Engineers must be on Wireguard / SSO for admin work — small UX cost, big security win.

RM-026 — ADB private endpoints / IP allow-lists¶

id: RM-026
title: Where ADB tier supports it, restrict ADB endpoints to a private VCN endpoint + IP allow-list
workstream: WS-5
wave: 3
priority: high
owner: TBD
status: proposed
servers: [E5, E4, E3, O2, O3]
apps: ["01", "02", "03", "08", "09", "10", "30", "31"]
ki_addresses: [KI-011]
effort: m
dependencies: [RM-001]
target_state:
  - Paid ADBs (currently E5; E4 after RM-001) use private endpoints in a VCN
  - ORDS endpoints reachable only from E1, E2, O1, and the tailnet
  - Free Tier ADBs (no private endpoint support): tighten IP allow-list
verification:
  - command: oci db autonomous-database get --autonomous-database-id <ocid> --query 'data."private-endpoint"'
    expected: not null for paid ADBs
risks_of_inaction: Paying-customer DB endpoints are public; only mTLS wallet protects them.

RM-027 — CAA records on all 4 domains¶

id: RM-027
title: Publish CAA records pinning Let's Encrypt as the only authorized CA on all 4 domains
workstream: WS-5
wave: 2
priority: medium
owner: Vishnu Kant
status: proposed
servers: []
apps: []
ki_addresses: [KI-026]
effort: xs
dependencies: []
target_state:
  - For each of projecteidos.com, eidos-global.com, tneconnect.app, 448.global, publish:
    - CAA 0 issue "letsencrypt.org"
    - CAA 0 iodef "mailto:security@<domain>" (with chosen recipient)
verification:
  - command: dig CAA <domain> +short (or DoH equivalent)
    expected: at least one issue record
risks_of_inaction: Any CA accepting attack-controlled validation can issue valid certs for our hostnames.

RM-028 — OCI Security Lists → Terraform¶

id: RM-028
title: Capture OCI Security Lists and NSGs as Terraform; apply via CI
workstream: WS-5
wave: 3
priority: medium
owner: Vishnu Kant
status: proposed
servers: [E1, E2, O1, E3, E4, E5, O2, O3]
apps: []
ki_addresses: [KI-013]
effort: m
dependencies: []
target_state:
  - Terraform module per tenancy (EIDOSDev1, ORA448Global)
  - All current rules imported via `terraform import`
  - Changes require a PR + plan + apply via CI pipeline
  - State stored in OCI bucket (with locking)
verification:
  - command: terraform plan -no-color (in CI)
    expected: no diff with current OCI state
risks_of_inaction: Manual security-list edits drift, are unauditable, and lack rollback.

RM-029 — Fix `projecteidos.com` malformed SPF¶

id: RM-029
title: Replace projecteidos.com TXT SPF record with a single well-formed SPF policy
workstream: WS-6
wave: 1
priority: high
owner: Vishnu Kant
status: proposed
servers: []
apps: []
ki_addresses: [KI-027]
effort: xs
dependencies: []
preconditions:
  - Inventory the actually-used senders (M365, WP Cloud / Brevo, GoDaddy secureserver)
target_state:
  - TXT record at projecteidos.com (apex) reads ONE SPF policy with a SINGLE trailing -all
  - Validated via mxtoolbox or similar
  - Outbound mail tested (esp. transactional from secureserver-related senders)
verification:
  - command: dig TXT projecteidos.com +short | grep '^"v=spf1' | head -1
    expected: single record, single -all
  - manual: send-and-receive test from each authorized sender path
risks_of_inaction: Outbound mail from secureserver.net senders is silently failing SPF; deliverability degrading.

RM-030 — Configure DKIM in all M365 tenants¶

id: RM-030
title: Enable DKIM in each M365 tenant for its associated domain; publish required CNAMEs
workstream: WS-6
wave: 3
priority: medium
owner: TBD
status: proposed
servers: []
apps: []
ki_addresses: [KI-029]
effort: s
dependencies: []
target_state:
  - Per domain (projecteidos.com, eidos-global.com, tneconnect.app, 448.global): selector1 + selector2 CNAME records published, DKIM signing enabled in M365 Defender
verification:
  - command: dig TXT selector1._domainkey.<domain> +short
    expected: non-empty
risks_of_inaction: Without DKIM, DMARC enforcement is much weaker; Gmail / M365 inbound deliverability suffers.

RM-031 — DMARC enforce on PE + 448¶

id: RM-031
title: Move DMARC from p=none to p=quarantine, then to p=reject, on projecteidos.com and 448.global
workstream: WS-6
wave: 2
priority: medium
owner: TBD
status: proposed
servers: []
apps: []
ki_addresses: [KI-028]
effort: s
dependencies: [RM-029, RM-030]
target_state:
  - Step 1 (after SPF/DKIM clean): DMARC p=quarantine for 30 days, monitor reports
  - Step 2: DMARC p=reject
  - Both projecteidos.com and 448.global at p=reject; eidos-global.com and tneconnect.app already at p=quarantine — also progress to p=reject when clean
verification:
  - command: dig TXT _dmarc.projecteidos.com +short
    expected: contains "p=reject"
risks_of_inaction: Spoofed mail from these domains continues to reach inboxes.

RM-032 — M365 tenant audit + consolidation plan¶

id: RM-032
title: Inventory the 3+ M365 tenants; produce a consolidation feasibility plan
workstream: WS-6
wave: 3
priority: medium
owner: TBD
status: proposed
servers: []
apps: []
ki_addresses: [KI-030]
effort: l
dependencies: []
target_state:
  - Inventory captured at infra/m365-tenants.md: tenant ID, billing owner, license SKU mix, admin list, federation status, mailbox count
  - Proposal: keep separate, or merge — with cost / migration / downtime analysis
verification:
  - manual: doc reviewed
risks_of_inaction: Off-boarding requires touching every tenant; MFA / DLP / retention drift; licensing inefficiency.

RM-033 — `unattended-upgrades` security-only¶

id: RM-033
title: Enable Ubuntu unattended-upgrades in security-only mode on E1, E2, O1
workstream: WS-7
wave: 2
priority: critical
owner: Vishnu Kant
status: proposed
servers: [E1, E2, O1]
apps: []
ki_addresses: [KI-023]
effort: xs
dependencies: []
target_state:
  - /etc/apt/apt.conf.d/50unattended-upgrades enabled with Ubuntu-Security only
  - APT::Periodic::Unattended-Upgrade "1"
  - Logs reviewed monthly
verification:
  - command: ssh <host> 'unattended-upgrade --dry-run --debug 2>&1 | tail -5'
    expected: indicates security updates applied / would-be applied
risks_of_inaction: Public-facing hosts accumulate kernel + package CVEs. The longer the gap, the harder each future patch becomes.
risks_of_action: Security-only auto-updates are the LOWEST blast-radius patching strategy on stable Ubuntu LTS.

RM-034 — Monthly maintenance windows¶

id: RM-034
title: Establish monthly maintenance windows with snapshot-before-patch + planned reboot
workstream: WS-7
wave: 2
priority: high
owner: TBD
status: proposed
servers: [E1, E2, O1]
apps: []
ki_addresses: [KI-023]
effort: s
dependencies: [RM-033]
target_state:
  - Calendar window (e.g. first Sunday of month, 02:00-04:00 UK)
  - Runbook: snapshot block volume → apt full-upgrade → reboot → verify health
  - User-facing communication for any customer-impacting work
  - Outcome logged at infra/maintenance-log/YYYY-MM.md
verification:
  - manual: review log entry exists for each month
risks_of_inaction: Kernel / userland still drifts even with unattended-upgrades; non-security updates accumulate.

RM-035 — Dokploy app auto-update strategy¶

id: RM-035
title: Decide and implement an auto-update mechanism for Dokploy-hosted apps on E2
workstream: WS-7
wave: 3
priority: medium
owner: TBD
status: proposed
servers: [E2]
apps: ["04", "05", "06", "07", "11", "12", "13", "16", "18"]
ki_addresses: [KI-004]
effort: s
dependencies: []
target_state:
  - One of: (a) Watchtower-on-E2 with label-based opt-in, (b) Dokploy webhook auto-deploy on Git tag, (c) scheduled CI rebuild pipeline
  - Decision documented at apps/22-watchtower.md
  - All E2 containers pinned to versioned tags (no :latest)
  - Update events alerted (Gotify / chosen channel)
verification:
  - manual: trigger an update and verify alert fires
risks_of_inaction: WordPress, GitLab, Twenty CRM all accumulate CVEs unattended.

RM-036 — Rotate registrar PAT into Vault¶

id: RM-036
title: Generate a fresh GitLab PAT for the engineering repo; store in Vault; rotate the embedded one
workstream: WS-4
wave: 3
priority: medium
owner: Vishnu Kant
status: proposed
servers: []
apps: ["15", "16"]
ki_addresses: []
effort: xs
dependencies: []
target_state:
  - PAT in Vault under documented path
  - Local clones use credential helper that reads from Vault, OR SSH key-based auth
  - The current PAT embedded in .git/config replaced or removed
verification:
  - manual: clone / push works without plaintext token in .git/config
risks_of_inaction: Long-lived PAT in plaintext on developer hosts is leak-prone.

RM-037 — Beszel coverage audit¶

id: RM-037
title: Confirm Beszel agent runs on every host (E1, E2, O1) and is delivering metrics
workstream: WS-8
wave: 3
priority: low
owner: TBD
status: proposed
servers: [E1, E2, O1]
apps: ["21"]
ki_addresses: [KI-022]
effort: xs
dependencies: []
target_state:
  - Beszel dashboard shows live metrics from E1, E2, O1
  - Alerts configured for disk-full, OOM, container-down
verification:
  - manual: visual check of Beszel dashboard
risks_of_inaction: A monitoring blind spot is invisible.

RM-038 — External uptime monitor¶

id: RM-038
title: Stand up external uptime monitoring on every customer-facing URL
workstream: WS-8
wave: 2
priority: high
owner: TBD
status: proposed
servers: []
apps: ["01", "02", "03", "04", "05", "06", "07", "11", "12", "13", "14", "15", "16"]
ki_addresses: [KI-021]
effort: xs
dependencies: []
target_state:
  - External tool (Healthchecks.io / UptimeRobot) checks every customer-facing URL every 1-5 minutes
  - Alerts route to a CHANNEL DIFFERENT from Gotify (so a Gotify outage doesn't swallow them)
  - Status page (optional) for transparency
verification:
  - manual: list of monitored URLs == app inventory
risks_of_inaction: Today, an outage is detected by user complaint.

RM-039 — Alert delivery rebuild¶

id: RM-039
title: Decide and rebuild alert delivery (revive Gotify or move to Teams/Slack webhook)
workstream: WS-8
wave: 2
priority: high
owner: TBD
status: proposed
servers: [O1]
apps: ["23"]
ki_addresses: [KI-022]
effort: s
dependencies: []
target_state:
  - Single, working alert channel that reaches a human within minutes
  - Beszel + Watchtower + n8n + uptime tool + ADB monitor all wired
  - Test alerts fired and confirmed received
verification:
  - manual: end-to-end test from each source
risks_of_inaction: Monitoring is only as good as the alerts that reach humans.

RM-042 — VPN access for TnE Connect: staff WG client rollout + office leased-line connectivity¶

id: RM-042
title: Roll out VPN access so all Eidos staff can reach the internal TnE Connect tenant from anywhere; connect UK + India offices over leased line
workstream: WS-5
wave: 2
priority: high
owner: Vishnu Kant + Sergiu Pop
status: proposed
servers: [O1]
apps: ["03", "20"]
ki_addresses: [KI-031, KI-035]
related_actions: [RM-019, RM-020]
effort: m
dependencies: []
preconditions:
  - WireGuard portal at wg.448.global is operational (currently 4 peers, lightly used)
  - Decision made on VPN domain — keep on 448.global, or migrate to projecteidos.com / eidos-global.com (see below)
target_state:
  - Every Eidos staff member who needs the TnE Connect Eidos tenant has a WireGuard client installed and a peer config issued
  - UK office reaches the internal tenancy over a dedicated leased-line / site-to-site VPN (no per-user clients required for office workstations)
  - India office on the same site-to-site arrangement
  - TnE Connect Eidos tenant access tightened so it requires either VPN or trusted-source IP (avoids public exposure during the SaaS-hardening period)
  - Peer-config issuance procedure documented; off-boarding revocation procedure documented
verification:
  - manual: list active WG peers vs current staff roster; confirm match
  - manual: from a staff laptop on VPN, reach https://eidos-global.tneconnect.app and confirm load
  - manual: from a staff laptop NOT on VPN, confirm denial (post-tightening)
risks_of_inaction: TnE Connect Eidos tenant remains publicly reachable while the SaaS hardening (KI-035, RM-043 VAPT) is still in flight; staff PII exposure surface is broader than it needs to be.
risks_of_action: Increases dependency on VPN availability — the WireGuard server (O1) becomes a critical path for Eidos staff workforce-app access. Mitigated by O1 already being a Tier-0 host in the estate.

Sub-decision: VPN domain. Today the WG portal is at wg.448.global — fitting the internal-tools brand. Two alternatives to consider before the rollout:

Keep wg.448.global (status quo) — fastest. Reasonable since the VPN is for staff (an internal-infra concern), not for customer access.
Migrate to wg.projecteidos.com — aligns the VPN URL with the corporate brand staff identify with. Trivial to set up via DNS + Caddy block + a one-off WG-Easy URL change.
Migrate to wg.eidos-global.com — same reasoning; depends which brand Stacy / Adam want staff to see.

The decision matters because as soon as we hand out WG client configs, the URL is on every staff laptop. Changing it later means re-issuing every config.

Sergiu's contribution: primary owner of the staff-laptop side — installing WG clients, helping non-technical staff bring up the connection, troubleshooting per-user issues. Vishnu owns the server-side and Caddy-side configuration. Office leased-line setup is Sergiu's responsibility too, with input on the OCI VCN side from Vishnu.

Where this sits relative to other items: - Closely related to RM-019 (Tailscale off personal account) and RM-020: clarify the role-split between WireGuard (staff + office connectivity, broader rollout) and Tailscale (admin / engineering overlay). Both VPNs can coexist; they serve different audiences. - Mitigates exposure called out in KI-035 (heavy PII on Free Tier, no formal SLA) by reducing the public attack surface during the security-hardening period.

RM-043 — Vulnerability Assessment + Penetration Test (VAPT) for the TnE Connect SaaS¶

id: RM-043
title: Engage an external assessor for a formal VA + PT of the TnE Connect product (both tenants); produce a remediation plan; close findings to a documented severity threshold
workstream: WS-1
wave: 2
priority: critical
owner: Vishnu Kant + leadership (Stacy + Adam) for commercial sign-off
status: proposed
servers: [E2, E3, E4]
apps: ["02", "03"]
ki_addresses: [KI-031, KI-035, KI-011, KI-036]
effort: l
dependencies: []
preconditions:
  - Vendor selection: pick a CREST-accredited (or equivalent UK) penetration-testing firm
  - Scope agreement: Fourway tenant + Eidos tenant + the underlying Oracle ADBs + the Caddy / Dokploy / E1 + E2 ingress + the Bitbucket source repo
  - Customer notification (Fourway) before any active testing against shared infra
  - Insurance / contractual clauses with the assessor on data handling
target_state:
  - Independent VAPT report issued covering: external network surface, web app (OWASP-aligned), API security, session / authentication, data exposure, privilege escalation paths, source-code review (SAST + manual)
  - Findings categorised (Critical / High / Medium / Low / Info)
  - Remediation plan agreed with timeline; all Critical and High findings closed within an agreed window
  - Re-test of remediated findings to confirm closure
  - Report and remediation evidence retained for prospective customer due-diligence (the SaaS go-to-market makes this essential — every potential enterprise customer will ask)
verification:
  - manual: VAPT report on file at infra/audits/YYYY-Q?-tne-connect-vapt.md (redacted as needed) + raw report retained securely
  - manual: every Critical / High finding has either a closing commit or a documented compensating control
risks_of_inaction: We sell the TnE Connect product as SaaS while never having had it externally tested. Any reasonably-due-diligent enterprise customer will not buy without a recent independent assessment. Plus the genuine security debt remains undiagnosed.
risks_of_action: Cost (~£10-30k typical for a small-app VAPT in the UK depending on scope and depth); short-term distraction during testing window; some findings may be uncomfortable.

Why this is critical, not just nice-to-have: the TnE Connect product holds heavy staff PII, runs on Free-Tier infra without restorable backups, has no MFA on the admin layer, and is being marketed (via RocketSaas + tneconnect.app) for new enterprise customers. We cannot credibly sell this as SaaS without a recent independent assessment. The combination of KI-031, KI-035 and KI-036 is a meaningful structural exposure even before considering app-layer issues.

Suggested scope sizing for the assessor: - External infra — E1, E2 public surfaces; Caddy / Traefik configurations; the publicly-reachable ORDS endpoints on E3 / E4 (KI-011). - Web application — both customer URLs (fourway.tneconnect.app, eidos-global.tneconnect.app) plus the developer URLs (apex1.projecteidos.com, apex2.projecteidos.com). - Authentication — Microsoft SSO chain through Authentik to the APEX app's authentication scheme, including post-login session handling. - Authorization — the custom ur_users-style page-and-hotel access-control pattern (verify it's enforced server-side, not just UI-hidden). - Source review — limited code review of the workforce repo, specifically the access-control + data-handling code paths.

Suggested out-of-scope for this round: - Parallax (UR has its own commercial relationship; would warrant a separate engagement when UR's resale plans firm up). - The Twenty CRMs (small user counts; not on the SaaS go-to-market). - The internal *.448.global estate (separate concern; potentially Wave 3 or a smaller scoped exercise).

Outputs to expect: - A written report fit to share (in summary or NDA form) with prospective enterprise customers. - A remediation backlog that becomes its own set of RM-NNN items once findings are known.

RM-041 — Public-friendly documentation site (MkDocs Material at `docs.projecteidos.com`)¶

id: RM-041
title: Stand up an MkDocs Material documentation site at docs.projecteidos.com so leadership can read the docs without GitLab access
workstream: WS-2
wave: 2
priority: medium
owner: Vishnu Kant
status: in-progress
servers: [E2]
apps: ["18"]
ki_addresses: []
effort: s
dependencies: []
preconditions:
  - mkdocs.yml + Dockerfile already committed in the engineering repo
  - .gitlab-ci.yml already has docs:build job validating the site builds
  - Authentik is up (RM-013/KI-033 closed) so OIDC-protected access works
target_state:
  - docs.projecteidos.com resolves and serves the MkDocs Material site rendered from the engineering repo
  - Mermaid diagrams render correctly
  - Site search works
  - Access is gated via Authentik forward-auth (one-time M365 sign-in; no GitLab login required)
  - Site rebuilds automatically when main is updated (Dokploy auto-deploy on push, or scheduled rebuild)
verification:
  - command: curl -sI https://docs.projecteidos.com/ -m 5 | head -1
    expected: 200 OK or 302 to Authentik
  - manual: open docs.projecteidos.com in a browser, sign in via M365, see the docs site
  - manual: navigate to overview/executive-summary; confirm Mermaid diagrams render
  - manual: use the search box; confirm full-text search returns expected results
risks_of_inaction: Leadership and external stakeholders cannot read the operational documentation without a GitLab account; institutional knowledge stays trapped in the repo.
risks_of_action: Surfaces operational documentation behind only the Authentik / M365 layer — verify no sensitive credential text accidentally entered the docs (the convention is paths-not-credentials; spot-check before flipping the auth gate to live).

Implementation steps (concrete):

Verify the build runs locally to catch any broken links before deploying:

pip install "mkdocs-material>=9.5" "pymdown-extensions>=10.0"
mkdocs serve   # open http://localhost:8000 to review

In Dokploy, create a new application named e.g. engineering-docs:
Source: this engineering Git repo, branch main
Build: Dockerfile (the one already in the repo root)
Domain: docs.projecteidos.com
Auto-deploy on push to main: enabled
Watchtower exclusion label: yes (this is a Tier-1 service we want explicit upgrades on, same lesson as KI-037)
DNS: add an A record at GoDaddy for docs.projecteidos.com pointing at E1's public IP (140.238.97.163). Same pattern as bot., git., crm.* — Caddy on E1 fronts it and proxies to Dokploy on E2.
Caddy on E1: add a docs.projecteidos.com block that proxies to the Dokploy container. Capture the change in the Caddyfile (which now lives in infra/caddy/E1.Caddyfile once RM-006 is closed).
Authentik forward-auth gate:
In Authentik, create a new Provider: type "Proxy", external host docs.projecteidos.com, mode "Forward-auth (single application)".
Bind the provider to the relevant group (e.g. staff, or a dedicated docs-readers group).
In Caddy on E1, add the forward_auth directive to the docs.projecteidos.com site block, pointing at the Authentik outpost.
Test from an unauthenticated session — should redirect to Authentik / Microsoft 365 sign-in.
Communicate to leadership: send Stacy and Adam (and anyone else) a one-line note: "Docs are at https://docs.projecteidos.com — sign in once with your Microsoft 365 corporate account; you'll stay signed in for normal browsing afterwards."

Cost: £0 — runs as a tiny Nginx container alongside the other Dokploy apps on E2; uses existing Caddy and Authentik infra.

RM-040 — Cert-expiry alerting (general)¶

id: RM-040
title: Add cert NotAfter monitoring on all customer-facing hostnames
workstream: WS-8
wave: 1
priority: critical
owner: TBD
status: proposed
servers: []
apps: ["01", "02", "03", "04", "05", "06", "07", "11", "12", "13", "14", "15", "16", "17"]
ki_addresses: [KI-024, KI-005]
effort: xs
dependencies: [RM-038]
target_state:
  - External tool checks NotAfter on every hostname
  - Alerts at 30 days and 7 days before expiry
  - Special attention on tneconnect.app (HSTS-preload)
verification:
  - manual: simulate near-expiry, confirm alert
risks_of_inaction: A renewal failure goes unnoticed until users report.

AI agent guidance¶

This document is structured so an AI agent (or any parser) can extract individual actions and act on them, against our repo.

Reading the structure¶

Each action is a top-level ### RM-NNN — <title> heading followed by a fenced ```yaml block.
The yaml fields (id, servers, ki_addresses, target_state, verification) are the canonical machine-readable form.
verification.command blocks are runnable shell or OCI CLI commands that an agent can execute (with appropriate credentials).
target_state is a list of testable post-conditions.

Suggested agent capabilities¶

RAG over this repo: index this file plus all of apps/*.md, infra/*.md, overview/*.md. Cross-references use stable IDs (KI-NNN, RM-NNN, E1, O1, etc.).
Status reporting: for each RM-NNN, run the verification.command set and report whether target_state is met. Update the status field on success.
Dependency resolution: never start an RM whose dependencies aren't all in status: done.
Safety: treat every verification.command as read-only unless explicitly tagged otherwise — they confirm state, they do not change it.
Permission scoping: the agent should run with a credential scoped to the minimum OCI policy needed for the targeted resource — not tenancy-admin.

What an agent should NOT do without human approval¶

Apply Terraform changes (RM-028).
Upgrade ADB tiers (RM-001) — billing impact.
Rotate live secrets (RM-021, RM-036).
Migrate Tailscale ownership (RM-019) — requires identity transitions.
Anything inside an M365 tenant admin console.

These are deliberately gated to humans; the agent's role is to prepare, verify, and report, not to execute irreversible production changes.

Maintenance¶

When a new operational risk is discovered, add a KI-NNN to infra/known-issues.md and a corresponding RM-NNN here.
When an action completes, update status: done and add a completed_date: YYYY-MM-DD line.
Quarterly: review the wave plan; promote items from "proposed" to "in-progress" / "done"; identify new KIs.

Decisions needed from leadership¶

Before Wave 1 can start, leadership needs to sign off on:

OCI cost increase for Fourway ADB upgrade (RM-001). Single largest line item; rest of Phase 2 fits in operational headroom.
Tailscale company subscription vs Headscale self-host (RM-019). Subscription is faster to ship; self-host is cheaper long-term and avoids vendor lock.
Off-OCI backup provider (RM-018). Backblaze B2 vs AWS S3 Glacier vs OVH; recommendation is B2 for price-simplicity.
External uptime tool (RM-038). Free tier of Healthchecks.io is the lowest-friction starting point.
Engineering capacity allocation. Phase 2 estimated at ~30% of one engineer for 6 months; explicit allocation prevents drift.

The detail behind each line is in the corresponding RM-NNN block above.