AI sovereignty & vendor risk¶

Where we are: every AI assistant, IDE-agent and Open-WebUI prompt our staff use today routes to third-party APIs (Anthropic, OpenAI, etc.). This doc captures the resulting risks, the realistic options for moving some of that workload onto infrastructure we own, and a costed comparison for the question that matters in practice — can we run frontier-quality AI for ~10 concurrent coding users in Bedford, on our own metal?

Last reviewed: 2026-05-08.

1. At a glance¶

Today all AI assistance is via third-party APIs. There is no local fall-back. If Anthropic / OpenAI goes down, every Claude-Code / Cursor / Open-WebUI session in the company stops.
Worse than that: most developers today use free-tier or personal-account AI tools (free ChatGPT, free Claude.ai, personal Cursor / Copilot subscriptions, personal API keys), not the company-managed Open-WebUI at ai.448.global. This is the single biggest sovereignty problem we have right now. Free-tier ToS frequently allow vendor training on submitted content; the company has zero audit trail; conversation histories live in the developer's personal account, not the company's; and when a developer leaves, that history (containing real customer code / config) walks out the door with them. No technical control we add to the company channel matters if the company channel isn't where the work happens.
Sensitive data leaves the building every time a developer pastes code, a customer ticket, or a Vault path into a prompt — and currently we don't even know which third party is receiving it. Acceptable-use policy and data-redaction discipline are the only controls in place today, and they are informal.
Frontier-quality local AI (Opus 4.7 / GPT-5-class reasoning + coding) is not yet achievable on a single Mac Studio for 10 concurrent users. It is achievable for 1–2 concurrent users on a good-but-not-frontier open-weight model — useful as a sovereign fallback, not a wholesale replacement.
Recommended direction: hybrid. Keep frontier API as the productivity primary, stand up a local Mac Studio (or small cluster) for sensitive workloads + offline resilience, and let usage data drive whether to expand.
The biggest single thing we can change today is making the company AI channel good enough that developers actively prefer it to the free / personal tools they use today. That means parity-or-better on model quality, generous per-user budgets so the bill is never a reason to defect to free tiers, native IDE integration (Claude-Code / Cursor pointed at the company endpoint), Authentik SSO so login isn't friction, and a written acceptable-use policy that's clear about what the deal is — the company pays for AI; in exchange, company work happens on the company channel. This is a policy + infra problem, not a hardware problem, and Phase A delivers it without any capex.

2. Why this matters now¶

Three forces converge:

Usage growth. AI coding has gone from "occasional helper" to "default IDE companion." A small engineering team plus consultants now routes a meaningful percentage of every working hour through one or two API vendors. The dependency is no longer at the margin.
Sensitivity drift. Early prompts were sandbox snippets. Today's prompts increasingly include real customer code (TnE Connect, Parallax), customer data fragments, infra config, and Vault paths. Any "this is just code, not data" framing is out of date.
Customer expectations. Prospective enterprise customers — exactly the audience the TnE Connect SaaS go-to-market is targeting — increasingly ask "is our data sent to OpenAI/Anthropic?" in their procurement questionnaires. A defensible answer is becoming a sales requirement, not just an internal control.

3. Specific risks of relying solely on third-party AI APIs¶

#	Risk	Concrete failure mode	Today's mitigation
R1	Sensitive-data leak via prompt	A developer pastes a customer's PII row, a Vault token, or a fragment of the workforce schema into Claude / ChatGPT / Copilot. Once sent, the vendor's retention and training policies govern it, not ours. Free-tier products typically reserve broader rights over submitted content than paid API tiers — the same paste through a free ChatGPT account is materially worse than through our paid API.	Informal acceptable-use guidance only. No DLP. No prompt-content audit.
R2	Vendor outage	Anthropic API has a multi-hour outage. Every Claude-Code session, Cursor agent, and Open-WebUI conversation in the company stops at once. No local fallback model exists.	None.
R3	Vendor account compromise	Our shared Anthropic / OpenAI key is leaked or the account is taken over. Conversation history (containing customer code + secrets) is exfiltrated; usage is run up under our billing.	Single shared key in Open-WebUI; rotation cadence not documented.
R4	Vendor policy change	Pricing change, new data-handling clause, capability tier reshuffle (a model we depend on is sunsetted), region-availability change. Each one forces a re-evaluation we don't currently have a runbook for.	None.
R5	Data-residency / regulatory exposure	UK / EU customers ask "where does our data go when you process it?" We currently cannot answer "it stays in the UK" because frontier APIs route through US infrastructure. For some prospects, this is a hard sell-blocker.	None.
R6	Geopolitical / export-control risk	Sanctions or export controls restrict access to a US-based AI vendor for a region we sell into (e.g. an India office on the wrong side of a future trade rule). The dependency becomes a market-access risk.	None — not on the radar, but worth naming.
R7	Lock-in to vendor-specific behaviour	Our Claude-Code / Cursor workflow, agent prompts, and tool schemas are tuned to one vendor's quirks. Switching cost is higher than it looks.	Some Open-WebUI multi-provider config exists; agent workflows are vendor-specific.
R8	No audit trail across team	We cannot answer "what did our team prompt the model with last week?" — a question both a customer DPO and our own incident-response would want answered after a leak.	Open-WebUI keeps per-user history; IDE-side prompts (Claude-Code, Cursor) are largely off-record.
R9	Shadow-IT AI: personal accounts and free-tier tools	Developers today predominantly use free or personal AI — free ChatGPT, free Claude.ai, personal Cursor / Copilot subscriptions, personal API keys. The company has no audit trail at all, no control over the ToS governing the data, and conversation history sits in the developer's personal account — so when they leave, real customer code / infra config / Vault paths leave with them. From a customer-IP / contractual standpoint this is the most consequential risk on this list, and it's the one the company has the least visibility into.	None. The Open-WebUI front-end exists but isn't where the work is happening.

These risks are not equally urgent — R9 is the most pressing one to address because it makes every other risk worse and it requires zero capex to start fixing. R1, R2, R5 are the ones a customer DPO or a thoughtful CTO would press on in a procurement meeting.

4. The sovereignty spectrum — this is not a binary choice¶

A useful framing: AI sovereignty isn't "API vs no API," it's a layered set of controls, any of which can be added independently:

Layer	What it does	Cost
L1 — Prompt hygiene & redaction	Documented "what not to paste" policy; pre-prompt scrubbing in Open-WebUI (regex-based PII / token filters); per-team API keys for audit	Low — engineering time
L2 — Per-user accounting and audit	Move from one shared API key to per-user keys via Open-WebUI / LiteLLM gateway. Every prompt logged + attributable	Low — engineering time
L3 — Local fall-back model	Mac Studio in Bedford running an open-weight code model. Used when (a) the work is sensitive, (b) the API is down, (c) latency-bound batch jobs	Mid — capex on hardware
*L4 — Local primary* for a defined workload class**	Specific staff / tasks (e.g. anything touching customer PII) routed first to local; fallback to API only for hard problems	Mid — operational discipline
L5 — Full local primary, API as overflow only	Production-grade local cluster sized for the team's load; API used only for narrow exceptions	High — multi-GPU rig + ops investment

Most teams stop at L3 or L4 and that's the right answer. L5 is only worth it if (a) the customer base demands it commercially, or (b) the API spend has grown to the point where the GPU capex pays back inside ~2 years.

5. Cost comparison — 10 concurrent coding users¶

Question framed by leadership: what is it going to cost to give 10 coding users an Opus-4.7-class assistant, comparing API vs cloud-GPU vs Mac-Studio-in-Bedford?

This is the most useful question to size the decision against. Below is an honest, assumption-led comparison. Every figure here is a planning estimate; treat it as the shape of the answer, not a quote.

5.1 Workload assumption¶

A heavy coding-agent user (Claude-Code-style sessions all day, including agentic loops with many tool calls) consumes on the order of 30–80 million tokens per month. The mid-point we'll use:

10 users × 50M tokens/month = 500M tokens/month total, mixed input/output, mostly cache-friendly (large repeated context windows).

If your team's actual usage is lighter (intermittent assistant rather than always-on agent), divide everything below by 2–3.

5.2 Option A — Continue with API (Anthropic Opus 4.7)¶

Item	Estimate
Per-token cost (current Anthropic Opus 4.x list, blended in:out at ~5:1 typical for coding)	~$25 / M tokens blended
Raw monthly	500M × $25 = ~$12,500 / month (~£10k)
With prompt-caching discipline (50–90% reduction on cached input)	~£3k–£6k / month realistic
Annualised, with caching	~£40k–£75k / year

What you get: best-in-class reasoning + coding; full reliability and SLA from Anthropic; instant scale up/down; no capex; no ops overhead.

What you don't get: data residency; offline capability; protection from vendor outage / policy change; ability to put a "no third-party processing" clause in a customer contract.

5.3 Option B — Self-host on cloud GPUs (Llama / DeepSeek / Qwen frontier)¶

To match Opus-4.7 coding quality (reasoning is a stretch), you would deploy something like DeepSeek-V3 (671B MoE) or Llama 3.3 70B with aggressive quantisation, or Qwen 2.5 / 3 Coder 32B–72B. These are strong on coding (within ~5–15% of Opus on coding-specific benchmarks) but noticeably weaker on hard agentic reasoning, long-context planning, and edge-case correctness.

Hardware to serve 10 concurrent users with reasonable latency on a frontier open model: - ~4–8 × H100 (80 GB) — required to fit a 70B–405B model + KV cache + serve concurrency.

Item	Estimate (mid-point)
8 × H100 on-demand at ~$4/hr typical	$32 × 720 hrs = ~$23k / month (~£18k)
8 × H100 on a 1-year reserved contract	~$15k / month (~£12k)
Annual, reserved	~£140k–£190k / year
Engineering overhead (one engineer, ~10–20% time)	~£8k–£15k / year

What you get: data stays in your tenant (still cloud, but a cloud you contract with directly); model choice; no per-token bill; predictable budget.

What you don't get: you're still on someone else's hardware; more expensive than the API unless utilisation is genuinely 24/7; inferior model quality vs Opus 4.7 on hard tasks.

Verdict on Option B alone: rarely the right move at our scale. If utilisation is < 60%, you pay more than the API for less quality.

5.4 Option C — Mac Studio M3 Ultra in Bedford¶

This is the option leadership has flagged: a sovereign, local, capital-purchase appliance.

Realistic capability of one top-spec M3 Ultra (192 GB unified memory, ~800 GB/s memory bandwidth):

Workload	Realistic experience
Single user, Qwen 2.5 Coder 32B / 72B at Q4–Q5	Smooth, ~20–40 tok/s depending on quant. Strong coding quality.
Single user, Llama 3.3 70B at Q4	Workable, ~8–15 tok/s. Good reasoning, slightly behind Qwen on code.
Single user, DeepSeek-V3 (671B MoE) at Q3	Marginal — fits in memory at heavy quant, slow output. Not production-comfortable.
2 concurrent users, 70B-class Q4	Acceptable for short turns, painful for long agentic sessions.
10 concurrent users, 70B-class	Not viable on one box. Would need 4–6 Mac Studios, and you'd still be on a non-frontier model.

Item	Estimate
1 × Mac Studio M3 Ultra, top-spec (192 GB, 32-core GPU, 8 TB)	~£8k–£10k capex
Power (idle ~80 W, peak ~480 W; mostly idle)	~£300–£600 / year UK business rate
AppleCare + 3-year amortisation	~£3k / year amortised
Engineering setup (one-off)	2–4 days (LLM serving stack: Ollama / llama.cpp / vLLM-on-Metal, Open-WebUI integration, Authentik SSO, prompt logging)
Engineering ongoing	~5% of one engineer (model upgrades, capacity nudges)
Realistic concurrent users at frontier-open quality	1–2
Realistic concurrent users at Qwen-Coder-32B quality	3–5

Verdict on Option C alone: a Mac Studio is a sovereignty appliance and an offline fall-back, not a wholesale replacement for 10-user Opus-class API access. It pays for itself if it deflects ~£3k–£5k/year of API spend on sensitive workloads, and gives the team a real "AI is up when the internet isn't" capability.

5.5 Honest gap: Opus-4.7-equivalent is not currently available locally¶

A direct, plain answer to the framing question:

No open-weight model in 2026 fully matches Claude Opus 4.7 on the combination of agentic reasoning, long-context planning, tool-use reliability, and code correctness on hard problems. The strongest open models (Qwen 2.5/3 Coder, DeepSeek-V3, Llama 3.3) are within ~10–15% on average coding benchmarks but fall further behind on hard, multi-step, or unfamiliar-codebase work — exactly the cases where the API is most valuable.

Anyone who tells you "you can run Opus-4.7-equivalent on a Mac Studio for £10k" is selling something. The honest framing is: you can run a very useful sovereign coding assistant on a Mac Studio — sufficient for the majority of routine work, ideal for sensitive work, and a real fallback for the next vendor outage. It is not a replacement for the API on hard problems.

5.6 Recommended hybrid (this is the actual answer)¶

Layer	Provider	Use case	Estimated annual cost
Default coding assistant	Anthropic API (Opus + Sonnet tiers, prompt-caching enforced)	Day-to-day coding, agentic sessions, hard reasoning	~£40k–£75k / year
Sovereign fallback	1 × Mac Studio M3 Ultra running Qwen 2.5/3 Coder + a small reasoning model	Sensitive workloads (PII, customer code, Vault paths); API outage; offline at-home / on-the-train work; experiments	~£3k–£4k / year amortised + ~£500 power + setup time
Audit + routing layer	Self-hosted gateway (LiteLLM / Open-WebUI as proxy) in front of both	Per-user accounting, prompt logging, automatic pre-prompt redaction, "if API down then local" routing	engineering time only
Total			~£45k–£80k / year + ~£10k one-off capex

That hybrid is cheaper than going pure API (because the local box absorbs the fully-cached / repeated workloads where API caching helps least), cheaper than going pure local (because we don't try to size for 10-user frontier locally), and structurally more defensible with customers because we can credibly state that sensitive workloads do not transit a third-party API.

6. Phased plan¶

This dovetails with the Phase 2 roadmap and is captured as RM-044 (a single roadmap item rather than a swarm so it stays decision-shaped).

Phase A — Now (no capex). This is the single highest-leverage block of work in this whole plan.¶

The premise: the only reason developers reach for free / personal AI tools (R9) is that the company channel today is worse than the alternative — fewer models, less budget headroom, no IDE integration, friction to set up, ambiguous policy. Phase A is the carrot-and-stick delivery that flips that.

The carrot — make the company channel the obviously better option.

Per-user keys via a self-hosted gateway in front of ai.448.global (LiteLLM or Open-WebUI proxy mode). Replaces the single shared API key. Every prompt is attributable, every user has a budget, every team has a usage dashboard. Engineering time only.
Generous monthly token budgets per developer — explicitly sized so a heavy AI-coding day doesn't bump into a limit. The whole point is to remove "I'll use my personal account because the company one runs out" as a reason to defect. Re-cost monthly based on actual usage.
Frontier model parity: the gateway exposes Anthropic Opus + Sonnet + Haiku tiers (and OpenAI / others as needed) so a developer never has to leave the company channel to access the model they would otherwise pick.
Native IDE integration: publish documented configs for Claude Code and Cursor that point at the company gateway with one paste. Developers who use these tools don't have to choose between "use the company channel" and "use the IDE I already use" — the IDE is the company channel.
Authentik SSO into the gateway — same sign-on as everything else; no separate login, no separate password.
Local model preview in Open-WebUI even before the Mac Studio lands — pre-set a Qwen Coder via a small VPS or temporary inference endpoint, so the local-vs-API choice is one dropdown, not a future promise.

The stick — written, signed acceptable-use policy.

AI Acceptable-Use Policy, distributed and signed by every developer. The contract:
The company pays for AI tooling at parity-or-better with what's available externally.
In exchange, all company work happens on the company channel (ai.448.global or company-issued IDE configs). No company code / customer data / Vault paths in personal accounts, free tiers, or unmanaged tools.
Specific examples of what cannot be pasted (customer PII, Vault paths, full customer-code dumps, secrets), and what can.
Reporting flow when someone realises they pasted something they shouldn't — no-blame, but they tell the company. Recovery (token rotation, customer notification if needed) is handled by the company, not hidden by the individual.
Personal use of AI on personal time and personal hardware is fine; mixing personal accounts with company work is not.
Personal-key amnesty + collection: a one-week window where any developer who has been using a personal API key for company work hands the key to Vishnu, the keys are rotated, and the company replaces them with attributed gateway access — no awkward conversation, no audit ambush. After the window the policy is "live" and any new exception requires a written justification.

Operational baseline.

Outage runbook for "what to do when Anthropic / OpenAI is down" — currently undefined. Even before Mac Studio lands, the gateway can fall back across vendors (e.g. Anthropic → OpenAI → Mistral hosted) so an outage on one provider doesn't stop the day.
Periodic audit: compare gateway traffic shape to engineer working hours. If a senior developer on a coding-heavy week shows zero gateway traffic, that's a signal someone is back on a personal tool — investigate, fix the friction, don't punish the symptom.

Order of work: items 1, 5, 7 ship first (gateway + SSO + policy). Items 2–4 ship in the same week — the policy is unenforceable until the company channel is genuinely better than the alternative.

Phase B — 3–6 months (Mac Studio)¶

Order one Mac Studio M3 Ultra (top-spec) — Bedford office. ~£8–10k capex.
Stand up llama.cpp / Ollama + vLLM with Qwen 2.5/3 Coder 32B (primary) + Llama 3.3 70B (reasoning fallback). Front it with the existing Open-WebUI at ai.448.global. Optionally expose to Cursor / Claude-Code-compatible endpoints via a translation layer.
SSO via Authentik, prompt logging, basic Beszel monitoring.
Define a sensitive-workload trigger list: which categories of prompt route to local-first by policy.

Phase C — 12 months (review)¶

Review actual usage and API spend. Decide whether to add a second Mac Studio, move to a small GPU rig, or stay with the Bedford box as the sovereignty appliance.
Re-test the API-vs-local quality gap — both vendors and open weights move quickly; 12 months is enough for the gap to materially close (or, occasionally, widen).

7. What we already have¶

The estate is not at zero on this — there's existing surface to build on:

Open WebUI at ai.448.global — already serves as the central AI chat front-end. Currently fronts API providers; can also front a local Ollama / vLLM endpoint without a UI change.
Authentik — usable as the SSO layer for any new local-AI endpoint we expose.
Coder workspaces — already in place; once a local AI endpoint exists, dev workspaces can be configured to point at it for sensitive projects.
Vault — the right place to keep API keys, gateway secrets, and per-user attribution tokens once Phase A lands.

The Mac Studio is therefore not a green-field deployment; it slots into infrastructure we already operate. The marginal new components are: the GPU host itself, an inference-serving stack on it, and the redaction / routing logic in front of Open-WebUI.

8. Open questions / unknowns¶

These are the things this doc cannot answer without further input — flag them up so future revisions close them.

Real prompt volume today: we don't currently aggregate API usage across Anthropic + OpenAI + Cursor + Claude-Code. The "500M tokens/month" figure above is an industry-typical estimate, not measured. Action: Phase A step 2 (gateway) makes this measurable.
Sensitive-prompt incidence: how often does a prompt actually contain something we wouldn't want sent externally? Until logged, we're guessing.
Customer ask: which customers (or prospects) have asked, or are about to ask, about AI data flow? Sales can de-risk this conversation early.
Bedford office network: Mac Studio in Bedford means client devices need to reach it. Either route via WireGuard (RM-042) or expose via Caddy on a *.448.global host with Authentik enforcement.
Power / cooling at Bedford: a Mac Studio is laptop-class power, not a server, so this is a small concern but worth confirming with Sergiu Pop (offices + endpoint provisioning) before purchase.

9. References¶

Per-app: Open WebUI · Authentik · Vault · Coder
Roadmap: RM-044 — Local AI sovereignty: Mac Studio in Bedford (see phase-2-roadmap.md)
Risk register: KI-042 — All AI workloads route to third-party APIs (no local fall-back; sensitive-prompt exposure) (see ../infra/known-issues.md)
Decision: D11 — Approve Mac Studio M3 Ultra purchase + sovereignty layer in executive-summary.md §8